# 🦷 Predicting Dental Caries Using Systemic Health Markers

### Using SVM (Support Vector Machine) to classify patients with more or less probability of having dental caries

This project explores the potential of **Support Vector Machine (SVM)** algorithms to predict dental caries likelihood in patients using comprehensive systemic health data. We will use a dataset from the [Kaggle Korean National Health Survey](https://www.kaggle.com/datasets/vernicacastillo/enfermedad-periodontal) that contains information about systemic health indicators and dental caries presence in patients.

**Research Innovation**: This is the first study to explore correlations between liver function markers (AST, ALT, GTP) and dental caries risk - a completely novel research area with significant clinical implications.


# 🎯 YOUR LEARNING JOURNEY: Step-by-Step SVM Discovery

## 📚 **Your Mission**
You will discover how SVM can predict dental caries by implementing each step yourself. I'll provide guidance and help when you need it, but YOU will be the one coding and discovering the insights!

## 🗺️ **Learning Roadmap**

### **Phase 1: Data Exploration (YOUR TURN!)**
- [ ] Load and explore the dataset
- [ ] Check data quality and missing values
- [ ] Understand the target variable distribution
- [ ] Calculate basic statistics for each feature category

### **Phase 2: Feature Analysis (YOUR DISCOVERY!)**
- [ ] Analyze correlations between features and dental caries
- [ ] Identify which features might be most predictive
- [ ] Explore the novel liver function markers (AST, ALT, GTP)
- [ ] Create your own feature engineering ideas

### **Phase 3: SVM Implementation (YOUR CODING!)**
- [ ] Build your first Linear SVM model
- [ ] Try RBF SVM and compare performance
- [ ] Experiment with feature engineering
- [ ] Tune hyperparameters and see what happens

### **Phase 4: Model Evaluation (YOUR ANALYSIS!)**
- [ ] Evaluate model performance with different metrics
- [ ] Analyze which features are most important
- [ ] Discover novel correlations
- [ ] Draw your own conclusions

## 🎓 **Learning Goals**
- Master SVM implementation from scratch
- Discover which features matter most for caries prediction
- Learn how to evaluate medical ML models
- Explore the novel liver-oral health connection
- Build confidence in your ML skills

## 💡 **My Role**
- Provide guidance when you're stuck
- Answer questions about SVM theory
- Help debug code issues
- Suggest next steps when you're ready
- Celebrate your discoveries! 🎉

**Ready to start your discovery journey? Let's begin with Phase 1!**


# 📊 PHASE 1: Data Exploration (YOUR TURN!)

## 🎯 **Your Mission**
Load the dataset and explore it step by step. Discover what we're working with!

## 📝 **Your Tasks**
1. **Load the dataset** using pandas
2. **Check the shape** - how many patients and features?
3. **Look at the first few rows** - what does the data look like?
4. **Check for missing values** - is the data clean?
5. **Understand the target variable** - what's the caries prevalence?
6. **Calculate basic statistics** - what are the ranges and means?

## 💡 **Hints**
- Use `pd.read_csv()` to load the data
- Use `.shape`, `.head()`, `.info()` to explore
- Use `.isnull().sum()` to check missing values
- Use `.describe()` for basic statistics
- Use `.value_counts()` for the target variable

## 🎓 **Learning Goal**
Get familiar with your dataset and understand what you're working with!

**Go ahead and write your code below! I'm here to help if you get stuck.**


In [None]:
# YOUR CODE GOES HERE!
# Start by loading the dataset and exploring it step by step

# Step 1: Load the dataset
# df = pd.read_csv('data/test_dataset.csv')

# Step 2: Check the shape
# print(f"Dataset shape: {df.shape}")

# Step 3: Look at the first few rows
# df.head()

# Step 4: Check for missing values
# df.isnull().sum()

# Step 5: Understand the target variable
# df['dental caries'].value_counts()

# Step 6: Calculate basic statistics
# df.describe()

# Add your code here and discover what you find!


# 🔍 PHASE 2: Feature Analysis (YOUR DISCOVERY!)

## 🎯 **Your Mission**
Analyze which features are most correlated with dental caries. Discover the patterns!

## 📝 **Your Tasks**
1. **Calculate correlations** between each feature and dental caries
2. **Find the top 10 most correlated features** (positive and negative)
3. **Explore liver function markers** (AST, ALT, GTP) - the novel research area!
4. **Create some visualizations** to see the patterns
5. **Think about feature engineering** - what new features could you create?

## 💡 **Hints**
- Use `df.corr()['dental caries']` to get correlations
- Use `.sort_values(key=abs, ascending=False)` to sort by absolute correlation
- Try plotting correlations with `plt.barh()` or `sns.barplot()`
- Explore the liver markers: `df[['AST', 'ALT', 'Gtp', 'dental caries']].corr()`
- Think about ratios: BMI, AST/ALT ratio, HDL/LDL ratio

## 🎓 **Learning Goal**
Discover which features matter most for predicting dental caries!

**What patterns do you find? Which features surprise you?**


In [None]:
# YOUR CODE GOES HERE!
# Analyze correlations and discover which features matter most

# Step 1: Calculate correlations with dental caries
# correlations = df.corr()['dental caries']

# Step 2: Find top correlations (positive and negative)
# top_correlations = correlations.sort_values(key=abs, ascending=False)

# Step 3: Explore liver function markers specifically
# liver_correlations = df[['AST', 'ALT', 'Gtp', 'dental caries']].corr()

# Step 4: Create visualizations
# plt.figure(figsize=(10, 6))
# plt.barh(range(len(top_correlations)), top_correlations.values)

# Step 5: Think about feature engineering
# BMI = weight / (height/100)**2
# AST_ALT_ratio = AST / ALT
# HDL_LDL_ratio = HDL / LDL

# What patterns do you discover? Which features surprise you?


# 🤖 PHASE 3: SVM Implementation (YOUR CODING!)

## 🎯 **Your Mission**
Build your first SVM models and see how they perform! Experiment and learn!

## 📝 **Your Tasks**
1. **Prepare your data** - separate features (X) from target (y)
2. **Split the data** - train/test split (80/20 or 70/30)
3. **Scale your features** - use StandardScaler
4. **Build Linear SVM** - start with LinearSVC
5. **Build RBF SVM** - try SVC with RBF kernel
6. **Compare performance** - which works better?

## 💡 **Hints**
- Use `train_test_split()` from sklearn
- Use `StandardScaler()` to scale features
- Try `LinearSVC()` and `SVC(kernel='rbf')`
- Use `accuracy_score()` to evaluate
- Try different hyperparameters: `C=1.0`, `C=10.0`

## 🎓 **Learning Goal**
Build your first SVM models and understand how they work!

**What accuracy do you get? Which kernel works better?**


In [None]:
# YOUR CODE GOES HERE!
# Build your first SVM models and see how they perform!

# Step 1: Prepare your data
# X = df.drop('dental caries', axis=1)  # Features
# y = df['dental caries']  # Target

# Step 2: Split the data
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Scale your features
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# Step 4: Build Linear SVM
# from sklearn.svm import LinearSVC
# linear_svm = LinearSVC(random_state=42)
# linear_svm.fit(X_train_scaled, y_train)

# Step 5: Build RBF SVM
# from sklearn.svm import SVC
# rbf_svm = SVC(kernel='rbf', random_state=42)
# rbf_svm.fit(X_train_scaled, y_train)

# Step 6: Evaluate performance
# from sklearn.metrics import accuracy_score
# linear_pred = linear_svm.predict(X_test_scaled)
# rbf_pred = rbf_svm.predict(X_test_scaled)
# 
# print(f"Linear SVM Accuracy: {accuracy_score(y_test, linear_pred):.3f}")
# print(f"RBF SVM Accuracy: {accuracy_score(y_test, rbf_pred):.3f}")

# What accuracy do you get? Which kernel works better?
