# PCOS Risk Prediction Model

## Introduction

This notebook builds a machine learning model to predict Polycystic Ovary Syndrome (PCOS) risk using symptoms, hormonal data, and vitals. We train Logistic Regression (for interpretability) and Random Forest (for accuracy) models, evaluate them, and save the best one for the RituCare app.

### Why PCOS Detection Matters
PCOS affects 1 in 10 women, causing irregular periods, infertility, and metabolic issues. Early detection via AI can guide users to seek medical advice, improving outcomes.

### Features Used
- Symptoms: Irregular periods, acne, hirsutism, weight gain.
- Hormones: LH, FSH, testosterone, insulin.
- Vitals: BMI, blood pressure, cholesterol.

**Disclaimer:** This is an AI-assisted screening tool, not a medical diagnosis. Consult a healthcare professional for confirmation.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pickle
import warnings
warnings.filterwarnings('ignore')

# Set visual style
sns.set_style("whitegrid")
palette = ['#FFB6C1', '#DDA0DD', '#FF69B4', '#FFC0CB', '#E6E6FA']
sns.set_palette(palette)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

## 1. Import & Load

Load the PCOS dataset and display basic info.

In [None]:
# Load dataset
df = pd.read_csv('../dataset/menstrual_cycle_data.csv')

print(f"Dataset Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nHead:")
print(df.head())

## 2. Clean & Prepare

Auto-detect target, clean labels, drop missing targets, remove ID columns.

In [None]:
# Target is 'Group' column (0 = normal, 1 = PCOS)
target_col = 'Group'
print(f"Target column: {target_col}")

# Drop rows where target is missing
df = df.dropna(subset=[target_col])
df[target_col] = df[target_col].astype(int)

# Remove ID-like columns
id_cols = ['ClientID', 'CycleNumber']
df = df.drop(columns=id_cols, errors='ignore')

# Separate X and y
y = df[target_col]
X = df.drop(columns=[target_col])

print(f"After cleaning: {X.shape}, Target classes: {y.value_counts().to_dict()}")

## 3. Train/Test Split

Split with stratify for balanced classes.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
print(f"Train classes: {y_train.value_counts().to_dict()}")
print(f"Test classes: {y_test.value_counts().to_dict()}")

## 4. Preprocessing Pipeline

Numeric: impute median + scale; Categorical: impute mode + one-hot.

In [None]:
# Convert object columns to numeric where possible
for col in X.columns:
    if X[col].dtype == 'object':
        try:
            X[col] = pd.to_numeric(X[col], errors='coerce')
        except:
            pass

# Identify column types
numeric_features = X.select_dtypes(include=[np.number]).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Preprocessing pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Fit preprocessor
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print(f"Processed train shape: {X_train_processed.shape}")

## 5. Models to Train

Train Logistic Regression and Random Forest.

In [None]:
# Models
models = {
    'Logistic Regression': LogisticRegression(max_iter=2000, class_weight='balanced', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=400, class_weight='balanced_subsample', random_state=42)
}

trained_models = {}
for name, model in models.items():
    model.fit(X_train_processed, y_train)
    trained_models[name] = model
    print(f"{name} trained.")

## 6. Evaluate

Compute metrics, confusion matrix, ROC curve.

In [None]:
# Evaluate
results = {}
for name, model in trained_models.items():
    y_pred = model.predict(X_test_processed)
    y_prob = model.predict_proba(X_test_processed)[:, 1] if hasattr(model, 'predict_proba') else y_pred
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_prob)
    
    results[name] = {'Accuracy': acc, 'Precision': prec, 'Recall': rec, 'F1': f1, 'ROC-AUC': auc}

# Display table
results_df = pd.DataFrame(results).T
print(results_df)

# Confusion Matrix for best model
best_model_name = max(results, key=lambda x: results[x]['F1'])
best_model = trained_models[best_model_name]
y_pred_best = best_model.predict(X_test_processed)
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='coolwarm')
plt.title(f'Confusion Matrix - {best_model_name}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test_processed)[:, 1])
plt.plot(fpr, tpr, color=palette[0], label=f'{best_model_name} (AUC = {results[best_model_name]["ROC-AUC"]:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid(True)
plt.show()

## 7. Pick Best Model

Select based on F1, save to pickle.

In [None]:
# Best model
print(f"Best Model: {best_model_name} with F1 {results[best_model_name]['F1']:.3f}")

# Save
with open('../models/pcos_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

print("Model saved to ../models/pcos_model.pkl")

# Also save preprocessor for app
with open('../models/preprocessor.pkl', 'wb') as f:
    pickle.dump(preprocessor, f)

## 8. Feature Importance

Plot for Random Forest.

In [None]:
# Feature importance
if 'Random Forest' in trained_models:
    rf = trained_models['Random Forest']
    # Get feature names after preprocessing
    feature_names = list(numeric_features)
    importances = rf.feature_importances_
    
    plt.barh(feature_names[:10], importances[:10], color=palette[1])  # Top 10
    plt.title('Top 10 Feature Importance - Random Forest')
    plt.xlabel('Importance')
    plt.ylabel('Features')
    plt.grid(True)
    plt.show()
    
    top_features = sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True)[:5]
    print("Top predictive features:")
    for feat, imp in top_features:
        print(f"- {feat}: {imp:.3f}")
    print("\nCycle characteristics and hormonal markers are key predictors.")

## Interpretation of Metrics

- **Accuracy**: Overall correct predictions.
- **Precision**: True positives over predicted positives.
- **Recall**: True positives over actual positives.
- **F1**: Balance of precision and recall.
- **ROC-AUC**: Ability to distinguish classes.

Higher F1 and AUC indicate better performance.

## Conclusion

PCOS model training complete âœ…
Best model saved at ../models/pcos_model.pkl

Remember, this tool aids in awarenessâ€”always consult a doctor. ðŸŒ¸