<p style="font-family: Cambria; text-align: center; font-size: 48px;">II. Predictive Analysis</p> 

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from scipy.stats import mannwhitneyu
from scipy.stats import chi2_contingency
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import LabelEncoder

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,precision_score,recall_score,f1_score
import xgboost as xgb


In [90]:
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)

In [92]:
data= pd.read_excel("GDM_Python_Aug2025.xlsx")

In [93]:
df=data.copy()

### Early Risk Prediction of Gestational Diabetes Mellitus (GDM)

**Research Hypotheses:**

- **Null Hypothesis (H₀):** Gestational Diabetes Mellitus (GDM) is *not* significantly associated with adverse pregnancy outcomes or complications.
- **Alternative Hypothesis (H₁):** Gestational Diabetes Mellitus (GDM) *is* significantly associated with adverse pregnancy outcomes and complications.

In [97]:
mean_weeks = df['Gestational Age_V1'].mean() / 7
print(f"Mean gestational age_V1 in weeks: {mean_weeks:.2f}")

Mean gestational age_V1 in weeks: 13.59


##### The average first prenatal visit women is at  arround 14 weeks in our dataset.
##### Our objective is to evaluate the predictive power of the early indicators through rigorous statistical testing and modeling.

###  **Reasoning (Why is it important)**

* **Early identification** of at-risk women allows for **early lifestyle or therapeutic interventions**.
* Can **prevent GDM-related complications** (e.g., macrosomia, pre-eclampsia, stillbirth).
* Enables **personalized prenatal care** before routine screening (\~24–28 weeks).
* Reduces long-term maternal risk and poor neonatal outcomes.

### Correlation Check for GDM Diagnosis

Since the data isn't normally distributed, we used Spearman’s correlation to check how each clinical feature is related to the GDM diagnosis (Yes = 1, No = 0).

In [102]:
df1=data.copy()

In [104]:
# Map GDM to binary
df1['GDM_binary'] = df1['GDM Diagonised'].map({'Yes': 1, 'No': 0})

# Define the feature list
features = ['systolicBP_V1', 'diastolicBP_V1', 'PulseinV1', 'Platelet_V1', 
            'Calcium_V1', 'Albumin_V1', 'U Protein_V1', 'ALT_V1', 
            'U Albumin_V1', 'V1 CRP.1', 'V1 Creatinine.1', 'Hemoglobin_V1',
            'V1 PCR.1', 'BMIinV1','White Cell Count']

# Check and filter valid features
valid_features = [f for f in features if f in df1.columns]

# Calculate Spearman correlation with GDM_binary
corr_spearman = df1[valid_features + ['GDM_binary']].corr(method='spearman')['GDM_binary'].drop('GDM_binary')

# Print sorted results
corr_spearman.sort_values(ascending=False)

V1 CRP.1            0.150546
PulseinV1           0.148724
BMIinV1             0.133116
White Cell Count    0.123750
Platelet_V1         0.112980
diastolicBP_V1      0.092703
Calcium_V1          0.091657
systolicBP_V1       0.088896
ALT_V1              0.084171
U Protein_V1        0.074699
V1 PCR.1            0.070136
Hemoglobin_V1       0.006601
U Albumin_V1       -0.007803
V1 Creatinine.1    -0.019262
Albumin_V1         -0.113102
Name: GDM_binary, dtype: float64

In [106]:
results = {}

for col in ['systolicBP_V1', 'diastolicBP_V1', 'PulseinV1','Platelet_V1', 'Calcium_V1', 'Albumin_V1', 'U Protein_V1', 'ALT_V1',
            'U Albumin_V1','V1 CRP.1','V1 Creatinine.1', 'Hemoglobin_V1','V1 PCR.1','BMIinV1','White Cell Count']:
    group_yes = df[df['GDM Diagonised'].str.lower() == 'yes'][col].dropna()
    group_no = df[df['GDM Diagonised'].str.lower() == 'no'][col].dropna()
    
    if len(group_yes) > 0 and len(group_no) > 0:
        stat, p = mannwhitneyu(group_yes, group_no, alternative='two-sided')
        results[col] = {
            'U-statistic': round(stat, 3),
            'p-value': round(p, 5),
            'Chance of randomness (%)': round(p * 100, 2)
        }
    else:
        results[col] = {
            'U-statistic': None,
            'p-value': None,
            'Chance of randomness (%)': None
        }

pd.DataFrame(results).T.sort_values('p-value')

Unnamed: 0,U-statistic,p-value,Chance of randomness (%)
PulseinV1,22091.5,0.00049,0.05
V1 CRP.1,17671.0,0.00071,0.07
BMIinV1,21627.0,0.0018,0.18
White Cell Count,21299.5,0.00374,0.37
Platelet_V1,20978.5,0.00813,0.81
diastolicBP_V1,20417.5,0.02973,2.97
systolicBP_V1,20304.5,0.03712,3.71
ALT_V1,17411.0,0.05499,5.5
Albumin_V1,4392.5,0.06228,6.23
U Protein_V1,19699.5,0.08097,8.1


- Pulse rate, CRP, BMI,White Cell Count and Platelet count show higher correlation and low p-values, which means they are consistently elevated in women diagnosed with GDM. These could be strong early indicators.
- Blood pressure values, ALT, and Albumin also show some difference, though not as strong—they may help as supporting features in prediction.
- Hemoglobin and Creatinine have very low correlation and high p-values, so they likely don’t relate to GDM in your data.
- The chance of randomness (based on p-value) is low for the top features, which means it’s unlikely that these differences happened by accident. That gives more confidence in using them for early GDM prediction


In [None]:
Checking for the Categorical values

In [36]:
# Define binary health markers and outcome
markers = ['Age_gt_30', 'Vit D Deficiency', 'Smoking 123','PreviousGDM10 V1','Nutritional counselling','HighRisk','Chronic Illness','GestationalHP','Took Vit D Supplements']
results = {}

for marker in markers:
    # Create contingency table
    contingency = pd.crosstab(df[marker], df['GDM Diagonised'])
    
    # Run chi-squared test
    chi2, p, dof, expected = chi2_contingency(contingency)
    results[marker] = {'Chi2': chi2, 'p-value': p}

# Convert to DataFrame for a cleaner view
correlation_df = pd.DataFrame(results).T
print(correlation_df.sort_values('p-value'))

                               Chi2        p-value
Nutritional counselling  512.854705  4.315348e-112
GestationalHP            495.051278  7.876842e-106
PreviousGDM10 V1          24.557809   4.648786e-06
Took Vit D Supplements    12.357633   1.488082e-02
Vit D Deficiency           8.240258   1.624242e-02
HighRisk                   4.013410   1.344309e-01
Age_gt_30                  3.405225   1.822069e-01
Smoking 123                5.992763   4.240013e-01
Chronic Illness            0.714893   6.994601e-01


- Use Nutritional Counselling, Previous GDM, and Vitamin D Deficiency confidently in your early prediction models.
- Keep Age and HighRisk flags as possibly relevant

In [39]:
selected_columns = ['systolicBP_V1', 'diastolicBP_V1', 'PulseinV1', 'BMIinV1', 
                    'Smoking 123', 'PreviousGDM10 V1', 'Chronic Illness', 'Age_gt_30',
                    'Platelet_V1', 'Albumin_V1', 'U Albumin_V1', 'U Protein_V1','White Cell Count',
                    'ALT_V1', 'V1 CRP.1', 'Vit D Deficiency','HighRisk','GestationalHP', 
                    'Took Vit D Supplements', 'Nutritional counselling','GDM Diagonised']

df_final = df[selected_columns]


In [41]:
df_final.shape

(565, 21)

In [43]:
# Filter rows where GDM was diagnosed
gdm_positive_df = df_final [df_final['GDM Diagonised'] == 'Yes']


In [45]:
gdm_positive_df.isna().sum()

systolicBP_V1               0
diastolicBP_V1              0
PulseinV1                   0
BMIinV1                     0
Smoking 123                 0
PreviousGDM10 V1            0
Chronic Illness             0
Age_gt_30                   0
Platelet_V1                 0
Albumin_V1                 27
U Albumin_V1                0
U Protein_V1                0
White Cell Count            0
ALT_V1                      7
V1 CRP.1                   11
Vit D Deficiency            0
HighRisk                    0
GestationalHP               0
Took Vit D Supplements      0
Nutritional counselling     0
GDM Diagonised              0
dtype: int64

In [47]:
gdm_positive_df.shape

(74, 21)

In [49]:
# Step 1: Get object-type columns
object_cols = df_final.select_dtypes(include='object').columns

# Step 2: Convert each to string, then apply label encoding to get integers
from sklearn.preprocessing import LabelEncoder

df_final = df_final.copy()  # Just to be safe
for col in object_cols:
    df_final.loc[:, col] = df_final[col].astype(str)
    le = LabelEncoder()
    df_final.loc[:, col] = le.fit_transform(df_final[col])

# Step 3: Confirm that the converted columns are now integers
int_cols = df_final.select_dtypes(include='int').columns.to_list()
print(int_cols)

['systolicBP_V1', 'diastolicBP_V1', 'PulseinV1', 'PreviousGDM10 V1', 'Chronic Illness', 'HighRisk']


In [51]:
# Columns to convert to integers
object_columns = [
    'Smoking 123',
    'Age_gt_30',
    'Vit D Deficiency',
    'GestationalHP',
    'Took Vit D Supplements',
    'Nutritional counselling',
    'GDM Diagonised'
]

# Dictionary to store label mappings
label_mappings = {}

# Apply encoding
for col in object_columns:
    df_final.loc[:, col] = df_final[col].astype(str)  # ensure all values are strings
    le = LabelEncoder()
    df_final.loc[:, col] = le.fit_transform(df_final[col])
    label_mappings[col] = dict(zip(le.classes_, le.transform(le.classes_)))

# Print the mappings (optional)
for col, mapping in label_mappings.items():
    print(f"Mapping for '{col}': {mapping}")

Mapping for 'Smoking 123': {'0': 0, '1': 1, '2': 2, '3': 3}
Mapping for 'Age_gt_30': {'0': 0, '1': 1}
Mapping for 'Vit D Deficiency': {'0': 0, '1': 1}
Mapping for 'GestationalHP': {'0': 0, '1': 1, '2': 2}
Mapping for 'Took Vit D Supplements': {'0': 0, '1': 1, '2': 2}
Mapping for 'Nutritional counselling': {'0': 0, '1': 1}
Mapping for 'GDM Diagonised': {'0': 0, '1': 1, '2': 2}


In [53]:
 df_final.head()

Unnamed: 0,systolicBP_V1,diastolicBP_V1,PulseinV1,BMIinV1,Smoking 123,PreviousGDM10 V1,Chronic Illness,Age_gt_30,Platelet_V1,Albumin_V1,U Albumin_V1,U Protein_V1,White Cell Count,ALT_V1,V1 CRP.1,Vit D Deficiency,HighRisk,GestationalHP,Took Vit D Supplements,Nutritional counselling,GDM Diagonised
0,114,58,73,20.650699,3,0,0,1,203.0,,30.0,0.02,9.64,10.0,0.45,0,0,0,0,0,1
1,178,78,84,29.215625,3,0,0,1,233.0,,94.0,0.09,10.27,12.0,0.1,0,0,0,0,1,2
2,123,62,79,26.063378,3,0,0,1,330.0,,91.0,0.1,9.46,15.0,1.15,0,0,0,0,0,1
3,115,68,82,24.736333,2,0,0,0,253.0,,30.0,0.03,10.22,12.0,0.22,0,0,0,0,0,1
4,116,61,92,23.383904,2,0,0,1,217.0,,30.0,0.05,11.61,9.0,0.41,0,0,0,0,0,1


In [55]:
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor

# 1. Select only numeric columns to avoid type issues
df_numeric = df_final.select_dtypes(include=['number'])

# 2. Drop any remaining NaNs just in case
df_numeric = df_numeric.dropna()

# 3. Add intercept term
X = add_constant(df_numeric)

# 4. Compute VIFs
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# 5. Display results
print(vif_data)

             Feature         VIF
0              const  158.408796
1      systolicBP_V1    1.500839
2     diastolicBP_V1    1.494248
3          PulseinV1    1.231250
4            BMIinV1    1.413617
5   PreviousGDM10 V1    2.085284
6    Chronic Illness    1.018301
7        Platelet_V1    1.210679
8         Albumin_V1    1.518809
9       U Albumin_V1    1.600968
10      U Protein_V1    1.233546
11  White Cell Count    1.260889
12            ALT_V1    1.024145
13          V1 CRP.1    1.401168
14          HighRisk    2.206162


In [231]:
df_final.isna().sum()

systolicBP_V1                0
diastolicBP_V1               0
PulseinV1                    0
BMIinV1                      0
Smoking 123                  0
PreviousGDM10 V1             0
Chronic Illness              0
Age_gt_30                    0
Platelet_V1                  1
Albumin_V1                 284
U Albumin_V1               264
U Protein_V1                 4
ALT_V1                      42
V1 CRP.1                    56
Vit D Deficiency             0
HighRisk                     0
GestationalHP                0
Took Vit D Supplements       0
Nutritional counselling      0
GDM Diagonised               0
dtype: int64

In [181]:
X.columns

Index(['const', 'systolicBP_V1', 'diastolicBP_V1', 'PulseinV1', 'BMIinV1',
       'PreviousGDM10 V1', 'Chronic Illness', 'Platelet_V1', 'Albumin_V1',
       'U Albumin_V1', 'U Protein_V1', 'ALT_V1', 'V1 CRP.1', 'HighRisk'],
      dtype='object')

In [183]:
df_final.columns

Index(['systolicBP_V1', 'diastolicBP_V1', 'PulseinV1', 'BMIinV1',
       'Smoking 123', 'PreviousGDM10 V1', 'Chronic Illness', 'Age_gt_30',
       'Platelet_V1', 'Albumin_V1', 'U Albumin_V1', 'U Protein_V1', 'ALT_V1',
       'V1 CRP.1', 'Vit D Deficiency', 'HighRisk', 'GestationalHP',
       'Took Vit D Supplements', 'Nutritional counselling', 'GDM Diagonised'],
      dtype='object')

In [187]:
vif_data 

Unnamed: 0,Feature,VIF
0,const,460.285607
1,systolicBP_V1,1.474383
2,diastolicBP_V1,1.433343
3,PulseinV1,1.134141
4,BMIinV1,1.60546
5,PreviousGDM10 V1,2.350795
6,Chronic Illness,1.066114
7,Platelet_V1,1.065271
8,Albumin_V1,1.103865
9,U Albumin_V1,1.103513


In [193]:
df_numeric.shape

(164, 13)

In [195]:
df_numeric.dtypes

systolicBP_V1         int64
diastolicBP_V1        int64
PulseinV1             int64
BMIinV1             float64
PreviousGDM10 V1      int64
Chronic Illness       int64
Platelet_V1         float64
Albumin_V1          float64
U Albumin_V1        float64
U Protein_V1        float64
ALT_V1              float64
V1 CRP.1            float64
HighRisk              int64
dtype: object

In [76]:
df_final.dtypes

systolicBP_V1                int64
diastolicBP_V1               int64
PulseinV1                    int64
BMIinV1                    float64
Smoking 123                 object
PreviousGDM10 V1             int64
Chronic Illness              int64
Age_gt_30                   object
Platelet_V1                float64
Albumin_V1                 float64
U Albumin_V1               float64
U Protein_V1               float64
White Cell Count           float64
ALT_V1                     float64
V1 CRP.1                   float64
Vit D Deficiency            object
HighRisk                     int64
GestationalHP               object
Took Vit D Supplements      object
Nutritional counselling     object
GDM Diagonised              object
dtype: object

In [80]:
df_final['GDM Diagonised'].value_counts()

GDM Diagonised
1    477
2     74
0     14
Name: count, dtype: int64

In [None]:
# Filter rows where GDM was diagnosed
gdm_positive_df = df_final [df_final['GDM Diagonised'] == 'Yes']
gdm_negative_df = df_final [df_final['GDM Diagonised'] == 'No']


In [59]:
gdm_positive_df.isna().sum()
null_counts = gdm_positive_df.isnull().sum()
null_percent = (null_counts / len(gdm_positive_df)) * 100
null_summary = pd.DataFrame({'Null Count': null_counts, 'Null Percentage': null_percent})
print(null_summary)

                         Null Count  Null Percentage
systolicBP_V1                     0              NaN
diastolicBP_V1                    0              NaN
PulseinV1                         0              NaN
BMIinV1                           0              NaN
Smoking 123                       0              NaN
PreviousGDM10 V1                  0              NaN
Chronic Illness                   0              NaN
Age_gt_30                         0              NaN
Platelet_V1                       0              NaN
Albumin_V1                        0              NaN
U Albumin_V1                      0              NaN
U Protein_V1                      0              NaN
White Cell Count                  0              NaN
ALT_V1                            0              NaN
V1 CRP.1                          0              NaN
Vit D Deficiency                  0              NaN
HighRisk                          0              NaN
GestationalHP                     0           

#### or non-GDM patients, we retain only records with complete values for the selected features. These filtered records will be used for undersampling, effectively avoiding the need for handling missing values in this group.


In [61]:

gdm_negative_df.isna().sum()
null_counts = gdm_negative_df.isnull().sum()
null_percent = (null_counts / len(gdm_negative_df)) * 100
null_summary = pd.DataFrame({'Null Count': null_counts, 'Null Percentage': null_percent})
print(null_summary)


                         Null Count  Null Percentage
systolicBP_V1                     0              NaN
diastolicBP_V1                    0              NaN
PulseinV1                         0              NaN
BMIinV1                           0              NaN
Smoking 123                       0              NaN
PreviousGDM10 V1                  0              NaN
Chronic Illness                   0              NaN
Age_gt_30                         0              NaN
Platelet_V1                       0              NaN
Albumin_V1                        0              NaN
U Albumin_V1                      0              NaN
U Protein_V1                      0              NaN
White Cell Count                  0              NaN
ALT_V1                            0              NaN
V1 CRP.1                          0              NaN
Vit D Deficiency                  0              NaN
HighRisk                          0              NaN
GestationalHP                     0           

In [64]:
#Again, exclude the label-encoded column itself
columns_to_check = gdm_negative_df.columns


In [74]:
# Filter where GDM = 0 and other columns are not null
filtered_df_0 = gdm_negative_df[
    (gdm_negative_df['GDM Diagonised'] == '0') & 
    gdm_negative_df[columns_to_check].notnull().all(axis=1)
]
filtered_df_0.shape


(0, 21)

In [68]:
filtered_df_0_sampled = filtered_df_0.sample(n=74, random_state=42)


ValueError: a must be greater than 0 unless no samples are taken

In [None]:



# In[189]:





# In[190]:




# In[191]:


#


# Undersampling the non gdm patients

# In[218]:




# In[192]:


# Again, exclude the label-encoded column itself
columns_to_check = gdm_positive_df.columns

# Filter where GDM = 0 and other columns are not null
filtered_df_1 = gdm_positive_df[
    (gdm_positive_df['GDM Diagonised'] == 'Yes') 
   
]
filtered_df_1.shape


# We combine the non-GDM records with complete data and the GDM records to create a consolidated dataset. This combined data will be used for model training, validation, and testing.

# In[219]:


# Append df2 to df1 (row-wise)
combined_df = pd.concat([filtered_df_0_sampled, filtered_df_1], axis=0, ignore_index=True)

print(combined_df.shape)
df=combined_df.copy()


# To handle missing values in the medical features Albumin_V1, U Albumin_V1, ALT_V1, and V1 CRP.1, we use a strategy tailored to GDM diagnosis status:
# 
# For patients diagnosed with GDM, we use K-Means clustering to impute missing values. This method captures underlying patterns while retaining important data from this smaller subgroup.
# 
# For non-GDM cases, we take advantage of the large sample size by removing records with missing values, ensuring that only complete and reliable data are used.
# 
# To prevent class imbalance, we will apply undersampling to the filtered non-GDM group.
# 
# This selective imputation strategy aims to maintain data quality, minimize noise, and improve the robustness of our predictive model.

# In[220]:


df.shape


# In[221]:


import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Columns to impute
cols = ['Albumin_V1', 'U Albumin_V1', 'ALT_V1', 'V1 CRP.1']

# Step 1: Work on a copy of the dataframe with only relevant columns
df_kmeans = df[cols].copy()

# Step 2: Temporarily fill missing values with column means to perform clustering
df_temp = df_kmeans.fillna(df_kmeans.mean())

# Step 3: Normalize the data
scaler = StandardScaler()
scaled = scaler.fit_transform(df_temp)

# Step 4: Apply KMeans clustering (choose k based on elbow or domain knowledge)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled)

# Add cluster labels to temp dataframe
df_temp['cluster'] = clusters
df_kmeans['cluster'] = clusters

# Step 5: Impute missing values with cluster-wise mean
for col in cols:
    for cluster_id in df_kmeans['cluster'].unique():
        mean_val = df_kmeans.loc[
            (df_kmeans['cluster'] == cluster_id) & (df_kmeans[col].notnull()), col
        ].mean()
        df_kmeans.loc[
            (df_kmeans['cluster'] == cluster_id) & (df_kmeans[col].isnull()), col
        ] = mean_val

# Step 6: Drop the helper 'cluster' column and update main df
df_kmeans.drop(columns='cluster', inplace=True)
df[cols] = df_kmeans


# In[222]:


df.isnull().sum()
#Imputed the null values


# In[223]:


df.select_dtypes(include='object').columns.to_list()


# Label Encoding the categorical field

# In[224]:


from sklearn.preprocessing import LabelEncoder

# Columns to convert to integers
object_columns = [
    'Smoking 123',
    'Age_gt_30',
    'Vit D Deficiency',
    'GestationalHP',
    'Took Vit D Supplements',
    'Nutritional counselling',
    'GDM Diagonised'
]

# Dictionary to store label mappings
label_mappings = {}

# Apply encoding
for col in object_columns:
    df.loc[:, col] = df[col].astype(str)  # ensure all values are strings
    le = LabelEncoder()
    df.loc[:, col] = le.fit_transform(df[col])
    label_mappings[col] = dict(zip(le.classes_, le.transform(le.classes_)))

# Print the mappings (optional)
for col, mapping in label_mappings.items():
    print(f"Mapping for '{col}': {mapping}")


# In[225]:


df.head()


# In[226]:


# Get list of object dtype columns
column_names = df.select_dtypes(include='object').columns.to_list()

# Convert each column in the list to numeric
for col in column_names:
    df[col] = pd.to_numeric(df[col], errors='coerce')


# In[227]:


df.dtypes


# In[228]:


df.shape


# In[229]:


df.isna().sum()


# In[ ]:





# In[241]:


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Separate features and target
X = df.drop('GDM Diagonised', axis=1)
y = df['GDM Diagonised']

# Split the dataset (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Optionally scale numeric features only
numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns

scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])


X_train.shape


# In[242]:


models=[]
models.append(('LogisticRegression', LogisticRegression(max_iter=1000)))
models.append(('Naive Bayes',GaussianNB()))
models.append(('KNN',KNeighborsClassifier(n_neighbors=3)))
models.append(('RandomForestClassifier',RandomForestClassifier()))
models.append(('DecisionTreeClassifier',DecisionTreeClassifier()))
models.append((' Neural Network Classifier()', MLPClassifier()))
models.append(('XGBClassifier',XGBClassifier()))
models.append(('SVM', SVC(kernel='linear', C=1.0, random_state=42)))


# In[243]:


from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score
)
import matplotlib.pyplot as plt
from sklearn.preprocessing import label_binarize

for name, model in models:
    print('\n')
    print(name, model)
    print()

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # If needed for roc_auc_score
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)[:, 1]
    else:
        # fallback for models that don't have predict_proba (e.g., SVM with kernel)
        y_proba = model.decision_function(X_test)

    # changing the values into %
    accuracy = accuracy_score(y_test, y_pred) * 100
    precision = precision_score(y_test, y_pred) * 100
    recall = recall_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred) * 100
    auc = roc_auc_score(y_test, y_proba) * 100

   # print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\nAccuracy: {:.2f}%".format(accuracy))
    print("Precision: {:.2f}%".format(precision))
    print("Recall: {:.2f}%".format(recall))
    print("F1 Score: {:.2f}%".format(f1))
    print("ROC AUC Score: {:.2f}%".format(auc))
   # print(classification_report(y_test, y_pred))

    # ROC Curve
    fpr, tpr, thresholds = roc_curve(y_test, y_proba)
    plt.figure()
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.2f}%)')
    plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve - {name}')
    plt.legend(loc='lower right')
    plt.grid()
    plt.show()


# In[237]:


from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

for name, model in models:
    print("\n" + "="*40)
    print(f"Model: {name}")
    print("="*40 + "\n")
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on test set
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred) * 100
    precision = precision_score(y_test, y_pred) * 100
    recall = recall_score(y_test, y_pred) * 100
    f1 = f1_score(y_test, y_pred) * 100

    # Print metrics and confusion matrix
    #print("Confusion Matrix:")
    #print(confusion_matrix(y_test, y_pred), "\n")
    
    print(f"Accuracy: {accuracy:.2f}%")
    print(f"Precision: {precision:.2f}%")
    print(f"Recall: {recall:.2f}%")
    print(f"F1 Score: {f1:.2f}%\n")
    
    #print("Classification Report:")
    #print(classification_report(y_test, y_pred))


# In[244]:


from sklearn.metrics import accuracy_score

for name, model in models:
    print('\n')
    print(name, model)
    print()

    model.fit(X_train, y_train)

    # Training accuracy
    y_train_pred = model.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred) * 100

    # Testing accuracy
    y_test_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred) * 100

    print(f"Training Accuracy: {train_accuracy:.2f}%")
    print(f"Testing Accuracy: {test_accuracy:.2f}%")

    # Bias and variance indication
    if abs(train_accuracy - test_accuracy) < 5:
        print("✅ Low bias, low variance (well-balanced)")
    elif train_accuracy > test_accuracy:
        print("⚠️ High variance (overfitting)")
    else:
        print("⚠️ High bias (underfitting)")

