# 1. **Dataset Analysis,EDA and model development**

**Original Dataset**: https://drive.google.com/file/d/1hz-IdScbQzkEg8VPaLoWZXifzLEVLLWH/view?usp=sharing

**Information about dataset and column discriptions**: https://colab.research.google.com/drive/1dUJ0Y0R8RJb_8cIFeOXcFoLv4ech8e-n?usp=drive_link


In [25]:
import pandas as pd
d=pd.read_csv('/content/Updated_BD_Dataset.csv')
d.head().T

Unnamed: 0,0,1,2,3,4
Age,44,46,62,56,48
Sex,F,M,M,M,F
Family_History,Yes,Yes,No,Yes,Yes
ANK3_rs10994336,AA,AG,GG,AA,AA
CACNA1C_rs1006737,AA,AG,AA,GG,AG
ODZ4_rs12576775,AA,AA,AG,GG,GG
Glutamate_Level,High,Normal,Low,Low,High
Tryptophan_Metabolites,Altered,Normal,Normal,Altered,Altered
Cortisol_Level,Elevated,Normal,Normal,Normal,Elevated
Circadian_Gene_Disruption,Yes,No,No,Yes,Yes


In [26]:
#checking shape
d.shape

(1000, 18)

In [27]:
#value counts of target column
d["BD_Type"].value_counts()

Unnamed: 0_level_0,count
BD_Type,Unnamed: 1_level_1
BD-I,250
False,250
Cyclothymia,250
BD-II,250


In [28]:
#information about dataset
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Age                        1000 non-null   int64 
 1   Sex                        1000 non-null   object
 2   Family_History             1000 non-null   object
 3   ANK3_rs10994336            1000 non-null   object
 4   CACNA1C_rs1006737          1000 non-null   object
 5   ODZ4_rs12576775            1000 non-null   object
 6   Glutamate_Level            1000 non-null   object
 7   Tryptophan_Metabolites     1000 non-null   object
 8   Cortisol_Level             1000 non-null   object
 9   Circadian_Gene_Disruption  1000 non-null   object
 10  Mitochondrial_Dysfunction  1000 non-null   object
 11  Neuroinflammation          1000 non-null   object
 12  Omega3_Intake              1000 non-null   object
 13  Folate_Level               1000 non-null   object
 14  VitaminD_

In [29]:
#checking null/empty values
d.isna().sum()

Unnamed: 0,0
Age,0
Sex,0
Family_History,0
ANK3_rs10994336,0
CACNA1C_rs1006737,0
ODZ4_rs12576775,0
Glutamate_Level,0
Tryptophan_Metabolites,0
Cortisol_Level,0
Circadian_Gene_Disruption,0


In [30]:
#checking datatype of column
d['Physical_Activity_Level'].dtype

dtype('O')

In [31]:
pd.api.types.is_object_dtype(d["Physical_Activity_Level"])

True

In [32]:
#Finding the columns which contains string in our dataset
col_list=[]
for label,content in d.items():
  if pd.api.types.is_object_dtype(content):
    col_list.append(label)
col_list

['Sex',
 'Family_History',
 'ANK3_rs10994336',
 'CACNA1C_rs1006737',
 'ODZ4_rs12576775',
 'Glutamate_Level',
 'Tryptophan_Metabolites',
 'Cortisol_Level',
 'Circadian_Gene_Disruption',
 'Mitochondrial_Dysfunction',
 'Neuroinflammation',
 'Omega3_Intake',
 'Folate_Level',
 'VitaminD_Level',
 'Physical_Activity_Level',
 'BD_Type']

In [33]:
# converting strings to category format
for label,content in d.items():
  if pd.api.types.is_string_dtype(content) or pd.api.types.is_object_dtype(content):
    d[label]=content.astype("category").cat.as_ordered()

In [34]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   Age                        1000 non-null   int64   
 1   Sex                        1000 non-null   category
 2   Family_History             1000 non-null   category
 3   ANK3_rs10994336            1000 non-null   category
 4   CACNA1C_rs1006737          1000 non-null   category
 5   ODZ4_rs12576775            1000 non-null   category
 6   Glutamate_Level            1000 non-null   category
 7   Tryptophan_Metabolites     1000 non-null   category
 8   Cortisol_Level             1000 non-null   category
 9   Circadian_Gene_Disruption  1000 non-null   category
 10  Mitochondrial_Dysfunction  1000 non-null   category
 11  Neuroinflammation          1000 non-null   category
 12  Omega3_Intake              1000 non-null   category
 13  Folate_Level               1000 no

In [39]:
for i in col_list:
  d[i].cat.categories

In [40]:
#pandas storing `DepressionDiagnosis` as category rather than object types
for i in col_list:
  d[i].cat.codes

In [41]:
#creating new column with numerical codes for all string columns

for i in col_list:
  # Making New Colums
  d[i+'_Codes']=d[i].cat.codes

  # Droping the original string colum
  d=d.drop(i,axis=1)

In [42]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                           Non-Null Count  Dtype
---  ------                           --------------  -----
 0   Age                              1000 non-null   int64
 1   Average_Sleep_Hours              1000 non-null   int64
 2   Sex_Codes                        1000 non-null   int8 
 3   Family_History_Codes             1000 non-null   int8 
 4   ANK3_rs10994336_Codes            1000 non-null   int8 
 5   CACNA1C_rs1006737_Codes          1000 non-null   int8 
 6   ODZ4_rs12576775_Codes            1000 non-null   int8 
 7   Glutamate_Level_Codes            1000 non-null   int8 
 8   Tryptophan_Metabolites_Codes     1000 non-null   int8 
 9   Cortisol_Level_Codes             1000 non-null   int8 
 10  Circadian_Gene_Disruption_Codes  1000 non-null   int8 
 11  Mitochondrial_Dysfunction_Codes  1000 non-null   int8 
 12  Neuroinflammation_Codes          1000 non-null   

In [43]:
d.head().T

Unnamed: 0,0,1,2,3,4
Age,44,46,62,56,48
Average_Sleep_Hours,3,6,7,6,3
Sex_Codes,0,1,1,1,0
Family_History_Codes,1,1,0,1,1
ANK3_rs10994336_Codes,0,1,2,0,0
CACNA1C_rs1006737_Codes,0,1,0,2,1
ODZ4_rs12576775_Codes,0,0,1,2,2
Glutamate_Level_Codes,0,2,1,1,0
Tryptophan_Metabolites_Codes,0,1,1,0,0
Cortisol_Level_Codes,0,1,1,1,0


In [45]:
# Save to Colab's temporary storage
d.to_csv('modified_BP.csv', index=False)

In [46]:
df=pd.read_csv('modified_BP.csv')
df.head()

Unnamed: 0,Age,Average_Sleep_Hours,Sex_Codes,Family_History_Codes,ANK3_rs10994336_Codes,CACNA1C_rs1006737_Codes,ODZ4_rs12576775_Codes,Glutamate_Level_Codes,Tryptophan_Metabolites_Codes,Cortisol_Level_Codes,Circadian_Gene_Disruption_Codes,Mitochondrial_Dysfunction_Codes,Neuroinflammation_Codes,Omega3_Intake_Codes,Folate_Level_Codes,VitaminD_Level_Codes,Physical_Activity_Level_Codes,BD_Type_Codes
0,44,3,0,1,0,0,0,0,0,0,1,1,1,2,0,0,1,0
1,46,6,1,1,1,1,0,2,1,1,0,0,0,0,1,1,0,3
2,62,7,1,0,2,0,1,1,1,1,0,0,0,1,1,1,0,3
3,56,6,1,1,0,2,2,1,0,1,1,0,1,1,1,0,2,2
4,48,3,0,1,0,1,2,0,0,0,1,1,1,2,0,1,1,1


### **Random Forest, Logistic Regression and KNN**
Accuracy is as followed:

* 'Random Forest': `1`
* 'Logistic Regression': `1`
* 'KNN': `0.87`

In [47]:
import numpy as np

#Models from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

#model evaluation
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.metrics import RocCurveDisplay

In [49]:
# Splitting the dataset
X=df.drop("BD_Type_Codes",axis=1)
y=df["BD_Type_Codes"]

np.random.seed(42)
# Splitting the data into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

In [50]:
#Using three ML models
models={"Random Forest":RandomForestClassifier(),
        "Logistic Regression":LogisticRegression(),
        "KNN":KNeighborsClassifier()}

#function to fit and score matrix

def fit_and_score(models,X_train,X_test,y_train,y_test):
  """
    Fits and evaluates given machine learning models
    models:a dictionary of different scikit-learn models
    X_train:Trainnig data (No labels)
    X_test:Testing data (No Labels)
    y_train:trainning labels
    y_test:test labels

  """

  #setting up a randomseed(42)
  np.random.seed(42)

  #dictionary for storing models score
  model_scores={}

  #looping throigh each model
  for name,model in models.items():
    # fit the model to data
    model.fit(X_train,y_train)

    #evaluate model and append score
    model_scores[name]=model.score(X_test,y_test)

  return model_scores


In [51]:
model_scores=fit_and_score(models,X_train,X_test,y_train,y_test);
model_scores

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'Random Forest': 1.0, 'Logistic Regression': 1.0, 'KNN': 0.87}

### **RandomForest + SMOTE** (Highest Accuracy)

Accuracy : `0.75`

In [94]:
# Step 1: Load dataset
df = pd.read_csv("/content/Updated_BD_Dataset.csv")

In [95]:
# Step 2: Encode target label
target_col = "BD_Type"
label_encoder = LabelEncoder()
df[target_col] = label_encoder.fit_transform(df[target_col])

In [96]:
# Step 3: Separate features and target
X = df.drop(target_col, axis=1)
y = df[target_col]

In [97]:
# Step 4: Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

In [98]:
# Step 5: Preprocessing for categorical variables
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
], remainder='passthrough')

In [99]:
# Step 6: Train-test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [100]:
# Step 7: Create pipeline with SMOTE and model
pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [101]:
# Step 8: Train the model
pipeline.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [102]:
# Step 9: Evaluate
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Accuracy: 1.0
              precision    recall  f1-score   support

        BD-I       1.00      1.00      1.00        50
       BD-II       1.00      1.00      1.00        50
 Cyclothymia       1.00      1.00      1.00        50
       False       1.00      1.00      1.00        50

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200



## **Prediction**

In [103]:
for col in X.columns:
  print(col)

Age
Sex
Family_History
ANK3_rs10994336
CACNA1C_rs1006737
ODZ4_rs12576775
Glutamate_Level
Tryptophan_Metabolites
Cortisol_Level
Circadian_Gene_Disruption
Mitochondrial_Dysfunction
Neuroinflammation
Omega3_Intake
Folate_Level
VitaminD_Level
Average_Sleep_Hours
Physical_Activity_Level


In [106]:
def predict_BP():
    """These function predicts depression from values such as
    Age
    Sex
    Family_History
    ANK3_rs10994336
    CACNA1C_rs1006737
    ODZ4_rs12576775
    Glutamate_Level
    Tryptophan_Metabolites
    Cortisol_Level
    Circadian_Gene_Disruption
    Mitochondrial_Dysfunction
    Neuroinflammation
    Omega3_Intake
    Folate_Level
    VitaminD_Level
    Average_Sleep_Hours
    Physical_Activity_Level
    """

    input_data = {
        "Age": int(input("Enter  Age: (int) ")),
        "Sex": input("Enter Gender: (F-female ,M-Male)"),
        "Family_History": input("Enter Family_History ('Yes','No'): "),
        "ANK3_rs10994336": input("Enter ANK3_rs10994336 ('AA', 'AG', 'GG'): "),
        "CACNA1C_rs1006737": input("Enter CACNA1C_rs1006737 ('AA', 'AG', 'GG'): "),
        "ODZ4_rs12576775": input("Enter ODZ4_rs12576775 ('AA', 'AG', 'GG'): "),
        "Glutamate_Level": input("Enter Glutamate_Level: (High,Noraml,Low) "),
        "Tryptophan_Metabolites":(input("Enter Tryptophan_Metabolites: (Altered,Noraml) ")),
        "Cortisol_Level": input("Enter Cortisol_Level: (Elevated,Noraml) "),
        "Circadian_Gene_Disruption": input("Enter Circadian_Gene_Disruption: (Yes, No) "),
        "Mitochondrial_Dysfunction": input("Enter Mitochondrial_Dysfunction: (Yes, No) "),
        "Neuroinflammation": input("Enter Neuroinflammation: (Yes,No) "),
        "Omega3_Intake": input("Enter Omega3_Intake: (Low,Adequate,High) "),
        "Folate_Level": input("Enter Folate_Level: (Deficient,Normal) "),
        "VitaminD_Level": input("Enter VitaminD_Level: (Deficient,Normal) "),
        "Average_Sleep_Hours": int(input("Enter Average_Sleep_Hours: (int) ")),
        "Physical_Activity_Level": input("Enter Physical_Activity_Level: (Moderate,Low,High) ")

        }

    # Convert to DataFrame
    user_df = pd.DataFrame([input_data])

    # Predict using the pipeline
    prediction = pipeline.predict(user_df)
    predicted_label = label_encoder.inverse_transform(prediction)

    return predicted_label[0]

In [107]:
predict_ BP()

Enter  Age: (int) 20
Enter Gender: (F-female ,M-Male)F
Enter Family_History ('Yes','No'): Yes
Enter ANK3_rs10994336 ('AA', 'AG', 'GG'): AA
Enter CACNA1C_rs1006737 ('AA', 'AG', 'GG'): AG
Enter ODZ4_rs12576775 ('AA', 'AG', 'GG'): GG
Enter Glutamate_Level: (High,Noraml,Low) High
Enter Tryptophan_Metabolites: (Altered,Noraml) Normal
Enter Cortisol_Level: (Elevated,Noraml) Elevated
Enter Circadian_Gene_Disruption: (Yes, No) Yes
Enter Mitochondrial_Dysfunction: (Yes, No) No
Enter Neuroinflammation: (Yes,No) Yes
Enter Omega3_Intake: (Low,Adequate,High) Low
Enter Folate_Level: (Deficient,Normal) Normal
Enter VitaminD_Level: (Deficient,Normal) Deficient
Enter Average_Sleep_Hours: (int) 6
Enter Physical_Activity_Level: (Moderate,Low,High) Moderate

Predicted Depression Diagnosis: Cyclothymia
