# ML Model Based on cross validation

- Ok so we will be using the cross validation to train the model and then we will be using the model to predict the values of the test data. 
- We will use cross validation to find which algorithm is giving the best output and then use that as our ML model algorithm

What are the steps we will be taking to build the model?
- We will first load the data and then do some basic data analysis
- We will then do some data preprocessing
- We will then use cross validation to find the best algorithm
- We will then use the best algorithm to predict the values of the test data


The problem trying to be solved is Single-Label Multi-Class classification.

We will be using the following algorithms and then use cross validation to find the best algorithm and then perform the evaluation of the model.


Several machine learning algorithms can handle multi-class classification natively, meaning they are designed to work directly with multiple classes without needing additional strategies like One-vs-Rest (OvR) or One-vs-One (OvO). Some examples of algorithms that can handle multi-class classification include:
- K-Nearest Neighbors (KNN) 
- decision tree classifier 
- random forest classifier 
- Gradieent Boosting Classifier 
- Naive Bayes: GaussianNB
- lasso Regression (L1 Regularization, comes with Logistic Regression)
- Ridge Regression (L2 Regularization, comes with Logistic Regression) 

Models like Support Vector Machines (SVM) and Logistic Regression are designed for binary classification and do not natively support multi-class classification. These algorithms can be used for multi-class classification by using a strategy like One-vs-Rest (OvR) or One-vs-One (OvO).

## Methodology
- Data Preprocessing
- Data Analysis
- Data Splitting: For Cross Validation
- Model Building
- Testing all models and their Cross Validation Scores
- Finding the best model
- Data splitting in the traditional way to train the best model
- Model Evaluation

Importing libraries. 

In [75]:
import pandas as pd #Pandas is for Data processing. 
import numpy as np #Numpy is for numerical calculations.
import matplotlib.pyplot as plt #plotting and visually understanding the data.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder #Encoding the data depending on the data type.
from sklearn.model_selection import train_test_split #split the data into testing and training set
from sklearn.model_selection import cross_val_score, KFold #cross validation
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier


In [76]:
from sklearn.linear_model import LogisticRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier



In [70]:
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve, auc, RocCurveDisplay
from sklearn.preprocessing import label_binarize



### data preprocessing and analysis

In [4]:
main_df = pd.read_csv("abundance.csv")
main_df = main_df.replace('nd', np.nan)

display(main_df)


  main_df = pd.read_csv("abundance.csv")


Unnamed: 0,dataset_name,sampleID,subjectID,bodysite,disease,age,gender,country,sequencing_technology,pubmedid,...,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus_gilvus|t__Enterococcus_gilvus_unclassified,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_otakiensis,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_otakiensis|t__GCF_000415925,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum|s__Desulfotomaculum_ruminis,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum|s__Desulfotomaculum_ruminis|t__GCF_000215085,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__GCF_000209855,k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales|f__Veillonellaceae|g__Megasphaera|s__Megasphaera_sp_BV3C16_1,k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales|f__Veillonellaceae|g__Megasphaera|s__Megasphaera_sp_BV3C16_1|t__GCF_000478965
0,WT2D,S367,s367,stool,impaired_glucose_tolerance,70.91,,yugoslavia,Illumina,23719380,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,WT2D,S604,s604,stool,t2d,71.63,,yugoslavia,Illumina,23719380,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,hmp,SRS011061,158458797,stool,n,,female,usa,Illumina,22699609,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,hmp,SRS011084,158479027,stool,n,,male,usa,Illumina,22699609,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,hmp,SRS011086,158458797,tongue_dorsum,n,,female,usa,Illumina,22699609,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3605,t2dmeta_long,T2D-034,-,stool,-,-,-,-,Illumina,23023125,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3606,t2dmeta_long,T2D-035,-,stool,-,-,-,-,Illumina,23023125,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3607,t2dmeta_long,T2D-037,-,stool,-,-,-,-,Illumina,23023125,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3608,t2dmeta_long,T2D-038,-,stool,-,-,-,-,Illumina,23023125,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Calculate and print missing values in the dataset.

In [5]:
missing_percentages = main_df.isnull().mean() * 100
print("Missing value percentages:")
print(missing_percentages)


Missing value percentages:
dataset_name                                                                                                                                         0.000000
sampleID                                                                                                                                             0.000000
subjectID                                                                                                                                            0.000000
bodysite                                                                                                                                            12.686981
disease                                                                                                                                             13.157895
                                                                                                                                                      ...    
k__Bacteria|p__Firmicutes

due to no biological relevancy we are dropping these:

In [6]:

# Create a copy of main_df before dropping columns
main_df_cleaned = main_df.copy()

# Drop columns from main_df_cleaned instead of main_df
columns_to_drop = ['dataset_name', 'sampleID', 'subjectID', 'sequencing_technology', 'pubmedid', 'camp', 
                   'collectionweek', 'samplecollectionwindow', 'paired_end_insert_size', 'read_length', 
                   'total_reads', 'matched_reads', 'uniquely_matching_reads', 'uniquely_matched_reads', 
                   'gene_number', 'gene_number_for_11m_uniquely_matched_reads', 'hitchip_probe_number', 
                   'gene_count_class', 'hitchip_probe_class', '#SampleID', 'rna_sampleid', 
                   'postnatal_antimicrobial_use', 'infant_gestation_weeks', 'cohort', 'less_than_29weeks', 
                   'sample_collection_days', 'gut_sample_id_ncbipublic', 'gut_sample_id_corrected', 
                   'projectid', 'flowcell', 'comment', 'mlst_project', 'mlst_ec', 'st_ec', 'mlst_kp', 
                   'st_kp', 'extractionprotocolid', 'site_id_cincinnati', 'ascites', 'classification']

main_df_cleaned.drop(columns=columns_to_drop, inplace=True, errors='ignore')
display(main_df_cleaned)


Unnamed: 0,bodysite,disease,age,gender,country,bmi,infant_gender,delivery_mode,infant_ethnicity,infant_race,...,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus_gilvus|t__Enterococcus_gilvus_unclassified,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_otakiensis,k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_otakiensis|t__GCF_000415925,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum|s__Desulfotomaculum_ruminis,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Peptococcaceae|g__Desulfotomaculum|s__Desulfotomaculum_ruminis|t__GCF_000215085,k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__GCF_000209855,k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales|f__Veillonellaceae|g__Megasphaera|s__Megasphaera_sp_BV3C16_1,k__Bacteria|p__Firmicutes|c__Negativicutes|o__Selenomonadales|f__Veillonellaceae|g__Megasphaera|s__Megasphaera_sp_BV3C16_1|t__GCF_000478965
0,stool,impaired_glucose_tolerance,70.91,,yugoslavia,32.3,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,stool,t2d,71.63,,yugoslavia,28.9,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,stool,n,,female,usa,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,stool,n,,male,usa,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,tongue_dorsum,n,,female,usa,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3605,stool,-,-,-,-,-,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3606,stool,-,-,-,-,-,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3607,stool,-,-,-,-,-,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3608,stool,-,-,-,-,-,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, coming to data cleaning, we are using mean imputation for the numerical columns only and forward fill for the categorical columns.

In [None]:
from sklearn.impute import SimpleImputer

# Assuming main_df_cleaned is the cleaned dataframe from previous steps
df_cleaned = main_df_cleaned.copy()

# Define categorical and numerical columns
categorical_columns = df_cleaned.select_dtypes(include=['object']).columns
numerical_columns = df_cleaned.select_dtypes(include=[np.number]).columns

# For categorical columns, use forward fill
df_cleaned[categorical_columns] = df_cleaned[categorical_columns].fillna(method='ffill')

# For numerical columns, use mean imputation
numerical_imputer = SimpleImputer(strategy='mean')
df_cleaned[numerical_columns] = numerical_imputer.fit_transform(df_cleaned[numerical_columns])


  df_cleaned[categorical_columns] = df_cleaned[categorical_columns].fillna(method='ffill')


Lets once view and export the dataset now.

In [8]:
export_csv = df_cleaned.to_csv(r'abundance_cleaned.csv', index = None, header=True)

# Columns in the current dataset 


In [69]:
    column_name = df_cleaned.select_dtypes(exclude=[np.number]).columns.tolist()
    print(column_name)
    column_name = df_cleaned.select_dtypes(exclude=[np.number]).columns.tolist()
    print(column_name)



['bodysite', 'disease', 'age', 'gender', 'country', 'bmi', 'infant_gender', 'delivery_mode', 'infant_ethnicity', 'infant_race', 'birth_year', 'necrotizing_enterocolitis', 'dol_firstnecors', 'sepsis', 'infant_birthweight_kg', 'infant_birth_length_cm', 'maternal_abx_given', 'postnatal_abx_2window', 'perinatal_antimicrobial_use', 'death', 'nursing_status', 'maternal_age_at_delivery_years', 'mat_age_bin', 'parity', 'infant_fut2', 'mat_fut2', 'lowh', 'lowlea', 'highslea', 'married', 'gravida', 'preeclampsia', 'mult_birth', 'hypertension', 'hypertension_prepreg', 'chorioamnionitis', 'primipar', 'daysonform3daysprior', 'days_on_abx_14', 'visit_number', 'snprnt', 'wmsphase', 'first', 'repeat', 'stooltexture', 'daysafteronset', 'hus', 'stec_count', 'shigatoxin2elisa', 'readsmillions', 'nonhuman', 'stec_coverage', 'stxab_detected', 'stx_ratio', 'typingdata', 'c_difficile_frequency', 'ibd', 'sampling_day', 'known_consumers_of_a_defined_fermented_milk_product_(dfmp)', 'mgs_richness', 'mgs_profile_

### Data splitting (X AND Y)
why are we not encoding before this? reason is that you are not supposed to encode the target variable.

We are now considering everything else except the target variable as the features and the target variable as the target.

In [12]:
# Note that disease column has already been dropped when creating X
Input_Features = [col for col in df_cleaned.columns if col != 'disease']
X = df_cleaned[Input_Features]
Y = df_cleaned['disease']

print(X.head())

print(Y.head())

        bodysite    age  gender     country   bmi infant_gender delivery_mode  \
0          stool  70.91     NaN  yugoslavia  32.3           NaN           NaN   
1          stool  71.63     NaN  yugoslavia  28.9           NaN           NaN   
2          stool  71.63  female         usa  28.9           NaN           NaN   
3          stool  71.63    male         usa  28.9           NaN           NaN   
4  tongue_dorsum  71.63  female         usa  28.9           NaN           NaN   

  infant_ethnicity infant_race birth_year  ...  \
0              NaN         NaN        NaN  ...   
1              NaN         NaN        NaN  ...   
2              NaN         NaN        NaN  ...   
3              NaN         NaN        NaN  ...   
4              NaN         NaN        NaN  ...   

  k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus_gilvus|t__Enterococcus_gilvus_unclassified  \
0                                                0.0     

Now lets do some encoding for the categorical columns
now, we will do two types of encoding i.e. label encoding and one hot encoding.

We are doing this because encoding has an impact in complexity of the model which will impact things like bias and variance and also the accuracy of the model

label encoding assigns unique value to the categories and one hot encoding creates a new column for each category and assigns a binary value to it

- Label encoding works better for algorithms that DO NOT take ordinality into consideration. One hot encoding works better for algorithms that take ordinality into consideration

- label encoding works better for decision trees and random forests and one hot encoding works better for linear regression and logistic regression

- for distance dependent algorithms like KNN, SVM, Neural Networks one hot encoding is preferred

we will bedoing encoding only to X axis
algrmths like decision trees, random forests, naive bayes dont really need encoding but we will apply label encoding for them

which encoding to use for which algorithm:
- Decision Trees and Random Forests: Label Encoding
- Naive Bayes: Label Encoding
- KNN, SVM, Logistic regression: One Hot Encoding
- lasso and ridge regression: One Hot Encoding

In [13]:
# find numeric and non numeric columns once
numeric_columns = X.select_dtypes(include=[np.number]).columns
non_numeric_columns = X.select_dtypes(include=[object]).columns

In [14]:
# CONVERT ALL CATEGORICAL DATA TO STRINGS
X_for_encoding = X.copy()
for column in non_numeric_columns:
	X_for_encoding[column] = X_for_encoding[column].astype(str)

label_encoders = {}
Label_X = X_for_encoding.copy()
for column in non_numeric_columns:
	label_encoders[column] = LabelEncoder()
	Label_X[column] = label_encoders[column].fit_transform(X_for_encoding[column])

onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
Onehot_X = onehot_encoder.fit_transform(X_for_encoding[non_numeric_columns])

now coming back to scaling, we are using MinMaXscalar for the numerical columns only.

In [98]:
scaler_minmax = MinMaxScaler()
df_cleaned[numerical_columns] = scaler_minmax.fit_transform(df_cleaned[numerical_columns])

# Cross Validation technique.
Cross-Validation is a technique used to assess the performance of a machine learning model more reliably by separating the data into 'folds' then and using different folds for training and testing. 

this has many advantages including: handling bias and variance.

In [15]:
kf = KFold(n_splits=10, shuffle=True, random_state=42)

Models to Evaluate:
- SVC (one vs rest)
- Logistic Regression (one vs rest)
- K-Nearest Neighbors (KNN)
- Decision Tree Classifier
- Random Forest Classifier
- MLP Classifier
- Naive Bayes: GaussianNB
- Regularized Regression: Lasso
- Regularized Regression: Ridge

Lets go one by one and perform cross validation checks on each of the models.

# 1. SVC (one vs rest)
- SVC is a binary classifier and does not support multi-class classification natively. hence we will use OneVsRestClassifier to perform multi-class classification.
- We will use the cross_val_score function to perform cross validation on the model.
- for onehot_X as X because it is a distance dependent algorithm

Support vector machines is 

In [29]:
SVC_OVR = SVC(decision_function_shape='ovr') # OneV Rest approc

# score
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(SVC_OVR, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
# n_jobs=-1: This tells scikit-learn to use all available CPU cores.
SVCMEAN = np.mean(scores) 
print("Accuracy:", SVCMEAN)

Accuracy: 0.9146814404432135


# 2. Logistic Regression (one vs rest)
- Logistic Regression is a binary classifier and does not support multi-class classification natively. hence we will use OneVsRestClassifier to perform multi-class classification.
- We will use the cross_val_score function to perform cross validation on the model.
- for onehot_X as X because it is a distance dependent algorithm

In [30]:
LR_OVR = LogisticRegression(multi_class='ovr', solver='liblinear') 
# solver in logistic regression
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(LR_OVR, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
LRMEAN = np.mean(scores) 
print("Accuracy:", LRMEAN)

Accuracy: 0.9570637119113574


# 3. K-Nearest Neighbors (KNN)
- KNN is a distance dependent algorithm and hence we will use one hot encoding for the categorical columns.
- We will use the cross_val_score function to perform cross validation on the model.
- for onehot_X as X because it is a distance dependent algorithm

in knn, we take the nearest k points and then take the majority of the points to classify the point

In [31]:
KNN = KNeighborsClassifier()
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(KNN, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
KNNMEAN = np.mean(scores) 
print("Accuracy:", KNNMEAN)

Accuracy: 0.9193905817174517



# 4. Decision Tree Classifier
- Decision Tree Classifier is a non-parametric supervised learning method used for classification and regression.
- they are a part of ensemble methods, which are a collection of models that are trained and evaluated in parallel and then combined to make a final prediction.
- they are the best as they can handle both numerical and categorical data and they are easy to interpret and understand. this can help in reducing complexity also.
- and also they can help in balancing the variance bias tradeoff.

In [38]:
DT = DecisionTreeClassifier()
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(DT, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
DTMEAN = np.mean(scores)
print("Accuracy:", DTMEAN)

Accuracy: 0.9592797783933518


# 5. Random Forest Classifier
- Random Forest Classifier is also ensemble method that uses multiple decision trees to make a prediction.
- it is a collection of decision trees that are trained and evaluated in parallel and then combined to make a final prediction. *basically, decision trees pro max*
- it is a good choice for multi-class classification problems because it can handle both numerical and categorical data and it is robust to overfitting.

In [17]:
RTC = RandomForestClassifier()
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(RTC, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
RTCMEAN = np.mean(scores)
print("Accuracy:", RTCMEAN)

Accuracy: 0.954016620498615


# 6. Gradient Boosting Classifier
- Gradient Boosting Classifier is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
- Gradiant boosting helps deal with loss function. it has optimisation techniques that help in reducing the loss function.
- it is a good choice for multi-class classification problems because it can handle both numerical and categorical data and it is robust to overfitting.


ONLY ISSUE WITH GRADIENT BOOSTING IS PERFORMANCE AND TIME ISSUE

In [36]:
GBC = GradientBoostingClassifier()
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(GBC, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
GBCMEAN = np.means(scores)
print("Accuracy:", GBCMEAN)

KeyboardInterrupt: 

# 7. Naive Bayes: GaussianNB
- Naive Bayes is basically uses the principles of Baye's theorem
- in our case we will use GaussianNB because we have continuous data
and other variants are for text based classification or count based classification

In [19]:
NB = GaussianNB()
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(NB, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
NBMEAN = np.mean(scores)
print("Accuracy:", NBMEAN)

Accuracy: 0.9121883656509693


# 8. MLP Classifier
this uses neural networks to classify the data and it is a good choice as it can handle both numerical and categorical data and it is robust to overfitting.

even we don't know much about neural networks but its an attempt of understnading.

it is also super duper computationally intensive 

you need one hot encoding for this

In [23]:
MLP = MLPClassifier(hidden_layer_sizes=(10,), activation='relu', solver='adam', max_iter=10)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(MLP, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
MLPMEAN = np.mean(scores)
print("Accuracy:", MLPMEAN)

# hidden layer sizes: number of nodes in each layer
# activation function: what is the fucntion that each datapoint represents?
# solver: optimization algorithm
# max_iter: maximum number of iterations

Accuracy: 0.8360110803324099


# REGULARIZED REGRESSION
- Regularized regression is a type of regression is a special regression that helps to reduce the complexity of the model and prevent overfitting.
- there are two types of regularized regression: Lasso and Ridge
- it uses a penalty system, that helps in penalizing the model if it is too complex and hence this also helps in striking a balance between bias and variance.



# 9.Lasso Regression (L1 Regularization, comes with Logistic Regression)
- lasso regression stands for Least Absolute Shrinkage and Selection Operator
- It works well when you have many features and want to reduce the number of features used in the model.
- feature selection

In [25]:
LRL1 = LogisticRegression(penalty = 'l1', solver = 'saga', C = 0.1)
# penalty introduced penalty system, solver is the optimization algorithm, C is the regularization parameter
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(LRL1, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
LRL1MEAN = np.mean(scores)
print("Accuracy:", LRL1MEAN)

Accuracy: 0.9246537396121883


# 10. Ridge Regression (L2 Regularization, comes with Logistic Regression)
- Ridge penalizes the sum of the squared values of the coefficients, preventing them from growing too large.
- useful when you have a large number of co-related features and to prevent overfitting 
- does not perform feature selection

In [27]:
LRL2 = LogisticRegression(penalty = 'l2', solver = 'saga', C = 0.1)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(LRL2, Onehot_X, Y, scoring='accuracy', cv=cv, n_jobs=-1)
LRL2MEAN = np.mean(scores)
print("Accuracy:", LRL2MEAN)

Accuracy: 0.948753462603878


# Model Evaluation
We will now evaluate the models based on the cross validation scores and then select the best model based on the cross validation scores.

In [41]:
model_scores = {
    'SVC': SVCMEAN,
    'Logistic Regression': LRMEAN,
    'KNN': KNNMEAN,
    'Decision Tree': DTMEAN,
    'Random Forest': RTCMEAN,
    'Naive Bayes': NBMEAN,
    'MLP': MLPMEAN,
    'Lasso': LRL1MEAN,
    'Ridge': LRL2MEAN
}

best_model = max(model_scores.items(), key=lambda x: x[1])

print("Model Performance Summary:")
for model, score in model_scores.items():
    print(f"{model}: {score:.4f}")

print("\nBest Performing Model:")
print(f"{best_model[0]} with accuracy: {best_model[1]:.4f}")

Model Performance Summary:
SVC: 0.9147
Logistic Regression: 0.9571
KNN: 0.9194
Decision Tree: 0.9593
Random Forest: 0.9540
Naive Bayes: 0.9122
MLP: 0.8360
Lasso: 0.9247
Ridge: 0.9488

Best Performing Model:
Decision Tree with accuracy: 0.9593


So we can infer that the best model is Random Forest Classifier with a cross validation score of 0.97

Now lets train the model on the entire training data and then predict the values of the test data.


Data splitting for best model


In [91]:
# Remove rows where Y contains '-', 'n', or 'y'
Y_cleaned = Y[~Y.isin(['-', 'n', 'y'])]

# Filter the corresponding rows in Onehot_X as well
Onehot_X_cleaned = Onehot_X[Y.isin(['-', 'n', 'y']) == False]

# Now split the cleaned data
x_train, x_test, y_train, y_test = train_test_split(Onehot_X_cleaned, Y_cleaned, test_size=0.2, random_state=42)


Model prediction and evaluation

In [97]:
best_model = DT
best_model.fit(x_train, y_train)
y_prob = best_model.predict_proba(x_test)[:, 1] 
y_pred = best_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
recall = recall_score(y_test, y_pred, average='weighted')
print("Recall:", recall)
precision = precision_score(y_test, y_pred, average='weighted')
print("Precision:", precision)
print(classification_report(y_test, y_pred))

Accuracy: 0.9230769230769231
Recall: 0.9230769230769231
Precision: 0.9506454772079772
                            precision    recall  f1-score   support

                         -       1.00      1.00      1.00         3
                    cancer       1.00      1.00      1.00         7
                 cirrhosis       1.00      1.00      1.00        21
         ibd_crohn_disease       1.00      0.33      0.50         6
    ibd_ulcerative_colitis       0.85      1.00      0.92        23
impaired_glucose_tolerance       0.56      0.90      0.69        10
             large_adenoma       1.00      1.00      1.00         4
                   leaness       1.00      1.00      1.00        16
                n_relative       0.93      1.00      0.97        14
                     obese       1.00      0.33      0.50         3
                   obesity       1.00      1.00      1.00        34
                overweight       0.25      1.00      0.40         1
             small_adenoma   

# comparision with Metaphlan2's outcomes

Now this is a multi-class classification problem and hence we will use the accuracy_score function to calculate the accuracy of the model.

In [95]:
data = {
    'Metric': ['Accuracy','f1_score' ],
    'DecisionTrees': [(accuracy_score(y_test, y_pred))*100, (f1_score(y_test, y_pred, average='macro'))*100],
    'Metaphlan2':['91.56', 0.9257733827354075*100]
}
table = pd.DataFrame(data)
display(table)

Unnamed: 0,Metric,DecisionTrees,Metaphlan2
0,Accuracy,91.826923,91.56
1,f1_score,85.273648,92.577338


# Inference
- For our Dataset, we have used cross validation to find the best algorithm for classification of diseases.
- We have found that Random Forest Classifier is the best algorithm for our dataset as it shows the highest cross validation score of 0.97
- We have trained the Random Forest Classifier on the entire training data and then predicted the values of the test data.
- We have noticed that DecisionTrees

# ROC CURVE

This ROC CURVE, isn't like your normal daily ROC curves and the Reason being, there are multiple Classes and we will get multiple probabilities from the 'predict_proba' function

Here we will be plotting the ROC curve for each class and then calculating the AUC for each class. 

ML MODEL FOR EVERY DISEASE

In [None]:
import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Clean the data first
mask = ~Y.isin(['-', 'n', 'y'])
Y_clean = Y[mask]
X_clean = X[mask]

# Split the cleaned data
x_train, x_test, y_train, y_test = train_test_split(X_clean, Y_clean, test_size=0.2, random_state=42)

def plot_multiclass_roc(classifier, X_test, y_test):
    # Get unique classes
    classes = np.unique(y_test)
    
    # Initialize plot
    plt.figure(figsize=(10, 8))
    
    # Calculate ROC curve for each class
    for i, class_name in enumerate(classes):
        # Convert to binary problem (one-vs-rest)
        y_binary = (y_test == class_name).astype(int)
        
        # Get probability scores
        y_score = classifier.predict_proba(X_test)[:, i]
        
        # Calculate ROC curve
        fpr, tpr, _ = roc_curve(y_binary, y_score)
        roc_auc = auc(fpr, tpr)
        
        # Plot ROC curve
        plt.plot(fpr, tpr, label=f'{class_name} (AUC = {roc_auc:.2f})')
    
    # Plot diagonal line
    plt.plot([0, 1], [0, 1], 'k--')
    
    # Set plot details
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Multi-class ROC Curves')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.show()

# Filter numerical columns and handle categorical data first
numeric_features = [col for col, dtype in X_clean.dtypes.items() if np.issubdtype(dtype, np.number)]
X_clean_numeric = X_clean[numeric_features]

# Train the model and use the trained model for ROC curves
best_model.fit(X_clean_numeric, Y_clean)
plot_multiclass_roc(best_model, x_test[numeric_features], y_test)
# E
# model.fit(x_train, y_train)
# plot_multiclass_roc(model, x_test, y_test)

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

def calculate_roc_scores(classifier, X_test, y_test):
    """
    Calculate ROC AUC scores for multi-class classification
    Returns dictionary with class-wise and average scores
    """
    # Get probabilities and classes
    y_prob = classifier.predict_proba(X_test)
    classes = classifier.classes_
    
    # Initialize results dictionary
    roc_scores = {}
    
    # Calculate ROC AUC for each class
    for i, class_name in enumerate(classes):
        y_true_binary = (y_test == class_name).astype(int)
        roc_scores[f'Class_{class_name}'] = roc_auc_score(y_true_binary, y_prob[:, i])
    
    # Calculate micro and macro average
    y_bin = label_binarize(y_test, classes=classes)
    roc_scores['macro_avg'] = roc_auc_score(y_bin, y_prob, average='macro')
    roc_scores['micro_avg'] = roc_auc_score(y_bin, y_prob, average='micro')
    
    return roc_scores

# Usage example:
scores = calculate_roc_scores(best_model, x_test_clean, y_test_clean)
for metric, score in scores.items():
    print(f"{metric}: {score:.3f}")