## **Owner**: Ahmed Tarek Mohamed

# **Data Frame One (Obesity)**

# **1. Import Libraries & Tools**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

**Setting up the environment by importing essential libraries and tool**
 * import pandas as pd: This imports the Pandas library, which is widely used for data manipulation and analysis in Python. It is typically used to work with structured data in tabular form.

 * from sklearn.model_selection import train_test_split: This imports the train_test_split function from scikit-learn, a machine learning library. It's used for splitting a dataset into training and testing sets.

 * from sklearn.preprocessing import StandardScaler, LabelEncoder: This imports the StandardScaler and LabelEncoder classes from scikit-learn. StandardScaler is often used to standardize or scale numerical features, while LabelEncoder is used to convert categorical labels into numeric format.

 * from sklearn.linear_model import LogisticRegression: This imports the LogisticRegression class, which is used for logistic regression, a type of linear model commonly used for binary classification problems.

 * from sklearn.tree import DecisionTreeClassifier: This imports the DecisionTreeClassifier class, which is used for constructing decision tree models. Decision trees are used for both classification and regression tasks.

 * from sklearn.ensemble import RandomForestClassifier: This imports the RandomForestClassifier class, which is an ensemble learning method based on constructing multiple decision trees and combining their outputs. It's often used for classification tasks.

 * from sklearn.svm import SVC: This imports the SVC class, which stands for Support Vector Classification. It is used for classification tasks and is a part of the support vector machines (SVM) family.

 * from sklearn.neighbors import KNeighborsClassifier: This imports the KNeighborsClassifier class, which is used for k-nearest neighbors classification. It classifies a data point based on the majority class of its k nearest neighbors.

 * from sklearn.naive_bayes import GaussianNB: This imports the GaussianNB class, which is used for Naive Bayes classification. It is based on the Bayes' theorem and assumes independence between features.

 * from sklearn.metrics import accuracy_score, classification_report: This imports evaluation metrics from scikit-learn. accuracy_score is a common metric for classification problems, and classification_report provides a comprehensive report with precision, recall, and F1-score for each class.

# **2. Import Dataset**

In [None]:
# Load the dataset
data = pd.read_csv('/Datasets/ObesityDataSet.csv')
data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


# **3. Preprocess Dataset**

In [None]:
# Encoding categorical variables
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    data[column] = label_encoders[column].fit_transform(data[column])

# Splitting the data into features and target
X = data.drop('NObeyesdad', axis=1)
y = data['NObeyesdad']

# Normalizing numerical variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


* **Encoding Categorical Variables:**
 * Iterate over columns with data type 'object.'
 * For each column, create a LabelEncoder.
 * Fit and transform the column's values into numerical labels.
 * Replace the original column with numerical labels.

* **Splitting into Features and Target:**
 * Create features (X) by excluding the 'NObeyesdad' column.
 * Set the target variable (y) to the values in the 'NObeyesdad' column.

* **Normalizing Numerical Variables:**
 * Use StandardScaler to normalize numerical features in X.
 * Standardization removes the mean and scales to unit variance.

* **Splitting into Training and Testing Sets:**
 * Use train_test_split to split the dataset.
 * Allocate 70% for training (X_train, y_train) and 30% for testing (X_test, y_test).
 * Set random_state for reproducibility.

# **4. Classification Models**

In [None]:
# Initializing models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB()
}

* **Logistic Regression:**
 Logistic Regression is a linear model commonly used for binary classification problems. The max_iter parameter is set to 1000, which represents the maximum number of iterations for the solver to converge.

* **Decision Tree:**
Decision Tree is a model that makes decisions based on a tree-like model of decisions. It's used for both classification and regression tasks.

* **Random Forest:**
Random Forest is an ensemble learning method based on constructing multiple decision trees during training and combining their outputs. It's often used for classification tasks.

* **SVM (Support Vector Machine):**
SVM is a model used for classification tasks. It works by finding the hyperplane that best separates different classes in the feature space.

* **K-Nearest Neighbors (KNN):**
KNN is a simple and effective algorithm used for classification. It classifies a data point based on the majority class of its k nearest neighbors.

* **Naive Bayes:**
Naive Bayes is a probabilistic model based on Bayes' theorem. It is often used for classification tasks, assuming independence between features.

In [None]:
# Training and evaluating models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    results[name] = {'accuracy': accuracy, 'report': report}

# Displaying the results
for model_name, metrics in results.items():
    print(f"Model: {model_name}")
    print(f"Accuracy: {metrics['accuracy']}")
    print("Classification Report:")
    print(metrics['report'])
    print("---------------------------------------------------")

Model: Logistic Regression
Accuracy: 0.8596214511041009
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.99      0.89        86
           1       0.86      0.60      0.71        93
           2       0.94      0.90      0.92       102
           3       0.91      0.98      0.94        88
           4       1.00      0.99      0.99        98
           5       0.74      0.73      0.74        88
           6       0.74      0.82      0.78        79

    accuracy                           0.86       634
   macro avg       0.86      0.86      0.85       634
weighted avg       0.86      0.86      0.86       634

---------------------------------------------------
Model: Decision Tree
Accuracy: 0.9085173501577287
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.95      0.92        86
           1       0.82      0.78      0.80        93
           2       0.97      0.90   

* **Training and Evaluating Models:**
 * Iterates through each model in the models dictionary.
 * Calls the fit method to train the model using the training data (X_train, y_train).
 * Predicts the target variable (y_pred) using the test data (X_test).
 * Calculates the accuracy of the predictions using accuracy_score.
 * Generates a classification report using classification_report.
 * Stores the results (accuracy and report) in the results dictionary, with the model name as the key.

* **Displaying the Results:**
 * Iterates through the results of each model stored in the results dictionary.
 * Prints the model name, accuracy, and the classification report for each model.
 * Separates the output with a line of dashes for better readability.

# **Data Frame two (Diabetes)**

# **1. Import Dataset**

In [None]:
# Load the dataset
data2 = pd.read_csv('/content/DiabetesDataSet.csv')
data2.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0,1,1,1,40,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0,0,0,0,25,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0,1,1,1,28,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0,1,0,1,27,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0,1,1,1,24,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


# **2. Preprocess Dataset**

In [None]:
# Encoding categorical variables
label_encoders2 = {}
for column in data2.select_dtypes(include=['object']).columns:
    label_encoders2[column] = LabelEncoder()
    data2[column] = label_encoders2[column].fit_transform(data2[column])

# Splitting the data into features and target
XX = data2.drop('Diabetes_012', axis=1)
yy = data2['Diabetes_012']

# Normalizing numerical variables
scaler2 = StandardScaler()
XX = scaler2.fit_transform(XX)

# Selecting the first 2111 rows
XX_selected = XX[:2111]
yy_selected = yy[:2111]

# Splitting the selected dataset into training and testing sets
XX_train, XX_test, yy_train, yy_test = train_test_split(XX_selected, yy_selected, test_size=0.3, random_state=42)


* **Encoding Categorical Variables:**
 * Similar to the previous code, this loop iterates over each column in data2 with data type 'object.'
 * Creates a LabelEncoder for each column.
 * Fits and transforms the column's values into numerical labels.
 * Replaces the original column with these numerical labels.
 * The label_encoders2 dictionary is used to store the LabelEncoder objects for potential later use.

* Splitting into Features and Target:
 * Creates features (XX) by excluding the 'Diabetes_012' column.
 * Sets the target variable (yy) to the values in the 'Diabetes_012' column.

* Normalizing Numerical Variables: Normalizes numerical features in XX using the StandardScaler.

* Selecting Subset of Data: Selects the first 2111 rows of the normalized data (XX and yy). It seems like a subset of the original dataset.

* Splitting into Training and Testing Sets:
 * Splits the selected dataset into training and testing sets using the train_test_split function.
 * Allocates 70% for training (XX_train, yy_train) and 30% for testing (XX_test, yy_test).
 * random_state is set for reproducibility.

# **3. Classification Models**

In [None]:
# Initializing models
models2 = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB()
}

* **Logistic Regression:**
 Logistic Regression is a linear model commonly used for binary classification problems. The max_iter parameter is set to 1000, which represents the maximum number of iterations for the solver to converge.

* **Decision Tree:**
Decision Tree is a model that makes decisions based on a tree-like model of decisions. It's used for both classification and regression tasks.

* **Random Forest:**
Random Forest is an ensemble learning method based on constructing multiple decision trees during training and combining their outputs. It's often used for classification tasks.

* **SVM (Support Vector Machine):**
SVM is a model used for classification tasks. It works by finding the hyperplane that best separates different classes in the feature space.

* **K-Nearest Neighbors (KNN):**
KNN is a simple and effective algorithm used for classification. It classifies a data point based on the majority class of its k nearest neighbors.

* **Naive Bayes:**
Naive Bayes is a probabilistic model based on Bayes' theorem. It is often used for classification tasks, assuming independence between features.

In [None]:
# Training and evaluating models
results2 = {}
for name, model in models2.items():
    model.fit(XX_train, yy_train)
    yy_pred = model.predict(XX_test)
    accuracy2 = accuracy_score(yy_test, yy_pred)
    report2 = classification_report(yy_test, yy_pred, zero_division=1)
    results2[name] = {'accuracy': accuracy2, 'report': report2}

# Displaying the results
for model_name, metrics in results2.items():
    print(f"Model: {model_name}")
    print(f"Accuracy: {metrics['accuracy']}")
    print("Classification Report:")
    print(metrics['report'])
    print("---------------------------------------------------")


Model: Logistic Regression
Accuracy: 0.7523659305993691
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.94      0.85       477
           1       1.00      0.00      0.00         9
           2       0.49      0.20      0.29       148

    accuracy                           0.75       634
   macro avg       0.76      0.38      0.38       634
weighted avg       0.72      0.75      0.71       634

---------------------------------------------------
Model: Decision Tree
Accuracy: 0.6451104100946372
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.76      0.77       477
           1       0.00      0.00      0.00         9
           2       0.30      0.31      0.31       148

    accuracy                           0.65       634
   macro avg       0.36      0.36      0.36       634
weighted avg       0.66      0.65      0.65       634

----------------------------------

* **Training and Evaluating Models:**
 * Iterates through each model in the models2 dictionary (presumably similar to the previous models but applied to the new dataset).
 * Calls the fit method to train the model using the training data (XX_train, yy_train).
 * Predicts the target variable (yy_pred) using the test data (XX_test).
 * Calculates the accuracy of the predictions using accuracy_score.
 * Generates a classification report using classification_report.
 * Stores the results (accuracy and report) in the results2 dictionary, with the model name as the key.

* **Displaying the Results:**
 * Iterates through the results of each model stored in the results2 dictionary.
 * Prints the model name, accuracy, and the classification report for each model.
 * Separates the output with a line of dashes for better readability.

# **Final Summary**

* **Summary:**
  * Logistic Regression achieved an accuracy of 0.86, with varying precision, recall, and f1-scores across different classes. It performed well in some classes (e.g., class 0 and class 4) but less so in others (e.g., class 1 and class 5).
  * Decision Tree achieved an accuracy of 0.91, demonstrating good performance with high precision, recall, and f1-scores in most classes. It outperformed Logistic Regression in terms of accuracy and overall class-wise metrics.
  * Random Forest outperformed both Logistic Regression and Decision Tree with an accuracy of 0.94. It showed high precision, recall, and f1-scores across all classes, indicating strong overall performance.
  * SVM achieved an accuracy of 0.87, with relatively balanced precision, recall, and f1-scores. It performed consistently across different classes, demonstrating robustness.
  * KNN had an accuracy of 0.81, displaying decent performance with varying precision, recall, and f1-scores across classes. It was less accurate compared to Decision Tree, Random Forest, and SVM.
  * Naive Bayes had the lowest accuracy of 0.61, with lower precision, recall, and f1-scores across most classes. It showed limitations in handling the complexity of the data.

* **Conclusion:**
  * In this comparison, Random Forest stands out as the top-performing model with the highest accuracy (0.94) and strong class-wise metrics. Decision Tree also performed well, achieving an accuracy of 0.91. Logistic Regression, SVM, and KNN demonstrated moderate performance, with SVM being the most robust among them. Naive Bayes performed the least effectively with the lowest accuracy of 0.61. The choice of the best model depends on the specific requirements and characteristics of the dataset, but Random Forest seems to be a strong candidate for this classification task.