In [98]:
# imports
import pandas as pd
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

### **Now in this 3rd component's 1st task we will be Building a baseline model without any feature engineering techniques to see how the models perform without any preprocessing, just to get a baseline understanding of how our models perform on raw data. After feature engineering and scaling the data we will again implement the Machine learning algorithms and see how the accuracy spiked up or down.**
---
- **<b>Note:</b> we will use the dataset from the 2nd component because we have handeld the outliers quite efficiently and also removed missing data from every column and cleaned the data set we will import the data later when we have need for it.**

In [99]:
# importing the data set.
cdf = pd.read_csv('Datasets/diabetes.csv')

# splitting the features(x) and dependent variable(y).
X = cdf.iloc[:, :-1].values
y = cdf.iloc[:, -1].values

# splitting into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

#checking the dimension of the data
print(X_train.shape[0])
print(X_test.shape[0])

614
154


In [100]:
# Building a Logistic regression model with it's estimated confusion matrix
LRM = LogisticRegression(max_iter=500)
LRM.fit(X_train, y_train)
y_pred_LR = LRM.predict(X_test)
confusion_matrix(y_test, y_pred_LR)

array([[78, 21],
       [18, 37]], dtype=int64)

In [101]:
# Building a K-Nearest Neighbors model with it's estimated confusion matrix
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
confusion_matrix(y_test, y_pred_knn)


array([[70, 29],
       [23, 32]], dtype=int64)

In [102]:
# Building a Decision tree classifier model with it's estimated confusion matrix
dec_tree = DecisionTreeClassifier(criterion='entropy', random_state=0)
dec_tree.fit(X_train, y_train)
y_pred_dec_tree = dec_tree.predict(X_test)
confusion_matrix(y_test, y_pred_dec_tree)


array([[78, 21],
       [17, 38]], dtype=int64)

In [103]:
# Building a random forest classifier model with it's estimated confusion matrix
rand_forest = RandomForestClassifier(n_estimators=100, random_state=0, criterion='entropy')
rand_forest.fit(X_train, y_train)
y_pred_rand_forest = rand_forest.predict(X_test)
confusion_matrix(y_test, y_pred_rand_forest)


array([[81, 18],
       [20, 35]], dtype=int64)

In [104]:
# we will use two accuracy metric accuracy_score which uses the total true predicted values divided by total total values in the data set and f1 score which uses precision and recall as an accuracy metric.

print(f"The accuracy score of logistic regression is {accuracy_score(y_test, y_pred_LR)}")
print(f"The accuracy score of K-Nearest Neighbors is {accuracy_score(y_test, y_pred_knn)}")
print(f"The accuracy of Decision Tree classifier is {accuracy_score(y_test, y_pred_dec_tree)}")
print(f"The accuracy of Random forest classifier is {accuracy_score(y_test, y_pred_rand_forest)}")
print("\n")
print(f"The f1 score of logistic regression is {f1_score(y_test, y_pred_LR)}")
print(f"The f1 score of K-Nearest Neighbors is {f1_score(y_test, y_pred_knn)}")
print(f"The f1 of Decision Tree classifier is {f1_score(y_test, y_pred_dec_tree)}")
print(f"The f1 of Random forest classifier is {f1_score(y_test, y_pred_rand_forest)}")

The accuracy score of logistic regression is 0.7467532467532467
The accuracy score of K-Nearest Neighbors is 0.6623376623376623
The accuracy of Decision Tree classifier is 0.7532467532467533
The accuracy of Random forest classifier is 0.7532467532467533


The f1 score of logistic regression is 0.6548672566371682
The f1 score of K-Nearest Neighbors is 0.5517241379310346
The f1 of Decision Tree classifier is 0.6666666666666665
The f1 of Random forest classifier is 0.6481481481481481


---
### Observations and Next Steps
***The accuracy score and F1-score of the model are currently not satisfactory, indicating that the model is performing poorly in predicting the data. To address this, we will now apply feature engineering techniques to enhance the input data and improve model performance. Additionally, to achieve optimal results, we will implement hyperparameter tuning using methods such as GridSearchCV or Optuna, which will help identify the best hyperparameters and further boost the model's effectiveness.***

---

# Feature engineering

**let's scale our data we will use Standard scaler but before that we will make a function that trains and print the accuracy for us because we will train our models several times and each time writing a long code is not productive so we will create a function so we can re-use it.**

- **In this user defined function we just need to pass the training and testing data and this function will run the models which is initialised in the models dictionary and return us a dictionary with accuracy metrics i.e (accuracy score and f1 score)**

In [105]:
def build_and_evaluate_classifiers(X_train, X_test, y_train, y_test):
    models = {
        'Logistic Regression': LogisticRegression(),
        'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2),
        'Decision Tree': DecisionTreeClassifier(criterion='entropy', random_state=0),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=0, criterion='entropy')
    }

    for name, model in models.items():
        model.fit(X_train, y_train) # train the model
        y_pred = model.predict(X_test) # predict the outcome based on the test data

        acc = accuracy_score(y_test, y_pred) # calculating accuracy score
        f1 = f1_score(y_test, y_pred) # calculating the f1_score


        print(f"{name}:") # name of the model
        print(f"  Accuracy Score: {round(acc, 3)}") # Accuracy Score rounded to 33 decimal places
        print(f"  F1 Score      : {round(f1, 3)}") # f1 score in rounded to 3 decimal places
        print("-" * 40) # a line to separate the scores of each model

### let's start with scaling the data with StandardScaler which converts the data into standard normal distribution with mean 0 and std of 1. this scales the data approximately in the range (-3 - 3) and test the accuracy with the data
---
- ***Note: we haven't cleaned the dataset we are using the unclean dataset to check how the model will perform by just scaling it**

In [106]:
sc = StandardScaler()

Xscaled_train = sc.fit_transform(X_train)
Xscaled_test = sc.transform(X_test)

build_and_evaluate_classifiers(Xscaled_train, Xscaled_test, y_train, y_test)

Logistic Regression:
  Accuracy Score: 0.753
  F1 Score      : 0.661
----------------------------------------
K-Nearest Neighbors:
  Accuracy Score: 0.695
  F1 Score      : 0.544
----------------------------------------
Decision Tree:
  Accuracy Score: 0.76
  F1 Score      : 0.678
----------------------------------------
Random Forest:
  Accuracy Score: 0.753
  F1 Score      : 0.648
----------------------------------------


In [107]:
ms = MinMaxScaler()
x_train_ms = ms.fit_transform(X_train)
x_test_ms = ms.transform(X_test)

build_and_evaluate_classifiers(Xscaled_train, Xscaled_test, y_train, y_test)

Logistic Regression:
  Accuracy Score: 0.753
  F1 Score      : 0.661
----------------------------------------
K-Nearest Neighbors:
  Accuracy Score: 0.695
  F1 Score      : 0.544
----------------------------------------
Decision Tree:
  Accuracy Score: 0.76
  F1 Score      : 0.678
----------------------------------------
Random Forest:
  Accuracy Score: 0.753
  F1 Score      : 0.648
----------------------------------------


# **as we can see that the model performed not that good comparing to the baseline model, even through we are scaling the data. Why this happend??**
- **there are lots of outliers in the dataset and missing values which is making our model performance weak so now we will use the clean dataset from our component 2.**

In [108]:
# here we will use the clean data from the component 2
cdf = pd.read_csv('Datasets/clean_df_from_component2.csv') # clean dataframe
x = cdf.iloc[:, :-1].values
y =cdf.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [109]:
xscaledtrain = sc.fit_transform(x_train)
xscaledtest = sc.transform(x_test)

build_and_evaluate_classifiers(xscaledtrain, xscaledtest, y_train, y_test)

Logistic Regression:
  Accuracy Score: 0.778
  F1 Score      : 0.638
----------------------------------------
K-Nearest Neighbors:
  Accuracy Score: 0.752
  F1 Score      : 0.596
----------------------------------------
Decision Tree:
  Accuracy Score: 0.771
  F1 Score      : 0.673
----------------------------------------
Random Forest:
  Accuracy Score: 0.752
  F1 Score      : 0.604
----------------------------------------


**hmmm, now we can see that our accuracy are increased a little and the model performed well we can see the accuracy score and the f1 score(which use precision and recall) that they have been increased comparing to the baseline model we built earlier and the model that we used scaling in unclean dataset. this shows how much is it importent to feature engineer our dataset.**

---
- **Now we will see which features(columns) are the most important one in the dataset using feature importance technique with random RandomForestClassifier. it gives us the ranking of each feature with certain values the grater the value the more it has importance to the dependent variable.**

In [110]:
# feature importance using random forest classifier
from sklearn.ensemble import RandomForestClassifier

# define features and target
X = cdf.drop(columns=['Outcome'])  # Drop target variable
y = cdf['Outcome']

# train a Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# get the feature importances of every column
importances = rf.feature_importances_
feature_names = X.columns

# Print the values
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
print(feature_importance_df.sort_values(by='Importance', ascending=False))


                    Feature  Importance
1                   Glucose    0.281294
4                       BMI    0.176908
6                       Age    0.141002
5  DiabetesPedigreeFunction    0.130520
2             BloodPressure    0.094809
0               Pregnancies    0.091749
3             SkinThickness    0.083717


**so the columns Glucose is the most importance and the highest one followed by BMI, Age, DiabetesPedigreeFunction has the highest importance as they have the highest values in the result.so, these are the top feature in the dataset.**

---
**now we have seen that applying feature engineeing and scaling the data is very much important as we can see from the above results the accuracy and the f1 score of each model significantly increased when we used the clean data and scaled it compared to the dirty data with scaling without any preprocessing other than scaling now we will apply GridSearchCV to tune our model and enhance the performance in other words we wil do hyperparameter tuning in each model and see which gives us the best accuracy**

---

In [111]:
# imports
from sklearn.model_selection import GridSearchCV

# took all the hyperparameters from the sklearn docs. here in this models dictionary we create a key value paris of model name(key) and value(a tuple with model initialization and parameter grids) which the grid search will use exhaustively to search the optimal hyperparameters. we will use all of this information to implement hyperparameter tuning using GridSearchCV.

models = {
    'Logistic Regression': (LogisticRegression(max_iter=500), {
        'penalty': ['l2'],
        'C': [0.1, 1, 10],
        'solver': ['liblinear']
    }),
    'KNN': (KNeighborsClassifier(), {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance'],
        'p': [1, 2]
    }),
    'Decision Tree': (DecisionTreeClassifier(), {
        'criterion': ['gini', 'entropy'],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5]
    }),
    'Random Forest': (RandomForestClassifier(), {
        'n_estimators': [100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5]
    })
}

best_models = {} # This dictionary will store the best results for each model.

for name, (model, param_grid) in models.items():
    grid_search = GridSearchCV(model, param_grid, cv=4, scoring='accuracy', n_jobs=-1) # initialized GridSearchCV for Hyperparameter Tuning

    grid_search.fit(xscaledtrain, y_train) # Fit the GridSearchCV to the training data (X_train, y_train) and it will perform the cross validation upto 4 folds with each combination of hyperparameters

    best_models[name] = {
        "estimator": grid_search.best_estimator_, # the best estimator meaning the model with best hyperparameters
        "Best Score": grid_search.best_score_, # best cross validation score.
        "best_params": grid_search.best_params_, # best hyperparameters(the one we need the most)
        "test_score": grid_search.best_estimator_.score(xscaledtest, y_test) # test score with the hyperparameters
    }

# this just loops in the best model dictionary and prints each key value pair results including the best estimator, optimal hyperparameters, accuracy etc.
for model_name, result in best_models.items():
    print(f"{model_name}:")
    print(f"Best Estimator: {result['estimator']}")
    print(f"Best Cross-Validation Score: {result['Best Score']}")
    print(f"Best Hyperparameters: {result['best_params']}")
    print(f"Test Accuracy: {result['test_score']}")
    print("-" * 100) # just prints the dashed line to separate each iteration for each model output

Logistic Regression:
Best Estimator: LogisticRegression(C=0.1, max_iter=500, solver='liblinear')
Best Cross-Validation Score: 0.7671891124871001
Best Hyperparameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
Test Accuracy: 0.7908496732026143
----------------------------------------------------------------------------------------------------
KNN:
Best Estimator: KNeighborsClassifier(n_neighbors=7, p=1)
Best Cross-Validation Score: 0.7212439800481596
Best Hyperparameters: {'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}
Test Accuracy: 0.7777777777777778
----------------------------------------------------------------------------------------------------
Decision Tree:
Best Estimator: DecisionTreeClassifier(criterion='entropy', max_depth=10, min_samples_split=5)
Best Cross-Validation Score: 0.704968610251118
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 10, 'min_samples_split': 5}
Test Accuracy: 0.7647058823529411
----------------------------------------------------

# Observation
**from this code cell above we tried to optimize and tune our model as much as we can we applied cross validation, grid search for finding optimal hyperparameters, we feature engineered the dataset handled outliers, handles missing values scaled the data we saw how feature engineering and scaling the data drastically changes the outcome of the accuracy of the model.**

---
# The end✨
## Author
### Ankit chimariya

----