# Content

1. [Load data](#1)
2. [Importing Libraries](#2)
3. [Categorical Pregnancies - One-Hot encoding](#3)
4. [Data Spliting](#4)
5. [Model Training and Evaluation](#5)
6. [Comparision of Models](#6)
7. [Conclusions](#7)

The objective of this notebook is to evaluate the impact of using the pregnancies feature as categorical or numerical in the ML model.

## 1. Load Data <a name = 1></a>

In [1]:
import pandas as pd

In [2]:
diabetes_df_numerical = pd.read_csv('/content/diabetes_clean_scaled.csv')
diabetes_df_categorical = pd.read_csv('/content/diabetes_clean_scaled_categorical_pregnancies.csv')

In [3]:
diabetes_df_numerical

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.842406,0.968697,-0.019559,0.740160,1.377079,0.270163,0.852460,1.399009,1
1,-0.849975,-1.474492,-0.593325,0.045089,-1.289791,-0.917294,-0.182884,0.121670,0
2,1.246532,1.750139,-0.783908,0.233018,0.795782,-1.556530,0.972625,0.226764,1
3,-0.849975,-1.254536,-0.593325,-0.679244,-0.628768,-0.645419,-1.343867,-1.474295,0
4,-1.592118,0.654682,-1.919628,0.740160,0.471166,1.623697,1.787773,0.325504,1
...,...,...,...,...,...,...,...,...,...
763,1.586906,-0.666332,0.364544,1.740139,0.606021,0.159936,-1.312374,1.685355,0
764,-0.356029,0.167825,-0.211141,-0.192741,-0.048756,0.753933,-0.238567,-0.376157,0
765,0.606143,0.132554,-0.019559,-0.679244,-0.303459,-0.991644,-0.785753,0.009595,0
766,-0.849975,0.305156,-1.164009,-0.168805,0.297337,-0.298674,-0.192903,1.266248,1


In [4]:
diabetes_df_categorical

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,0.968697,-0.019559,0.740160,1.377079,0.270163,0.852460,1.399009,1
1,1,-1.474492,-0.593325,0.045089,-1.289791,-0.917294,-0.182884,0.121670,0
2,8,1.750139,-0.783908,0.233018,0.795782,-1.556530,0.972625,0.226764,1
3,1,-1.254536,-0.593325,-0.679244,-0.628768,-0.645419,-1.343867,-1.474295,0
4,0,0.654682,-1.919628,0.740160,0.471166,1.623697,1.787773,0.325504,1
...,...,...,...,...,...,...,...,...,...
763,10,-0.666332,0.364544,1.740139,0.606021,0.159936,-1.312374,1.685355,0
764,2,0.167825,-0.211141,-0.192741,-0.048756,0.753933,-0.238567,-0.376157,0
765,5,0.132554,-0.019559,-0.679244,-0.303459,-0.991644,-0.785753,0.009595,0
766,1,0.305156,-1.164009,-0.168805,0.297337,-0.298674,-0.192903,1.266248,1


## 2. Importing Libraries <a name = 2></a>

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## 3. Categorical Pregnancies - One-Hot encoding

In [6]:
categorical = pd.get_dummies(diabetes_df_categorical, columns=['Pregnancies'], drop_first=True)
categorical

Unnamed: 0,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Pregnancies_1,Pregnancies_2,...,Pregnancies_7,Pregnancies_8,Pregnancies_9,Pregnancies_10,Pregnancies_11,Pregnancies_12,Pregnancies_13,Pregnancies_14,Pregnancies_15,Pregnancies_17
0,0.968697,-0.019559,0.740160,1.377079,0.270163,0.852460,1.399009,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,-1.474492,-0.593325,0.045089,-1.289791,-0.917294,-0.182884,0.121670,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1.750139,-0.783908,0.233018,0.795782,-1.556530,0.972625,0.226764,1,0,0,...,0,1,0,0,0,0,0,0,0,0
3,-1.254536,-0.593325,-0.679244,-0.628768,-0.645419,-1.343867,-1.474295,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0.654682,-1.919628,0.740160,0.471166,1.623697,1.787773,0.325504,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,-0.666332,0.364544,1.740139,0.606021,0.159936,-1.312374,1.685355,0,0,0,...,0,0,0,1,0,0,0,0,0,0
764,0.167825,-0.211141,-0.192741,-0.048756,0.753933,-0.238567,-0.376157,0,0,1,...,0,0,0,0,0,0,0,0,0,0
765,0.132554,-0.019559,-0.679244,-0.303459,-0.991644,-0.785753,0.009595,0,0,0,...,0,0,0,0,0,0,0,0,0,0
766,0.305156,-1.164009,-0.168805,0.297337,-0.298674,-0.192903,1.266248,1,1,0,...,0,0,0,0,0,0,0,0,0,0


## 4. Data Splitting <a name = 4></a>

**Categorical**


---

In [7]:
X_categorical = categorical.drop(columns=['Outcome'])
y_categorical = categorical['Outcome']
X_train_categorical, X_test_categorical, y_train_categorical, y_test_categorical = train_test_split(
X_categorical, y_categorical, test_size=0.3, random_state=42)

**Numerical**


---


In [8]:
X_numerical = diabetes_df_numerical.drop(columns=['Outcome'])
y_numerical = diabetes_df_numerical['Outcome']
X_train_numerical, X_test_numerical, y_train_numerical, y_test_numerical = train_test_split(
X_numerical, y_numerical, test_size=0.3, random_state=42)

## 5. Model Training and Evaluation <a name = 5></a>

**Categorical**


---


In [9]:
# Initialize and train the classifier
rf_categorical = RandomForestClassifier(n_estimators=100, random_state=42)
rf_categorical.fit(X_train_categorical, y_train_categorical)

# Make predictions on the test set
y_pred_categorical = rf_categorical.predict(X_test_categorical)

# Calculate the accuracy
accuracy_categorical = accuracy_score(y_test_categorical, y_pred_categorical)

**Numerical**


---

In [10]:
# Initialize and train the classifier
rf_numerical = RandomForestClassifier(n_estimators=100, random_state=42)
rf_numerical.fit(X_train_numerical, y_train_numerical)

# Make predictions on the test set
y_pred_numerical = rf_numerical.predict(X_test_numerical)

# Calculate the accuracy
accuracy_numerical = accuracy_score(y_test_numerical, y_pred_numerical)

## 6. Comparision of Models<a name = 6></a>

In [11]:
accuracy_categorical, accuracy_numerical

(0.7229437229437229, 0.7229437229437229)

## 7. Conclusions <a name = 7></a>

There are a few things to consider:

- Both approaches give the same results with respect to the accuracy:
    - Pregnancies features is not as important as other features in the prediction of the Outcome (Diabetes or No-Diabetes). We've seen this in the pre-processing Pearson correlation.
    - Using categorical representation might lead to more features (after one-hot encoding) in datasets with a wide range of unique pregnancy counts, potentially increasing the model's complexity.