# Random forest classifier: diabetes prediction

Absolutely minimal MVP (minimum viable product) solution.

## 1. Data acquisition

In [2]:
import pandas as pd

# Load the data from the URL
data_df=pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/decision-tree-project-tutorial/main/diabetes.csv")

### 1.2. Train-test split

In [3]:
from sklearn.model_selection import train_test_split

# Separate features from labels
labels=data_df['Outcome']
features=data_df.drop('Outcome', axis=1)

# Split the data into training and testing features and labels
training_features, testing_features, training_labels, testing_labels=train_test_split(
    features,
    labels,
    test_size=0.2,
    random_state=315
)

In [16]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


## 2. EDA

### 2.1. Features

In [4]:
# Inspect the training features' data types
training_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 614 entries, 765 to 611
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               614 non-null    int64  
 1   Glucose                   614 non-null    int64  
 2   BloodPressure             614 non-null    int64  
 3   SkinThickness             614 non-null    int64  
 4   Insulin                   614 non-null    int64  
 5   BMI                       614 non-null    float64
 6   DiabetesPedigreeFunction  614 non-null    float64
 7   Age                       614 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 43.2 KB


All of the features are already numeric - we don't need to do anything for the model to run. Let's check the labels too.

### 2.2. Labels

In [5]:
training_labels.info()

<class 'pandas.core.series.Series'>
Index: 614 entries, 765 to 611
Series name: Outcome
Non-Null Count  Dtype
--------------  -----
614 non-null    int64
dtypes: int64(1)
memory usage: 9.6 KB


Also, already numeric - we can move right to training the model and setting a baseline performance result.

## 3. Training

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate and train decision tree classifier
model=RandomForestClassifier(random_state=315)
fit_result=model.fit(training_features, training_labels)

## 4. Evaluation

In [7]:
from sklearn.metrics import accuracy_score

# Make predictions from test set features
predicted_labels=model.predict(testing_features)

# Score predictions from accuracy
percent_accuracy=accuracy_score(testing_labels, predicted_labels) * 100
print(f'Model is {percent_accuracy:.1f}% accurate on the test data')

Model is 77.9% accurate on the test data


In [None]:
zero_counts = (training_features == 0).sum()

sorted_zero_counts = zero_counts.sort_values(ascending=False)

print("Columns with the most zeroes:")
print(sorted_zero_counts)

Columns with the most zeroes:
Insulin                     294
SkinThickness               172
Pregnancies                  93
BloodPressure                23
BMI                           7
Glucose                       4
DiabetesPedigreeFunction      0
Age                           0
dtype: int64


In [None]:
df_filtered = training_features.drop(columns=['Insulin'])

print(df_filtered)
df_filtered.describe()

     Pregnancies  Glucose  BloodPressure  SkinThickness   BMI  \
765            5      121             72             23  26.2   
74             1       79             75             30  32.0   
733            2      106             56             27  29.0   
740           11      120             80             37  42.3   
0              6      148             72             35  33.6   
..           ...      ...            ...            ...   ...   
275            2      100             70             52  40.5   
746            1      147             94             41  49.3   
194            8       85             55             20  24.4   
567            6       92             62             32  32.0   
611            3      174             58             22  32.9   

     DiabetesPedigreeFunction  Age  
765                     0.245   30  
74                      0.396   22  
733                     0.426   22  
740                     0.785   48  
0                       0.627   50

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,BMI,DiabetesPedigreeFunction,Age
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,3.851792,120.605863,69.602606,21.307818,32.215961,0.47309,33.245928
std,3.403173,31.483407,18.224136,16.055309,7.706636,0.339908,11.742608
min,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,100.0,64.0,0.0,27.5,0.23925,24.0
50%,3.0,117.0,72.0,24.0,32.35,0.3705,29.0
75%,6.0,138.0,80.0,33.0,36.575,0.6285,41.0
max,17.0,199.0,122.0,99.0,67.1,2.42,72.0


Ok, done! Absolutely minimal random forest classifier using ~10 statements. Out of the box, the random forest performs slightly better than a single decision tree classifier. But, there are still many things we could do to try and improve it.