### Bonus Diabetes Data Generation

## Section 1

#### Create a notebook that pre-processes this data for model fitting
In this notebook, I will analyze and process the chosen data. Identify the target variable and input variables. I will include details about what I observed, what changes I am making, how I am making these changes, and why I am making these changes. I will save the results into csv files (these files should therefore be pre-processed and ready for model fitting. Later model fitting notebooks should not need data manipulation/processing.

## Identification of a problem/goal for analysis

The data set is collecred from kaggle and we need to predict whether a patient can have a diabetes or not.
##### Problem: 
The data set have measurements of different variables based on that we need to find the diabetic patients.
##### Goal:
My goal is to analyze input variables such as Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction,Age and predict the target variable which is outcome, whether a patient can have a diabetes or not.

### Importing necessary modules

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier 
from sklearn.preprocessing import StandardScaler

np.random.seed(1)

### Reading and displaying data from the choosen data set.

In [2]:
df = pd.read_csv('bonus-data-set.csv') 

In [3]:
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Details of the data

Predictors (X)
- Pregnancies - Number of pregnancies person had
- Glucose - Glucose level of person
- BloodPressure - Blood Pressure of person
- SkinThickness - Skin Thickness of person
- Insulin - Insulin level of person
- BMI - BMI of person
- DiabetesPedigreeFunction - Diabetes Pedigree Function of person
- Age - Glucose level of person
Target (Y)
- Outcome

(0 - not diabetic, 1 - diabetic)

## Cleaning the data

### Replacing categorical values with binary values.

#### Cleaning up colum names, if there are some leading whitespace characters. 
Since it is best practice to clear any white spaces before starting to analyze.

In [4]:
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

### No unnecessary columns to drop

### Properteis and observations of cleaned data

In [5]:
df.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [6]:
df.shape

(768, 9)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [8]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Looking for null values

In [9]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Looks like there are no null values.

## Section 2

### Spliting the data for training and testing (data partitoning 70/30)

The data set is a good one with 20,000 observations. So I decided to partiton the data for 70% for training and 30% for testing which I believe will give good results.

In [10]:
train_df, test_df = train_test_split(df, test_size=0.3, random_state=1)

### Seperating the predictors and traget variables

In [11]:
target = 'Outcome'
predictors = list(df.columns)
predictors.remove(target)

### Looking for null values

In [12]:
numeric_cols_with_nas = list(train_df.isna().sum()[train_df.isna().sum() > 0].index)
numeric_cols_with_nas

[]

### Standardizing the input variables

In [13]:
scaler = preprocessing.StandardScaler()
train_df[predictors] = scaler.fit_transform(train_df[predictors])
test_df[predictors] = scaler.transform(test_df[predictors])

### Saving the datasets for testing and training

In [14]:
X_train = train_df[predictors]
y_train = train_df[target]
X_test = test_df[predictors]
y_test = test_df[target]

In [15]:
X_train.shape

(537, 8)

In [16]:
y_train.shape

(537,)

In [17]:
X_test.shape

(231, 8)

In [18]:
y_test.shape

(231,)

In [19]:
X_train.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
88,3.353608,0.480473,0.05217,0.781406,0.288597,0.68612,-0.946901,0.810205
467,-1.121017,-0.768911,-0.246393,1.032297,0.199796,0.646996,0.39613,-0.695262
550,-0.822709,-0.160237,0.05217,0.530515,-0.688222,-0.578899,-0.79367,-1.02981


In [20]:
y_train.head(3)

88     1
467    0
550    0
Name: Outcome, dtype: int64

In [21]:
X_test.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
285,0.967141,0.480473,0.251212,0.40507,0.510602,-0.761478,0.537343,1.479301
101,-0.822709,0.961005,-0.445435,-1.225722,-0.688222,-0.748437,-0.868783,-0.946173
581,0.668833,-0.384485,-0.445435,0.467793,-0.688222,-0.891893,-0.787661,-0.527988


In [22]:
y_test.head(3)

285    0
101    0
581    0
Name: Outcome, dtype: int64

In [23]:
X_train.to_csv('bonus-train_X-data.csv', index=False)
y_train.to_csv('bonus-train_y-data.csv', index=False)
X_test.to_csv('bonus-test_X-data.csv', index=False)
y_test.to_csv('bonus-test_y-data.csv', index=False)

## Conclusion

In this note book I used the techniques covered in class to load and clean the data and saved the predictors and target variable containing test and train data sets in csv files. I will use these files in bonus-model-fit notebook.