# Welcome to day 2 of kickstart Ai 
### For today's practical we are going to cover the Basics of the Machine Learning Process/Pipeline, and by the end of this practical you will have a better understanding on how to start and finish a basic machine learning project.
## Table Of Contents:
1. [Problem Statement](#ProblemStatement)
2. [Data gathering/ingestion](#Data-gathering-ingestion)
3. [Data Exploration](#Data-Exploration)
4. [Data Preprocessing](#Data-Preprocessing)
5. [Feature Engineering](#Feature-Engineering)
6. [Feature Selection](#Feature-Selection)
7. [Modelling](#Modelling)
8. [Logistic Regression](#Logistic-Regression)
9. [Hyperparameter Tuning](#Hyperparameter-Tuning)

# Problem Statement <a class="anchor" id="ProblemStatement"></a>

# Data gathering/ingestion <a name="Data-gathering-ingestion"></a>
#### in this step we first have to ingest our data before we can even use it for modeling 

--2021-05-31 11:54:51--  https://dl.dropboxusercontent.com/s/wdzi026ifw960o4/healthcare-dataset-stroke-data.csv
Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 162.125.2.15, 2620:100:6017:15::a27d:20f
Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|162.125.2.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 322048 (314K) [text/csv]
Saving to: ‘healthcare-dataset-stroke-data.csv’


2021-05-31 11:54:51 (9.35 MB/s) - ‘healthcare-dataset-stroke-data.csv’ saved [322048/322048]



In [None]:
import pandas as pd 
import numpy as np

To ingest our data we first import the pandas libary using `import pandas as pd` which will allow us to use the libary
<br/>
since our data is in the csv(comma separated values) file format we use `pd.read_csv("file path")` and we specify the location the file is on our computer using the `"file path"` to read the file and put all of its contents in a dataframe which is  stored in a varible called `data` 

In [None]:
data = pd.read_csv("/content/healthcare-dataset-stroke-data.csv")

To check if we have succesfuly ingested the data we can use  `.head()`  which shows the **first 5 rows** of data in a tabular format

In [None]:
data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


# Data Exploration <a class="anchor" name="Data-Exploration"></a>
#### in this step we have successfully ingested our data and we now need to understand what kind of data we are going to be working with and check if our data is clean enough for modeling 

so we are going to do the following steps:

1. check the number of row and columns  
2. check the data types of the columns  
3. check which columns contain missing values and how much data is missing   
4. ~~View the descriptive statistics of columns that contain numerical data~~ 

#### 1. check the number of row and columns:
to check the number of rows and columns in our dataframe we can use `.shape` on the end of the variable that your dataframe is stored in 

In [None]:
data.shape

(5110, 12)

#### 2. check the data types of the columns: 
to view the data types of all the columns in your dataframe and similar to the previous step , we can use `.info()`  on the end of the variable that your dataframe is stored in 

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5103 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB



#### 3. check which columns contain missing values and how much data is missing:
to view which columns have missing values and how much data is missing from those columns we can use `.isna().sum()` which will assign true if there is missing data or false if there is no missing data to a cell in all the columns in the  dataframe using `.isna()` and then sum all the cells who's missing data is true using `.sum()` 



In [None]:
data.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         7
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

#### 4. ~~View the descriptive statistics of columns that contain numerical data~~ 
using `.describe()` will show us the following infomation of colunms which have data type of int64 and float64 , **aka only numerical data**

*   MIN
*   MAX
*   mean
*   standard deviaton(std)
*   count
*   25%,50% and 75% percentile  




In [None]:
data.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


# Data cleaning <a class="anchor" name="Data-Preprocessing"></a>
now that we have understood our data and know which columns have missing data and incorrect data types, in this step we will clean the data so that we can use the data to train a model 



#### 1. filling data with the mean
from the previous data exploration step we saw that the `bmi` colunm contains missing values and knowing that the data type of `bmi` is a `float64` which is in a numerical data format we can use the `mean` of the entire colunm to impute (replace) the missing values 
<br/>

To do this we can use `.fillna()` to impute (replace) the missing values and we can get the mean values of the entire column using `data["bmi"].mean()`  where `data["bmi"]` selects the specfic colunm and `.mean()` calcuates the mean of the entire column.
<br/>
And lastly to make sure that we apply this change permently to our dataframes' colunm we specify `inplace = True ` inside `.fillna()`

In [None]:
data["bmi"].fillna(data["bmi"].mean(),inplace= True)

#### 2. filling missing data with mode 
another colunm we found that has missing values is `Residence_type` , but unlike the pervious step  `Residence_type` is not a mumerical value but a string value so we use the **mode** which is the most common value (The value that appears the most in that column) ,

To do this we can use the same `.fillna()` and similarly we can use `data["Residence_type"].mode()[0]` to get the most common value , we use `[0]` because `["Residence_type"].mode()` returns a 1 x 1 pandas series or numpy 1d array and we want to get the value inside this series.lastly we use the same inplace = True inside .fillna() to apply the change to the colunm  

In [None]:
data["Residence_type"].mode()

0    Urban
dtype: object

In [None]:
data["Residence_type"].fillna(data["Residence_type"].mode()[0],inplace = True)

Finally lets double check if we have cleaned our data properly

In [None]:
data.isna().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

we have sucessfully claned our data and now we are ready to move on to feature selection 

# Feature Selection <a class="anchor" name="Feature-Selection"></a>
In this step we are going to select the features we are going to train or model on

next we need to slect the features that we are going to use to train our model on 

since in this practical our goal is to predict the  house value hence we will select median house value as our target i.e what we want to predict  

In [None]:
data.head() 

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [None]:
y = data.pop("stroke")

In [None]:
x = data

In [None]:
x.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked


here we drop the id colunm as it does not add anything meaningful to our data 

In [None]:
x.drop("id",axis=1,inplace=True)

again let's check if we have done this correctly

In [None]:
x.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked


# Feature Engineering <a class="anchor" name="Feature-Engineering"></a>
In this step we have to engineer our features so that they are in the right format that the computer can understand and train on

If we look at our ocean proximity feature using `.head()` we see that it is a string but computers can’t understand strings - they can only understand numbers. So we'll have to turn this into numbers. Fortunately there exists a function in the sklearn library called one hot encoder that can do that !


In [None]:
data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked


but before we start we need to understand what does one hot encoding do,and to put it simply it creates a new colunm fot each unique value in the colunm and assigns 1 if the value matches the corresponding colunm and 0 if it not

![oht example](https://miro.medium.com/max/3758/1*O_pTwOZZLYZabRjw3Ga21A.png)

To do this We first import `LabelEncoder` using the code shown below 

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

Next we assign a variable to the function called le,this will initialize the function 

In [None]:
oht = OneHotEncoder()

To encode the column we do `le.fit_transform("data")` next we  override the data in `x` varible with the newly encoded data using 
<br>
`x = le.fit_transform(x)`

In [None]:
x = oht.fit_transform(x)

# Modelling <a class="anchor" name="Modelling"></a>

Before we train our model we need to split our data into one part for training and one for testing as we need to have a way to test if our model is predicting accurately and we can’t use the same data that we trained the model on to test it as it is like trying to take the same test a second time where you could have remembered the answers 
<br>
To fix this problem we use `train_test_split` form sklearn and as the name implies it splits our data into training and testing segments


In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y)

and we pass in our data like this `train_test_split(feature,labels)` 



In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=101)

For our modeling we will be using logistic regression as we want to predict if a indivual will have a stroke which mean this is a classfication task and logistic regression is one of the models that can do just that 

to use the model we import `LogisticRegression` from the sklearn libary

In [None]:
from sklearn.linear_model import LogisticRegression

simialr to the previous day we inialise the model by assigning a varible to it 

In [None]:
model = LogisticRegression()

to train the model we just need to use the same step as yesteray by using `.fit` and passing in our `x_train` and `y_train` varibles from our train test split 

In [None]:
model.fit(x_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

now that we have trained our model we need to test how good our model is at predicting, to do this we just use `.predict()` and pass in our test data `x_test` 

In [None]:
predictions = model.predict(x_test)

### model evaluation

now that we have our predctions from our test data we need to score our model to determine now much in preddicted correctly to do this we use accuracy_score from the sklarn libary to measure our models accuracy

to do this we import `accuracy_score` from `sklearn.metrics` and we pass in our `precdictions` and the correct answers `y_test` like this  `accuracy_score(y_test,predictions)`

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy = accuracy_score(y_test,predictions)

next we print out our models accuracy score to see how well it performed

In [None]:
print(accuracy)

0.9444444444444444


wow thats a very high accuracy score , our model managed to predict about 94% of the labels correctly lets move on to the next step to tune some of the models parameters to see if we can improve the model 

# Hyperparameter Tuning <a class="anchor" name="Hyperparameter-Tuning"></a>



*to* tune our model we will use 2 parameters called `max_iter` and `C`, 



In [None]:
model_tuned = LogisticRegression(max_iter=1000,C=0.1)

In [None]:
model_tuned.fit(x_train,y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
tuned_predictions = model_tuned.predict(x_test)

In [None]:
tuned_accuracy = accuracy_score(y_test,tuned_predictions)

In [None]:
print(tuned_accuracy)

0.945226917057903


nice our accuracy increased by 0.001 which is a small improvement and with more tuning this model can perform better !