# Machine Learning

Make sure you have `scikit-learn` installed in your project's Anaconda environment.

In [1]:
import pandas as pd

In [5]:
dfm = pd.read_csv('heart.csv')

# Encoding  

[**Encoding**](https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/) is the process of converting a categorical value into a numerical value. Machine learning algorithms can only process numerical data, and depending on what your category is communicating, there are a few ways you can transform it to optimize how it is processed. 

## Label / Ordinal Encoding

**Label and ordinal encoding** will replace any categorical value with an integer. Label encoding works best for **boolean** or 2-option categories, but will also be necessary for your **target classes**. If your categories can also be considered "ranked", you should use ordinal encoding to preserve the order of the labels. There are many ways you can achieve this, the most practical is probably to use [`LabelEncoder()`](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) or [`OrdinalEncoder()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder) from Scikit Learn.

[Here](https://www.geeksforgeeks.org/label-encoding-across-multiple-columns-in-scikit-learn/) is a good article demonstrating various multi-column approaches. 

<!-- Make sure to assign the dtype for the encoded column as [Categorical dtype](https://pandas.pydata.org/docs/user_guide/categorical.html#categoricaldtype).  -->

In [6]:
dfm[['Sex', 'ExerciseAngina']].head()

Unnamed: 0,Sex,ExerciseAngina
0,M,N
1,F,N
2,M,N
3,F,Y
4,M,N


In [None]:
# using pd.replace() currently still works, but with deprecation warning

# dfm.replace(to_replace={'Sex': {'M':1, 'F':0}, 'ExerciseAngina': {'Y':1, 'N':0} }, inplace=True)


# manually using .apply() with a lambda function

# dfm['Sex'] = dfm['Sex'].apply(lambda v: 1 if v == 'M' else 0 if v == 'F' else v)
# dfm['ExerciseAngina'] = dfm['ExerciseAngina'].apply(lambda v: 1 if v == 'Y' else 0 if v == 'N' else v)

In [7]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

oe = OrdinalEncoder(categories=[["M", "F"]], dtype=int) # this is how you can define the label order
dfm['Sex'] = oe.fit_transform(dfm[['Sex']])

le = LabelEncoder()
dfm['ExerciseAngina'] = le.fit_transform(dfm['ExerciseAngina'])

dfm[['Sex', 'ExerciseAngina']].head()

Unnamed: 0,Sex,ExerciseAngina
0,0,0
1,1,0
2,0,0
3,1,1
4,0,0


If working with more than two options within your category, this type of encoding can sometimes cause unintentional weight to be assigned to "labels" of a higher number. This happens when the algorithm incorrectly interprets the unrelated numeric categories as a range/scale.

## One Hot Encoding

In this scenario, we can instead use **one hot encoding**. This will split the values of a categorical feature into as many columns as categories, each assigned a label encoded boolean (0/1 = true/false). Only one in the row will ever be "true", a pattern the ML algorithm will quickly recognize. 

We can reduce the number of features by removing one of these hot encoded columns, since all "false" will be recognized the same way as a "true" in the omitted column. 

![one hot encode](https://drive.google.com/thumbnail?id=1NRF2th9dR69Xa1LH4hYTW9YiJhZGXBYB&sz=s4000)

Again, this is just manipulating your DataFrame, so you can achieve this manually with `.apply()`, or both Pandas and Scikit Learn offer handy functions: [`pd.get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) or [`OneHotEncoder()`](https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html#onehotencoder).

In [8]:
# separate remaining categorical columns
cat_cols = dfm.select_dtypes('object')
for label, values in cat_cols.items():
    print(label, values.unique())

ChestPainType ['ATA' 'NAP' 'ASY' 'TA']
RestingECG ['Normal' 'ST' 'LVH']
ST_Slope ['Up' 'Flat' 'Down']


In [9]:
# using 'pd.get_dummies()'
cat_col_encode = pd.get_dummies(cat_cols, drop_first=True, dtype='int')
cat_col_encode

Unnamed: 0,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ST_Slope_Flat,ST_Slope_Up
0,1,0,0,1,0,0,1
1,0,1,0,1,0,1,0
2,1,0,0,0,1,0,1
3,0,0,0,1,0,1,0
4,0,1,0,1,0,0,1
...,...,...,...,...,...,...,...
913,0,0,1,1,0,1,0
914,0,0,0,1,0,1,0
915,0,0,0,1,0,1,0
916,1,0,0,0,0,1,0


In [None]:
# using OneHotEncoder()
# from sklearn.preprocessing import OneHotEncoder

# categorical_columns = []
# for label, values in cat_cols.items():
#     for i, value in enumerate(values.unique()):
#         if i != 0:
#             categorical_columns.append(f"{label}_{value}")

# categorical_columns

# ohe = OneHotEncoder(drop="first")
# transformed = ohe.fit_transform(cat_cols)
# cat_col_encode = pd.DataFrame(transformed.toarray(), columns=categorical_columns, dtype='int')

# cat_col_encode

In [10]:
# drop categorical columns
dfm.drop(columns=cat_cols, axis=1, inplace=True)

# combine new columns with original dfm
dfml = pd.concat([dfm, cat_col_encode], axis=1)

# view first 5 rows
dfml.head()

Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,MaxHR,ExerciseAngina,Oldpeak,HeartDisease,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ST_Slope_Flat,ST_Slope_Up
0,40,0,140,289,0,172,0,0.0,0,1,0,0,1,0,0,1
1,49,1,160,180,0,156,0,1.0,1,0,1,0,1,0,1,0
2,37,0,130,283,0,98,0,0.0,0,1,0,0,0,1,0,1
3,48,1,138,214,0,108,1,1.5,1,0,0,0,1,0,1,0
4,54,0,150,195,0,122,0,0.0,0,0,1,0,1,0,0,1


# Split Data

Start by separating your **target** column from the rest of the **features**. Target is often referred to as **y**, while the remaining features are referred to as **X**. If you check the `.shape`, `X` will be a DataFrame with one less column than the original, while `y` will be a series with the same number of rows as X.

![unsplit](https://drive.google.com/thumbnail?id=1jr0DVZ9lffixnDdzoTmA26QE4KQaHOOA&sz=s4000)
![X y](https://drive.google.com/thumbnail?id=1iv52oBp6E09ZrFMwZhcAMgJI-OK3xBw3&sz=s4000)

In [11]:
from sklearn.model_selection import train_test_split

X = dfml.drop('HeartDisease', axis=1)
y = dfml['HeartDisease']

print("original:", dfml.shape, "\nX:", X.shape, "\ny:", y.shape)

original: (918, 16) 
X: (918, 15) 
y: (918,)


Split both X and y, this time separating 70-80% of the rows for **training** purposes, and the remaining 20-30% will be for **testing** purposes.

![train test](https://drive.google.com/thumbnail?id=12Hzj8-eDZ8zJEVDfdRV38G_-__W37Mqi&sz=s4000)

Use the function `train_test_split()` from SciKit Learn, and define the split by specifying the `test_size=`. Optionally define the random order of the rows with `random_state=`, this will shuffle the rows into a random order, but using an algorithm that can then reproduce that exact order for the next time you run the model (you want to make sure you are using the same set of training/testing data as you reiterate the steps to improve your model's performance.)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=420)
print(X_train, X_test)

     Age  Sex  RestingBP  Cholesterol  FastingBS  MaxHR  ExerciseAngina  \
525   45    0        130          236          0    144               0   
801   56    0        132          184          0    105               1   
636   48    0        130          245          0    180               0   
380   60    0        160            0          0     99               1   
878   49    0        130          266          0    171               0   
..   ...  ...        ...          ...        ...    ...             ...   
627   44    0        140          235          0    180               0   
799   53    0        130          246          1    173               0   
575   56    0        137          282          1    126               1   
390   51    0        140            0          0     60               0   
72    52    0        120          182          0    150               0   

     Oldpeak  ChestPainType_ATA  ChestPainType_NAP  ChestPainType_TA  \
525      0.1               

# Building Model

Choose your model. There are two examples here, [`LogisticRegression`](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) and [`RandomForest`](https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestClassifier.html). Feel free to experiment with other classification models.

## Logistic Regression

You will need to import your model from SciKit Learn and set it to a variable. There are _many_ customization options, but start with the simplest setup and we will cover **hyperparameter tuning** in a later spike. 

Then use the [`.fit()`](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) method to "fit" or train your model using your training data.

In [30]:
from sklearn.linear_model import LogisticRegression

LR_model = LogisticRegression(max_iter=1000) 

In [31]:
LR_model.fit(X_train, y_train)

You might experience a [convergence warning](https://forecastegy.com/posts/how-to-solve-logistic-regression-not-converging-in-scikit-learn/) while fitting the Logistic Regression model. Try increasing the number of `max_iter=` to give the model more time to find the optimal parameters for your dataset. It might also be necessary to perform some **feature elimination** to reduce confusion from correlated columns.

We can now let our model make [**predictions**](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) using the testing data we reserved. For each row, it will use what it learned from the training data to infer whether the patient does or doesn't have heart disease.

In [32]:
preds = LR_model.predict(X_test)
preds

array([1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 1])

In [33]:
y_test

314    0
403    1
82     1
454    1
498    1
      ..
628    1
355    1
590    0
405    1
572    1
Name: HeartDisease, Length: 184, dtype: int64

Since we have the actual correct results saved in our `y_test` variable, we can now compare the predicted results against the true results to gauge how accurately our model is predicting the target. SciKit Learn have a [handy function](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.accuracy_score.html) we can use to give us a percentage float. 

Be aware, while 1 is the highest possible score, it usually indicates a problem with your model if it returns a perfect score every time. There are other metrics you should also use to gauge the accuracy of your model, we will cover those in a different spike.

In [34]:
from sklearn.metrics import accuracy_score 

acc = accuracy_score(y_test, preds)
acc

0.8641304347826086

## RandomForest

In [35]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
RF_clf = RandomForestClassifier(random_state=420)

In [None]:
RF_clf.fit(X_train, y_train)

In [41]:
preds = RF_clf.predict(X_test)
preds

array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1])

In [39]:
acc = accuracy_score(y_test, preds)
acc

0.8804347826086957