<a href="https://colab.research.google.com/github/TrollRider-Kristian/Springboard-AI-Mini-Projects/blob/main/codebasics_bagging_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [63]:
import pandas
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Force Pandas to display ALL columns when I call the .head function
# https://stackoverflow.com/questions/11361985/output-data-from-all-columns-in-a-dataframe-in-pandas
pandas.set_option ('display.max_columns', None)
heart_disease_raw = pandas.read_csv ("heart-disease.csv")
print (heart_disease_raw.head(10))

   Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  MaxHR  \
0   40   M           ATA        140          289          0     Normal    172   
1   49   F           NAP        160          180          0     Normal    156   
2   37   M           ATA        130          283          0         ST     98   
3   48   F           ASY        138          214          0     Normal    108   
4   54   M           NAP        150          195          0     Normal    122   
5   39   M           NAP        120          339          0     Normal    170   
6   45   F           ATA        130          237          0     Normal    170   
7   54   M           ATA        110          208          0     Normal    142   
8   37   M           ASY        140          207          0     Normal    130   
9   48   F           ATA        120          284          0     Normal    120   

  ExerciseAngina  Oldpeak ST_Slope  HeartDisease  
0              N      0.0       Up             0  
1     

Per the describe() call, all the quantifiable columns with means and standard deviations for which a Z-score exists are: Age, RestingBP, Cholesterol, FastingBS, MaxHR, and Oldpeak. We omit the target 'HeartDisease' column.

In [45]:
heart_disease_raw.isnull().sum()

Unnamed: 0,0
Age,0
Sex,0
ChestPainType,0
RestingBP,0
Cholesterol,0
FastingBS,0
RestingECG,0
MaxHR,0
ExerciseAngina,0
Oldpeak,0


In [46]:
heart_disease_raw.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [47]:
# Feature names with mean and standard deviation from the dataset
feature_list = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']

# The Z-score, also called the Standard Score: https://en.wikipedia.org/wiki/Standard_score
# I was advised to remove outliers with a Z-score < -3 or > 3, as in any row with a value
# more than 3 standard deviations away from the mean.

def calculate_z_score (dataset, column_name):
  get_column = dataset [column_name]
  return (get_column - get_column.mean()) / get_column.std()

# Uncomment to test the addition of new columns for the z-score of each row of each feature.
# for feature_name in feature_list:
#   heart_disease_raw [feature_name + '-Z-score'] = calculate_z_score (heart_disease_raw, feature_name)
# print (heart_disease_raw.head(10))

In [48]:
def is_outlier_in_given_column (dataset, column_name):
  return abs (calculate_z_score (dataset, column_name)) > 3

def is_outlier (dataset, column_list):
  is_outlier = is_outlier_in_given_column (dataset, column_list[0])
  for column_name in column_list[1:]:
    is_outlier = is_outlier | is_outlier_in_given_column (dataset, column_name)
  return is_outlier

In [49]:
heart_disease_raw ['is_outlier'] = is_outlier (heart_disease_raw, feature_list)
print (heart_disease_raw ['is_outlier'].value_counts())

# A bit tricky to use the drop method on rows by condition, rather than specific indices, but it can be done.
# Docs on index method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html
# KRISTIAN_NOTE - I have a feeling I'll be using this method more in the future.
heart_disease_raw.drop (heart_disease_raw[heart_disease_raw ['is_outlier'] == True].index, inplace = True)
print (heart_disease_raw['is_outlier'].value_counts())
heart_disease_raw.drop (columns = 'is_outlier', inplace = True) # Drop the outliers column once we've removed outliers.

is_outlier
False    899
True      19
Name: count, dtype: int64
is_outlier
False    899
Name: count, dtype: int64


There are five columns that require one-hot encoding before I can train models with this dataset:
1. Sex ['F', 'M']
2. ChestPainType ['ASY', 'ATA', 'NAP', 'TA']
3. RestingECG ['LVH', 'Normal', 'ST']
4. ExerciseAngina ['N', 'Y']
5. ST_Slope ['Down', 'Flat', 'Up']

This should yield 14 total bits with 1's indicating the applicable categories for each row.  For example, the following row:
{Sex: 'F', ChestPainType: 'ATA', RestingECG: 'Normal', Exerciseangina: 'N', ST_Slope: 'Up'}

will yield the following encoding:
[1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1]

In [50]:
# One-Hot Encoding works with a number of bits equal to the total number of categories (eg. Male: [1, 0]; Female: [0, 1]),
# but it can also work with multiple categories at a time.
heart_disease_category_names = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

# See documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# See Wikipedia article: https://en.wikipedia.org/wiki/One-hot
one_hot = OneHotEncoder (handle_unknown = 'ignore')
one_hot.fit (heart_disease_raw[heart_disease_category_names])
print (one_hot.categories_)

[array(['F', 'M'], dtype=object), array(['ASY', 'ATA', 'NAP', 'TA'], dtype=object), array(['LVH', 'Normal', 'ST'], dtype=object), array(['N', 'Y'], dtype=object), array(['Down', 'Flat', 'Up'], dtype=object)]


In [51]:
encoded_categories = one_hot.transform (heart_disease_raw[heart_disease_category_names]).toarray().tolist()
# print (encoded_categories[0:3]) # Uncomment to check if categories correctly encoded.
encoded_heart_disease_df = pandas.concat ([\
  heart_disease_raw,\
  pandas.DataFrame (encoded_categories, columns = one_hot.get_feature_names_out())
  ], axis = 1\
)
encoded_heart_disease_df.drop (columns = heart_disease_category_names, inplace = True)
print (encoded_heart_disease_df.head(10))

    Age  RestingBP  Cholesterol  FastingBS  MaxHR  Oldpeak  HeartDisease  \
0  40.0      140.0        289.0        0.0  172.0      0.0           0.0   
1  49.0      160.0        180.0        0.0  156.0      1.0           1.0   
2  37.0      130.0        283.0        0.0   98.0      0.0           0.0   
3  48.0      138.0        214.0        0.0  108.0      1.5           1.0   
4  54.0      150.0        195.0        0.0  122.0      0.0           0.0   
5  39.0      120.0        339.0        0.0  170.0      0.0           0.0   
6  45.0      130.0        237.0        0.0  170.0      0.0           0.0   
7  54.0      110.0        208.0        0.0  142.0      0.0           0.0   
8  37.0      140.0        207.0        0.0  130.0      1.5           1.0   
9  48.0      120.0        284.0        0.0  120.0      0.0           0.0   

   Sex_F  Sex_M  ChestPainType_ASY  ChestPainType_ATA  ChestPainType_NAP  \
0    0.0    1.0                0.0                1.0                0.0   
1    1.0   

In [64]:
# At last, we split the dataset by dropping the 'HeartDisease' column and saving it as the target vector.
heart_disease_features = encoded_heart_disease_df.drop (columns = 'HeartDisease')
heart_disease_target = encoded_heart_disease_df['HeartDisease']
print (heart_disease_features.head())
print (heart_disease_target.head())

    Age  RestingBP  Cholesterol  FastingBS  MaxHR  Oldpeak  Sex_F  Sex_M  \
0  40.0      140.0        289.0        0.0  172.0      0.0    0.0    1.0   
1  49.0      160.0        180.0        0.0  156.0      1.0    1.0    0.0   
2  37.0      130.0        283.0        0.0   98.0      0.0    0.0    1.0   
3  48.0      138.0        214.0        0.0  108.0      1.5    1.0    0.0   
4  54.0      150.0        195.0        0.0  122.0      0.0    0.0    1.0   

   ChestPainType_ASY  ChestPainType_ATA  ChestPainType_NAP  ChestPainType_TA  \
0                0.0                1.0                0.0               0.0   
1                0.0                0.0                1.0               0.0   
2                0.0                1.0                0.0               0.0   
3                1.0                0.0                0.0               0.0   
4                0.0                0.0                1.0               0.0   

   RestingECG_LVH  RestingECG_Normal  RestingECG_ST  ExerciseA

In [66]:
# KRISTIAN_NOTE - Looks pretty evenly split between people with heart disease and people without.
print (heart_disease_target.value_counts())

HeartDisease
1.0    492
0.0    407
Name: count, dtype: int64


In [62]:
# Same as logistic regression project, scale each column by the standard deviation of its features
scaler = StandardScaler (with_std = True)

# But in this case, codebasics scales ALL the features, as opposed to scaling both the training and test sets by the
# standard deviation of only the training set after a train/test split.
scaled_features = scaler.fit_transform (heart_disease_features)
print (scaled_features[:3])

[[-1.42815446  0.46590022  0.84963584 -0.5503622   1.38431998 -0.85546862
  -0.515943    0.515943   -1.07752387  2.06332497 -0.5349047  -0.22955001
  -0.50382083  0.80970176 -0.48989795  0.8229452  -0.8229452  -0.26018448
  -0.99888827  1.13469459]
 [-0.47585532  1.63471366 -0.16812204 -0.5503622   0.7529728   0.13751561
   1.93819859 -1.93819859 -1.07752387 -0.48465463  1.86949191 -0.22955001
  -0.50382083  0.80970176 -0.48989795  0.8229452  -0.8229452  -0.26018448
   1.00111297 -0.88129441]
 [-1.7455875  -0.1185065   0.79361247 -0.5503622  -1.53566071 -0.85546862
  -0.515943    0.515943   -1.07752387  2.06332497 -0.5349047  -0.22955001
  -0.50382083 -1.23502263  2.04124145  0.8229452  -0.8229452  -0.26018448
  -0.99888827  1.13469459]]


In [67]:
# Because the outcomes are almost evenly split between 0 and 1 for heart disease,
# I don't think 'stratify = heart_disease_target' will be necessary.
features_train, features_test, targets_train, targets_test = train_test_split(\
  scaled_features, heart_disease_target, random_state = 8\
)