# Baseline score
A baseline score is a lower bound for our test accuracy and it tells us that if we put the minimum possible effort in building a model, how far can we get. This helps us in the following ways:
1. It gives us a brief idea of how much effort is required in improving the model
2. We have a 'reset point' if things go really bad 

In [46]:
# importing basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

For this step, you will have to upload the data to your google drive by creating a new folder in your drive with the name: Heart Failure Prediction and 
store the file "heart.csv" in that folder. The file is available on the notion page as well.

In [47]:
df = pd.read_csv("/content/drive/MyDrive/Heart Failure Prediction/heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


Let's have a look at the kind of data present in each column by print their data type. These are numpy placeholders for the kind of data present in the columns.

In [48]:
for col in df.columns:
  print(f"{col}: {df[col].dtype}")

Age: int64
Sex: object
ChestPainType: object
RestingBP: int64
Cholesterol: int64
FastingBS: int64
RestingECG: object
MaxHR: int64
ExerciseAngina: object
Oldpeak: float64
ST_Slope: object
HeartDisease: int64


We want to know the number of unique values with data type as "object" (strings) 

In [49]:
for col in df.columns:
  if df[col].dtype == np.object:  # similarly np.int64 and np.float64 for int and float data type
    print(f"Name: {col}")         # printing the name of column -- as for why did I use f'' before the string -- read about formatted strings in python
    unique_vals = list(df[col].unique())    # fetching all unique values
    print(f"Num. of unique values: {len(unique_vals)}")
    if len(unique_vals) < 7:    # if the number of unique values is small, we will print them and see
      print(unique_vals)
    print('')

Name: Sex
Num. of unique values: 2
['M', 'F']

Name: ChestPainType
Num. of unique values: 4
['ATA', 'NAP', 'ASY', 'TA']

Name: RestingECG
Num. of unique values: 3
['Normal', 'ST', 'LVH']

Name: ExerciseAngina
Num. of unique values: 2
['N', 'Y']

Name: ST_Slope
Num. of unique values: 3
['Up', 'Flat', 'Down']



It seems that all columns with object data type have a small number of unique values. We'll return to these columns later before running our model. 

Next, we'll observe some basic statistics for those columns which have have a numeric data type. This can be handled simply by pandas.

In the code below, the output statistics is given only for those columns which had numeric data type and any column with an object data type is automatically filtered out.

In [50]:
df.drop(df.columns[-1], axis=1).describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak
count,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657
min,28.0,0.0,0.0,0.0,60.0,-2.6
25%,47.0,120.0,173.25,0.0,120.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6
75%,60.0,140.0,267.0,0.0,156.0,1.5
max,77.0,200.0,603.0,1.0,202.0,6.2


In the table above, the values we're initially concerned about are:
1. mean
2. std 
3. max
4. min

## Preparing the dataset for ML models

Step 1: Replacing object types with numbers

A machine learning model will only be able to understand numbers. However, we have 5 columns which have non-numeric data. We will now try to convert these strings to numbers. The process of converting a string data to numeric form is called encoding.

There are various forms of encoding. In fact, encoding of text to numbers has evolved to become its own research area under NLP. We will use the simplest possible protocol for encoding, known as ordinal encoding.

Ordinal encoding means that we will arbitrarily give a sequence to text data types, which are otherwise unordered. Let's try this with the first column. 

In [51]:
data = list(df['Sex']) # copying the data in a separate variable to avoid overwriting

We will arbitrarily label 'M' as 0 and 'F' as 1.

In [52]:
data_ordinal = []
for val in data:
  if val == 'M':
    data_ordinal.append(0.0)
  else:
    data_ordinal.append(1.0)

In [53]:
# Try prinitng a few values together
for i in range(5):
  print(data[i], data_ordinal[i])

M 0.0
F 1.0
M 0.0
F 1.0
M 0.0


Since we have now confirmed that our enocding has worked, we'll write it back to the original dataset.

In [54]:
df['Sex'] = data_ordinal
print(f"data type for the column 'Sex' is: {df['Sex'].dtype}")
df.head()

data type for the column 'Sex' is: float64


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,0.0,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,1.0,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,0.0,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,1.0,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,0.0,NAP,150,195,0,Normal,122,N,0.0,Up,0


Similarly, we'll do the same for other remaining columns

In the code above, I had explicitly written an if..else statement for labelling 'M' as 0 and 'F' as 1. As you can see, this method is very tedious if we have more than 5 categories in a column. Also, we would have to write separate codes for all such columns. A smarter way to do what I just did above is to fetch an array of unique values and replace the string variable in the column with the index of the variable in the array of unique values. I'll state that more clearly:

Let's say that the array for the unique values of 'Sex' is: ['M', 'F'].
The index of 'M' is 0 and that of 'F' is 1. These are the same values with which we replaced these variables above. You can fetch the index of an item by using the following code:



```Python
array = ['M', 'F']
print(array.index('F'))
>> 1
```



In [55]:
for col in df.columns:
  if df[col].dtype == np.object:
    temp_unique = list(df[col].unique())  # fetching unique values
    temp_array = list(df[col])            # data present in the column
    temp_ordinal = []                     # empty array for converted data
    for val in temp_array:
      index_of_var = temp_array.index(val)    # this is the modification -- we will replace the terms by their index in temp_unique
      temp_ordinal.append(float(index_of_var))
    df[col] = temp_ordinal                # writing it back to the dataset

In [56]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,0.0,0.0,140,289,0,0.0,172,0.0,0.0,0.0,0
1,49,1.0,1.0,160,180,0,0.0,156,0.0,1.0,1.0,1
2,37,0.0,0.0,130,283,0,2.0,98,0.0,0.0,0.0,0
3,48,1.0,3.0,138,214,0,0.0,108,3.0,1.5,1.0,1
4,54,0.0,1.0,150,195,0,0.0,122,0.0,0.0,0.0,0


Now that we have all numbers in our dataset, we are just one step away from getting into training our model.

Go back and have a look at the max_values we printed earlier. ML models take a huge number of iterations to train if we have very large values.

In order to address that, if the maximum value of any column exceeds 5 (arbitrarily chosen), we will divide the values of that column by its mean. This technique is known as normalization. There several better ways to normalize your dataset and this technique is not a good one, to be honest but we'll stick with this and explore more in the coming days

In [57]:
for col in df.columns:
  if df[col].max() > 5:
    df[col] = df[col]/df[col].mean()

In [58]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,0.747511,0.0,0.0,1.05743,1.453726,0,0.0,1.257224,0.0,0.0,0.0,0
1,0.915701,1.0,0.351589,1.208491,0.905435,0,0.0,1.140273,0.0,1.126933,0.0573,1
2,0.691448,0.0,0.0,0.981899,1.423544,0,0.060323,0.716325,0.0,0.0,0.0,0
3,0.897014,1.0,1.054768,1.042324,1.076461,0,0.0,0.78942,3.0,1.6904,0.0573,1
4,1.00914,0.0,0.351589,1.13296,0.980887,0,0.0,0.891752,0.0,0.0,0.0,0


## Getting started with training our model

First step is to separate the labels from the features.

In [38]:
X_columns = []
for col in df.columns:
  if col != 'HeartDisease':
    X_columns.append(col)

In [39]:
X = df[X_columns]
y = df['HeartDisease']

Next, we will split the data into train and test subset. The test subset will serve as a proxy for unseen data set and it will be used to generate the baseline accuracy.

In [40]:
from sklearn.model_selection import train_test_split

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The test_size used above is '0.2' which means that 20% data will be kept aside for testing and, consequently, 80% of the dataset will be used for training.

Our first model is logistic regression. This is a simple classification model. To read more about it, [scikit-learn: Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic#sklearn.linear_model.LogisticRegression)

In [42]:
from sklearn.linear_model import LogisticRegression

In [43]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression()

In [44]:
clf.score(X_test, y_test)

0.842391304347826

Since you will be running the code from your end, you might see a different output for the last cell. The accuracy should be something close to ~0.84, which is pretty impressive for a start.

Your work in the coming days will be focused on beating this baseline score. We will try to:
1. Use EDA to select best features for learning
1. Use better ways to encode the string/object type columns
2. Try better techniques for normalizing the data
3. Use a better algorithm
4. Use regularization and cross-validation for optimum learning
5. Fine-tune the hyperparameters of our models

Don't worry if you don't understand any of the above.