### 1. Load and Preprocess Dataset
Firstly, I have to load the dataset and anaylyze it, so that it is good for training our model.

We have to import the required libraries for data loading and preprocessing,
- `pandas` for data loading and basic analysis
- `numpy` for effect data manipulation
- `sklearn.preprocessing.StandardScaler` for mean normalization
- `sklearn.model_selection.train_test_split` for making train, dev and test sets

In [26]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Now, loading the dataset from `csv` file,

In [15]:
df = pd.read_csv('heart_disease.csv')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


Checking nulls in data,

In [16]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


Now checking the unique entries,

In [17]:
df.apply(lambda x: x.nunique())

Unnamed: 0,0
age,41
sex,2
cp,4
trestbps,50
chol,152
fbs,2
restecg,3
thalach,91
exang,2
oldpeak,40


Null columns looks like as categories, so replacing with mode,

In [18]:
df.fillna({'ca': df.ca.mode()[0]}, inplace=True)
df.fillna({'thal': df.thal.mode()[0]}, inplace=True)

Converting target column `num`'s from 5 uniques to 2 uniques,

In [21]:
df.loc[:, 'num'] = df['num'].replace([1, 2, 3, 4], 1)
df['num'].unique()

array([0, 1])

Now, getting the summary stats of the data,

In [23]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.663366,4.722772,0.458746
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,0.934375,1.938383,0.49912
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,3.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,3.0,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0,1.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0,1.0


Now converting the dataframe to numpy arrays,

In [22]:
X = df.drop('num', axis=1).values
y = df['num'].values

Normalizing the input features,

In [25]:
scalar = StandardScaler()
X = scalar.fit_transform(X)

Making train, dev, and test sets,

In [27]:
X_train, X_, y_train, y_ = train_test_split(X, y, test_size=0.4, random_state=42)
X_dev, X_test, y_dev, y_test = train_test_split(X_, y_, test_size=0.5, random_state=42)

### 2. Training and Evaluate the Models
Here we are ready to train our `LogisticRegression` and `RandomForestRegressor` models.

Libraries,
- `sklearn.linear_model.LogisticRegression` for Logistic Regression
- `sklearn.ensemble.RandomForestClassifier` for Random Forest Classifier

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

First we train the logistic regression model,

In [37]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

In [38]:
log_reg.score(X_test, y_test)

0.8032786885245902

So, we achive the `80%` accuracy through LogisticRegression model.

Now, lets train and perform hyperperameter tuning on RandomForestRegressor,

In [44]:
params = {
    'n_estimators': [100, 150, 200, 250, 300, 400, 500],
    'max_depth': [10, 15, 20, 25, 30, 40, 50, None],
    'min_samples_leaf': [1, 2, 4]
}

rand_forst = RandomizedSearchCV(RandomForestClassifier(), params, n_iter=20, cv=5, verbose=1)
rand_forst.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [45]:
rand_forst.score(X_test, y_test)

0.8524590163934426

And here we achive `85%` accuracy.

### 3. Save the models for future use,
Here we use library,
- `joblib` for dumping and loading the sklearn models through `pkl` files

In [47]:
import joblib

Saving the logistic regression model,

In [48]:
joblib.dump(log_reg, 'log_reg.pkl')

['log_reg.pkl']

Saving the random forest model,

In [49]:
joblib.dump(rand_forst, 'rand_forst.pkl')

['rand_forst.pkl']