 ## Model Selection
 ##### This is the process of choosing the best suited model for a particular problem. May depend on dataset , task, nature of model etc
 **Two main factors to consider**
 - Logical reason to select the model
 - Comparing the performance of the model 

Models can be selected depending on 
1. Type of data available
    - Image or videos-CNN
    - Text data or speech data - RNN
    - Numeric data - svm, logistics,decision trees
2. Task we want to carry out 
    - classification - svm, logististics,naive bayes,knn
    - Regression tasks - linear reg, ensemble models
    - Clustering tasks- K-means clustering,Hierarchical clustering        


**Import neccessary packages**

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV 

In [3]:
# importing the models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [5]:
#lets load the data set
data = pd.read_csv('heart.csv')

In [6]:
# number of columns and rows
data.shape

(303, 14)

In [7]:
#check for null values 
data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [8]:
#check distribution of target variable
data["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [21]:
# separate data into response and predictor variable
y = data["target"]
X= data.drop("target", axis=1)

In [24]:
#convert to numpy array
X = np.asarray(X)
y=np.asarray(y)

**Model Selection**

#### Compare Models with default hyperparameters using the cross_val_score

In [28]:
# create a list of models 
models=[LogisticRegression(max_iter=1000),SVC(kernel="linear"),KNeighborsClassifier(),RandomForestClassifier(random_state=0)]

In [35]:
#create a function that returns accuracy score of each model 
def model_comparison():
    for model in models:
        cv_score = cross_val_score(model,X,y,cv=5)
        mean_cv_score = round((sum(cv_score)/len(cv_score))*100,2)
        print(f'the cross value scores for {model} are {cv_score}')
        print(f'mean accuracy score for {model} is {mean_cv_score} %')
        print("*************************************************")

In [36]:
model_comparison()

the cross value scores for LogisticRegression(max_iter=1000) are [0.80327869 0.86885246 0.85245902 0.86666667 0.75      ]
mean accuracy score for LogisticRegression(max_iter=1000) is 82.83 %
*************************************************
the cross value scores for SVC(kernel='linear') are [0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]
mean accuracy score for SVC(kernel='linear') is 82.83 %
*************************************************
the cross value scores for KNeighborsClassifier() are [0.60655738 0.6557377  0.57377049 0.73333333 0.65      ]
mean accuracy score for KNeighborsClassifier() is 64.39 %
*************************************************
the cross value scores for RandomForestClassifier(random_state=0) are [0.85245902 0.90163934 0.81967213 0.81666667 0.8       ]
mean accuracy score for RandomForestClassifier(random_state=0) is 83.81 %
*************************************************


From this sample we can deduce that the **Random Forest Classifier** has the highest accuracy 