 ## Model Selection
 ##### This is the process of choosing the best suited model for a particular problem. May depend on dataset , task, nature of model etc
 **Two main factors to consider**
 - Logical reason to select the model
 - Comparing the performance of the model 

Models can be selected depending on 
1. Type of data available
    - Image or videos-CNN
    - Text data or speech data - RNN
    - Numeric data - svm, logistics,decision trees
2. Task we want to carry out 
    - classification - svm, logististics,naive bayes,knn
    - Regression tasks - linear reg, ensemble models
    - Clustering tasks- K-means clustering,Hierarchical clustering        


**Import neccessary packages**

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV 

In [2]:
# importing the models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [3]:
#lets load the data set
data = pd.read_csv('heart.csv')

In [4]:
# number of columns and rows
data.shape

(303, 14)

In [5]:
#check for null values 
data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [6]:
#check distribution of target variable
data["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [7]:
# separate data into response and predictor variable
y = data["target"]
X= data.drop("target", axis=1)

In [8]:
#convert to numpy array
X = np.asarray(X)
y=np.asarray(y)

**Model Selection**

#### Compare Models with default hyperparameters using the cross_val_score

In [9]:
# create a list of models 
models=[LogisticRegression(max_iter=1000),SVC(kernel="linear"),KNeighborsClassifier(),RandomForestClassifier(random_state=0)]

In [10]:
#create a function that returns accuracy score of each model 
def model_comparison():
    for model in models:
        cv_score = cross_val_score(model,X,y,cv=5)
        mean_cv_score = round((sum(cv_score)/len(cv_score))*100,2)
        print(f'the cross value scores for {model} are {cv_score}')
        print(f'mean accuracy score for {model} is {mean_cv_score} %')
        print("*************************************************")

In [11]:
model_comparison()

the cross value scores for LogisticRegression(max_iter=1000) are [0.80327869 0.86885246 0.85245902 0.86666667 0.75      ]
mean accuracy score for LogisticRegression(max_iter=1000) is 82.83 %
*************************************************
the cross value scores for SVC(kernel='linear') are [0.81967213 0.8852459  0.80327869 0.86666667 0.76666667]
mean accuracy score for SVC(kernel='linear') is 82.83 %
*************************************************
the cross value scores for KNeighborsClassifier() are [0.60655738 0.6557377  0.57377049 0.73333333 0.65      ]
mean accuracy score for KNeighborsClassifier() is 64.39 %
*************************************************
the cross value scores for RandomForestClassifier(random_state=0) are [0.85245902 0.90163934 0.81967213 0.81666667 0.8       ]
mean accuracy score for RandomForestClassifier(random_state=0) is 83.81 %
*************************************************


In [25]:
# Define Models
models_list = models=[LogisticRegression(max_iter=10000),SVC(),KNeighborsClassifier(),RandomForestClassifier(random_state=0)]

In [26]:
#define the model with hyperparameters for tuning
hyperparameters = {
    "logistic_regression": {
        "C": [1, 5, 10, 20]
    },
    "support_vector_machine": {
        "kernel": ["linear", "rbf", "poly", "sigmoid"],
        "C":[1,5,10,20]
    },
    "k_nearest_neighbors": {
        "n_neighbors": [3, 5, 10]
    
    },
    "random_forest": {
        "n_estimators": [10, 20, 50, 100]
    }
}


In [27]:
model_keys=list(hyperparameters.keys())
model_keys[0]

'logistic_regression'

In [28]:
hyperparameters[model_keys[0]]

{'C': [1, 5, 10, 20]}

In [42]:
#define the function
def modelSelection(list_of_models, hyperparameters_dict):
    results = []
    i = 0

    for model in list_of_models:
        key = model_keys[i]
        param = hyperparameters_dict[key]
        i+=1

        print(model)
        print(param)

        classifier = GridSearchCV(model, param, cv=5)
        classifier.fit(X, y)

        results.append({
            "model_used": model.__class__.__name__,
            "highest_score": classifier.best_score_,
            "best_hyperparams": classifier.best_params_
        })

    results_dataframe = pd.DataFrame(results, columns=["model_used", "highest_score", "best_hyperparams"])
    return results_dataframe  

In [43]:
modelSelection(models_list,hyperparameters)

LogisticRegression(max_iter=10000)
{'C': [1, 5, 10, 20]}


SVC()
{'kernel': ['linear', 'rbf', 'poly', 'sigmoid'], 'C': [1, 5, 10, 20]}
KNeighborsClassifier()
{'n_neighbors': [3, 5, 10]}
RandomForestClassifier(random_state=0)
{'n_estimators': [10, 20, 50, 100]}


Unnamed: 0,model_used,highest_score,best_hyperparams
0,LogisticRegression,0.831585,{'C': 5}
1,SVC,0.828306,"{'C': 1, 'kernel': 'linear'}"
2,KNeighborsClassifier,0.64388,{'n_neighbors': 5}
3,RandomForestClassifier,0.838087,{'n_estimators': 100}
