# Optimising models & SVMs

In all the ML algorithms we have learnt so far, you have learnt a few hyperparameters and understood the general effects of them on each model. But how do we best set the hyperparameter to customise our models to our dataset? We can objectively search different values for model hyperparameters and choose a subset that results in a model that achieves the best performance - this is called **hyperparameter tuning / hyperparameter optimisation**. 

Today we will discover hyperparameter tuning using scikit-learn. We will first try it on some of the classification / regression models that you already know, then we will learn a new algorithm, **Support Vector Machines**, and apply hyperparameter tuning as well. 

## Outline <a name="top"></a>
1. [House Price Data Preparation](#cleaning)
2. [Hyperparameter Tuning](#ht)
    1. [Grid Search](#grid)
    2. [Random Search](#random)
3. [Hyperparameter Tuning in Classification](#classification)
4. [Hyperparameter Tuning in Regression](#regression)
5. [Support Vector Machines](#svm)
6. [Activity](#activity)

In [1]:
### loading packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random as rand

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.impute import SimpleImputer
from sklearn.neighbors import NearestNeighbors

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV

from sklearn import svm

from sklearn.metrics import f1_score, precision_score, recall_score, cohen_kappa_score, roc_curve, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


In [2]:
#Here is the pip-free version: Our own ReSampler for imbalance in your target variable 

def ReSampler(X,y,Target,Ratio,Mode,Seed):
    X[Target] = y 
    df = X
    if (Mode == 0) or (Mode == "Over"):
        #Get the Minority Class
        MinorityClass=df[Target].value_counts().loc[df[Target].value_counts()==min(df[Target].value_counts())].index[0]

        #Get a df of only Minority Cases - for resampling
        Minoritydf=df[df[Target]==MinorityClass]

        #Find out how many samples we need in total
        TargetTotal=round(max(df[Target].value_counts())/(1-Ratio))

        #Find out how many of our sample need to be of the minority class, so the target ratio is reached
        SamplesNeeded=TargetTotal-len(df)

        #Initialise Resampled Df
        ResampledDf=pd.DataFrame(np.empty((0,len(Minoritydf.columns))))
        ResampledDf.columns=Minoritydf.columns

        #Draw with replacement from the Minority Dataframe to get as many Samples as needed
        ResampledDf=ResampledDf.append(Minoritydf.sample(int(SamplesNeeded), replace = True, axis = 0,random_state=Seed))

        NewDf=df.append(ResampledDf)
    elif (Mode == 1) or (Mode == "Under"):
        #Get the Majority Class and Minority Class
        MajorityClass=df[Target].value_counts().loc[df[Target].value_counts()==max(df[Target].value_counts())].index[0]
        MinorityClass=df[Target].value_counts().loc[df[Target].value_counts()!=max(df[Target].value_counts())].index[0]

        #Get a df of only Minority Cases - for the output
        Minoritydf=df[df[Target]==MinorityClass]

        #Get a df of only Majority Classes - for resampling
        Majoritydf=df[df[Target]==MajorityClass]
        
        #Find out how many samples we need in total
        TargetTotal=round(min(df[Target].value_counts())/(1-Ratio))
        
        #Find out how many of our sample need to be of the majority class, so the target ratio is reached
        SamplesNeeded=TargetTotal-len(df)

        #get a list of random numbers the length of Majoritydf
        nums=list(range(len(Majoritydf)))

        #randomly reorder these numbers and draw an amount from them equal to the length of the Minoritydf
        rand.Random(Seed).shuffle(nums)
        nums=nums[0:SamplesNeeded]

        #Now use that shorter list of numbers to slice Majoritydf - so you you only select a random subset of that df that has the
        #length of Minoritydf
        Majoritydf=Majoritydf.iloc[nums,:]

        #Recombine the two df
        NewDf=Minoritydf.append(Majoritydf)
    elif (Mode == 2) or (Mode == "SMOTE"):
        #Get the Minority Class
        MajorityClass=df[Target].value_counts().loc[df[Target].value_counts()==max(df[Target].value_counts())].index[0]
        MinorityClass=df[Target].value_counts().loc[df[Target].value_counts()==min(df[Target].value_counts())].index[0]
        
        #Get a Dataframe of the MinorityClass
        Minoritydf=df[df[Target]==MinorityClass]
        
        #Get a df of only Majority Classes - for resampling
        Majoritydf=df[df[Target]==MajorityClass]
        
        #Standardise that dataframe (writign our own StandardScaler here is more efficient, because it saves us converting for DataFrame to array and back)
        StandardMinoritydf=Minoritydf.copy()
        for i in range(len(StandardMinoritydf.columns)):
            if Minoritydf.iloc[:,i].astype("float").std()==0:
                StandardMinoritydf.iloc[:,i]=(Minoritydf.iloc[:,i].astype("float")-Minoritydf.iloc[:,i].astype("float").mean())
            else:
                StandardMinoritydf.iloc[:,i]=(Minoritydf.iloc[:,i].astype("float")-Minoritydf.iloc[:,i].astype("float").mean())/Minoritydf.iloc[:,i].astype("float").std()
        
        #Initiate Nearest Neighbour Algorythm - we do 5 neighbours, as suggested in the original SMOTE paper
        Neigh=NearestNeighbors(n_neighbors=5, algorithm='ball_tree')
        
        #Preallocating an empty dataframe with the same columns as our input
        ResampledDf=pd.DataFrame(np.empty((0,len(Minoritydf.columns))))
        ResampledDf.columns=Minoritydf.columns
        
        #Preallocating a counter for the loop
        k=1
        
        #Setting a seed
        np.random.seed(Seed)
        
        #The Main SMOTE Loop - go until the Minority data is [Ratio] percent of all data
        while (len(StandardMinoritydf))/(len(StandardMinoritydf)+len(Majoritydf)) < Ratio:
            
            #get a random number for picking a random row of data later
            i=rand.randint(0,len(StandardMinoritydf)-1)
            
            #get a random number for picking a random one of the datapoint's 5 nearest neighbours later
            j=rand.randint(0,4)
            
            #Find the 5 nearest neighbours of all current datapoints
            Neigh.fit(StandardMinoritydf)
            _, indices = Neigh.kneighbors()
            
            #Create a new datapoint that lies exactly between a randomly chosen original datapoint, and one of its nearest neighbours (randomly chosen)
            ResampledDf=pd.DataFrame((StandardMinoritydf.iloc[i,:].astype("float")+StandardMinoritydf.iloc[indices[i][j]].astype("float"))/2).transpose()
            
            #Give that new datapoint an index that counts up from the highest index number in the original dataset
            ResampledDf.index=range(df.index.max()+k,df.index.max()+k+1)
            
            #Add the new datapoint to the standardised dataframe containing all data from the minority class
            StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
            
            #Increase the loop counter
            k+=1

        #For my sanity, make a copy of the now SMOTEed, but still standardised dataframe
        NewDf=StandardMinoritydf.copy()
        
        #Reverse the Standardisation process by multiplying all data by the original standard deviations of their columns, and adding their columns original means
        for i in range(len(NewDf.columns)):
            if Minoritydf.iloc[:,i].astype("float").std()==0:
                NewDf.iloc[:,i]=(StandardMinoritydf.iloc[:,i].astype("float")+Minoritydf.iloc[:,i].astype("float").mean())
            else:
                NewDf.iloc[:,i]=(StandardMinoritydf.iloc[:,i].astype("float")*Minoritydf.iloc[:,i].astype("float").std())+Minoritydf.iloc[:,i].astype("float").mean()
        
        #Unite the (SMOTEed) Minority data with the Majority data - ready to eject!
        NewDf=Majoritydf.append(NewDf)
    NewX = NewDf.drop(columns = [Target])
    Newy = NewDf[Target]
    return NewX, Newy

In [3]:
## Impute missing values using Simple Imputation method - let's impute our numerical values with 'mean' value and the categorical with 'most_frequent'
### Since we need to impute on both X_train and X_test separately, let's use a function to avoid repeating ourselves!!

def impute_missing_values(df):
    # Separate numerical and categorical features
    num_features = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
    cat_features = df.select_dtypes(include=['object']).columns.tolist()

    # Create separate SimpleImputer instances for numerical and categorical features
    num_imputer = SimpleImputer(strategy='mean')
    cat_imputer = SimpleImputer(strategy='most_frequent')

    # Impute missing values in numerical features
    df_num = df[num_features]
    df_num_imputed = num_imputer.fit_transform(df_num)

    # Convert the imputed numerical features back to a DataFrame
    df_num_imputed_df = pd.DataFrame(df_num_imputed, columns=num_features)

    # Impute missing values in categorical features
    df_cat = df[cat_features]
    df_cat_imputed = cat_imputer.fit_transform(df_cat)

    # Convert the imputed categorical features back to a DataFrame
    df_cat_imputed_df = pd.DataFrame(df_cat_imputed, columns=cat_features)

    # Combine the numerical and categorical DataFrames back into one DataFrame
    df_imputed = pd.concat([df_num_imputed_df, df_cat_imputed_df], axis=1)
    
    return df_imputed

In [4]:
# While we are on the function train, let's have another function that allows you to print the performance metric for each class balancing method 

def get_classification_results(truth, prediction, df, idx):

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_probs)
    kappa = cohen_kappa_score(y_test, y_pred)

    df.loc[idx,:] = [acc, f1, precision, recall, auc, kappa]

    return df

In [5]:
def get_reg_results(truth, prediction, df, idx):
    '''
    This function generates a results dataframe given your y_test and predictions. 
    It allows you to take the output of this function and put it into the next time you call the function
    Filling out the entirity of the dataframe one function call (and row) at a time
    Inputs:
    - truth = y_test, the actual values
    - prediction = your predictions
    - df = a dataframe that is already set up to hold our data. 
    - idx = the method that you are using (which corresponds to a row index)
    Output:
    - the input df, but with another row filled out
    '''
    
    mse_int = mean_squared_error(y_test, y_pred)
    mse = round(mse_int, 3)
    rmse = round((mse_int)**0.5, 3)
    mae = round(mean_absolute_error(y_test, y_pred), 3)
    
    SS_Residual = sum((truth-prediction)**2)       
    SS_Total = sum((truth-np.mean(truth))**2)     
    r_squared = 1 - (float(SS_Residual))/SS_Total
    
    r2 = round(r_squared, 3)
    
    df.loc[idx, :] = [rmse, mae, r2]
    
    return df

## Data Preparation <a name="cleaning"></a>

You have guessed it right - its the London House Price Dataset again for demonstration!

You already know the cleaning process. It's exactly the same as in previous workshops. 


In [7]:
## Read in comma separated file 

df = pd.read_csv('../CSV files/london_house_price_data.csv',index_col = [0])
df.info()

FileNotFoundError: [Errno 2] No such file or directory: '../CSV files/london_house_price_data.csv'

In [None]:
def clean(df):
    ## Removing columns 
    byebye_col = ['link','address','description','added_date','sold_date','agent','postcode', 'borough']
    df = df.drop(columns = byebye_col)
    
    # Remove word "miles" in column "distance_to_station"
    df['distance_to_station'] = df['distance_to_station'].str.replace(' miles', '').astype('float')
    
    # Combining the property type of bungalow into house 
    df['property_type'] = df['property_type'].replace(['Detached Bungalow', 'Retirement Property', 'Semi-Detached Bungalow'], 'House')
    
    # Remove column leasehold_years_left
    df = df.drop('leasehold_years_left', axis = 1)
    
    # Dropping rows with target variable missing
    df.dropna(subset=['sold_under_90days'], inplace=True)
    df['sold_under_90days'] = df['sold_under_90days'].astype(int)
    
    return df

In [None]:
df = clean(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1379 entries, 0 to 1467
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   saleprice            1379 non-null   float64
 1   property_type        1379 non-null   object 
 2   bedrooms             1379 non-null   int64  
 3   bathrooms            1379 non-null   int64  
 4   distance_to_station  1379 non-null   float64
 5   tenure               1188 non-null   object 
 6   crime_rate           1021 non-null   float64
 7   total_area           1379 non-null   float64
 8   year_built           1379 non-null   int64  
 9   property_condition   1379 non-null   float64
 10  amenities_rating     1379 non-null   float64
 11  garden               1379 non-null   bool   
 12  balcony              1379 non-null   bool   
 13  fuel_type            1379 non-null   object 
 14  sold_under_90days    1379 non-null   int64  
dtypes: bool(2), float64(6), int64(4), obje

In [None]:
# extract the target for classification and regression accordingly
X_class = df.drop(columns = ['sold_under_90days'])
y_class = df['sold_under_90days']

X_reg = df.drop(columns = ['saleprice'])
y_reg = df['saleprice']

In [None]:
# Train Test Split with on target variable for Classification
X_train, X_test, y_train, y_test = train_test_split(X_class, y_class,test_size=0.2,random_state=456)

In [None]:
# Resetting index to make sure our ReSampler function works
X_train.reset_index(drop=True, inplace = True)
y_train.reset_index(drop=True, inplace = True)

# Imputataion
X_train_imputed = impute_missing_values(X_train)
X_test_imputed = impute_missing_values(X_test)

## One-hot encoding on categorical columsn before training the model 
X_train_encoded = pd.get_dummies(X_train_imputed, columns=['property_type','tenure','fuel_type'])
X_test_encoded = pd.get_dummies(X_test_imputed, columns=['property_type', 'tenure','fuel_type'])

## Hyperparmeter Tuning <a name="ht"></a>


A **“parameter”** is a configuration variable that is internal to the model and whose value can be estimated from the data.
A **“hyperparameter”** is a configuration that is external to the model and whose value cannot be estimate from the data.

Hyperparameters are model parameters whose values are set before training. For example, the number of trees in a random forest or the penalty intensity of a Lasso regression. They are all numbers that are set before the training phase and their values affect the behavior of the model. A model with different hyperparameters is, actually, a different model so it may have a lower performance.

If the model has several hyperparameters, we need to find the best combination of values of the hyperparameters searching in a multi-dimensional space. That’s why hyperparameter tuning, which is the process of finding the right values of the hyperparameters, is a very complex and time-expensive task.

There are more than one way for hyperparameter tuning, but we will focus on **Grid Search** for demonstration. 

### Grid Search <a name="grid"></a>

Grid search is the simplest algorithm for hyperparameter tuning. We divide the domain of the hyperparameters into a discrete grid. We then try every combination of values of this grid, calculating some performance metrics using cross-validation. The point of the grid that maximizes the average value in cross-validation, is the optimal combination of values for the hyperparameters.

<img src="gridsearch.png" width="550" align="center">



**Cross Validation**

Cross validation is the best practice for tuning a model. It allows you to split your training data into n versions of training + validation data sets. You then build n models, averaging the performance metric of choice. In a one off tuning without a validation set, you may overfit the data or have selection bias present. Since you **do not (ever!) use your test data to select your best performing model**, you may accidentially choose an inferior model as the best one. Using the average performance of n different models trained on n different datasets will protect you from these sources of error and allow you to select the best tuning parameters.

<img src="CV.png" width="550" align="center">


Grid search is an exhaustive algorithm in order to find the best point in the domain. The biggest drawback is that it’s very slow. Checking every combination of the space requires a lot of time that and sometimes we might not have that time for it... Don’t forget that every point in the grid needs k-fold cross-validation, which requires k training steps. Grid search can be expensive and slow but if we want the best combination of values of the hyperparmeters, it can help us to achieve that. 

We will use [gridsearchcv](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) from sklearn. It creates a grid of all of your parameter values and tests all possible combinations - using cross validation to select the best ones from the training data.

For classification model demo, we will be using random forest and gradient boost, tuning on the purity metric used (criterion), depth of the tree (max_depth), and the minimum number of resulting observations required to accept a spliit (min_samples_leaf). The idea is to look at how varying the size (depth) and complexity (number of nodes) impacts our performance. We will also tune the learning rate which controls the shrinkage of each tree for gradient boost. 

For regression model demo, we will be using lasso and ridge, tuning on alpha. You can also use the [repeated kfold function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedKFold.html) as your strategy - where we will set the number of folds in our cv, a random state, and the number of times to repeat the cv. Additionally, we will set the scoring function to be rmse as it is popular to use and reports values on the same scale as your outcome. 

*Note: n_jobs parameter in gridsearchcv is the number of jobs to run in parallel. -1 will use all the processors. I set it to 3 so that I can still run other scripts/programs on my computer if the model takes a long time to run. These won't, but a neural network can take 8+ hours*


### Random Search <a name="random"></a>

Random search is similar to grid search, but instead of using all the points in the grid, it tests only a randomly selected subset of these points. The smaller this subset, the faster but less accurate the optimization. The larger this dataset, the more accurate the optimization but the closer to a grid search.Random search is a very useful option when you have several hyperparameters with a fine-grained grid of values. Using a subset made by 5-100 randomly selected points, we are able to get a reasonably good set of values of the hyperparameters. It will not likely be the best point, but it can still be a good set of values that gives us a good model. 

[RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) function is here for you to explore! 

## Hyperparameter Tuning on Classification models <a name="classification"></a>

In [None]:
# We are tuning a random forest, so let's make a dictionary for all that models parameters
rfDict = {'n_estimators':[100,250,500],
              'max_depth': [5,10,25],
              'min_samples_split': [2,4,6],
              'min_samples_leaf': [1,2,4]} 
#These aren't all the params of a forest - but we are already asking for 81 jobs (that is, models). 
#That is not even taking into account cross validation (which will multiply this number further), so let's see what happens.

In [None]:
#Build an empty Forest:
rf = RandomForestClassifier(random_state = 423) #setting a seed here will mean the bootstrapping will always be performed the same - making the hyperparameters more comparable

In [None]:
#Initialise the Grid Search with the Forrest and the Dictionary
rf_class = GridSearchCV(rf, rfDict, n_jobs=1,
                        verbose = 3, # setting verbose = 3 gives us some information about all the fits, so we see what's happening as we wait and don't get bored...
                        cv = 3, # Cross Validation can be useful... but time intensive. Default here is five... so set this down to 3, which is a reasonable min.
                        scoring = 'accuracy') 

In [None]:
#Fit the training data to perform the grid search (TRAINING ONLY! DON'T USE ALL YOUR DATA - OTHERWISE YOU'LL HAVE SPILLAGE!)
rf_class.fit(X_train_encoded, y_train) 

Fitting 3 folds for each of 81 candidates, totalling 243 fits
[CV 1/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.774 total time=   0.4s
[CV 2/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.799 total time=   0.3s
[CV 3/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100;, score=0.812 total time=   0.3s
[CV 1/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=250;, score=0.783 total time=   0.8s
[CV 2/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=250;, score=0.796 total time=   0.7s
[CV 3/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=250;, score=0.807 total time=   0.7s
[CV 1/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500;, score=0.777 total time=   1.4s
[CV 2/3] END max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500;, score=0.793 total time=   1.6s
[C

In [None]:
# Let's get the grid search results
GridResults=pd.DataFrame(rf_class.cv_results_) # We make them a pandas dataframe, just it because it looks nice
GridResults=GridResults.sort_values("rank_test_score") #And we sort them by rank
GridResults

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_leaf,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
32,1.919612,0.468777,0.093650,0.018840,10,1,4,500,"{'max_depth': 10, 'min_samples_leaf': 1, 'min_...",0.769022,0.801630,0.822888,0.797847,0.022153,1
52,0.715979,0.038818,0.043580,0.001558,10,4,6,250,"{'max_depth': 10, 'min_samples_leaf': 4, 'min_...",0.771739,0.801630,0.817439,0.796936,0.018950,2
49,0.748214,0.033201,0.044152,0.003420,10,4,4,250,"{'max_depth': 10, 'min_samples_leaf': 4, 'min_...",0.771739,0.801630,0.817439,0.796936,0.018950,2
46,0.684530,0.006472,0.042665,0.000623,10,4,2,250,"{'max_depth': 10, 'min_samples_leaf': 4, 'min_...",0.771739,0.801630,0.817439,0.796936,0.018950,2
38,1.482456,0.032553,0.084759,0.005550,10,2,2,500,"{'max_depth': 10, 'min_samples_leaf': 2, 'min_...",0.777174,0.793478,0.817439,0.796030,0.016537,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69,0.302522,0.001327,0.019457,0.000793,25,2,6,100,"{'max_depth': 25, 'min_samples_leaf': 2, 'min_...",0.766304,0.793478,0.801090,0.786958,0.014931,77
54,0.329156,0.009911,0.021663,0.002449,25,1,2,100,"{'max_depth': 25, 'min_samples_leaf': 1, 'min_...",0.769022,0.782609,0.798365,0.783332,0.011990,78
66,0.333898,0.025835,0.025191,0.006951,25,2,4,100,"{'max_depth': 25, 'min_samples_leaf': 2, 'min_...",0.760870,0.788043,0.798365,0.782426,0.015814,79
63,0.305774,0.003093,0.020491,0.001786,25,2,2,100,"{'max_depth': 25, 'min_samples_leaf': 2, 'min_...",0.760870,0.788043,0.798365,0.782426,0.015814,79


In [None]:
# Let's get the best estimator and use that as the model 
rf_model = rf_class.best_estimator_

y_pred = rf_model.predict(X_test_encoded)
y_probs = rf_model.predict_proba(X_test_encoded)[:, 1]  # Probability estimates of the positive class, need this to calculate auc

## Set up results df

results_class = pd.DataFrame(index = ['Random Forest', 'Gradient Boosting', 'SVM'], 
                       columns = ['accuracy', 'f1', 'precision', 'recall','auc','kappa']) #F1 is the harmonic mean of precision and recall.


results_class = get_classification_results(y_test, y_pred, results_class, 'Random Forest')


print('The best parameters are {}'.format(rf_class.best_params_))

The best parameters are {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 500}


In [None]:
results_class

Unnamed: 0,accuracy,f1,precision,recall,auc,kappa
Random Forest,0.789855,0.865741,0.802575,0.939698,0.813418,0.395879
Gradient Boosting,,,,,,
SVM,,,,,,


In [None]:
## Let's try with gradient boost
gbDict = {'n_estimators':[100,250,500],
          'max_depth': [5,10,25],
          'min_samples_split': [2,4,6],
          'min_samples_leaf': [1,2,4],
          'learning_rate':[0.1,0.5,0.8]} 

gb = GradientBoostingClassifier(random_state = 423)

gb_class = GridSearchCV(gb, gbDict, n_jobs=1,
                        verbose = 2, # setting verbose = 2 displays a progress bar, with some information in the grid searching process, so we don't get bored...
                        cv = 3, # Cross Validation can be useful... but time intense. Default is five... so set this down to 3, which is a reasonable min.
                        scoring = 'accuracy') 

gb_class.fit(X_train_encoded, y_train) 

gb_model = gb_class.best_estimator_

y_pred = gb_model.predict(X_test_encoded)
y_probs = gb_model.predict_proba(X_test_encoded)[:, 1]  # Probability estimates of the positive class, need this to calculate auc


results_class = get_classification_results(y_test, y_pred, results_class, 'Gradient Boosting')


print('The best parameters are {}'.format(gb_class.best_params_))

Fitting 3 folds for each of 243 candidates, totalling 729 fits
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.6s
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.5s
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.6s
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=250; total time=   1.2s
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=250; total time=   1.1s
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=250; total time=   1.2s
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time=   2.3s
[CV] END learning_rate=0.1, max_depth=5, min_samples_leaf=1, min_samples_split=2, n_estimators=500; tota

In [None]:
results_class

Unnamed: 0,accuracy,f1,precision,recall,auc,kappa
Random Forest,0.789855,0.865741,0.802575,0.939698,0.813418,0.395879
Gradient Boosting,0.76087,0.842857,0.800905,0.889447,0.789402,0.348544
SVM,,,,,,


## Hyperparameter Tuning on Regression models <a name="regression"></a>

In [None]:
# Train Test Split with on target variable for Regression 
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg,test_size=0.2,random_state=456)

In [None]:
#Defining Variable types
cat_vars = ['property_type', 'tenure', 'fuel_type','sold_under_90days']
num_vars = list(df.columns)
num_vars = [x for x in num_vars if x not in cat_vars]

# Resetting index to make sure our ReSampler function works
X_train.reset_index(drop=True, inplace = True)
y_train.reset_index(drop=True, inplace = True)

# Imputataion
X_train_imputed = impute_missing_values(X_train)
X_test_imputed = impute_missing_values(X_test)

## One-hot encoding on categorical columsn before training the model 
X_train_encoded = pd.get_dummies(X_train_imputed, columns=['property_type','tenure','fuel_type'])
X_test_encoded = pd.get_dummies(X_test_imputed, columns=['property_type', 'tenure','fuel_type'])

# Standardisation
scaler = StandardScaler()
scaler.fit(X_train_encoded[num_vars])
X_train_encoded.loc[:, num_vars] = scaler.transform(X_train_encoded.loc[:, num_vars])
X_test_encoded.loc[:,num_vars] = scaler.transform(X_test_encoded.loc[:, num_vars])

KeyError: "['saleprice', 'garden', 'balcony'] not in index"

In [None]:
X_train_encoded

Unnamed: 0,bedrooms,bathrooms,distance_to_station,crime_rate,total_area,year_built,property_condition,amenities_rating,property_type_Bungalow,property_type_Flat,property_type_House,tenure_Freehold,tenure_Leasehold,tenure_Share of freehold,fuel_type_electric,fuel_type_gas
0,0.904363,-0.668307,0.826828,-0.105854,1.913288,0.798816,1.727897,1.922179,0,0,1,0,1,0,0,1
1,-1.186857,-0.668307,0.826828,-0.404973,-0.926341,0.979257,-1.284147,-0.084918,0,1,0,0,1,0,0,1
2,0.904363,1.086795,0.144455,-0.165678,0.039057,0.546198,-0.531136,-0.753950,0,1,0,0,1,0,0,1
3,-1.186857,-0.668307,-0.537918,-0.225501,-1.427257,-1.005595,-0.531136,-1.422983,0,0,1,1,0,0,0,1
4,-0.141247,-0.668307,-0.879104,-0.225501,-0.317118,0.870992,0.974886,0.584114,0,1,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1098,-1.186857,-0.668307,0.485642,0.000000,-0.729379,0.798816,-0.531136,0.584114,0,1,0,0,1,0,1,0
1099,1.949972,-0.668307,0.144455,-0.424914,0.108183,-0.536448,-0.531136,-1.422983,0,0,1,1,0,0,1,0
1100,-0.141247,-0.668307,-0.537918,0.000000,-1.705894,1.087521,0.974886,-1.422983,0,1,0,0,1,0,1,0
1101,-1.186857,-0.668307,-0.196731,-0.783858,-0.343127,0.798816,-0.531136,0.584114,0,1,0,0,1,0,0,1


In [None]:
## Setting global parameters

parameters = {'alpha':np.arange(0.1, 5, .1)} # start, stop, step
# for lasso and ridge

In [None]:
## Fitting the LASSO model and making predictions

model = Lasso(random_state = 4, max_iter=20000) #alpha will be adjusted in the gridsearch
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

clf = GridSearchCV(model, parameters, cv = cv, n_jobs=1, scoring = 'neg_root_mean_squared_error',verbose=3)
# remember parameters is set above as a global parameter

clf.fit(X=X_train_encoded, y=y_train)
lasso_model = clf.best_estimator_

print('The best alpha value is {}'.format(clf.best_params_['alpha']))

y_pred = lasso_model.predict(X_test_encoded)

results_reg = pd.DataFrame(index = ['Lasso', 'Ridge'], columns = ['RMSE', 'MAE', 'R2'])

results_reg = get_reg_results(y_test, y_pred, results_reg, 'Lasso')
results_reg

Fitting 30 folds for each of 49 candidates, totalling 1470 fits
[CV 1/30] END ..................alpha=0.1;, score=-142024.888 total time=   4.9s
[CV 2/30] END ...................alpha=0.1;, score=-75466.469 total time=   0.0s
[CV 3/30] END ...................alpha=0.1;, score=-79089.876 total time=   0.0s
[CV 4/30] END ...................alpha=0.1;, score=-88235.848 total time=   0.0s
[CV 5/30] END ...................alpha=0.1;, score=-66767.053 total time=   0.0s
[CV 6/30] END ...................alpha=0.1;, score=-84272.782 total time=   0.0s
[CV 7/30] END ...................alpha=0.1;, score=-80689.247 total time=   0.0s
[CV 8/30] END ...................alpha=0.1;, score=-71971.829 total time=   0.0s
[CV 9/30] END ...................alpha=0.1;, score=-89814.535 total time=   0.0s
[CV 10/30] END ..................alpha=0.1;, score=-80913.887 total time=   0.0s
[CV 11/30] END ..................alpha=0.1;, score=-72125.459 total time=   0.0s
[CV 12/30] END ..................alpha=0.1;, 

Unnamed: 0,RMSE,MAE,R2
Lasso,72604.193,60654.32,0.942
Ridge,,,


In [None]:
## Fitting the Ridge model and making predictions

model = Ridge(random_state = 4) #alpha will be adjusted in the gridsearch
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

clf = GridSearchCV(model, parameters, cv = cv, n_jobs=1, scoring = 'neg_root_mean_squared_error')
# remember parameters is set above as a global parameter

clf.fit(X=X_train_encoded, y=y_train)
ridge_model = clf.best_estimator_

print('The best alpha value is {}'.format(clf.best_params_['alpha']))

y_pred = ridge_model.predict(X_test_encoded)

results_reg = get_reg_results(y_test, y_pred, results_reg, 'Ridge')
results_reg

The best alpha value is 0.1


Unnamed: 0,RMSE,MAE,R2
Lasso,72604.193,60654.32,0.942
Ridge,72644.504,60699.709,0.942


## Suppor Vector Machines/ Classification <a name="svm"></a>


Support-vector machines are supervised learning models that analyse data used for classification and regression analysis.
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. The best hyperplane is that plane that has the maximum distance from both the classes, and this is the main aim of SVM. 

<img src="svm.png" width="550" align="center">

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane, etc.

**Kernel**

The concept of kernels in SVM allows us to separate non-linearly separable data points. In simple terms, a kernel is like a mathematical function that takes the original data and transforms it into a higher-dimensional space where it becomes easier to separate the data points with a straight line or hyperplane.

Imagine that the data points cannot be perfectly separated by a straight line in 2D, but if you transform them to a higher-dimensional space using a kernel, they can be separated by a hyperplane. Kernels essentially help us to find more complex decision boundaries in the original space by projecting the data points into a higher-dimensional space where they become linearly separable.

Some commonly used kernels in SVM include the linear kernel (for linearly separable data), polynomial kernel, and radial basis function (RBF) kernel. Each kernel has its own mathematical formula to transform the data points into a higher-dimensional space.

<img src="kernel.png" width="550" align="center">

In general, SVM classifers work better when the data is in the same magnitude, hence we would want to do feature scaling. The main advantage of standardizing is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is to avoid numerical difficulties during the calculation. Because kernel values usually depend on the inner products of 
feature vectors, e.g. the linear kernel and the polynomial kernel, large attribute values might cause numerical problems. 


*Note: all graphs taken from [this guide on SVM](https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-guide-for-beginners/).

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1379 entries, 0 to 1467
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   saleprice            1379 non-null   float64
 1   property_type        1379 non-null   object 
 2   bedrooms             1379 non-null   int64  
 3   bathrooms            1379 non-null   int64  
 4   distance_to_station  1379 non-null   float64
 5   tenure               1188 non-null   object 
 6   crime_rate           1021 non-null   float64
 7   total_area           1379 non-null   float64
 8   year_built           1379 non-null   int64  
 9   property_condition   1379 non-null   float64
 10  amenities_rating     1379 non-null   float64
 11  garden               1379 non-null   bool   
 12  balcony              1379 non-null   bool   
 13  fuel_type            1379 non-null   object 
 14  sold_under_90days    1379 non-null   int32  
dtypes: bool(2), float64(6), int32(1), int6

In [None]:

# Train Test Split with on target variable for SVM Classification
X_train, X_test, y_train, y_test = train_test_split(X_class, y_class,test_size=0.2,random_state=456)

# Resetting index to make sure our ReSampler function works
X_train.reset_index(drop=True, inplace = True)
y_train.reset_index(drop=True, inplace = True)

# Standardisation
scaler = StandardScaler()
scaler.fit(X_train.loc[:, num_vars])
X_train.loc[:, num_vars] = scaler.transform(X_train.loc[:, num_vars])
X_test.loc[:,num_vars] = scaler.transform(X_test.loc[:, num_vars])

# Imputataion
X_train_imputed = impute_missing_values(X_train)
X_test_imputed = impute_missing_values(X_test)

## One-hot encoding on categorical columsn before training the model 
X_train_encoded = pd.get_dummies(X_train_imputed, columns=['property_type','tenure','fuel_type'])
X_test_encoded = pd.get_dummies(X_test_imputed, columns=['property_type', 'tenure','fuel_type'])

# Oversample with Lukas's function using SMOTE 
X_train_resampled, y_train_resampled = ReSampler(X_train_encoded, y_train,"sold_under_90days",0.5,"SMOTE",123)

  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.append(ResampledDf)
  StandardMinoritydf=StandardMinoritydf.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn import svm
# Parameter grid 
svm_param = {'C': [0.1, 1, 10, 100], 
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['linear', 'rbf','poly']}

# Create an SVM classifier
svm = svm.SVC(random_state = 423, probability = True)

#Fit both Grid and Random Search - for comparison
svm_class_grid = GridSearchCV(svm, svm_param, scoring='accuracy', cv=3, verbose = 3, n_jobs = 1)
svm_class_random = RandomizedSearchCV(svm, svm_param, scoring='accuracy', cv=3, verbose = 3, n_iter= 10, n_jobs = 1, random_state = 123)

# Fit the classifiers to the training data
svm_class_grid.fit(X_train_resampled, y_train_resampled)
svm_class_random.fit(X_train_resampled, y_train_resampled)

Fitting 3 folds for each of 48 candidates, totalling 144 fits
[CV 1/3] END .....C=0.1, gamma=1, kernel=linear;, score=0.673 total time=   0.1s
[CV 2/3] END .....C=0.1, gamma=1, kernel=linear;, score=0.659 total time=   0.0s
[CV 3/3] END .....C=0.1, gamma=1, kernel=linear;, score=0.680 total time=   0.0s
[CV 1/3] END ........C=0.1, gamma=1, kernel=rbf;, score=0.559 total time=   0.3s
[CV 2/3] END ........C=0.1, gamma=1, kernel=rbf;, score=0.913 total time=   0.2s
[CV 3/3] END ........C=0.1, gamma=1, kernel=rbf;, score=0.680 total time=   0.2s
[CV 1/3] END .......C=0.1, gamma=1, kernel=poly;, score=0.825 total time=   0.0s
[CV 2/3] END .......C=0.1, gamma=1, kernel=poly;, score=0.897 total time=   0.0s
[CV 3/3] END .......C=0.1, gamma=1, kernel=poly;, score=0.928 total time=   0.1s
[CV 1/3] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.673 total time=   0.1s
[CV 2/3] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.659 total time=   0.0s
[CV 3/3] END ...C=0.1, gamma=0.1, kernel=linear

In [None]:
#Get the Grid Search Results
GridResultsSV=pd.DataFrame(svm_class_grid.cv_results_) # We make them a pandas dataframe, just it because it looks nice
GridResultsSV=GridResultsSV.sort_values("rank_test_score") #And we sort them by rank
GridResultsSV.head(1)

#Get the Random Search Results
RandomResultsSV=pd.DataFrame(svm_class_random.cv_results_) # We make them a pandas dataframe, just it because it looks nice
RandomResultsSV=RandomResultsSV.sort_values("rank_test_score") #And we sort them by rank
RandomResultsSV.head(1)

#Print what we have found - Check out how much quicker Random is than Grid
print(f'Grid Search took {GridResultsSV.mean_fit_time.sum()*3} seconds to fit. The best fit is {GridResultsSV.mean_test_score[0]}')
print(f'Random Search took {RandomResultsSV.mean_fit_time.sum()*3} seconds to fit. The best fit is {RandomResultsSV.mean_test_score[0]}')

Grid Search took 405.6396188735962 seconds to fit. The best fit ist 0.6705507709529485
Random Search took 124.1187515258789 seconds to fit. The best fit ist 0.6723923915790994


In [None]:
#Get the best estimators
svm_model = svm_class_grid.best_estimator_

y_pred = svm_model.predict(X_test_encoded)
y_probs = svm_model.predict_proba(X_test_encoded)[:, 1]  # Probability estimates of the positive class, need this to calculate auc

results_class = get_classification_results(y_test, y_pred, results_class, 'SVM')

In [None]:
results_class

Unnamed: 0,accuracy,f1,precision,recall,auc,kappa
Random Forest,0.789855,0.865741,0.802575,0.939698,0.813418,0.395879
Gradient Boosting,0.76087,0.842857,0.800905,0.889447,0.789402,0.348544
SVM,0.884058,0.919192,0.923858,0.914573,0.923579,0.714082


## Activity - your turn! <a name="activity"></a>

Today, you have two choices! One is to explore hyperparameter tuning either going through the Telecom Churn dataset or you can start data preparation work on your own project! 
