<a href="https://colab.research.google.com/github/Prajaktahz/ML-Practice-Uni/blob/main/Prac_Evaluation_%5Bbase%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://www.cs.nott.ac.uk/~pszgss/teaching/nlab.png)
# ML Practical 3: Evaluation of multiple models

## The task.

Task: Predict whether a person makes over $50k per year from census data known about them.

Data set from the paper: Kohavi, Ron. "Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid." KDD. Vol. 96. 1996.
Data URL: We will be using modified versions of the publically avaliable data. Please download the data from the URLs provided.

**Output Feature:**

Feature | type | values
:-------:|:--------:|:--------:|
salary | categorical | >50K, <=50K|

**Input features**

|     Feature    |     Type    |                                                                                                                                                                                                              Values                                                                                                                                                                                                             |
|:--------------:|:-----------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
|       age      |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                               |
|    workclass   | categorical |                                                                                                                                                              Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked                                                                                                                                                              |
|     fnlwgt     |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|    education   | categorical |                                                                                                                                      Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.                                                                                                                                     |
|  education-num |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| marital-status | categorical |                                                                                                                                                            Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.                                                                                                                                                           |
|   occupation   | categorical |                                                                                                    Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.                                                                                                    |
|  relationship  | categorical |                                                                                                                                                                               Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.                                                                                                                                                                               |
|      race      | categorical |                                                                                                                                                                                   White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.                                                                                                                                                                                  |
|       sex      | categorical |                                                                                                                                                                                                          Female, Male.                                                                                                                                                                                                          |
|  capital-gain  |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|  capital-loss  |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| hours-per-week |  continuous |                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| native-country | categorical | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. |


# To help you along some of the basic data preparation has been done for you.
Read the code. Understand what has been done.

In [14]:
# Some basic imports
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV

# Read the data into a pandas DataFrame
data = pd.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/prac3_train_42.csv', header = 0, names = ['age','workclass','fnlwgt','education','education-num','matrial-status','occupation','relationship','race','sex','captial-gain','captial-loss','hours-per-week','salary'])

In [15]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,matrial-status,occupation,relationship,race,sex,captial-gain,captial-loss,hours-per-week,salary
0,25,Private,292058,HS-grad,9,Never-married,Other-service,Other-relative,White,Male,0,0,30,<=50K
1,28,Private,285294,Bachelors,13,Married-civ-spouse,Sales,Wife,Black,Female,15024,0,45,>50K
2,31,Private,113364,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,>50K
3,33,Federal-gov,29617,Some-college,10,Divorced,Other-service,Not-in-family,Black,Male,0,0,40,<=50K
4,34,Private,157289,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,<=50K


In [16]:
# Define our input features and our output feature
# Call our input features X and our output feature y (the sklearn standard)
# Note that we have categorical features.
X = data.drop( columns = 'salary' )
y = data.salary

In [17]:
y.head()

0     <=50K
1      >50K
2      >50K
3     <=50K
4     <=50K
Name: salary, dtype: object

In [18]:
# Now we need to encode our output feature to be an integer 0 or 1.
# This is because we have a binary classification problem and in order to use sklearn's
# built-in evaluation measures we need to have one class defined as 1 (target) and one as 0 (non-target).

# We could do this by using the LabelEncoder from sklearn. The LabelEncoder will convert n-distinct values
# to 0,..,n-1 values in this case giving us what we want. We assume that our training set contains both
# labels and that this mapping will be valid. However, we have no control
# over which value is represented by 1 and which is represented by 0.
# Therefore it is easier (in terms of subsequent interpretation) to do this
# manually. Recall the problem, we want our target variable (1) to be '>50k'

# To do this (your variable y is a pandas.Series object, use the replace method):
# 1) update all values '<=50K' within y to equal 0
# 2) update all values '>50K' within y to equal 1

y.replace(to_replace = ' <=50K', value = 0, inplace = True)
y.replace(to_replace = ' >50K', value = 1, inplace = True)

In [19]:
# The baseline classifier for you to use
lr_model = Pipeline([
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse = False)),  # will automatically pick string columns (could have specified)
    ('standardize', StandardScaler()), # will convert everything (can't specify which columns but all columns are fine after onehot)
    ('model',LogisticRegression(solver = 'liblinear') )
    ])



# Your turn. See the instructions in the slide deck...

In [20]:
X.shape

(3500, 13)

In [21]:
pipeline = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                           ('model', RandomForestClassifier(min_samples_split=10))])

param_grid = {'model__max_depth': [50, 75, 100, 125, 150],
               'model__max_features': ['sqrt', 'log2']}

grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid)

In [22]:
grid_search.fit(X, y)



In [24]:
print(grid_search.best_params_)

{'model__max_depth': 75, 'model__max_features': 'sqrt'}


In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify = y)

In [27]:
pipeline_svc = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False)),
                           ('model', SVC())])
param_grid_svc = {'model__C': [0.1, 1, 10, 100],
               'model__gamma': [0.1, 1, 10, 100],
               'model__kernel': ['rbf', 'linear']}

grid_search_svc = GridSearchCV(estimator=pipeline_svc, param_grid=param_grid_svc)

In [31]:
grid_search_svc.fit(X_train,Y_train)



In [32]:
print(grid_search_svc.best_params_)

{'model__C': 1, 'model__gamma': 0.1, 'model__kernel': 'linear'}


##**Task 1**

In [36]:
data = pd.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/prac3_train_42.csv', header = 0, names = ['age','workclass','fnlwgt','education','education-num','matrial-status','occupation','relationship','race','sex','captial-gain','captial-loss','hours-per-week','salary'])

In [37]:
X = data.drop( columns = 'salary' )
y = data.salary

In [38]:
y

0        <=50K
1         >50K
2         >50K
3        <=50K
4        <=50K
         ...  
3495     <=50K
3496     <=50K
3497     <=50K
3498     <=50K
3499     <=50K
Name: salary, Length: 3500, dtype: object

In [39]:
y.replace(to_replace = ' <=50K', value = 0, inplace = True)
y.replace(to_replace = ' >50K', value = 1, inplace = True)

In [45]:
# Initial spit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
results = []

In [50]:
rf_model = Pipeline([
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse = False)),  # will automatically pick string columns (could have specified)
    ('model',RandomForestClassifier(min_samples_split=10, max_depth=10))
    ])

rf_model_50 = Pipeline([
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse = False)),  # will automatically pick string columns (could have specified)
    ('model',RandomForestClassifier(min_samples_split=50, max_depth=10))
    ])

svc_model = Pipeline([
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse = False)),  # will automatically pick string columns (could have specified)
    ('model',SVC(C=1.0, kernel='rbf'))
    ])

In [74]:
models = [lr_model, rf_model, rf_model_50, svc_model]

In [75]:
results = []
# For each model further split and do the evaluation
for model in models:
  # Define the further data split we'll use
  X_subtrain, X_valid, y_subtrain, y_valid = train_test_split( X_train, y_train, random_state = 42)
  # Train and test with this split
  model.fit(X_subtrain,y_subtrain)
  r = model.score(X_valid, y_valid)
  # Store the result
  results.append(r)
# Get the best model



In [76]:
idx_of_best_model = results.index(max(results))

In [77]:
idx_of_best_model

3

In [79]:
best_model = models[idx_of_best_model]


In [80]:
#Train and test with the final split
best_model.fit(X_train, y_train)
score = best_model.score(X_test, y_test)



In [81]:
#Train deployment version
best_model.fit(X, y)



In [82]:
results

[0.8270799347471451,
 0.7944535073409462,
 0.7960848287112561,
 0.8368678629690048]