# Model Performance Measures, ML Pipeline and Hyperparameter Tuning

## Can you correctly identify glass type?

## Context:
    
This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 
values)

# Content

Attribute Information:

Id number: 1 to 214

RI: refractive index

Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)

Mg: Magnesium

Al: Aluminum

Si: Silicon

K: Potassium

Ca: Calcium

Ba: Barium

Fe: Iron

Type of glass: (class attribute) 
    -- 1 building_windows_float_processed 
    -- 2 building_windows_non_float_processed 
    -- 3 vehicle_windows_float_processed 
    -- 4 vehicle_windows_non_float_processed (none in this database) 
    -- 5 containers 
    -- 6 tableware 
    -- 7 headlamps

## Source:
https://archive.ics.uci.edu/ml/datasets/Glass+Identification

# 1.  Import necessary libraries and load the data

In [2]:
import warnings
import pandas as pd 
df = pd.read_csv('glass.csv')
df.head()

Unnamed: 0,ID,refractive index,Sodium,Magnesium,Aluminum,Silicon,Potassium,Calcium,Barium,Iron,Type
0,1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,building_windows_float_processed
1,2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,building_windows_float_processed
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,building_windows_float_processed
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,building_windows_float_processed
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,building_windows_float_processed


# 2. Split the data into dependent and independent variables. Also see how the looks like

Hint: you can make use of nay method(iloc or drop method)

In [3]:
X = df.iloc[:, 1:10].values 
y = df.iloc[:, 10].values 
print(X.shape)
print(y.shape)
print(df.info())

(214, 9)
(214,)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 11 columns):
ID                  214 non-null int64
refractive index    214 non-null float64
Sodium              214 non-null float64
Magnesium           214 non-null float64
Aluminum            214 non-null float64
Silicon             214 non-null float64
Potassium           214 non-null float64
Calcium             214 non-null float64
Barium              214 non-null float64
Iron                214 non-null float64
Type                214 non-null object
dtypes: float64(9), int64(1), object(1)
memory usage: 18.5+ KB
None


# 3. Convert Target variable into numerical

In [4]:
from sklearn.preprocessing import LabelEncoder 

le = LabelEncoder() 
y = le.fit_transform(y)

le.transform(['building_windows_float_processed','building_windows_non_float_processed','containers','headlamps',
              'tableware','vehicle_windows_float_processed'])

array([0, 1, 2, 3, 4, 5], dtype=int64)

# 4. Split the dataset into train set test set also the validation 
Always a good practice to split the dataset into 3 sets

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

# 5. Build the pipeline
Steps:

Instantiate the pipeline, as first defining standard scaler and on the scaled data run the PCA and then feed it to the logistic regression(or any other algo)

Hint:

Import standard scaler to standardize the data

You can take an algorithm of choice and build a pipeline

In [6]:
#PCA - to reduce dimensions
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA 
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 

pipe_lr = Pipeline([('scl', StandardScaler()), ('pca', PCA(n_components=3)), ('clf', LogisticRegression(random_state=1))]) 
pipe_lr.fit(X_train, y_train) 
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

Test Accuracy: 0.523




# 6.Follow the above steps and check if you can tweak the logistic regression parameters above and make use of Grid search(can use any algorithm)

In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC 

pipe_svc = Pipeline([('scl', StandardScaler()), ('pca', PCA()), ('svc', SVC())]) 


param_grid = {'pca__n_components':[4,5],'svc__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1, 10], 'svc__kernel':['rbf','poly']} 

grid = GridSearchCV( pipe_svc , param_grid = param_grid, cv = 5) 

grid.fit( X_train, y_train) 

print(" Best cross-validation accuracy: {:.2f}". format( grid.best_score_)) 
print(" Best parameters: ", grid.best_params_) 
print(" Test set accuracy: {:.2f}". format( grid.score( X_test, y_test)))



 Best cross-validation accuracy: 0.73
 Best parameters:  {'pca__n_components': 5, 'svc__C': 100, 'svc__gamma': 0.01, 'svc__kernel': 'rbf'}
 Test set accuracy: 0.69




In [8]:
grid.predict(X_test)

array([3, 0, 1, 4, 2, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 3, 0, 1, 0, 1, 2, 0,
       3, 4, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 3, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 4, 3, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0])

# 7. Optimize the model parameters(can make use of any algorithm)
Make use of Grid search for hyper parameter

Steps:
Split the dataset into train and test set

Make use of any algorithm , from the list of hyper parameters you get apply param grid 

Once hyper parameter grid is defined, import grid search CV and fit x_train and y_train

Find the best params and mean test score


In [9]:
#split the dataset into train and test set
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,random_state = 7)

In [10]:
from sklearn.neighbors import KNeighborsClassifier
### Number of nearest neighbors
knn_clf = KNeighborsClassifier()

In [11]:
knn_clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [12]:
from sklearn.metrics import accuracy_score

In [13]:
param_grid = {'n_neighbors': list(range(1,9)),
             'algorithm': ('auto', 'ball_tree', 'kd_tree' , 'brute') }

In [14]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(knn_clf,param_grid,cv=10)

In [15]:
gs.fit(X_train, y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8], 'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute')},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [16]:
gs.best_params_

{'algorithm': 'auto', 'n_neighbors': 1}

In [17]:
gs.cv_results_['mean_test_score']

array([0.6875 , 0.6625 , 0.63125, 0.65   , 0.61875, 0.60625, 0.58125,
       0.6125 , 0.6875 , 0.6625 , 0.63125, 0.65   , 0.61875, 0.60625,
       0.58125, 0.6125 , 0.6875 , 0.6625 , 0.63125, 0.65   , 0.61875,
       0.60625, 0.58125, 0.6125 , 0.6875 , 0.6625 , 0.63125, 0.65   ,
       0.61875, 0.60625, 0.58125, 0.6125 ])