# ML Application Example
## Classification using Steel Plates Faults Data Set

The task of this example is to implement a complete Data Driven pipeline (load, data-analysis, visualisation, model selection and optimization, prediction) on a specific Dataset. In this example the challenge is to perform a classification with different models to find the most accurate prediction.  


## Dataset 
The notebook will upload a public available dataset: https://archive.ics.uci.edu/ml/datasets/steel+plates+faults
<blockquote>
  <b>Source:</b>
    Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. www.semeion.it
    <br/>
    <b>Data Set Information:</b>
    Type of dependent variables (7 Types of Steel Plates Faults):
    <ol>
        <li> Pastry </li> <li> Z_Scratch </li> <li> K_Scatch </li> <li> Stains </li> <li>Dirtiness </li> <li> Bumps</li> <li> Other_Faults</li> </ol> 
    <br/>
    <b>Attribute Information:</b>
    27 independent variables:
    <table>
        <tr><td>X_Minimum        </td><td> X_Maximum           </td><td> Y_Minimum           </td><td>Y_Maximum            </td><td>Pixels_Areas      </td><td>X_Perimeter     </td></tr>
        <tr><td>Y_Perimeter      </td><td>Sum_of_Luminosity    </td><td>Minimum_of_Luminosity</td><td>Maximum_of_Luminosity</td><td>Length_of_Conveyer</td><td>TypeOfSteel_A300</td></tr>
        <tr><td>TypeOfSteel_A400 </td><td>Steel_Plate_Thickness</td><td>Edges_Index          </td><td>Empty_Index          </td><td>Square_Index      </td><td>Outside_X_Index </td></tr>
        <tr><td>Edges_X_Index    </td><td>Edges_Y_Index        </td><td>Outside_Global_Index </td><td>LogOfAreas           </td><td>Log_X_Index       </td><td>Log_Y_Index      </td></tr>
        <tr><td>Orientation_Index</td><td>Luminosity_Index     </td><td>SigmoidOfAreas       </td></tr></table> 
    <br/>
</blockquote>

In [None]:
# algebra
import numpy as np
# data structure
import pandas as pd
# data visualization
import matplotlib.pylab as plt
# another module for data visualization
import plotly.express as px

import seaborn as sns
#file handling
from pathlib import Path




# Data load
The process consist in downloading the data if needed, loading the data as a Pandas dataframe

In [None]:
    
filename  = "Faults.NNA"
separator = '\t'
columns   = ['X_Minimum','X_Maximum','Y_Minimum','Y_Maximum','Pixels_Areas','X_Perimeter','Y_Perimeter','Sum_of_Luminosity','Minimum_of_Luminosity','Maximum_of_Luminosity','Length_of_Conveyer',
             'TypeOfSteel_A300','TypeOfSteel_A400','Steel_Plate_Thickness','Edges_Index','Empty_Index','Square_Index','Outside_X_Index','Edges_X_Index','Edges_Y_Index','Outside_Global_Index',
            'LogOfAreas','Log_X_Index','Log_Y_Index','Orientation_Index','Luminosity_Index','SigmoidOfAreas','Pastry','Z_Scratch','K_Scatch','Stains','Dirtiness','Bumps','Other_Faults']


In [None]:
#if the dataset is not already in the working dir, it will download
my_file = Path(filename)
if not my_file.is_file():
  print("Downloading dataset")
  !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00198/Faults.NNA


In [None]:
#function to semplificate the load of dataset, in case it is a csv, tsv or excel file
#output is a pandas dataframe 
def load_csv(filename,separator,columns):
    
    try:
    
        csv_table = pd.read_csv(filename,sep=separator,names=columns,dtype='float64')
    
    except:
        
        csv_table = pd.read_excel(filename,names=columns)
    print("n. samples: {}".format(csv_table.shape[0]))
    print("n. columns: {}".format(csv_table.shape[1]))

    return csv_table #.dropna()

data = load_csv(filename,separator,columns)

# Data Analysis and Visualization
In this section confidence with the data is gained, data are plotted and cleaned

In [None]:
#How does the dataset look like? 
data.head()

In [None]:
Faults = ['Pastry',
'Z_Scratch',
'K_Scatch',
'Stains',
'Dirtiness',
'Bumps',
'Other_Faults']

data['class'] = (data[Faults]*np.arange(len(Faults))).sum(axis=1)

In [None]:
#Do we have a balanced dataset?
plt.bar(Faults,data[Faults].sum())
plt.xticks(rotation=30)
plt.grid()

In [None]:
#Name of all columns
print(data.columns.values)

In [None]:
#let's have a look at the data and their correlations, if any
measurements = ['X_Minimum',
            'X_Maximum',
            'Y_Minimum',
            'Y_Maximum',
            'Pixels_Areas',
            'X_Perimeter',
            'Y_Perimeter',
            'Sum_of_Luminosity',
            'Minimum_of_Luminosity',
            'Maximum_of_Luminosity',
            'Length_of_Conveyer',
            'TypeOfSteel_A300',
            'TypeOfSteel_A400',
            'Steel_Plate_Thickness',
            'Edges_Index',
            'Empty_Index',
            'Square_Index',
            'Outside_X_Index',
            'Edges_X_Index',
            'Edges_Y_Index',
            'Outside_Global_Index',
            'LogOfAreas',
            'Log_X_Index',
            'Log_Y_Index',
            'Orientation_Index',
            'Luminosity_Index',
            'SigmoidOfAreas']
target       = ['class']

#let's have a look only at a few parameters
sns.pairplot(data[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity']+target],hue='class')

In [None]:
#Let's have a look if there is the possibility to reduce the dimensionality
#to see if there is the possibility to see if the fault-classes are "separable" 

from sklearn.decomposition import PCA

aux = data[measurements]
aux = (aux-aux.mean())/aux.std()

pca = PCA(n_components=3)
X_r = pca.fit(aux).transform(aux)
y_r   = data[target].values.flatten()


colors = plt.cm.get_cmap('Dark2')(np.linspace(0,1,len(Faults)))
lw = 2

fig = plt.figure(figsize=[10,10])

ax = plt.axes(projection='3d')
ax.scatter(X_r[:,0], X_r[:,1], X_r[:,2], c=data[target].values, cmap='viridis', linewidth=0.5);

#print(data[target].column)

In [None]:
#another fancy way of doing the previous plot
px.scatter_3d( x=X_r[:,0], y=X_r[:,1], z=X_r[:, 2], color=data['class'].values,color_continuous_scale='Rainbow' )


In [None]:
#t-distributed stochastic neighbor embedding is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map
#https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

from sklearn.manifold import TSNE

tsne = TSNE(n_components=3)
X_r = tsne.fit_transform(aux)
y_r  = data[target].values.flatten()


#colors = plt.cm.get_cmap('viridis')(np.linspace(0,1,len(Faults)))
#lw = 2

fig = plt.figure(figsize=[10,10])
ax = plt.axes(projection='3d')
ax.scatter(X_r[:,0], X_r[:,1], X_r[:,2], c=data[target].values, cmap='viridis', linewidth=0.5);


In [None]:
#Select only the interesting variable for the model, and remove any anomalous value (e.g. "nan")
data = data.dropna()

# Machine Learning
Here the interesting input features and output to predict for the task are selected, the data are opportunelly preprocessed (i.e. normalized), the dataset is splitted in two separate train and test subsets, each model is trained on the training data and evaluated against a test set. <br/>
The evaluation metrics list can be found <a href='https://scikit-learn.org/stable/modules/model_evaluation.html'>here</a>

In [None]:
#the module needed for the modeling and data mining are imported
#Cross-Validation 
from sklearn.model_selection import train_test_split
#Data normalization
from sklearn.preprocessing   import StandardScaler
#metrics to evaluate the model

from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

In [None]:
#Selection of feature and output variable, definition of the size (fraction of the total) of the random selected test set
input_features = measurements
output         = target
test_size      = 0.33
random_state   = 0

In [None]:
#not preprocessed data
unnormalized_X,y = data[input_features],data[output]

In [None]:
# normalisation
#Having features on a similar scale can help the model converge more quickly towards the minimum
scaler_X = StandardScaler().fit(unnormalized_X)
X = scaler_X.transform(unnormalized_X)

In [None]:
#check if nan are present on the data after normalization to avoid trouble later
sum(np.isnan(X))

In [None]:
# basic train-test dataset random split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_size,
                                                    random_state=random_state)

In [None]:
#dictionary to help the display of the results
Score_Dict = {}

#function introduced to simplifies the following comparison and test of the various
#return the trained model and the score of the selected metrics
def fit_predict_plot(model,X_train,y_train,X_test,y_test,class_names):
    model.fit(X_train,y_train)

    pred_y_test = model.predict(X_test)

    conf_matrix = confusion_matrix(y_test,pred_y_test)
    score = f1_score(y_test,pred_y_test,average='weighted')

    model_name = type(model).__name__
    if(model_name=='GridSearchCV'):
        model_name ='CV_'+type(model.estimator).__name__

    #Alternative metrics are listed here:https://scikit-learn.org/stable/modules/model_evaluation.html
    Score_Dict[model_name]=score

    fig,ax = plt.subplots(1,1,figsize=[10,10])
    
    np.set_printoptions(precision=2)

    plot_confusion_matrix(model,X_test,y_test,display_labels=class_names,
                                 cmap     =plt.cm.Blues,
                                 normalize='true',
                                 xticks_rotation=45,ax=ax)
    plt.axis('tight')
    
    correctly_classified = np.sum(np.diag(conf_matrix))/np.sum(conf_matrix)
    print("correctly classified :: {:.2f}".format(correctly_classified))
    print("f1 score :: {:.2f}".format(score))
    
    
    return model,correctly_classified



## Models used in this example are:
<ul>
    <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier">Ridge</a></li>
     <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression">Logistic Regression</a></li>
    <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier">kNN</a></li>
    <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html">Support Vector Classification</a></li>
    <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Random Forest</a></li>
</ul>

# Ridge Classifier

In [None]:
#initialization, fit and evaluation of the model
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV
estimator = RidgeClassifier()

parameters = { 'alpha':np.logspace(-2,2,5)}
model = GridSearchCV(estimator, parameters,cv=5)

model, ridge_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

# Logistic Regression

In [None]:
#initialization, fit and evaluation of the model
from sklearn import linear_model
estimator = linear_model.LogisticRegression(max_iter=10000)

parameters = { 'C':np.logspace(-2,3,5)}
model = GridSearchCV(estimator, parameters,cv=5)

model, logistic_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

# kNN

In [None]:
#initialization, fit and evaluation of the model
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()

parameters = { 'n_neighbors':[3,5,7]}
model = GridSearchCV(estimator, parameters,cv=5)

model, knn_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

# SVC

In [None]:
from sklearn.svm import SVC
estimator = SVC(gamma='auto')

parameters = { 'C':[0.1,1,10,100]}
model = GridSearchCV(estimator, parameters,cv=5)

model, svc_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

# Random Forest

In [None]:
#initialization, fit and evaluation of the model
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier()

parameters = { 'min_samples_leaf':[1,3,5],
              'class_weight':['balanced_subsample'],
              'n_estimators':[10,100,200]}
model = GridSearchCV(estimator, parameters,cv=5)

model, rf_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

In [None]:
#print out the results in a table
from IPython.display import Markdown as md
from IPython.display import display


table = '<table><tr><th> Model</th><th> Accuracy Metric </th></tr>'

for key, value in Score_Dict.items():
    table +='<tr> <td>'+key+'</td><td>' +'%.2f'%(value)+'</td></tr>'
table+='</table>'
display(md(table))


names = list(Score_Dict.keys())
values = list(Score_Dict.values())

plt.figure(figsize=(6, 3))
plt.bar(names, values)
plt.ylabel('Accuracy Metric')
plt.xticks(rotation=30)
plt.grid()
#plt.ylim([0.5,0.8])


# How to deal with Unbalanced dataset
There are at least two possibilities as explaned <a href="https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets">here</a> and <a href="https://medium.com/strands-tech-corner/unbalanced-datasets-what-to-do-144e0552d9cd"> here </a>: 
<b> Undersampling or Oversampling</b> 

<img  width="500" src="https://miro.medium.com/max/2400/1*ENvt_PTaH5v4BXZfd-3pMA.png">



In [None]:
#Undersample
counts = data[target].value_counts()
mincounts = np.min(counts)


df = [0,0,0,0,0,0,0]
df_under = pd.DataFrame()
for a in range(len(Faults)):
    df = data.loc[data[Faults[a]]==1].sample(mincounts) 
    df_under = pd.concat([df_under, df], axis=0)
    
    
plt.bar(Faults,df_under[Faults].sum())
plt.xticks(rotation=30)
plt.grid()

In [None]:
#not preprocessed data
unnormalized_X,y = df_under[input_features],df_under[output]
# normalisation
#Having features on a similar scale can help the model converge more quickly towards the minimum
scaler_X = StandardScaler().fit(unnormalized_X)
X = scaler_X.transform(unnormalized_X)

# basic train-test dataset random split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_size,
                                                    random_state=random_state)

#initialization, fit and evaluation of the model
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()

parameters = { 'n_neighbors':[3,5,7,9]}
model = GridSearchCV(estimator, parameters,cv=5)

model, knn_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

#initialization, fit and evaluation of the model
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier()

parameters = { 'min_samples_leaf':[1,3,5],
              'class_weight':['balanced_subsample'],
              'n_estimators':[10,100,200,300]}
model = GridSearchCV(estimator, parameters,cv=5)

model, rf_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

In [None]:
#Oversample 
counts = data[target].value_counts()
maxcounts = np.max(counts)

df_over = pd.DataFrame()
for a in range(len(Faults)):
    df = data.loc[data[Faults[a]]==1].sample(maxcounts,replace=True) 
    df_over = pd.concat([df_over, df], axis=0)
    
    
plt.bar(Faults,df_over[Faults].sum())
plt.xticks(rotation=30)
plt.grid()


In [None]:
#not preprocessed data
unnormalized_X,y = df_over[input_features],df_over[output]
# normalisation
#Having features on a similar scale can help the model converge more quickly towards the minimum
scaler_X = StandardScaler().fit(unnormalized_X)
X = scaler_X.transform(unnormalized_X)

# basic train-test dataset random split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_size,
                                                    random_state=random_state)

#initialization, fit and evaluation of the model
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()

parameters = { 'n_neighbors':[3,5,7,9]}
model = GridSearchCV(estimator, parameters,cv=5)

model, knn_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)

#initialization, fit and evaluation of the model
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier()

parameters = { 'min_samples_leaf':[1,3,5],
              'class_weight':['balanced_subsample'],
              'n_estimators':[10,100,200,300]}
model = GridSearchCV(estimator, parameters,cv=5)

model, rf_score = fit_predict_plot(model,X_train,y_train.values.flatten(),X_test,y_test.values.flatten(),Faults)
print(model.best_params_)