# Challenge 1: The banknote-authentication data set problem

We will perform a nearly realistic analysis of the data set bank note authentication that can be downloaded from https://archive-beta.ics.uci.edu/dataset/267/banknote+authentication

## Data set description

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.
These features are:
1. variance of Wavelet Transformed image (continuous) 
2. skewness of Wavelet Transformed image (continuous) 
3. curtosis of Wavelet Transformed image (continuous) 
4. entropy of image (continuous) 
5. class (integer)

## Task description
We have a binary classification problem. The assignment can be divided in several parts:
    
    1. Load the data and pretreatment.
    2. Data exploring by Unsupervised Learning techniques.
    3. Construction of several models of Supervised Learning.

### 1. Data pretreatment

Load the data and look at it: It is needed some kind of scaling? Why? Are the data points sorted in the original data set? Can it generate problems? How can this be solved?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

In [None]:
colNames=['variance','skewness','curtosis','entropy','counterfit']
df = pd.read_csv('./data_banknote_authentication.txt', header=None, names=colNames)
df

In [None]:
df.isna().sum()

In [None]:
df_shuffled = df.sample(frac=1,random_state=123).reset_index(drop=True)
df_shuffled

In [None]:
y = df_shuffled["counterfit"].copy()
X = df_shuffled.drop(columns=["counterfit"])
X.shape

In [None]:
import seaborn as sb

plt.rcParams['figure.figsize']=15,5
plt.subplot(121)
plt.title('Banknote Class Type Count', fontsize=10)
s = sb.countplot(x = "counterfit", data = df, alpha=0.7)
for p in s.patches:
    s.annotate(format(p.get_height(), '.1f'), 
               (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'center', 
                xytext = (0, 4), 
                textcoords = 'offset points')

ax = plt.subplot(122)
classpie = df['counterfit'].value_counts()
size = classpie.values.tolist()
types = classpie.axes[0].tolist()
labels = 'Yes', 'No'
colors = ['#EAFFD0', '#F38181']
plt.title('Banknote Class Type Percentange', fontsize=10)
patches, texts, autotexts = plt.pie(size, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=150)
for text,autotext in zip(texts,autotexts):
    text.set_fontsize(14)
    autotext.set_fontsize(14)

plt.axis('equal')

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=['variance','skewness','curtosis','entropy'])
X_scaled

### Analisi esplorativa dei dati

In [None]:
ax = X_scaled.hist(figsize=(20,10))
plt.show()

In [None]:
ax = X_scaled.boxplot(figsize=(20,10))
ax.set_xlabel('Features')
ax.set_ylabel('Values')
ax.set_title('Outliers detection')
plt.show()

In [None]:
pd.plotting.scatter_matrix(X_scaled, c = y, figsize = (18,18), diagonal = "kde", alpha = .8)
plt.suptitle("UCI Dataset Scatter Matrix", y = .9)
plt.show()
plt.close()

### 2. Unsupervised Learning

Use PCA and plot the two first components colouring according with the class. Are the classes linearly separable in this projection? What happens when I applied k-means with two classes in this space? And if I use all the coordinates? Try also t-SNE for projection and DBSCAN for the clustering and comment on the results.

#### PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)

perc_var = np.round(pca.explained_variance_ratio_*100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(perc_var)+1)]

plt.bar(x=range(1, len(perc_var)+1), height=perc_var, tick_label=labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()

In [None]:
pca_df = pd.DataFrame(X_pca, columns=labels)
plt.scatter(pca_df.PC1, pca_df.PC2, c=y)
plt.title('PCA Graph')
plt.xlabel('PC1 - {0}%'.format(perc_var[0]))
plt.ylabel('PC2 - {0}%'.format(perc_var[1]))
plt.show()

#### K-means

What happens when I applied k-means with two classes in this space? And if I use all the coordinates?

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2)

kmeans.fit(X_pca)

kmeans_labels = kmeans.labels_
centroids = kmeans.cluster_centers_

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=300, c='r')
plt.xlabel('Prima componente principale')
plt.ylabel('Seconda componente principale')
plt.show()

In [None]:
from sklearn.metrics.cluster import normalized_mutual_info_score

normalized_mutual_info_score(kmeans_labels, np.array(y).flatten())

#### t-SNE

Try also t-SNE for projection and DBSCAN for the clustering and comment on the results.

In [None]:
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2, learning_rate='auto',
                  init='random', perplexity=15, random_state=42).fit_transform(X_scaled)
fig, ax =plt.subplots(figsize=(9,9))
ax.scatter(X_embedded[:,0],X_embedded[:,1], c=y)
ax.set_title('t-SNE')
plt.show()

#### DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.7, min_samples=15).fit(X_pca)

fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot()
ax.scatter(pca_df.PC1, pca_df.PC2,c=dbscan.labels_)

plt.show()

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.7, min_samples=12).fit(X_pca)

fig = plt.figure(figsize=(9, 9))
ax = fig.add_subplot()
ax.scatter(pca_df.PC1, pca_df.PC2,c=dbscan.labels_)

plt.show()

### 3. Supervised Learning

Generate a subset of the data of 372 elements that would be saved as test set. With the rest of the data generate the following models: Logistic Regression, Decision tree (use the ID3 algorithm), Naive Bayesian and k-NN. 

Investigate the effect of regularization (when possible) and use cross validation for setting the hyper-parameters when needed. 

Compare the performances in terms of accuracy, precision, recall and F1-score on the test set. Comment these results at the light of those obtained from the Unsupervised Learning analysis. Could you propose a way to improve these results?     


In [None]:
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2711, stratify=y, random_state=2)

#### TEST

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgb

import sklearn.metrics as metric

In [None]:
estimators = {
    'LogisticRegression': LogisticRegression(),
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'KNeighborsClassifier': KNeighborsClassifier(n_neighbors=5),
    'SVM': SVC(),
    'XGB': xgb.XGBClassifier()
}

In [None]:
def train_model(estimator, X_train, X_test, y_train, y_test):
    estimator.fit(X_train, y_train)
    y_pred = estimator.predict(X_test)
    print(f'The accuracy score is: {metric.accuracy_score(y_test, y_pred):.4f}')
    print(f'The report is: {metric.classification_report(y_test, y_pred)}')
    print('#'*100)
    
def estimator_dict(X_train, X_test, y_train, y_test):
    for name, estimator in estimators.items():
        print(name)
        train_model(estimator, X_train, X_test, y_train, y_test)

In [None]:
estimator_dict(X_tr, X_val, y_tr, y_val)

### LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2711, random_state=2)

lr = LogisticRegression()

lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

y_prob = lr.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve (AUC={:.2f})'.format(auc))
plt.show()

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

logistic_regression_model = LogisticRegression(penalty='l2')

param_grid = {'C': [0.1, 1, 10, 100]}
grid_search = GridSearchCV(logistic_regression_model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print('Best parameter:', grid_search.best_params_)
print('Best score:', grid_search.best_score_)

#### DECISION TREE

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

dtc = DecisionTreeClassifier(criterion="entropy", random_state=42)

dtc.fit(X_train, y_train)

y_pred = dtc.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

print(classification_report(y_test, y_pred))

#### Regolarizzazione e Cross Validation

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

tree = DecisionTreeClassifier()

param_grid = {'max_depth': range(1, 11)}
grid_search = GridSearchCV(tree, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print('Best parameter:', grid_search.best_params_)
print('Best score:', grid_search.best_score_)

In [None]:
best_tree = DecisionTreeClassifier(max_depth=grid_search.best_params_['max_depth'])
best_tree.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15,7.5))
plot_tree(best_tree,
          filled=True,
          rounded=True,
          class_names=['Edible','Poisonous'],
          feature_names=colNames)

#### NAIVE BAYESIAN

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix

nb = GaussianNB()

nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
print(classification_report(y_test, y_pred))

#### K-NN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

#### Cross Validation

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()

param_grid = {'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]}

grid_search = GridSearchCV(knn, param_grid=param_grid, cv=5)

grid_search.fit(X_train, y_train)

best_n_neighbors = grid_search.best_params_['n_neighbors']

knn_best = KNeighborsClassifier(n_neighbors=best_n_neighbors)

knn_best.fit(X_train, y_train)

knn_best_score = knn_best.score(X_test, y_test)

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score

knn_model = KNeighborsClassifier(n_neighbors=5)

cv_results = cross_validate(knn_model, X, y, cv=5, scoring='accuracy')

print('Accuracy media: ', cv_results['test_score'].mean())
print('Deviazione standard: ', cv_results['test_score'].std())