# Lab | Avila Bible 

In this lab, we will explore the [**Avila Bible dataset**](https://archive.ics.uci.edu/ml/datasets/Avila) which has been extracted from 800 images of the 'Avila Bible', an XII century giant Latin copy of the Bible. The prediction task consists in associating each pattern to a copyist. You will use supervised learning algorithms to figure out what feature patterns each copyist are likely to have and use our model to predict those copyist.

-----------------------------------------------------------------------------------------------------------------

## Before your start:
    - Comment as much as you can and use the APIla-bible in the README.md,
    - Happy learning!

In [2]:
# Import your libraries
import pandas as pd
import requests
import seaborn as sns

![machine-learning](https://miro.medium.com/proxy/1*halC1X4ydv_3yHYxKqvrwg.gif)

The Avila data set has been extracted from 800 images of the the **Avila Bible**, a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The palaeographic analysis of the  manuscript has  individuated the presence of 8 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

# What am I expected to do?

Well, your prediction task consists in associating each pattern to one of the 8 monks we will be evaluating (labeled as:  Marcus, Clarius, Philippus, Coronavirucus, Mongucus, Paithonius, Ubuntius, Esequlius). For that aim, you should: 
- Train a minimum of 4 different models
- Must contain a summary of the machine learning tools and algorithms
- and the results or the score obtained with each of them

You won't get much more instructions from now on. Remember to comment your code as much as you can. Keep the requirements in mind and have fun! 

## Dataset

In [3]:
data = pd.read_csv('../data/training_dataset.csv', index_col=0)

In [4]:
data.head(4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus


`Keep calm and code on!`

## Explore Data

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12017 entries, 0 to 12016
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       12017 non-null  float64
 1   1       12017 non-null  float64
 2   2       12017 non-null  float64
 3   3       12017 non-null  float64
 4   4       12017 non-null  float64
 5   5       12017 non-null  float64
 6   6       12017 non-null  float64
 7   7       12017 non-null  float64
 8   8       12017 non-null  float64
 9   9       12017 non-null  float64
 10  10      12017 non-null  object 
dtypes: float64(10), object(1)
memory usage: 1.1+ MB


In [8]:
data.dtypes

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10     object
dtype: object

# Challenge - train your models, make the best prediction

### Label encoder 
Encode target labels with value between 0 and n_classes-1.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [9]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data["numeric"] = le.fit_transform(data["10"])

In [10]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,numeric
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus,3
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius,0
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus,6
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus,6
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.9917,-1.14103,Philippus,6


In [11]:
columnas = [a for a in data.columns if a not in ["10","numeric"]]
X = data[columnas]
y = data["numeric"]

### Train_Test Split
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [15]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

### Train One Model

Classifier implementing the k-nearest neighbors vote. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [18]:
from sklearn.neighbors import KNeighborsClassifier

In [30]:
knc = KNeighborsClassifier()

In [31]:
knc.fit(X_train,y_train)

KNeighborsClassifier()

In [32]:
y_pred = knc.predict(X_test)

### Exploring metrics for multi-class classification algorithms
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

- Accuracy
- Precission
- Recall
- F1_score

In [33]:
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score

In [34]:
print ("Accuracy", round(accuracy_score(y_test,y_pred),3))
print("Precission",round(precision_score(y_test,y_pred, average = "weighted"),3))
print("Recall", round(recall_score(y_test,y_pred, average = "weighted"),3))
print("F1_score", round(f1_score(y_test,y_pred,average= "weighted"),3))

Accuracy 0.751
Precission 0.755
Recall 0.751
F1_score 0.749


### Training several models and explores the metrics for each of them

- DecisionTreeClassifier
- SVC
- RandomForestClassifier
- AdaBoostClassifier

In [35]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [36]:
models = { "tree": DecisionTreeClassifier(), 
          "forest": RandomForestClassifier(),
          "ada": AdaBoostClassifier(),
          "svc": SVC()   
}

In [39]:
models.items()

dict_items([('tree', DecisionTreeClassifier()), ('forest', RandomForestClassifier()), ('ada', AdaBoostClassifier()), ('svc', SVC())])

In [37]:
for name, model in models.items():
    print(f"Entrenando ---> {name}")
    model.fit(X_train,y_train)
    print(f"He acabado ----> {name}")

Entrenando ---> tree
He acabado ----> tree
Entrenando ---> forest
He acabado ----> forest
Entrenando ---> ada
He acabado ----> ada
Entrenando ---> svc
He acabado ----> svc


In [38]:
for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"------{name}------")
    print ("Accuracy", round(accuracy_score(y_test,y_pred),3))
    print("Precission",round(precision_score(y_test,y_pred, average = "weighted"),3))
    print("Recall", round(recall_score(y_test,y_pred, average = "weighted"),3))
    print("F1_score", round(f1_score(y_test,y_pred,average= "weighted"),3))

------tree------
Accuracy 0.983
Precission 0.983
Recall 0.983
F1_score 0.983
------forest------
Accuracy 0.99
Precission 0.99
Recall 0.99
F1_score 0.99
------ada------
Accuracy 0.495
Precission 0.321
Recall 0.495
F1_score 0.346


  _warn_prf(average, modifier, msg_start, len(result))


------svc------
Accuracy 0.683
Precission 0.69
Recall 0.683
F1_score 0.647


In [41]:
for name, model in models.items():
    print(f"------{name}------")
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print(f"----Métricas----")
    print ("Accuracy", round(accuracy_score(y_test,y_pred),3))
    print("Precission",round(precision_score(y_test,y_pred, average = "weighted"),3))
    print("Recall", round(recall_score(y_test,y_pred, average = "weighted"),3))
    print("F1_score", round(f1_score(y_test,y_pred,average= "weighted"),3))

------tree------
----Métricas----
Accuracy 0.983
Precission 0.983
Recall 0.983
F1_score 0.983
------forest------
----Métricas----
Accuracy 0.987
Precission 0.987
Recall 0.987
F1_score 0.987
------ada------
----Métricas----
Accuracy 0.495
Precission 0.321
Recall 0.495
F1_score 0.346
------svc------


  _warn_prf(average, modifier, msg_start, len(result))


----Métricas----
Accuracy 0.683
Precission 0.69
Recall 0.683
F1_score 0.647


In [42]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,numeric
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus,3
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius,0
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus,6
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus,6
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.9917,-1.14103,Philippus,6


In [60]:
r = RandomForestClassifier()
r.fit(X_train,y_train)

RandomForestClassifier()

In [61]:
data["pred"] = r.predict(X)

In [62]:
pd.crosstab(data.numeric, data.pred)

pred,0,1,2,3,4,5,6,7
numeric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2354,0,0,8,0,0,0,0
1,0,1007,0,2,0,0,0,0
2,0,0,424,0,0,0,3,0
3,1,0,0,5105,0,0,0,1
4,0,0,0,1,636,0,3,0
5,0,0,0,5,0,594,1,0
6,1,0,0,1,1,0,1357,0
7,0,0,0,0,0,0,0,512


In [63]:
data_test = X_test.copy()

In [64]:
y_test

9068     1
2955     3
2215     6
2571     3
10476    3
        ..
11004    6
1049     2
1883     6
6093     3
1222     3
Name: numeric, Length: 2404, dtype: int64

In [65]:
data_test["y"] = y_test

In [66]:
data_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,y
9068,0.73514,-0.283388,0.089814,-1.848127,0.976743,-1.30458,-0.157328,-2.086827,-0.528364,-1.000672,1
2955,0.15498,0.23479,0.210732,0.643738,0.17234,0.356414,0.484429,0.259273,-0.621908,0.178597,3
2215,-0.22768,0.446772,0.534365,0.177429,0.17234,-0.764757,0.371178,0.882468,1.217798,-0.761136,6
2571,0.105604,-0.149918,0.438342,0.934659,0.261718,0.190314,0.748682,0.449738,0.219991,-0.115488,3
10476,-0.00549,-0.220579,-3.210528,1.071885,0.261718,0.190314,-0.006326,1.550844,0.064084,0.299619,3


In [67]:
data_test["y_pred"] = r.predict(X_test)

In [68]:
data_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,y,y_pred
9068,0.735140,-0.283388,0.089814,-1.848127,0.976743,-1.304580,-0.157328,-2.086827,-0.528364,-1.000672,1,1
2955,0.154980,0.234790,0.210732,0.643738,0.172340,0.356414,0.484429,0.259273,-0.621908,0.178597,3,3
2215,-0.227680,0.446772,0.534365,0.177429,0.172340,-0.764757,0.371178,0.882468,1.217798,-0.761136,6,6
2571,0.105604,-0.149918,0.438342,0.934659,0.261718,0.190314,0.748682,0.449738,0.219991,-0.115488,3,3
10476,-0.005490,-0.220579,-3.210528,1.071885,0.261718,0.190314,-0.006326,1.550844,0.064084,0.299619,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...
11004,-3.412392,0.148427,0.438342,-1.047781,-3.224030,-1.968978,1.805694,0.432186,-1.245538,-2.347688,6,6
1049,-0.277055,-1.963541,0.616162,0.437826,0.172340,2.640280,0.295677,0.099562,-0.278912,2.327389,2,2
1883,-0.363462,-0.008597,0.256965,-0.226340,0.261718,0.480988,0.144676,1.525235,1.061891,0.481085,6,6
6093,0.303106,-0.032150,0.367214,-0.350585,0.172340,-0.349509,0.559930,-0.329635,-0.372457,-0.495791,3,3


In [69]:
pd.crosstab(data_test.y, data_test.y_pred)

y_pred,0,1,2,3,4,5,6,7
y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,491,0,0,8,0,0,0,0
1,0,217,0,2,0,0,0,0
2,0,0,82,0,0,0,3,0
3,1,0,0,976,0,0,0,1
4,0,0,0,1,116,0,3,0
5,0,0,0,5,0,116,1,0
6,1,0,0,1,1,0,290,0
7,0,0,0,0,0,0,0,88


![hope-you-enjoy](https://imgs.xkcd.com/comics/machine_learning.png)