# Lab | Avila Bible 

In this lab, we will explore the [**Avila Bible dataset**](https://archive.ics.uci.edu/ml/datasets/Avila) which has been extracted from 800 images of the 'Avila Bible', an XII century giant Latin copy of the Bible. The prediction task consists in associating each pattern to a copyist. You will use supervised learning algorithms to figure out what feature patterns each copyist are likely to have and use our model to predict those copyist.

-----------------------------------------------------------------------------------------------------------------

## Before your start:
    - Read the README.md file,
    - Comment as much as you can and use the APIla-bible in the README.md,
    - Happy learning!

In [17]:
# Import your libraries
import matplotlib.pyplot as plt
import pandas as pd
import requests
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.svm import NuSVC, SVC, LinearSVC
from sklearn.tree import ExtraTreeClassifier, DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

![machine-learning](https://miro.medium.com/proxy/1*halC1X4ydv_3yHYxKqvrwg.gif)

The Avila data set has been extracted from 800 images of the the **Avila Bible**, a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The palaeographic analysis of the  manuscript has  individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

## What am I expected to do?

Well, your prediction task consists in associating each pattern to one of the 8 monks we will be evaluating (labeled as:  Marcus, Clarius, Philippus, Coronavirucus, Mongucus, Paithonius, Ubuntius, Esequlius). For that aim, you should: 
- Train a minimum of 4 different models
- Perform a minimum of 4 Feature Extraction and Engineering techniques
- Must contain a summary of the machine learning tools and algorithms
- and the results or the score obtained with each of them

You won't get much more instructions from now on. Remember to comment your code as much as you can. Keep the requirements in mind and have fun! 

Just one last piece of advice, take a moment to explore the data, remember this dataset contains two files: **train** and **test**. You will find both files in `data` folder. The **test** files contains the data you will predict for, therefore it does not include the labels.
Use the **train** dataset as you wish, but don't forget to split it into **train** and **test** again so you can evaluate your models. Just be sure to train it again with the whole data before predicting.
We have also included a **sample submission** which is of the exact shape and format you must use when evaluating your predictions against the groundtruth through the `APIla-bible`. It won't work unless it is the exact same shape. 



#### Train dataset

In [18]:
train_dataset = pd.read_csv('../data/training_dataset.csv', index_col=0)

In [19]:
train_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.9917,-1.14103,Philippus


In [20]:
train_dataset.corr()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,-0.034603,0.040097,-0.050536,0.406481,-0.051888,-0.034114,-0.06432,0.074687,-0.006808
1,-0.034603,1.0,0.405835,-0.00322,-0.063053,0.396154,0.57537,0.036323,0.318548,-0.026659
2,0.040097,0.405835,1.0,0.128052,0.00059,0.108497,0.279027,0.030528,0.16252,-0.064532
3,-0.050536,-0.00322,0.128052,1.0,0.084093,0.2546,0.066038,0.357121,0.274736,0.304903
4,0.406481,-0.063053,0.00059,0.084093,1.0,0.042959,0.01953,-0.078455,0.279677,0.145087
5,-0.051888,0.396154,0.108497,0.2546,0.042959,1.0,0.469109,-0.035121,0.220883,0.776504
6,-0.034114,0.57537,0.279027,0.066038,0.01953,0.469109,1.0,0.024779,0.264194,0.299467
7,-0.06432,0.036323,0.030528,0.357121,-0.078455,-0.035121,0.024779,1.0,0.500367,0.006262
8,0.074687,0.318548,0.16252,0.274736,0.279677,0.220883,0.264194,0.500367,1.0,0.20068
9,-0.006808,-0.026659,-0.064532,0.304903,0.145087,0.776504,0.299467,0.006262,0.20068,1.0


#### Test dataset


In [21]:
test_dataset = pd.read_csv('../data/test_dataset.csv', index_col=0)

In [22]:
test_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.017834,0.132725,0.125378,1.357345,0.261718,0.190314,0.182426,0.445253,-0.715453,0.189796
1,-0.202992,-0.000745,-3.210528,-0.527256,0.082961,0.771662,0.144676,0.098572,0.251173,0.745333
2,1.019049,0.211237,-0.155578,-0.311855,0.261718,0.107265,0.484429,0.339303,-0.310094,-0.04963
3,0.451232,-0.267686,0.335206,-0.831336,0.261718,0.024215,0.220177,0.988787,0.032902,0.025485
4,-0.22768,0.109171,0.413447,0.118917,0.17234,0.480988,0.52218,0.091562,0.313536,0.256389


#### Sample submission

In [23]:
sample_submission = pd.read_csv('../data/sample_submission.csv', header=None, index_col=0)

In [24]:
sample_submission.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
0,Philippus
1,Ubuntius
2,Esequlius
3,Coronavirucus
4,Philippus


`Keep calm and code on!`

# Challenge - train your models, make the best prediction

In [25]:
#your code
X = train_dataset.drop(columns=["10"])
y = train_dataset['10']

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(9613, 10) (2404, 10) (9613,) (2404,)


In [27]:
models = {
    #Multiclass as One-Vs-One: ambos del tipo SVM usados para clasificacion
    "NuSupport": NuSVC(nu=0.1),
    "CSupport": SVC(C=1.0),
    #Multiclass as One-Vs-The-Rest
    "GradientBoosting": GradientBoostingClassifier(), # dentro de ensemble methods
    "Linear": LinearSVC(multi_class="ovr"), #del tipo SVM usados para clasificacion
    "Logistic": LogisticRegression(multi_class="ovr"), # dentro del modelo lineal
    "LogisCV": LogisticRegressionCV(Cs=10, multi_class="ovr"), #dentro del modelo lineal
    #Support multilabel
    "DecTree": DecisionTreeClassifier(), #decision tree-based models for classification (a non-parametric supervised learning )
    "ExtraTree": ExtraTreeClassifier(), #decision tree-based models for classification (a non-parametric supervised learning )
    "KNeigh": KNeighborsClassifier(n_neighbors=5), #classification for data with discrete labels
    "NN": MLPClassifier(), #dentro de los modelos Neural network 
    "RandomForest": RandomForestClassifier(n_estimators=100), # dentro de ensemble methods
    }

In [28]:
for name,m  in models.items():
    m.fit(X_train, y_train)
    #y_train_pred = m.predict(X_test)
    #y_scores = m.predict_proba(X_test)  #Return probability estimates for the vector X
    scores = m.score(X_test, y_test) #Return the mean accuracy on the given test data and labels.
    print(f"Score for {name} model is: {scores}")

Score for NuSupport model is: 0.7371048252911814
Score for CSupport model is: 0.668053244592346
Score for GradientBoosting model is: 0.9276206322795341




Score for Linear model is: 0.5162229617304492
Score for Logistic model is: 0.5299500831946755
Score for LogisCV model is: 0.512063227953411
Score for DecTree model is: 0.968801996672213
Score for ExtraTree model is: 0.8826955074875208
Score for KNeigh model is: 0.7296173044925125




Score for NN model is: 0.7753743760399334
Score for RandomForest model is: 0.9800332778702163


In [43]:
# Los modelos LinearSVC y MLPClassifier no han convergido ni llegando al máximo de iteraciones, los descartamos

In [29]:
# Lo he probado aparte porque tarda mucho en generar el modelo
m = GaussianProcessClassifier(multi_class = "one_vs_one")
m.fit(X_train, y_train)

GaussianProcessClassifier(copy_X_train=True, kernel=None, max_iter_predict=100,
                          multi_class='one_vs_one', n_jobs=None,
                          n_restarts_optimizer=0, optimizer='fmin_l_bfgs_b',
                          random_state=None, warm_start=False)

In [30]:
#y_train_pred = m.predict(X_test)
scores = m.score(X_test, y_test)
scores

0.7333610648918469

Los modelos lineales no predicen bien (LogisticRegression,LogisticRegressionCV)
Los 3 mejores modelos son :  RandomForestClassifier, DecisionTreeClassifier, GradientBoostingClassifier. Por tanto voy a reentrenar esto modelos con el data train completo (sin hacer split)

Dos de los mejores modelos pertenecen a la clase de los modelos ensemble (RandomForestClassifier,GradientBoostingClassifier):

1) RandomForestClassifier es un algoritmo Bagging: lo que implica que toman multiples muestras del train dataset (con reemplzamiento) y entrenan un modelo con cada muestra. La prediccion es una media de las predicciones de todas las muestras modeladas. Ademásreduce la correlacion entre los clasificadores.

2) GradientBoostingClassifier pertenece a los algoritmos Boosting Algorithms: crea una secuencia de modelos que intenta corregir los errores de los modelos anteriores.

En cuanto a DecisionTreeClassifier o modelos Decision Tree, el algoritmo aprende un set de preguntas si/si no que llevan a la decision. Pueden combinar tanto datos categoricos como numericos.

In [31]:
#Reentreno los modelos esta vez con el datset completo
model = GradientBoostingClassifier()
model.fit(X, y)
y_predBoosting = model.predict(test_dataset)

In [32]:
model = DecisionTreeClassifier()
model.fit(X, y)
y_predDecTree = model.predict(test_dataset)

In [33]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
y_predRandomForest = model.predict(test_dataset)

## What do I do once I have a prediction?

Once you have already trained your model and made a prediction with it, you are ready to check what is the accuracy of it. 

Save your prediction as a `.csv` file.

In [35]:
#your code here
test_Bossting = pd.DataFrame(y_predBoosting) 
test_DecisionTree = pd.DataFrame(y_predDecTree)
test_RandomForest = pd.DataFrame(y_predRandomForest)

In [39]:
test_Bossting.to_csv("../data/test_Boosting.csv", header = None)
test_DecisionTree.to_csv("../data/test_DecTree.csv", header = None)
test_RandomForest.to_csv("../data/test_RandomForest.csv", header = None)                     

Now you are ready to know the truth! Are you good enough to call yourself a pro?

Lucky you have the ultimate **APIla-bible** which give you the chance of checking the accuracy of your predictions as many times as you need in order to become the pro you want to be. 

## How do I post my prediction to the APIla-bible?

Easy peasy! You should only fulfil the path to your prediction `.csv` and run the cell below! 

In [40]:
my_submission = "../data/test_Boosting.csv"
with open(my_submission) as f:
    res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":f.read()})
res.json()

{'accuracy': 0.9377184223664503,
 'quote': "Great job! That's an impressive score. Will you give it an extra push? Almost at the top, care for a final `boost`?"}

In [41]:
my_submission = "../data/test_DecTree.csv"
with open(my_submission) as f:
    res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":f.read()})
res.json()

{'accuracy': 0.9883924113829257,
 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?",
 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}

In [42]:
my_submission = "../data/test_RandomForest.csv"
with open(my_submission) as f:
    res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":f.read()})
res.json()

{'accuracy': 0.9936345481777334,
 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?",
 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}

![hope-you-enjoy](https://imgs.xkcd.com/comics/machine_learning.png)