# Lab | Avila Bible 

In this lab, we will explore the [**Avila Bible dataset**](https://archive.ics.uci.edu/ml/datasets/Avila) which has been extracted from 800 images of the 'Avila Bible', an XII century giant Latin copy of the Bible. The prediction task consists in associating each pattern to a copyist. You will use supervised learning algorithms to figure out what feature patterns each copyist are likely to have and use our model to predict those copyist.

-----------------------------------------------------------------------------------------------------------------

## Before your start:
    - Read the README.md file,
    - Comment as much as you can and use the APIla-bible in the README.md,
    - Happy learning!

In [130]:
# Import your libraries
import pandas as pd
import requests
import seaborn as sns
import numpy as np
from sklearn import datasets, svm, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from scipy import stats


![machine-learning](https://miro.medium.com/proxy/1*halC1X4ydv_3yHYxKqvrwg.gif)

The Avila data set has been extracted from 800 images of the the **Avila Bible**, a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The palaeographic analysis of the  manuscript has  individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

## What am I expected to do?

Well, your prediction task consists in associating each pattern to one of the 8 monks we will be evaluating (labeled as:  Marcus, Clarius, Philippus, Coronavirucus, Mongucus, Paithonius, Ubuntius, Esequlius). For that aim, you should: 
- Train a minimum of 4 different models
- Perform a minimum of 4 Feature Extraction and Engineering techniques
- Must contain a summary of the machine learning tools and algorithms
- and the results or the score obtained with each of them

You won't get much more instructions from now on. Remember to comment your code as much as you can. Keep the requirements in mind and have fun! 

Just one last piece of advice, take a moment to explore the data, remember this dataset contains two files: **train** and **test**. You will find both files in `data` folder. The **test** files contains the data you will predict for, therefore it does not include the labels.
Use the **train** dataset as you wish, but don't forget to split it into **train** and **test** again so you can evaluate your models. Just be sure to train it again with the whole data before predicting.
We have also included a **sample submission** which is of the exact shape and format you must use when evaluating your predictions against the groundtruth through the `APIla-bible`. It won't work unless it is the exact same shape. 



#### Train dataset

In [2]:
train_dataset = pd.read_csv('../data/training_dataset.csv', index_col=0)

In [4]:
train_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.9917,-1.14103,Philippus


In [57]:
#Chequeo el shape
train_dataset.shape

(12017, 11)

In [128]:
#Miro cuanto ha escrito cada uno
train_dataset["10"].value_counts() #Quizás cambie el nombre de esa columna

Marcus           5107
Clarius          2362
Philippus        1360
Coronavirucus    1009
Mongucus          640
Paithonius        600
Ubuntius          512
Esequlius         427
Name: 10, dtype: int64

In [59]:
#Tipo de datos.Todo ok, nada raro
train_dataset.dtypes 

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10     object
dtype: object

In [60]:
#Check de nulos. No hay ninguno, todo ok
train_dataset.isnull().sum() 

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
dtype: int64

In [61]:
#Correlación entre las columnas. Todo ok, no hay columnas con una correlación excesiva

corr = train_dataset.corr().unstack().sort_values(ascending=False).drop_duplicates()
corr.head(2)


9  9    1.000000
5  9    0.776504
dtype: float64

In [62]:
#Parece que hay columnas con outliers.
train_dataset.describe() 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0
mean,0.02128,0.030684,-0.000379,-0.022127,0.006801,-0.001279,0.03254,-0.00672,-0.011368,-0.026942
std,1.004481,3.66103,1.072319,1.002045,0.963679,1.108192,1.245215,1.012995,1.085821,0.985799
min,-3.498799,-2.426761,-3.210528,-5.440122,-4.922215,-7.450257,-11.935457,-4.164819,-5.486218,-6.719324
25%,-0.128929,-0.259834,0.064919,-0.542563,0.17234,-0.598658,-0.006326,-0.555747,-0.372457,-0.528135
50%,0.056229,-0.055704,0.214288,0.080127,0.261718,-0.058835,0.220177,0.101115,0.064084,-0.053548
75%,0.216699,0.203385,0.349432,0.601905,0.261718,0.522513,0.446679,0.646377,0.500624,0.491862
max,11.819916,386.0,50.0,3.987152,1.066121,53.0,83.0,13.173081,44.0,11.911338


In [157]:
#Confirmo los outliers. Voy a probar a hacer un par de modelos de predicción sin quitarlos porque pueden ser 
#de valor. Luego probaré los mismos modelos pero sin outliers 

train_dataset.skew(axis=0)

0   -0.717434
1    5.698805
2   -3.156520
3   -0.349086
4   -2.212842
5    0.371734
6   -1.988341
7   -0.435788
8   -0.171133
9    0.124302
dtype: float64

In [94]:
#Separo entre categóricas y numéricas

X = train_dataset.select_dtypes(include=["number"])
y = train_dataset.select_dtypes(exclude=["number"])

In [98]:
#X.head()
y.head()

Unnamed: 0,10
0,Marcus
1,Clarius
2,Philippus
3,Philippus
4,Philippus


#### Test dataset


In [99]:
test_dataset = pd.read_csv('../data/test_dataset.csv', index_col=0)

In [100]:
test_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.017834,0.132725,0.125378,1.357345,0.261718,0.190314,0.182426,0.445253,-0.715453,0.189796
1,-0.202992,-0.000745,-3.210528,-0.527256,0.082961,0.771662,0.144676,0.098572,0.251173,0.745333
2,1.019049,0.211237,-0.155578,-0.311855,0.261718,0.107265,0.484429,0.339303,-0.310094,-0.04963
3,0.451232,-0.267686,0.335206,-0.831336,0.261718,0.024215,0.220177,0.988787,0.032902,0.025485
4,-0.22768,0.109171,0.413447,0.118917,0.17234,0.480988,0.52218,0.091562,0.313536,0.256389


#### Sample submission

In [101]:
sample_submission = pd.read_csv('../data/sample_submission.csv', header=None, index_col=0)

In [102]:
sample_submission.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
0,Philippus
1,Ubuntius
2,Esequlius
3,Coronavirucus
4,Philippus


`Keep calm and code on!`

# Challenge - train your models, make the best prediction

### MODELO 1 --->SVC

In [105]:
#your code
#MODELO 1 --->SVC

#Entreno el modelo

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

modelo_1 = svm.SVC(gamma='scale')
modelo_1.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [107]:
#Predicción con mi 0.2 reservado
y_pred = modelo_1.predict(X_test)

In [108]:
#Chequeo como ha ido. No está tan mal.Vamos a ver con el real.
print(classification_report(y_test,y_pred))  

               precision    recall  f1-score   support

      Clarius       0.62      0.44      0.51       466
Coronavirucus       0.98      0.98      0.98       184
    Esequlius       0.85      0.17      0.28       100
       Marcus       0.70      0.89      0.78      1030
     Mongucus       0.93      0.85      0.89       136
   Paithonius       0.81      0.50      0.62       118
    Philippus       0.71      0.76      0.73       272
     Ubuntius       0.70      0.46      0.56        98

     accuracy                           0.73      2404
    macro avg       0.79      0.63      0.67      2404
 weighted avg       0.73      0.73      0.71      2404



In [112]:
#Predicción con test_dataset
y_pred_total = modelo_1.predict(test_dataset)
y_pred_total = pd.DataFrame(y_pred_total)


In [113]:
#Check con la API

res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":y_pred_total.to_csv(header=None)})
res.json()

{'accuracy': 0.7302795806290564,
 'quote': "Close, but no cigar. It's a good begining. How can you improve it more? Maybe try some different models?"}

### MODELO 2 --->LRM

In [115]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

modelo_2 = LogisticRegression()
modelo_2.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [116]:
#Predicción con mi 0.2 reservado
y_pred = modelo_2.predict(X_test)

In [117]:
#Chequeo como ha ido. Pinta muy mal.Vamos a ver con el real.
print(classification_report(y_test,y_pred))  

               precision    recall  f1-score   support

      Clarius       0.38      0.09      0.15       463
Coronavirucus       0.89      0.89      0.89       223
    Esequlius       0.00      0.00      0.00        80
       Marcus       0.54      0.94      0.69      1020
     Mongucus       0.75      0.76      0.76       129
   Paithonius       0.37      0.12      0.18       120
    Philippus       0.49      0.27      0.35       261
     Ubuntius       0.00      0.00      0.00       108

     accuracy                           0.57      2404
    macro avg       0.43      0.38      0.38      2404
 weighted avg       0.50      0.57      0.49      2404



  _warn_prf(average, modifier, msg_start, len(result))


In [119]:
#Predicción con test_dataset
y_pred_total = modelo_2.predict(test_dataset)
y_pred_total = pd.DataFrame(y_pred_total)

In [120]:
#Check con la API
#Fatal, descarto el modelo y busco otro 

res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":y_pred_total.to_csv(header=None)})
res.json()

{'accuracy': 0.5766350474288567,
 'quote': 'Nope, not good enough. But you shall rise as the glorious phoenix from the ashes of this score and get to the top!'}

### MODELO 3 --->Random Forest Classifier

In [121]:
#your code
#Entreno el modelo
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

modelo_3 = RandomForestClassifier()
modelo_3.fit(X_train,y_train)

  


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [122]:
#your code
#Prediccion con mi 0.2
y_pred = modelo_3.predict(X_test)

In [123]:
print(classification_report(y_test,y_pred))  #Genial!!! Veamos si es real

               precision    recall  f1-score   support

      Clarius       0.99      0.99      0.99       473
Coronavirucus       1.00      0.99      1.00       196
    Esequlius       1.00      0.98      0.99        89
       Marcus       0.98      1.00      0.99      1001
     Mongucus       1.00      0.98      0.99       144
   Paithonius       0.98      0.94      0.96       115
    Philippus       0.99      1.00      0.99       277
     Ubuntius       0.98      0.96      0.97       109

     accuracy                           0.99      2404
    macro avg       0.99      0.98      0.98      2404
 weighted avg       0.99      0.99      0.99      2404



In [124]:
#Prediccion con test_dataset
y_pred_total = modelo_3.predict(test_dataset)
y_pred_total = pd.DataFrame(y_pred_total)

In [125]:
#Check con la API
#Genial!!!!!!!! 

res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":y_pred_total.to_csv(header=None)})
res.json()

{'accuracy': 0.9890164752870694,
 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?",
 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}

#### POR CURIOSIDAD VOY A PROBAR LOS DOS MEJORES MODELOS (1 Y 3) SIN OUTLIERS, A VER QUÉ PASA.

In [126]:
train_dataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.241386,0.109171,-0.127126,0.380626,0.172340,0.314889,0.484429,0.316412,0.188810,0.134922,Marcus
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius
2,-0.116585,0.281897,0.175168,-0.152490,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.877830,Philippus
3,-0.326430,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.807380,Philippus
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.991700,-1.141030,Philippus
...,...,...,...,...,...,...,...,...,...,...,...
12012,0.093260,-0.087108,-2.268081,-0.164963,0.261718,0.148790,0.333428,0.587587,0.219991,0.072596,Marcus
12013,-0.215336,0.101320,0.235627,-0.280585,0.261718,-1.719828,-0.308329,1.008086,-0.154186,-1.302496,Philippus
12014,0.031541,0.297600,-3.210528,-0.583590,-0.721442,-0.224934,0.333428,0.664239,0.687713,-0.224659,Marcus
12015,0.266074,0.580242,0.114709,-0.165469,0.261718,0.024215,0.446679,0.428536,0.375899,-0.103698,Marcus


In [148]:
#Elimino los outliers

train_dataset_so=train_dataset[(np.abs(stats.zscore(train_dataset[train_dataset.columns[0:9]])) < 3).all(axis=1)]


### MODELO 1 --->SVC (sin outliers)


In [149]:
#your code
#MODELO 1 --->SVC
#Redefino X e y con mi nuevo train_dataset

X = train_dataset_so.select_dtypes(include=["number"])
y = train_dataset_so.select_dtypes(exclude=["number"])

#Entreno el modelo

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

modelo_1_so = svm.SVC(gamma='scale')
modelo_1_so.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [150]:
#Predicción con mi 0.2 reservado
y_pred_so = modelo_1_so.predict(X_test)

In [151]:
#Predicción con test_dataset
y_pred_total_so = modelo_1_so.predict(test_dataset)
y_pred_total_so = pd.DataFrame(y_pred_total_so)

In [152]:
#Check con la API
#DEFINITIVAMENTE EMPEORA, VEREMOS SI PASA LO MISMO CON RANDOM FOREST
res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":y_pred_total_so.to_csv(header=None)})
res.json()

{'accuracy': 0.6862206689965052,
 'quote': 'Nope, not good enough. But you shall rise as the glorious phoenix from the ashes of this score and get to the top!'}

### MODELO 3 --->Random Forest Classifier (SIN OUTLIERS)


In [153]:
#your code

#Redefino X e y con mi nuevo train_dataset

X = train_dataset_so.select_dtypes(include=["number"])
y = train_dataset_so.select_dtypes(exclude=["number"])

#Entreno el modelo
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

modelo_3_so = RandomForestClassifier()
modelo_3_so.fit(X_train,y_train)

  if sys.path[0] == '':


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [154]:
#your code
#Prediccion con mi 0.2
y_pred_so = modelo_3_so.predict(X_test)

In [155]:
#Prediccion con test_dataset
y_pred_total_so = modelo_3_so.predict(test_dataset)
y_pred_total_so = pd.DataFrame(y_pred_total_so)

In [156]:
#Check con la API
#TAMBIÉN EMPEORA BASTANTE
res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":y_pred_total_so.to_csv(header=None)})
res.json()

{'accuracy': 0.8971542685971043,
 'quote': "It's good, but I'm sure you can do better! Try different models, adjust the hyperparameters, some fine tuning can lead you a long way."}

## What do I do once I have a prediction?

Once you have already trained your model and made a prediction with it, you are ready to check what is the accuracy of it. 

Save your prediction as a `.csv` file.

In [13]:
#your code here

Now you are ready to know the truth! Are you good enough to call yourself a pro?

Lucky you have the ultimate **APIla-bible** which give you the chance of checking the accuracy of your predictions as many times as you need in order to become the pro you want to be. 

## How do I post my prediction to the APIla-bible?

Easy peasy! You should only fulfil the path to your prediction `.csv` and run the cell below! 

In [14]:
my_submission = "../data/sample_submission.csv"
with open(my_submission) as f:
    res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":f.read()})
res.json()

{'accuracy': 0.12368946580129805,
 'quote': 'Nope, not good enough. But you shall rise as the glorious phoenix from the ashes of this score and get to the top!'}

![hope-you-enjoy](https://imgs.xkcd.com/comics/machine_learning.png)