# Lab | Avila Bible 

In this lab, we will explore the [**Avila Bible dataset**](https://archive.ics.uci.edu/ml/datasets/Avila) which has been extracted from 800 images of the 'Avila Bible', an XII century giant Latin copy of the Bible. The prediction task consists in associating each pattern to a copyist. You will use supervised learning algorithms to figure out what feature patterns each copyist are likely to have and use our model to predict those copyist.

-----------------------------------------------------------------------------------------------------------------

## Before your start:
    - Comment as much as you can and use the APIla-bible in the README.md,
    - Happy learning!

In [1]:
# Import your libraries
import pandas as pd
import requests
import seaborn as sns

![machine-learning](https://miro.medium.com/proxy/1*halC1X4ydv_3yHYxKqvrwg.gif)

The Avila data set has been extracted from 800 images of the the **Avila Bible**, a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The palaeographic analysis of the  manuscript has  individuated the presence of 8 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

# What am I expected to do?

Well, your prediction task consists in associating each pattern to one of the 8 monks we will be evaluating (labeled as:  Marcus, Clarius, Philippus, Coronavirucus, Mongucus, Paithonius, Ubuntius, Esequlius). For that aim, you should: 
- Train a minimum of 4 different models
- Must contain a summary of the machine learning tools and algorithms
- and the results or the score obtained with each of them

You won't get much more instructions from now on. Remember to comment your code as much as you can. Keep the requirements in mind and have fun! 

## Dataset

In [2]:
data = pd.read_csv('../data/training_dataset.csv', index_col=0)

In [3]:
data.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.9917,-1.14103,Philippus
5,-0.412837,-0.346197,0.601936,2.211191,0.440474,0.356414,-1.138838,1.623671,2.371513,1.346221,Mongucus
6,0.056229,0.187683,-0.031104,-0.958476,0.261718,-0.598658,0.861934,-0.885246,-0.528364,-0.848946,Clarius
7,-0.128929,-0.03215,0.068476,-0.151655,-0.989576,-0.806282,0.295677,0.259758,0.251173,-0.767984,Clarius
8,0.130292,-0.275537,0.274747,-0.574878,0.261718,2.889429,0.635431,-1.180749,-0.840179,2.303929,Marcus
9,0.080916,0.438921,0.14316,-0.523288,0.261718,-0.598658,0.106925,-0.641641,0.18881,-0.491704,Ubuntius


`Keep calm and code on!`

## Explore Data

In [4]:
# Explore column names, type of data on them and number of not null values
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12017 entries, 0 to 12016
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       12017 non-null  float64
 1   1       12017 non-null  float64
 2   2       12017 non-null  float64
 3   3       12017 non-null  float64
 4   4       12017 non-null  float64
 5   5       12017 non-null  float64
 6   6       12017 non-null  float64
 7   7       12017 non-null  float64
 8   8       12017 non-null  float64
 9   9       12017 non-null  float64
 10  10      12017 non-null  object 
dtypes: float64(10), object(1)
memory usage: 1.1+ MB


In [5]:
# Check for null values
data.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
dtype: int64

In [6]:
# Statistics of the data frame
data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0
mean,0.02128,0.030684,-0.000379,-0.022127,0.006801,-0.001279,0.03254,-0.00672,-0.011368,-0.026942
std,1.004481,3.66103,1.072319,1.002045,0.963679,1.108192,1.245215,1.012995,1.085821,0.985799
min,-3.498799,-2.426761,-3.210528,-5.440122,-4.922215,-7.450257,-11.935457,-4.164819,-5.486218,-6.719324
25%,-0.128929,-0.259834,0.064919,-0.542563,0.17234,-0.598658,-0.006326,-0.555747,-0.372457,-0.528135
50%,0.056229,-0.055704,0.214288,0.080127,0.261718,-0.058835,0.220177,0.101115,0.064084,-0.053548
75%,0.216699,0.203385,0.349432,0.601905,0.261718,0.522513,0.446679,0.646377,0.500624,0.491862
max,11.819916,386.0,50.0,3.987152,1.066121,53.0,83.0,13.173081,44.0,11.911338


In [7]:
# Count od rows and columns
data.shape

(12017, 11)

# Challenge - train your models, make the best prediction

### Label encoder 
Encode target labels with value between 0 and n_classes-1.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [9]:
# Use label encoder to change names to numeric values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data["numeric"] = le.fit_transform(data["10"])

In [10]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,numeric
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus,3
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius,0
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus,6
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus,6
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.9917,-1.14103,Philippus,6


In [11]:
# Select dependent and independent variables to then analyze, and drop dependent variable from original dataframe
columnas = [a for a in data.columns if a not in ["10","numeric"]]
X = data[columnas] # Set independent variables
y = data["numeric"] # Set dependent variables


### Train_Test Split
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
# Set training and testing 'parts' of the dataframe
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

### Train One Model

Classifier implementing the k-nearest neighbors vote. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [14]:
from sklearn.neighbors import KNeighborsClassifier

In [15]:
classifier = KNeighborsClassifier(n_neighbors = 5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [16]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 6, 3, ..., 3, 3, 4])

### Exploring metrics for multi-class classification algorithms
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

- Accuracy
- Precission
- Recall
- F1_score

In [17]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score, confusion_matrix

In [32]:
# Setting variables
accuracy = round(accuracy_score(y_test, y_pred), 4)
precission = round(precision_score(y_test, y_pred, average='weighted'), 4)
recall = round(recall_score(y_test, y_pred, average = "weighted"), 4)
f1_score = round(f1_score(y_test, y_pred, average = "weighted"),3)

# Printing variables
print(f'Accuracy: {accuracy}' ) 
print(f'Precission: {precission}') 
print(f"Recall: {recall}") 
print(f'F1_score: {f1_score}') 

  _warn_prf(average, modifier, msg_start, len(result))


TypeError: 'numpy.float64' object is not callable

### Training several models and explores the metrics for each of them

- DecisionTreeClassifier
- SVC
- RandomForestClassifier
- AdaBoostClassifier
- DecisionTreeClassifier

In [23]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier


In [31]:
# Models to train
models = [DecisionTreeClassifier(), SVC(), RandomForestClassifier(), AdaBoostClassifier()]

# Training the models
for x in models:
    x.fit(X_train, y_train)
    
# Print results for all models
for i in models:
    y_pred = i.predict(X_test)
    print(i,''':
    ''')
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Precission: ', precision_score(y_test, y_pred, average = 'weighted'))
    print('Recall: ', recall_score(y_test, y_pred, average = 'weighted'))
    # print('F1_score: ', f1_score(y_test, y_pred, average = "weighted")) No sé porque 'numpy.float64' object is not callable
    print('')

DecisionTreeClassifier() :
    
Accuracy:  0.9675540765391015
Precission:  0.9676333812583804
Recall:  0.9675540765391015

SVC() :
    
Accuracy:  0.7208818635607321
Precission:  0.7326776429070874
Recall:  0.7208818635607321

RandomForestClassifier() :
    
Accuracy:  0.9829450915141431
Precission:  0.9830665866269171
Recall:  0.9829450915141431

AdaBoostClassifier() :
    
Accuracy:  0.5091514143094842
Precission:  0.307988323181416
Recall:  0.5091514143094842



  _warn_prf(average, modifier, msg_start, len(result))


![hope-you-enjoy](https://imgs.xkcd.com/comics/machine_learning.png)