# Lab | Avila Bible 

In this lab, we will explore the [**Avila Bible dataset**](https://archive.ics.uci.edu/ml/datasets/Avila) which has been extracted from 800 images of the 'Avila Bible', an XII century giant Latin copy of the Bible. The prediction task consists in associating each pattern to a copyist. You will use supervised learning algorithms to figure out what feature patterns each copyist are likely to have and use our model to predict those copyist.

-----------------------------------------------------------------------------------------------------------------

## Before your start:
    - Read the README.md file,
    - Comment as much as you can and use the APIla-bible in the README.md,
    - Happy learning!

In [96]:
# Import your libraries
import pandas as pd
import requests
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.kernel_approximation import RBFSampler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

![machine-learning](https://miro.medium.com/proxy/1*halC1X4ydv_3yHYxKqvrwg.gif)

The Avila data set has been extracted from 800 images of the the **Avila Bible**, a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The palaeographic analysis of the  manuscript has  individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

## What am I expected to do?

Well, your prediction task consists in associating each pattern to one of the 8 monks we will be evaluating (labeled as:  Marcus, Clarius, Philippus, Coronavirucus, Mongucus, Paithonius, Ubuntius, Esequlius). For that aim, you should: 
- Train a minimum of 4 different models
- Perform a minimum of 4 Feature Extraction and Engineering techniques
- Must contain a summary of the machine learning tools and algorithms
- and the results or the score obtained with each of them

You won't get much more instructions from now on. Remember to comment your code as much as you can. Keep the requirements in mind and have fun! 

Just one last piece of advice, take a moment to explore the data, remember this dataset contains two files: **train** and **test**. You will find both files in `data` folder. The **test** files contains the data you will predict for, therefore it does not include the labels.
Use the **train** dataset as you wish, but don't forget to split it into **train** and **test** again so you can evaluate your models. Just be sure to train it again with the whole data before predicting.
We have also included a **sample submission** which is of the exact shape and format you must use when evaluating your predictions against the groundtruth through the `APIla-bible`. It won't work unless it is the exact same shape. 



#### Train dataset

In [97]:
train_dataset = pd.read_csv('../data/training_dataset.csv', index_col=0)

In [98]:
train_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus
3,-0.32643,-0.652394,0.384996,-1.694222,-0.185173,-1.138481,-0.232828,-1.747116,-1.183175,-0.80738,Philippus
4,-0.437525,-0.471816,0.463236,-0.545248,0.261718,-0.972381,0.824183,-3.108388,-2.9917,-1.14103,Philippus


#### Test dataset


In [99]:
test_dataset = pd.read_csv('../data/test_dataset.csv', index_col=0)

In [100]:
test_dataset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.017834,0.132725,0.125378,1.357345,0.261718,0.190314,0.182426,0.445253,-0.715453,0.189796
1,-0.202992,-0.000745,-3.210528,-0.527256,0.082961,0.771662,0.144676,0.098572,0.251173,0.745333
2,1.019049,0.211237,-0.155578,-0.311855,0.261718,0.107265,0.484429,0.339303,-0.310094,-0.04963
3,0.451232,-0.267686,0.335206,-0.831336,0.261718,0.024215,0.220177,0.988787,0.032902,0.025485
4,-0.22768,0.109171,0.413447,0.118917,0.17234,0.480988,0.52218,0.091562,0.313536,0.256389


#### Sample submission

In [101]:
sample_submission = pd.read_csv('../data/sample_submission.csv', header=None, index_col=0)

In [102]:
sample_submission.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
0,Marcus
1,Esequlius
2,Marcus
3,Marcus
4,Esequlius


`Keep calm and code on!`

# Challenge - train your models, make the best prediction

In [103]:
train_dataset.dtypes

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10     object
dtype: object

In [104]:
train_dataset["10"].value_counts()#Nombres muy top!!:D 

Marcus           5107
Clarius          2362
Philippus        1360
Coronavirucus    1009
Mongucus          640
Paithonius        600
Ubuntius          512
Esequlius         427
Name: 10, dtype: int64

In [105]:
train_dataset.shape

(12017, 11)

In [106]:
train_dataset.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
count,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0,12017.0
mean,0.02128,0.030684,-0.000379,-0.022127,0.006801,-0.001279,0.03254,-0.00672,-0.011368,-0.026942
std,1.004481,3.66103,1.072319,1.002045,0.963679,1.108192,1.245215,1.012995,1.085821,0.985799
min,-3.498799,-2.426761,-3.210528,-5.440122,-4.922215,-7.450257,-11.935457,-4.164819,-5.486218,-6.719324
25%,-0.128929,-0.259834,0.064919,-0.542563,0.17234,-0.598658,-0.006326,-0.555747,-0.372457,-0.528135
50%,0.056229,-0.055704,0.214288,0.080127,0.261718,-0.058835,0.220177,0.101115,0.064084,-0.053548
75%,0.216699,0.203385,0.349432,0.601905,0.261718,0.522513,0.446679,0.646377,0.500624,0.491862
max,11.819916,386.0,50.0,3.987152,1.066121,53.0,83.0,13.173081,44.0,11.911338


In [107]:
#Feauture Vector
X = train_dataset.drop(columns=["10"])
#GroundTruth
y = train_dataset["10"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Models

In [109]:
models = {
    "logis": LogisticRegression(C=10,solver="lbfgs"),
    "svm-rbf": CalibratedClassifierCV(SVC(kernel="poly",gamma="auto", max_iter=200),cv=3),
    "svc": SVC(),
    
}
for name,m  in models.items():
    print(f"Training {name}...")
    m.fit(X_train, y_train)
print("Train complete")

Training logis...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training svm-rbf...




Training svc...
Train complete


In [110]:
printMetric = lambda label,value:print(f"\t {label}: {round(value,3)}")
y_predictDict=[]
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_predict = model.predict(test_dataset)
    y_predictDict.append({"model":name,"y_predict":y_predict})
    print(f"Evaluating model {name}")
    printMetric("Accuracy",accuracy_score(y_test, y_pred))


Evaluating model logis
	 Accuracy: 0.565
Evaluating model svm-rbf
	 Accuracy: 0.533
Evaluating model svc
	 Accuracy: 0.687


In [111]:
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
models = {
    "SGDClassifier":  SGDClassifier(loss="hinge", penalty="l2", max_iter=5),
    "BaggingClassifier": BaggingClassifier(KNeighborsClassifier()),
    "MLPClassifier":  MLPClassifier(solver='lbfgs'),
    
   
}
for name,m  in models.items():
    print(f"Training {name}...")
    m.fit(X_train, y_train)
print("Train complete")

Training SGDClassifier...
Training BaggingClassifier...
Training MLPClassifier...




Train complete


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [112]:
printMetric = lambda label,value:print(f"\t {label}: {round(value,3)}")
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_predict = model.predict(test_dataset)
    y_predictDict.append({"model":name,"y_predict":y_predict})
    print(f"Evaluating model {name}")
    printMetric("Accuracy",accuracy_score(y_test, y_pred))

Evaluating model SGDClassifier
	 Accuracy: 0.493
Evaluating model BaggingClassifier
	 Accuracy: 0.757
Evaluating model MLPClassifier
	 Accuracy: 0.804


In [113]:
models = {
    "DecisionTreeClassifier": tree.DecisionTreeClassifier(),
    "randomforest": RandomForestClassifier(n_estimators=300),
    "GradientBoostingClassifier": GradientBoostingClassifier(n_estimators=300), 
    "HistGradientBoostingClassifier": HistGradientBoostingClassifier()
    }
   
for name,m  in models.items():
    print(f"Training {name}...")
    m.fit(X_train, y_train)
print("Train complete")

Training DecisionTreeClassifier...
Training randomforest...
Training GradientBoostingClassifier...
Training HistGradientBoostingClassifier...
Train complete


In [114]:
printMetric = lambda label,value:print(f"\t {label}: {round(value,3)}")
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_predict = model.predict(test_dataset)
    y_predictDict.append({"model":name,"y_predict":y_predict})
    print(f"Evaluating model {name}")
    printMetric("Accuracy",accuracy_score(y_test, y_pred))

Evaluating model DecisionTreeClassifier
	 Accuracy: 0.984
Evaluating model randomforest
	 Accuracy: 0.993
Evaluating model GradientBoostingClassifier
	 Accuracy: 0.997
Evaluating model HistGradientBoostingClassifier
	 Accuracy: 0.998


## What do I do once I have a prediction?

Once you have already trained your model and made a prediction with it, you are ready to check what is the accuracy of it. 

Save your prediction as a `.csv` file.

In [115]:
#your code here
for e in y_predictDict:
    y_predict=e["y_predict"]
    submission=pd.DataFrame(y_predict)
    submission.to_csv("../data/sample_submission.csv",header=None)
    my_submission = "../data/sample_submission.csv"
    with open(my_submission) as f:
        res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":f.read()})
    print("For model: ",e["model"])
    print(res.json(),"\n")
    

For model:  logis
{'accuracy': 0.5757613579630554, 'quote': 'Nope, not good enough. But you shall rise as the glorious phoenix from the ashes of this score and get to the top!'} 

For model:  svm-rbf
{'accuracy': 0.5376934598102846, 'quote': 'Nope, not good enough. But you shall rise as the glorious phoenix from the ashes of this score and get to the top!'} 

For model:  svc
{'accuracy': 0.6829755366949576, 'quote': 'Nope, not good enough. But you shall rise as the glorious phoenix from the ashes of this score and get to the top!'} 

For model:  SGDClassifier
{'accuracy': 0.5008736894658014, 'quote': 'Nope, not good enough. But you shall rise as the glorious phoenix from the ashes of this score and get to the top!'} 

For model:  BaggingClassifier
{'accuracy': 0.744882675986021, 'quote': "Close, but no cigar. It's a good begining. How can you improve it more? Maybe try some different models?"} 

For model:  MLPClassifier
{'accuracy': 0.8170244633050424, 'quote': "It's good, but I'm sur

Me salen 4 modelos con accuracy muy alto y por el tipo de datos, no se sobreajustó, habría que escoger modelos más generalista en otros casos en la realidad.

For model DecisionTreeClassifier:
{'accuracy': 0.9822765851223165}

For model:  randomforest:
{'accuracy': 0.9898901647528707}

For model:  GradientBoostingClassifier
{'accuracy': 0.9988766849725412}

For model:  HistGradientBoostingClassifier
{'accuracy': 0.9995007488766849}

Now you are ready to know the truth! Are you good enough to call yourself a pro?

Lucky you have the ultimate **APIla-bible** which give you the chance of checking the accuracy of your predictions as many times as you need in order to become the pro you want to be. 

## How do I post my prediction to the APIla-bible?

Easy peasy! You should only fulfil the path to your prediction `.csv` and run the cell below! 

In [116]:
my_submission = "../data/sample_submission.csv"
with open(my_submission) as f:
    res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":f.read()})
res.json()

{'accuracy': 0.9995007488766849,
 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?",
 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}

![hope-you-enjoy](https://imgs.xkcd.com/comics/machine_learning.png)