# Lab | Avila Bible 

In this lab, we will explore the [**Avila Bible dataset**](https://archive.ics.uci.edu/ml/datasets/Avila) which has been extracted from 800 images of the 'Avila Bible', an XII century giant Latin copy of the Bible. The prediction task consists in associating each pattern to a copyist. You will use supervised learning algorithms to figure out what feature patterns each copyist are likely to have and use our model to predict those copyist.

-----------------------------------------------------------------------------------------------------------------

## Before your start:
    - Read the README.md file,
    - Comment as much as you can and use the APIla-bible in the README.md,
    - Happy learning!

In [68]:
# Import your libraries
import pandas as pd
import numpy as np
import requests
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import *
from sklearn.tree import DecisionTreeClassifier

#from sklearn.feature_selection import chi2
#from sklearn.feature_selection import SelectKBest

![machine-learning](https://miro.medium.com/proxy/1*halC1X4ydv_3yHYxKqvrwg.gif)

The Avila data set has been extracted from 800 images of the the **Avila Bible**, a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The palaeographic analysis of the  manuscript has  individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

## What am I expected to do?

Well, your prediction task consists in associating each pattern to one of the 8 monks we will be evaluating (labeled as:  Marcus, Clarius, Philippus, Coronavirucus, Mongucus, Paithonius, Ubuntius, Esequlius). For that aim, you should: 
- Train a minimum of 4 different models
- Perform a minimum of 4 Feature Extraction and Engineering techniques
- Must contain a summary of the machine learning tools and algorithms
- and the results or the score obtained with each of them

You won't get much more instructions from now on. Remember to comment your code as much as you can. Keep the requirements in mind and have fun! 

Just one last piece of advice, take a moment to explore the data, remember this dataset contains two files: **train** and **test**. You will find both files in `data` folder. The **test** files contains the data you will predict for, therefore it does not include the labels.
Use the **train** dataset as you wish, but don't forget to split it into **train** and **test** again so you can evaluate your models. Just be sure to train it again with the whole data before predicting.
We have also included a **sample submission** which is of the exact shape and format you must use when evaluating your predictions against the groundtruth through the `APIla-bible`. It won't work unless it is the exact same shape. 



#### Train dataset

In [2]:
train_dataset = pd.read_csv('../data/training_dataset.csv', index_col=0)
train_dataset.columns=range(11)

In [3]:
print(train_dataset.shape)
train_dataset.head(3)

(12017, 11)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.241386,0.109171,-0.127126,0.380626,0.17234,0.314889,0.484429,0.316412,0.18881,0.134922,Marcus
1,0.303106,0.352558,0.082701,0.703981,0.261718,-0.391033,0.408929,1.045014,0.282354,-0.448209,Clarius
2,-0.116585,0.281897,0.175168,-0.15249,0.261718,-0.889332,0.371178,-0.024328,0.905984,-0.87783,Philippus


#### Test dataset


In [4]:
test_dataset = pd.read_csv('../data/test_dataset.csv', index_col=0)
test_dataset.columns=range(10)

In [5]:
print(test_dataset.shape)
test_dataset.head(3)

(8012, 10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.017834,0.132725,0.125378,1.357345,0.261718,0.190314,0.182426,0.445253,-0.715453,0.189796
1,-0.202992,-0.000745,-3.210528,-0.527256,0.082961,0.771662,0.144676,0.098572,0.251173,0.745333
2,1.019049,0.211237,-0.155578,-0.311855,0.261718,0.107265,0.484429,0.339303,-0.310094,-0.04963


#### Sample submission

In [6]:
sample_submission = pd.read_csv('../data/sample_submission.csv', header=None, index_col=0)

In [7]:
sample_submission.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
0,Philippus
1,Ubuntius
2,Esequlius
3,Coronavirucus
4,Philippus


`Keep calm and code on!`

# Challenge - train your models, make the best prediction

#Your code

        """
        My code starts here...
        """

### This is my trainer function

Takes in a model, features, labels, and some parameters.

Returns the measurements of accuracy (could return the labels or not) or return just a trained model instead.
       

In [9]:
def fit_or_predict(model, X, y,
                   test_size=0.2, solver=None, params=None, debug=False,
                   return_pred=False, predict=True):
    if predict:
        # First, split the data:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
        model = model.fit(X_train, y_train)    # Fit the model with data and labels
        y_pred = model.predict(X_test)         # Make a prediction
        if debug:
            display(pd.DataFrame(y_pred)[0].value_counts())
        
        # Compare my results
        result = pd.DataFrame({    "y_pred":y_pred,    "gt":y_test      })
        print(f'Accuracy: ', sum(result['y_pred'] == result['gt'])/len(y_pred))
        
        if return_pred:
            return y_pred # Labels
        
    else: # If I am taking the 
        print(f'I wont split the data you have into train-test, but I am going to train your model: \n{model}')
        X_train = X
        y_train = y
        fit_model = model.fit(X_train, y_train)  
        print('OK. I just fit this model, try to use its `predict_` method with the `X_test`')
        return fit_model


### This is my train data, split in x and y

In [10]:
X = train_dataset.drop(columns=10).copy()
y = train_dataset[10].copy()
X.shape, y.shape

((12017, 10), (12017,))

### Logistic Regression Model

In [11]:
fit_or_predict(LogisticRegression(solver='liblinear'), X, y)

Accuracy:  0.5449251247920133


In [12]:
fit_or_predict(LogisticRegression(solver='sag'), X, y)

Accuracy:  0.5607321131447587




In [13]:
fit_or_predict(LogisticRegression(solver='saga'), X, y)

Accuracy:  0.5128951747088186




In [14]:
fit_or_predict(LogisticRegression(solver='newton-cg'), X, y)

Accuracy:  0.5678036605657238


    """
    From these experiments, we can see that the best result was achieved with the `liblinear` and `newton-cg` solvers.

    The `saga` and  `sag` however did not converge and wont be considered.
    """

### Calibrated Classifier (Linear Support Vector Machine)

In [15]:
model = CalibratedClassifierCV(LinearSVC(),cv=3)
fit_or_predict(model, X, y)



Accuracy:  0.5507487520798668




### Random Forest Classifier

In [17]:
model = RandomForestClassifier(n_estimators=100)
fit_or_predict(model, X, y)

Accuracy:  0.9920965058236273


    """
    In this case, we see a very high boost in accuracy when using the random forest model. Almost reaching 99%. This model will be included in our experiments.
    """

### Gradient Boosting Classifier:
`sklearn.ensemble.GradientBoostingClassifier`
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier

In [20]:
model = GradientBoostingClassifier()
fit_or_predict(model, X, y)

Accuracy:  0.9371880199667221


    """
    The Gradient Boosting Classifier also shows a high accuracy, high boost in accuracy when using the random forest model. Almost reaching 99%. This model will be included in our experiments.
    """

### To make a better prediction, fit one of the following models, while using the complete `train_dataset`

In [21]:
models= {
    #---------------------------------------------------------------------------------
    # Accuracy < 60%                                           |         API SCORES:   
    #---------------------------------------------------------------------------------
    'logisticnewton':LogisticRegression(solver='newton-cg'),      #0.54000000000000
    'Calibrated-Classifiersvm-linear':CalibratedClassifierCV(LinearSVC(),cv=3),    
    #---------------------------------------------------------------------------------
    # Accuracy 60% to 90%                                      |        # API SCORES:   
    #---------------------------------------------------------------------------------
    'gradientboosting':GradientBoostingClassifier(),              #0.9377184223664503
    #---------------------------------------------------------------------------------        
    # Accuracy >= 99%                                          |        # API SCORES:     
    #---------------------------------------------------------------------------------
    'randomforest-200':RandomForestClassifier(n_estimators=200),  #0.9940089865202196
 
    }

selected_model = models['randomforest-225']
trained_model = fit_or_predict(selected_model, X, y, predict=False)

I wont split the data you have into train-test, but I am going to train your model: 
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=225,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
OK. I just fit this model, try to use its `predict_` method with the `X_test`


In [22]:
answer = trained_model.predict(test_dataset)
answer

array(['Marcus', 'Esequlius', 'Marcus', ..., 'Marcus', 'Marcus', 'Marcus'],
      dtype=object)

X_complete = pd.concat((train_dataset.iloc[:,:-1], test_dataset), axis=0).reset_index(drop=True)
X_complete.shape

## What do I do once I have a prediction?

Once you have already trained your model and made a prediction with it, you are ready to check what is the accuracy of it. 

Save your prediction as a `.csv` file.

In [23]:
#your code here
my_pred = pd.DataFrame(answer).to_csv(header=None)

Now you are ready to know the truth! Are you good enough to call yourself a pro?

Lucky you have the ultimate **APIla-bible** which give you the chance of checking the accuracy of your predictions as many times as you need in order to become the pro you want to be. 

## How do I post my prediction to the APIla-bible?

Easy peasy! You should only fulfil the path to your prediction `.csv` and run the cell below! 

In [24]:
my_submission = "../data/sample_submission.csv"
def send_prediction(csv):
    with open(my_submission) as f:
        res = requests.post("http://apila-bible.herokuapp.com/check", files={"csv_data":csv})
    print(res.json())
send_prediction(my_pred)

{'accuracy': 0.9932601098352472, 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?", 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}


![hope-you-enjoy](https://imgs.xkcd.com/comics/machine_learning.png)

In [25]:
# Streamilned optimization

    """
    So, I need to modify my hyper-parameters more quickly.
    
    I've noticed they dont normally behave lika a bell curve and I want to find which variable would increase the accuracy of my model.
    """

In [122]:
def select_model(key, n=None, cv=None):
    models= {
        ###
        # Untesterd: GradientBoostingClassifier, HistGradientBoostingClassifier
        ##
        'decisiontree':DecisionTreeClassifier(min_samples_split=4),
        #---------------------------------------------------------------------------------
        # Accuracy < 60%                                           |         API SCORES:   
        #---------------------------------------------------------------------------------
        #'logisticnewton':LogisticRegression(solver='newton-cg'),      #0.
        #'Calibrated-Classifiersvm-linear':CalibratedClassifierCV(LinearSVC(),cv=cv),    
        #---------------------------------------------------------------------------------
        # Accuracy 60% to 90%                                      |        # API SCORES:   
        #---------------------------------------------------------------------------------
        'gradientboosting':GradientBoostingClassifier(min_samples_leaf=3), #0.
        #---------------------------------------------------------------------------------        
        # Accuracy >= 99%                                          |        # API SCORES:     
        #---------------------------------------------------------------------------------
        'randomforest-y':RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=154,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=1, warm_start=False),
        
        'randomforest-x':RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=2, oob_score=False, random_state=None,
                       verbose=1, warm_start=False)
        ,
        'randomforest-z':RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.00, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False),  #0.
        }
    return models[key]

In [123]:
# My own variables
experiments =  [#'decisiontree', 'gradientboosting',
                'randomforest-x', 'randomforest-y', 'randomforest-z',
                #'gradientboosting',
              ]
for key in experiments:
    for n in ns:
        print(f"""\n----------------------------------------
        \nBeginning experiment with this model: {key}""")
        selected_model = select_model(key)
        trained_model = fit_or_predict(selected_model, X, y, predict=False)
        answer = trained_model.predict(test_dataset)
        my_pred = pd.DataFrame(answer).to_csv(header=None)
        res = send_prediction(my_pred)
        print(res)


----------------------------------------
        
Beginning experiment with this model: randomforest-x
I wont split the data you have into train-test, but I am going to train your model: 
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=2,
                       oob_score=False, random_state=None, verbose=1,
                       warm_start=False)


[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    1.9s
[Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed:    6.0s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s


OK. I just fit this model, try to use its `predict_` method with the `X_test`


[Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed:    0.2s finished


{'accuracy': 0.9985022466300549, 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?", 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}
None

----------------------------------------
        
Beginning experiment with this model: randomforest-x
I wont split the data you have into train-test, but I am going to train your model: 
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=2,
                       oob_score=False, random_state=N

[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    1.8s
[Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed:    6.4s finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    0.1s


OK. I just fit this model, try to use its `predict_` method with the `X_test`


[Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed:    0.2s finished


{'accuracy': 0.9980029955067399, 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?", 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}
None

----------------------------------------
        
Beginning experiment with this model: randomforest-y
I wont split the data you have into train-test, but I am going to train your model: 
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=154,
                       n_jobs=None, oob_score=False, random_state

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 154 out of 154 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


OK. I just fit this model, try to use its `predict_` method with the `X_test`


[Parallel(n_jobs=1)]: Done 154 out of 154 | elapsed:    0.3s finished


{'accuracy': 0.997378931602596, 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?", 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}
None

----------------------------------------
        
Beginning experiment with this model: randomforest-y
I wont split the data you have into train-test, but I am going to train your model: 
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=154,
                       n_jobs=None, oob_score=False, random_state=

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 154 out of 154 | elapsed:    7.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


OK. I just fit this model, try to use its `predict_` method with the `X_test`


[Parallel(n_jobs=1)]: Done 154 out of 154 | elapsed:    0.3s finished


{'accuracy': 0.9976285571642536, 'quote': "AWESOME! A-W-E-S-O-M-E! Amazing score!!! So cool! I can't even... But wait, maybe...too good to be true? Overfit much?", 'tip': 'If you think you may have overfitted your model, visit http://apila-bible.herokuapp.com/check/overfit on your browser for some follow up. ;)'}
None

----------------------------------------
        
Beginning experiment with this model: randomforest-z
I wont split the data you have into train-test, but I am going to train your model: 
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state

# 🏆️ HALL OF FAME 🏆️

This will at some point be sent to a database, storing the used parameters, and the score.

    """    
    0.9980029955067399
    RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
                       
    0.9971293060409386
    RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=167,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

    
    0.9942586120818772
    RandomForestClassifier(n_estimators=199)
    
    
    0.9941337993010484
    RandomForestClassifier(n_estimators=220)

    0.9938841737393909
    RandomForestClassifier(n_estimators=242)
    
    0.9937593609585622
    RandomForestClassifier(n_estimators=220)    
    
    'accuracy': 0.9936345481777334,
    RandomForestClassifier(n_estimators=180)
    
    'accuracy': 0.9933849226160759
    RandomForestClassifier(n_estimators=166)
    
    'accuracy': 0.9876435346979531
    DecisionTreeClassifier
    
    'accuracy': 0.9377184223664503
    GradientBoostingClassifier
    """

