The Avila data set has been extracted from 800 images of the the "Avila Bible", a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain.  
The palaeographic analysis of the  manuscript has  individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous. 
Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

The prediction task consists in associating each pattern to one of the 12 copyists (labeled as: A, B, C, D, E, F, G, H, I, W, X, Y).
The data have has been normalized, by using the Z-normalization method, and divided in two data sets: a training set containing 10430 samples, and a test set  containing the 10437 samples.

Class distribution (training set)
A: 4286
B: 5  
C: 103 
D: 352 
E: 1095 
F: 1961 
G: 446 
H: 519
I: 831
W: 44
X: 522 
Y: 266

ATTRIBUTE DESCRIPTION

ID      Name    
F1       intercolumnar distance 
F2       upper margin 
F3       lower margin 
F4       exploitation 
F5       row number 
F6       modular ratio 
F7       interlinear spacing 
F8       weight 
F9       peak number 
F10     modular ratio/ interlinear spacing
Class: A, B, C, D, E, F, G, H, I, W, X, Y


In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import pickle

In [2]:
database = pd.read_csv('avila-tr.txt', sep=",")

In [3]:
database_test = pd.read_csv('avila-ts.txt', header = None, sep=",")

In [4]:
database_test.columns = ["intercolumnar distance", "upper margin", "lower margin", "exploitation", "row number", "modular ratio",
                        "interlinear spacing", "weight", "peak number", "modular ratio/ interlinear spacing", "class"]

In [5]:
database = database.rename(columns={"intercolumnar distance": "F1", "upper margin": "F2", "lower margin": "F3", "exploitation": "F4",
                        "row number": "F5", "modular ratio": "F6", "interlinear spacing": "F7", "weight": "F8", 
                        "peak number": "F9", "modular ratio/ interlinear spacing": "F10", " class": "class"})
database_test = database_test.rename(columns={"intercolumnar distance": "F1", "upper margin": "F2", "lower margin": "F3", "exploitation": "F4",
                        "row number": "F5", "modular ratio": "F6", "interlinear spacing": "F7", "weight": "F8", 
                        "peak number": "F9", "modular ratio/ interlinear spacing": "F10"})

In [6]:
database.columns, database_test.columns

(Index(['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9', 'F10', 'class'], dtype='object'),
 Index(['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9', 'F10', 'class'], dtype='object'))

In [7]:
database.dtypes, database_test.dtypes

(F1       float64
 F2       float64
 F3       float64
 F4       float64
 F5       float64
 F6       float64
 F7       float64
 F8       float64
 F9       float64
 F10      float64
 class     object
 dtype: object,
 F1       float64
 F2       float64
 F3       float64
 F4       float64
 F5       float64
 F6       float64
 F7       float64
 F8       float64
 F9       float64
 F10      float64
 class     object
 dtype: object)

#### On constate que les variables sont premièrement de même type entre train et test et surtout que l'ensemble des variables sont des float à l'exception d'une : 'class' notre variable de prédiction. On peut donc annoncer dès lors que notre problème sera celui d'une classification.

In [8]:
target = 'class'

In [9]:
database.columns

Index(['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9', 'F10', 'class'], dtype='object')

#### On split le dataset en détachant la variable de prédiction afin de bien différencier X et Y

In [22]:
y = database['class']
x = database.drop('class', axis = 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)

# Preprocessing

#### Test valeurs NA

In [11]:
database.isnull().values.any()

False

#### Il n'y a donc aucune valeur NA : il n'y a pas de nettoyage à faire à ce niveau

#### Normalisation des valeurs

In [23]:
scaler = preprocessing.StandardScaler().fit(x_train)

In [24]:
scaler.mean_, scaler.scale_

(array([ 0.00169849,  0.03980698,  0.00153056, -0.00548555,  0.00890724,
         0.02280871,  0.00851373,  0.00462168,  0.01596709,  0.0057365 ]),
 array([0.99804695, 4.46715465, 1.15367818, 1.0197531 , 0.98830433,
        1.16779084, 1.3728213 , 0.99328777, 1.11492302, 1.00168716]))

In [25]:
x_train = scaler.transform(x_train)

In [26]:
x_train.mean(axis = 0), x_train.std(axis = 0)

(array([ 1.27174614e-17, -2.27097525e-18,  5.45034060e-18, -2.27097525e-17,
         1.27174614e-17,  1.27174614e-17,  9.08390099e-18,  2.72517030e-18,
        -8.62970594e-18, -1.77136069e-17]),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

#### On a bien normalisé nos valeurs afin de nous assurer que les différences d'échelle entre les variables n'influent pas sur notre résultat

In [28]:
from sklearn.model_selection import validation_curve, cross_val_score, GridSearchCV
from sklearn.utils.testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning



In [29]:
@ignore_warnings(category=ConvergenceWarning)
def test_model(model, X, y):
  """
  Tester un modele
  """
  accuracy = cross_val_score(model, X, y, scoring='accuracy', verbose=1)
  avg_accuracy = accuracy.mean()

  return avg_accuracy

In [30]:
@ignore_warnings(category=ConvergenceWarning)
def train_model(model, params, X, y):
  """
  Entrainer un modele
  """
  grid = GridSearchCV(model, params, verbose=1)

  grid.fit(X, y)

  best_params = grid.best_params_
  best_model = grid.best_estimator_

  return best_params, best_model

In [31]:
from sklearn.neighbors import KNeighborsClassifier

In [32]:
knn_model = KNeighborsClassifier()

In [34]:
knn_accuracy = test_model(knn_model, x_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.5s finished


In [35]:
print(f'Précision:\t', knn_accuracy)

Précision:	 0.6807736368613287


In [37]:
filename = 'finalized_model.sav'
pickle.dump(knn_model, open(filename, 'wb'))