## Data description 
For each CA 2 csv files are created: x_train and y_train. x_train file has a vector of rational numbers per each row. These numbers depict assessment of an object via given assessment model. Values in single column (i.e. column 1) of one file (i.e. x_train_CA1) were produced with same model. Corresponding y_train has an expected binary label (with a noise) related to a row in x_train file. In that sense 𝑦=𝑓𝐶𝐴(𝑥).

## Goal

Using provided data and performing its analysis build prediction model that based on x vector will produce y value without knowing function 𝑓𝐶𝐴 (frankly, we do not know it either!).

## Expected outcome

A Jupyter (not to be mixed with Jupiter, we are not having any premises on that planet) notebook containing description of data, its analysis and information about reason and purpose of each step, its input, output and result. Notebook should produce a final model or models that based on x vector predicts y value for each CA with analysis of results.

## Mudules import

In [145]:
import numpy as np
import pandas as pd
import os
import joblib
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score
from time import time


## Data Load and describe

In [146]:
def describe_data(train_data, labels_data):
    print(train_data.describe())
    print(labels_data.describe())

def describe_shape(CA_number, train_data, labels_data):
    print(f"shape for train data in set number {CA_number}: ", train[CA_number].shape)
    print(f"shape for labels in set number  {CA_number}: ",   labels[CA_number].shape)
    print(80*"-")
    

In [147]:
train = []
labels = []

for CA_number in range(1,18):
    train_path = os.getcwd() + f'\\dataset\\x_train_CA{CA_number}.csv'
    labels_path = os.getcwd() + f'\\dataset\\y_train_CA{CA_number}.csv'
    
    train_data = pd.read_csv(train_path, header=None)
    labels_data = pd.read_csv(labels_path, header=None)
    
    describe_data(train_data, labels_data)
    train.append(np.array(train_data))
    labels.append(np.array(labels_data))
    describe_shape(CA_number-1, train_data, labels_data)


                0            1            2            3            4   \
count  1060.000000  1060.000000  1060.000000  1060.000000  1060.000000   
mean      0.406296     0.302534     0.389756     0.367475     0.406316   
std       0.389307     0.323628     0.390292     0.384839     0.389539   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.005148     0.000239     0.003323     0.000565     0.004901   
50%       0.340362     0.312962     0.329896     0.322059     0.337146   
75%       0.679549     0.658765     0.677412     0.672623     0.680219   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

                5            6            7            8            9   ...  \
count  1060.000000  1060.000000  1060.000000  1060.000000  1060.000000  ...   
mean      0.302119     0.390031     0.367795     0.099622     0.096031  ...   
std       0.323106     0.389898     0.384444     0.262318     0.256830  ...   
min       0.00000

                 0
count  1060.000000
mean      0.049760
std       0.203441
min       0.000000
25%       0.000000
50%       0.000003
75%       0.014018
max       1.000000
shape for train data in set number 3:  (1060, 96)
shape for labels in set number  3:  (1060, 1)
--------------------------------------------------------------------------------
                0            1            2            3            4   \
count  1060.000000  1060.000000  1060.000000  1060.000000  1060.000000   
mean      0.258850     0.390481     0.243240     0.227599     0.257207   
std       0.309700     0.384880     0.305310     0.302047     0.309545   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.000000     0.001553     0.000000     0.000000     0.000000   
50%       0.020241     0.332325     0.017197     0.015916     0.019318   
75%       0.649127     0.677760     0.645170     0.642074     0.647632   
max       1.000000     1.000000     1.000000     1.000000   

                0            1            2            3            4   \
count  1060.000000  1060.000000  1060.000000  1060.000000  1060.000000   
mean      0.607527     0.613617     0.590604     0.571106     0.607393   
std       0.368835     0.370294     0.377010     0.384977     0.368989   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.328699     0.327281     0.321277     0.315063     0.326707   
50%       0.667849     0.669369     0.666598     0.659988     0.668020   
75%       0.990529     0.992794     0.990383     0.990617     0.991563   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

                5            6            7            8            9   ...  \
count  1060.000000  1060.000000  1060.000000  1060.000000  1060.000000  ...   
mean      0.613300     0.590358     0.570889     0.401203     0.406784  ...   
std       0.369938     0.377014     0.384975     0.424861     0.426113  ...   
min       0.00000

                0            1            2            3            4   \
count  1060.000000  1060.000000  1060.000000  1060.000000  1060.000000   
mean      0.369466     0.495564     0.357977     0.346858     0.369698   
std       0.350320     0.389686     0.350761     0.352280     0.350800   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.007841     0.013759     0.005232     0.004175     0.005850   
50%       0.331877     0.648268     0.324795     0.321454     0.334076   
75%       0.666204     0.978025     0.663880     0.664283     0.668347   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

                5            6            7            8            9   ...  \
count  1060.000000  1060.000000  1060.000000  1060.000000  1060.000000  ...   
mean      0.495877     0.358560     0.346220     0.214039     0.212893  ...   
std       0.389692     0.351388     0.351642     0.358916     0.358024  ...   
min       0.00000

                0            1            2            3            4   \
count  1059.000000  1059.000000  1059.000000  1059.000000  1059.000000   
mean      0.508510     0.523688     0.496452     0.485938     0.509036   
std       0.382495     0.381556     0.387406     0.384182     0.383088   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%       0.018829     0.020317     0.014919     0.016900     0.019283   
50%       0.649211     0.653107     0.649270     0.644693     0.651279   
75%       0.977373     0.980536     0.976443     0.690894     0.978150   
max       1.000000     1.000000     1.000000     1.000000     1.000000   

                5            6            7            8            9   ...  \
count  1059.000000  1059.000000  1059.000000  1059.000000  1059.000000  ...   
mean      0.524426     0.496734     0.486583     0.235173     0.240732  ...   
std       0.381698     0.387180     0.384238     0.373042     0.376884  ...   
min       0.00000

##### Purpose
 Load data into numpy array, first glance at data structure by pandas.decsribe. First conclusions:
 * values between 0 and 1 (continous), noise need to be reduced
 * sets: 6, 13, 16, 17 have diferent shapes, others 1060 x 96

## Labels convert

In [148]:
def convert_to_binary(CA_number):
    for i in range (0, len(labels[CA_number])):
        if labels[CA_number][i] >= 0.5:
            labels[CA_number][i] = 1
        else:
            labels[CA_number][i] = 0

for CA_number in range(1, 18):
    convert_to_binary(CA_number - 1)


##### Purpose
As we can find in description "y_train has an expected binary label (with a noise)" labels are converted to binary


## Split into train, validation, and test set

In [134]:
X_train = []
X_test  = []
X_val   = []
y_train = []
y_test  = []
y_val   = []

def split_train_val_test(CA_number):
    # train set 60%, rest 40%, _t-temporary variable
    X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(train[CA_number],
                                                                labels[CA_number],
                                                                test_size=0.4,
                                                                random_state=42
                                                               )
    # validation 20%, test set 40%
    X_val_t, X_test_t, y_val_t, y_test_t = train_test_split(X_test_t,
                                                            y_test_t,
                                                            test_size=0.5,
                                                            random_state=42
                                                           )
    X_train.append(np.array(X_train_t))
    X_val.append(np.array(X_val_t))
    X_test.append(np.array(X_test_t))
    y_train.append(np.array(y_train_t))
    y_val.append(np.array(y_val_t))
    y_test.append(np.array(y_test_t))
    

In [149]:
# split for all CA 
for CA_number in range(0, 17):
    split_train_val_test(CA_number)
    

##### Purpose
For each CA data are splited in ratio: 60% train set, 20% validation set, 20% test set. Validation set for hyperparemeters tuning, and test set for final result verification 

### Train models

After quick tests using basics ML algorithms i have discoverd that all of them perform really good. I Asume (and hope) that chosing one with best performence isn't important it this task, so i choosee support vector machines becouse i like it. Then i have trainted models with GridSearch for fast parameters selection

In [136]:
def print_results(results):
    '''Prints best parameters and results of training 
    for each parameter in GridSearch. 
    '''
    
    print(f'BEST PARAMS: {results.best_params_}\n')

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print(f'{round(mean, 3)} (+/-{round(std * 2, 3)}) for {params}')
        

In [150]:
def train_and_save_model(CA_number):
    '''Trains model(SVM) using GridSearch (five cross validation sets)
    with parameters stored in dict and save best model in pickle
    '''
   
    model_path = os.getcwd() + f'\\models\\SVM_model_CA_{CA_number + 1}.pkl'
    svc = SVC()
    parameters = {
        'kernel': ['linear', 'rbf'],
        'C': [0.001, 0.1, 1, 10]
    }

    cv = GridSearchCV(svc, parameters, cv=5)
    cv.fit(X_train[CA_number], y_train[CA_number].ravel())
    
    print_results(cv)
    joblib.dump(cv.best_estimator_, model_path)
    

In [138]:
for CA_number in range(0, 17):
    train_and_save_model(CA_number)


BEST PARAMS: {'C': 1, 'kernel': 'linear'}

0.954 (+/-0.006) for {'C': 0.001, 'kernel': 'linear'}
0.954 (+/-0.006) for {'C': 0.001, 'kernel': 'rbf'}
0.98 (+/-0.008) for {'C': 0.1, 'kernel': 'linear'}
0.97 (+/-0.006) for {'C': 0.1, 'kernel': 'rbf'}
0.997 (+/-0.008) for {'C': 1, 'kernel': 'linear'}
0.978 (+/-0.012) for {'C': 1, 'kernel': 'rbf'}
0.997 (+/-0.008) for {'C': 10, 'kernel': 'linear'}
0.987 (+/-0.008) for {'C': 10, 'kernel': 'rbf'}
BEST PARAMS: {'C': 0.1, 'kernel': 'rbf'}

0.877 (+/-0.007) for {'C': 0.001, 'kernel': 'linear'}
0.877 (+/-0.007) for {'C': 0.001, 'kernel': 'rbf'}
0.937 (+/-0.022) for {'C': 0.1, 'kernel': 'linear'}
0.937 (+/-0.017) for {'C': 0.1, 'kernel': 'rbf'}
0.932 (+/-0.029) for {'C': 1, 'kernel': 'linear'}
0.936 (+/-0.023) for {'C': 1, 'kernel': 'rbf'}
0.917 (+/-0.044) for {'C': 10, 'kernel': 'linear'}
0.931 (+/-0.038) for {'C': 10, 'kernel': 'rbf'}
BEST PARAMS: {'C': 1, 'kernel': 'linear'}

0.934 (+/-0.026) for {'C': 0.001, 'kernel': 'linear'}
0.753 (+/-0.007)

##### Purpose
Having all trained models in pickle files for quick and easy access later

## Load models and evaluate

In [154]:
models = {}

for m_number in range(1, 18):
    models[m_number] = joblib.load(f'models/SVM_model_CA_{m_number}.pkl')

models


{1: SVC(C=1, kernel='linear'),
 2: SVC(C=0.1),
 3: SVC(C=1, kernel='linear'),
 4: SVC(C=1, kernel='linear'),
 5: SVC(C=0.1, kernel='linear'),
 6: SVC(C=1, kernel='linear'),
 7: SVC(C=10),
 8: SVC(C=1, kernel='linear'),
 9: SVC(C=0.1, kernel='linear'),
 10: SVC(C=1, kernel='linear'),
 11: SVC(C=1, kernel='linear'),
 12: SVC(C=0.1, kernel='linear'),
 13: SVC(C=1),
 14: SVC(C=1, kernel='linear'),
 15: SVC(C=0.1, kernel='linear'),
 16: SVC(C=1),
 17: SVC(C=10, kernel='linear')}

In [155]:
def evaluate_model(models, features, labels):
    ''' Loads model, predict decision for given X - features and y - labels
    calculate and print accuracy, precision, recall and time of execution
    '''
    start = time()
    pred  = models.predict(features)
    end   = time()
    
    accuracy  = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred), 3)
    recall    = round(recall_score(labels, pred), 3)
    
    print(f'Accuracy: {accuracy} | Precision: {precision} |'
          f'Recall: {recall} | Latency: {round((end - start)*1000, 1)}ms')
    

## Evaluate on validation sets
Evaluate each model for every CA on validation set 

In [156]:
for CA_number in range (1, 18):
    print(f'CA number: {CA_number}')      # CA_number - 1: array indexes starts with 0, models with 1
    evaluate_model(models[CA_number], X_val[CA_number - 1], y_val[CA_number - 1])
    print(80*'-')
    

CA number: 1
Accuracy: 0.995 | Precision: 0.923 |Recall: 1.0 | Latency: 1.0ms
--------------------------------------------------------------------------------
CA number: 2
Accuracy: 0.953 | Precision: 0.792 |Recall: 0.792 | Latency: 3.9ms
--------------------------------------------------------------------------------
CA number: 3
Accuracy: 0.986 | Precision: 1.0 |Recall: 0.941 | Latency: 1.0ms
--------------------------------------------------------------------------------
CA number: 4
Accuracy: 0.972 | Precision: 1.0 |Recall: 0.5 | Latency: 2.0ms
--------------------------------------------------------------------------------
CA number: 5
Accuracy: 1.0 | Precision: 1.0 |Recall: 1.0 | Latency: 0.0ms
--------------------------------------------------------------------------------
CA number: 6
Accuracy: 0.995 | Precision: 1.0 |Recall: 0.944 | Latency: 0.0ms
--------------------------------------------------------------------------------
CA number: 7
Accuracy: 0.972 | Precision: 1.0 |Rec

## Evaluate on test sets

Evaluate each model for every CA on test set 

In [157]:
for CA_number in range (1, 18):
    print(f'CA number: {CA_number}')      # CA_number - 1: array indexes starts with 0, models with 1
    evaluate_model(models[CA_number], X_test[CA_number - 1], y_test[CA_number - 1])
    print(80*'-')


CA number: 1
Accuracy: 1.0 | Precision: 1.0 |Recall: 1.0 | Latency: 0.0ms
--------------------------------------------------------------------------------
CA number: 2
Accuracy: 0.925 | Precision: 0.812 |Recall: 0.5 | Latency: 4.2ms
--------------------------------------------------------------------------------
CA number: 3
Accuracy: 0.986 | Precision: 0.961 |Recall: 0.98 | Latency: 1.8ms
--------------------------------------------------------------------------------
CA number: 4
Accuracy: 0.986 | Precision: 1.0 |Recall: 0.667 | Latency: 1.0ms
--------------------------------------------------------------------------------
CA number: 5
Accuracy: 1.0 | Precision: 1.0 |Recall: 1.0 | Latency: 1.0ms
--------------------------------------------------------------------------------
CA number: 6
Accuracy: 1.0 | Precision: 1.0 |Recall: 1.0 | Latency: 1.0ms
--------------------------------------------------------------------------------
CA number: 7
Accuracy: 0.976 | Precision: 0.625 |Recall: 

## Single predict
Based on X vector predicts y value for each CA.

In [158]:
def predict(X, y, CA_number):
    ''' Predicts label(y) for given vector(X)'''
    X = X.reshape(1, -1) # for fit into predict method
    pred = models[CA_number].predict(X)
    
    if pred == y:
        state = ' CORRECT'
    else:
        state = 'WRONG'
    
    print(f' CA: {CA_number} y = {y}, prediction: {pred} {state}')
    
def predict_all(X, y):
    '''Predicts label(y) for all CA for given X - vector '''
    predicts = []
    for CA_number in range(1,18):
        pred = predict(X, y, CA_number)
        predicts.append(pred)

In [159]:
X = X_test[15][3] # radom vector from test set and its label
y = y_test[15][3]

predict_all(X, y)

 CA: 1 y = [1.], prediction: [0.] WRONG
 CA: 2 y = [1.], prediction: [0.] WRONG
 CA: 3 y = [1.], prediction: [1.]  CORRECT
 CA: 4 y = [1.], prediction: [0.] WRONG
 CA: 5 y = [1.], prediction: [1.]  CORRECT
 CA: 6 y = [1.], prediction: [1.]  CORRECT
 CA: 7 y = [1.], prediction: [1.]  CORRECT
 CA: 8 y = [1.], prediction: [1.]  CORRECT
 CA: 9 y = [1.], prediction: [1.]  CORRECT
 CA: 10 y = [1.], prediction: [1.]  CORRECT
 CA: 11 y = [1.], prediction: [0.] WRONG
 CA: 12 y = [1.], prediction: [1.]  CORRECT
 CA: 13 y = [1.], prediction: [0.] WRONG
 CA: 14 y = [1.], prediction: [1.]  CORRECT
 CA: 15 y = [1.], prediction: [0.] WRONG
 CA: 16 y = [1.], prediction: [0.] WRONG
 CA: 17 y = [1.], prediction: [1.]  CORRECT


## Conclusions and additional informations

* All models performs well, Accuracy > 0.925
* Its hard to eliminate or choosee CA number based on model prediciton - data is similar for all CA, but was it a goal for this task? 
* Data analysis could be performed better, maybe knowledge about data source and models destinity would make work with synthetic data easier.
* As i have used one model, validation set was useless - in normal case i would choosee between best model at this stage and check performence on test set. But i leave it as it is. 
* Some parts arent explained, I believe that those lines of code speak for themselfs. 