# Tema 5 - Pamfile Alex

## Modele de clasificare

Folositi 5 seturi de date pentru probleme de clasificare, plecand de la repository-urile specificate in Cursul 6. Toate seturile de date trebuie sa aiba valori precizate (adica sa fie fara valori lipsa) si sa aiba macar o trasatura de intrare variabila categoriala nominala.

1. Transformati trasaturile categoriale nominale folosind one hot encoding, https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html. 
1. (numar de modele * numar de seturi de date \* 1 punct = 30 de puncte) Pentru fiecare set de date aplicati 6 modele de clasificare din scikit learn. Pentru fiecare raportati: acuratete, precision, recall, scorul F1 - a se vedea [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics), [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) - folosind 10 fold cross validation. Raportati mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare. Rularile se vor face cu valori fixate ale hiperparametrilor. 
3. (numar de modele * numar de seturi de date * 1 punct = 30 de puncte) Raportati performanta fiecarui model, folosind 10 fold cross validation. Pentru fiecare din cele 10 rulari, cautati hiperparametrii optimi folosind 4-fold cross validation. Performanta modelului va fi raportata ca medie a celor  10 rulari. 
    *Observatie:* la fiecare din cele 10 rulari, hiperparametrii optimi pot diferi, din cauza datelor utilizate pentru antrenare/validare. 
3. (numar modele * 4 puncte = 20 puncte) Documentati in jupyter notebook fiecare din modelele folosite, in limba romana. Daca acelasi algoritm e folosit pentru mai multe seturi de date, puteti face o sectiune separata cu documentarea algoritmilor + trimitere la algoritm. 

Se acorda 20 de puncte din oficiu. 

Exemple de modele de clasificare:
1. [Multi-layer Perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)
1. [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
1. [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
1. [Gaussian processes](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html#sklearn.gaussian_process.GaussianProcessClassifier)
1. [RBF](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html#sklearn.gaussian_process.kernels.RBF)
1. [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
1. [Random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
1. [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) 

*Predare:* 
1. Predarea se face cel tarziu in 25 noiembrie 2022 ora 23, in lucrarea de pe elearning (Tema 5). 
1. Obligatoriu: type annotations pentru variabile, parametri, tip de retur; docstrings. 
1. Fisierele de date folosite vor fi descarcate local de studenti si puse intr-un director "data". Se va realiza o arhiva zip care contine minim: fiserul/fisierele ipynb si direcotrul de date. Suplimentar, pot fi folosite imagini incluse in ipynb; acestea vor fi puse in directorul "images" ce se va include in arhiva zip predata.


In [1]:
import sklearn
import pandas as pd
import numpy as np

print("Sklearn version:", sklearn.__version__)
print("Pandas version:", pd.__version__)
print("Numpy version:", np.__version__)

Sklearn version: 1.0.2
Pandas version: 1.4.4
Numpy version: 1.21.5


In [2]:
from typing import Dict, List, Union
from collections import Counter
import random

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

ModelClassifier = Union[KNeighborsClassifier, MultinomialNB, DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier]
model_dict:Dict[str, ModelClassifier] = {
        "LinearSVC" : LinearSVC(C=1.0),
        "KNeighborsClassifier" : KNeighborsClassifier(n_neighbors=5, p=2),
        "MultinomialNB" : MultinomialNB(alpha=1.0),
        "DecisionTreeClassifier" : DecisionTreeClassifier(random_state=0, ccp_alpha=0.0),
        "RandomForestClassifier" : RandomForestClassifier(random_state=0, n_estimators=100),
        "GradientBoostingClassifier" : GradientBoostingClassifier(random_state=0, learning_rate=0.1)
    }

In [5]:
def one_hot_encoding(y:np.ndarray) -> pd.core.frame.DataFrame:
    '''
    Encodes using one hot method the classes vector using pandas.get_dummies method
    :param y: numpy array, classes
    :return: pandas DataFrame one hot encoded classes
    '''
    return pd.get_dummies(y)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

def models_metrics(x:np.ndarray, y:np.ndarray, cv:int=10) -> None:
    '''
    Shows model metric for 6 different models
    :param x: numpy array, atribute values
    :param y: numpy array, classes
    :param cv: int, number of folds
    :return: None
    '''
    global model_dict
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=1/3, shuffle=True)
    print("#########################")
    print(f"## Train distribution: {Counter(y_train)}")
    print(f"## Test distribution: {Counter(y_test)}")
    print("#########################\n")
    
    for name, model in model_dict.items():
        model.fit(x_train, y_train)
        y_predicted:np.ndarray = model.predict(x_test)
        print(name)
        print("-----------------------")
        print("Accuracy:", accuracy_score(y_test, y_predicted))
        print("Precision:", precision_score(y_test, y_predicted, average='macro'))
        print("Recall:", recall_score(y_test, y_predicted, average='macro'))
        print("F1 Score:", f1_score(y_test, y_predicted, average='macro'))
        
        scores_train:np.ndarray = cross_val_score(model, x_train, y_train, cv=cv, scoring='accuracy')
        scores_test:np.ndarray = cross_val_score(model, x_test, y_test, cv=cv, scoring='accuracy')
        print("Trained fold mean:", scores_train.mean())
        print("Tested fold mean:", scores_test.mean(), end="\n\n")

In [7]:
def models_performances(x:np.ndarray, y:np.ndarray, cv:int=10) -> None:
    '''
    Shows performance for 6 different models
    :param x: numpy array, atribute values
    :param y: numpy array, classes
    :param cv: int, number of folds
    :return: None
    '''
    for name, model in model_dict.items():
        print(name)
        scores:np.ndarray = cross_val_score(model, x, y, cv=cv, scoring='accuracy')
        print("Mean score:", scores.mean(), "\n")

In [8]:
def hyperparameters(x:np.ndarray, y:np.ndarray, cv:int=4) -> None:
    '''
    Determines best hyperparameter for 6 different models
    :param x: numpy array, atribute values
    :param y: numpy array, classes
    :param cv: int, number of folds
    :return: None
    '''
    size:int = 10
    print("LinearSVC")
    sample:np.ndarray = np.random.uniform(size=size)
    scores_k:np.ndarray = np.array([cross_val_score(LinearSVC(C=k), x, y, cv=cv, scoring='accuracy').mean() for k in sample])
    print('Optimal C parameter: {0}\nMax score: {1}\n'.format(sample[np.argmax(scores_k)], np.max(scores_k)))
    
    print("KNeighborsClassifier")
    scores_k:np.ndarray = np.array([cross_val_score(KNeighborsClassifier(n_neighbors=k), x, y, cv=cv, scoring='accuracy').mean() for k in range(1, size)])
    print('Optimal n_neighbors parameter: {0}\nMax score: {1}\n'.format(1+np.argmax(scores_k), np.max(scores_k)))
    
    print("MultinomialNB")
    sample:np.ndarray = np.random.uniform(size=size)
    scores_k:np.ndarray = np.array([cross_val_score( MultinomialNB(alpha=k), x, y, cv=cv, scoring='accuracy').mean() for k in sample])
    print('Optimal alpha parameter: {0}\nMax score: {1}\n'.format(sample[np.argmax(scores_k)], np.max(scores_k)))
    
    print("DecisionTreeClassifier")
    sample:np.ndarray = np.random.uniform(size=size)
    scores_k:np.ndarray = np.array([cross_val_score(DecisionTreeClassifier(random_state=0, ccp_alpha=k), x, y, cv=cv, scoring='accuracy').mean() for k in sample])
    print('Optimal ccp_alpha parameter: {0}\nMax score: {1}\n'.format(sample[np.argmax(scores_k)], np.max(scores_k)))
    
    print("RandomForestClassifier")
    sample:np.ndarray = np.random.randint(low=10, high=100, size=size)
    scores_k:np.ndarray = np.array([cross_val_score(RandomForestClassifier(random_state=0, n_estimators=k), x, y, cv=cv, scoring='accuracy').mean() for k in sample])
    print('Optimal n_estimators parameter: {0}\nMax score: {1}\n'.format(sample[np.argmax(scores_k)], np.max(scores_k)))
    
    print("GradientBoostingClassifier")
    sample:np.ndarray = np.random.uniform(size=size, high=0.3)
    scores_k:np.ndarray = np.array([cross_val_score(GradientBoostingClassifier(random_state=0, learning_rate=k), x, y, cv=cv, scoring='accuracy').mean() for k in sample])
    print('Optimal learning_rate parameter: {0}\nMax score: {1}\n'.format(sample[np.argmax(scores_k)], np.max(scores_k)))

### Iris Dataset
http://archive.ics.uci.edu/ml/datasets/Iris \
Missing Values: NO

In [9]:
iris_data:pd.core.frame.DataFrame = pd.read_csv("./data/iris.data", header=None)
iris_values:np.ndarray = iris_data.values
print(f"Shape: {iris_values.shape}")
print(f"Missing values: {pd.isnull(iris_values).sum()}")

Shape: (150, 5)
Missing values: 0


In [10]:
iris_x:np.ndarray = iris_values[:,:-1]
iris_y:np.ndarray = iris_values[:,-1]

In [11]:
one_hot_encoding(iris_y)

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
...,...,...,...
145,0,0,1
146,0,0,1
147,0,0,1
148,0,0,1


In [12]:
models_metrics(iris_x, iris_y)

#########################
## Train distribution: Counter({'Iris-setosa': 34, 'Iris-virginica': 33, 'Iris-versicolor': 33})
## Test distribution: Counter({'Iris-versicolor': 17, 'Iris-virginica': 17, 'Iris-setosa': 16})
#########################

LinearSVC
-----------------------
Accuracy: 0.96
Precision: 0.9649122807017544
Recall: 0.9607843137254902
F1 Score: 0.9606481481481483
Trained fold mean: 0.96
Tested fold mean: 0.9800000000000001

KNeighborsClassifier
-----------------------
Accuracy: 0.96
Precision: 0.9649122807017544
Recall: 0.9607843137254902
F1 Score: 0.9606481481481483
Trained fold mean: 0.9700000000000001
Tested fold mean: 0.96

MultinomialNB
-----------------------
Accuracy: 0.94
Precision: 0.9421296296296297
Recall: 0.9411764705882352
F1 Score: 0.9411255411255411
Trained fold mean: 0.95
Tested fold mean: 0.9199999999999999

DecisionTreeClassifier
-----------------------
Accuracy: 0.96
Precision: 0.9649122807017544
Recall: 0.9607843137254902
F1 Score: 0.9606481481481483


In [13]:
models_performances(iris_x, iris_y)

LinearSVC
Mean score: 0.9666666666666668 

KNeighborsClassifier
Mean score: 0.9666666666666668 

MultinomialNB
Mean score: 0.9533333333333334 

DecisionTreeClassifier
Mean score: 0.96 

RandomForestClassifier
Mean score: 0.96 

GradientBoostingClassifier
Mean score: 0.96 



In [14]:
hyperparameters(iris_x, iris_y)

LinearSVC
Optimal C parameter: 0.4199344473847224
Max score: 0.9598150782361308

KNeighborsClassifier
Optimal n_neighbors parameter: 5
Max score: 0.9667496443812233

MultinomialNB
Optimal alpha parameter: 0.2788862401512243
Max score: 0.9731507823613087

DecisionTreeClassifier
Optimal ccp_alpha parameter: 0.11254199966588962
Max score: 0.9466571834992887

RandomForestClassifier
Optimal n_estimators parameter: 21
Max score: 0.9599928876244666

GradientBoostingClassifier
Optimal learning_rate parameter: 0.18854089032848845
Max score: 0.9667496443812233



### Wine Dataset
http://archive.ics.uci.edu/ml/datasets/Wine \
Missing Values: NO

In [15]:
wine_data:pd.core.frame.DataFrame = pd.read_csv("./data/wine.data", header=None)
wine_values:np.ndarray = wine_data.values
print(f"Shape: {wine_values.shape}")
print(f"Missing values: {pd.isnull(wine_values).sum()}")

Shape: (178, 14)
Missing values: 0


In [16]:
wine_x:np.ndarray = wine_values[:,1:]
wine_y:np.ndarray = wine_values[:,0]

In [17]:
one_hot_encoding(wine_y)

Unnamed: 0,1.0,2.0,3.0
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
...,...,...,...
173,0,0,1
174,0,0,1
175,0,0,1
176,0,0,1


In [18]:
models_metrics(wine_x, wine_y)

#########################
## Train distribution: Counter({2.0: 50, 1.0: 39, 3.0: 29})
## Test distribution: Counter({2.0: 21, 1.0: 20, 3.0: 19})
#########################

LinearSVC
-----------------------
Accuracy: 0.8
Precision: 0.8366666666666666
Recall: 0.7953634085213032
F1 Score: 0.7925925925925926
Trained fold mean: 0.8287878787878787
Tested fold mean: 0.65

KNeighborsClassifier
-----------------------
Accuracy: 0.7
Precision: 0.7028265851795265
Recall: 0.7010025062656643
F1 Score: 0.6983311938382543
Trained fold mean: 0.7007575757575758
Tested fold mean: 0.6499999999999999

MultinomialNB
-----------------------
Accuracy: 0.8833333333333333
Precision: 0.8962962962962964
Recall: 0.8814954051796157
F1 Score: 0.8844496670583627
Trained fold mean: 0.8363636363636363
Tested fold mean: 0.8833333333333332

DecisionTreeClassifier
-----------------------
Accuracy: 0.9333333333333333
Precision: 0.939855072463768
Recall: 0.934126984126984
F1 Score: 0.934122934122934
Trained fold mean: 0.94

In [19]:
models_performances(wine_x, wine_y)

LinearSVC
Mean score: 0.8833333333333332 

KNeighborsClassifier
Mean score: 0.6754901960784313 

MultinomialNB
Mean score: 0.8496732026143791 

DecisionTreeClassifier
Mean score: 0.8705882352941178 

RandomForestClassifier
Mean score: 0.9833333333333332 

GradientBoostingClassifier
Mean score: 0.9160130718954248 



In [20]:
hyperparameters(wine_x, wine_y)

LinearSVC
Optimal C parameter: 0.021467787639915525
Max score: 0.9108585858585858

KNeighborsClassifier
Optimal n_neighbors parameter: 1
Max score: 0.7193181818181817

MultinomialNB
Optimal alpha parameter: 0.9609296649121679
Max score: 0.8327020202020202

DecisionTreeClassifier
Optimal ccp_alpha parameter: 0.12319987583789394
Max score: 0.8319444444444444

RandomForestClassifier
Optimal n_estimators parameter: 82
Max score: 0.9609848484848484

GradientBoostingClassifier
Optimal learning_rate parameter: 0.2141054561039081
Max score: 0.9386363636363637



## Haberman's Survival Dataset
http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival \
Missing Values: NO

In [21]:
haberman_data:pd.core.frame.DataFrame = pd.read_csv("./data/haberman.data", header=None)
haberman_values:np.ndarray = haberman_data.values.astype(np.float64)
print(f"Shape: {haberman_values.shape}")
print(f"Missing values: {pd.isnull(haberman_values).sum()}")

Shape: (306, 4)
Missing values: 0


In [22]:
haberman_x:np.ndarray = haberman_values[:,:-1]
haberman_y:np.ndarray = haberman_values[:,-1]

In [23]:
one_hot_encoding(haberman_y)

Unnamed: 0,1.0,2.0
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
301,1,0
302,1,0
303,1,0
304,0,1


In [24]:
models_metrics(haberman_x, haberman_y)

#########################
## Train distribution: Counter({1.0: 146, 2.0: 58})
## Test distribution: Counter({1.0: 79, 2.0: 23})
#########################

LinearSVC
-----------------------
Accuracy: 0.7647058823529411
Precision: 0.6200716845878136
Recall: 0.5553109521188773
F1 Score: 0.5552325581395349
Trained fold mean: 0.6071428571428571
Tested fold mean: 0.7300000000000001

KNeighborsClassifier
-----------------------
Accuracy: 0.6862745098039216
Precision: 0.5509080902586682
Recall: 0.5509080902586682
F1 Score: 0.5509080902586682
Trained fold mean: 0.7445238095238096
Tested fold mean: 0.7063636363636363

MultinomialNB
-----------------------
Accuracy: 0.7156862745098039
Precision: 0.6153474903474904
Recall: 0.631535498073748
F1 Score: 0.6209150326797386
Trained fold mean: 0.74
Tested fold mean: 0.7354545454545455

DecisionTreeClassifier
-----------------------
Accuracy: 0.6764705882352942
Precision: 0.5861607142857144
Recall: 0.6062190423775454
F1 Score: 0.589261744966443
Trained f

In [25]:
models_performances(haberman_x, haberman_y)

LinearSVC
Mean score: 0.6320430107526882 

KNeighborsClassifier
Mean score: 0.7027956989247313 

MultinomialNB
Mean score: 0.7388172043010753 

DecisionTreeClassifier
Mean score: 0.6021505376344086 

RandomForestClassifier
Mean score: 0.6931182795698925 

GradientBoostingClassifier
Mean score: 0.6189247311827957 



In [26]:
hyperparameters(haberman_x, haberman_y)

LinearSVC
Optimal C parameter: 0.044740092614624194
Max score: 0.7450444292549556

KNeighborsClassifier
Optimal n_neighbors parameter: 8
Max score: 0.7485047846889952

MultinomialNB
Optimal alpha parameter: 0.2941847130437514
Max score: 0.7387218045112782

DecisionTreeClassifier
Optimal ccp_alpha parameter: 0.05935464490240505
Max score: 0.7353041695146959

RandomForestClassifier
Optimal n_estimators parameter: 46
Max score: 0.669771018455229

GradientBoostingClassifier
Optimal learning_rate parameter: 0.015274518058623442
Max score: 0.669471975393028



## Yeast Dataset
https://archive.ics.uci.edu/ml/datasets/Yeast \
Missing Values: NO

In [27]:
yeast_data:pd.core.frame.DataFrame = pd.read_csv("./data/yeast.data", header=None)
yeast_values:np.ndarray = yeast_data.values
print(f"Shape: {yeast_values.shape}")
print(f"Missing values: {pd.isnull(yeast_values).sum()}")

Shape: (1484, 10)
Missing values: 0


In [28]:
yeast_x:np.ndarray = yeast_values[:,1:-1]
yeast_y:np.ndarray = yeast_values[:,-1]

In [29]:
one_hot_encoding(yeast_y)

Unnamed: 0,CYT,ERL,EXC,ME1,ME2,ME3,MIT,NUC,POX,VAC
0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
1479,0,0,0,0,1,0,0,0,0,0
1480,0,0,0,0,0,0,0,1,0,0
1481,0,0,0,0,1,0,0,0,0,0
1482,0,0,0,0,0,0,0,1,0,0


In [30]:
models_metrics(yeast_x, yeast_y)

#########################
## Train distribution: Counter({'CYT': 308, 'NUC': 291, 'MIT': 158, 'ME3': 111, 'ME1': 33, 'ME2': 28, 'EXC': 23, 'VAC': 21, 'POX': 12, 'ERL': 4})
## Test distribution: Counter({'CYT': 155, 'NUC': 138, 'MIT': 86, 'ME3': 52, 'ME2': 23, 'EXC': 12, 'ME1': 11, 'VAC': 9, 'POX': 8, 'ERL': 1})
#########################

LinearSVC
-----------------------
Accuracy: 0.5777777777777777
Precision: 0.48098352032314295
Recall: 0.4004740456169074
F1 Score: 0.389021948204878
Trained fold mean: 0.5894248608534323
Tested fold mean: 0.5657551020408162

KNeighborsClassifier
-----------------------
Accuracy: 0.5313131313131313
Precision: 0.5527953325356328
Recall: 0.5312645776980882
F1 Score: 0.5230662333775176
Trained fold mean: 0.5631313131313131
Tested fold mean: 0.545469387755102

MultinomialNB
-----------------------
Accuracy: 0.33131313131313134
Precision: 0.09713407542781193
Recall: 0.1071575502571295
F1 Score: 0.06830452285105375
Trained fold mean: 0.3437950937950938
Tested

In [31]:
models_performances(yeast_x, yeast_y)

LinearSVC
Mean score: 0.5754670778160711 

KNeighborsClassifier
Mean score: 0.5437828768365681 

MultinomialNB
Mean score: 0.32750317431525483 

DecisionTreeClassifier
Mean score: 0.465608561581716 

RandomForestClassifier
Mean score: 0.6037456920007255 

GradientBoostingClassifier
Mean score: 0.5794395066207148 



In [32]:
hyperparameters(yeast_x, yeast_y)

LinearSVC
Optimal C parameter: 0.9875029514458578
Max score: 0.5747978436657681

KNeighborsClassifier
Optimal n_neighbors parameter: 9
Max score: 0.570754716981132

MultinomialNB
Optimal alpha parameter: 0.01627461166974853
Max score: 0.33423180592991913

DecisionTreeClassifier
Optimal ccp_alpha parameter: 0.6455065007215198
Max score: 0.3119946091644205

RandomForestClassifier
Optimal n_estimators parameter: 95
Max score: 0.5862533692722371

GradientBoostingClassifier
Optimal learning_rate parameter: 0.04355796934040276
Max score: 0.5700808625336927



## Balance Scale Dataset
https://archive.ics.uci.edu/ml/datasets/Balance+Scale \
Missing Values: NO

In [33]:
balance_scales_data:pd.core.frame.DataFrame = pd.read_csv("./data/balance_scale.data", header=None)
balance_scales_values:np.ndarray = balance_scales_data.values
print(f"Shape: {balance_scales_values.shape}")
print(f"Missing values: {pd.isnull(balance_scales_values).sum()}")

Shape: (625, 5)
Missing values: 0


In [34]:
balance_scales_x:np.ndarray = balance_scales_values[:,1:]
balance_scales_y:np.ndarray = balance_scales_values[:,0].astype('str')

In [35]:
one_hot_encoding(balance_scales_y)

Unnamed: 0,B,L,R
0,1,0,0
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
620,0,1,0
621,0,1,0
622,0,1,0
623,0,1,0


In [36]:
models_metrics(balance_scales_x, balance_scales_y)

#########################
## Train distribution: Counter({'R': 197, 'L': 193, 'B': 26})
## Test distribution: Counter({'L': 95, 'R': 91, 'B': 23})
#########################

LinearSVC
-----------------------
Accuracy: 0.8373205741626795
Precision: 0.5588379204892967
Recall: 0.627299016772701
F1 Score: 0.5910241932724224
Trained fold mean: 0.8775261324041812
Tested fold mean: 0.8323809523809524

KNeighborsClassifier
-----------------------
Accuracy: 0.8277511961722488
Precision: 0.5691906855560983
Recall: 0.6193560825139772
F1 Score: 0.5915343915343915
Trained fold mean: 0.8653310104529618
Tested fold mean: 0.78

MultinomialNB
-----------------------
Accuracy: 0.8564593301435407
Precision: 0.5713455657492355
Recall: 0.6414883362251783
F1 Score: 0.6043185162372104
Trained fold mean: 0.8872241579558653
Tested fold mean: 0.8276190476190475

DecisionTreeClassifier
-----------------------
Accuracy: 0.7894736842105263
Precision: 0.5858560090702948
Recall: 0.6025783522351028
F1 Score: 0.592998

In [37]:
models_performances(balance_scales_x, balance_scales_y)

LinearSVC
Mean score: 0.8480798771121352 

KNeighborsClassifier
Mean score: 0.7472606246799794 

MultinomialNB
Mean score: 0.8592933947772657 

DecisionTreeClassifier
Mean score: 0.6704301075268816 

RandomForestClassifier
Mean score: 0.6815924219150026 

GradientBoostingClassifier
Mean score: 0.7119559651817715 



In [38]:
hyperparameters(balance_scales_x, balance_scales_y)

LinearSVC
Optimal C parameter: 0.33138494239049765
Max score: 0.8480013882083945

KNeighborsClassifier
Optimal n_neighbors parameter: 8
Max score: 0.8032622897272579

MultinomialNB
Optimal alpha parameter: 0.26032780229273345
Max score: 0.8287706189776254

DecisionTreeClassifier
Optimal ccp_alpha parameter: 0.06166726488492469
Max score: 0.513688143067124

RandomForestClassifier
Optimal n_estimators parameter: 55
Max score: 0.6450269475747183

GradientBoostingClassifier
Optimal learning_rate parameter: 0.2537635611540968
Max score: 0.686693614241385



# Documentation

### 2. LinearSVC
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Linear Support Vector Machine (Linear SVC) este un algoritm care încearcă să găsească un hiperplan pentru a maximiza distanța dintre mostrele clasificate. Metoda SVC aplică o funcție de kernel liniară pentru a realiza clasificarea și se comportă bine cu un număr mare de mostre. Dacă îl comparăm cu modelul SVC, SVC-ul liniar are parametri suplimentari, cum ar fi normalizarea penalizării care aplică "L1" sau "L2" și funcția de pierdere (loss function). Metoda kernelului nu poate fi schimbată în SVC liniar, deoarece se bazează pe metoda liniară kernel. 

În cazul în care datele de instruire sunt separabile liniar, putem selecta două hiperplane paralele care separă cele două clase de date, astfel încât distanța dintre ele să fie cât mai mare posibil. Regiunea delimitată de aceste două hiperplane se numește "marjă", iar hiperplanul cu marjă maximă este hiperplanul care se află la jumătatea distanței dintre ele.

Implementarea C subiacentă utilizează un generator de numere aleatoare pentru a selecta caracteristicile atunci când se ajustează modelul. Prin urmare, nu este neobișnuit să se obțină rezultate ușor diferite pentru aceleași date de intrare. Dacă se întâmplă acest lucru, se recomanda incercarea cu un parametru tol mai mic. Implementarea de bază, liblinear, utilizează o reprezentare internă dispersată pentru date, ceea ce implică o copie în memorie.

### 2. KNeighborsClassifier
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

K Nearerst Neighbors (kNN, k-NN) este un model de clasificare si regresie din categoria Case Based Reasoning, un stil de lucru in care deciziile se iau pe baza cautarii intr-o baza de experiente anterioare inregistrate. Este suficient de simplu pentru a putea fi implementat in mai putin de 20 de minute. Poate fi extins pentru regresie. In ciuda simplitatii, este considerat robust si util pentru multe probleme si a fost inclus in [Top 10 data mining algorithms](https://www.kdnuggets.com/2015/05/top-10-data-mining-algorithms-explained.html)

El memoreaza cazurile cunoscute, iar pentru o situatie la care se cere raspuns (clasificare sau regresie) gaseste cele mai apropiate  𝑘  cazuri si formuleaza raspunsul prin combinarea raspunsurilor de la acestea. Modelul este neparametric: raspunsul nu depinde de vreo presupunere apriorica asupra modului in care raspunsul este format, ci este dat de continutul bazei de cunostinte, si desigur influentat de numarul de vecini considerati ( 𝑘 ) si de modul de calcul al distantei.

Pentru clasificare, principiul de lucru este simplu:
- se gasesc cei mai apropiati  𝑘  vecini fata de cazul pentru care se solicita clasificarea
- se gaseste clasa majoritara si se considera ca elementul nou face parte din aceasta clasa

### 3. MultinomialNB
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

Metodele Naive Bayes sunt un set de algoritmi de invatare supravegheati pe baza aplicarii teoremei lui Bayes cu presupunerea "naiva" a independentei conditionate intre fiecare pereche de atribute, data fiind valoarea variabilei de clasa.

Multinomial Naive Bayes presupune a avea un vector de caracteristici unde fiecare element reprezinta de cate ori apare (sau, foarte des, frecventa sa). Aceasta tehnica este foarte eficienta in procesarea limbii natale sau sau ori de cate ori mostrele sunt compuse pornind de la un dictionar comun.

O distributie multinomiala este utila pentru modelarea vectorilor de caracteristici in care fiecare valoare reprezinta, de exemplu, numarul de aparitii ale unui termen sau frecventa sa relativa.

MultinomialNB presupune ca, caracteristicile au distributie multinomiala, care este o generalizare a distributiei binomiale.

Clasificatorul multinomial Naive Bayes este potrivit pentru clasificarea cu atribute discrete (de exemplu, numarul de cuvinte pentru clasificarea textului).

### 4. DecisionTreeClassifier
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Un arbore de decizie este o structura de arbore asemanatoare unei diagrame in care un nod intern reprezinta caracteristica (sau atribut), ramura reprezinta o regula de decizie si fiecare nod frunza reprezinta rezultatul. Cel mai de sus nod dintr-un arbore de decizie este cunoscut sub numele de nod radacina. Invata sa partitioneze pe baza valorii atributului. Arborele se partitioneaza in mod recursiv, numita partitionare recursiva.

Arborele de decizie este un algoritm de Machine Learning de tip "white box". Impartaseste logica interna de luare a deciziilor, care nu este disponibila in tipul "black box" de algoritmi, cum ar fi Reteaua neuronala. Timpul sau de antrenare este mai rapid in comparatie cu algoritmul retelei neuronale. Complexitatea in timp a arborilor de decizie se bazeaza pe numarului de inregistrari cat si pe numarului de atribute din datele primite. Arborii de decizie pot gestiona datele cu dimensiuni foarte mari cu o precizie buna.

Ideea de baza din spatele oricarui algoritm al arborelui decizional este urmatoarea:

- Selectarea celui mai bun atribut folosind masuri de selectie a atributelor (Attribute Selection Measures sau ASM) pentru a imparti inregistrarile
- Transformarea atributului intr-un nod de decizie si ruperea setului de date in subseturi mai mici.
- Construirea arborelui prin repetarea acestui proces recursiv pentru fiecare nod copil pana cand una dintre conditii se va potrivi:
  - Toate tuplurile apartin aceleiasi valori de atribut.
  - There are no more remaining attributes.
  - Nu mai exista atribute ramase.
  - Nu mai exista instante.
  
  

### 5. RandomForestClassifier
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Un Random Forest este un estimator care se potriveste cu un numar de clasificatori de tip arbori de decizie pe diferite sample-uri ale setului de date si care utilizeaza media pentru a imbunatati precizia si a controla over-fitting-ul.

Random Forest este un algoritm de invatare supervizat. "Padurea" construita este un ansamblu de arbori de decizie, de obicei instruiti prin metoda "bagging" sau a "impachetarii".

Ideea generala a acestei metode este ca o combinatie de modele de invatare mareste rezultatul general. Astfel, in loc sa caute cea mai importanta caracteristica sau cel mai important atribut in timp ce imparte un nod, cauta cel mai bun atribut dintr-un subset de atribute aleatorii. Acest lucru are ca rezultat o mare diversitate care, in general, duce la un model mai bun si care previne over-fitting-ul.

Asadar, in Random Forest, doar un subset aleatoriu de caracteristici este luat in considerare de algoritmul pentru divizarea unui nod. De asemenea, modelul este eficient si pentru ca hiperparametrii impliciti produc adesea un rezultat bun de predictie.

### 6. GradientBoostingClassifier
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

Gradient boosting este un algoritm de tip Greedy care poate produce over-fitting pe un set de date.

Acest model poate beneficia de metode de regularizare pe diferite parti ale algoritmului care imbunatatesc performanta si reduc over fitting-ul.
La baza Gradient Boosting-ului se afla ideea de a lua un algoritm relativ slab de invatare si de a face o serie de modificari care sa imbunatateasca puterea acestuia de procesare sau "learner-ii"

Ideea aceasta a fost realizata initial in algoritmul Adaptive Boosting (AdaBoost). Pentru AdaBoost, multi "learneri" slabi sunt creati prin initializarea multor algoritmi de decizie care au doar o singura impartire. In AdaBoost, predictiile se fac printr-o metoda numita majority vote sau vot majoritar, in care instantele sunt clasificate in functie de clasa care primeste cele mai multe voturi de la acesti "learneri" slabi.

Modelul Gradient Boosting foloseste algoritmul AdaBoost combinat cu metoda minimizarii ponderate, unde clasificatorii si input-urile ponderate sunt recalculate. Obiectivul modelului Gradient Boosting este acela de a minimiza costul sau diferenta dintre valorile prezise si valorile actuale pe setul de antrenare.

De asemenea, modelul Gradient Boosting foloseste arbori de decizie pe post de "learneri" slabi.

Deoarece modelul Gradient Boosting poate produce over-fitting, se utilizeaza diferite constrangeri sau metode de regularizare pentru a spori performanta algoritmului. Astfel, invatarea penalizata folosing regularizarile L1 sau L2 sau tree constraints sunt solutii posibile pentru a combate over-fitting-ul.