1. (20 points) Apply a missing value imputation method, where applicable; document the method used.

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from typing import List

path: str = "./data/iris.data"
names: List[str] = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
iris_df = pd.read_csv(path, names=names)

imp0 = SimpleImputer(missing_values=np.NaN, strategy='mean')
imp0.fit(iris_df[['sepal_length']])
imp1 = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
imp1.fit(iris_df[['sepal_width']])
imp2 = SimpleImputer(missing_values=np.NaN, strategy='median')
imp2.fit(iris_df[['petal_length']])
imp3 = SimpleImputer(missing_values=np.NaN, strategy='constant', fill_value=88)
imp3.fit(iris_df[['petal_width']])

print(iris_df)
iris_df['sepal_length'] = imp0.transform(iris_df[['sepal_length']])
iris_df['sepal_width'] = imp1.transform(iris_df[['sepal_width']])
iris_df['petal_length'] = imp2.transform(iris_df[['petal_length']])
iris_df['petal_width'] = imp3.transform(iris_df[['petal_width']])
print(iris_df)

     sepal_length  sepal_width  petal_length  petal_width           class
0             5.1          3.5           NaN          NaN     Iris-setosa
1             4.9          NaN           NaN          0.2     Iris-setosa
2             4.7          NaN           NaN          NaN     Iris-setosa
3             4.6          NaN           NaN          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          NaN  Iris-virginica
146           6.3          2.5           5.0          NaN  Iris-virginica
147           NaN          3.0           5.2          NaN  Iris-virginica
148           NaN          3.4           5.4          2.3  Iris-virginica
149           NaN          3.0           5.1          NaN  Iris-virginica

[150 rows x 5 columns]
     sepal_length  sepal_width  petal_length  petal_width           class
0        5.100

In [2]:
path2:str = "./data/roadNet-TX.txt"
names2:List[str] = ['from_node_id', 'to_node_id']
nodes = pd.read_csv(path2, names=names2)

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(nodes[['from_node_id']])

print(nodes)
nodes['from_node_id'] = imp.transform(nodes[['from_node_id']])
print(nodes)

    from_node_id  to_node_id
0            NaN           1
1            NaN           2
2            0.0          29
3            NaN           0
4            1.0          23
5            NaN          32
6            2.0           0
7            NaN          26
8            2.0          34
9            NaN           0
10           NaN         358
11           NaN           1
12          23.0          13
    from_node_id  to_node_id
0            5.6           1
1            5.6           2
2            0.0          29
3            5.6           0
4            1.0          23
5            5.6          32
6            2.0           0
7            5.6          26
8            2.0          34
9            5.6           0
10           5.6         358
11           5.6           1
12          23.0          13


2. (number of models * number of datasets * 1 point = 20 points) For each dataset apply 5 classification models from scikit learn. For each report: accuracy, precision, recall, F1 score - see [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics), [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) - using 5 fold cross validation. Report the mean results for both training and test folds. The runs will be done with fixed hyperparameter values.

3. (number of models * number of datasets * 1 point = 20 points) Report the performance of each model, using 5-fold cross validation. For each of the 5 runs, look for optimal hyperparameters using 4-fold cross validation. Model performance will be reported as the average of the 5 runs. 
    *Note:* for each of the 5 runs, the optimal hyperparameters may differ, due to the data used for training/validation. 

In [5]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.datasets import load_iris
from typing import List

def print_scores(score_to_print:dict) -> None:
    """
    Displays the optimal performance for each model out of the following 4: accuracy, precision, recall, f1 score
    param: score_to_print: retains the chosen classification models and the results of the 5 folds
    """
    print('accuracy: ', score_to_print['test_accuracy'])
    print('precision_micro: ', score_to_print['test_precision_micro'])
    print('recall_micro: ', score_to_print['test_recall_micro'])
    print('f1_micro: ', score_to_print['test_f1_micro'])
    
def print_performance(score_to_print:dict) -> None:
    """
    Displays the optimal performance for each model out of the following 4: accuracy, precision, recall, f1 score
    param: score_to_print: retains the chosen classification models and the results of the 5 folds
    """
    print('Accuracy performance: ' + str(score_to_print['test_accuracy'].mean()))
    print('Precision_micro performance: ' + str(score_to_print['test_precision_micro'].mean()))
    print('Recall_micro performance: ' + str(score_to_print['test_recall_micro'].mean()))
    print('F1_micro performance: ' + str(score_to_print['test_f1_micro'].mean()))
    
def mean_score_for_model(model_number:int, parameter:int, data_set_X:np.ndarray, data_set_y:np.ndarray) -> float:
    """
    Creates and trains a given model as a parameter
    model parameter: number of the chosen classification model
    param parameter: parameter used for the chosen classification model
    return: average of scores for 4 fold CV 
    """
    if model_number == 1:
        model = KNeighborsClassifier(n_neighbors=parameter)
    elif model_number == 2:
        model = DecisionTreeClassifier(random_state=parameter)
    else:
        model = RandomForestClassifier(max_depth=parameter, random_state=0)
    scores = cross_val_score(model, data_set_X, data_set_y, cv=4, scoring='accuracy')
    return scores.mean()

def get_best_hyperparam(model_number:int, parameter:int, data_set_X:np.ndarray, data_set_y:np.ndarray) -> None:
    """
    Find the optimal parameter 
    param model: number of the chosen classification model
    param parameter: parameter used for the chosen classification model
    """
    range_parameters:range = range(1, 15)
    scores_parameters:List(float) = [mean_score_for_model(model,parameter,data_set_X,data_set_y) for parameter in range_parameters]
    print('Max score obtained for: {0} with value: {1}'.format(1+np.argmax(scores_parameters), np.max(scores_parameters)))

def print_everything(X:np.ndarray, y:np.ndarray, model) -> None:
    """
    Perform 5 fold cross validation and print for each fold the accuracy, precision, recall, f1 score,
    average scores, average scores for both training and test folds and last but not least
    performance.
    :param1: X - the date is retained
    :param2: y - the target is retained
    :param3: model - classification model used
    """
    scores:List(str) = ['accuracy', 'precision_micro', 'recall_micro', 'f1_micro']        
    score_to_print:dict = cross_validate(model, X, y, cv=5, scoring = scores)
    print_scores(score_to_print)
    print('')
    results:dict = cross_validate(model, X, y, cv=5, return_train_score=True)
    print('Mean for train scores: ', results['train_score'].mean())
    print('Mean for test scores: ', results['test_score'].mean())
    print('')
    print_performance(score_to_print)
    print('')
    
iris = load_iris()
X:np.ndarray = iris.data
y:np.ndarray = iris.target
model = KNeighborsClassifier(n_neighbors=5)
print_everything(X, y, model)
get_best_hyperparam(1,5,X,y)

accuracy:  [0.96666667 1.         0.93333333 0.96666667 1.        ]
precision_micro:  [0.96666667 1.         0.93333333 0.96666667 1.        ]
recall_micro:  [0.96666667 1.         0.93333333 0.96666667 1.        ]
f1_micro:  [0.96666667 1.         0.93333333 0.96666667 1.        ]

Mean for train scores:  0.97
Mean for test scores:  0.9733333333333334

Accuracy performance: 0.9733333333333334
Precision_micro performance: 0.9733333333333334
Recall_micro performance: 0.9733333333333334
F1_micro performance: 0.9733333333333334

Max score obtained for: 3 with value: 0.9667496443812233


In [6]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0)
print_everything(X, y, model)
get_best_hyperparam(2,0,X,y)

accuracy:  [0.96666667 0.96666667 0.9        0.96666667 1.        ]
precision_micro:  [0.96666667 0.96666667 0.9        0.96666667 1.        ]
recall_micro:  [0.96666667 0.96666667 0.9        0.96666667 1.        ]
f1_micro:  [0.96666667 0.96666667 0.9        0.96666667 1.        ]

Mean for train scores:  1.0
Mean for test scores:  0.9600000000000002

Accuracy performance: 0.9600000000000002
Precision_micro performance: 0.9600000000000002
Recall_micro performance: 0.9600000000000002
F1_micro performance: 0.9600000000000002

Max score obtained for: 3 with value: 0.9667496443812233


In [7]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(X, y)
print_everything(X,y,clf)

accuracy:  [0.93333333 0.96666667 0.93333333 0.93333333 1.        ]
precision_micro:  [0.93333333 0.96666667 0.93333333 0.93333333 1.        ]
recall_micro:  [0.93333333 0.96666667 0.93333333 0.93333333 1.        ]
f1_micro:  [0.93333333 0.96666667 0.93333333 0.93333333 1.        ]

Mean for train scores:  0.9616666666666667
Mean for test scores:  0.9533333333333334

Accuracy performance: 0.9533333333333334
Precision_micro performance: 0.9533333333333334
Recall_micro performance: 0.9533333333333334
F1_micro performance: 0.9533333333333334



In [8]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(max_depth=2, random_state=0)
rfc.fit(X, y)
print_everything(X,y,rfc)
get_best_hyperparam(3,2,X,y)

accuracy:  [0.96666667 0.96666667 0.93333333 0.9        1.        ]
precision_micro:  [0.96666667 0.96666667 0.93333333 0.9        1.        ]
recall_micro:  [0.96666667 0.96666667 0.93333333 0.9        1.        ]
f1_micro:  [0.96666667 0.96666667 0.93333333 0.9        1.        ]

Mean for train scores:  0.96
Mean for test scores:  0.9533333333333334

Accuracy performance: 0.9533333333333334
Precision_micro performance: 0.9533333333333334
Recall_micro performance: 0.9533333333333334
F1_micro performance: 0.9533333333333334

Max score obtained for: 3 with value: 0.9667496443812233


In [9]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(X, y)
print_everything(X,y,svc)

accuracy:  [0.96666667 0.96666667 0.96666667 0.93333333 1.        ]
precision_micro:  [0.96666667 0.96666667 0.96666667 0.93333333 1.        ]
recall_micro:  [0.96666667 0.96666667 0.96666667 0.93333333 1.        ]
f1_micro:  [0.96666667 0.96666667 0.96666667 0.93333333 1.        ]

Mean for train scores:  0.975
Mean for test scores:  0.9666666666666666

Accuracy performance: 0.9666666666666666
Precision_micro performance: 0.9666666666666666
Recall_micro performance: 0.9666666666666666
F1_micro performance: 0.9666666666666666



In [10]:
dataset2 = './data/heart.csv'
data2 = pd.read_csv(dataset2)

X2 = data2.values[:, :-1]
y2 = data2.values[:, -1]
model = KNeighborsClassifier(n_neighbors=5)
print_everything(X2, y2,model)
get_best_hyperparam(1,5,X2,y2)

accuracy:  [0.62295082 0.6557377  0.58333333 0.73333333 0.61666667]
precision_micro:  [0.62295082 0.6557377  0.58333333 0.73333333 0.61666667]
recall_micro:  [0.62295082 0.6557377  0.58333333 0.73333333 0.61666667]
f1_micro:  [0.62295082 0.6557377  0.58333333 0.73333333 0.61666667]

Mean for train scores:  0.7640650183464216
Mean for test scores:  0.6424043715846995

Accuracy performance: 0.6424043715846995
Precision_micro performance: 0.6424043715846995
Recall_micro performance: 0.6424043715846995
F1_micro performance: 0.6424043715846995

Max score obtained for: 3 with value: 0.8276754385964913


In [11]:
model = DecisionTreeClassifier(random_state=0)
print_everything(X2, y2, model)
get_best_hyperparam(2,0,X2,y2)

accuracy:  [0.73770492 0.86885246 0.8        0.73333333 0.66666667]
precision_micro:  [0.73770492 0.86885246 0.8        0.73333333 0.66666667]
recall_micro:  [0.73770492 0.86885246 0.8        0.73333333 0.66666667]
f1_micro:  [0.73770492 0.86885246 0.8        0.73333333 0.66666667]

Mean for train scores:  1.0
Mean for test scores:  0.7613114754098361

Accuracy performance: 0.7613114754098361
Precision_micro performance: 0.7613114754098361
Recall_micro performance: 0.7613114754098361
F1_micro performance: 0.7613114754098361

Max score obtained for: 3 with value: 0.8276754385964913


In [12]:
clf = GaussianNB()
clf.fit(X2, y2)
print_everything(X2,y2,clf)

accuracy:  [0.81967213 0.86885246 0.78333333 0.85       0.71666667]
precision_micro:  [0.81967213 0.86885246 0.78333333 0.85       0.71666667]
recall_micro:  [0.81967213 0.86885246 0.78333333 0.85       0.71666667]
f1_micro:  [0.81967213 0.86885246 0.78333333 0.85       0.71666667]

Mean for train scores:  0.8418641336031
Mean for test scores:  0.8077049180327869

Accuracy performance: 0.8077049180327869
Precision_micro performance: 0.8077049180327869
Recall_micro performance: 0.8077049180327869
F1_micro performance: 0.8077049180327869



In [13]:
rfc = RandomForestClassifier(max_depth=2, random_state=0)
rfc.fit(X2, y2)
print_everything(X2,y2,rfc)
get_best_hyperparam(3,2,X2,y2)

accuracy:  [0.81967213 0.86885246 0.85       0.83333333 0.8       ]
precision_micro:  [0.81967213 0.86885246 0.85       0.83333333 0.8       ]
recall_micro:  [0.81967213 0.86885246 0.85       0.83333333 0.8       ]
f1_micro:  [0.81967213 0.86885246 0.85       0.83333333 0.8       ]

Mean for train scores:  0.8650732142244779
Mean for test scores:  0.8343715846994536

Accuracy performance: 0.8343715846994536
Precision_micro performance: 0.8343715846994536
Recall_micro performance: 0.8343715846994536
F1_micro performance: 0.8343715846994536

Max score obtained for: 3 with value: 0.8276754385964913


In [14]:
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(X2, y2)
print_everything(X2,y2,svc)

accuracy:  [0.81967213 0.8852459  0.85       0.83333333 0.71666667]
precision_micro:  [0.81967213 0.8852459  0.85       0.83333333 0.71666667]
recall_micro:  [0.81967213 0.8852459  0.85       0.83333333 0.71666667]
f1_micro:  [0.81967213 0.8852459  0.85       0.83333333 0.71666667]

Mean for train scores:  0.9138849833681972
Mean for test scores:  0.8209836065573771

Accuracy performance: 0.8209836065573771
Precision_micro performance: 0.8209836065573771
Recall_micro performance: 0.8209836065573771
F1_micro performance: 0.8209836065573771



In [15]:
dataset3 = './data/Nationalities.txt'
data3 = pd.read_csv(dataset3, delimiter = ',')

X3 = data3.values[:, 1:]
y3 = data3.values[:, 1]
y3= y3.astype('int')
model = KNeighborsClassifier(n_neighbors=5)
print_everything(X3, y3, model)
get_best_hyperparam(1,5,X3,y3)

accuracy:  [0.30880929 0.25750242 0.32236205 0.26550388 0.14437984]
precision_micro:  [0.30880929 0.25750242 0.32236205 0.26550388 0.14437984]
recall_micro:  [0.30880929 0.25750242 0.32236205 0.26550388 0.14437984]
f1_micro:  [0.30880929 0.25750242 0.32236205 0.26550388 0.14437984]

Mean for train scores:  0.48527752427608417
Mean for test scores:  0.2597114973322227

Accuracy performance: 0.2597114973322227
Precision_micro performance: 0.2597114973322227
Recall_micro performance: 0.2597114973322227
F1_micro performance: 0.2597114973322227

Max score obtained for: 14 with value: 0.8857132563543675


In [16]:
model = DecisionTreeClassifier(random_state=0)
print_everything(X3, y3, model)
get_best_hyperparam(2,0,X3,y3)

accuracy:  [1. 1. 1. 1. 1.]
precision_micro:  [1. 1. 1. 1. 1.]
recall_micro:  [1. 1. 1. 1. 1.]
f1_micro:  [1. 1. 1. 1. 1.]

Mean for train scores:  1.0
Mean for test scores:  1.0

Accuracy performance: 1.0
Precision_micro performance: 1.0
Recall_micro performance: 1.0
F1_micro performance: 1.0

Max score obtained for: 14 with value: 0.8857132563543675


In [17]:
clf = GaussianNB()
clf.fit(X3, y3)
print_everything(X3, y3, clf)

accuracy:  [1. 1. 1. 1. 1.]
precision_micro:  [1. 1. 1. 1. 1.]
recall_micro:  [1. 1. 1. 1. 1.]
f1_micro:  [1. 1. 1. 1. 1.]

Mean for train scores:  1.0
Mean for test scores:  1.0

Accuracy performance: 1.0
Precision_micro performance: 1.0
Recall_micro performance: 1.0
F1_micro performance: 1.0



In [18]:
rfc = RandomForestClassifier(max_depth=2, random_state=0)
rfc.fit(X3, y3)
print_everything(X3,y3,rfc)
get_best_hyperparam(3,2,X3,y3)

accuracy:  [0.44530494 0.44433688 0.40658277 0.35465116 0.29554264]
precision_micro:  [0.44530494 0.44433688 0.40658277 0.35465116 0.29554264]
recall_micro:  [0.44530494 0.44433688 0.40658277 0.35465116 0.29554264]
f1_micro:  [0.44530494 0.44433688 0.40658277 0.35465116 0.29554264]

Mean for train scores:  0.40679513487755425
Mean for test scores:  0.38928367740531455

Accuracy performance: 0.38928367740531455
Precision_micro performance: 0.38928367740531455
Recall_micro performance: 0.38928367740531455
F1_micro performance: 0.38928367740531455

Max score obtained for: 14 with value: 0.8857132563543675


In [19]:
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(X3, y3)
print_everything(X3,y3,svc)

accuracy:  [0.59244918 0.61374637 0.61277832 0.64922481 0.36337209]
precision_micro:  [0.59244918 0.61374637 0.61277832 0.64922481 0.36337209]
recall_micro:  [0.59244918 0.61374637 0.61277832 0.64922481 0.36337209]
f1_micro:  [0.59244918 0.61374637 0.61277832 0.64922481 0.36337209]

Mean for train scores:  0.5940339240948523
Mean for test scores:  0.5663141523522217

Accuracy performance: 0.5663141523522217
Precision_micro performance: 0.5663141523522217
Recall_micro performance: 0.5663141523522217
F1_micro performance: 0.5663141523522217



In [20]:
dataset4 = './data/Years.txt'
data4 = pd.read_csv(dataset4, delimiter = ',')

X4 = data4.values[:, 1:]
y4 = data4.values[:, 1]
y4 = y4.astype('int')
model = KNeighborsClassifier(n_neighbors=5)
print_everything(X4, y4, model)
get_best_hyperparam(1,5,X4,y4)

accuracy:  [1. 1. 1. 1. 1.]
precision_micro:  [1. 1. 1. 1. 1.]
recall_micro:  [1. 1. 1. 1. 1.]
f1_micro:  [1. 1. 1. 1. 1.]

Mean for train scores:  1.0
Mean for test scores:  1.0

Accuracy performance: 1.0
Precision_micro performance: 1.0
Recall_micro performance: 1.0
F1_micro performance: 1.0

Max score obtained for: 4 with value: 1.0


In [21]:
model = DecisionTreeClassifier(random_state=0)
print_everything(X4, y4, model)
get_best_hyperparam(2,0,X4,y4)

accuracy:  [1. 1. 1. 1. 1.]
precision_micro:  [1. 1. 1. 1. 1.]
recall_micro:  [1. 1. 1. 1. 1.]
f1_micro:  [1. 1. 1. 1. 1.]

Mean for train scores:  1.0
Mean for test scores:  1.0

Accuracy performance: 1.0
Precision_micro performance: 1.0
Recall_micro performance: 1.0
F1_micro performance: 1.0

Max score obtained for: 4 with value: 1.0


In [22]:
clf = GaussianNB()
clf.fit(X4, y4)
print_everything(X4, y4, clf)

accuracy:  [1. 1. 1. 1. 1.]
precision_micro:  [1. 1. 1. 1. 1.]
recall_micro:  [1. 1. 1. 1. 1.]
f1_micro:  [1. 1. 1. 1. 1.]

Mean for train scores:  1.0
Mean for test scores:  1.0

Accuracy performance: 1.0
Precision_micro performance: 1.0
Recall_micro performance: 1.0
F1_micro performance: 1.0



In [23]:
rfc = RandomForestClassifier(max_depth=2, random_state=0)
rfc.fit(X4, y4)
print_everything(X4,y4,rfc)
get_best_hyperparam(3,2,X4,y4)

accuracy:  [0.93442623 0.93442623 0.95081967 0.93333333 0.93333333]
precision_micro:  [0.93442623 0.93442623 0.95081967 0.93333333 0.93333333]
recall_micro:  [0.93442623 0.93442623 0.95081967 0.93333333 0.93333333]
f1_micro:  [0.93442623 0.93442623 0.95081967 0.93333333 0.93333333]

Mean for train scores:  0.9372921130496887
Mean for test scores:  0.9372677595628416

Accuracy performance: 0.9372677595628416
Precision_micro performance: 0.9372677595628416
Recall_micro performance: 0.9372677595628416
F1_micro performance: 0.9372677595628416

Max score obtained for: 4 with value: 1.0


In [24]:
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(X4, y4)
print_everything(X4,y4,svc)

accuracy:  [1. 1. 1. 1. 1.]
precision_micro:  [1. 1. 1. 1. 1.]
recall_micro:  [1. 1. 1. 1. 1.]
f1_micro:  [1. 1. 1. 1. 1.]

Mean for train scores:  1.0
Mean for test scores:  1.0

Accuracy performance: 1.0
Precision_micro performance: 1.0
Recall_micro performance: 1.0
F1_micro performance: 1.0



4. (number of models * 4 points = 20 points) Document in jupyter notebook each of the models used. If the same algorithm is used for more than one dataset, you can make a separate section with documentation of the algorithms + reference to the algorithm.

# Classification methods used

1. [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
2. [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
3. [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)
4. [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
5. [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)


## 1. KNN - (k-nearest neighbors)


In statistics, the K Nearest Neighbours (k-NN) algorithm is a non-parametric classification method. It is used for classification and regression. In both cases, the input consists of the K closest training examples in the data set. The result depends on whether k-NN is used for classification or regression:

   In k-NN classification, the output is a class member. An object is classified by voting on the plurality of its neighbors, the object is assigned to the most common class among its k nearest neighbors (k is a positive integer, usually small). If k = 1, then the nearest neighbour is chosen.
    
   In the k-NN regression, the output is the value of the property of the object. This value is the average of the values of the k nearest neighbours.

## 2. Decision Tree

A decision tree is a decision support tool that uses a tree-like decision model and its possible consequences, including the outcomes of chance events, resource costs and utility. It is a way of displaying an algorithm that contains only conditional control instructions.

Decision trees are commonly used in operations research, particularly in decision analysis, to help identify a strategy most likely to achieve a goal, but are also a popular tool in machine learning.

## 3. Gaussian Naive Bayes

In statistics, Naivi Bayes classifiers are a family of simple "probabilistic classifiers" based on the application of Bayes' theorem with strong assumptions of independence between features. They are among the simplest Bayesian network models, but combined with kernel density estimation, they can achieve higher levels of accuracy.

Naivi Bayes classifiers are highly scalable, requiring a number of linear parameters in the number of variables (features/predictors) in a learning problem. Training with maximum likelihood can be done by evaluating a closed-form expression, which requires linear time, rather than by costly iterative approximation, as is used for many other types of classifiers.

## 4. SVM - (Support-vector machine)

In machine learning, support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.

Given a set of training examples, each marker belongs to a category. An SVM training algorithm builds a model that assigns new examples to one category or another, making it a non-probabilistic binary linear classifier. An SVM maps the training examples to points in space so as to maximize the width of the gap between the two categories. New examples are then mapped to the same space and predicted to belong to a category based on which side of the gap they fall on.

## 5. Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that works by building a multitude of decision trees during training and outputting the class that is the class mode (classification) or mean/average prediction ( regression) of the individual trees.

Random decision trees correct the habit of decision trees to match their training set. Random forests generally outperform decision trees, but their accuracy is lower than gradient trees. However, data characteristics can affect their performance.