### Importing packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score,precision_score,recall_score,confusion_matrix

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### Loading data set

In [2]:
mnist = fetch_openml("mnist_784")

### Creating data

In [3]:
# A sample size has been selected for quicker processing
x = mnist.data.sample(n=3000, random_state=1)
y = mnist.target.sample(n=3000, random_state=1)

### Machine learning algorithm 1 - Decision tree

The algorithm creates forks to classify what digit is being represented. This forces the algorithm to make a decition that is similar to human decisions.

#### Tuned parameters
This tree will have a tunned 'splitter' parameter. The default is set to 'best'. This revers to the strategy used to chose the split at each node.

In the tuned tree below the parameter will be set to 'random'. This will force the algorithm to choose the best random split.

#### Test data ratio

On this tree belew the ratio will be set to a 75/25 train/test spilt. This is based on the standard ratio for learning algorithms.

#### Creating and training tree

In [4]:
# Creating a decision tree
base = DecisionTreeClassifier(splitter = "random")

# Creating test/train data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

# Fitting data to tree
base.fit(X_train,y_train)

DecisionTreeClassifier(splitter='random')

#### Confusion matrix

In [5]:
# Confusion matrix
y_base_predict = base.predict(X_test)
conf_mat = confusion_matrix(y_test, y_base_predict)
cm_df = pd.DataFrame(conf_mat)
cm_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,54,2,4,3,0,4,4,0,1,3
1,0,77,1,1,2,0,1,0,1,1
2,4,2,55,1,6,3,5,2,4,1
3,0,2,3,49,1,7,1,2,3,6
4,1,1,1,1,43,0,2,3,2,8
5,1,2,2,6,3,46,1,2,3,4
6,2,1,0,3,1,3,46,1,1,0
7,1,2,3,0,5,2,0,78,0,7
8,2,2,2,6,4,6,4,4,35,3
9,1,2,0,3,3,1,0,7,1,60


#### Confusion matrix results
Based on the confustion matrix above. The dicision tree strugled the most with the number 4.

#### Accuracy

In [6]:
# Accuracy of model
acc = base.score(X_test, y_test)
print("The accuracy of this model is {}".format(acc))

The accuracy of this model is 0.724


#### Precision

In [7]:
# Precision of model
prec = precision_score(y_test, y_base_predict, average = 'macro')
print("The precision of this model is {}".format(prec))

The precision of this model is 0.7201325989536053


#### Recall

In [8]:
# Recall of Model
rec = recall_score(y_test, y_base_predict, average = 'macro')
print("The recall of this model is {}".format(rec))

The recall of this model is 0.7185129142684611


#### f1-score

In [9]:
# f1-score
av = f1_score(y_test, y_base_predict, average = 'macro')
print("The f1-score of this model is {}".format(av))

The f1-score of this model is 0.7162358323745286


### Machine learning algorithm 2 - Random forrest tree
These trees provide an imporvemnet over standard trees by decorrelating the trees.

#### Tunned parameter

In this model the n_estimators will be tuned. The default is set to 100 trees in the forest.

In the tuned model this number will be increased to 120. This will increase the acuracy of the model.

#### Test data ratio

On this tree belew the ratio will be set to a 75/25 train/test spilt. This is based on the standard ratio for learning algorithms.

#### Creating and training model

In [11]:
# Creating the forest tree
forest = RandomForestClassifier(n_estimators = 120)

# Creating test/train data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

# Fitting data to tree
forest.fit(X_train,y_train)

RandomForestClassifier(n_estimators=120)

#### Confusion matrix

In [12]:
# Confusion matrix
y_forest_predict = forest.predict(X_test)
conf_mat = confusion_matrix(y_test, y_forest_predict)
cm_df = pd.DataFrame(conf_mat)
cm_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,66,1,1,1,0,0,1,0,0,0
1,0,91,0,1,0,0,0,1,0,0
2,0,1,74,1,2,0,0,0,0,0
3,0,0,4,71,0,2,1,1,1,2
4,0,0,2,0,58,0,2,0,0,6
5,1,3,1,0,1,59,0,0,0,0
6,0,0,0,0,1,1,74,0,0,0
7,0,3,1,0,1,0,0,68,2,1
8,0,2,0,1,1,1,0,0,55,0
9,0,0,0,0,1,0,0,6,1,74


#### Confusion matrix results
Based on the confustion matrix above. The random forest tree strugled the most with the number 8.

#### Accuracy

In [13]:
# Accuracy of model
acc = forest.score(X_test, y_test)
print("The accuracy of this model is {}".format(acc))

The accuracy of this model is 0.92


#### Precision

In [19]:
# Precision of model
prec = precision_score(y_test, y_forest_predict, average = 'macro')
print("The precision of this model is {}".format(prec))

The precision of this model is 0.9220337732132071


#### Recall

In [20]:
# Recall of Model
rec = recall_score(y_test, y_forest_predict, average = 'macro')
print("The recall of this model is {}".format(rec))

The recall of this model is 0.9184083601618976


#### f1-score

In [21]:
# f1-score
av = f1_score(y_test, y_forest_predict, average = 'macro')
print("The f1-score of this model is {}".format(av))

The f1-score of this model is 0.9196531935784051


### Conclustion

Based on the two models the flowing has been discovered.

On the accuracy metric the random forest model did the best with a score of 92%.

On the precision metric the random forest model did the best with a score of 0.92

On the recall metric the random forest model did the best with a score of 0.91

On the f1 score metric the random forest model did the best with a score of 0.91

Thus the random forest tree is the best performing model overall.
