# MNIST - Ensemble Learning
In this code exercise we are going to practice on creating different models. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier

np.random.seed(42)

# Data
Load the MNIST data and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing).You can use the code below: 

```python
mnist = fetch_openml('mnist_784', version=1, cache=True, as_frame=False)

X = mnist["data"]
y = mnist["target"].astype(np.uint8)

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)
```

In [2]:
mnist = fetch_openml('mnist_784', version=1, cache=True, as_frame=False)

X = mnist["data"]
y = mnist["target"].astype(np.uint8)

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

# Modelling
Instantiate a (1) Random Forest, (2) ExtraTree and (3) LinearSVC model. You can use this code:
```python
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf = LinearSVC(max_iter=100, tol=20, random_state=42)
```
Train the models on the training data and print out the score for each model on the validation data (what does the score method do?). Example on using the score method:

```python
my_model_name.score(X_val, y_val)
```

In [None]:
#Jag hade problem med att köra fit så därav nedanstående kod
print("Training Random Forest...")
random_forest_clf.fit(X_train, y_train)
print("Random Forest Done!")

print("Training Extra Trees...")
extra_trees_clf.fit(X_train, y_train)
print("Extra Trees Done!")

print("Training Linear SVM...")
svm_clf.fit(X_train, y_train)
print("Linear SVM Done!")


Training Random Forest...
Random Forest Done!
Training Extra Trees...
Extra Trees Done!
Training Linear SVM...




Linear SVM Done!


In [6]:
rf_score = random_forest_clf.score(X_val, y_val)
et_score = extra_trees_clf.score(X_val, y_val)
svm_score = svm_clf.score(X_val, y_val)

print(f"Random Forest accuracy: {rf_score:.4f}")
print(f"Extra Trees accuracy: {et_score:.4f}")
print(f"Linear SVM accuracy: {svm_score:.4f}")


Random Forest accuracy: 0.9692
Extra Trees accuracy: 0.9715
Linear SVM accuracy: 0.8590


**Voting classifier**

Create a voting classifier, train it on the training data and evaluate it on the validation data using the score method. 
Some code to help: 

```python
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf)
]

voting_clf = VotingClassifier(named_estimators)
```

In [20]:
from sklearn.ensemble import VotingClassifier

# Lista över de individuella modellerna med sina namn
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf)
]

# Skapa en Voting Classifier
voting_clf = VotingClassifier(estimators=named_estimators, voting='hard')



In [21]:
print("Training Voting Classifier...")
voting_clf.fit(X_train, y_train)
print("Voting Classifier Done!")


Training Voting Classifier...




Voting Classifier Done!


In [22]:
#Börjar med att utvärdera på valideringsdatan
voting_score = voting_clf.score(X_val, y_val)
print(f"Voting Classifier accuracy: {voting_score:.4f}")


Voting Classifier accuracy: 0.9693


In [23]:
#Sammanställer resultatetn för att lättare jämföra
rf_score = random_forest_clf.score(X_val, y_val)
et_score = extra_trees_clf.score(X_val, y_val)
svm_score = svm_clf.score(X_val, y_val)
voting_score = voting_clf.score(X_val, y_val)

print(f"Random Forest accuracy: {rf_score:.4f}")
print(f"Extra Trees accuracy: {et_score:.4f}")
print(f"Linear SVM accuracy: {svm_score:.4f}")
print(f"Voting Classifier accuracy: {voting_score:.4f}")


Random Forest accuracy: 0.9692
Extra Trees accuracy: 0.9715
Linear SVM accuracy: 0.8590
Voting Classifier accuracy: 0.9693


# Evaluate your best model on the test set. 

In [24]:
best_model = extra_trees_clf  # Eftersom Extra Trees hade bäst valideringsaccuracy

test_score = best_model.score(X_test, y_test)
print(f"Best model: {best_model.__class__.__name__}")
print(f"Test set accuracy: {test_score:.4f}")


Best model: ExtraTreesClassifier
Test set accuracy: 0.9691


# Summary and analysis
In this section, write down a summary of your work and some analysis. 

I teorin ska ensemble learning ge bättre prestanda, men här ser vi att Voting Classifier inte slog Extra Trees.

Möjliga orsaker:

Extra Trees var redan väldigt starkt

Eftersom Extra Trees hade högst accuracy, kan det ha dominerat röstningen.
Om en enskild modell är bäst, kan en ensemble ibland sänka prestandan.
Linear SVM presterade dåligt

85.90% accuracy är betydligt sämre än de andra två.
Om SVM ofta gissar fel kan det ha påverkat Voting Classifier negativt.
Majoritetsröstning fungerar inte alltid optimalt

Eftersom vi använde hard voting (majoritetsröstning) kan det ha gjort att den starkaste modellen (Extra Trees) förlorade sin precision på grund av att de andra modellerna var svagare.

Man skulle kunna testa Soft voting eller ta bort den svagaste modellen (Linear SVM)

Vi ser också att modellen INTE överanpassar sig och att den generaliserar bra på ny data. 