# Oefeningen week 1

## Conda
- Maak een aparte directory voor dit vak
- Maak een conda omgeving voor dit vak 'ml_algorithms'
- Gebruik deze omgeving om gaandeweg packages toe te voegen naargelang nodig. Hou je versie bij in een environment.yml file.
    - matplotlib, pandas, numpy, scikit-learn zijn al essentieel om te kunnen starten.

## KNN
Implementeer KNN in Python, gebruik de code snippet hieronder als startpunt.

In [31]:
import numpy as np

class KNNClassifier:
    def __init__(self, k=3):
        self.k = k
        self.X_train = None
        self.y_train = None

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def euclidean_distance(self, x1, x2): 
        """x in this is an array not a number"""
        return np.sqrt(np.sum((x1 - x2)**2))

    def predict(self, X):
        """Gets the test labels and splits them into 1 label runs _predict on that label"""
        predicted_labels = [self._predict(x) for x in X]
        return np.array(predicted_labels)

    def _predict(self, x):
        """gets the euclidian distance from data from one x_tst array and performs"""
        distances = [self.euclidean_distance(x, x_train) for x_train in self.X_train]
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        return np.mean(k_nearest_labels,dtype=int)

# Sample dataset
X_train = np.array([
    [0, 0, 1000, 2, 1],
    [1, 1, 1500, 3, 2],
    [2, 2, 1200, 2, 1],
    [3, 3, 1800, 4, 2],
    [4, 4, 2000, 3, 2]
])
    
y_train = np.array([200000, 250000, 220000, 280000, 300000])

X_test = np.array([
    [5, 5, 1600, 3, 1],
    [2, 3, 1300, 2, 1]
])

knn = KNNClassifier(k=3)
knn.fit(X_train[:, 1:], y_train)  # Ignoring the x-y coordinates for simplicity
predictions = knn.predict(X_test[:, 1:])
print("Predicted house prices:", predictions)

Predicted house prices: [276666 223333]


### Iris Flowers dataset

- Haal de dataset van de iris flowers binnen. Ze zit ingebakken in sci-kit-learn
- Gebruik de sci-kit learn library om een KNN classifier te trainen.
- Vergeet niet je algoritme te scoren
- Bekijk het effect van het al dan niet gebruiken van een StandardScaler

In [32]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
iris = load_iris()
x = iris.data #features
y = iris.target #labs

In [33]:
x_trn, x_tst, y_trn, y_tst = train_test_split(x,y, test_size=0.2, random_state = 1)

In [34]:
scaler = StandardScaler()
x_trn_scl = scaler.fit_transform(x_trn)
x_tst_scl = scaler.transform(x_tst)

In [35]:
classifier = KNeighborsClassifier(3)
classifier.fit(x_trn_scl, y_trn)
predicted = classifier.predict(x_tst_scl)
print(predicted)

[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 1 0 2 1 0 0 1 2]


In [36]:
accuracy = accuracy_score(y_tst,predicted)
print(accuracy)

1.0


In [37]:
report = classification_report(y_tst,predicted)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00         6

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30


## Mall customers

Bekijk de data van de Mall_customers.csv - laad deze in en bekijk de kolommen. Pas hier KNN toe (laatste kolom spending score is het te voorspellen label). Wat moet je nog aanpassen om dit goed te laten werken ?

In [38]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

df = pd.read_csv('Mall_Customers.csv')

df.drop('CustomerID', axis=1, inplace=True)
df['Gender'].replace({'Male': 0, 'Female': 1}, inplace=True)

x = df.drop('Spending Score (1-100)', axis=1)
y = df[['Spending Score (1-100)']].values.ravel()

#handig om even te zien of onze splitsing gelukt is.
#print(x)
#print(y)

x_trn, x_tst, y_trn, y_tst = train_test_split(x,y, test_size=0.2, random_state = 1)

scaler = StandardScaler()
x_trn_scl = scaler.fit_transform(x_trn)
x_tst_scl = scaler.transform(x_tst)

classifier = KNeighborsClassifier(3)
classifier.fit(x_trn_scl, y_trn)
predicted = classifier.predict(x_tst_scl)
accuracy = accuracy_score(y_tst,predicted)*100

#result
print(predicted)
print(f'We have an accuracy of: {accuracy}%')

[42 48 32 41 17  8 49  6 65 22 12  4  6 42 41 48 78 65 16 41 55  6 45 42
 65  5 42 17 17 13 42 22 17 12 11 41 16 49 42 13]
We have an accuracy of: 5.0%
