Data Mining: Basic Concepts - Winter 2023/24
---------------
``` 
> University of Konstanz 
> Department of Computer and Information Science
> Maximilian T. Fischer, Frederik Dennig, Yannick Metz, Udo Schlegel
```
__Organize in teams of 2 people, return the exercise on time using ILIAS__

---

Assignment 08 in Python 
---------------
- ___Please put your names and student IDs here___:
    - Wei-Cheng Lin, 01/1348028
    - Kuon Ito, 01/1358810

---

#### Excercise 1: Lazy vs Eager: Theory

**a) Explain the difference between an eager learner and a lazy learner.**

```
An eager learner builds a model during the training phase and generalizes from that model to make predictions on new, unseen data. Meanwhile, a lazy learner defers the process of generalizing the training data until it receives a new, unseen instance that needs a prediction. Lazy learners do not build a model during the training phase. Instead, they store the entire training dataset and generalize from it when a prediction is needed. 


**b) Provide an example of a machine learning algorithm that is an eager learner and one that is a lazy learner.**

```
Eager Learner: Random Forest
Lazy Learner: k-Nearest Neighbors (KNN)

You are given a dataset with 1 million rows and 10 columns. You need to build a machine learning model to predict a binary outcome based on the values in the columns. You have two options: a decision tree or a k-nearest neighbors algorithm.   
  
**c) Which algorithm should you choose and why? Explain your reasoning in detail, taking into account the size and complexity of the dataset and the computational resources available to you.**

```
In this dataset, using a Decision Tree may provide better outcome.

(1)Efficiency: Decision trees are efficient to train and can handle relatively large datasets.

(2)Scalability: Decision trees scale well with the number of instances, and their prediction time is typically faster than KNN.

(3)Interpretability: Decision trees provide a straightforward and interpretable model, which can be crucial for understanding and explaining the model.

#### Excercise 2: Lazy vs Eager: Practical

In this exercise, we want to implement a basic k-nearest neighbor algorithm.   
For this, we need a similarity function and a general k-nearest neighbor function.  
We will use the breast cancer dataset we previously used for the SVM with an 80/20 train test split.    
  
**Please implement the k-nearest neighbor algorithm on your own and show that the implementation works for the breast cancer data.**  
_(Hint: use the `sklearn.model_selection.train_test_split` method and the parameter `random_state=0`)_

In [1]:
from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

knn_classifier = KNeighborsClassifier(n_neighbors=9)

knn_classifier.fit(X_train, y_train)
y_pred = knn_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

AttributeError: 'NoneType' object has no attribute 'split'

#### Excercise 3: Evaluation Metrics Theory

**a) Explain why different evaluation metrics are used for evaluating the performance of classifiers.**

```
 Classifiers may be designed with different objectives in mind, such as accuracy, precision, recall, F1 score, or area under the ROC curve (AUC-ROC). These metrics capture different dimensions of performance, and the choice depends on the specific goals of the application. For example, in medical diagnoses, false positives and false negatives may have different consequences, leading to the preference for precision/recall.

**b) Provide examples of common evaluation metrics for classification algorithms, and explain the strengths and limitations of each metric in the context of classifier evaluation.**

```
Accuracy:
    Strengths: Simple and intuitive metric representing the ratio of correctly classified instances.
    Limitations: Misleading in imbalanced datasets, especially when classes have different prevalences.

Precision:
    Strengths: Useful when the cost of false positives is high.
    Limitations: Alone may not provide balance between false positives and false negatives, affected by imbalanced datasets.

Recall (Sensitivity or True Positive Rate):
    Strengths: Valuable when the cost of false negatives is high.
    Limitations: May not offer a complete picture, especially in the trade-off between false positives and false negatives.

Receiver Operating Characteristic (ROC) Curve:
    Strengths: Visual representation of performance, threshold independence, and facilitates easy model comparison.
    Limitations: Less informative for imbalanced datasets, doesn't indicate optimal threshold, assumes equal misclassification costs.

**c) Explain how a perfect precision-recall curve looks like.**

```
Visually, on a precision-recall curve, this perfection is represented by a line that starts from the origin (0,0) and goes to the point (1,1), forming a right-angled triangle with the x-axis and y-axis. The area under this perfect curve is 1, indicating ideal performance.

#### Excercise 4: Evaluation Metrics Practical

After thinking about the different evaluation metrics, we want to see them in real-world scenarios and inspect their workings.  
We tackle two datasets for this task and use the machine learning algorithms we have already explored.  
The first dataset will be the breast cancer dataset again. The second one will be the cover-type forest dataset.  
Both datasets have unique properties and need special care. Handling different data is not always easy and selecting proper algorithms is even more challenging.
Thus, we focus on the different evaluation metrics to select one good working algorithm.  
In this case, we will compare the sci-kit learn implementations of Naive Bayes, Decision Trees, SVMs, and k-Nearest Neighbors based on the classification accuracy and F1-score.  
  
As the cover type data has more than two classes, we will only use the second and third class samples.  
  
**Please implement a comparison between the sci-kit learn implementations of Naive Bayes, Decision Trees, SVMs, and k-Nearest Neighbors on the breast cancer data and cover type data for your own implementation of accuracy and F1-score.**    
_(Hint: use the `sklearn.model_selection.train_test_split` method and the parameter `random_state=0`)_  
_(Hint: use the `sklearn.naive_bayes.GaussianNB`, `sklearn.tree.DecisionTreeClassifier`, `sklearn.svm.SVC`, `sklearn.neighbors.KNeighborsClassifier` methods with default parameters)_  

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score

cancer_data = load_breast_cancer()
cancer_X = cancer_data.data
cancer_y = cancer_data.target

cancer_X_train, cancer_X_test, cancer_y_train, cancer_y_test = train_test_split(cancer_X, cancer_y, test_size=0.2, random_state=0)

print("Breast Cancer:")

gnb = GaussianNB()
gnb.fit(cancer_X_train, cancer_y_train)
y_pred = gnb.predict(cancer_X_test)
accuracy = accuracy_score(cancer_y_test, y_pred)
f1 = f1_score(cancer_y_test, y_pred, average='binary')  
print(f"Naive Bayes - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")

clf = DecisionTreeClassifier()
clf = clf.fit(cancer_X_train, cancer_y_train)
y_pred = clf.predict(cancer_X_test)
accuracy = accuracy_score(cancer_y_test, y_pred)
f1 = f1_score(cancer_y_test, y_pred, average='binary')  
print(f"Decision Tree - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")

clf = SVC(kernel='linear')
clf.fit(cancer_X_train, cancer_y_train)
y_pred = clf.predict(cancer_X_test)
accuracy = accuracy_score(cancer_y_test, y_pred)
f1 = f1_score(cancer_y_test, y_pred, average='binary')  
print(f"SVM - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")

knn_classifier.fit(cancer_X_train, cancer_y_train)
y_pred = knn_classifier.predict(cancer_X_test)
accuracy = accuracy_score(cancer_y_test, y_pred)
f1 = f1_score(cancer_y_test, y_pred, average='binary')  
print(f"k-Nearest Neighbors - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")



In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_covtype
data = fetch_covtype()

target = data['target'][np.logical_or(data['target'] == 2, data['target'] == 3)] - 2
data = data['data'][np.logical_or(data['target'] == 2, data['target'] == 3)]

cover_X_train, cover_X_test, cover_y_train, cover_y_test = train_test_split(data, target, test_size=0.2, random_state=0)

scaler = StandardScaler()
cover_X_train_scaled = scaler.fit_transform(cover_X_train)
cover_X_test_scaled = scaler.transform(cover_X_test)

pca = PCA(n_components=0.95)
cover_X_train = pca.fit_transform(cover_X_train_scaled)
cover_X_test = pca.transform(cover_X_test_scaled)

print("Cover-Type Forest(Classes 2 and 3):")

gnb = GaussianNB()
gnb.fit(cover_X_train, cover_y_train)
y_pred = gnb.predict(cover_X_test)
accuracy = accuracy_score(cover_y_test, y_pred)
f1 = f1_score(cover_y_test, y_pred, average='binary')  
print(f"Naive Bayes - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")

clf = DecisionTreeClassifier()
clf = clf.fit(cover_X_train, cover_y_train)
y_pred = clf.predict(cover_X_test)
accuracy = accuracy_score(cover_y_test, y_pred)
f1 = f1_score(cover_y_test, y_pred, average='binary')  
print(f"Decision Tree - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")


clf = SVC(kernel='linear')
clf.fit(cover_X_train, cover_y_train)
y_pred = clf.predict(cover_X_test)
accuracy = accuracy_score(cover_y_test, y_pred)
f1 = f1_score(cover_y_test, y_pred, average='binary')  
print(f"SVM - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")


knn_classifier.fit(cover_X_train, cover_y_train)
y_pred = knn_classifier.predict(cover_X_test)
accuracy = accuracy_score(cover_y_test, y_pred)
f1 = f1_score(cover_y_test, y_pred, average='binary')  
print(f"k-Nearest Neighbors - Accuracy: {accuracy:.4f}, F1-Score: {f1:.4f}")

***<span style="color:orange">Feedback: </span>***
 - 3c) Wrong. Do not use ChatGPT for that.
 - 4a) No own implementation of accuracy and F1-score

***<span style="color:orange">Okay. Grade: Yellow</span>***