```markdown
# Anvendt Programmering 5
---
## Machine Learning Basics with Scikit-Learn and Python




# Introduction

- Welcome to the lecture on Machine Learning Basics with Scikit-Learn and Python.
- Objectives:
  - Understand basic machine learning concepts
  - Learn how to use scikit-learn for machine learning tasks
  - Complete two hands-on exercises

---



# Setting Up Your Environment

- Install packages using pip:

```bash
pip install jupyter
pip install scikit-learn
pip install matplotlib
pip install seaborn
pip install pandas
pip install seaborn
```


In [None]:
%pip install scikit-learn matplotlib seaborn pandas jupyter


# Understanding the Basics

- **Supervised Learning**: Training a model on labeled data (e.g., classification, regression).
- **Unsupervised Learning**: Training a model on unlabeled data (e.g., clustering, dimensionality reduction).

---


# Unsupervised Learning

A type of machine learning where the algorithm learns patterns from unlabeled data.
- **Key Methods**:
  - Clustering
  - Dimensionality Reduction
- **Applications**:
  - Customer segmentation
  - Anomaly detection
  - Image compression
  

# What is K-means Clustering?

- **K-means Clustering**: A method to partition data into K clusters, where each data point belongs to the cluster with the nearest mean.
- **Steps**:
  1. Initialize K centroids randomly.
  2. Assign each data point to the nearest centroid.
  3. Update centroids by calculating the mean of assigned points.
  4. Repeat steps 2-3 until convergence.

# Visualizing K-means Clustering

- Data points grouped into 4 clusters.
- Feature is just a measurement of a specific sample

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import pandas as pd

plt.style.use("ggplot")

N_CLUSTERS = 2
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=N_CLUSTERS, cluster_std=0.60, random_state=0)
df = pd.DataFrame(X, columns=["x", "y"])

df.head()

In [None]:
# Visualization of the data
plt.scatter(X[:, 0], X[:, 1])
plt.title("Scatter Plot for Unlabelled data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

## K-means Clustering


In [None]:
from sklearn.cluster import KMeans

# Apply K-means clustering
kmeans = KMeans(n_clusters=N_CLUSTERS)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Visualization of the Clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap="viridis")
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c="red", s=200, alpha=0.75)
plt.title("Scatter Plot for K-means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

# KMeans on breastcancer patients

## Load data

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target


# Display the dataframe
df_all = pd.DataFrame(X, columns=data.feature_names)

df_all["Target"] = y
df_all["Target"] = df_all["Target"].map({i: v for i, v in enumerate(data.target_names)})
df_all.head()




# Splitting Data into training and test datasets

- Splitting data into training and testing sets helps evaluate the model's performance on unseen data, ensuring it generalizes well.

- Data can be split into Training, Test, Validation
- Generally a good split is:
    - Training:80\%
    - Test 20\% 
    - At some point you will also get to worry about validation, however, we will skip this for now!


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Normalize X

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Display the dataframe
df_test = pd.DataFrame(X_test, columns=data.feature_names)

df_test["Target"] = y_test
df_test["Target"] = df_test["Target"].map({i: v for i, v in enumerate(data.target_names)})

df_test.sample(10)

## Visualize Breast Cancer Dataset

In [None]:
import seaborn as sns

sns.pairplot(df_all, hue="Target", vars=data.feature_names[:3])
plt.show()

In [None]:
import seaborn as sns

sns.pairplot(df_test, hue="Target", vars=data.feature_names[:3])
plt.show()

# Performance of the KMeans algorithm

- Often measured in Accuracy

!["acc.png"](acc.png)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

# Apply K-means clusteriang
kmeans = KMeans(n_clusters=N_CLUSTERS, random_state=42)
kmeans.fit(X_train_scaled)
y_kmeans = kmeans.predict(X_test_scaled)

# Map cluster labels to original labels (0: benign, 1: malignant)
mapping = (
    {0: 1, 1: 0}
    if confusion_matrix(y_test, y_kmeans)[0][0] < confusion_matrix(y_test, y_kmeans)[1][0]
    else {0: 0, 1: 1}
)
y_kmeans_mapped = [mapping[label] for label in y_kmeans]

# Evaluate the clustering performance
accuracy_Kmeans = accuracy_score(y_test, y_kmeans_mapped)
conf_matrix_Kmeans = confusion_matrix(y_test, y_kmeans_mapped)

# Print the results
print(f"KMeans Accuracy using {len(data.feature_names)} features: {accuracy_Kmeans * 100:.1f}%")
print(f"KMeans Confusion Matrix using {len(data.feature_names)} features:")
print(conf_matrix_Kmeans)


# Accuracy visualized
!["acc2.png"](acc2.png)

# Visualize the clusters

In [None]:
df_test["Cluster"] = y_kmeans_mapped
df_test["Cluster"] = df_test["Cluster"].map({i: v for i, v in enumerate(data.target_names)})


plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_test, x="mean radius", y="mean texture", hue="Cluster", style="Target", palette="viridis")
plt.title(f"K-means Clustering of Breast Cancer Data using {len(data.feature_names)} features")
plt.xlabel("Mean Radius")
plt.ylabel("Mean Texture")
plt.legend(title="Cluster/Target")
plt.show()

# Exercise

- What is the accuracy of KMeans if we use 1 feature?
- Use the following code as the starting point
- Data should be fitted using the training data, and verified using test data

- What will happen to the accuracy

In [None]:
# get first feature of X, and do KMeans clustering
X_train_scaled1 = X_train_scaled[:, :1]  # 1 feature only
X_test_scaled1 = X_test_scaled[:, :1]  # 1 feature only
# y_test is the label variable, for the test set! the label doesn't change!


# Answer

In [None]:
# Apply K-means clustering and predict the labels
print("Applt K-means clustering after this print!!!!")
kmeans.fit(X_train_scaled1)
y_kmeans1 = kmeans.predict(X_test_scaled1)


# Map cluster labels to original labels (0: benign, 1: malignant)
mapping = (
    {0: 1, 1: 0}
    if confusion_matrix(y_test, y_kmeans1)[0][0] < confusion_matrix(y_test, y_kmeans1)[1][0]
    else {0: 0, 1: 1}
)
y_kmeans1_mapped = [mapping[label] for label in y_kmeans1]

# Evaluate the clustering performance
print("Evaluate the clustering performance")
accuracy_Kmeans1 = accuracy_score(y_test, y_kmeans1_mapped)
conf_matrix_Kmeans1 = confusion_matrix(y_test, y_kmeans1_mapped)

# Print the results
print(f"KMeans Accuracy using {1} features: {accuracy_Kmeans1 * 100:.1f}%")
print(f"KMeans Confusion Matrix using {1} features:")
print(conf_matrix_Kmeans1)

# Supervised Learning

**Why Train the Model?**

- Training the model involves learning patterns from the training data, which the model uses to make predictions.



## K Nearest Neighbors

- Lazy Learner
- K: How many neighbors, should be considered, to find the closet fit
    - Basically, if K = 3, then we find the three closest samples to a given sample, and pick the majority


## 
1. The k-nearest neighbor algorithm is imported from the scikit-learn package.
2. Create feature and target variables. 
3. Split data into training and test data.
4. Generate a k-NN model using neighbors value.
5. Train or fit the data into the model.
6. Predict the future.

# K Nearest Neighbors

!["knn.png"](knn.png)

# K Nearest Neighbors


In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7)

knn.fit(X_train_scaled, y_train)
y_knn = knn.predict(X_test_scaled)

# Calculate the accuracy of the model
# Evaluate the clustering performance
accuracy_knn = knn.score(X_test_scaled, y_test)
conf_matrix_knn = confusion_matrix(y_test, y_knn)

# Print the results
print(f"KNN Accuracy using {len(data.feature_names)} features: {accuracy_knn * 100:.1f}%")
print(f"KNN Confusion Matrix using {len(data.feature_names)} features:")
print(conf_matrix_knn)

# Exercise

- What is the accuracy of K Nearest Neighbors if we use 1 feature?
- Data should be fitted using the training data, and verified using test data

- What do you expect will happen to the accuracy
- How will the accuracy be compared to KMeans?

- Use the following code as the starting point

In [None]:
# get first feature of X, and do KMeans clustering
X_train_scaled1 = X_train_scaled[:, :1]  # 1 feature only
X_test_scaled1 = X_test_scaled[:, :1]  # 1 feature only
# y_train, y_test are the label variables. the label doesn't change!


# Apply KNN  and predict the labels
print("Applt KNN clustering after this print!!!!")

# Answer


In [None]:
# Apply KNN  and predict the labels
print("Applt KNN clustering after this print!!!!")
knn.fit(X_train_scaled1, y_train)
y_knn1 = knn.predict(X_test_scaled1)


# Evaluate the clustering performance
print("Evaluate the clustering performance")
accuracy_knn1 = knn.score(X_test_scaled1, y_test)
conf_matrix_knn1 = confusion_matrix(y_test, y_knn1)

# Print the results
print(f"KNN Accuracy using {1} features: {accuracy_knn1 * 100:.1f}%")
print(f"KNN Confusion Matrix using {1} features:")
print(conf_matrix_knn1)

# Support Vector Machines (SVM's)

Support Vector Machine tries to find the best separating line between classes.

**The advantages of support vector machines are:**
- Effective in high dimensional spaces.
- Still effective in cases where number of dimensions is greater than the number of samples.
- Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
- Versatile: different Kernel functions can be specified for the decision function. 

**The disadvantages of support vector machines include:**
- If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
- SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).


## How SVM Works
- SVM finds the best hyperplane that separates the data into different classes while maximizing the margin.

!["svm.png"](svm.png)

# Example using Breast Cancer

In [None]:
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train_scaled, y_train)

y_clf = clf.predict(X_test)
accuracy_clf = accuracy_score(y_test,y_clf)
conf_matrix_clf = confusion_matrix(y_test,y_clf)


# Print the results
print(f"SVM Accuracy using {len(data.feature_names)} features: {accuracy_knn1 * 100:.1f}%")
print(f"SVM Confusion Matrix using {len(data.feature_names)} features:")
print(conf_matrix_knn1)


---

# Improving the Model

**Why Improve the Model?**
- Improving the model can lead to better performance and more accurate predictions.
- Experiment with different models, hyperparameters, and feature engineering to improve performance.

**Examples:**
- Hyperparameter Tuning: Adjusting parameters like max_depth for Decision Trees.
- Feature Engineering: Creating new features from existing data.

- You will not be doing this, this is far beond this course!

# Exercises


## Exercise 1 - Classification with Breast Cancer Dataset

**Objective**: Train a classifier on the Breast Cancer dataset and evaluate its performance.
Steps:
1. Load the Breast Cancer dataset.
2. Split the data into training and testing sets.
3. Train a Decision Tree classifier using one feature.
4. Train a Decision Tree classifier using multiple features.
5. Make predictions and evaluate accuracy.



In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model with one feature
X_train_one_feature = X_train[:, [0]]  # Using 'mean radius'
X_test_one_feature = X_test[:, [0]]
model = DecisionTreeClassifier()
model.fit(X_train_one_feature, y_train)

# Make predictions with one feature
y_pred_one_feature = model.predict(X_test_one_feature)

# Evaluate model with one feature
accuracy_one_feature = accuracy_score(y_test, y_pred_one_feature)
print(f"Accuracy with one feature: {accuracy_one_feature}")

# Train model with multiple features
model.fit(X_train, y_train)

# Make predictions with multiple features
y_pred = model.predict(X_test)

# Evaluate model with multiple features
accuracy_Kmeans = accuracy_score(y_test, y_pred)
print(f"Accuracy with multiple features: {accuracy_Kmeans}")
