# COMP4030 - Data Science and Machine Learning

# Clustering and Classification


**Topics from last labs you absolutely need to be familiar with**:

- how to conduct exploratory data analysis on a given dataset
- understand basic statistical descriptors and visualising statistical relationships
- data cleaning techniques
- normalisation


In [3]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
!pip install pyclustering
!pip install yellowbrick



# Exercises to try out different Clustering algorithms

## Read up on the following before doing the tasks

#### Clustering
https://scikit-learn.org/stable/modules/clustering.html

### Principal Component Analysis
https://scikit-learn.org/stable/modules/decomposition.html#decompositions

### Silhouette score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

### Distance measures
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html

**Task 1**: 
- Load the cars dataset and apply the k-means clustering algorithm. 
- Experiment with different numbers of clusters and report the silhouette score. 
- Use a scatter plot to visualise the clustering results for the different numbers of clusters.

#### More information on K-means Clustering https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

**Task 2**: 
- Load the iris dataset from scikit-learn and perform k-means clustering to group the iris flowers into different clusters based on their features. 
- Use the elbow method to determine the optimal number of clusters. 

#### You can use the KElbowVisualizer Object from the Yellowbrick library https://www.scikit-yb.org/en/latest/api/cluster/elbow.html

**Task 3**: 
- Load the wine dataset from scikit-learn and perform hierarchical clustering to group the wines into different clusters based on their features. 
- Experiment with different linkage methods and report the silhouette score. 
- Use dendrogram visualisation to analyze the clustering results.

#### More information on Hierarchical Clustering https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
#### Wine dataset https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html

**Task 4**: 
- Load the diabetes dataset and apply the DBSCAN clustering algorithm to group the data points into different clusters based on their proximity to each other.
- Experiment with different epsilon (eps) and minimum samples (min_samples) values and report the silhouette score. 
- Use a scatter plot to visualise the clustering results for the different values of epsilon and minimum samples.

#### More information on DBSCAN https://scikit-learn.org/stable/modules/clustering.html#dbscan
#### More information on eps and min_samples https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
#### Diabetes dataset https://www.kaggle.com/datasets/mathchi/diabetes-data-set

**Task 5**: 
- Load the Wisconsin breast cancer dataset and apply the K-means clustering algorithm 
- Experiment with different numbers of cluster and report the silhouette score. 
- Use a scatter plot to visualise the clustering results for the different numbers of clusters, what do can you say about the dataset/features compared with the previous task. 
- Try using PCA and then compare again. Also try Heirarchical Clustering. 

#### Wisconsin Breast Cancer Dataset https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

# Exercises to try out different Classification algorithms

### Read up on the following before doing the tasks

### Precision - Recall
https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py 

### ROC curves
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py

### Confusion Matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix

### N-fold Cross Validation
https://scikit-learn.org/stable/modules/cross_validation.html


**Task 6**: 
- Load the iris dataset from scikit-learn and split it into training and testing sets. 
- Train a decision tree model to classify the different types of iris flowers based on their features. 
- Use the graphviz library to visualise the decision tree 

#### More information on Decision Trees https://scikit-learn.org/stable/modules/tree.html

**Task 7**: 
- Load the cars dataset and split it into training and testing sets. 
- Apply the k-nearest neighbours (KNN) algorithm with different values of k and find the accuracy, precision and recall.
- Use a confusion matrix to compare the performance of the different values of k.

#### More information on Nearest Neighbour Classification https://scikit-learn.org/stable/modules/neighbors.html

**Task 8**: 
- Load the Wisconsin breast cancer dataset from scikit-learn and split it into training and testing sets. 
- Train a support vector machine (SVM) model to classify the tumors as malignant or benign based on their features. 
- Evaluate the precision, recall of the model on the testing set
- Experiment with different kernel functions and report the accuracy, precision and recall.
- Use a confusion matrix and ROC curve to compare the performance of the different kernels.

#### More information on Support Vector Machines: https://scikit-learn.org/stable/modules/svm.html#

**Task 9**: 
- Load the digits dataset from scikit-learn and split it into training and testing sets. 
- Train a Random Forest model to classify the handwritten digits based on their pixel values. 
- Evaluate the accuracy of the model on the testing set.

#### More information on Random Forest Classifiers https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier 

#### Digits Dataset https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html

**Task 10**: 
- Load the wine dataset and use an artificial neural network (multi-layer perceptron - MLP) for classification, splitting the data into a training and test set.
- Experiment with different number of neurons in a single hidden layer using sigmoid activation functions and report the accuracy, precision and recall for different number of neurons on the training set. 
- Use a confusion matrix to compare the performance of the MLP with the optimal number of neurons for different number of iterations.

#### More information on MLPs https://scikit-learn.org/stable/modules/neural_networks_supervised.html

# Refine your models and analysis


**Task 11**: 
- Experiment with other different hyperparameters of each algorithm such as:
    - learning rate
    - number of neighbours
    - maximum depth of the tree
    - kernel functions, etc.

**Task 12**: 
- Use cross-validation to validate the performance of each algorithm and select the best hyperparameters.

**Task 13**: 
- Use feature selection techniques such as PCA and correlation analysis to select the most relevant features for each dataset.
- repeat the clustering and classification tasks, comparing the performance of different algorithms using different evaluation metrics such as appropriate:
    - accuracy
    - precision
    - recall
    - ROC curve
    - silhouette score

**Task 14**: 
- Visualise the clustering results using scatter plots, dendrograms, or other visualisation techniques to gain insights into the structure of the data.