## Assignment 2 - Data Analysis - Web Mining

The web is an immense source of data, offering valuable insights across various domains. In this assignment, we will explore different ways to process, and analyze web data to uncover meaningful patterns and make informed decisions. Through five practical exercises, we will apply key data mining techniques to real-world scenarios.

Each exercise focuses on a specific application:

* *Exercise 1* - Clickbait classification
* *Exercise 2* - Sentiment analysis on comments
* *Exercise 3* - Movie recommendation
* *Exercise 4* - Association rules in online shopping
* *Exercise 5* - Clustering of mobile apps

These exercises will provide hands-on experience in working with real-world data collected from the web, helping you understand its potential for analysis.

For this assignment, complete all exercises that are marked in <span style='color:red;font-weight:bold'>red</span>. Please make sure all your cells run correctly (try to *Clear All Outputs* then *Run All* once before submitting). **Check the cells outputs are visibles even for the coding parts**

The assignment is due for <span style='color:red;font-weight:bold'>Thursday 27th of March 2025 at 23:59</span>.

No report is needed as all questions can be answered directly in this notebook file. You only need to give this notebook file completed on the [Moodle assignment page](https://moodle.msengineering.ch/course/view.php?id=2732). Only one file per group is required for submission.

If you have any questions or issues, please contact one of the assistants below:
- Cédric Campos Carvalho (*Teams* might be easier to discuss, mail: cedric.camposcarvalho@heig-vd.ch)
- Elena Najdenovska (mail: elena.najdenovska@heig-vd.ch)


Teacher : 
- Laura Elena Raileanu <Laura.Raileanu@heig-vd.ch>(mail: Laura.Raileanu@heig-vd.ch)

### Exercise 1 - Clickbait classification

The objective of this first part is to model a filter for "clickbait" in online news media. Clickbait headlines are designed to attract attention and drive clicks, often at the expense of accuracy or relevance.

To achieve this, we provide you with a dataset containing more than 10'000 press headlines collected in 2016. Each row in this dataset corresponds to a single headline, which is described by the following two attributes:

* `headline`: the text representing the title
* `clickbait`: the label identifying whether the title is a clickbait *(1)* or not *(0)*.

In [8]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import FunctionTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

import warnings
warnings.simplefilter("ignore")

In [9]:
df_1 = pd.read_excel('data/part1_classification/news_clickbait.xlsx', engine="openpyxl")
df_1

Unnamed: 0,headline,clickbait
0,You Need To Tell Us If These Things Are Doughnuts,1
1,15 Great Pieces Of Relationship Advice From Books,1
2,Improved E-Mail Service From a Dedicated Device,0
3,"Two MBTA Green Line trains collide in Newton, ...",0
4,17 Struggles All Smartypants Will Understand,1
...,...,...
10536,Can You Match The Phone To The R&B Video,1
10537,19 Soul Food Recipes That Are Almost As Good A...,1
10538,16 Photos Of Desis That Will Give You Intense ...,1
10539,City Plans to Make Older Buildings Refit to Sa...,0


<p style='color:red;font-weight:bold'>Exercise 1.1 :</p>

The first step is to separate the dataset into two sets (training and test), complete the input parameters of `train_test_split` function.

In [10]:
# Séparation des données en features (X) et cible (y)
X = df_1['headline']
y = df_1['clickbait']

# Séparation en ensembles d'entraînement et de test (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<p style='color:red;font-weight:bold'>Exercise 1.2 :</p>

Create a pre-processing `Pipeline` using [scikit-learn Pipeline object](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).
In your pipeline you need:
- Vectorize your headlines with Term frequency-inverse document frequency.
- Transform your data in case of [`sparse matrix`](https://docs.scipy.org/doc/scipy/reference/sparse.html#module-scipy.sparse), so the data goes through the model without any issues.

**Do not forget to remove words giving no information for classification (i.e. [stop words](https://en.wikipedia.org/wiki/Stop_word)).**

In [11]:
# Fonction pour convertir une matrice sparse en dense (nécessaire pour GaussianNB)
def to_dense(X):
    return X.toarray()

# Pipeline de pré-traitement
preprocessor = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),  
    ('densify', FunctionTransformer(to_dense, accept_sparse=True)), 
])

<p style='color:red;font-weight:bold'>Exercise 1.3 :</p>

Create the pipeline using `make_pipeline`, combining the pre-processing pipeline and the `GaussianNB` model. Then, train the pipeline and find the score obtained with the test set.

In [12]:
# TODO 1.3
# Création du pipeline complet (pré-traitement + modèle)
pipeline = make_pipeline(preprocessor, GaussianNB())

# Entraînement du pipeline
pipeline.fit(X_train, y_train)

# Évaluation sur l'ensemble de test
score = pipeline.score(X_test, y_test)
print(f"Score (accuracy) sur l'ensemble de test : {score}")

# Prédictions sur l'ensemble de test (pour la suite)
y_pred = pipeline.predict(X_test)

Score (accuracy) sur l'ensemble de test : 0.89900426742532


<p style='color:red;font-weight:bold'>Exercise 1.4 :</p>

Modify your split ratio for the training/testing set and see if there's a difference in the model's performances. Please explain your findings.

*TODO 1.4*

In [13]:
# Tester différents ratios de split
for test_size in [0.1, 0.2, 0.3, 0.4]:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
    pipeline.fit(X_train, y_train)
    score = pipeline.score(X_test, y_test)
    print(f"Test size: {test_size}, Score: {score}")

Test size: 0.1, Score: 0.8957345971563981
Test size: 0.2, Score: 0.89900426742532
Test size: 0.3, Score: 0.8921909579513121
Test size: 0.4, Score: 0.8878349537585961


<p style='color:red;font-weight:bold'>Exercise 1.5 :</p>

Keep your split validation but now incorporate [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Use the [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function and a K-Folding with 5 splits over the training set without using the test set. Then, calculate the averaged accuracy (with standard deviation) for 5 folds.

**Explain the results obtained and how to read them compared to the split validation.**

In [14]:
# TODO 1.5
# Validation croisée avec 5 folds
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')

print(f"Scores de validation croisée : {cv_scores}")
print(f"Moyenne des scores : {cv_scores.mean()}")
print(f"Écart-type des scores : {cv_scores.std()}")

Scores de validation croisée : [0.8743083  0.88458498 0.90671937 0.89249012 0.88370253]
Moyenne des scores : 0.8883610596887979
Écart-type des scores : 0.010839901287151002


*TODO 1.5*

<p style='color:red;font-weight:bold'>Exercise 1.6 :</p>

Try to use atleast 3 different classifiers and report their results in a table. Compare the results between them.

In [None]:
# TODO 1.6
# Dictionnaire des classifieurs
classifiers = {
    'GaussianNB': GaussianNB(),
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'RandomForest': RandomForestClassifier(),
    'SVC': SVC()
}

# Stocker les résultats pour chaque classifieur
results = {}

# Boucle sur les classifieurs
for name, clf in classifiers.items():
    # Créer le pipeline
    pipeline = make_pipeline(preprocessor, clf)

    # Validation croisée
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')

    # Stocker les résultats
    results[name] = {
        'mean_accuracy': cv_scores.mean(),
        'std_accuracy': cv_scores.std()
    }

# Afficher les résultats sous forme de tableau
results_df = pd.DataFrame.from_dict(results, orient='index')
print(results_df)

*TODO 1.6*

<p style='color:red;font-weight:bold'>Exercise 1.7 :</p>

In text processing we often use a stemming step for the pre-processing, explain what it consists and how it can be useful then give an example.

*TODO 1.7*

### Exercise 2 - Sentiment analysis on comments

The goal of the second part is to analyze tweets from the COVID-19 period and perform sentiment analysis to determine whether they express a positive or negative sentiment.

In the first step, we will work with the dataset `CoronaTwitterComments_2labels.xlsx`, which contains approximately 3'000 comments related to COVID-19 from Twitter in March 2020. These comments are already labeled with *1* (positive comment) or *-1* (negative comment). In the second step, we will consider the dataset `CoronaTwitterComments_3labels.xlsx`, which includes additional neutral comments (labeled with *0*). You will see how we can process the text data with the help of [WordNet](https://wordnet.princeton.edu/) to retrieve a sentiment and evaluate the results.

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

import nltk

In [2]:
# NLTK packages needed for this exercise, feel free to add some if you need it.

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('sentiwordnet')

[nltk_data] Downloading package stopwords to C:\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to C:\nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package sentiwordnet to C:\nltk_data...
[nltk_data]   Unzipping corpora\sentiwordnet.zip.


True

In [3]:
corona_tweets2 = pd.read_excel('data/part2_sentimentanalysis/CoronaTwitterComments_2labels.xlsx', engine="openpyxl")
corona_tweets3 = pd.read_excel('data/part2_sentimentanalysis/CoronaTwitterComments_3labels.xlsx', engine="openpyxl")

print(corona_tweets2.SentimentLabel.unique(), corona_tweets3.SentimentLabel.unique())
corona_tweets2

  warn("Workbook contains no default style, apply openpyxl's default")


[-1  1] [-1  1  0]


  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,OriginalTweet,SentimentLabel
0,TRENDING: New Yorkers encounter empty supermar...,-1
1,When I couldn't find hand sanitizer at Fred Me...,1
2,Find out how you can protect yourself and love...,1
3,#Panic buying hits #NewYork City as anxious sh...,-1
4,Voting in the age of #coronavirus = hand sanit...,1
...,...,...
3174,"@RicePolitics @MDCounties Craig, will you call...",-1
3175,Meanwhile In A Supermarket in Israel -- People...,1
3176,Did you panic buy a lot of non-perishable item...,-1
3177,Gov need to do somethings instead of biar je r...,-1


<p style='color:red;font-weight:bold'>Exercise 2.1 :</p>

The first step is to pre-process the text (like in previous exercise) to later use a WordNet sentiment analysis over it.

The class inherits from `BaseEstimator` and `TransformerMixin`, ensuring seamless integration with other scikit-learn modules. This allows it to utilize the familiar `fit`, `transform`, and `fit_transform` functions of the library.

Your task is to update the `NLTKPreprocessor` class with the needed functions to be used later in the `transform` function.
- Add a tokenizer specialized in tweets.
- Remove stop words and other unuseful characters.
- Transform in lowercase the text
- Lemmatize the text using for Wordnet.

*Advice : If you encounter any issues, refer to the `apply_pipeline` function to understand how it works. This function should **not** be modified.*

In [11]:
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords, wordnet, sentiwordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag


class NLTKPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.tokenizer = TweetTokenizer()
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()

        self.pipeline = [
            self.tokenize,
            self.remove_stopwords,
            self.lemmatize
        ]
        
    def tokenize(self, text):
        return self.tokenizer.tokenize(text.lower())
    
    def remove_stopwords(self, tokens):
        return [word for word in tokens if word.isalpha() and word not in self.stop_words]

    def lemmatize(self, tokens):
        return [self.lemmatizer.lemmatize(word) for word in tokens]
        
    def apply_pipeline(self, x):
        for transform in self.pipeline:
            x = transform(x)
        return x


    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [self.apply_pipeline(x) for x in X]

<p style='color:red;font-weight:bold'>Exercise 2.2 :</p>

The next step is to create the `WordNetSentimentAnalyzer`. You will need to create three functions to add to your pipeline : 
1. The WordNet sentiment analysis requires to have the tag of each word of the sentence. Tagging in part of speech (POS) is the process of assigning grammatical categories, such as nouns, verbs, or adjectives, to words in a text based on their role in a sentence. `nltk` has a pre-trained model that can tag these words, find it and apply it to the pipeline.
2. For WordNet the tagging is different from the pre-trained `nltk` model. Create a function replacing the tags with the WordNet tags using `nltk.corpus.wordnet` module.
3. Create the sentiment function which sums the positive sentiment ($\text{pos}$) and negative sentiment ($\text{neg}$) of each word (using `nltk.corupus.sentiwordnet` module). The value needs to be taken in account only if it is superior to a certain threshold (i.e. $\text{treshold} = 0.05$). Then, return a single value representing the sentiment of the sentence such as it is $\text{sentiment} < 0$ if it's negative and $\text{sentiment} > 0$ if positive.


In [6]:
class WordNetSentimentAnalyzer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.pipeline = [
            self.pos_tagging,
            self.convert_pos_tags,
            self.compute_sentiment
        ]

    def pos_tagging(self, tokens):
        return pos_tag(tokens)
    
    def convert_pos_tags(self, tagged_tokens):
        tag_map = {
            'J': wordnet.ADJ,
            'V': wordnet.VERB,
            'N': wordnet.NOUN,
            'R': wordnet.ADV
        }
        return [(word, tag_map.get(tag[0], wordnet.NOUN)) for word, tag in tagged_tokens]
    
    def compute_sentiment(self, tagged_tokens):
        sentiment = 0
        threshold = 0.05
        
        for word, tag in tagged_tokens:
            synsets = list(sentiwordnet.senti_synsets(word, tag))
            if synsets:
                synset = synsets[0]
                pos, neg = synset.pos_score(), synset.neg_score()
                if max(pos, neg) > threshold:
                    sentiment += pos - neg
        return sentiment

    def apply_pipeline(self, x):
        for transform in self.pipeline:
            x = transform(x)
        return x
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [self.apply_pipeline(x) for x in X]

<p style='color:red;font-weight:bold'>Exercise 2.3 :</p>

Create two functions that return the sentiment class to compare it to the labels.
* `get_sentiment_class_2` : For positive and negative only.
* `get_sentiment_class_3` : For positive, negative and neutral.

**Do not forget that it is going to be used in a `scikit-learn` `Pipeline`!**

In [7]:
def get_sentiment_class_2(sent_val):
    return 1 if sent_val > 0 else -1

def get_sentiment_class_3(sent_val):
    if sent_val > 0:
        return 1
    elif sent_val < 0:
        return -1
    return 0

<p style='color:red;font-weight:bold'>Exercise 2.4 :</p>

Create a `scikit-learn` `Pipeline` using the 3 previous *Transformers* that you created (using the 2 class sentiment).

Then, complete the function `get_sentiment_tweet` and test it with your own Tweet (**Do not change its signature !**).

In [12]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ('preprocessor', NLTKPreprocessor()),
    ('sentiment_analyzer', WordNetSentimentAnalyzer())
])

def get_sentiment_tweet(tweet:str)-> int:
    sentiment = pipeline.transform([tweet])[0]
    return get_sentiment_class_2(sentiment)

<p style='color:red;font-weight:bold'>Exercise 2.5 :</p>

Compute the accuracy obtained with all tweets of the dataset. Then, do the same for the 3 labels class.

In [19]:
from sklearn.metrics import accuracy_score

pipeline.fit(corona_tweets2['OriginalTweet'], corona_tweets2['SentimentLabel'])

y_true_2 = corona_tweets2['SentimentLabel']
y_pred_2 = [get_sentiment_tweet(tweet) for tweet in corona_tweets2['OriginalTweet']]
accuracy_2 = accuracy_score(y_true_2, y_pred_2)
print(f"Accuracy for 2-label classification: {accuracy_2}")

pipeline.fit(corona_tweets3['OriginalTweet'], corona_tweets3['SentimentLabel'])

y_true_3 = corona_tweets3['SentimentLabel']
y_pred_3 = [get_sentiment_tweet(tweet) for tweet in corona_tweets3['OriginalTweet']]
accuracy_3 = accuracy_score(y_true_3, y_pred_3)
print(f"Accuracy for 3-label classification: {accuracy_3}")



Accuracy for 2-label classification: 0.6517772884554891




Accuracy for 3-label classification: 0.545550289626119


<p style='color:red;font-weight:bold'>Exercise 2.6 :</p>

Analyze the results of each pre-processing (`NLTKPreprocessor`) step to try to understand what you could do to improve it. For example, what could you add into this pipeline ? *(No implementation needed)*

The preprocessing pipeline can be improved by better handling emojis and hashtags, as they often carry sentiment (e.g., 😀 → "happy"). 
Removing URLs, mentions (`@user`), and refining stopwords with a domain-specific list would reduce noise. Enhancing POS tagging with a more advanced model (e.g., spaCy) would improve lemmatization. 
A spell checker like SymSpell could correct typos common in tweets. Additionally, considering negations (e.g., "not good" should be negative) would refine sentiment detection. 
These improvements would likely enhance classification accuracy.

### Exercise 3 - Movie recommendation

The objective of this exercise is to create a recommendation system of movies based on the user ratings. We will focus on the collaborative approach for movie recommendation using the provided dataset, which contains approximately 9'800 movies rated by 610 users from MovieLens.

More specifically, the `movies.csv` dataset, which describes the movies, includes three attributes:

* `movieId`: unique identifier of the movie
* `title`: the title of the movie (with the release year in parentheses)
* `genres`: the genres of the movie

The `ratings.csv` dataset, which contains user ratings for the movies, includes four attributes:

* `userId`: unique identifier of the user
* `movieId`: unique identifier of the movie
* `rating`: the user's rating for the corresponding movie
* `timestamp`: the timestamp of the rating

In [None]:
from sklearn.neighbors import NearestNeighbors
import pandas as pd


In [None]:
df_movies = pd.read_csv('data/part3_recommandationdesfilms/movies.csv')
df_ratings = pd.read_csv('data/part3_recommandationdesfilms/ratings.csv')

df_3 = pd.merge(df_movies, df_ratings, on='movieId')
df_3

<p style='color:red;font-weight:bold'>Exercise 3.1 :</p>

Use `df_3` to create a second `DataFrame` with the rating of every movies for each `userId`. If the user never watched a movie the value is $0$. 
For this step, use the [`pivot` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html).

In [None]:
user_item_matrix = ... # TODO 3.1
X = user_item_matrix.to_numpy()
X

<p style='color:red;font-weight:bold'>Exercise 3.2 :</p>

First separate the data using `train_test_split` function with $20\%$ of the data in the test set.
Create an `knn` model using the [`NearestNeighbors`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) model from `scikit-learn`. For now use $N=80$, and the `cosine` metric, then train it.

**Explain in few sentences how does the `cosine` metric works by taking our case in an example (include the formula).**

In [None]:
... = train_test_split(...)

knn = ...

...

*TODO 3.2*

<p style='color:red;font-weight:bold'>Exercise 3.3 :</p>

Create the `predict` function by using the previous trained model.

*Hints : Think about the model returns and what they represent. Then, use the closest indexes to obtain information on the best matching movies.* 

In [None]:
def predict(X, user_item_matrix, knn):
    ...
    return ...

<p style='color:red;font-weight:bold'>Exercise 3.4 :</p>

Finally, complete the `prec_at_k` function, which measures the precision of our model. This metric calculates how many of the top $k$ movies recommended to a user are actually relevant. A movie is considered relevant if the user has given it a score of at least 4..

In [None]:
def prec_at_k(Y_pred, user_item_matrix, k):
    ...
    return ...

<p style='color:red;font-weight:bold'>Exercise 3.5 :</p>

Now, use the function with three different values for k ($5$, $15$, $25$) each time. Then, use different $N$ neighbours and a different split ratio.

Report all your results bellow and explain the impact of these hyperparameters in the model.

In [None]:
# TODO 3.5

*TODO 3.5*

### Exercise 4 - Association rules in online shopping

In this fourth part, we will focus on a **Market Basket Analysis** problem. The dataset provided contains **online sales transactions** from an e-commerce site over one year. Our goal is to generate **association rules** based on these sales to identify which items are frequently bought together. You can refer to the source mentioned in the **README file** included with the data for more details.  

The dataset is structured as follows, with each row representing the details of a product sale:  
- `InvoiceNo`: invoice/sale identifier  
- `StockCode`: product identifier  
- `Description`: purchased product  
- `Quantity`: quantity sold  
- `InvoiceDate`: order/payment date  
- `UnitPrice`: price of the product  
- `CustomerID`: customer identifier  
- `Country`: customer’s country of residence

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

In [None]:
df_4 = pd.read_excel('data/part4_marketbasketanalysis/OnlineRetailDataset.xlsx', engine='openpyxl')
df_4

<p style='color:red;font-weight:bold'>Exercise 4.1 :</p>

Transform the given dataset into a one-hot encoded format for association rule mining. Group items by customer so that each row represents a unique customer with a list of purchased items. Then, convert this into a binary matrix where each column is an item, and values indicate whether a customer bought that item (1) or not (0). Use [`MultiLabelBinarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) to achieve this.

In [None]:
# TODO 4.1

<p style='color:red;font-weight:bold'>Exercise 4.2 :</p>

Finally, create the association rules based on the frequent patterns obtained through `FP-growth` algorithm. Check [`mlxtend`](https://rasbt.github.io/mlxtend/) documentation to complete this task. Then, show what are the association rules with a minimum of $0.9$ confidence.

In [None]:
# TODO 4.2

<p style='color:red;font-weight:bold'>Exercise 4.3 :</p>

Do you observe any changes in the different parameters of the *FP-Growth* and *Association Rules* functions ? Please comment on the chosen parameters and the results obtained.

In [None]:
# TODO 4.3

*TODO 4.3*

<p style='color:red;font-weight:bold'>Exercise 4.4 :</p>

Is it possible to use other column(s) from the initial data to generate interesting rules?

*TODO 4.4*

### Exercise 5 - Clustering of mobile apps

In this final part, we will perform clustering of applications from the Google Play Store. You can refer to the source mentioned in the **README file** included with the data for more details.

Among the attributes available in the `googleplaystore.xlsx` file, we will use the following:

* `Rating`: overall user rating of the application
* `Reviews`: number of user reviews
* `Size`: size of the application
* `Installs`: number of users who installed the application
* `Price`: price of the application



In [None]:
from sklearn.cluster import KMeans
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
df_5 = pd.read_excel('data/part5_clustering/googleplaystore.xlsx', engine='openpyxl')
df_5

<p style='color:red;font-weight:bold'>Exercise 5.1 :</p>

Select the numerical features of the dataset then standardize them by removing the mean and scaling to unit variance. 
Then, do a first training using [`KMeans`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm with $5$ clusters.

In [None]:
# TODO 5.1

<p style='color:red;font-weight:bold'>Exercise 5.2 :</p>

Complete the function `average_cluster_distance` to compute a clustering performance metric. It should return a list where the first value is the global average distance from all samples to their assigned centroids, followed by the average distances per centroid. 

In [None]:
def average_cluster_distance(X, model):
    ...
    return ...

avg_distances = average_cluster_distance(...)
print(avg_distances)

<p style='color:red;font-weight:bold'>Exercise 5.3 :</p>

Try using different values for the maximum number of iterations of the *K-Means Algorithm*. Then, report your results and describe them.

In [None]:
# TODO 5.3

*TODO 5.3*

<p style='color:red;font-weight:bold'>Exercise 5.4 :</p>

Analyze how the category distributions change when clustering with $\text{n\_cluster} = {2,3,4,5}$ and compare the results. Observe how categories are grouped within each cluster and note any significant shifts as the number of clusters increases. Describe the distributions obtained and evaluate the average distances for each $\text{n\_cluster}$.

In [None]:
# TODO 5.4

*TODO 5.4*