# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [109]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [110]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

1. What are the benefits of using a pipline?

Using a pipeline in data science and machine learning workflows offers several benefits:

1. Streamlined Workflow: Pipelines provide a structured and organized approach to data processing and model building. They allow you to define and automate a sequence of data transformation steps, such as data preprocessing, feature engineering, model training, and prediction, in a coherent manner. This streamlines the workflow and makes it easier to reproduce and maintain.

2. Code Readability and Maintainability: Pipelines improve the readability and maintainability of your code. By encapsulating the steps within a pipeline, it becomes easier to understand the flow of data and operations performed. This modular structure makes it simpler to modify or update specific steps without impacting the entire workflow.

3. Data Leakage Prevention: Pipelines help prevent data leakage, which occurs when information from the test set inadvertently influences the model during training or preprocessing. By explicitly separating the steps into distinct stages within the pipeline, you can ensure that each stage operates on the appropriate subset of data, avoiding data leakage and producing more reliable model evaluations.

4. Hyperparameter Tuning: Pipelines can be combined with techniques like grid search or random search for hyperparameter tuning. This allows you to systematically explore different combinations of hyperparameters for your models within the pipeline. By automating this process, you can find the best set of hyperparameters that maximize model performance without manual intervention.

5. Model Deployment and Scalability: Pipelines facilitate the deployment and scalability of your models. Once you have defined a pipeline for your data processing and modeling steps, it becomes easier to deploy the entire pipeline as a cohesive unit. This is particularly beneficial when deploying models in production environments. Moreover, pipelines can handle large datasets efficiently, enabling scalability and parallel processing of data and models.

6. Reproducibility: Pipelines promote reproducibility by capturing the entire data processing and modeling workflow in a single entity. This ensures that the same transformations and modeling steps are consistently applied to new data or in future runs. By storing the pipeline configuration or using pipeline serialization, you can recreate the exact data processing and modeling pipeline, making it easier to reproduce results and share them with others.

Overall, pipelines enhance the efficiency, reliability, and maintainability of data science workflows by providing a structured framework for data processing, modeling, and deployment. They contribute to better code organization, prevent data leakage, enable hyperparameter tuning, and support reproducibility, ultimately leading to more robust and scalable data science projects.

In [111]:
# call_on_students(1)

2. What does a gridsearch achieve?

Grid search is a hyperparameter tuning technique used in machine learning to find the optimal combination of hyperparameter values for a given model. Hyperparameters are parameters that are set before the learning process begins and control the behavior of the model.

The goal of grid search is to systematically search through a predefined grid of hyperparameter values and evaluate the model's performance using each combination of values. It exhaustively tries all possible combinations within the specified grid and identifies the combination that produces the best performance according to a chosen evaluation metric, such as accuracy, precision, recall, or F1-score.

Here's how grid search works:

1. Define the Hyperparameter Grid: Specify a grid of hyperparameter values to explore for each hyperparameter of the model. For example, if you have two hyperparameters, A and B, with possible values [1, 2, 3] and [0.1, 0.2, 0.3], respectively, the grid search will create a Cartesian product of these values, resulting in nine combinations to evaluate.

2. Model Training and Evaluation: For each combination of hyperparameter values, train the model using the training dataset and evaluate its performance using a validation dataset or through cross-validation. The evaluation metric is calculated to assess how well the model performs with the given hyperparameter values.

3. Best Hyperparameters Selection: Compare the performance of each model based on the evaluation metric and select the combination of hyperparameters that achieves the best performance. This is typically the combination that maximizes the evaluation metric, although it can also be based on minimizing a loss function, error, or other relevant metrics.

By exhaustively searching through the grid of hyperparameters, grid search helps identify the hyperparameter values that lead to the best model performance on the validation dataset. It automates the process of finding the optimal hyperparameter values, saving time and effort compared to manually testing different combinations.

Grid search is implemented in various machine learning libraries, such as scikit-learn in Python, making it straightforward to apply in practice. It is a valuable technique for improving the performance of machine learning models and finding the best hyperparameter configuration, resulting in more accurate and reliable models.

In [112]:
# call_on_students(1)

3. Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

**Answer**:

In [113]:
from sklearn.datasets import load_breast_cancer

In [114]:
# Your code here
pipe1 = Pipeline([('scaler', StandardScaler()),
                  ('logreg', LogisticRegression()),
                  ])

data = load_breast_cancer()

X = data.data
y = data.target

pipe1.fit(X, y)
pipe1.score(X, y)

0.9876977152899824

4. Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

**Answer**:

In [115]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_grid = {
    'logreg__C': [10, 1, 0.1]
}

grid_search = GridSearchCV(estimator=pipe1,
                           param_grid=param_grid,
                           cv=5)



In [116]:
grid_search.fit(X_train, y_train)

In [117]:
# import accuracy score
from sklearn.metrics import accuracy_score

# Access the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Apply the best estimator to the test set
best_estimator = grid_search.best_estimator_
y_pred = best_estimator.predict(X_test)

# Evaluate the performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Best Parameters: {'logreg__C': 10}
Best Score: 0.9764705882352942
Test Accuracy: 0.972027972027972


# 2) Ensemble Methods

1. What sorts of ensembling methods have we looked at?

baging 
random forests
boosting

Some ensemble methods include:

1. Random Forest: It combines multiple decision trees by using random sampling of training data and random feature selection to improve predictive accuracy and reduce overfitting.

2. Gradient Boosting: It builds an ensemble of weak learners (usually decision trees) in a sequential manner. Each subsequent model corrects the errors made by the previous model, gradually improving the overall prediction.

3. AdaBoost: It trains a sequence of weak learners in which each subsequent model focuses on the instances that were misclassified by previous models. The models are combined by giving more weight to the more accurate ones.

4. Bagging: It involves training multiple instances of the same base model using bootstrap sampling and averaging the predictions to reduce variance and improve generalization.

5. Stacking: It combines the predictions of multiple different models by training a meta-model (also known as a blender or meta-learner) on the predictions of the base models. The meta-model learns to make the final prediction based on the outputs of the base models.

6. Voting: It combines predictions from multiple models by taking the majority vote (for classification problems) or averaging the predictions (for regression problems) to make the final prediction.

7. Extra Trees: Similar to Random Forest, it combines multiple decision trees. However, Extra Trees further randomizes the tree construction process by selecting random splits at each node, aiming for even more diversity.

These ensemble methods leverage the wisdom of multiple models to improve predictive performance, increase robustness, and handle complex patterns in the data.

In [118]:
# call_on_students(1)

2. What is random about a random forest?

A Random Forest is called "random" because it randomly selects subsets of the training data and features to build a collection of decision trees, promoting diversity and improving generalization.

In a Random Forest, the "random" refers to two main aspects of the algorithm:

    Random Sampling of Training Data:
        Random Forest uses a technique called "bootstrap aggregating" or "bagging." It involves randomly selecting subsets of the training data with replacement.
        At each tree-building step, a random subset of the training data (with replacement) is used to train each decision tree in the forest. This random sampling helps to introduce diversity in the training data for each tree.

    Random Feature Selection:
        In addition to sampling the training data, Random Forest also performs random feature selection.
        At each split of a decision tree, a random subset of features is considered as candidates for splitting. This helps to reduce the correlation among the trees and makes each tree focus on different subsets of features.
        The number of features considered at each split is typically the square root of the total number of features.

By incorporating these two randomization techniques, Random Forest aims to reduce overfitting and improve the model's ability to generalize well to unseen data. The random sampling of training data and random feature selection help to decorrelate the trees and capture different aspects of the data, resulting in an ensemble of diverse decision trees that work together to make predictions.

In [119]:
# call_on_students(1)

3. What hyperparameters of a random forest might it be useful to tune? How so?

Important hyperparameters to tune in a Random Forest are `n_estimators` (number of trees), `max_depth` (maximum depth of trees), `min_samples_split` and `min_samples_leaf` (minimum samples for splitting and leaf formation), `max_features` (number of features considered), and `bootstrap` (use of bootstrap sampling). Tuning these parameters helps balance model complexity and generalization, prevent overfitting, and improve performance.

When tuning a Random Forest model, some important hyperparameters to consider are:

1. `n_estimators`: It represents the number of decision trees in the forest. Increasing the number of estimators generally improves the model's performance, but it also increases the computational cost.

2. `max_depth`: It determines the maximum depth allowed for each decision tree in the forest. Deeper trees can capture complex relationships in the data but may overfit. Shallow trees can reduce overfitting but may result in underfitting. Tuning this parameter helps balance the trade-off between model complexity and generalization.

3. `min_samples_split` and `min_samples_leaf`: These parameters control the minimum number of samples required to split an internal node or form a leaf node, respectively. Increasing these values can prevent overfitting by enforcing a minimum number of samples needed for a split or a leaf.

4. `max_features`: It determines the number of features to consider when looking for the best split at each tree node. By limiting the number of features, you can reduce the correlation among the trees and introduce more randomness into the model.

5. `bootstrap`: It specifies whether bootstrap sampling should be used when building trees. Setting this parameter to `True` (default) enables bootstrap sampling, while setting it to `False` disables it. Disabling bootstrap sampling can lead to more diversity in the trees but may result in higher variance.

6. `random_state`: It is the random seed used for random number generation. Fixing the random state ensures reproducibility of the results.

By tuning these hyperparameters, you can find the optimal configuration for your Random Forest model. It helps you strike a balance between overfitting and underfitting, improve model performance, and enhance generalization to unseen data. Hyperparameter tuning can be performed using techniques like grid search, random search, or Bayesian optimization to find the best combination of values for these parameters.

In [120]:
# call_on_students(1)

4. Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Make sure you do a train-test split!

**Answer**:

In [121]:
# Your code here

pipe2 = Pipeline([('scaler', StandardScaler()),
                  ('rf', RandomForestClassifier()),
                  ])

data = load_breast_cancer()

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

param_grid = {
    'rf__n_estimators': [100, 200, 300],  # Number of trees in the forest
    'rf__max_depth': [None, 5, 10],  # Maximum depth of trees
    # Add more hyperparameters to tune as needed
}

grid_search = GridSearchCV(estimator=pipe2, 
                           param_grid=param_grid, 
                           cv=5)

In [122]:
# Fit the GridSearchCV object on the training data
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

y_pred = grid_search.predict(X_test)

Best Parameters: {'rf__max_depth': 5, 'rf__n_estimators': 100}
Best Score: 0.9624623803009577


# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [123]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

### 1: NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [124]:
# call_on_students(1)

To turn raw text into something semantically valuable for numerical analysis, you can: clean and preprocess the text, normalize and lemmatize the words, and represent the text as numerical features using techniques like bag-of-words or TF-IDF.

To turn raw text into something semantically valuable and ready for numerical representation, you can follow these three steps:

1. Text Cleaning and Preprocessing:
   - Remove any unnecessary characters, punctuation, or special symbols that do not contribute to the overall meaning of the text.
   - Convert the text to lowercase to ensure consistency and avoid treating the same words as different based on their case.
   - Tokenize the text by splitting it into individual words or tokens. This step separates the text into its fundamental units for further analysis.
   - Remove stop words (commonly occurring words like "is," "the," "and") as they often do not carry much semantic meaning.

2. Text Normalization and Lemmatization:
   - Normalize the words by applying stemming or lemmatization techniques. Stemming reduces words to their base or root form, while lemmatization converts words to their dictionary or base form.
   - This step helps to consolidate words with the same meaning, reducing the vocabulary size and improving the accuracy of subsequent analyses.

3. Feature Extraction and Representation:
   - Convert the preprocessed text into numerical representations that machine learning algorithms can understand.
   - Use techniques like bag-of-words (BoW) or term frequency-inverse document frequency (TF-IDF) to represent the text as a numerical matrix.
   - BoW represents each document as a vector of word frequencies, while TF-IDF takes into account the importance of words in the context of the entire corpus.
   - Additionally, you can explore more advanced techniques like word embeddings (e.g., Word2Vec or GloVe) that capture semantic relationships between words.

By following these steps, you can transform raw text into a structured and meaningful representation that captures the essence of the original text. This enables you to perform various natural language processing (NLP) tasks such as sentiment analysis, text classification, or information retrieval.

#### Answer:

- Lowercase (standardize case)
- Remove stopwords (really common words that likely have no semantic value)
- Stem or lemmatize to remove prefixes/suffixes/grammer bits
- Remove punctuation
- Tokenize

### 2: Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

The resulting dataframe (document-term matrix) would have rows representing each document and columns representing unique words from the corpus, with values indicating word frequencies in each document.

In [125]:
# call_on_students(1)

Vectorized text, when represented as a dataframe, typically has each document or text sample represented as a row, and each feature or word represented as a column. The values in the dataframe correspond to the frequency, occurrence, or weight of each word in the respective document.

Here's an example to illustrate the structure of a vectorized text dataframe:

```plaintext
| Document | word1 | word2 | word3 | ... | wordN |
|----------|-------|-------|-------|-----|-------|
|    1     |   2   |   0   |   1   | ... |   3   |
|    2     |   0   |   1   |   0   | ... |   2   |
|    3     |   1   |   0   |   2   | ... |   0   |
```

In this example, we have three documents represented as rows, and each column represents a specific word or feature in the text. The values in the cells indicate the frequency or weight of each word in the corresponding document.

It's worth noting that vectorized text dataframes can have different representations based on the specific encoding technique used, such as bag-of-words, TF-IDF, or word embeddings. The example above represents the bag-of-words approach, where each cell represents the frequency of a word in the document.

This dataframe representation allows for easy integration with machine learning algorithms since the textual data is transformed into a structured numerical format that algorithms can process and analyze.

If you vectorize the given corpus using the bag-of-words approach, the resulting document-term matrix (dataframe) would have the rows representing each document in the corpus and the columns representing each unique word (term) found in the corpus.

In this specific case, the resulting document-term matrix (dataframe) would look as follows:

```plaintext
| Document | about | adult | best | book | earth | excuse | heard | harry | is | it | like | me | non-fiction | only | person | potter | read | sea | the | to | um | wizards | young |
|----------|-------|-------|------|------|-------|--------|-------|-------|----|----|------|----|-------------|------|--------|---------|------|-----|-----|----|----|---------|-------|
|    1     |   1   |   1   |   1  |   1  |   0   |   0    |   0   |   1   |  1 |  0 |   0  | 0  |     0       |  0   |   0    |    1    |  0   |  0  |  1  |  0 | 0  |    1    |   1   |
|    2     |   0   |   0   |   0  |   0  |   1   |   1    |   1   |   0   |  0 |  0 |   0  | 0  |     0       |  0   |   0    |    0    |  0   |  1  |   1 |  0 | 1  |    0    |   0   |
|    3     |   0   |   0   |   0  |   0  |   0   |   0    |   0   |   0   |  1 |  1 |   1  | 1  |     1       |  1   |   1    |    0    |  1   |  0  |   0 |  1 | 0  |    0    |   0   |
```

Each row represents a document from the corpus (sentence_one, sentence_two, sentence_three), and each column represents a unique word (term) found in the corpus. The values in the cells indicate the frequency of each word in the respective document.

Please note that in this example, I have assumed that the words in the corpus have been preprocessed and tokenized, resulting in a specific set of unique terms. The actual resulting dataframe would depend on the preprocessing steps applied to the raw text before vectorization.

#### Answer:

- Columns: every word/token in the dataset/corpus
- Rows: the documents you're vectorizing


### 3: What does TF-IDF do?

Also, what does TF-IDF stand for?

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that measures the importance of a term in a document by considering its frequency within the document and rarity across a corpus. It helps identify relevant and distinctive terms in natural language processing tasks.

TF-IDF stands for "Term Frequency-Inverse Document Frequency." It is a numerical statistic used in natural language processing and information retrieval to measure the importance of a term in a document relative to a corpus of documents.

TF-IDF combines two components:

1. Term Frequency (TF): Measures the frequency of a term in a document. It indicates how often a term appears in a specific document. A higher term frequency suggests that the term is more relevant to that document.

2. Inverse Document Frequency (IDF): Measures the rarity or uniqueness of a term across the corpus. It quantifies the importance of a term by considering how often it appears in the entire corpus. Rare terms that appear in few documents receive a higher IDF score, indicating their potential significance.

The TF-IDF score of a term in a document is calculated by multiplying the term frequency (TF) and the inverse document frequency (IDF). The resulting score emphasizes terms that are frequent within a document but rare across the entire corpus.

The purpose of TF-IDF is to highlight terms that are both relevant to a specific document and distinctive in the overall context of the corpus. It helps in identifying important terms and distinguishing them from commonly occurring terms. TF-IDF is widely used in various natural language processing tasks, such as text classification, information retrieval, and document similarity analysis.

In [126]:
# call_on_students(1)

#### Answer:

- TF-IDF: term frequency inverse document frequency
- TF-IDF is a vectorizer that takes into account the rarity of the words


## NLP in Code

### Set Up

In [127]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [128]:
policies.head()

Unnamed: 0.1,Unnamed: 0,name,policy,candidate
0,0,100% Clean Energy for America,"As published on Medium on September 3rd, 2019:...",1
1,1,A Comprehensive Agenda to Boost America’s Smal...,Small businesses are the heart of our economy....,1
2,2,A Fair and Welcoming Immigration System,"As published on Medium on July 11th, 2019:\r\n...",1
3,3,A Fair Workweek for America’s Part-Time Workers,Working families all across the country are ge...,1
4,4,A Great Public School Education for Every Student,I attended public school growing up in Oklahom...,1


The documents for activity are in the `policy` column, and the target is candidate. 

In [129]:
policies.shape

(189, 4)

### 4: Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [130]:
# call_on_students(1)

In [131]:
# First! Train-test split the dataset
from sklearn.model_selection import train_test_split

# Code here to train test split
X_train, X_test, y_train, y_test = train_test_split(policies['policy'], policies['candidate'])

In [132]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [133]:
# Instantiate it
vectorizer = CountVectorizer()

In [134]:
# Fit it
vectorizer.fit(X_train)

### 5: Vectorize Your Text, Then Model

In [135]:
# call_on_students(1)

In [136]:
# Code here to transform train and test sets with the vectorizer
X_tr_vec = vectorizer.transform(X_train)
X_te_vec = vectorizer.transform(X_test)

In [137]:
# Importing the classifier...
from sklearn.ensemble import RandomForestClassifier

# Code here to instantiate and fit a Random Forest model
rfc = RandomForestClassifier()
rfc.fit(X_tr_vec, y_train)

In [138]:
# Code here to evaluate your model on the test set
rfc.score(X_te_vec, y_test)

0.8541666666666666

# 4) Clustering

## Clustering Concepts

### 1: Describe how the K-Means algorithm updates its cluster centers after initialization.

In [139]:
# call_on_students(1)

#### Answer:

- You set the number of cluster centers (K) - algorithm randomly starts with that number of cluster centers (in random spots!)
- The algorithm calculates the distance between the centers and each observation and assigns the observation to the closest cluster center to create the first iteration of clusters
- The algorithm then takes all the observations assigned to each cluster, and moves that cluster center to be at the exact actual center (mean) of the newly created cluster
- Repeat! Until the cluster centers stop moving (or tolerance is met - some parameters in the implementation)

### 2: What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [140]:
# call_on_students(1)

#### Answer:

- Inertia measures the distance between each point and its center - the idea is that better clusters are more tightly concentrated
- KMeans tries to minimize inertia when choosing cluster centers
- Method to evaluate - elbow plot!

### 3: What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

In [141]:
# call_on_students(1)

#### Answer:

- Silhouette score
- Difference between silhouette score and inertia: silhouette score tries to maximize similarity within groups and maximize distances between clusters, while inertia just looks within each cluster

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [142]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

### 4: Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?

- scale

In [143]:
# call_on_students(1)

In [144]:
# Code to preprocess the data
k_scaler = StandardScaler()

# Name the processed data X_processed
X_processed = k_scaler.fit_transform(X)

### 5: Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [145]:
# call_on_students(1)

In [146]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate
cluster = AgglomerativeClustering(n_clusters=2)
# Fit the object
cluster.fit(X_processed)

# Calculate a silhouette score
from sklearn.metrics import silhouette_score
silhouette_score(X_processed, cluster.labels_)

0.5770346019475988

### 6: Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [147]:
# call_on_students(1)

In [148]:
def test_n_for_clustering(n, data):
    """ 
    Tests different numbers for the hyperparameter n_clusters
    Prints the silhouette score for that clustering model
    Returns the labels that are output from the clustering model

    Parameters: 
    -----------
    n: float object
        number of clusters to use in the agglomerative clustering model
    data: Pandas DataFrame or array-like object
        Data to cluster

    Returns: 
    --------
    labels: array-like object
        Labels attribute from the clustering model
    """
    # Create the new clustering model
    cluster = AgglomerativeClustering(n_clusters=n)
    
    # Fit the new clustering model
    cluster.fit(data)

    # Print the silhouette score
    print(silhouette_score(data, cluster.labels_))
    
    # Return the labels attribute from the fit clustering model
    return cluster.labels_

# Testing your function

for n in range(2, 9):
    test_n_for_clustering(n, X_processed)

0.5770346019475988
0.4466890410285909
0.4006363159855973
0.33058726295230545
0.31485480100512825
0.316969830299128
0.310946529007258
