# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [3]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [4]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

1. What are the benefits of using a pipline?

- Streamlined Workflow: Pipelines allow you to organize and automate the steps of your machine learning workflow. You can define a sequence of data preprocessing, feature extraction, and model training steps and execute them in a consistent manner.

- Code Simplicity: With pipelines, you can encapsulate multiple steps into a single object. This simplifies your code and makes it more readable, modular, and reusable. You can easily understand and modify the pipeline steps without affecting other parts of your code.

- Data Leakage Prevention: Pipelines help prevent data leakage, which occurs when information from the test set unintentionally influences the training process. By ensuring that all preprocessing steps are applied within the pipeline, transformations are fit only on the training data and then applied consistently to the test data.

- Hyperparameter Optimization: Pipelines can be combined with grid search or random search techniques to perform hyperparameter optimization efficiently. Instead of manually tuning hyperparameters for each step, you can define a grid of hyperparameters for all steps in the pipeline and automatically search for the best combination.

- Reproducibility: Pipelines enable better reproducibility of your experiments. By defining and saving the pipeline configuration, including preprocessing steps and model parameters, you can easily recreate the same pipeline and obtain consistent results.

- Deployment and Production: Pipelines are useful for deploying machine learning models in production. Once you have a finalized pipeline, you can save it as a single object and deploy it without worrying about individually saving and loading preprocessing steps or model components.

- Overall, pipelines enhance the efficiency, maintainability, and reliability of machine learning workflows by providing a structured approach to data preprocessing, model training, and deployment.

In [5]:
# call_on_students(1)

2. What does a gridsearch achieve?

- Hyperparameter Optimization: Machine learning models often have hyperparameters, which are parameters that are not learned from the data but set before the training process. Examples of hyperparameters include the learning rate in neural networks or the regularization parameter in support vector machines. Grid search helps find the best values for these hyperparameters by exhaustively searching through all possible combinations specified in a grid.

- Performance Evaluation: Grid search evaluates the performance of the model using different hyperparameter values for each combination in the grid. This typically involves using a cross-validation technique, such as k-fold cross-validation, to estimate the model's performance on different subsets of the data. By comparing the performance metrics across different hyperparameter combinations, grid search helps identify the best combination that yields the highest performance.

- Model Generalization: The goal of hyperparameter tuning is to improve the model's generalization performance. By systematically exploring different hyperparameter values, grid search aims to find the combination that optimizes the model's performance not only on the training data but also on unseen data. This helps prevent overfitting (where the model performs well on the training data but poorly on new data) and enhances the model's ability to make accurate predictions on unseen instances.

- Efficiency and Automation: Grid search automates the process of trying out different hyperparameter combinations, eliminating the need for manual trial and error. It searches through the grid in a systematic manner, saving time and effort for the practitioner. It also ensures that all possible combinations are evaluated, reducing the chance of missing out on potentially better hyperparameter settings.

- Reproducibility: Grid search allows for reproducibility of experiments. By specifying the grid of hyperparameter values and using the same random seeds for the data splits, one can reproduce the exact experiments and results. This is important for research purposes and for comparing different models or approaches.

- In summary, grid search is a powerful technique that helps optimize the hyperparameters of machine learning models, enabling improved performance and generalization. It automates the process of exploring different hyperparameter combinations, allowing practitioners to efficiently find the optimal settings for their models.

In [6]:
# call_on_students(1)

3. Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

**Answer**:

In [7]:
from sklearn.datasets import load_breast_cancer

In [8]:
# Your code here
data = load_breast_cancer()
X = data.data
y = data.target

pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('classifier', LogisticRegression())  # Logistic regression model
])

# Fit the pipeline to the data
pipeline.fit(X, y)


4. Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

**Answer**:

In [9]:
# Your code here


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with a scaler and logistic regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('classifier', LogisticRegression())  # Logistic regression model
])

# Define the parameter grid for grid search
param_grid = {
    'classifier__C': [10, 1, 0.1]  # C (inverse regularization) values to try
}

# Perform grid search over pipelines
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best-performing model
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
test_accuracy = best_model.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)


Test Accuracy: 0.9736842105263158


# 2) Ensemble Methods

1. What sorts of ensembling methods have we looked at?

Ensamble Methods combine other learners into one

- bagging

- boosting

- stacking

- voting

- technically nural networks

In [10]:
# call_on_students(1)

2. What is random about a random forest?

- Random sampling: When building each individual decision tree within the random forest, the algorithm randomly selects a subset of the available training data. This process is known as "bootstrap aggregating" or "bagging." Randomly sampling the data helps introduce diversity in the individual trees and reduces the risk of overfitting to the training set. It allows each tree to see a slightly different subset of the data, increasing the overall robustness and generalization ability of the forest.

- Random feature selection: For each split in a decision tree, the algorithm considers only a random subset of features or attributes rather than using all the available features. This randomness in feature selection helps to decorrelate the individual trees in the forest and ensures that no single feature dominates the decision-making process. By randomly selecting features at each split, the random forest becomes less prone to overfitting and captures a broader range of relationships between the features and the target variable.

- extra credit: bootstrapping (sample with replacement)

In [11]:
# call_on_students(1)

3. What hyperparameters of a random forest might it be useful to tune? How so?

- Number of trees (n_estimators): It determines the number of decision trees in the random forest. Increasing the number of trees generally improves the model's performance, but it also increases the computational cost. It's essential to find a balance to prevent overfitting without sacrificing efficiency.

- Maximum depth (max_depth): It limits the depth of each decision tree. A deeper tree can capture complex relationships in the data, but it also increases the risk of overfitting. Tuning the maximum depth helps control the complexity of individual trees and the overall model.

- Minimum samples for a split (min_samples_split) and leaf (min_samples_leaf): These parameters determine the minimum number of samples required to perform a split at a node or form a leaf node. Increasing these values can prevent overfitting by enforcing a higher threshold for splitting or creating leaves.

- Maximum number of features (max_features): It controls the number of features to consider when looking for the best split. Reducing the number of features can increase the diversity among trees and reduce overfitting. The ideal value depends on the dataset and can be determined through experimentation.

- Feature selection criterion (criterion): It defines the quality measure used for feature selection during the tree construction. The two common options are "gini" and "entropy." The choice between them depends on the specific problem and often requires experimentation to find the best criterion.

- Randomness control (random_state): This parameter sets the random seed, ensuring reproducibility of results. By fixing the random seed, you can obtain consistent results across multiple runs or experiments.

In [12]:
# call_on_students(1)

4. Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Make sure you do a train-test split!

**Answer**:

In [14]:
# Your code here
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

random_forest = RandomForestClassifier(random_state=42)
random_forest.fit(X_train, y_train)

# Make predictions on the test set
y_pred = random_forest.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9649122807017544


# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [16]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

### 1: NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [None]:
# call_on_students(1)



#### Answer:

- Lowercase (standardize case)
- Remove stopwords (really common words that likely have no semantic value)
- Stem or lemmatize to remove prefixes/suffixes/grammer bits
- Remove punctuation
- Tokenize

### 2: Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

In [None]:
# call_on_students(1)

#### Answer:

- Columns: every word/token in the dataset/corpus
- Rows: the documents you're vectorizing


### 3: What does TF-IDF do?

Also, what does TF-IDF stand for?

In [None]:
# call_on_students(1)

#### Answer:

- TF-IDF: term frequency inverse document frequency
- TF-IDF is a vectorizer that takes into account the rarity of the words


- TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic that is used to evaluate the importance of a term in a document within a collection or corpus of documents.

- TF (Term Frequency) measures the frequency of a term within a document. It calculates the ratio of the number of times a term appears in a document to the total number of terms in that document. The intuition behind TF is that terms that appear more frequently in a document are more important to that document.

- IDF (Inverse Document Frequency) measures the rarity or uniqueness of a term across the entire corpus. It calculates the logarithm of the inverse ratio of the total number of documents in the corpus to the number of documents that contain the term. The intuition behind IDF is that terms that appear in a fewer number of documents are more informative and carry more weight in distinguishing between documents.

- TF-IDF is calculated by multiplying the TF and IDF values for each term in a document. The higher the TF-IDF score for a term in a document, the more important and relevant that term is to the document.

- TF-IDF is commonly used in natural language processing tasks, such as text classification, information retrieval, and text mining. It helps in identifying the most significant words or terms in a document and can be used to represent documents as numerical feature vectors for machine learning algorithms.


## NLP in Code

### Set Up

In [None]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [None]:
policies.head()

The documents for activity are in the `policy` column, and the target is candidate. 

### 4: Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [None]:
# call_on_students(1)

In [None]:
# First! Train-test split the dataset
from sklearn.model_selection import train_test_split

# Code here to train test split
X_train, X_test, y_train, y_test = train_test_split(policies['policy'], policies['candidate'])

In [None]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Instantiate it
vectorizer = CountVectorizer()

In [None]:
# Fit it
vectorizer.fit(X_train)

### 5: Vectorize Your Text, Then Model

In [None]:
# call_on_students(1)

In [None]:
# Code here to transform train and test sets with the vectorizer
X_tr_vec = vectorizer.transform(X_train)
X_te_vec = vectorizer.transform(X_test)

In [None]:
# Importing the classifier...
from sklearn.ensemble import RandomForestClassifier

# Code here to instantiate and fit a Random Forest model
rfc = RandomForestClassifier()
rfc.fit(X_tr_vec, y_train)

In [None]:
# Code here to evaluate your model on the test set
rfc.score(X_te_vec, y_test)

# 4) Clustering

## Clustering Concepts

### 1: Describe how the K-Means algorithm updates its cluster centers after initialization.

In [None]:
# call_on_students(1)

#### Answer:

- You set the number of cluster centers (K) - algorithm randomly starts with that number of cluster centers (in random spots!)
- The algorithm calculates the distance between the centers and each observation and assigns the observation to the closest cluster center to create the first iteration of clusters
- The algorithm then takes all the observations assigned to each cluster, and moves that cluster center to be at the exact actual center (mean) of the newly created cluster
- Repeat! Until the cluster centers stop moving (or tolerance is met - some parameters in the implementation)

### 2: What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# call_on_students(1)

#### Answer:

- Inertia measures the distance between each point and its center - the idea is that better clusters are more tightly concentrated
- KMeans tries to minimize inertia when choosing cluster centers
- Method to evaluate - elbow plot! 

### 3: What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

In [None]:
# call_on_students(1)

#### Answer:

- Silhouette score
- Difference between silhouette score and inertia: silhouette score tries to maximize similarity within groups and maximize distances between clusters, while inertia just looks within each cluster

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [None]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

### 4: Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?

- scale

In [None]:
# call_on_students(1)

In [None]:
# Code to preprocess the data
k_scaler = StandardScaler()

# Name the processed data X_processed
X_processed = k_scaler.fit_transform(X)

### 5: Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [None]:
# call_on_students(1)

In [None]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate
cluster = AgglomerativeClustering(n_clusters=2)
# Fit the object
cluster.fit(X_processed)

# Calculate a silhouette score
from sklearn.metrics import silhouette_score
silhouette_score(X_processed, cluster.labels_)

### 6: Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [None]:
# call_on_students(1)

In [None]:
def test_n_for_clustering(n, data):
    """ 
    Tests different numbers for the hyperparameter n_clusters
    Prints the silhouette score for that clustering model
    Returns the labels that are output from the clustering model

    Parameters: 
    -----------
    n: float object
        number of clusters to use in the agglomerative clustering model
    data: Pandas DataFrame or array-like object
        Data to cluster

    Returns: 
    --------
    labels: array-like object
        Labels attribute from the clustering model
    """
    # Create the new clustering model
    cluster = AgglomerativeClustering(n_clusters=n)
    
    # Fit the new clustering model
    cluster.fit(data)

    # Print the silhouette score
    print(silhouette_score(data, cluster.labels_))
    
    # Return the labels attribute from the fit clustering model
    return cluster.labels_

# Testing your function

for n in range(2, 9):
    test_n_for_clustering(n, X_processed)