# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [1]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

1. What are the benefits of using a pipline?

2. What does a gridsearch achieve?

3. Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

**Answer**:

**My answer:**
Using pipeline reduces complexity (simplify). One specific part of the pipeline can be focused and and adjusted at a time.
It has better readability which prevents mistakes and data leakage.
Pipeline is compatible with many different models so it can be utilized with flexibility.

**My answer:**
Search for optimal hyperparameters to tune. Model tuning.

In [2]:
from sklearn.datasets import load_breast_cancer

In [51]:
# Your code here
from sklearn.pipeline import Pipeline
data = load_breast_cancer(return_X_y=True)

X = data[0]
y = data[1]

pipeline = Pipeline(
   [ ('std_scaler', StandardScaler()),
    ('logreg', LogisticRegression(random_state=42))]
)

pipeline.fit(X, y)

4. Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

**Answer**:

In [55]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42)

pipe_grid = {'logreg__C': [10, 1, 0.1]}
gs_pipe = GridSearchCV(estimator=pipeline, param_grid=pipe_grid)
gs_pipe.fit(X_train, y_train)


In [57]:
print(gs_pipe.best_estimator_)

Pipeline(steps=[('std_scaler', StandardScaler()),
                ('logreg', LogisticRegression(C=10, random_state=42))])


In [58]:
pipeline = Pipeline(
   [ ('std_scaler', StandardScaler()),
    ('logreg', LogisticRegression(C=10, random_state=42))]
)

pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

0.972027972027972

# 2) Ensemble Methods

1. What sorts of ensembling methods have we looked at?

2. What is random about a random forest?

3. What hyperparameters of a random forest might it be useful to tune? How so?

4. Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Make sure you do a train-test split!

**Answer**:

**My answer:**
1. Random forest is an example. There are bagging and boosting
2. Random sampling from the training set, and random subset of features at each decision point
3. max_features, max_samples...

In [60]:
# Your code here
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=42)

rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.965034965034965

# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [21]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

### 1: NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

**My answers:**
Tokenization to n-grams
Remove capitualization
Remove punctuation

### 2: Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

**My answers:**
Row = each text in corpus
Column = each feature (n-gram)

### 3: What does TF-IDF do?

Also, what does TF-IDF stand for?

**My answers:**
term frequency inverse document frequency
a vectorizer that normalizes the count of n-grams in the document (considers the uniqueness of the words)

## NLP in Code

### Set Up

In [22]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [23]:
policies.head()

Unnamed: 0.1,Unnamed: 0,name,policy,candidate
0,0,100% Clean Energy for America,"As published on Medium on September 3rd, 2019:...",1
1,1,A Comprehensive Agenda to Boost America’s Smal...,Small businesses are the heart of our economy....,1
2,2,A Fair and Welcoming Immigration System,"As published on Medium on July 11th, 2019:\nIm...",1
3,3,A Fair Workweek for America’s Part-Time Workers,Working families all across the country are ge...,1
4,4,A Great Public School Education for Every Student,I attended public school growing up in Oklahom...,1


The documents for activity are in the `policy` column, and the target is candidate. 

### 4: Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [28]:
# First! Train-test split the dataset
X_train, X_test, y_train, y_test = train_test_split(policies.policy, 
                                                    policies.candidate,
                                                    )

In [29]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

In [30]:
# Instantiate it
sw = stopwords.words('english')
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", stop_words=sw)

In [39]:
# Fit it
vec.fit(X_train)

### 5: Vectorize Your Text, Then Model

In [40]:
# Code here to transform train and test sets with the vectorizer
import pandas as pd

X_tr_vec = vec.transform(X_train)
train_df = pd.DataFrame(X_tr_vec.toarray(), columns=vec.get_feature_names_out())
train_df.head()

Unnamed: 0,aac,abandon,abandoned,abandoning,abandonment,abatement,abdicated,abdications,abets,abetting,...,zinke,zip,zoe,zombie,zone,zones,zoning,zorc,zuckerberg,zucman
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.012117,0.012117,0.0114,0.0,0.0,0.0,0.0,...,0.0,0.010835,0.0,0.0,0.0,0.009377,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
X_test_vec = vec.transform(X_test)
test_df = pd.DataFrame(X_test_vec.toarray(), columns=vec.get_feature_names_out())
test_df.head()

Unnamed: 0,aac,abandon,abandoned,abandoning,abandonment,abatement,abdicated,abdications,abets,abetting,...,zinke,zip,zoe,zombie,zone,zones,zoning,zorc,zuckerberg,zucman
0,0.0,0.0,0.006164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.005115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
# Code here to instantiate and fit a Random Forest model
rfc = RandomForestClassifier()
rfc.fit(train_df, y_train)

In [43]:
# Code here to evaluate your model on the test set
rfc.score(test_df, y_test)

0.9791666666666666

# 4) Clustering

## Clustering Concepts

### 1: Describe how the K-Means algorithm updates its cluster centers after initialization.

**My answer:**
minimizes within cluster diatance (intra-cluster)
maximizes between cluster distance (inter-cluster)

### 2: What is inertia, and how does K-Means use inertia to determine the best estimator?

**My answer:**
Inertia is like log loss of K-means clustering
It uses inertia to create elbow plot to optimize the k. You want low inertia.

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

### 3: What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

**My answer:**
Silhouette coefficient.
It's between -1 and 1, and you want it as close as to 1.

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [44]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

### 4: Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?

- scale

In [45]:
# Code to preprocess the data
from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler()

# Name the processed data X_processed
X_processed = minmax.fit_transform(X)

### 5: Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [46]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate and fit
agg_cluster = AgglomerativeClustering(n_clusters=2).fit(X_processed)



In [48]:
# Calculate a silhouette score
from sklearn import metrics

metrics.silhouette_score(X_processed, agg_cluster.predict(X_processed))

0.6300471284354711

### 6: Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [49]:
def n_cluster_options(cluster_number, data):
    from sklearn.cluster import KMeans
    from sklearn import metrics

    model = KMeans(n_clusters=cluster_number).fit(data)

    print(metrics.silhouette_score(data, model.predict(data)))

    return model.labels_