# a4 - Python

This assignment will cover topics of text mining and clustering

Make sure that you keep this notebook named as "a4.ipynb" 

Any other packages or tools, outside those listed in the assignments or Canvas, should be cleared
by Dr. Brown before use in your submission.

# Q0 - Setup

The following code looks to see whether your notebook is run on Gradescope (GS), Colab (COLAB), or the linux Python environment you were asked to setup.

In [1]:
import re 
import os
import platform 
import sys 

# flag if notebook is running on Gradescope 
if re.search(r'am', platform.uname().release): 
    GS = True
else: 
    GS = False

# flag if notebook is running on Colaboratory 
try:
  import google.colab
  COLAB = True
except:
  COLAB = False

# flag if running on Linux lab machines. 
cname = platform.uname().node
if re.search(r'(guardian|colossus|c28|coc-15954-m)', cname):
    LLM = True 
else: 
    LLM = False

print("System: GS - %s, COLAB - %s, LLM - %s" % (GS, COLAB, LLM))

System: GS - False, COLAB - False, LLM - True


## Notebook Setup

It is good practice to list all imports needed at the top of the notebook. You can import modules in later cells as needed, but listing them at the top clearly shows all which are needed to be available / installed.

If you are doing development on Colab, the otter-grader package is not available, so you will need to install it with pip (uncomment the cell directly below).

In [2]:
# Only uncomment if you developing on Colab 
# if COLAB == True: 
#     print("Installing otter:")
#     !pip install otter-grader==4.2.0 

In [3]:
# Import standard DS packages 
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy
import statistics
import textwrap
%matplotlib inline


from sklearn.model_selection import train_test_split, StratifiedKFold 
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import tree        # decision tree classifier
from sklearn import neighbors   # knn classifier
from sklearn import naive_bayes # naive bayes classifier 
from sklearn import svm         # svm classifier
from sklearn import ensemble    # ensemble classifiers
from sklearn import metrics     # performance evaluation metrics
from sklearn import model_selection
from sklearn import preprocessing 
from sklearn.decomposition import PCA
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

from sklearn import preprocessing
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
from sklearn import cluster
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy

# Package for Autograder 
import otter 
grader = otter.Notebook()

In [4]:
grader.check("q0")

# Q1 - Text Classification 


You will look to predict whether scenes in Shakespeare's plays come from the comedies or histories.  Shakespeare's comedies include plays such as: The Taming of the Shrew, The Merchant of Venice, Much Ado About Nothing, and more.  The histories include: Richard II, Richard III, Henry IV part 1, Henry IV part 2, Henry V, Henry VI (part 1-3). 

The plays were downloaded from the [Shakespeare Corpus](http://hdl.handle.net/11040/24448).  Note, the original plays were downloaded from [Project Gutenberg](https://www.gutenberg.org/). 

Note, the plays have already had significant preprocessing.  The plays have been scrubbed by: removing digits, making the file all lowercase, and removing punctuation, excluding hyphens and word-internal apostrophes. Also, the character names and stage directions have been removed manually. An example of the text would be like this:

*Before scrubbing:*

    ADAM. Yonder comes my master, your brother.
    ORLANDO. Go apart, Adam, and thou shalt hear how he will shake me
    up. [ADAM retires]
    OLIVER. Now, sir! what make you here?

*After scrubbing:*

    yonder comes my master your brother
    go apart adam and thou shalt hear how he will shake me
    up
    now sir what make you here

The text files are split into negative - comedies and positive - histories.  


## Q1(a) - Load the Data 

Load the plays into a list `textdata` and a np.ndarray `yvalues`.  I highly suggest using `scikit-learn`'s `load_files` function, with the `random_state` set to 42.  

 

In [7]:
# Load the plays data     
plays = load_files("data/shakespeare", random_state=42)

# Extract text data
textdata = dataset.data

# Extract target values
yvalues = dataset.target

print("Samples per class: {}".format(np.bincount(yvalues)))

plays.filenames[0:10]

IsADirectoryError: [Errno 21] Is a directory: 'data/shakespeare/histories/.ipynb_checkpoints'

In [None]:
grader.check("q1a")

## Q1(b) - Prepare the Data 

Split the data into `text_trainval`, `text_test` and `y_trainval`, `y_test` variables.  Use 20% of the data in the test set with a `random_state` of 42 and make sure to stratify the split (the data is imbalanced). 

In [None]:
# Split the data 




In [None]:
grader.check("q1b")

## Q1(c)  - Explore the Data

Create a document-term count matrix for the "trainval" data using the default tokenizer, removing the standard English stopwords and store this in `dtm_trainval`.

Store the names of the terms in the dtm matrix in the variable `vocab`.

In [None]:
# Create document-term count matrix for the "trainval" text data 




In [None]:
grader.check("q1c")

<!-- BEGIN QUESTION -->

## Q1(d) - Explore the Data 

Create a plot showing the top 15 most frequently used words in the trainval text data. 

In [None]:
# Create a plot of the top 15 most frequently used words 




<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Q1(e) - Explore the Data 

For the trainval text data, plot the top 15 most frequently used words in the histories and the comedies.  Put these two bar plots side-by-side to compare the results. 

In [None]:
# Create a plot of the top 15 most frequently used words in the 
#  Comedies and Histories. 




<!-- END QUESTION -->

## Q1(f) - Bernoulli Naive Bayes 

Let's know explore using Bernoulli Naive Bayes as a classifier, `bern_nb`, to predict the type of play. 

We will use the split of the data into trainval / test found above to train the model and then evaluate it's performance. 

Create the training data, `X_trainval` to be binary with features using the default tokenizer, stop words removed, appear in at least 5 documents and is limited to the top 5000 features.  

Calculate the training accuracy `train_acc_bern` and testing accuracy `test_acc_bern` for the model.



In [None]:
# Run Bernoulli Naive Bayes model




In [None]:
grader.check("q1f")

## Q1(g) - Multinomial Naive Bayes 

Let's know explore using multinomial Naive Bayes as a classifier, `mult_nb`, to predict the type of play. 

We will use the split of the data into trainval / test found above to train the model and then evaluate it's performance. 

Create the training data, `X_trainval` with features using the default tokenizer, stop words removed, appear in at least 5 documents and is limited to the top 5000 features.  

Calculate the training accuracy `train_acc_mult` and testing accuracy `test_acc_mult` for the model.


In [None]:
# Run Multinomial Naive Bayes model




In [None]:
grader.check("q1g")

<!-- BEGIN QUESTION -->

## Q1(h) - Naive Bayes Models 

Looking at the results of the two models above. Answer the following questions.  


Which of the two models is preferred?  Why?   (10 words or less)


What is a problem for both models?  How might you solve it? (12 words or less)

<!-- END QUESTION -->

## Q1(i) - Other Models 

Let's now look to explore using other models. 

You will set up a pipeline, `pipe`, that will use a Random Forest model with 100 trees and a `random_state` = 42.  

In the pipeline (`param_grid`), you will consider using both a document term count matrix as well as a TF-IDF matrix. 
In either case, limit the matrix to words that appear in at least 5 documents and remove English stop words.  Consider features of unigrams, unigrams + bigrams, and bigrams.  Examine a maximum feature limit of either 2500 or 5000.  

Optimize your choice of hyperparameters using GridSearchCV, `grid`, with stratified 5-fold cross-validation (random_state = 42), select the parameters using AUC. (See how to set up the scorer below)

Note, do not run the jobs in parallel, you may exceed the memory resources of the autograder on Gradescope. 

In [None]:
# Run Pipeline to find best model

pipe = ...

param_grid = ...

cvStrat = ...

score_fn = metrics.make_scorer(metrics.roc_auc_score, needs_threshold = False)
grid = ...


print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))

In [None]:
grader.check("q1i")

## Q1(j) - Explore the Results 

Calculate the AUC on the test text, `auc_test`.   

Gather the importances of the features in the best model in `importance`. [Feature Importance Example](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)

Create a bar plot with the top 10 features sorted by importance.  


In [None]:
# Calculate performance on test data `auc_test` 

# Create plot of top 10 features sorted by importance. 




In [None]:
grader.check("q1j")

# Q2

Consider methods to cluster NBA players based on their statistics. 




## Q2(a) 

Load in data for NBA players from the 2018-2019 season. 

Filter the players to only consider those who have played in more than 20 games.  

Ignore the first 7 columns as well as ignore columns of statistics with percentages (FG%, 3P%, 2P%, eFG%, FT%). 

In [None]:
# Load in data and filter out requested rows and columns. 

nba = ...



nba.head()

In [None]:
grader.check("q2a")

## Q2(b)

The features have different ranges, therefore we should scale the data before considering the clustering analysis. Scale the data using min-max normalization with range of [0, 1].

In [None]:
# Scale the data 

scaler = ...
nbaScaled = ...


In [None]:
grader.check("q2b")

## Q2(c)

Run Kmeans clustering on the data with k=2, . . . , 12.  Set the `random_state` in the Kmeans method to 42 and `n_init` to 10. 
For each value of k, keep track of the within-cluster variation (this quantity is referred to as different terms such as ‚Äúinertia‚Äù and total ‚Äúwithin-cluster sum-of-squares‚Äù), the Calinski-Harabasz score, and the Davies-Bouldin index on the resulting clusters. 

In [None]:
# Run Kmeans 

sse = []
dbscore = []
chscore = []




In [None]:
grader.check("q2c")

## Q2(d)

Assuming the best number of clusters is 4 (depending on which measure we use different number of clusters is preferred with this data). 

Run Kmeans again with this value for $k$ (use `n_init` = 10 and `random_state` = 10).

Create a DataFrame `clusterStats` with the mean statistics (centers) of each group. 

The DataFrame should have rows for each cluster group 0, 1, 2, 3 and columns for the mean statistics.  

Add a column `Num` reporting the number of samples in each group. 


In [None]:
# Create a Data Frame for the mean statistics of each group 


clusterStats[['Num', 'MP', 'FG', '3P', 'FT']]

In [None]:
grader.check("q2d")

## Q2(e)

Report the same statistics as in (e), but using the original data scaling (reverse the scaling back to the original data range). 

Store results in `clusterStatsOrig`; this DataFrame should not have the "Num" column.



In [None]:
# Create a Data Frame for the mean statistics of each group (using the 
#   original data scaling)


clusterStatsOrig = ...

clusterStatsOrig[['MP', 'FG', '3P', 'FT']]

In [None]:
grader.check("q2e")

<!-- BEGIN QUESTION -->

## Q2(f) 

Apply PCA to the basketball data.  Plot the first two principal components, colored by the best group labels found above.  

In [None]:
# Run PCA on the nba data and plot the first two principal components
#  colored by the group labels. 




<!-- END QUESTION -->

# Q3 : Clustering - Spotify Music

For this problem you will look at popular streaming music.  Specifically, Spotify's top 100 streaming songs.  For each song information about the song is described with different properties: `duration`, `energy`, `key`, etc. 


## Q3(a) - Load and Prepare the Data 

Load in the `music.csv` data.  

The clustering algorithms will only consider variables of `duration` to the end of the DataFrame. 

Standardize the variables to be used in clustering.  

In [None]:
# Load in music data 

music = pd.read_csv(...)

music.head()

In [None]:
grader.check("q3a")

## Q3(b) - Hierarchical Clustering 

Perform Hierarchical clustering with **single** linkage on just the top 30 songs. 

Report results in a dendrogram, `dg_single` and label the samples by the Artist.  



In [None]:
# Perform Hierarchical clustering with single linkage on top 30 songs 
# Report results in a dendrogram, dg_single


In [None]:
grader.check("q3b")

## Q3(c) - Hierarchical Clustering, part 2 

Perform Hierarchical clustering with **complete** linkage on just the top 30 songs.

Report results in a dendrogram, `dn_complete` and label the samples by the Artist.

In [None]:
# Perform Hierarchical clustering with complete linkage on top 30 songs 
# Report results in a dendrogram, dg_complete


In [None]:
grader.check("q3c")

## Q3(d) - Hierarchical Clustering, part 3

Perform Hierarchical clustering with **aveage** linkage on just the top 30 songs.

Report results in a dendrogram, `dn_average` and label the samples by the Artist.

In [None]:
# Perform Hierarchical clustering with average linkage on top 30 songs 
# Report results in a dendrogram, dg_average


In [None]:
grader.check("q3d")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

**NOTE** the submission must be run on the campus linux machines.  See the instruction in the Canvas assignment.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)