### Course Announcements

**Due tonight (11:59 PM)**
- D6
- Q6
- A3
- Weekly Project Survey (optional)

**Notes**
- Prof Ellis' Office hours
    - this Friday only (aka today): 10-11; 2-2:30, 3-3:30
    - Weeks 8-10: Wed 1:30-2:30; Fri 10-11

**Mid-Course Survey Summary** (N=237)

- Least Liked:
    - Too much work due on Fridays!
    - Quizzes (feel too high stakes!)
    - Discussion labs are too long!
    - Ah! Project! (it's a lot! had to change topics!)
    - Remote learning (generally, and for the project)

**Planning ahead** & Changes:

- Wk 8: Q7<sup>+</sup>, D7<sup>*</sup>, Checkpoint #2: EDA
- Wk 9: Q8<sup>+</sup>, D8<sup>*</sup>, A4
- Wk 10: Q9<sup>+</sup>, ~D9~
- Finals Week: Final Report, Final Video, Project Survey

<sup>*</sup> will shorten  
<sup>+</sup> will make more straightforward

# Machine Learning in Python

- Tools: scikit-learn (`sklearn`)
    - Data Partitioning
    - Feature selection
    - Modeling: SVM
    - Model Assessment


For more reading on scikit-learn (`sklearn`) and machine learning in Python: https://scikit-learn.org/stable/index.html

# Machine Learning: General Steps

1. Data Partitioning
2. Feature Selection
3. Model
4. Model Assessment

## Setup

In [None]:
# import ds/plotting packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import nltk package 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# import random for randomizing
import random

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# scikit-learn imports
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

In [None]:
# Uncomment if you need to download the NLTK English tokenizer and the stopwords of all languages
# nltk.download('punkt')
# nltk.download('stopwords')

# Example: Class Responses

## Data

Student responses on COGS 108 Mid-course survey to the following two questions: 

- What have you enjoyed MOST about COGS 108 so far? Please explain.
- What have you enjoyed LEAST about COGS 108 so far? Please explain.

In [None]:
# read data in
# 1 = most; 0 = least
df = pd.read_csv('https://raw.githubusercontent.com/shanellis/datasets/master/COGS108_ml.csv', encoding="ISO-8859-1")
df.tail()

In [None]:
# randomly sort data frame
df = df.sample(frac=1, random_state=200).reset_index(drop=True)
df.head()

Randomly sorted data frame:
- for selection of training and test set
- will be approximately balanced between outcomes in each

In [None]:
# see how much data we're working with
df.shape

### Train, Test, Validate

- We'll train the model on 80% of the responses from the past four quarter's survey
- We'll test the model on the 20% we've held out
- We'll validate the model on this quarter's responses

In [None]:
## Train/Test
df_traintest = df[df['quarter']!='sp21']

## Validation
df_validation = df[df['quarter']=='sp21']

In [None]:
print(df_traintest.shape, df_validation.shape)

- train/test
    - Sp19: 631
    - Wi20: 492
    - Sp20: 705
    - Fa20: 551
    - Wi21: 615
- validate:
    - Sp21: 466

#### Clicker Question #1

How well do you think we will be able to predict whether a student comment is a response to what they liked most vs what they liked least in our ***test*** dataset?

- A) Accuracy ~0%
- B) Accuracy ~25%
- C) Accuracy ~50%
- D) Accuracy ~75%
- E) Accuracy ~100%

#### Clicker Question #2

How well do you think we will be able to predict whether a student comment is a response to what they liked most vs what they liked least in our ***validation*** dataset?

- A) Accuracy ~0%
- B) Accuracy ~25%
- C) Accuracy ~50%
- D) Accuracy ~75%
- E) Accuracy ~100%

## Prediction Task: 

**Classify text from students as 'most liked' or 'least liked'**

#### 11 Steps to Prediction:

1. Specify parameters for TF-IDF calculation
2. Calculate TF-IDF from text input (predictors)
3. Extract most or least (outcome)
4. Specify how data will be partitioned
5. Partition the data
6. Train model
7. Predict in training
8. Predict in testing
9. Assess accuracy in training
10. Assess accuracy in test set
11. Assess accuracy in validation set

### Data Processing

Step 1: Determine how you'll convert a collection of raw documents to a matrix of TF-IDF features.

In [None]:
# Create vectorizer & specify parameters
tfidf = TfidfVectorizer(sublinear_tf=True, #apply sublinear TF scaling
                        analyzer='word',   #specify tokenizer
                        max_features=500, # specify max # of features to include
                        tokenizer=word_tokenize)

* sublinear TF scaling - replaces term frequency (TF) with $1 + log(TF)$

Step 2: Generate matrix of TF-IDF features.

In [None]:
# Learn vocabulary and idf, return term-document matrix.
# return an array;our predictor
tfidf_X = tfidf.fit_transform(df_traintest['response']).toarray()

# take a look at the output
print(tfidf_X.shape)

print("min: " , np.min(tfidf_X), '\n',
      "mean: ", np.mean(tfidf_X), '\n',
      "max: ",  np.max(tfidf_X))

In [None]:
## get IDF to visualize
idf = tfidf.idf_
rr = dict(zip(tfidf.fit(df_traintest['response']).get_feature_names(), idf))

token_weight = pd.DataFrame.from_dict(rr, orient='index').reset_index()
token_weight.columns=('token','weight')
token_weight = token_weight.sort_values(by='weight', ascending=False)
token_weight.head() 

In [None]:
sns.barplot(x='token', 
            y='weight', 
            data=token_weight[0:10], 
            color="gray")            
plt.title("Inverse Document Frequency(IDF) per token")
fig = plt.gcf()
fig.set_size_inches(15,5);

Step 3: Extract outcome variable

In [None]:
# specify outcome variable
tfidf_Y = np.array(df_traintest['most_least'])
tfidf_Y[0:5]

## Data Partitioning & Feature Selection

80/20 split

We're going to be looking at the ability of using the text responses to predict whether or not it was something someone liked most or liked least.

Step 4: Determine split in data.

In [None]:
# specify training and test
num_training = int(len(df_traintest)*0.8)
num_testing = len(df_traintest)-num_training

print(num_training, num_testing)

Step 5: Split (partition) the data.

In [None]:
# get data
# because rows have been randomized previously
tfidf_train_X = tfidf_X[:num_training]
tfidf_train_Y = tfidf_Y[:num_training]
tfidf_test_X = tfidf_X[num_training:]
tfidf_test_Y = tfidf_Y[num_training:]

#### Clicker Question #3

Looking at the code above and thinking about what we've done so far in this analysis, what is stored in `tfidf_test_Y`?

- A) predictor variable - training data
- B) outcome variable - training data
- C) predictor variable - test data
- D) outcome variable - test data
- E) validation DataFrame

In [None]:
# take a look at the data we're using
print(tfidf_train_X.shape)
tfidf_train_X

## Model

### SVM: Support Vector Machines

- simple & interpretable machine learning model
- based in linear regression
- classification task
- supervized
    - input: labeled training data
    - model determines hyperplane that best discriminates between categories

### SVM: Tuning Parameters
- **regularization** parameter
    - can determine how this line is drawn
    - can increase accuracy of prediction
    - can lead to overfitting of the data
- **kernel** parameter
    - specifies how to model & transform data
    

For more reading on SVMs using `sklearn`: https://scikit-learn.org/stable/modules/svm.html

### Model Generation
    

Step 6: Generate and train the model.

In [None]:
# uncomment to read documentation for model
SVC?

In [None]:
# function we'll use to run the model
def train_SVM(X, Y, kernel='linear'):
    model = SVC(kernel=kernel)
    model.fit(X, Y)
    return model

In [None]:
# train model
svm_model = train_SVM(tfidf_train_X, tfidf_train_Y)
type(svm_model)

### Training Data

Step 7: Predict in the training data

In [None]:
# predict on training
df_predicted_train_Y = svm_model.predict(tfidf_train_X)
print(df_predicted_train_Y[0:5])
len(df_predicted_train_Y)

In [None]:
# see how many were predicted most vs. least
pd.Series(df_predicted_train_Y).value_counts()

### Testing Data

Step 8: Predict in the testing data

In [None]:
# predict on training
df_predicted_test_Y = svm_model.predict(tfidf_test_X)
print(df_predicted_test_Y[0:5])
len(df_predicted_test_Y)

In [None]:
# see how many were predicted most vs. least
pd.Series(df_predicted_test_Y).value_counts()

## Accuracy Assessment

- RMSE (continuous)
- Accuracy, Sensitivity, Specificity, AUC
    - TP, TN, FP, FN

**Accuracy** - What % were predicted correctly?  
**Sensitivity (Recall)** - Of those that were positives, what % were predicted to be positive?  ; $\frac {TP}{(TP + FN)}$  
**Specificity** - Of those that were actually negatives, what % were predicted to be negative?  $\frac {TN}{(TN + FP)}$

**Precision (Positive Predictive Value, PPV)** = $\frac {TP}{(TP + FP)}$

- probability that predicted positive truly is positive

### Training Data

Step 9: Assess accuracy in training data

In [None]:
print(classification_report(tfidf_train_Y, df_predicted_train_Y))

**support** - the number of occurrences of each class  
**precision (PPV)** - ability of the classifier not to label a positive sample as negative  
**recall (sensitivity)** - ability of the classifer to find all the positive samples


**f1-score** - weighted harmonic mean of the precision and recall; score reaches its best value at 1 and worst score at 0  
**macro average** - averaging the unweighted mean per label  
**weighted average** - averaging the support-weighted mean per label  
**micro average** - averaging the total true positives, false negatives and false positives

In [None]:
# where 'support' comes from
pd.Series(tfidf_train_Y).value_counts()

### Testing Data

Step 10: Assess accuracy in testing data

In [None]:
print(classification_report(tfidf_test_Y, df_predicted_test_Y))

#### Clicker Question #4

Given this output, would you use this model to predict whether or not text was something someone liked or disliked about COGS 108?

- A) Yes
- B) No
- C) Unsure

### Validation Data

Step 11: Assess accuracy in validation data

In [None]:
df_validation.head()

In [None]:
# the ground truth
tfidf_vaidation_Y = np.array(df_validation['most_least'])

# predicted values from class responses
tfidf_validation_X = tfidf.fit_transform(df_validation['response']).toarray()
df_predicted_validation_Y = svm_model.predict(tfidf_validation_X)

# assess accuracy
print(classification_report(tfidf_vaidation_Y, df_predicted_validation_Y))

#### Clicker Question #5

Given this output, would you use this model to predict whether or not text was something someone liked or disliked about COGS 108?

- A) Yes
- B) No
- C) Unsure

### Summary

1. 80:20 Partition
2. Specified TF-IDF as predictor and most/least (0,1) as outcome
3. Trained SVM linear classifier
4. Built model on Training data
5. Predicted in training data and on testing data
6. Assessed overall accuracy

### Approaches For Improvement?

- Data Cleaning/Stemming
- Different Tuning Parameters?
- Cross-Validation?
- Train/Test on all data OR Train only Sp20/Fa20/Wi21 (remote quarters)
- Different Model?