# Introduction to Machine Learning 2
Jevon Heath, Feb 2020

### ~~Linear~~ Logistic Regression: fitting to a ~~line~~ probability curve
Instead of a continuous outcome, we want a categorical response:
* Is this email junk or not?
* Is this a correct usage or not?
* Is the speaker a native speaker or not?
* Is the backchannel in question laughter, non-verbal, phrasal, or substantive?

For these **classification** questions, the outcome should be a specific category.

But a model can also give us the _likelihood_ of that predicted outcome.

#### Assumptions:
* continuous values **for independent variables**
* ~~a linear relationship~~ **linear independent variables**
* ~~multivariate normality~~
* no multicollinearity among independent variables
* ~~homoskedasticity~~
* **independence of observations**
* **a large sample size**

In [None]:
# Turns on/off pretty printing 
%pprint

# Every returned Out[] is displayed, not just the last one. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import numpy as np
import pandas as pd
import sklearn               
import nltk 

import matplotlib.pyplot as plt
import seaborn as sns        
sns.set_style('darkgrid')

import statsmodels.api as sm
import statsmodels.formula.api as smf

## Classification: predicting discrete labels

##### Simple case: two labels
Quick example: Given a reaction time, is the participant young or old?

In [None]:
english = pd.read_csv('../../Class-Exercise-Repo/activity3/english_updated.csv', index_col='Index')

In [None]:
english.describe()

In [None]:
english.info()

In [None]:
english['AgeSubject'].value_counts()

In [None]:
logit1 = smf.glm("AgeSubject ~ RTlexdec + WrittenFrequency", data=english, family=sm.families.Binomial())

In [None]:
logit1f = logit1.fit()

In [None]:
logit1f.summary()

##### Complicated case: many labels

Now a textbook example using sklearn's pre-loaded data set 20 news group data. 
- For detailed explanation, see the textbook section:
 https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html
- The original data set can be downloaded from: http://qwone.com/~jason/20Newsgroups/
- sklearn's tutorial on the dataset: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

Topic classification is our goal: 
- Given a short text, can we identify topic labels? 

Text-based classification requires converting **INDIVIDUAL WORDS into a their own features**, which blows up feature space. Some common strategies:

- Removing stop words and punctuation (depending on your data and goal) 
- Limiting word types to top 2000K, 5000K, etc. 
- Using "sparse vector" format

In [None]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()   # downloads training data by default: subset='train'. test', 'all'
data.target_names

In [None]:
dir(data)
type(data)

In [None]:
data.target.shape

In [None]:
data.filenames[:5]

In [None]:
data.data[0]

In [None]:
data.target[:5]

### We'll download subsections of the data, with four categories only, training and test sections

In [None]:
categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [None]:
type(train)
dir(train)

In [None]:
train.data[3]
# Quick! Which topic is this? 

In [None]:
train.target[3]

In [None]:
train.target_names
train.target_names[train.target[3]]

In [None]:
train.target[:100]

In [None]:
len(train.data)
len(test.data)

In [None]:
# The data is not in DataFrame format, but you could shape it into one if you wanted to: 
train_df = pd.DataFrame()
train_df['target'] = train.target
train_df['text'] = train.data
train_df.head()

### Question: how do you extract & represent word-based features from the text?
- **Bag-of-words** approach: reduce a document to the words it contains
- **Occurrence** features: whether or not each word occurs in a document (0 or 1)
- **Count features**:  how many times each word occurs in a document (0 --) 

In [None]:
toy_df = train_df[:10].copy()   # first 10 rows
toy_df

In [None]:
# Lowercase and then tokenize
toy_df['tokens'] = toy_df.text.map(lambda x: nltk.word_tokenize(x.lower()))
toy_df

In [None]:
toy_df['god#'] = toy_df.tokens.map(lambda x: x.count('god'))
toy_df['believe#'] = toy_df.tokens.map(lambda x: x.count('believe'))
toy_df['space#'] = toy_df.tokens.map(lambda x: x.count('space'))
toy_df['computer#'] = toy_df.tokens.map(lambda x: x.count('computer'))
toy_df['graphics#'] = toy_df.tokens.map(lambda x: x.count('graphics'))
toy_df['the#'] = toy_df.tokens.map(lambda x: x.count('the'))
toy_df['you#'] = toy_df.tokens.map(lambda x: x.count('you'))
toy_df['way#'] = toy_df.tokens.map(lambda x: x.count('way'))
toy_df

### Now do this for ALL word types in the training data...
- Or, more realistically, we could do this for the _n_ most frequent word types (We'll use 3,000)
- Then, the word-count columns (3,000 of them!) will be `X_train`. Feed that into the Naive Bayes training algorithm...
- But is there a better way?

### Considerations
1. We need to normalize the values: raw counts are sensitive to text length. 
2. Some words are going to be frequent across all topics, just because they are common words ('the', 'way', 'talked')
   - We could filter our function words, but that goes only so far 
   - 'space' will be common in one topic, not so in others. 'god' will be common in two, but not in others. How to better capture this? 
3. The vector is going to be SPARSE: most values will be 0. A DataFrame is not an efficient data structure for this.
4. We don't want to do all this manually, word by word! 

## Under the hood with CounterVectorizer and TF-IDF

#### Count-vectorize, and then TF-IDF
- 3. & 4. as well as tokenization are handled by `CountVectorizer`
- 1. & 2. are addressed by `TfidfTransformer`

A detour: we will take a look at a detailed illustration of CountVectors and TF-IDF:
http://www.pitt.edu/~naraehan/presentation/Movie%20Reviews%20sentiment%20analysis%20with%20Scikit-Learn.html#A-detour:-try-out-CountVectorizer-&-TF-IDF


**TF-IDF (Term Frequency - Inverse Document Frequency)**
- Textbook section on TF-IDF: https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html#Text-Features
- Better explanation here: http://www.tfidf.com/

### Back to the textbook and our 4 newsgroups. 
-  **Reminder:  Refer to textbook for explanation!! Link up above.** 
- `TfidfVectorizer()` used below is a combination of `CountVectorizer()` and `TfidfTransformer()`. It takes care of:
   - Tokenizes text and gets rid of stop words and punctuation
   - Builds a token count vector 
   - Converts raw token count into TF-IDF (Term Frequency - Inverse Document Frequency)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# pipeline! See textbook. 
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [None]:
# train model
model.fit(train.data, train.target)

# predict labels on test data
labels = model.predict(test.data)

In [None]:
type(labels)
labels[:10]

In [None]:
test.target[1]
test.target_names[0]
test.data[1]

In [None]:
# seems to match up pretty well
test.target[:10]
labels[:10]

In [None]:
from sklearn.metrics import confusion_matrix
mat1 = confusion_matrix(test.target, labels)

In [None]:
mat1

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(test.target, labels)

In [None]:
sns.heatmap(mat1.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

In [None]:
# If you run into this issue with top and bottom rows being cut off,
# it's because of a matplotlib version issue (Thanks StackOverflow!).
# You'll have to explicitly widen the y-axis, as below.

ax = sns.heatmap(mat1, annot=True) #notation: "annot" not "annote"
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

In [None]:
tests = ['sending a payload to the ISS', 'I met Santa Claus once']
preds = model.predict(tests)
print(preds)

In [None]:
print(train.target_names[1])
print(train.target_names[2])