# Text Classification

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/
- Keras Documentation: https://keras.io


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Text classification

Our goal is to perform a binary classification on text data. We will perform both a Spam detection example and a Sentiment analysis example. We will attempt 3 strategies:

1) build naive features based on our ideas

2) use well tested feature extraction technique

3) use deep learning and recurrent models on text

### 1. Spam detection on SMS messages

In [2]:
df = pd.read_csv('../data/sms.tsv', sep='\t')
df.head()

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df['label'].value_counts() / len(df)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

### Exercise1: Encode Labels to 0 and 1

Create a variable called y that contains 0 for HAM messages and 1 for SPAM messages. There are several ways to do this.

In [4]:
from sklearn.preprocessing import LabelEncoder

In [9]:
df["label"].map({"ham":0, "spam":1})

0       0
1       0
2       1
3       0
4       0
5       1
6       0
7       0
8       1
9       1
10      0
11      1
12      1
13      0
14      0
15      1
16      0
17      0
18      0
19      1
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
       ..
5542    0
5543    0
5544    0
5545    0
5546    0
5547    1
5548    0
5549    0
5550    0
5551    0
5552    0
5553    0
5554    0
5555    0
5556    0
5557    0
5558    0
5559    0
5560    0
5561    0
5562    0
5563    0
5564    0
5565    0
5566    1
5567    1
5568    0
5569    0
5570    0
5571    0
Name: label, Length: 5572, dtype: int64

In [10]:
(df["label"] == "spam").astype(int).values

array([0, 0, 1, ..., 0, 0, 0])

In [8]:
encoder = LabelEncoder()
y = encoder.fit_transform(df["label"])
y

array([0, 0, 1, ..., 0, 0, 0])

In [11]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def find_call(s):
    res = s.lower().find('call')
    return res != -1

y_pred = df['msg'].apply(find_call)

print("\nAccuracy:")
print(accuracy_score(y, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y, y_pred))

print("\nClassification Report:")
print(classification_report(y, y_pred))


Accuracy:
0.8761665470208184

Confusion Matrix:
[[4535  290]
 [ 400  347]]

Classification Report:
             precision    recall  f1-score   support

          0       0.92      0.94      0.93      4825
          1       0.54      0.46      0.50       747

avg / total       0.87      0.88      0.87      5572



### Exercise 2: Build naive features based on keywords

- turn all your sms messages to lowercase
- define a function to count occurrences of a single keyword with the following signature:

        def count_word(word, sentence):
            ....
            return count_word_in_sentence
            
            
- to test your function, try it on these examples and check that the results match:
   
        count_word("the", "quick brown fox") # -> 0
        count_word("fox", "quick brown fox") # -> 1
        count_word("a", "a b a abab") # -> 2
     

- using the function `count_word` you just wrote, create a feature matrix `X` using counts of some keywords of your choice. (this will a bag-of-words representation.)
- create other similar features. You could use:
    - the length of the message
    - the presence of numbers
    - the presence of special characters
    - ...

In [14]:
from collections import Counter

In [32]:
def count_word(word, sentence):
    l_s = sentence.split()
    c = Counter(l_s)
    wc = c.get(word)
    if not wc:
        return 0
    return wc

def count_word(word, sentence):
    return sentence.count(word)

def count_word(word, sentence):
    tokens = sentence.split()
    return len([w for w in tokens if w == word])

In [33]:
count_word("the", "quick brown fox")

0

In [34]:
count_word("fox", "quick brown fox")

1

In [35]:
 count_word("a", "a b a abab")

2

In [36]:
import re

docs = df['msg'].values
docs_lower = [d.lower() for d in docs]

def count_word(word, sentence):
    return sentence.count(word)

def count_numbers(sentence):
    return len(re.findall('[0-9]', sentence))


X = pd.DataFrame([count_word('free', d) for d in docs_lower], columns=['free'])

for keyword in ['win', 'discount', 'call']:
    X[keyword] = [count_word(keyword, d) for d in docs_lower]

X['num_char'] = [count_numbers(d) for d in docs_lower]

X.head()

Unnamed: 0,free,win,discount,call,num_char
0,0,0,0,0,0
1,0,0,0,0,0
2,1,1,0,0,25
3,0,0,0,0,0
4,0,0,0,0,0


### Exercise 3: Train first model and evaluate performance

- split data in to train and test sets with `test_size=0.3, random_state=0`. you can use the `train_test_split` function from sklearn, which we have used in previous labs
- train model of your choice on these features
- evaluate performance on training and test set
- discuss with classmate:
    - how did you evaluate performance?
    - is model overfitting?
    - is model better than benchmark?

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.3, random_state=0)

In [44]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [45]:
y_pred = clf.predict(X_test)

In [48]:
print("\nAccuracy on train:")
y_pred_train = clf.predict(X_train)
print(accuracy_score(y_pred_train, y_train))
print(clf.score(X_train, y_train))

print("\nAccuracy on test:")
print(accuracy_score(y_test, y_pred))
print(clf.score(X_test, y_test))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Accuracy on train:
0.9782051282051282
0.9782051282051282

Accuracy on test:
0.9736842105263158
0.9736842105263158

Confusion Matrix:
[[1434   17]
 [  27  194]]

Classification Report:
             precision    recall  f1-score   support

          0       0.98      0.99      0.98      1451
          1       0.92      0.88      0.90       221

avg / total       0.97      0.97      0.97      1672



### Exercise 4: Cross Validation

- perform a 5-Fold cross validation on your model. you can refer back to lab 8 to refresh your memory on how to do this.
- print the confusion matrix and the classification report on the test data

### Exercise 5: Count Features

- use features based on word counts using the `CountVectorizer` class from Scikit Learn
- use the following function to simplify your code (it encapsulates model training and evaluation):


    def split_fit_eval(X, y, model=None, epochs=10, random_state=0):
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

        if not model:
            model = Sequential()
            model.add(Dense(1, input_dim=X.shape[1], activation='sigmoid'))

            model.compile(loss='binary_crossentropy',
                          optimizer='adam',
                          metrics=['accuracy'])

        h = model.fit(X_train, y_train, epochs=epochs, verbose=1)

        train_loss, train_acc = model.evaluate(X_train, y_train)
        test_loss, test_acc = model.evaluate(X_test, y_test)

        return train_loss, train_acc, test_loss, test_acc, model, h


- did you improve the performance?

## Sentiment Analysis

The previous dataset was easy. Let's switch to a harder one and do sentiment analysis on it.

In [None]:
df = pd.read_csv('../data/rt_critics.csv')
df.head()

In [None]:
df.info()

In [None]:
df['fresh'].value_counts() / len(df)

In [None]:
df = df[df.fresh != 'none'].copy()
df['fresh'].value_counts() / len(df)

In [None]:
le.fit(df['fresh'])

In [None]:
y = le.transform(df['fresh'])

### Exercise 6: TFIDF

- Build features with word frequencies (Tfidf). (sklearn has a preprocessor for this.)
- do train/test split
- train and evaluate a model

### Exercise 7: NLP with deep learning

- Use the Tokenizer from Keras to:
    - Create a vocabulary
    - Convert sentences to sequences of integers
- pad the sequences so that they look like a tensor using the `pad_sequences` function from Keras.

### Train / Test split on sequences

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

### Exercise 8: Build recurrent neural network model
- use what you have learned to build a recurrent model that classifies the sentiment

### Exercise 9

- Try changing the network architecture and re-train the model at each change. Can you avoid overfitting?
    - change the number of nodes in the LSTM layer
    - change the output dimension of the Embedding layer
    - add dropout and recurrent dropout to the LSTM
    - add a second LSTM layer
    - add kernel regularizers