# Text Classification

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/
- Keras Documentation: https://keras.io


In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Text classification

Our goal is to perform a binary classification on text data. We will perform both a Spam detection example and a Sentiment analysis example. We will attemp 3 strategies:

1) build naive features based on our ideas
2) use well tested feature extraction technique
3) use deep learning and recurrent models on text

### 1. Spam detection on SMS messages

In [None]:
df = pd.read_csv('../data/sms.tsv', sep='\t')
df.head()

In [None]:
df['label'].value_counts() / len(df)

### Exercise1: Encode Labels to 0 and 1

Create a variable called y that contains 0 for HAM messages and 1 for SPAM messages. There are several ways to do this.

### Exercise 2: Build naive features based on keywords

- turn all your sms messages to lowercase
- define a function to count occurrences of a keyword with the following signature:

        def count_word(word, sentence):
            ....
            return count_word_in_sentence

- create a feature matrix `X` using counts of some keywords of your choice
- create other similar features. You could use:
    - the length of the message
    - the presence of numbers
    - the presence of special characters
    - ...

### Exercise 3: Train first model and evaluate performance

- split data in train and test with `test_size=0.3, random_state=0`
- train model of your choice on these features
- evaluate performance on training and test set
- discuss with classmate:
    - is model overfitting?
    - is model better than benchmark?

### Exercise 4: Cross Validation

- perform a 5-Fold cross validation on your model
- print the confusion matrix and the classification report on the test data

### Exercise 5: Count Features

- use features based on word counts using the `CountVectorizer` class from Scikit Learn
- encapsulate model training and evaluation in a function
- did you improve the performance?

## Sentiment Analysis

The previous dataset was easy. Let's switch to a harder one and do sentiment analysis on it.

In [None]:
df = pd.read_csv('../data/rt_critics.csv')
df.head()

In [None]:
df.info()

In [None]:
df['fresh'].value_counts() / len(df)

In [None]:
df = df[df.fresh != 'none'].copy()
df['fresh'].value_counts() / len(df)

In [None]:
y = le.fit_transform(df['fresh'])

### Exercise 6: TFIDF

- Build features with word frequencies (Tfidf)
- do train/test split
- train and evaluate a model

### Exercise 7: NLP with deep learning

- Use the Tokenizer from Keras to:
    - Create a vocabolary
    - Convert sentences to sequences of integers
- pad the sequences so that they look like a tensor

### 6. Train / Test split on sequences

### Exercise 8: Build recurrent neural network model
- use what you have learned to build a recurrent model that classifies the sentiment

### Exercise 9

- Try changing the network architecture and re-train the model at each change. Can you avoid overfitting?
    - change the number of nodes in the LSTM layer
    - change the output dimension of the Embedding layer
    - add dropout and recurrent dropout to the LSTM
    - add a second LSTM layer
    - add kernel regularizers

*Copyright &copy; 2017 Francesco Mosconi & CATALIT LLC. All rights reserved.*