# Exercise 15: Text Classification of Consumer Complaints

In this exercise, you will try to categorize consumer complaints, based on the complaint narrative, using supervised machine learning with Support Vector Machines (SVM). You will also be able to experiment with different forms of data pre-processing to test the effects on the categorization of the text.

### Loading the data

We will use a package called `sklearn` (Scikit-learn) for this lab. This package contains machine learning algorithms for Python focusing on classification, regression and clustering. If you wish, can read more about the details of `sklearn` [here](http://scikit-learn.org/stable/documentation.html) but it is not necessary for completing this lab exercise.

First, let's load the relevant components we will use in this lab:

In [None]:
import numpy as np
import pandas as pd
import textmining as tm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from client.api.notebook import Notebook
ok = Notebook('ex15.ok')

Load the data into a `DataFrame` using Pandas and show the head of the data set:

In [None]:
complaints = pd.read_csv("Consumer_Complaints-sliced.tar.gz")
complaints.head()

If you are unsure what this `.csv` file looks like in its raw format, you check the contents of it in a regular text editor, or by going to the Jupyter dashboard and opening it, just to get a hint of what data we will handle.

Calculate the shape of the dataset. Use the code that you have learned to get the shape on `complaints` by replacing the ellipsis (`...`) below with your own code:

In [None]:
complaints.shape

**Q4.1.** How many complaints records are present in the data set?

In [None]:
num_records_in_complaints = ...

In [None]:
_ = ok.grade('q41')

### What is a Term Document Matrix?

The data set you work with consists of consumer complaints narrative (some description from a consumer about their complaint) alongside a lot of extra data about the complaints, such as when each complaint was made, which company it relates to, and some categorisations such as product or issue category.

For this exercise we are interested in looking at the `Product` relating to each complaint. Each row in the dataset corresponds to a complaint. We need to start by creating a TDM that is a representation of these complaints in terms of a feature vector.

We can use the `textmining` package to create a TDM and do some processing on it. We can experiment with several techniques for optimizing the input dataset and inspect the TDMs after processing.

First, let's compile the corpus from the complaints records:

In [None]:
complaints = complaints.dropna(subset=["Consumer complaint narrative"])  # drops null values
complaints.shape

**Q4.2.** How many complaints records contain non-null narratives?

In [None]:
num_non_null_records_in_complaints = ...

In [None]:
_ = ok.grade('q42')

#### Stemming

Stemming is a method where words are shortened to their morphological root. The algorithm that performs this truncation is adapted to the features of specific languages and thus it is not possible to use the same algorithm in Swedish as you would use in English. In this lab we focus on data in English.

We will create three different TDMs based on a sample of the `Consumer_Complaints.csv` dataset. We use a sample initially because the inspecting and manipulating a TDM with a large input dataset easily becomes unworkable. To create samples from using the `sample()` function of a `DataFrame`:

In [None]:
sampled_complaints = complaints.sample(100)  # limits to a sample of 100 records
sampled_complaints.shape

Now we get our corpus by just considering the `Consumer complaint narrative` column:

In [None]:
sampled_corpus = sampled_complaints["Consumer complaint narrative"]

Now we use the `textmining` package to create our TDM. Write some code to create our TDM `DataFrame` named `tdm_df`, and then output the head of the new `DataFrame`:

In [None]:
tdm = tm.TermDocumentMatrix()
for complaint in sampled_corpus:
    tdm.add_doc(complaint)
tdm_df = tdm.to_df(cutoff=1)
tdm_df.head()

**Q4.3.** How many features (terms) are present in the initial TDM generated from the sampled corpus?

In [None]:
num_features_sampled = ...

In [None]:
_ = ok.grade('q43')

`textmining` provides a function to apply stemming to a corpus of text. We can use the `tm.stem()` function to apply stemming to the complaint narrative corpus. The algorithm applied is the Porter stemming algorithm described in http://snowball.tartarus.org/algorithms/porter/stemmer.html

The following line of code defines a function `stem_doc()` that takes as a parameter a document and returns the document with the individual words in the document already stemmed.

    stem_doc = lambda x: ' '.join(tm.stem(x.split()))
    
Write some code to create a TDM `DataFrame` named `tdm_stemmed_df`, and then output the shape of the new `DataFrame`:

In [None]:
stem_doc = lambda x: ' '.join(tm.stem(x.split()))
tdm_stemmed = tm.TermDocumentMatrix()
for complaint in sampled_corpus:    
    tdm_stemmed.add_doc(stem_doc(complaint))
tdm_stemmed_df = tdm_stemmed.to_df(cutoff=1)
tdm_stemmed_df.head()

**Q4.4.** How many features (terms) are present in the stemmed TDM generated from the sampled corpus?

In [None]:
num_features_sampled_stemmed = ...

In [None]:
_ = ok.grade('q44')

**Q4.5.** What do you observe about the shapes of `tdm_df` and `tdm_stemmed_df`?

*Edit this cell to type your answer here*

Let's now study some of the terms in `tdm_df` and` tdm_stemmed_df`.

In [None]:
compare_stemmed = pd.concat([ 
    pd.Series(sorted(tdm_df.columns)), 
    pd.Series(sorted(tdm_stemmed_df.columns))
], ignore_index=True, axis=1)
compare_stemmed.columns = ["Raw terms", "Stemmed terms"]

In [None]:
compare_stemmed[0:50]
# you can change the selection range 20:70 to view other parts of the comparison data set

**Q4.6.** How do the terms differ in a TDM with stemming from a TDM without stemming?

*Edit this cell to type your answer here*

#### Stopwords

Stopwords are words of limited importance that do not significantly affect the text analysis. Words that are filtered out are, for example, prepositions (prepositioner) and conjunctions (konjunktioner). The default setting for using the `TermDocumentMatrix` does not remove any works when you use the `add_doc()` function.

- A *preposition*  is a word that tells you where or when something is in relation to something else. For example, words like "after", "before", "on", "under", "inside" and "outside".`
- A *conjunction* is a connective word that join sentences together. For example, the FANBOYS words: "for", "and", "nor", "but", "or", "yet", "so".

To remove the stopwords we use the function `tm.simple_tokenize_remove_stopwords()` applied to each document to add to the TDM. What this function does is takes a document, tokenizes it (splits up the document into a list of individual words), and at the same time removes the stopwords from the list. We then reconstruct the document using `' '.join()`, which takes a list of strings and concatenates them with a space. This results in getting the document with the stopwords removed.

Run the following code to apply this to the sampled corpus:

In [None]:
tdm_stopped = tm.TermDocumentMatrix()
for complaint in sampled_corpus:
    complaint_stopped = ' '.join(tm.simple_tokenize_remove_stopwords(complaint))
    tdm_stopped.add_doc(complaint_stopped)
tdm_stopped_df = tdm_stopped.to_df(cutoff=1)
tdm_stopped_df.head()

**Q4.7.** How many features (terms) are present in the stop word truncated TDM generated from the sampled corpus?

In [None]:
num_features_sampled_stopped = ...

In [None]:
_ = ok.grade('q47')

Study the terms in `tdm_df` and` tdm_stopped_df`. The following code cell creates a table from two `Series` of terms so that you can more easily compare them.

In [None]:
compare_stopped = pd.concat([ 
    pd.Series(tdm_df.columns), 
    pd.Series(tdm_stopped_df.columns)
], ignore_index=True, axis=1)
compare_stopped.columns = ["Raw terms", "Stopped terms"]
compare_stopped[:50]

**Q4.8.** How do the terms differ in a TDM with removal of stopwords from a TDM without truncation?

*Edit this cell to type your answer here*

**Q4.9.** How does the deletion of stopwords affect the calculation efficiency?

*Edit this cell to type your answer here*

You can inspect the stopwords list used by the `textmining` package as follows:

In [None]:
tm.stopwords

#### Frequency

In the commands that you execute above, the meaning of words in the feature vectors is recorded based on only the number of occurences of each term in each record of the corpus (the DTM, loaded initially into `tdm_df`).

Another further matrix we can derive is a TF-IDF (term frequency inverse document frequency) matrix. This emphasizes the occurrence of a word in a particular document in relation to whether the word appears in the other documents. This means that if a word occurs in almost all documents, it is allocated a lower value in the TDM. A word that appears only in a few documents is instead weighted higher. An easier way to fold a word into the feature vector is by means of TF (term frequency). TF weight the words in the feature vector in such a way that it only calculates the occurrence of the word in a document and records this in the feature vector.

Instead of manually creating our TDM and then our TF-IDF matrix manually, the `sklearn` package provides some components that allow us to do this for us. Firstly, we can use a `CountVectorizer` that can take an input corpus and create an initial TDM. We then apply the `TfidfTransformer` to calculate and apply the IDF weights to the TF values. `sklearn` then allows us to train models to classify texts.

Inspect the TF-IDF matrix created below with a small corpus:

In [None]:
simple_corpus = pd.Series([
    'She watches bandy and football', 
    'Alice likes to play bandy', 
    'Karl loves to play football'])

count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(simple_corpus)

tf_df = pd.DataFrame(data=term_freq_matrix.toarray(), columns=count_vectorizer.get_feature_names())
tf_df.style.set_caption('Term Document Matrix')

In [None]:
tfidf = TfidfTransformer()
tfidf.fit(term_freq_matrix)
tf_idf_matrix = tfidf.transform(term_freq_matrix)

tf_idf_df = pd.DataFrame(data=tf_idf_matrix.toarray(), columns=count_vectorizer.get_feature_names())
tf_idf_df.style.set_caption('Term Frequency-Inverse Document Frequency Matrix')

**Q4.10.** Describe how the weighting of terms differs depending on how the frequency is calculated based on the terms found above.

*You can try adding and removing to the `simple_corpus` and re-running the cells to help you observe changes in weights.*

*Edit this cell to type your answer here*

### Create a Term Document Matrix

Now it's time to get back to our consumer complaints data set and create TF-IDF for text analysis. In this section, you will use one of another dataset that contains. Start by clearing data and values. You will create three different TDMs and vary the input by applying stop words and then stemming.

In [None]:
count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(sampled_corpus)
tdm_df = pd.DataFrame(data=term_freq_matrix.toarray(), columns=count_vectorizer.get_feature_names())
tdm_df.head()

In [None]:
count_vectorizer_stopped = CountVectorizer(min_df=1, stop_words=tm.stopwords)
term_freq_matrix_stopped = count_vectorizer_stopped.fit_transform(sampled_corpus)
tdm_stopped_df = pd.DataFrame(data=term_freq_matrix_stopped.toarray(), columns=count_vectorizer_stopped.get_feature_names())
tdm_stopped_df.head()

In [None]:
analyzer = CountVectorizer().build_analyzer()
def stemmed_words(doc):
    return (tm.stem(w) for w in analyzer(doc))
count_vectorizer_stemmed = CountVectorizer(min_df=1, analyzer=stemmed_words)
term_freq_matrix_stemmed = count_vectorizer_stemmed.fit_transform(sampled_corpus)
tdm_stopped_stemmed_df = pd.DataFrame(data=term_freq_matrix_stemmed.toarray(), columns=count_vectorizer_stemmed.get_feature_names())
tdm_stopped_stemmed_df.head()

Now create the TF-IDF matrix:

In [None]:
def create_tfidf_matrix(tdm, features):
    transformer = TfidfTransformer()
    tf_idf_matrix = transformer.fit_transform(tdm)
    tfidf_df = pd.DataFrame(data=tf_idf_matrix.toarray(), columns=features)
    return tfidf_df

tfidf_matrix = create_tfidf_matrix(term_freq_matrix, count_vectorizer.get_feature_names())
tfidf_matrix.head()

**Q4.11.** What are the implications of data pre-processing for the objectivity of an analysis? (e.g. see Boyd & Crawford 2012 for a discussion)

*Edit this cell to type your answer here*

## Training of SVM and classification

When you have created a DTM, it is time to divide the data set into a training set and a test set. Classification with supervised machine learning requires a training set as the algorithm learns how to categorize data. An SVM is customized so that they can classify the training set. The classifier is then tested on the test set. In `sklearn` there is a method `train_test_split` to extract a training data set from the full data.

Since training our classifier takes some time if we use the full complaints dataset, we will load the first 100000 rows only for the purposes of the rest of this lab. Run the next cell to reload the complaints dataset:

In [None]:
complaints = pd.read_csv("Consumer_Complaints-sliced.tar.gz", nrows=100000).dropna(subset=["Consumer complaint narrative"])

Let us now visualize the distribution of complaint records according to the product categorization:

In [None]:
fig = plt.figure(figsize=(8,6))
complaints.groupby('Product')["Consumer complaint narrative"].count().plot.bar(ylim=0)
plt.show()

**Q4.12.** What can you observe about the number of complaints per product? How might this affect our analysis?

*Edit this cell to type your answer here*

### Classifier with no data pre-processing

We will use the `LinearSVC` model from `sklearn` to create our classifier. We will use the `CountVectorizer` and then the `TfidfTransformer` to create an input TF-IDF matrix that we will train our model with, alongside the relevant labels.

When training a model, we take an input dataset, in our case the input complaints records, and split it into a training dataset and a test dataset. This allows us to train the model with labelled data, and then test the trained model with labeled data that was not used in the training process. The `train_test_split()` function by default split the input data into 75% training data and 25% test data.

Run the next cell to train the model on the input `complaints` data that we loaded earlier:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(complaints['Consumer complaint narrative'], 
                                                    complaints.Product, random_state=0)
count_vectorizer = CountVectorizer(stop_words=None)
X_train_counts = count_vectorizer.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
model = LinearSVC()
classifier = model.fit(X_train_tfidf, y_train)

We can now test out classifer against the test data `X_test`. We have to firstly vectorize the input data, then pass it to the `predict()` function of the classifier. We then build a `DataFrame` so we can inspect the predicted categories against the actual categories:

In [None]:
vec = count_vectorizer.transform(X_test)
predictions = classifier.predict(vec)
results = pd.DataFrame({
    "Complaint narrative": X_test,
    "Actual category": y_test,
    "Predicted category": predictions
})
results

By inspecting the results table above, we can see if the classifier has done a good or bad job (it should have done an OK job). However, we can quantify the accuracy. We do this using cross-validation. `sklearn` allows us to do this with the `cross_val_score()` function.

Run the next cell to run cross-validation on our classifier:

In [None]:
tdf_vectorizer = TfidfVectorizer(min_df=1)
no_pprocess_scores = cross_val_score(classifier, tdf_vectorizer.fit_transform(X_test), predictions, scoring='accuracy', cv=5)
no_pprocess_scores

### Classifier with truncation of stop words

Next, we will create a classifier using the stopwords we used earlier from the `textmining` package, and then run the cross-validation on the results.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(complaints['Consumer complaint narrative'], 
                                                    complaints.Product, random_state=0)
stopped_count_vect = CountVectorizer(stop_words=tm.stopwords)
X_train_counts = stopped_count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
stopped_classifier = LinearSVC().fit(X_train_tfidf, y_train)

In [None]:
tdf_vectorizer = TfidfVectorizer(min_df=1)
stopped_predictions = stopped_classifier.predict(stopped_count_vect.transform(X_test))
stopped_scores = cross_val_score(stopped_classifier, tdf_vectorizer.fit_transform(X_test), stopped_predictions, scoring='accuracy', cv=5)
stopped_scores

### Classifier with stemming

Next, we will create a classifier but apply stemming to our input data, using the stemming function we defined earlier `stem_doc`. We then run the cross-validation on the results.

In [None]:
stemmed_complaints = complaints.copy()
stemmed_complaints['Consumer complaint narrative'] = complaints['Consumer complaint narrative'].apply(stem_doc)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(stemmed_complaints['Consumer complaint narrative'], 
                                                    stemmed_complaints.Product, random_state=0)
stemmed_count_vect = CountVectorizer()
X_train_counts = stemmed_count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
stemmed_classifier = LinearSVC().fit(X_train_tfidf, y_train)

In [None]:
tdf_vectorizer = TfidfVectorizer(min_df=1)
stemmed_predictions = stemmed_classifier.predict(stemmed_count_vect.transform(X_test))
stemmed_scores = cross_val_score(stemmed_classifier, tdf_vectorizer.fit_transform(X_test), stemmed_predictions, scoring='accuracy', cv=5)
stemmed_scores

### Comparison of accuracies

You can probably see from each of the cross-validation results the general accuracies, but to make things a little bit clearer we can build a `DataFrame` to compare these, and the visualize the results. We will do this twice, so we have written a function to builds the comparison for us:

In [None]:
def cross_val_comparison(labels, scores):
    d = {}
    for x, y in zip(labels, scores):
        d[x] = y
    cross_val_scores = pd.DataFrame(d)
    _ = cross_val_scores.boxplot().set_title(
        "Cross validation scores trained on {} records".format(len(complaints)))
    return cross_val_scores

Let us first look at the cross-validation scores for 100000 records as input to our models:

In [None]:
if len(complaints) > 1000:    
    cross_val_scores_100k = cross_val_comparison(
        ('No preprocessing', 'With stop words', 'With stemming'),
        (no_pprocess_scores, stopped_scores, stemmed_scores))
    cross_val_scores_100k.mean()

**Q4.13.** What do you observe about the cross-validated accuracies using Linear SVC without pre-processed features, stop word removed features, and stemmed features? Can you explain the reasons behind your observation(s)?

*Edit this cell to type your answer here*

**Re-run your analysis using only 1000 records from the input complaints dataset.**

*Hint: You need to change the number of input rows loaded at the beginning of the [Training of SVM and classification](#Training-of-SVM-and-classification) section by setting `nrows=1000`, then re-run the code cells that come afterwards. Instead of producing the boxplot that precedes 4.13, skip that cell and run the code cell below so that you can compare the results.*

In [None]:
if len(complaints) < 1000:
    cross_val_scores_1k = cross_val_comparison(
        ('No preprocessing', 'With stop words', 'With stemming'),
        (no_pprocess_scores, stopped_scores, stemmed_scores))
    cross_val_scores_1k.mean()

**Q4.14.** Does this affect your previous observations? If so, provide possible reasons.

In [None]:
_ = ok.grade('q414')

*Edit this cell to type your answer here*

---
When you're finished with exercise 15, get one the TA or lecturer to discuss your observations.

If you are running this notebook using Binder, choose **Save and Checkpoint** from the **File** menu, **rename** your notebook to add a hyphen and your initials to the notebook name e.g. `Ex15_Text_Classification_of_Consumer_Complaints-DJ`, then choose **Download as Notebook** and save it to your computer or USB stick.

If you are running this notebook on your own machine, choose **Save and Checkpoint** from the **File** menu, choose **Make a copy** from the **File** menu, then **rename** your notebook to add a hyphen and your initials to the notebook name e.g. rename from `Ex15_Text_Classification_of_Consumer_Complaints-Copy1` to `Ex15_Text_Classification_of_Consumer_Complaints-DJ`.