# Text modeling

In this Notebook, we are applying text mining techniques to a corpus [of genuine and fake reviews](https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus). In this Notebook, we will create the document-feature matrix.

In [95]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import math

First, let's read in the data file.

In [55]:
df = pd.read_csv('simpsons.csv')
# reviews = reviews[reviews["raw_character_text"] == "Bart Simpson"]| reviews[reviews["raw_character_text"] == "Lisa Simpson"]
df = df[(df["raw_character_text"] == "Lisa Simpson") | (df["raw_character_text"] == "Bart Simpson")]
df.head()

Unnamed: 0,raw_character_text,spoken_words
1,Lisa Simpson,Where's Mr. Bergstrom?
3,Lisa Simpson,That life is worth living.
7,Bart Simpson,Victory party under the slide!
9,Lisa Simpson,Mr. Bergstrom! Mr. Bergstrom!
11,Lisa Simpson,Do you know where I could find him?


In [56]:
df['raw_character_text'].value_counts()

Bart Simpson    13759
Lisa Simpson    11489
Name: raw_character_text, dtype: int64

As we can see, there are 800 truthful and 800 deceptive reviews. 

To read the text and use it for our analysis, we need an object from `sklearn` called a [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using `.values.astype('U')`.

In [57]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df['spoken_words'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")


There are 14258 words in the vocabulary. A selection: ['anguished', 'angus', 'anima', 'animal', 'animals', 'animated', 'animation', 'animators', 'anka', 'ankle', 'ann', 'annapolis', 'anne', 'annie', 'anniversary', 'annnnd', 'announce', 'announcement', 'announcements', 'announcer']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [92]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:5,0:5]) #Let's print a little part of the matrix: the first 50 words & documents




As you can see, there are no 0's in the matrix. Because the matrix is mostly zeroes, they are left out to save memory. Instead, the positions of the cells that _don't_ have a zero are spelled out, with their values. This is a so-called _sparse matrix_ which saves a lot of memory. We can convert it to a regular matrix however, with `.toarray()`. Let's do that and add it to the reviews dataframe.

**NOTE: we are doing this now just to provide an example**. In an application or Big Data analysis, you would not actually do this, because it uses ways too much memory. Instead, you would use sparse matrices.

In [93]:
#Make a regular matrix out of docu_feat, make it into a DataFrame and concatenate it along the columns
rev_words = pd.concat([df, pd.DataFrame(docu_feat.toarray())], axis=1)
rev_words.head()


MemoryError: Unable to allocate 2.68 GiB for an array with shape (25248, 14258) and data type int64

Now, let's add columns labels.

In [17]:
#Relabeling the columns. feature_names contains the words in the text. I've used the v (variable) + underscore to distinguish from the words like 'hotel' in the text
#Hard-coding the names like this is not really good practice (better would be some operation on the dataframe), but it's a lot clearer.

df.columns = ['v_deceptive', 'v_hotel', 'v_polarity', 'v_source', 'v_text'] + feature_names
rev_words.head()

Unnamed: 0,v_deceptive,v_hotel,v_polarity,v_source,v_text,00,000,00a,00am,00pm,...,yum,yummo,yummy,yunan,yup,zagat,zest,zipped,zone,zoo
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As we can see, the matrix is almost entirely filled with zeroes.