# Text Mining


**Data Science for Business - Instructor:  Chris Volinsky**



In this notebook we will be using features extracted from text as input into supervised (predictive) models.

In [None]:
# Import the libraries we will be using
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pylab as plt
%matplotlib inline


## Application: Analyzing Customer Tweets for an Airline Company

Our problem setting: You've been hired by Trans American Airlines (TAA) as a business analytics professional. One of the top priorities of TAA is  customer service. For TAA, it is of utmost importance to identify whenever customers are unhappy with the way employees have treated them. You've been hired to analyze twitter data in order to detect whenever a customer has complaints about flight attendants. Tweets suspected to be related to flight attendant complaints should be forwarded directly to the customer service department in order to track the issue and take corrective actions.  

Let's start by loading the training data, which has been hand labelled with the subject of the tweet and the text of the tweet itself.

[Click here](https://drive.google.com/uc?download&id=1zgbAtmg3Pm2Wg7vMWujsbT2sUBe__Qy7) to download the file "TAA_tweets.csv"

In [None]:
#select file from computer

from google.colab import files
uploaded = files.upload()


In [None]:

tweets = pd.read_csv("TAA_tweets.csv")

pd.set_option("display.max_colwidth", 1000)
tweets[['negativereason','text']].head()

Let's take a look at what do people complain about in Twitter.

In [None]:
tweets.negativereason.value_counts()

We will define our target variable based on "Flight Attendant Complaints"



In [None]:
pd.set_option("display.max_colwidth", 1000)  # To display up to 1000 characters
# We'll call our target variable "is_fa_complaint" and keep only the text as a "feature" (really, the text is the field from which we will engineer features)
complaint = "Flight Attendant Complaints"
is_complaint = tweets.negativereason == complaint
tweets['is_complaint']=is_complaint
tweets[is_complaint]['text']


Let's take a look at the percentage of tweets related to complaints about flight attendants, aka the base rate:

In [None]:
is_complaint.mean().round(3)

In [None]:
X = tweets['text']
Y = tweets['is_complaint']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=42)

## Term Frequency Matrix using Binary CountVectorizer
How can we turn the large amount of text for each record into useful features?

We want to create a Term Frequency (TF) matrix, which keeps track of whether or not a word appears in a document/record. The easiest TF matrix is binary - it simply has zeros and ones for which words appear in the document.

You can do this in sklearn with a CountVectorizer() and setting binary to true. The process is very similar to how you fit a model: you will fit a CounterVectorizer(). This will figure out what words exist in your data.

In [None]:
binary_vectorizer = CountVectorizer(binary=True)
binary_vectorizer.fit(X_train)

In [None]:
# lets look at some of the words
vocabulary_list = list(zip( binary_vectorizer.vocabulary_.keys(), binary_vectorizer.vocabulary_.values()) )
vocabulary_list[0:20]

How big is the vocabulary in tweets?


In [None]:
vocabulary_size = len(binary_vectorizer.vocabulary_)
print(f"Vocabulary size: {vocabulary_size}")


Now that we have a list of the words are in the data, we can transform our text into a clean matrix. Use .transform() on the raw data using our fitted CountVectorizer(). You will do this for the training and test data. What do you think happens if there are new words in the test data that were not seen in the training data?

In [None]:
X_train_binary = binary_vectorizer.transform(X_train)
X_test_binary = binary_vectorizer.transform(X_test)

We can take a look at our new `X_test_binary`.

In [None]:
X_train_binary.shape

In [None]:
X_test_binary

Sparse matrix? Where is our data?

If you look at the output above, you will see that it is being stored in a *sparse* matrix (as opposed to the typical dense matrix) that is 3k rows long and 13k columns. The rows here are records in the original data and the columns are words. Given the shape, this means there are 39m cells that should have values. However, from the above, we can see that only 46k cells (~0.12%) of the cells have values! Why is this?

To save space, sklearn uses a sparse matrix. This means that only values that are not zero are stored. This saves a ton of space! This also means that visualizing the data is a little trickier. Let's look at a very small chunk.

In [None]:
# Recall that 13183 is the index for "you"
X_test_binary[0:20, 13180:13200].todense()


Now that we have a ton of features (one for every word!) let's try using a logistic regression model to predict which tweets are about flight attendant complaints.  

We'll need some regularlization, so we will set C=1

In [None]:
LogReg_bin = LogisticRegression(solver='liblinear',C=1)
LogReg_bin.fit(X_train_binary, Y_train)
# extract probabilities

probs = LogReg_bin.predict_proba(X_test_binary)[:,1]
y_pred = LogReg_bin.predict(X_test_binary)
# get ROC score
fpr, tpr, thresholds = metrics.roc_curve(Y_test, probs)
auc = metrics.roc_auc_score(Y_test, probs)

# print roc curve
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], linestyle='dashed', color='black')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('ROC - Binary (area = %0.4f)' % auc)
print ("AUC  = ",auc.round(3))


#Note that if you were doing this for real, you'd want to make sure you are regularizing well!


## CountVectorizer with Counts
Instead of using a 0 or 1 to represent the occurence of a word, we can use the actual counts. We do this the same way as before, but now we leave `binary` set to `false` (the default value).

In [None]:
# Fit a counter
ngram_stopvectorizer = CountVectorizer()
ngram_stopvectorizer.fit(X_train)

X_train_counts = ngram_stopvectorizer.transform(X_train)
X_test_counts = ngram_stopvectorizer.transform(X_test)

In [None]:
# Model
LogReg_counts = LogisticRegression(solver='liblinear',C=1)
LogReg_counts.fit(X_train_counts, Y_train)

# extract probabilities
probs = LogReg_counts.predict_proba(X_test_counts)[:,1]
y_pred = LogReg_counts.predict(X_test_counts)
# get ROC score
fpr, tpr, thresholds = metrics.roc_curve(Y_test, probs)
auc = metrics.roc_auc_score(Y_test, probs)

# print roc curve
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], linestyle='dashed', color='black')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Counts (area = %0.4f)' % auc)
print ("AUC  = ",auc.round(3))


## TF-IDF Vectorizer

Often we can improve performance by combining the term frequency in a docuemnt with the term frequency across documents (inverse document frequency - IDF). This way more important (rare) words get more weight. This is called a TF-IDF matrix.  

Python does this via `TfidfVectorizer()`

In [None]:
# Fit a counter
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)

X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)



In [None]:
model_tfidf = LogisticRegression(solver='liblinear',C=1)
model_tfidf.fit(X_train_tfidf, Y_train)

LogReg_tfidf = LogisticRegression(solver='liblinear',C=1)
LogReg_tfidf.fit(X_train_tfidf, Y_train)
# extract probabilities

probs = LogReg_tfidf.predict_proba(X_test_tfidf)[:,1]
y_pred = LogReg_tfidf.predict(X_test_tfidf)
# get ROC score
fpr, tpr, thresholds = metrics.roc_curve(Y_test, probs)
auc = metrics.roc_auc_score(Y_test, probs)

# print roc curve
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], linestyle='dashed', color='black')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('TFIDF (area = %0.4f)' % auc)
print ("AUC  = ",auc.round(3))


The `CountVectorizer()` and `TfidfVectorizer()` functions have many options.

We discussed in class the importance of pre-processing, and some of that can be done within the Vectorizer functions.  For instance, you can remove **stopwords** which are unimportant English words.  

You can also include. **n-grams** which are combinations of words which appear often.  **Bigrams** (n-grams with n=2) can find two word phrases that are often used and include them in the TF/IDF matrices.  Be careful, increasing n will increase the complexity of your model.

In [None]:
# Removing stop words and ngrams up to n=2

# Fit a counter
ngram_stop_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
ngram_stop_vectorizer.fit(X_train)

X_train_ngs = ngram_stop_vectorizer.transform(X_train)
X_test_ngs = ngram_stop_vectorizer.transform(X_test)

LogReg_ngs = LogisticRegression(solver='liblinear',C=1)
LogReg_ngs.fit(X_train_ngs, Y_train)

# extract probabilities

probs = LogReg_ngs.predict_proba(X_test_ngs)[:,1]
y_pred = LogReg_ngs.predict(X_test_ngs)

# get ROC score
fpr, tpr, thresholds = metrics.roc_curve(Y_test, probs)
auc = metrics.roc_auc_score(Y_test, probs)

# print roc curve
plt.plot(fpr, tpr, label='BiGrams+Stopwords (area = %0.4f)' % auc)
plt.plot([0, 1], [0, 1], linestyle='dashed', color='black')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.title('BiGrams+Stopwords (area = %0.4f)' % auc)


In [None]:
# what n-grams are used?
feature_names = ngram_stop_vectorizer.get_feature_names_out()
feature_importance = LogReg_ngs.coef_[0]  # Assuming binary classification
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importance})
top_bi_grams = feature_importance_df.sort_values(by='importance', ascending=False)

N = 20  # You can change this to display more or fewer bi-grams
print(top_bi_grams.head(N))

This list implies that we could also do with some stemming, to combine terms like "gate agent" and "gate agents".  This can be done with the function `PorterStemmer()` in the library `nltk`.

## Naive and Multinomial Naive Bayes Models

Naive Bayes is a class of classification models built off of the idea that all words can be modelled independent of one another.  However, it only works for binary term frequency matrices.

Multinomial Naive Bayes (`MultinomialNB`) is an extension of Naive Bayes which works off of a CountVectorized matrix.

Using this model in sklearn works just the same as the other classification models we've seen ([More details here..](http://scikit-learn.org/stable/modules/naive_bayes.html))

Lets fit both of these and see which one performs better.


In [None]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

#Naive Bayes works on the Binary matrix

NaiveB = BernoulliNB(alpha=1)
NaiveB.fit(X_train_binary, Y_train)

probs = NaiveB.predict_proba(X_test_binary)[:,1]
y_pred = NaiveB.predict(X_test_binary)
# get ROC score
fpr, tpr, thresholds = metrics.roc_curve(Y_test, probs)
auc = metrics.roc_auc_score(Y_test, probs)

# print roc curve
plt.plot(fpr, tpr, label='Naive Bayes (area = %0.4f)' % auc)
plt.legend("Binary")

MultinomialNB = MultinomialNB(alpha=1)
MultinomialNB.fit(X_train_counts, Y_train)

probs = MultinomialNB.predict_proba(X_test_counts)[:,1]
y_pred = MultinomialNB.predict(X_test_counts)
# get ROC score
fpr, tpr, thresholds = metrics.roc_curve(Y_test, probs)
auc = metrics.roc_auc_score(Y_test, probs)
plt.plot(fpr, tpr, label='Multinomial (area = %0.4f)' % auc,color='red')
plt.legend()

plt.plot([0, 1], [0, 1], linestyle='dashed', color='black')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Naive")

Naive Bayes has a complexity hyperparameter that is typically tuned - the smoothing value **`alpha`**.  You can try to see whether tuning `alpha` helps improve on the results above. (Try values of `alpha < 1` - the default)


Also, there are other versions of naive Bayes, for instance  **Gaussian Naive Bayes (GNB):** can be used when we have other numeric features that we can use in the predictive model (like, say, the age of the tweeter).  Sometimes GNB and Bernoulli NB are combined when one has features of mixed types.  


**Practice at home**:
- Create a tweet of your own that is about Flight Attendant Complaints and calculate what the model thinks the probability is of it being about a Flight Attendant problem.  
- Try some of these other models yourself, or maybe try a regularized (Lasso, Ridge) regression model.
- Or, see which type of complaint (Flight Attendant, Luggage, Bad flight) is the easiest to detect.  
- Cluster the data (Topic Modelling) without using the label and see if you come up with other interesting clusters

