# [Computational Social Science]
## 5-2 Exploratory Data Analysis and Unsupervised Methods - Solutions


This lab will demonstrate some exploratory methods for finding separating words, and introduce unsupervised topic models.

In [None]:
#!pip install scattertext
#!pip install wordcloud
#!pip install textblob

In [None]:
import pandas as pd
import numpy as np
from PIL import Image
import spacy
import en_core_web_sm
import scattertext as st
nlp = en_core_web_sm.load()
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Data

<img src = "../../images/cfpb logo.png"  />

We'll once again use the Consumer Financial Protection Bureau's [Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/). This time, we are going to focus on figuring out whether we can find text features that help distinguish different "Products." There are several products represented in the dataset:

In [None]:
cfpb = pd.read_csv("../../data/CFPB 2020 Complaints.csv")
cfpb = cfpb.dropna(subset = ['Consumer complaint narrative'])
cfpb['Product'].unique()

For the first few exercises, we will focus on mortgages and student loans. We will also just use the first one thousand observations so that the code runs faster. 

In [None]:
cfpb = cfpb[(cfpb['Product']=='Mortgage') | (cfpb['Product'] == 'Student loan')]
cfpb = cfpb[:1000]

In [None]:
cfpb.head()

## Preprocessing

Let's start by creating our tokens. We'll use the same `rem_punc_stop()` function we defined last time.

In [None]:
def rem_punc_stop(text):
    stop_words = STOP_WORDS
    punc = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    doc = nlp(punc_free)
    
    spacy_words = [token.text for token in doc]
    
    no_punc = [word for word in spacy_words if word not in stop_words]
    
    return no_punc

Notice here how we use the `map()` function to apply our `rem_punc_stop()` function to every row of our dataframe. `map()` is typically much faster than writing a for loop, though there are also faster options like [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html) and vectorized numpy arrays.

In [None]:
cfpb['tokens'] = cfpb['Consumer complaint narrative'].map(lambda x: rem_punc_stop(x))
cfpb['tokens']

## Word Cloud

One of the most popular text analysis visualizations is the word cloud. Word clouds visualize the most frequent words in a corpus, and size them according to frequency.

In [None]:
text = ' '.join(cfpb['tokens'].map(lambda text: ' '.join(text)))
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

There are lots of ways to customize a word cloud, including by changing the background color:

In [None]:
text = ' '.join(cfpb['tokens'].map(lambda text: ' '.join(text)))
wordcloud = WordCloud(background_color = "white").generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

And we can even overlay the wordcloud onto an image:

In [None]:
cfpb_mask = np.array(Image.open("../../images/cfpb logo.png"))
text = ' '.join(cfpb['tokens'].map(lambda text: ' '.join(text)))
wordcloud = WordCloud(background_color = "white", 
                     mask = cfpb_mask,
                     width = 1000,
                     height = 1000).generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

### Challenge

Notice that in the above word clouds, tokens like "XXXX" and "XXXXXXXX" appear frequently. These are redacted dates and likely won't help us with classification. Try to rewrite `rem_punc_stop` to remove these.

**Hint**: Try taking a look at `nlp.Defaults.stop_words` and see if there are any associated methods that might help.

In [None]:
def rem_punc_stop(text):
    stop_words = STOP_WORDS
    # Individually
    # nlp.Defaults.stop_words.add("XX")
    # nlp.Defaults.stop_words.add("XXXX")
    # nlp.Defaults.stop_words.add("XXXXXXX")
    
    # Using the bitwise |= (or) operator
    nlp.Defaults.stop_words |= {"XX", "XXXX","XXXXXXXX"}
    
    punc = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    doc = nlp(punc_free)
    
    spacy_words = [token.text for token in doc]
    
    no_punc = [word for word in spacy_words if word not in stop_words]
    
    return no_punc

In [None]:
cfpb['tokens'] = cfpb['Consumer complaint narrative'].map(lambda x: rem_punc_stop(x))
text = ' '.join(cfpb['tokens'].map(lambda text: ' '.join(text)))
wordcloud = WordCloud(background_color = "white").generate(text)
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

**Bonus**: This is an example where using [regular expressions](https://docs.python.org/3/library/re.html) can be useful. Instead of inputting all of the different ways that something like "XX" might show up, you can use regex to find and remove all similar patterns.

## Lengths and Counts

One of the basic things we might look for when analyzing text data is the length of a document. Let's see how we might grab the total number of characters and the total number of words:

In [None]:
cfpb['complaint_len'] = cfpb['Consumer complaint narrative'].apply(len)
cfpb['word_count'] = cfpb['Consumer complaint narrative'].apply(lambda x: len(str(x).split()))

In [None]:
sns.displot(cfpb, x="complaint_len")
plt.show()

In [None]:
sns.displot(cfpb, x="word_count")
plt.show()

In [None]:
sns.displot(cfpb, x="word_count", hue = "Product", col = "Product")
plt.show()

## Sentiment Polarity

A common area of research in the social sciences is thinking about the "sentiment" of a text. The [`TextBlob`](https://textblob.readthedocs.io/en/dev/quickstart.html) library gives us access to a pre-trained sentiment analysis model. Text might be characterized as "positive," "negative," or "neutral" on a [-1,1] scale with -1 being highly negative and 1 being highly positive. Before we look at the code, do you expect that the sentiment scores for these data should be negative or positive? 

In [None]:
cfpb['tokens'] = cfpb['tokens'].map(lambda text: ' '.join(text))
cfpb['polarity'] = cfpb['tokens'].map(lambda text: TextBlob(text).sentiment.polarity)
cfpb.head()

In [None]:
sns.displot(cfpb, x="polarity")
plt.show()

What's going on here? Why does sentiment look so close to neural, or even slightly positive? We know that all of the narratives in this dataset are consumer **complaints**, so we should expect them to look somewhat negative. Let's look at the 5 most positive reviews:

In [None]:
for complaint in cfpb.nlargest(5, 'polarity')['Consumer complaint narrative']:
    print(complaint + "\n")

Do we have any words that are skewing things? Let's look at the sentiment score for this first comment, and the individual sentiments of the words in it:

In [None]:
sample_complaint = "the company said they are offering a covid relief program which I requested assistance and they are saying a balloon payment is owed in XXXX I called the company and I was told that if I can't make this payment they will be talking taking litigation steps how are people who have lost their job able to keep their homes"
print("overall polarity score is ", TextBlob(sample_complaint).sentiment.polarity)
for word in sample_complaint.split():
    print(word, TextBlob(word).sentiment.polarity)

Only one word actually has a sentiment score ("able")! TextBlob's sentiment polarity is not a simple average of all of the sentiments in a string - this is why preprocessing is important and why you should validate these types of off-the-shelf methods. Let's take a look at the most negative reviews and see if these make sense:

In [None]:
for complaint in cfpb.nsmallest(5, 'polarity')['Consumer complaint narrative']:
    print(complaint + '\n')

In [None]:
sample_complaint = "Navient is the worst company to ever exist. Website does not work. Do the people at customer service even work for navient??? They don't know anything about whats going on. Applied for a repayement plan and their website always says an error has occured."
print("overall polarity score is ", TextBlob(sample_complaint).sentiment.polarity)
for word in sample_complaint.split():
    print(word, TextBlob(word).sentiment.polarity)

In this case we have a perfectly negative sentiment (-1.0), but again only one word is contirbuting - "worst". Sentiment polarity is a powerful tool, but not automatically suited to inference. That being said, maybe it can be helpful for distinguishing between labels. We can take a look at how polarity differs across mortgage and student loans:

In [None]:
sns.displot(cfpb, x="polarity", hue = "Product", col = "Product")
plt.show()

### Challenge

In addition to sentiment polarity, TextBlob also has a method for determining how "objective" or "subjective" a piece of text is. Plot the objectivity measure by Product. Do these results make sense?

In [None]:
cfpb['subjectivity'] = cfpb['tokens'].map(lambda text: TextBlob(text).sentiment.subjectivity)
sns.displot(cfpb, x="subjectivity")
plt.show()

In [None]:
sns.displot(cfpb, x="subjectivity", hue = "Product", col = "Product")
plt.show()

**Answer**: We get "subjectivity" centered around .5, which means somewhere between objective and subjective. Maybe this makes sense as each complaint will be filled with both factual information and opinions about how the company is bad, but it is hard to say exactly.

## ScatterText

Next we'll take a look at a useful visualization for finding separarting words. We'll use the [ScatterText](https://spacy.io/universe/project/scattertext) library to visualize both word frequencies and how well they separate two classes.

In [None]:
corpus = st.CorpusFromPandas(cfpb[:5000],
                            category_col = 'Product',
                            text_col = 'tokens',
                            nlp = nlp).build()

In [None]:
html = st.produce_scattertext_explorer(corpus,
                                       category='Student loan',
                                       category_name='Student loan',
                                       not_category_name='Mortgage',
                                        width_in_pixels=1000,
                                        minimum_term_frequency=5,
                                       metadata=cfpb['Complaint ID'])

In [None]:
open("CFPB Sentiment.html", 'wb').write(html.encode('utf-8'))

## Unsupervised Methods

Finally, we'll look at unsupervised machine learning methods for text data. Specifically, we'll implement [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation), which is a classic method for topic modeling. Topic models can help us uncover structure within a text. Specifically it does so through a "mixture model" - meaning every document is assumed to be "about" various topics, and we try to estimate the proportion each topic contributes to a document. Let's reload our cfpb dataset and look at "checkings or savings account" and "student loan" products this time:

In [None]:
cfpb = pd.read_csv("../../data/CFPB 2020 Complaints.csv")
cfpb = cfpb.dropna(subset = ['Consumer complaint narrative'])
cfpb = cfpb[(cfpb['Product']=='Checking or savings account') | (cfpb['Product'] == 'Student loan')]
cfpb = cfpb[:1000]

Let's start by creating our tf-idf matrix again:

In [None]:
X = cfpb['Consumer complaint narrative']
tf = TfidfVectorizer(tokenizer = rem_punc_stop)

tfidf_matrix =  tf.fit_transform(X)
dense_matrix = tfidf_matrix.todense()

In [None]:
# k means clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=3,
               max_iter=300
               ).fit(tfidf_matrix)

In [None]:
# We can access the centers of the clusters through the cluster_centers_ attribute. To get the labels 
#(i.e. the corresponding cluster) we use labels_.

In [None]:
cfpb['cluster']= kmeans.labels_

In [None]:
cfpb

Now that we have a dense matrix, let's apply our LDA model. The key hyperparameter here is the `n_components` argument. Let's start with 5, and then print out our topics to see what we get.

In [None]:
lda = LatentDirichletAllocation(n_components=5, max_iter=20, random_state=0)
lda = lda.fit(dense_matrix)

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [None]:
tf_feature_names = tf.get_feature_names()
print_top_words(lda, tf_feature_names, 20)

We have 5 topics! Some seem to be sensible (i.e. topic 2 seems to be about banking while topic 3 is about student loans), but notice that the computer doedsn't find the right "topic names" for us automatically - so there is still a role for humans to interpret and make sense of the output.

## Topic weights

One thing we may want to do with the output is compare the prevalence of each topic across documents. A simple way to do this, is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.

In [None]:
topic_dist = lda.transform(tfidf_matrix)
topic_dist

And then we'll merge back with original dataframe:

In [None]:
topic_dist_df = pd.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(cfpb.reset_index())
df_w_topics.head()

Now we can chech the average weight of each topic across Product using `groupby`.

In [None]:
grouped = df_w_topics.groupby('Product')
for i in range(0, 5):
    print(grouped[i].mean().sort_values(ascending=False))

Topics 2 and 3 seem to have a lot of separation between checkings or savings account and student loans. Let's plot them:

In [None]:
sns.displot(df_w_topics, x=df_w_topics[2], hue = "Product", kind = 'kde', fill = 'true')
plt.show()

In [None]:
sns.displot(df_w_topics, x=df_w_topics[3], hue = "Product", kind = 'kde', fill = 'true')
plt.show()

This is a very clean separation! These two topics would be good candidates to pass to a supervised learning algorithm if we were interested in predicting the Product.

## Challenge

Try retraining the LDA witha  different number of topics, say 10. What do you notice? How is this similar to issues we've seen with other clustering algorithms?

In [None]:
lda = LatentDirichletAllocation(n_components=10, max_iter=20, random_state=0)
lda = lda.fit(dense_matrix)

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
tf_feature_names = tf.get_feature_names()
print_top_words(lda, tf_feature_names, 20)

**Answer**: The topics produced are very sensitive to our choice of `n`. This is similar to the problem with finding the right number of clusters in other unsupervised learning problems, like k-means.

---
Notebook developed by Aniket Kesari