# Text analysis Pt II

We previously looked at using dictionary and machine learning tools to conduct sentiment analysis. In this class, we'll continue our discussion of text analysis by looking at some more exploratory methods for describing or understanding text data.

To start, install (or at least attempt to install) these packages first. Then restart the kernel. We'll use them later in the class.

In [None]:
%pip install pyLDAvis
%pip install wordcloud


In [None]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.decomposition import LatentDirichletAllocation

import nltk
from nltk import tokenize
from nltk.corpus import stopwords
from nltk import SnowballStemmer

import pyLDAvis
from pyLDAvis import lda_model
from IPython.display import display, HTML



We'll start by reading in some example data. This .csv contains articles scraped from CNN and Fox News from around 2020 through 2023:

In [None]:
articles = pd.read_csv('https://github.com/Neilblund/APAN/raw/main/news_sample.csv')
# converting date to datetime
articles['date'] = pd.to_datetime(articles['date'])

# stripping some excess whitespace
articles['headline'] = articles.headline.str.strip()
articles['text'] = articles.text.str.strip()
# creating a hyperlink for later use
articles['hyperlink']=articles.apply(axis=1, func = lambda x: f'<a href={x.url}>{x.headline}</a>')

articles.head()

The articles in this collection are equally sized random samples from each source and each year:

In [None]:
pd.crosstab(articles['source'], articles['year(date)'])

...although there are some disparities in the amount of coverage within each year:

In [None]:
weekly_counts = articles.groupby([pd.Grouper(key='date', freq='W'), 'source'])['url'].size().reset_index()
weekly_counts = weekly_counts.pivot_table(index='date', columns ='source', values ='url')
full_week_range = pd.date_range(start=min(articles.date).to_period('W').start_time,
                                end=max(articles.date).to_period('W').start_time,
                                freq='W')

weekly_counts = weekly_counts.reindex(full_week_range, fill_value=0)



In [None]:
weekly_counts.plot()
plt.xlabel("Week")
plt.ylabel("Weekly stories")
# Relocate the legend to the outside right of the plot
plt.legend(loc='center left', bbox_to_anchor=(.4, 1.2))
plt.tight_layout() 
plt.show()

Our goal in this analysis will be to compare and contrasts the contents of these texts. Note here that we do have some additional data about each article beyond the text: the source and the headline, but the most interesting part is the text of each article.

# Pre-processing data
Just like in the previous class, we'll start by doing some initial pre-processing to our texts to make them into a document-term matrix. Our steps here will be more-or-less the same ones we used for the supervised modeling:

 - Splitting text into individual words
 - Lower casing and removing stop words
 - Stemming to remove endings like -ed or -ing
 - Creating a Document-Term-Matrix with one column per word and one row per document


In [None]:

eng_stopwords = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    return [stemmer.stem(token).lower() for token in tokens if token not in eng_stopwords and token.isalpha()]




vectorizer = CountVectorizer(analyzer = 'word',
                             tokenizer = tokenize,
                             ngram_range=(0,1), # Tokens are individual words for now
                             strip_accents='unicode',
                             max_df = 0.1, # maximum number of documents in which word j occurs. 
                             min_df = .0025 # minimum number of documents in which word j occurs. 
                            )


dfm = vectorizer.fit_transform(articles['text'])

# get the names of the features for future use
features = vectorizer.get_feature_names_out()


The `dfm` object we created is a sparse matrix. By using the `.toarray()` method on this object, we can convert it to a numpy array and then use methods like `sum`, `sort`, and `where` to manipulate the results. For instance, if we wanted to calculate the number of occurrences of each term, we could use:

In [None]:
dfm.toarray().sum(axis=0)

Or we could sort this in descending order:

In [None]:
np.sort(dfm.toarray().sum(axis=0))

Or, we could use `argsort` to get the indices that will sort the array from highest to lowest:

In [None]:
np.argsort(dfm.toarray().sum(axis=0))[::-1]

Or we could sort the `features` (the terms used to build the document-feature matrix) in order of their occurrence from most frequent to least frequent:

In [None]:
features = vectorizer.get_feature_names_out()
indices = np.argsort(dfm.toarray().sum(axis=0))[::-1]
features[indices]

Finally, we could use the `where` method on the original data frame to get the indices of documents that come from a particular source:

In [None]:
np.where(articles.source=="Fox News")

And we could put all of this together to do something like identify the most common terms across all articles in Fox News stories:

In [None]:
fox_articles = np.where(articles.source=="Fox News")
fox_dfm = dfm.toarray()[fox_articles]
indices = np.argsort(fox_dfm.sum(axis=0))[::-1]
features[indices][:10].tolist()

## Descriptive visualizations

We can start with a couple of simple descriptive and comparative visualizations for each text. 

One simple exploratory approach might be to look at what terms are more strongly associated with Fox News compared to CNN.

The `calcKeyness` function is a custom function included in the `text_functions.py` file in this directory. It gives us a way to compare term frequencies from two different sources using their relative ($log_2$) odds ratios. 

The general usage will be something like: `calcKeyness(X, y)` where `X` is the sparse matrix that we get from using `CountVectorizer` and `y` is a boolean vector that equals False for the "baseline" category and "True" for the category we want to compare against. 
The code below is going to calculate the odds ratio for terms that appear in CNN articles (positive values) compared to terms that show up more often in Fox News articles:

In [None]:
from text_functions import calcKeyness
keyterms = calcKeyness(X=dfm,                         # the document-term matrix
                       targets = articles['source'] == "CNN", # True if the article is from CNN (False for Fox News)
                       minimum_threshold=200,       # remove words that occur less than 200 times
                       feature_names=features)      # including the vocabuly so we have labels for each term

The negative values are terms more strongly associated with Fox News articles:

In [None]:
keyterms.head(n=5)

Terms with a positive odds ratio are more strongly associated with CNN:

In [None]:
keyterms.tail(n=5)

A visualization can also be helpful for getting a sense of these results. If you were following the news in 2020, you might be able to spot some terms associated with stories that were in the news around that time:

In [None]:
top_bottom = pd.concat([keyterms.iloc[:15], keyterms.iloc[-15:]])
ax = sns.barplot(data=top_bottom,
                 y= 'term',    
                 hue='term',
                x=top_bottom['oddsratio'],dodge=False, palette='turbo')
ax.set(xlabel='Term associations Fox News (negative values) vs. CNN (positive values)', ylabel='term')

### Wordclouds 
We can also make a word cloud, either for the entire corpus, or separately for each source.

(wordclouds are not necessarily a great way to visualize text, but they look cool and people like them, so there's something to be said for playing the hits)

In [None]:
from wordcloud import WordCloud

wordcloud = WordCloud().generate_from_frequencies(dict(zip(vectorizer.get_feature_names_out(), dfm.toarray().sum(axis=0))))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Or, we can compare CNN to Fox side by side:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

cnn_dfm = dfm[np.where(articles['source'] == "CNN")]

cnn_wordcloud = WordCloud().generate_from_frequencies(dict(zip(vectorizer.get_feature_names_out(), cnn_dfm.toarray().sum(axis=0))))
axes[0].imshow(cnn_wordcloud, interpolation='bilinear')
axes[0].axis("off")
axes[0].set_title("CNN")

fox_dfm = dfm[np.where(articles['source'] == "Fox News")]
fox_wordcloud = WordCloud().generate_from_frequencies(dict(zip(vectorizer.get_feature_names_out(), fox_dfm.toarray().sum(axis=0))))
axes[1].imshow(fox_wordcloud, interpolation='bilinear')
axes[1].axis("off")
axes[1].set_title("Fox News")

plt.show()

# Topic Modeling

We know that terms like "migrant", "immigrant", "border" etc. probably all reference the same general idea, but our bag-of-words method fails to capture even fairly obvious relationships like this. Topic modeling is a way to identify important themes or ideas in texts that can pick up on some of these conceptual similarities.

We'll use **Latent Dirichlet Allocation** (LDA), a common method for topic modeling, to these texts.


### Latent Dirichlet Allocation

LDA is a statistical model that identifies topics by taking advantage of the fact that related words tend to appear together in the same documents. LDA is an example of an **unsupervised machine learning model**: we don't have any labels, we just have a collection of documents and a rough idea of the number of topics, and the model will infer the rest for us.


### Dirichlet distributions

Before starting, its useful to have a rough idea of what we mean when we talk about a "latent dirichlet". A dirichlet distribution is a probability distribution that can be used to model the probabilities of multivariate outcomes. For instance, you might think about the probabilities of attempting to manufacture some fair six-sided dice. For each die you make, the total probability of all six faces should sum to one, and every side will have a positive probability that's greater than zero. However, sometimes you might have manufacturing defects that cause one side to be a little more likely than the others. We could use a dirichlet distribution to model this hypothetical process for manufacturing a single dice:

In [None]:
sides  = 6     # the number of outcomes
n = 1      # the number of random draws from the dirichlet
alpha = [100] * sides  # the concentration parameter (higher values = more even distribution, lower values = more unbalanced distribution)
# manufacturing a single die:
pd.DataFrame(np.random.dirichlet(alpha, n), columns=range(1, len(alpha)+1 ,1))

You can try running the code above with different values for `alpha`, or simulate multiple dice by increasing the value of `n`, or model dice with more faces by increasing `sides`. Take note of how different parameters impact the values of your randomly generated "dice". You should note that using higher values of alpha usually results in a more uniform distribution of probabilities across all six faces, while lower values will typically result in a more uneven distribution:

In [None]:
alpha = [.1] * sides  
# manufacturing a single die:
pd.DataFrame(np.random.dirichlet(alpha, n), columns=range(1, len(alpha)+1 ,1)).plot.bar()

Increasing the value of `sides` will just increase the number of possible outcomes. So we could model a 20-sided die by increasing that value:

In [None]:
sides  = 20 
n = 1      
alpha = [100] * sides 
pd.DataFrame(np.random.dirichlet(alpha, n), columns=range(1, len(alpha)+1 ,1))

In LDA, we assume that documents are generated by sampling from a pair of these dirichlet distributions:

- A **Topic-word distribution** (sometimes called phi or beta)  models the probability of any word occuring in a single topic. A topic related to Covid-19 might have a high probability of terms like "Fauci", "mask", or vaccine". A topic related to the election might have a high probabiltiy of "vote", "caucus" or "turnout"
- A **Document-Topic distributions** (theta) models the probability of drawing one of `K` topics within a given document. So a document might be drawn 75% from the Covid-19 topic, 20% from the election topic, and 5% from some other random topic.

So, we might have something like this:



<div style="display: flex;">
    <div style="flex: 1; padding-right: 10px;">
        <!-- Content for the first column -->
        <h3>Topic 1</h3>
<table style="border-collapse:collapse;border-spacing:0" class="tg"><thead><tr><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Term</th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Prob.</th></tr></thead><tbody><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">covid</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.04</td></tr>
<tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Fauci</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.02</td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">vaccine</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.01</td></tr>
<tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">mask</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.01</td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">...</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal"></td></tr>
<tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">election</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.0001</td></tr></tbody></table>
    </div>
    <div style="flex: 1; padding-left: 10px;">
        <!-- Content for the second column -->
        <h3>Topic 2</h3>
        <table style="border-collapse:collapse;border-spacing:0" class="tg"><thead><tr><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Term</th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Prob</th></tr></thead><tbody><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">vote</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.03</td></tr>
<tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">caucus</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.02</td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">turnout</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.01</td></tr>
<td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">poll</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.01</td></tr>

<tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">...</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal"></td></tr><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">covid</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.0001</td></tr></tbody></table>
    </div>
        <div style="flex: 5; padding-left: 10px;">
        <!-- Content for the second column -->
        <h3>Document distribution</h3>
        <table style="border-collapse:collapse;border-spacing:0" class="tg"><thead><tr><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal"></th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Topic 1</th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Topic 2</th><th style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Topic 3</th></tr></thead>
<tbody><tr><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">Document 1</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.75</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.20</td><td style="border-color:inherit;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;text-align:left;vertical-align:top;word-break:normal">.05</td></tr></tbody></table>

</div>




LDA assumes a generative model where each article is written by randomly sampling a topic from a document's topic distribution, and then randomly sampling a word from the selected topic distribution. So I might write a document where I drew 75% of my words from a "Covid-topic" and 20% from a "election topic" and another 5% from some other random topic. Obviously, no one actually writes documents by randomly sampling words from a list. The generative model is mostly useful because it gives us a clear idea of what we need to optimize for in our machine learning model: we want to find the values of $\phi$ and $\theta$ that would maximize the likelihood of generating the collection of documents we have in our data.


## Fitting the model

Let's try fitting an LDA model. 

Similar to the clusters in K-means clustering, LDA doesn't choose the number of topics for us automatically. We'll have to make a guess at that based on our intuitions. For this model, I'l set `k=15`. This is probably too low - it's common to see 20, 40 or even 100 or more topics for large-scale models - but we'll use it here because a lower value will make the model converge more quickly.

We'll also want to set a `random_state` value. Multiple LDA models with the same number of topics and the same data should find similar topics, but even then the ordering of those topics is arbitrary, so setting the `random_state` argument will ensure that we can replicate our results.

We'll create an LDA model, then put it in a pipeline along with our function for converting texts into a document-term-matrix:

In [None]:
k = 15
ldamodel = LatentDirichletAllocation(n_components = k, # number of topics. Try different numbers here to see what works best. Usually somewhere between 20 - 100
                                random_state = 123, # random number seed. You can use any number here, but its important to include so you can replicate analysis
                                doc_topic_prior = .01,
                                topic_word_prior = .001
                               ) 



In [None]:
from sklearn.pipeline import Pipeline

text_vectorizer = CountVectorizer(analyzer = 'word',
                             tokenizer = tokenize,
                             ngram_range=(0,1), # Tokens are individual words for now
                             strip_accents='unicode',
                             max_df = 0.1, # maximum number of documents in which word j occurs. 
                             min_df = .0025 # minimum number of documents in which word j occurs. 
                            )


pipeline_steps = [
    ('vectorizer', text_vectorizer),
    ('lda', ldamodel )
]

# Create the pipeline
lda_pipeline = Pipeline(pipeline_steps)



In [None]:
doctopic = lda_pipeline.fit_transform(articles['text'])


The `doctopic` object has the document topic probabilities in an array. So here's the breakdown of topics in document 0:

In [None]:
doctopic[0]

In [None]:
plt.bar(range(k), doctopic[0])

The term distribution for each topic is stored in the `.components_` attribute of the fitted LDA model. So here's some top terms from topic 0:

In [None]:
# accessing the features from the vectorizer part of the pipeline (this is just the vocabulary from the texts)
features = lda_pipeline['vectorizer'].get_feature_names_out()

In [None]:
features[np.argsort(lda_pipeline['lda'].components_[0])[::-1][:10]]

For ease of use, we can put these together in a Pandas Dataframe with named columns (we'll also normalize the rows so that each sums to 1)

In [None]:
phi_frame = pd.DataFrame(lda_pipeline['lda'].components_, columns=features)
phi_frame = phi_frame.div(phi_frame.sum(axis=1), axis=0)
phi_frame.head()

## Interpreting the results

A lot of the complexity of using LDA lies in intepreting the topics themselves. The simplest way to do this is by looking at the most probable words for each topic, which we can do by running the code below:

In [None]:
# Displaying the top keywords in each topic

n_terms = 10
ls_keywords = []
ls_freqs = []
topic_id = []
    
for i,topic in enumerate(lda_pipeline['lda'].components_):
     # Sorting and finding top keywords
    word_idx = np.argsort(topic)[::-1][:n_terms]
    freqs = list(np.sort(topic)[::-1][:n_terms])
    keywords = [features[i] for i in word_idx]
        
        # Saving keywords and frequencies for later
    ls_keywords = ls_keywords + keywords
    ls_freqs = ls_freqs + freqs
    topic_id = topic_id + [i] * n_terms
        
    
        # Printing top keywords for each topic
    print(i, ', '.join(keywords))
top_words_df = pd.DataFrame({'keywords':ls_keywords, 'frequency':ls_freqs, 'topic_id':topic_id})


In [None]:
top_words_df

<b style="color:red;"> Question 1: Wrap the code above in a function that takes `n_terms`, a fitted `lda` model, and a list of `features` as arguments and returns a data frame with the top n terms for each topic in descending order of frequency. Try running the same function with a few different values for `n_terms`</b>

In [None]:
# define a function here



We can also plot terms by frequency within each topic (although this may get unwieldy for models with a larger number of components)

In [None]:
sns.catplot(top_words_df, x = 'frequency', y = 'keywords', col = 'topic_id', kind = 'bar', sharey = False, col_wrap=3)


In many cases, an interactive visualization can make it easier to identify topics. The LDAvis package provides an easy way to create an interactive HTML file. 

In [None]:
panel = pyLDAvis.lda_model.prepare(lda_pipeline['lda'], dfm, lda_pipeline['vectorizer'], mds='tsne', sort_topics=False, n_jobs = -1)
word_info = panel.topic_info

#To save panel in html
pyLDAvis.save_html(panel, 'panel.html')

In the left panel of the display below, you can see each topic scaled by its overall frequency in the corpus. The relative positions of each topic indicates how distinct they are, so that topics that are further apart should share fewer terms. The plot on the right will display top terms for each topic. Instead of using the probability of each term, the displayed in this visualization are ranked according to a metric that accounts for how specific each term is to each topic. In some cases, this can be a better way of identifying the concept each topic represents.

In [None]:
HTML('panel.html')

<b style="color:red;"> Question 2: Using the information we've gathered so far, see if you can assign a short label to each topic in the LDA model. Replace the generic labels in `label_map` below with some descriptive topic IDs and then recreate the catplot object from the previous section</b>

In [None]:
label_map = {
    0: 'topic 1',
    1: 'topic 2',
    2: 'topic 3',
    3: 'topic 4',
    4: 'topic 5',
    5: 'topic 6',
    6: 'topic 7',
    7: 'topic 8',
    8: 'topic 9',
    9: 'topic 10',
    10: 'topic 11',
    11: 'topic 12',
    12: 'topic 13',
    13: 'topic 14',
    14: 'topic 15',
}
# map the labels
top_words_df['topic_label'] = top_words_df['topic_id'].map(label_map)

# recreate the catplot object:
sns.catplot(top_words_df, x = 'frequency', y = 'keywords', col = 'topic_label', kind = 'bar', sharey = False, col_wrap=3)

## Identifying Document Topics

Remember that LDA gives us two distributions: a distribution for word occurrences in each topic, and a distribution of topic occurrences within each document. So we also have a way to see what documents are associated with what sources or what time periods. We just need to link the topic memberships in `doctopic` back to the original documents so that we can see which documents are getting categorized into which topics:

In [None]:
topic_memberships = pd.DataFrame(doctopic)
topic_memberships.columns = ["topic " + str(i)  for i in topic_memberships.columns ]
topic_memberships.head()

Each row in this result represents one of the documents from our original data, and each column represents a topic. We can make things a little easier to interpret by appending some information about each article as additional columns onto this data frame:

In [None]:
topic_memberships['text'] = articles.text
topic_memberships['source'] = articles.source
topic_memberships['headline'] =articles.headline
topic_memberships['url'] = articles.url
topic_memberships['hyperlink'] = articles.hyperlink

topic_memberships.head()

<b style="color:red;"> Question 3: Identify the top 5 articles most strongly associated with topic 6</b>

In [None]:
# 



<b style="color:red;"> Question 4: Which topics, if any, occurred in a higher proportion of Fox News stories vs. CNN?</b>

In [None]:
#


# Making a styled table

Since we included a hyperlink and an article title in our original data frame, we can make a styled table that includes a formatted link for the topic articles in each topic.

In [None]:
n_terms = 10
n_docs = 3
top_documents = []
top_index = topic_memberships.columns.values.tolist()[:15]
for i, label in enumerate(top_index):
    top_n_documents =  topic_memberships.sort_values(label, ascending=False).head()
    terms={ 'topic' : i,
           'mean proportion' : np.mean(topic_memberships[label]),
        'docs' : '<br>'.join([i for i in top_n_documents['hyperlink'].to_list()[:n_docs]]),
        'terms' : ', '.join([features[j] for j in np.argsort(lda_pipeline['lda'].components_[i])[::-1][:n_terms]]) 
    }
    top_documents.append(terms)



In [None]:
stylized_table = pd.DataFrame(top_documents).sort_values(['mean proportion'], ascending=False).reset_index(drop=True).style

stylized_table

<b style="color:red;"> Question 5: Try re-running the model with a larger number of topics. Compare your results. </b>

## Visualizing documents

The document-topic matrix gives us a lower-dimensional way to represent the contents of our different documents. If we take this matrix and perform some additional dimensionality reduction, we can use the results to visualize all of the texts in a 2-dimensional space. We'll use the T-stochastic nearest neighbor embedding model to get a 2-d representation of our documents. T-SNE is a dimensionality reduction technique like principal components analysis, but it tends to be more effective in cases where there are non-linear relationships between observations. 

In [None]:
from sklearn.manifold import TSNE
# Create a t-SNE model with 2 components
tsne = TSNE(n_components=2, perplexity=30, random_state=42)

# Fit and transform the data
X_tsne = tsne.fit_transform(doctopic)

In [None]:
# adding the 2-d embedding to our articles data frame
positions = pd.DataFrame(X_tsne, columns=['Dim 1', 'Dim 2'])
positions = pd.concat([positions, articles], axis=1)
positions.head()


Creating some topic labels from the top words for each document

In [None]:
topic_labels = top_words_df.groupby('topic_id')['keywords'].agg(lambda x: ', '.join(x)).reset_index()


Identifying the most probable topic for each document:

In [None]:
max_topic = pd.DataFrame(doctopic).idxmax(axis='columns')
positions['max_topic'] = max_topic
positions = pd.merge(positions, topic_labels, left_on ='max_topic', right_on ='topic_id')


Showing the results in plotly:

In [None]:
positions

In [None]:
import plotly.express as px

In [None]:
fig = px.scatter(positions, x='Dim 1', y='Dim 2', 
                 title='Document positions from topic model',
                width= 1800, height=800, color='keywords'

                )
fig.show()