# Topic Modelling and Attitudes from Twitter Data

A short tutorial by **Eduardo Graells-Garrido** / <egraells@dcc.uchile.cl> / [@ZorzalErrante](http://twitter.com/ZorzalErrante) / http://datagramas.cl 

Last updated: **7/7/2022**

Today we have two aims:

1. Identify narratives in Twitter discussion with a given context (geographical in this example). We will use topic modelling for this.
2. Identify sentiment/emotions in the discussion. We will use transformers (a deep learning architecture) for this.

## Preamble

This notebook requires the [tsundoku environment](https://github.com/zorzalerrante/tsundoku). Clone the repository and execute the following:

```
# Create conda environment, install dependencies on it and activate it
conda create --name tsundoku --file environment.yml
conda activate tsundoku

python -m ipykernel install --user --name tsundoku --display-name "Python (tsundoku)"
```

### Google Colab

If you use Google Colab you need to install the dependencies in the server. This will take a few minutes! You need to execute the first cell, wait until the server gives you a restart error, and then run the second cell.

In [None]:
try:
    import google.colab
    !pip uninstall matplotlib -y
    !pip install -q condacolab
    
    import condacolab
    condacolab.install_mambaforge()
except ModuleNotFoundError:
    pass

In [None]:
try:
    import google.colab
    !git clone https://github.com/zorzalerrante/tsundoku.git tsundoku_git
    !mamba env update --name base --file tsundoku_git/environment.yml
except ModuleNotFoundError:
    pass

### Python

Here we load all the dependencies we will use in the notebook.

In [None]:
import csv
import urllib.request

import dask.dataframe as dd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from aves.features.geo import clip_area_geodataframe, to_point_geodataframe
from aves.features.sparse import sparse_matrix_to_long_dataframe
from aves.features.utils import normalize_rows, standardize_columns
from aves.visualization.figures import small_multiples_from_geodataframe
from aves.visualization.maps import choropleth_map
from aves.visualization.text import draw_wordcloud
from scipy.special import softmax
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfTransformer
from transformers import AutoModelForSequenceClassification, AutoTokenizer

from tsundoku.features.dtm import build_vocabulary, tokens_to_document_term_matrix
from tsundoku.features.text import tokenize


## Dataset

This is a small dataset of tweets about migration in UK. See https://fcorowe.github.io/intro-gds/04-spatial_econometrics.html for a description.

In [None]:
tweets = dd.read_csv('https://github.com/fcorowe/gds-bigsss-groningen/raw/main/data/uk_geo_tweets_01012019_31012019.csv')
tweets.head()

In [None]:
len(tweets)

In [None]:
tweets.columns

In [None]:
tweets['token'] = tweets['text'].apply(tokenize)

In [None]:
tweets = tweets.compute()

In [None]:
tweets.head()

In [None]:
len(tweets)

In [None]:
len(tweets['author_id'].unique())

In [None]:
print('\n---\n'.join(tweets['text'].sample(10)))

### Dates

In [None]:
tweets['created_at'] = pd.to_datetime(tweets['created_at'])

In [None]:
tweets.resample('1d', on='created_at').size().plot()

### Words

In [None]:
vocab = build_vocabulary(tweets, 'token')
vocab

In [None]:
fig, ax = plt.subplots()

draw_wordcloud(ax, vocab.set_index('token')['frequency'].to_dict())

In [None]:
vocab['frequency'].plot(kind='hist', bins=20)

In [None]:
np.log(vocab['frequency']).plot(kind='hist', bins=20)

In [None]:
filtered_vocab = vocab[vocab['frequency'].between(5, vocab['frequency'].quantile(0.985))].reset_index(drop=True)
filtered_vocab

In [None]:
filtered_vocab.sort_values('frequency', ascending=False).head(25)

In [None]:
fig, ax = plt.subplots(figsize=(9, 9))

draw_wordcloud(ax, filtered_vocab.set_index('token')['frequency'].to_dict())

### Geographical Context

In [None]:
tweets['place_name'].unique().shape

In [None]:
# https://www.geoboundaries.org/index.html#getdata
gdf = gpd.read_file('https://raw.githubusercontent.com/wmgeolab/geoBoundaries/793caebea9ccb4bb1c4f38e80684c1166daf288a/releaseData/gbOpen/GBR/ADM2/geoBoundaries-GBR-ADM2-all.zip')
gdf['geometry'] = gdf.simplify(0.0001)
gdf.plot()

In [None]:
# somehow lat and lon are reversed in the original data.

tweets = to_point_geodataframe(tweets, 'lat', 'long', drop=True)
tweets.plot()

In [None]:
ax = gdf.plot(facecolor='none', edgecolor='black', figsize=(7, 7))
tweets.plot(ax=ax, color='purple', markersize=1)

In [None]:
len(gpd.sjoin(tweets, gdf, op='within'))

In [None]:
tweets = gpd.sjoin(tweets, gdf, op='within')
print(len(tweets))
tweets.head()

*Potential bug!* How do you know that all tweets are within your geography? If not, the index will have gaps. It would be better to reset it.

In [None]:
tweets = tweets.reset_index(drop=True)

In [None]:
location_counts = (
    tweets.groupby("shapeName")
    .size()
    .sort_values(ascending=False)
    .rename("n_tweets")
)

location_counts.head()

In [None]:
ax = gdf.join(location_counts, on='shapeName', how='inner').plot(column='n_tweets', cmap='PuRd', edgecolor='none', figsize=(12, 12))
gdf.plot(facecolor='none', edgecolor='black', linewidth=0.1, ax=ax)

## Narratives

### Main Representation: Document-Term Matrix

In [None]:
dtm = tokens_to_document_term_matrix(tweets, 'tweet_id', 'token', filtered_vocab['token'])
dtm

We observed that the most frequent words are not necessarily the most informative. We filtered out some of them, but that only diminishes the problem.

One way of improving the situation is to assign a weight to each word.

The most common weighting formula is TF-IDF.

In [None]:
tfidf = TfidfTransformer()

dtm_weighted = tfidf.fit_transform(dtm)
dtm_weighted

In [None]:
word_importances = sparse_matrix_to_long_dataframe(dtm_weighted, var_map=filtered_vocab['token'].to_dict())
word_importances

In [None]:
global_word_importance = (
    word_importances.groupby("column")["value"]
    .sum()
    .sort_values(ascending=False)
)
global_word_importance.head(25)


In [None]:
place_tweet_idx = (
    tweets.groupby("shapeName")
    .apply(lambda x: x.index.values)
    #.loc[location_counts.index]
)

place_tweet_idx

In [None]:
place_dtm = np.vstack(place_tweet_idx.map(lambda x: np.squeeze(np.array(dtm[x].sum(axis=0)))))
place_dtm.shape


In [None]:
place_dtm

In [None]:
place_words = pd.DataFrame(
    tfidf.transform(place_dtm).todense(),
    index=place_tweet_idx.index,
    columns=filtered_vocab["token"],
)

place_words

In [None]:
place_words.T.apply(lambda x: x.sort_values(ascending=False).head(10).index).T


## Topic Model: Non-Negative Matrix Factorization



In [None]:
nmf_model = NMF(n_components=20, random_state=42)
nmf_document_topic = nmf_model.fit_transform(dtm_weighted)

In [None]:
nmf_term_topic = nmf_model.components_
nmf_term_topic.shape

In [None]:
nmf_term_topic = pd.DataFrame(nmf_term_topic.T, index=filtered_vocab['token']).pipe(normalize_rows)
nmf_term_topic

In [None]:
sns.clustermap(nmf_term_topic)

In [None]:
nmf_term_topic.apply(lambda x: x.sort_values(ascending=False).head(25).index).add_prefix('topic_')

In [None]:
nmf_place_topic = nmf_model.transform(tfidf.transform(place_dtm))
nmf_place_topic = pd.DataFrame(nmf_place_topic, index=place_tweet_idx.index).add_prefix('topic_').pipe(normalize_rows)
nmf_place_topic

In [None]:
sns.clustermap(nmf_place_topic, metric='cosine')

In [None]:
nmf_topic_labels = nmf_term_topic.apply(lambda x: '\n'.join(x.sort_values(ascending=False).head(15).index))
nmf_topic_labels

In [None]:
fig, axes = small_multiples_from_geodataframe(gdf, n_variables=len(nmf_place_topic.columns), height=7, col_wrap=5)

place_topic = nmf_place_topic
topic_labels = nmf_topic_labels

joint_gdf = gdf.join(place_topic, on='shapeName')

for ax, col, labels in zip(axes, place_topic.columns, topic_labels.values):
    gdf.plot(facecolor='none', edgecolor='#abacab', linewidth=0.5, ax=ax, aspect=None)
    
    choropleth_map(ax, joint_gdf[joint_gdf[col] >= 0.05], col, edgecolor='black', linewidth=0.5, k=5, edgebinning="fisher_jenks", palette='RdPu',
        cbar_args=dict(
            label=f"{col}",
            height="25%",
            width="3%",
            orientation="vertical",
            location="lower left",
            label_size="small",
            bbox_to_anchor=(0.0, 0.0, 0.8, 0.95),
        ),)
    ax.set_title(col)

    ax.annotate(labels, (0.99, 0.99), xycoords='axes fraction', ha='right', va='top', fontsize='medium')


fig.tight_layout()


### Latent Dirichlet Allocation (LDA)

In [None]:
lda_model = LatentDirichletAllocation(n_components=20, random_state=42)
lda_document_topic = lda_model.fit_transform(dtm)
lda_term_topic = pd.DataFrame(lda_model.components_.T, index=filtered_vocab['token'])
lda_term_topic.apply(lambda x: x.sort_values(ascending=False).head(25).index)

In [None]:
lda_place_topic = lda_model.transform(place_dtm)
lda_place_topic = pd.DataFrame(lda_place_topic, index=place_tweet_idx.index).add_prefix('topic_')
lda_place_topic

In [None]:
sns.clustermap(lda_place_topic, metric='cosine')

In [None]:
lda_topic_labels = lda_term_topic.apply(lambda x: '\n'.join(x.sort_values(ascending=False).head(15).index))
lda_topic_labels


In [None]:
fig, axes = small_multiples_from_geodataframe(gdf, n_variables=len(lda_place_topic.columns), height=7, col_wrap=5)

place_topic = lda_place_topic
topic_labels = lda_topic_labels

joint_gdf = gdf.join(place_topic, on='shapeName')

for ax, col, labels in zip(axes, place_topic.columns, topic_labels.values):
    gdf.plot(facecolor='none', edgecolor='#abacab', linewidth=0.5, ax=ax, aspect=None)
    
    choropleth_map(ax, joint_gdf[joint_gdf[col] >= 0.05], col, edgecolor='black', linewidth=0.5, k=5, edgebinning="fisher_jenks", palette='RdPu',
        cbar_args=dict(
            label=f"{col}",
            height="25%",
            width="3%",
            orientation="vertical",
            location="lower left",
            label_size="small",
            bbox_to_anchor=(0.0, 0.0, 0.8, 0.95),
        ),)
    ax.set_title(col)

    ax.annotate(labels, (0.99, 0.99), xycoords='axes fraction', ha='right', va='top', fontsize='medium')


fig.tight_layout()

Which one is better? We can't say. It will depend on your task :)

## Sentiment using Transformers

Transformers are a deep learning architecture that has been popularized due to their availability and the ability to fine-tune.

Fine-tuning means that you can download a model and re-train it for your specific task, taking advantage of all previous structure already inferred by the model.

Fortunately, the Huggingface transformers library makes it very easy to download models and put them into operation.

In [None]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)


task='emotion'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [None]:
# download label mapping
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]
labels

In [None]:
def predict_emotion(text):
    text = preprocess(text)
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    return pd.Series(scores, index=labels)

model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
predict_emotion("Good night 😊")

In [None]:
sample_tweets= tweets[['text']].sample(5)
sample_tweets.join(sample_tweets['text'].apply(predict_emotion)).set_index('text')

In [None]:
tweet_emotions = tweets['text'].apply(predict_emotion)
tweet_emotions.describe()

In [None]:
place_sentiment = (
    tweets.join(tweet_emotions)
    .groupby("shapeName")[tweet_emotions.columns]
    .median()
    #.pipe(normalize_rows)
)
place_sentiment


In [None]:
place_sentiment.sort_values('anger', ascending=False)

In [None]:
fig, axes = small_multiples_from_geodataframe(
    gdf, n_variables=len(labels), height=7, col_wrap=4, remove_axes=True
)

joint_gdf = gdf.join(place_sentiment.pipe(standardize_columns), on="shapeName")

for ax, col in zip(axes, place_sentiment.columns):
    choropleth_map(
        ax,
        joint_gdf,
        col,
        k=5,
        linewidth=0.5,
        edgecolor="black",
        binning="fisher_jenks",
        cbar_args=dict(
            label=f"{col} [z]",
            height="25%",
            width="3%",
            orientation="vertical",
            location="upper right",
            label_size="small",
            bbox_to_anchor=(0.0, 0.0, 0.8, 0.95),
        ),
    )
    # joint_gdf.plot(ax=ax, column=col, aspect=None, cmap='RdBu')
    # gdf.plot(facecolor='none', edgecolor='black', linewidth=0.1, ax=ax, aspect=None)
    ax.set_title(col)


fig.tight_layout()
# fig.subplots_adjust(hspace=0.001, wspace=0.001)


In [None]:
# https://woeplanet.org/id/23416974/
london_bbox = [-0.51035, 51.286839, 0.33403, 51.692322]
gdf_london = clip_area_geodataframe(gdf, london_bbox, buffer=0.01)
gdf_london.plot()


In [None]:
fig, axes = small_multiples_from_geodataframe(gdf_london, n_variables=len(labels), height=6, col_wrap=5, remove_axes=False)

joint_gdf = gdf.join(place_sentiment.pipe(standardize_columns), on='shapeName')

for ax, col in zip(axes, place_sentiment.columns):
    choropleth_map(
        ax,
        joint_gdf,
        col,
        k=5,
        linewidth=0.5,
        edgecolor="black",
        binning="fisher_jenks",
        legend=None
    )
    #joint_gdf.plot(ax=ax, column=col, aspect=None, cmap='RdBu')
    #gdf.plot(facecolor='none', edgecolor='black', linewidth=0.1, ax=ax, aspect=None)
    ax.set_title(col)


fig.tight_layout()

In [None]:
aspect_ratio = (london_bbox[2] - london_bbox[0]) / (london_bbox[3] - london_bbox[1])
aspect_ratio

In [None]:
fig, axes = small_multiples_from_geodataframe(
    gdf, n_variables=4, height=9, col_wrap=4, remove_axes=True
)

joint_gdf = gdf.join(place_sentiment.pipe(standardize_columns), on="shapeName")

for ax, col in zip(axes, place_sentiment.columns):
    gdf.plot(facecolor='none', edgecolor='#abacab', linewidth=0.5, ax=ax, aspect=None)
    
    choropleth_map(
        ax,
        joint_gdf,
        col,
        k=5,
        linewidth=0.5,
        edgecolor="black",
        binning="fisher_jenks",
        cbar_args=dict(
            label=f"{col} [z]",
            height="25%",
            width="3%",
            orientation="vertical",
            location="upper right",
            label_size="small",
            bbox_to_anchor=(0.0, 0.0, 0.8, 0.95),
        ),
    )
    # joint_gdf.plot(ax=ax, column=col, aspect=None, cmap='RdBu')
    
    ax.set_title(col)

    axins = ax.inset_axes([0.75, -0.12, 0.4, 0.4 / aspect_ratio])
    axins.set_axis_off()
    #axins.imshow(Z2, extent=extent, origin="lower")
    # sub region of the original image
    #x1, x2, y1, y2 = -1.5, -0.9, -2.5, -1.9
    axins.set_xlim(london_bbox[0], london_bbox[2])
    axins.set_ylim(london_bbox[1], london_bbox[3])
    #axins.set_xticklabels([])
    #axins.set_yticklabels([])

    choropleth_map(
        axins,
        joint_gdf,
        col,
        k=5,
        linewidth=0.2,
        edgecolor="black",
        binning="fisher_jenks",
        legend=None
    )

    ax.indicate_inset_zoom(axins, edgecolor="black", zorder=50)


fig.tight_layout()

### Correlating Emotion and Narratives

We may want to characterize the topics underpinning the discussion. For instance, we cannot say with confidence that a topic characterized by a high association to a negative word is negative, because we don't know the context of the negative word. However, the sentiment characterization does that. 

Since we have estimated these measures for the same unit of analysis, one step toward characterizing topics is through correlation.

In [None]:
nmf_topic_x_emotion = (
    place_sentiment.join(nmf_place_topic)
    .fillna(0)
    .corr()
    .loc[nmf_place_topic.columns, place_sentiment.columns]
    .set_index(nmf_topic_labels.map(lambda x: x.replace("\n", ",")))
)

lda_topic_x_emotion = (
    place_sentiment.join(lda_place_topic)
    .fillna(0)
    .corr()
    .loc[lda_place_topic.columns, place_sentiment.columns]
    .set_index(lda_topic_labels.map(lambda x: x.replace("\n", ",")))
)


In [None]:
g = sns.clustermap(nmf_topic_x_emotion, center=0, figsize=(16, 9), annot=True, metric='cosine')
g.fig.tight_layout()

In [None]:
g = sns.clustermap(lda_topic_x_emotion, center=0, figsize=(16, 9), annot=True, metric='cosine')
g.fig.tight_layout()

Which one to pick? It seems that the results are not _that_ different. 

This is not the end of the study. We should do a careful qualitative analysis that can be supported by these numbers.


## Remaining Questions

### How to select a topic model?

My advice would be to test if simpler models give you reasonable results. If so, before moving to a more complex model, see if you can improve your data or your pre-processing. I like NMF for its simplicity and speed. And results are usually good enough. As with LDA, you can find evidence regarding its usefulness, thus, it is not a choice hard to justify.

Note that there are multiple versions of NMF and LDA. The [gensim](https://radimrehurek.com/gensim/) library is a good starting point as it has many implementations of those variants, as well as of other models.

### How to select the number of topics?

Do not focus only on quantitative measurements. Think about your assumptions of the data. Keep in mind typical evaluations (such as selecting a model based on Log-Likelihood or similar), but remember that those metrics are not necessarily related to your needs or assumptions about the data and the phenomena under study.

For instance, NMF tries to reconstruct the original matrix. As such, the "goodness of fit" is measured through matrix reconstruction error. You will notice that, as you increase the rank of the latent matrices, the fit improves always. 

Note that topics are _latent_, sometimes they do not have a human interpretation. A way to surpass this and have more interpretable topics is to use semi-supervised models. One of them is [Corex Topic Model](https://github.com/gregversteeg/corex_topic). where you can anchor words to topics as a way to guide the inference.

Always visualize what you do :) It will help you to pinpoint potential insights and also potential errors.

### Which transformer model to use?

I would say that every week there is a new model! The field is growing in a quite spectacular way, what I suggest is to find model authors that you trust and that have evaluated the new models in datasets similar to yours. For instance, here we used [tweeteval](https://github.com/cardiffnlp/tweeteval), which is trained and fine-tuned with tweets.

## Thanks!