![SGSSS Logo](../img/SGSSS_Stacked.png)

# Text Analysis

## Introduction

Computational methods are transforming research practice across the disciplines. For social scientists these methods offer a number of valuable opportunities, including creating new datasets from digital sources; unearthing new insights and avenues for research from existing data sources; and improving the accuracy and efficiency of fundamental research activities.

In this lesson we introduce and apply a range of supervised and unsupervised text analysis techniques to social science data.

### Aims

This lesson has two aims:
1. Demonstrate how to use Python to analyse text data relating to charitable activities.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data preprocessing problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 40-60 minutes
* **Pre-requisites**: None
* **Audience**: Researchers and analysts from any disciplinary background
* **Learning outcomes**:
    1. Understand and apply common supervised and unsupervised text analysis techniques to social science data.
    3. Be able to use Python for performing text analysis.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*How do we analyse social science text data?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## How do we analyse social science text data?

There are a wide array of text analysis techniques that we could apply in our research:
* **Descriptive inference:** how to characterise text; vector space model, bag of words, (dis)similarity measures, diversity, complexity, style, bursts.
* **Supervised techniques:** dictionaries, sentiment analysis, categorising.
* **Unsupervised techniques:** cluster analysis, Principal Components Analysis (PCA), topic modelling, embeddings. (Spirling, 2022)

To say nothing of using Generative AI or Large Language Models (LLMs) to conduct these analyses on our behalf.

In this lesson we focus on a common unsupervised text analysis technique:
* Topic modelling

## Preliminaries

First we need to ensure Python has the functionality it needs for text analysis. As you will see, it needs quite a bit of extra functionality, so this may take some time to install / import depending on your machine.

In [None]:
# Install additional packages - only run once per machine
!pip install textblob
!pip install seaborn
!pip install pyldavis
!pip install gensim

Packages for general data and file management:

In [None]:
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import json
import os
import re

Packages for processing text data:

In [None]:
import nltk                       # get nltk 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize
from nltk import FreqDist

English_punctuation = "-!\"#$%&()'*+,./:;<=>?@[\]^_`{|}~''“”"      # Things for removing punctuation, stopwords and empty strings
table_punctuation = str.maketrans('','', English_punctuation)  

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('webtext')
nltk.download('words')

from nltk.corpus import words     # list of valid words
english_words = set(words.words())

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk.corpus import wordnet                    # Functions we need for lemmatising
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 

from nltk.stem.porter import PorterStemmer         # Functions we need for stemming
porter = PorterStemmer()

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

from collections import Counter

print("Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

Packages for analysing text data:

In [None]:
# for sentiment analysis
from textblob import TextBlob

# for data visualisation
import matplotlib.pyplot as plt 
import seaborn as sns # for data visualisation

# for PCA
from sklearn.decomposition import PCA

# for topic modelling
import gensim
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel

# for topic modelling evaluation
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

### Import data

A second important preliminary step is to import the text data you will be using.

In [None]:
infile = "https://raw.githubusercontent.com/SGSSSonline/text-analysis-summer-school-2025/refs/heads/main/data/acnc-overseas-activities-2022.csv" # define file to be imported

data = pd.read_csv(infile, encoding="ISO-8859-1")

In [None]:
data.sample(10)

In [None]:
data["activity_desc"].sample(10)

###  Create Document Term Matrix

You have likely created and saved this in a previous lesson but let's start afresh just in case.

In [None]:
def preprocess_text(text):

    # Tokenize the text and convert to lowercase
    words = nltk.word_tokenize(text)
    lower_words = [word.lower() for word in words]
    #print(lower_words)

    # Remove punctuation and numbers
    a_words = [word for word in lower_words if word.isalpha()]
    #print("Alpha words: ",a_words)

    # Lemmatise words
    lemmed_words = [lemmatizer.lemmatize(word) for word in a_words]
    #print("Lemmed words: ",lemmed_words)
    
    # Remove non-English words
    e_words = [word for word in lemmed_words if word in english_words]
    #print("English words: ", e_words)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    new_stop_words = ["registered", "registration", "company", "number", "australia", 
                      "australian", "report", "charity", "charities", "charitable", "year", 
                      "end", "statement", "statements", "trustee", "trustees", "trust", "overseas",
                     "international", "support", "fund", "provide", "provision", "activity", "activities",
                     "providing", "provided", "program", "programme", "project"]
    stop_words.update(new_stop_words)
    s_words = [word for word in e_words if word not in stop_words]
    #print("Stop words: ", s_words)

    # Stem words
    #stemmed_words = [porter.stem(word) for word in p_words]

    # Remove words with less than three characters
    clean_words = [word for word in s_words if len(str(word)) > 2]

    return ' '.join(clean_words)

### Clean text using function

In [None]:
# Ensure text column is valid
data["activity_desc"] = data["activity_desc"].astype(str)
data = data.dropna(subset=["activity_desc"])

In [None]:
data["clean_text"] = data["activity_desc"].apply(preprocess_text)
data[["abn", "activity_desc", "clean_text"]].head(5)

### Create list of documents

We want to loop over every row in the dataset and extract the charity unique id and the cleaned activity description.

In [None]:
documents = [(row["abn"], row["clean_text"]) for _, row in data.iterrows()]
documents[0:5] # view first five elements in list of documents

### Extract just the cleaned text for converting to DTM

In [None]:
text_data = [text for _, text in documents]
text_data[0:5]

### Create a Document-Term Matrix using a Count or TF-IDF vectoriser

In [None]:
vectorizer = CountVectorizer()bow = vectorizer.fit_transform(text_data)
terms = vectorizer.get_feature_names_out() # extract unique terms in corpus (vocabulary)

In [None]:
#vectorizer = TfidfVectorizer()#bow = vectorizer.fit_transform(text_data)
#terms = vectorizer.get_feature_names_out() # extract unique terms in corpus (vocabulary)

In [None]:
# Convert DTM into a Pandas DataFrame
dtm = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names_out(), index=[doc_id for doc_id, _ in documents])
document_ids = dtm.index.tolist() # create list of document ids

In [None]:
dtm

In [None]:
print(terms[0:500]) # view first 500 terms in vocabulary

## Unsupervised techniques

An unsupervised text analysis technique (or unsupervised learning more generally) is one that seeks to uncover or determine what category or class a document belongs to. Cluster analysis is a type of unsupervised learning technique as it groups observations according to shared characteristics or features. We do not know *a priori* what group a document belongs to, we need to estimate or predict based on the features of the document. This and similar techniques (e.g., PCA, topic modelling) are termed **unsupervised** because the category or class is not already known and the text analysis technique is therefore not guided or supervised as to what the correct categories or classes are.

Perhaps documents share a linguistic style, or talk about similar topics at similar rates. Unless we want to read a lot of these documents and manually code them into particular categories or classes, we need to use techniques that construct these groups from the ground up.

### Topic modelling

Topic models are an important intellectual development in the social sciences (Blei, 2012). Topic modelling is a technique for discovering the main themes in a corpus of documents. At its simplest these topics can be used to organise the documents in a corpus; that is, are there groups of documents talking about similar topics? However topic models can also be used to measure and explain the prevalence of interesting themes across documents e.g., do small charities talk about a certain topic more frequently than medium and large organisations?


Let's get straight into this technique using the full charity activity DTM. We will use a package we haven't seen before (`gensim`) so don't worry if the code is unfamiliar: we just need a couple of simple steps to convert our DTM to a format that package can work with.

In [None]:
docs = [[word for word, freq in zip(dtm.columns, row) for _ in range(freq)] for row in dtm.to_numpy()]
dictionary = corpora.Dictionary(docs)
dictionary.filter_extremes(no_below = 10, no_above= .95) # words must appear in at least 10 documents, and no more than 95% of documents

corpus = [dictionary.doc2bow(doc) for doc in docs]

In [None]:
print(f"Dictionary size after filtering: {len(dictionary)}")

Topic models can take quite long to run depending on the size of the corpus, how many topics we are looking for, how many passes through the corpus we take etc. Therefore it can help if we limit the scope of the topic modelling. We do this in the above code block by only including words that appear in at least 10 documents and in no more than 95% of documents.

Notice that the filtering of very rare and very common terms has reduced the number of unique terms (or types_ from c.4,200 to c.800. This may not be optimal but is a good starting point for our first topic model. (If you want to avoid any filtering, just place a '#' at the beginning of `dictionary.filter_extremes(no_below = 10, no_above= .95)` above).

OK, let's estimate a topic model with five topics, and let's have the model pass over each document 10 times, and the corpus as a whole 10 times. (Think of this as a human reading each document ten times and then going over the whole corpus ten times before determining what the five topics are.) In addition we set a seed, which is a way of ensuring the topic model generates the same results every time. Remember that topic models are **probabilistic** by design: they estimate the distribution of documents over topics, and topics over words. These distributions change slightly every time you estimate them.

In [None]:
seed_value = 37
topic_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary, iterations=10, passes=10, random_state=seed_value)

No error messages means the code ran successfully but what about the results? First let's extract the topics and the top 20 words associated with each.

In [None]:
topics = {}
for topic_id in range(topic_model.num_topics):
    top_words = topic_model.show_topic(topic_id, topn=20)
    topics[f"Topic {topic_id+1}"] = [word for word, _ in top_words]

# Convert topics to DataFrame for display
topics_df = pd.DataFrame(topics)

In [None]:
topics_df

We can also examine the proportion of a topic that is constituted by each word.

In [None]:
for topic in topic_model.print_topics(num_topics=5, num_words=10):
    print(topic)

**TASK:** Increase the number of words (`num_words=10`) in the above code to see what proportion other words contribute to the topic.

We can also view the number of unique terms associated with each topic:

In [None]:
topic_word_counts = {f"Topic {i+1}": len(topic_model.show_topic(i, topn=len(dictionary))) 
                     for i in range(topic_model.num_topics)}
topic_word_counts

It looks like every term is associated with every topic, which is not informative. However this is expected: all documents share the same topics, and all topics share the same words. Where topics differ is in the probabilties each word has with each topic. Put another way, what is the contribution of each term to a topic?

In [None]:
topic_word_counts = {
    f"Topic {i+1}": sum(1 for _, prob in topic_model.show_topic(i, topn=len(dictionary)) if prob > 0.01) 
    for i in range(topic_model.num_topics)
}
topic_word_counts

Now we can see that there are only a small number of unique terms per topic that make up a meaningful proportion of the topic (i.e., more than 1% probability of being associated with a topic).

Tabular representations of the topics can be informative but visualisations are an excellent way of deriving deeper insight from the results.

In [None]:
pyLDAvis.enable_notebook()

# Prepare visualization
tm = gensimvis.prepare(topic_model, corpus, dictionary, mds='mmds', sort_topics=True)

# Save as HTML
pyLDAvis.save_html(tm, 'topic-model.html')

# Display visualization
pyLDAvis.display(tm)

The visualisation can be overwhelming so here is some practical guidance for interpreting the findings:
* Look at the left panel (scatterplot of topics): Are topics well-separated, or do they overlap?
* Click on topics one by one: What are the key words in each topic?
* Adjust the λ slider: Does it help clarify the topics? Which words are exclusive to a particular topic (choose a small lambda value)?
* Compare the top words: Do they align with expected themes?

An initial interpretation of topic 4 suggests it is about religion. Words like "church", "ministry", and "faith" are both common to and highly associated with this topic: that is, these words appear frequently in relation to this topic **and** when these words are used in the corpus, it is almost entirely in relation to this topic (adjust the λ slider to a lower value to see words that are not that common in the corpus overall but are to a given topic).

**TASK:** Select topic 5 and interpret the results. What theme do you think this topic represents? What words are common to this topic in terms of freqeuncy and proportion?

#### Recovering the Document-Topic Matrix and the Topic-Term Matrix.

Topic modelling produces two sets of results or matrices of interest:
* The Document-Topic Matrix - the distribution of documents (rows) across topics (columns). The cells contain the probability of each topic appearing in a document.
* The Topic-Term Matrix - the distribution of topics (rows) across terms (columns). The cells contain the probability of each word appearing in a topic

In [None]:
# Document-Topic Matrix
doc_topic_matrix = []
for doc_bow in corpus:
    topic_distribution = topic_model.get_document_topics(doc_bow, minimum_probability=0)
    doc_topic_matrix.append([prob for _, prob in topic_distribution])

# Convert to DataFrame for visualization
doc_topic_df = pd.DataFrame(doc_topic_matrix, columns=[f"Topic {i+1}" for i in range(topic_model.num_topics)])

# Add original document ids
doc_topic_df["doc_id"] = document_ids # this is possible because topic modelling preserves row order from the DTM

doc_topic_df

How can we validate whether the documents actually cohere with the topics? We have interpreted topic 4 as 'religion' though there are other words relating to education. Let's look at the documents that map most closely to topic 4 and read the original text.

In [None]:
doc_topic_df.sort_values(by='Topic 4', ascending=False, inplace=False)

For example, document `87161085650` seems to be mainly about topic 4 ('religion'), as evidence by its high proportion (97% of the document's words are associated with this topic). Let's perform a close reading of this document to see if this really is the case.

In [None]:
pd.set_option('display.max_colwidth', None)  # Show full content in columns, including long text
data.loc[data["abn"] == 87161085650, ["abn", "activity_desc", "clean_text"]]

Hmm, seems to be more about education than religion. Let's look at some others:

In [None]:
data.loc[data["abn"] == 76611738464, ["abn", "activity_desc", "clean_text"]]

In [None]:
pd.reset_option('display.max_colwidth') # reset display settings

In [None]:
# Topic-Term Matrix
topic_term_matrix = topic_model.get_topics()
topic_term_df = pd.DataFrame(topic_term_matrix, columns=[dictionary[i] for i in range(len(dictionary))])
topic_term_df

#### Model evaluation

How do we know how many topics is optimal? Well this is really a subjective task, as it is the analyst's judgement that matters most here. There are "objective" approaches we could take. One of these is called the *coherence score*: 

In [None]:
coherence_model = CoherenceModel(model=topic_model, texts=docs, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print("Coherence score: ", coherence_score)

Coherence is used to compare different topic models, especially where the number of topics differs. Values closer to 1 represent a set of topics that are semantically consistent: there are clear differences in the groups of words forming different topics. 

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for text analysis tasks you'll almost certainly need to import some additional modules.
* **How to perform unsupervised text analyses**. There are a number of common and key analytical techniques that can yield substantive insight into key features of documents.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

These are but a selection of the analytical techniques at your displosal; however they are common and often key ones in text analysis projects. Topic modelling in particular is a comprehensive analytical technique that deserves deeper engagement and practice with.

## Exercise

Perform topic modelling using the other file in the data folder (*acnc-overseas-activities-2021.csv*).

In [None]:
# INSERT CODE HERE

In [None]:
# INSERT CODE HERE

--END OF FILE--