![SGSSS Logo](../img/SGSSS_Stacked.png)

# Text Analysis

## Introduction

Computational methods are transforming research practice across the disciplines. For social scientists these methods offer a number of valuable opportunities, including creating new datasets from digital sources; unearthing new insights and avenues for research from existing data sources; and improving the accuracy and efficiency of fundamental research activities.

In this lesson we introduce and apply a range of text analysis techniques to social science data.

### Aims

This lesson has two aims:
1. Demonstrate how to use Python to analyse text data relating to charitable activities.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data preprocessing problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 40-60 minutes
* **Pre-requisites**: None
* **Audience**: Researchers and analysts from any disciplinary background
* **Learning outcomes**:
    1. Understand and apply common text analysis techniques to social science data.
    2. Be able to use Python for performing text analysis.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*How do we prepare social science data for text analysis?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## How do we analyse social science text data?

There are a wide array of text analysis techniques that we could apply in our research:
* **Descriptive inference:** how to characterise text; vector space model, bag of words, (dis)similarity measures, diversity, complexity, style, bursts.
* **Supervised techniques:** dictionaries, sentiment analysis, categorising.
* **Unsupervised techniques:** cluster analysis, PCA, topic modelling, embeddings. (Spirling, 2022)

To say nothing of using Generative AI or Large Language Models (LLMs) to conduct these analyses on our behalf.

In this lesson we focus on a number of common and often key analytical approaches:
* Bag of words summaries and visualisations e.g., word clouds
* Similarity metrics e.g., cosine similarity
* Discriminating words e.g., Mutual Information and Fightin' Words

We will cover the following topics in a later lesson:
* Topic modelling

## Preliminaries

First we need to ensure Python has the functionality it needs for text analysis. As you will see, it needs quite a bit of extra functionality, so this may take some time to install / import depending on your machine.

In [None]:
# Install additional packages - only run once per machine
!pip install autocorrect

Packages for general data and file management:

In [None]:
import pandas as pd
import numpy as np
import json
import os
import re

Packages for processing text data:

In [None]:
import nltk                       # get nltk 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize
from nltk import FreqDist

from autocorrect import Speller   # things we need for spell checking
check = Speller(lang='en')

nltk.download('stopwords') # additional words or dictionaries we can use to check our documents against
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('webtext')
nltk.download('words')

from nltk.corpus import words     # list of valid English words
english_words = set(words.words())

from nltk.corpus import stopwords # list of common words that are not substantively informative e.g., "the"
stop_words = set(stopwords.words('english'))

from nltk.corpus import wordnet                    # functions we need for lemmatising
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 

from nltk.stem.porter import PorterStemmer         # function we need for stemming
porter = PorterStemmer()

from sklearn.feature_extraction.text import CountVectorizer # function we need for converting text to numeric
vectorizer = CountVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer # function we need for converting text to numeric
tfidf_vectorizer = TfidfVectorizer()

from collections import Counter

print("Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

Packages for analysing text data:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity # for calculating document similarity
from wordcloud import WordCloud # for producing word clouds of DTMs
import matplotlib.pyplot as plt # for data visualisation
from sklearn.feature_selection import mutual_info_classif # for calculating mutual information score

### Import data

A second important preliminary step is to import the text data you will be using.

In [None]:
infile = "https://raw.githubusercontent.com/SGSSSonline/text-analysis-summer-school-2025/refs/heads/main/data/acnc-overseas-activities-2022.csv" # define file to be imported

data = pd.read_csv(infile, encoding="ISO-8859-1")

In [None]:
data.sample(10)

In [None]:
data["activity_desc"].sample(10)

###  Create Document Term Matrix

You have likely created and saved this in a previous lesson but let's start afresh just in case.

In [None]:
def preprocess_text(text):

    # Tokenize the text and convert to lowercase
    words = nltk.word_tokenize(text)
    lower_words = [word.lower() for word in words]
    #print(lower_words)

    # Remove punctuation and numbers
    a_words = [word for word in lower_words if word.isalpha()]
    #print("Alpha words: ",a_words)

    # Lemmatise words
    lemmed_words = [lemmatizer.lemmatize(word) for word in a_words]
    #print("Lemmed words: ",lemmed_words)
    
    # Remove non-English words
    e_words = [word for word in lemmed_words if word in english_words]
    #print("English words: ", e_words)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    new_stop_words = ["registered", "registration", "company", "number", "australia", 
                      "australian", "report", "charity", "charities", "charitable", "year", 
                      "end", "statement", "statements", "trustee", "trustees", "trust", "overseas",
                     "international", "support", "fund", "provide", "provision", "activity", "activities",
                     "providing", "provided", "program", "programme", "project"]
    stop_words.update(new_stop_words)
    s_words = [word for word in e_words if word not in stop_words]
    #print("Stop words: ", s_words)

    # Stem words
    #stemmed_words = [porter.stem(word) for word in p_words]

    # Remove words with less than three characters
    clean_words = [word for word in s_words if len(str(word)) > 2]

    return ' '.join(clean_words)

### Clean text using function

In [None]:
# Ensure text column is valid
data["activity_desc"] = data["activity_desc"].astype(str)
data = data.dropna(subset=["activity_desc"])

In [None]:
data["clean_text"] = data["activity_desc"].apply(preprocess_text)
data[["abn", "activity_desc", "clean_text"]].sample(5)

### Create list of documents

We want to loop over every row in the dataset and extract the charity unique id and the cleaned activity description.

In [None]:
documents = [(row["abn"], row["clean_text"]) for _, row in data.iterrows()]
documents[0:5] # view first five elements in list of documents

### Extract just the cleaned text for converting to DTM

In [None]:
text_data = [text for _, text in documents]
text_data[0:5]

### Create a Document-Term Matrix using a Count or TF-IDF vectoriser

In [None]:
vectorizer = CountVectorizer()bow = vectorizer.fit_transform(text_data)
terms = vectorizer.get_feature_names_out() # extract unique terms in corpus (vocabulary)

In [None]:
#vectorizer = TfidfVectorizer()
#bow = vectorizer.fit_transform(text_data)
#terms = vectorizer.get_feature_names_out() # extract unique terms in corpus (vocabulary)

In [None]:
# Convert DTM into a Pandas DataFrame
dtm = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names_out(), index=[doc_id for doc_id, _ in documents])

In [None]:
dtm

## Descriptive inference

Descriptive inference in text analysis refers to the process of summarizing and identifying patterns, structures, and key characteristics in textual data without making causal claims (Grimmer & Stewart, 2013). There are many approaches we could take but in this lesson we will focus on:
* Bag of words summaries and visualisations e.g., word clouds
* Similarity metrics e.g., cosine similarity
* Discriminating words e.g., Mutual Information and Fightin' Words

### Simple summaries of documents and terms

A DTM represents a corpus as a series of rows and columns:
* Each row represent a document in the corpus
* Each column represents a term in the corpus
* Each cell represents the frequency (weighted if tf-idf vectoriser was used) each term appears in each document

The DTM offers us the ability to apply linear algebra to text and produce summaries across documents (e.g., how many times does a particular term appear in the corpus overall?) and within documents (e.g., how many terms does a document contain?).

In [None]:
# Compute row totals (sum of term frequencies per document)
row_totals = dtm.sum(axis=1)

# Compute column totals (sum of term frequencies per term across documents)
column_totals = dtm.sum(axis=0)

# Convert both to data frames
dtm_row_totals = pd.DataFrame(row_totals, columns=["Document Totals"])
dtm_row_totals.index.name = "abn" # rename index as "abn" (unique charity id)
dtm_row_totals = dtm_row_totals.reset_index() # convert index to column

dtm_col_totals = pd.DataFrame(column_totals, columns=["Term Totals"])

In [None]:
dtm_row_totals

We can go back to the DTM and look at the terms for a specific document as follows:

In [None]:
index_value = 11000761571  # Pick a unique id of a document

# Filter the row using .loc[] and select only columns with nonzero values
dtm.loc[[index_value], dtm.loc[index_value] > 0]

In [None]:
# Produce summary statistics
dtm_row_totals["Document Totals"].describe()

Now we have a summary of how long each document is (at least after preprocessing). That allows us to do some useful analyses.

In [None]:
dtm_row_totals['Document Totals'].hist(bins=10, edgecolor='black', figsize=(8, 5))
plt.xlabel('Total number of terms in document')
plt.ylabel('Frequency')
plt.title('Histogram of total document terms')
plt.show()

**QUESTION:** What shape distribution - in broad terms - do the document totals have?

**TASK:** Change the number of bins in the histogram and recreate the plot. How does the shape of the distribution change?

How about the frequency of terms across the entire corpus, not just within a document?

In [None]:
dtm_col_totals

**TASK:** Produce summary statistics and a histogram for the term totals. Write a summary of the results. (*See a simple solution at the end of this notebook*)

In [None]:
# INSERT CODE HERE

### Word Clouds

A word cloud is a simple visualisation of a bag of words representation of a corpus. It displays the terms in a corpus in accordance to how frequent the term appears: common terms are shown in a larger font and near the centre of the visualisation, while rare terms are shown in a smaller font and near the edges of the visualisation.

In [None]:
# Compute word frequencies from the DTM by summing across rows (or down columns)
word_frequencies = dtm.sum(axis=0).to_dict()

# Generate the word cloud from DTM word frequencies
wordcloud = WordCloud(width=800, height=400, background_color="white").generate_from_frequencies(word_frequencies)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

**QUESTION:** Write a short summary of the results of the word cloud. What can we say about the activities of overseas charities? Do the results point to any overlooked preprocessing work we needed to do?

### (Dis)similarity metrics

An oft-posed question in text analysis is: how different or similar are documents in a corpus? For example, do they use similar terms and in similar frequencies or proportions? A robust and common metric that helps us answer these types of questions is **Cosine Similarity**. This is a measure of how similar two vectors (rows) in a DTM are.

Let's start by calculating this measure for all pairs of documents in the DTM.

In [None]:
cosine_sim_matrix = cosine_similarity(dtm)
cosine_sim_matrix

It is difficult to read in this format so let's convert it to a more usable state:

In [None]:
# Convert to DataFrame for better visualization
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=dtm.index, columns=dtm.index)
cosine_sim_df

That's much better. We have retained the document ids (unique charity number) so we can make easier comparisons between documents. This format should be familiar by now: it is a matrix just like a DTM is. Instead now we have a DDM (Document Document Matrix) where the rows and columns are both documents and the cells are cosine similarity scores.

**QUESTION:** How do you interpret the diagonal values in the similarity matrix as all equalling 1?

We can select particular pairs of documents and their cosine similarity score as follows:

In [None]:
cosine_sim_df.loc[11095489197, 99821785872] # select a cell in the matrix for two documents using their unique ids

These documents are very different as the cosine similarity score is very close to zero. We should validate whether this is informative by looking at the original documents (activity descriptions).

In [None]:
data["activity_desc"].loc[data["abn"] == 11095489197].tolist() # convert to list for easier viewing

In [None]:
data["activity_desc"].loc[data["abn"] == 99821785872].tolist() # convert to list for easier viewing

**QUESTION:** Do you agree with the cosine similarity score that these documents are very different?

**TASK:** Select a different pair of documents and compare their similarity scores and original document text. (Hint: you can see the full list of unique document ids by running `cosine_sim_df.index.tolist()`)

In [None]:
# INSERT CODE HERE

Looking at pairs of documents can be interesting but is unlikely to be the focus on your analysis. Instead you are likely interested in summaries of cosine similarity across documents e.g., which documents have the highest / lowest average similarity score? We can use our techniques from earlier to achieve this i.e., calculating row summaries.

In [None]:
# Compute row means (mean of similarity scores per document)
cosine_sim_df["Avg similarity score"] = cosine_sim_df.mean(axis=1)
cosine_sim_df["Avg similarity score"].describe()

**QUESTION:** How different are activity descriptions of overseas charities?

**TASK:** Create a histogram of the `Avg similarity score` variable.

In [None]:
# INSERT CODE HERE

### Discriminating words

The idea with this analytical approach is to identify words that characterise the language use in a group of documents in the corpus (Grimmer et al., 2022). In essence, there may be words that are more prevalent in certain types of documents than others e.g., do large charities describe their overseas activities differently to medium or small organisations? The overarching aim is to be able to explain / predict documents belonging to certain categories (as opposed to uncovering what these categories or groups are).

#### Mutual Information

In order to implement this approach we need information from the DTM (the term frequencies) and the original data (the charity types). This helps us answer the question: are there terms that are more associated with large or medium charities compared to small organisations?

In [None]:
dtm

In [None]:
data

In [None]:
# Convert the index into a column so we can merge charity size information
dtm_mi = dtm # create a new version of DTM
dtm_mi = dtm_mi.reset_index() # convert index to column
dtm_mi.rename(columns={"index": "abn"}, inplace=True) # rename index as "abn" (unique charity id)

In [None]:
dtm_mi.columns

In [None]:
# Merge charity size information from original data
dtm_mi = dtm_mi.merge(data[["abn", "charitysize"]], on="abn", how="left")

In [None]:
# Encode the "charitysize" column as numerical labels
charitysize_labels = dtm_mi["charitysize"].astype("category").cat.codes

# Compute Mutual Information scores for each term in the DTM
mi_scores = mutual_info_classif(dtm_mi.iloc[:, 1:-1], charitysize_labels, discrete_features=True)

# Create a DataFrame for Mutual Information Scores
mi_df = pd.DataFrame({"Term": dtm_mi.columns[1:-1], "Mutual Information": mi_scores})

# Sort by highest MI score
mi_df = mi_df.sort_values(by="Mutual Information", ascending=False)

In [None]:
mi_df

This gives us a list of mutual information scores for terms in the corpus, where higher values indicate greater usefulness in distinguishing between documents. From the analysis above we can see that the term "country" is the most useful for distinguishing the documents between different size charities. However we do not know if this term is particularly associated with a given charity size (e.g., it is large charities that mainly use the term?). Therefore we need to compute the mutual information scores for each charity size separately and compare. 

In [None]:
class_labels = dtm_mi["charitysize"].unique()  # Get unique class names
mi_results = {}

for cls in class_labels:
    # Convert multi-class labels into binary (1 for current class, 0 for others)
    binary_labels = (dtm_mi["charitysize"] == cls).astype(int)
    
    # Compute MI scores
    mi_scores = mutual_info_classif(dtm_mi.iloc[:, 1:-1], binary_labels, discrete_features=True)
    
    # Store results in dictionary
    mi_results[cls] = mi_scores

# Convert to DataFrame
mi_df = pd.DataFrame(mi_results, index=dtm_mi.columns[1:-1])
mi_df.columns.name = "Charity Size"

In [None]:
mi_df

It can be difficult to read so let's sort by highest score for each charity size.

In [None]:
# Change display settings to avoid scientific notation
pd.set_option("display.float_format", "{:.4f}".format)

In [None]:
# Sort by highest MI score for small charities
mi_df_small = mi_df.sort_values(by="Small", ascending=False)
mi_df_small.head(10)

In [None]:
# Sort by highest MI score for medium charities
mi_df_medium = mi_df.sort_values(by="Medium", ascending=False)
mi_df_medium.head(10)

In [None]:
# Sort by highest MI score for medium charities
mi_df_large = mi_df.sort_values(by="Large", ascending=False)
mi_df_large.head(10)

There are some potentially insightful differences: small and medium charities use words like "school" and "partner" much more than larger organisations, while the latter use words like "equality", "inclusion" and "research" more than their smaller counterparts.

**QUESTION:** Compare the top five distinguishing terms for each charity size. What do you notice? Are there any particular terms you think are noteworthy or indicative of real differences in the types of activities of different charities?

#### Fightin' Words

Or "Feature Weighting using Log-Odds Ratio with Informative Dirichlet Priors" to give it its proper title. These are words that are overrepresented in one document compared to another. The Fightin' Words approach is particularly useful for analysing differences in word usage between two documents (Monroe et al., 2008), as it takes into account the extent to which words are used in documents, not just whether they are present or not (like in Mutual Information scores). 

The calculation of the Fightin' Words score is a bit complicated but essentially z-scores are generated which allow us to say whether certain words are **statistically significantly** to be appear in certain groups of documents than others.

To make the analysis simpler we will divide documents into two categories (or classes): those written by small charities and those written by medium or large charities.

In [None]:
# Convert the index into a column so we can merge charity size information
dtm_fw = dtm # create a new version of DTM
dtm_fw = dtm_fw.reset_index() # convert index to column
dtm_fw.rename(columns={"index": "abn"}, inplace=True) # rename index as "abn" (unique charity id)

In [None]:
# Merge charity size information from original data
dtm_fw = dtm_fw.merge(data[["abn", "charitysize"]], on="abn", how="left")

In [None]:
dtm_fw = dtm_fw.set_index("abn")
dtm_fw

In [None]:
# Ensure 'charitysize' is a string column
dtm_fw["charitysize"] = dtm_fw["charitysize"].astype(str)

# Create binary class labels: 'small' vs. 'medium/large'
dtm_fw["binary_charitysize"] = dtm_fw["charitysize"].apply(lambda x: "small" if x == "Small" else "medium_large")

# Identify term columns (excluding metadata)
term_columns = dtm_fw.columns.difference(["charitysize", "binary_charitysize"])

# Ensure term columns are numeric
dtm_fw[term_columns] = dtm_fw[term_columns].apply(pd.to_numeric, errors="coerce")

# Drop any remaining non-numeric values (optional)
dtm_fw.dropna(inplace=True)

# Compute word counts for each class
word_counts = dtm_fw.groupby("binary_charitysize")[term_columns].sum()

# Apply Dirichlet prior smoothing (Laplace smoothing with α=1)
alpha = 1  # Smoothing parameter
word_probs = (word_counts + alpha) / (word_counts.sum(axis=1) + alpha * len(term_columns)).values[:, None]

# Compute log-odds ratio
log_odds_ratio = np.log(word_probs.loc["small"]) - np.log(word_probs.loc["medium_large"])

# Compute variance using Dirichlet prior
variance = (1 / (word_counts.loc["small"] + alpha)) + (1 / (word_counts.loc["medium_large"] + alpha))
std_dev = np.sqrt(variance)

# Compute z-scores (Fightin' Words metric)
z_scores = log_odds_ratio / std_dev

# Create a DataFrame for Fightin' Words Scores
fw_df = pd.DataFrame({"Term": term_columns, "Log-Odds Ratio": log_odds_ratio, "Z-Score": z_scores})
fw_df = fw_df.sort_values(by="Z-Score", ascending=False)  # Sort by highest Z-score

A lot of Mathematics in the above code, let's focus on the interpretation instead. We are interested in z-scores greater than 2 i.e., those are **statistically significantly** more likely to be found in the activity descriptions of small charities.

In [None]:
fw_df.head(20) # top 20 z-scores (most discriminating words for small charities)

Conversely, we can also look at negative z-scores to see which words are **statistically significantly** more likely to be found in the activity descriptions of medium/large charities.

In [None]:
fw_df.tail(20) # bottom 20 z-scores (most discriminating words for medium/large charities)

This looks considerably more insightful than the Mutual Information (MI) approach. Small charities seem to discuss issues relating to schooling ("pupil", "school", "scholarship") much more frequently than medium or large organisations. Conversely they are less likely to operate at a global scale according to their word usage (e.g., they don't tend to use words like "across", "country" or "world").

**QUESTION:** Are there meaningful words that distinguish the activity descriptions of small, medium and large charities? Can you make any substantive conclusions about differences in the nature of overseas activities by charity size? 

We are generally not interested in z-scores within -2 and +2 as these are likely to be found at the same rate in the documents of small, medium and large charities; however we can look at them as follows:

In [None]:
fw_df[(fw_df["Z-Score"] >= -2) & (fw_df["Z-Score"] <= 2)]

**TASK:** Change the range of the z-scores so that you examine terms with values between -.5 and .5.

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for text analysis tasks you'll almost certainly need to import some additional modules.
* **How to make descriptive inferences from text data**. There are a number of common and key analytical techniques that can yield substantive insight into key features of documents.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

These are but a selection of the analytical techniques at your displosal; however they are common and often key ones in text analysis projects. In the next practical we will focus on more substantively rich analytical techniques, specifically sentiment analysis and topic modelling.

## Exercise

Create a DTM / DFM using the other file in the data folder (*acnc-overseas-activities-2021.csv*) and apply the analytical techniques demonstrated in this notebook.

In [None]:
# INSERT CODE HERE

In [None]:
# INSERT CODE HERE

The solution is provided at the end of this course.

## Appendix A

#### Producing summaries and visualisations of term totals in corpus

In [None]:
dtm_col_totals["Term Totals"].describe()

In [None]:
dtm_col_totals['Term Totals'].hist(bins=500, edgecolor='black', figsize=(8, 5))
plt.xlabel('Total number of occurances of terms in corpus')
plt.ylabel('Frequency')
plt.title('Histogram of total occurances of terms')
plt.show()

In [None]:
# Sort and select top 25 terms
sorted_terms = sorted(column_totals.items(), key=lambda x: x[1], reverse=True)[:25]
terms, frequencies = zip(*sorted_terms)

# Plot bar chart
plt.figure(figsize=(12, 6))
plt.bar(terms, frequencies)
plt.xlabel("Terms")
plt.ylabel("Total Frequency")
plt.title("Top 25 Most Common Terms in DTM")
plt.xticks(rotation=45, ha="right")  # Rotate labels for better readability
plt.show()

--END OF FILE--