![SGSSS Logo](../img/SGSSS_Stacked.png)

# Text Analysis

## Introduction

Computational methods are transforming research practice across the disciplines. For social scientists these methods offer a number of valuable opportunities, including creating new datasets from digital sources; unearthing new insights and avenues for research from existing data sources; and improving the accuracy and efficiency of fundamental research activities.

In this lesson we introduce and apply the foundational preprocessing steps of text analysis for social science research.

### Aims

This lesson has two aims:
1. Demonstrate how to use Python to preprocess text data relating to charitable activities.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data preprocessing problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 40-60 minutes
* **Pre-requisites**: None
* **Audience**: Researchers and analysts from any disciplinary background
* **Learning outcomes**:
    1. Understand the key steps and concepts for getting social science data ready for text analysis.
    2. Be able to use Python for preprocessing text data.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*How do we prepare social science data for text analysis?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## How do we prepare social science data for text analysis?

There are a number of common, initial steps before you can perform text analysis with social science data. Grimmer et al., (2022) suggest the following workflow (Grimmer et al., 2022):
1. Choose unit of analysis
2. Tokenise
3. Reduce complexity:
   * Convert to lowercase
   * Remove punctuation
   * Remove stop words
   * Create equivalence classes (lemmatisation / stemming)
   * Filter by frequency
4. Construct document-feature matrix (W = N*J) (Wij = count of type j in document i)

## Preliminaries

First we need to ensure Python has the functionality it needs for text analysis. As you will see, it needs quite a bit of extra functionality, so this may take some time to install / import depending on your machine.

In [None]:
# Install additional packages - only run once per machine
!pip install autocorrect

Packages for general data and file management:

In [None]:
import pandas as pd
import numpy as np
import json
import os
import re

Packages for processing text data:

In [None]:
import nltk                       # get nltk 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize
from nltk import FreqDist

from autocorrect import Speller   # things we need for spell checking
check = Speller(lang='en')

nltk.download('stopwords') # additional words or dictionaries we can use to check our documents against
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('webtext')
nltk.download('words')

from nltk.corpus import words     # list of valid English words
english_words = set(words.words())

from nltk.corpus import stopwords # list of common words that are not substantively informative e.g., "the"
stop_words = set(stopwords.words('english'))

from nltk.corpus import wordnet                    # functions we need for lemmatising
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 

from nltk.stem.porter import PorterStemmer         # function we need for stemming
porter = PorterStemmer()

from sklearn.feature_extraction.text import CountVectorizer # function we need for converting text to numeric
vectorizer = CountVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer # function we need for converting text to numeric
tfidf_vectorizer = TfidfVectorizer()

from collections import Counter

print("Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

How do we know what modules we need for text analysis? Thankfully it is an established method, therefore others have figured this out for us:
* https://github.com/UKDataServiceOpen/text-mining/blob/407d16015ba270b4e39462c20de9b370c4e78563/code/1-Processing.ipynb
* https://github.com/UKDataServiceOpen/text-mining/blob/407d16015ba270b4e39462c20de9b370c4e78563/code/2-Extraction.ipynb

Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

### Import data

A second important preliminary step is to import the text data you will be using.

In [None]:
infile = "https://raw.githubusercontent.com/SGSSSonline/text-analysis-summer-school-2025/refs/heads/main/data/acnc-overseas-activities-2022.csv" # define file to be imported

data = pd.read_csv(infile, encoding="ISO-8859-1")

In [None]:
data.sample(10)

In [None]:
data["activity_desc"].sample(10)

###  Choose unit of analysis

A fundamental task in social science research more generally, it is important for text analysis also. In many cases the unit of analysis is the **document**; that is, we are interested in measuring relevant, salient features of a document (e.g., author, style, sentiment, topics) and comparing these across other documents. However we can also select other, often smaller units of analysis such as paragraphs or sentences - then we can compare *within* and *between* documents e.g., how do political speeches develop from beginning to end, and across different speakers?

In our analysis the unit of analysis is the **document**: each row in the raw data represents a single charity's description of its overseas activities. This description is what serves as the document.

### Pilot

Before unleashing this workflow on a corpus, let's apply it to a single document so we can get a sense of what happens at each step. Below we select the text in the `activity_desc` column for the 501st row in the dataset.

In [None]:
sample_text = data.iloc[500, data.columns.get_loc("activity_desc")]
sample_text

**TASK**: Read through the above activity description and note any issues with the text e.g., misspellings, odd words, improper punctuation.

#### Tokenise

The next major step is to split the text into subunits of analysis. The most common subunit of interest is a type (or word).

In [None]:
sample_words = word_tokenize(sample_text)
print(sample_words)  # comma-separated list of tokens in the text
print(len(sample_words)) # number of tokens in text
print(len(set(sample_words))) # number of types in text

We can see that tokenising is an important but not infallible step in preprocessing text data. Tokenisers generally work by splitting text into separate components. How do these approaches know when one component (e.g., word) begins and another ends? They use whitespace as a delimiter / separator. This works very well but not perfectly, as you can see punctuation like commas, periods and brackets are identified as tokens in the text. We are generally not interested in punctuation for analysis, so we need a later step to remove these instances.

We can also tokenise the text into sentences if these were our linguistic subunits of interest.

In [None]:
sample_sent = sent_tokenize(sample_text)
print(sample_sent) # comma-separated list of sentences in the text
print(len(sample_sent)) # number of sentences in text

#### Reduce complexity

##### Convert to lowercase

Unless capitalisation is of analytical interest, we generally convert all tokens to lowercase. In essence we want to avoid situations where we treat the same words as if they different e.g., are "The" and "the" different words? "Charity" and "CHARITY"?

In [None]:
sample_lower = [word.lower() for word in sample_words]
print(sample_lower)
print(len(sample_lower))

##### Spell check

This step is often not necessary and can be computationally intensive. However here is how you can do it.

##### Remove punctuation

In [None]:
sample_no_punct = [w.translate(table_punctuation) for w in sample_lower]  
print(sample_no_punct)
print(len(sample_no_punct))

At this point there are still some issues:
* There are spaces operationalised as tokens e.g., where there used to be punctuation
* There are tokens consisting of a single character e.g., a number or letter that was separated from its apostrophe

In [None]:
sample_no_space = list(filter(None, sample_no_punct)) # strip out empty spaces
print(sample_no_space)
print(len(sample_no_space))

The text looks much cleaner and we have reduced the number of tokens as well (good for computational efficiency and substantive clarity).

##### Remove stopwords

In [None]:
sample_no_stop_words = []

for word in sample_no_space:
    if word not in stop_words:
        sample_no_stop_words.append(word)

In [None]:
print(sample_no_stop_words)
print(len(sample_no_stop_words))

##### Create equivalence classes: Stemming

In [None]:
sample_stemmed = [porter.stem(word) for word in sample_no_stop_words]
print(sample_stemmed)
print(len(sample_stemmed)) # number of tokens
print(len(set(sample_stemmed))) # number of terms

Notice what has happened words like "comply", "external" and "charity". They are now expressed in their common root form and thus are no longer words that we would find in the English dictionary. These are examples of terms rather than types.

**QUESTION:** What is the value of transforming words to their root form?

##### Create equivalence classes: Lemmatisation

This is an alternative to stemming that maps words to a common word based on semantic meaning e.g., "car" and "cars" map to "car".

In [None]:
lemmatizer.lemmatize("car")

In [None]:
lemmatizer.lemmatize("cars")

In [None]:
sample_lemmed = [lemmatizer.lemmatize(word) for word in sample_no_stop_words]
print(sample_lemmed)
print(len(sample_lemmed)) # number of tokens
print(len(set(sample_lemmed))) # number of terms

**QUESTION:** What is the difference between stemming and lemmatisation in the example above, both in terms of the number of terms / tokens and the readability of the words?

##### Filter by frequency

As a final step we may want to remove very common or very rare words from the corpus: this aids both substantive interpretations (e.g., perhaps all charities mention their beneficiaries in their activity descriptions) or certain words only appear once across the entire corpus (e.g., misspellings or acroynms).

We can view the frequency table of the terms in our corpus as follows:

In [None]:
freq_table = Counter(sample_lemmed)
print(freq_table)

As we are only working with one document at the moment we won't remove any words just yet. However there are better approaches for handling common / rare terms in a corpus that we shall see shortly (e.g., weighting). For completeness sake, here is how you could remove words based on their frequencies:

In [None]:
max_count = max(freq_table.values()) # find most frequency word(s)
min_count = min(freq_table.values()) # find most frequency word(s)

sample_filtered = [word for word in sample_lemmed if freq_table[word] not in (max_count, min_count)] # create new list of terms

print("Original tokens:", sample_lemmed)
print("Filtered tokens:", sample_filtered)

#### Create Document-Term Matrix

If you are happy with the preprocessing steps above, both in terms of effect and order, we can convert the text to a numeric format suitable for analysis. This format is known as a Document-Term Matrix (DTM) or Document-Feature Matrix (DFM) - the latter is a more general format than the former. Both simply represent a document or corpus in a tabular format, where every row represents a document and every column represents a term or feature relating to the document. If you are a quantitative researcher then this format will be familiar to you e.g., the rows are units of analysis and the columns are variables representative numeric characteristics of the units of analysis.

In order to the create the DTM we need to convert the list of terms into a single string of terms as follows: 

In [None]:
sample_text = " ".join(sample_lemmed)
sample_text

We take the single string of terms and represent them in the "bag of words" format - there are a couple of ways of doing this.

In [None]:
bow = FreqDist(sample_lemmed)
#print(dict(bow))
dtm = pd.DataFrame([bow])
dtm

In [None]:
print(dtm.shape)

Or:

In [None]:
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform([sample_text])
dtm = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names_out())

# Get the vocabulary (unique words) and the bag of words matrix
#vocabulary = vectorizer.get_feature_names_out()
#bow_matrix = bow.toarray()

In [None]:
dtm

In [None]:
dtm.shape

**QUESTION:** What are the differences between the two approaches (look at the data frame content and shape results)?

It may be difficult to see in the notebook but most DTMs based on real social science text data are *sparse*: that it, there are lots of terms with zero counts for many documents in the corpus. This is a function of the nature of language (authors have lots of words to choose from when creating a given document) and if any reweighting of terms has been applied. 

#### Pilot end

Phew, that is a lot of preprocessing and quite a bit of code to get your head around. The good news is these tasks are common to almost all text analysis projects, so once you get your head around them you will be set for future work. And we don't have to run through all of these steps in such a manual way either: we can rationalise our code and make use of functions from packages like `sklearn`.

We could still perform some additional work to improve the substantive relevance of the text:
* Remove numbers
* Remove single-character tokens
* Remove subject-specific stop words (e.g., "charity", "charities", "australia", "year", "trust", "fund")

## Creating the full DTM

Let's create the DTM we will use for analysis. Instead of sampling one document we will preprocess all of them and make some simple adjustments to improve the text cleaning (e.g., removing numbers and common stop words). To speed up this process, let's create a function (block of code) that handles all of these steps in one go.

### Define function

In [None]:
def preprocess_text(text):

    # Tokenize the text and convert to lowercase
    words = nltk.word_tokenize(text)
    lower_words = [word.lower() for word in words]
    #print(lower_words)

    # Remove punctuation and numbers
    a_words = [word for word in lower_words if word.isalpha()]
    #print("Alpha words: ",a_words)

    # Lemmatise words
    lemmed_words = [lemmatizer.lemmatize(word) for word in a_words]
    #print("Lemmed words: ",lemmed_words)
    
    # Remove non-English words
    e_words = [word for word in lemmed_words if word in english_words]
    #print("English words: ", e_words)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    new_stop_words = ["registered", "registration", "company", "number", "australia", 
                      "australian", "report", "charity", "charities", "charitable", "year", 
                      "end", "statement", "statements", "trustee", "trustees", "trust", "overseas",
                     "international", "support", "fund", "provide", "provision", "activity", "activities",
                     "providing", "provided", "program", "programme", "project"]
    stop_words.update(new_stop_words)
    s_words = [word for word in e_words if word not in stop_words]
    #print("Stop words: ", s_words)

    # Stem words
    #stemmed_words = [porter.stem(word) for word in p_words]

    # Remove words with less than three characters
    clean_words = [word for word in s_words if len(str(word)) > 2]

    return ' '.join(clean_words)

**QUESTION:** What are the consequences of removing non-English words from the corpus?

### Clean text using function

In [None]:
# Ensure text column is valid
data["activity_desc"] = data["activity_desc"].astype(str)
data = data.dropna(subset=["activity_desc"])

In [None]:
data["clean_text"] = data["activity_desc"].apply(preprocess_text)
data[["abn", "activity_desc", "clean_text"]].sample(5)

### Create list of documents

We want to loop over every row in the dataset and extract the charity unique id and the cleaned activity description.

In [None]:
documents = [(row["abn"], row["clean_text"]) for _, row in data.iterrows()]
documents[0:5] # view first five elements in list of documents

### Extract just the cleaned text for converting to DTM

In [None]:
text_data = [text for _, text in documents]
text_data[0:5]

### Create a Document-Term Matrix using a Count or TF-IDF vectoriser

In [None]:
#vectorizer = CountVectorizer()#dtm = vectorizer.fit_transform(text_data)

In [None]:
vectorizer = TfidfVectorizer()bow = vectorizer.fit_transform(text_data)

In [None]:
# Convert DTM into a Pandas DataFrame
dtm = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names_out(), index=[doc_id for doc_id, _ in documents])

In [None]:
dtm

### Save DTM as .csv file

In [None]:
# Create a temporary downloads folder

try:
    os.mkdir("./tmp")
except:
    print("Unable to create folder: already exists")

In [None]:
dtm.to_csv("./tmp/acnc-2022-activities-dtm.csv")

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for text analysis tasks you'll almost certainly need to import some additional modules.
* **How to preprocess text using a standard workflow**. There are a number of preprocessing steps common to almost all text analysis projects but you still retain some control over which steps and in which order you apply them.
* **How to convert text to a number format**. The DTM / DFM is the workhorse of text analysis as it offers an efficient format for performing calculations on salient terms or features of text.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

There are a number of important steps in getting text data ready for analysis. However you need to think carefully about how sensitive your findings are to variation in the preprocessing steps or order. We will see why we go to the effort of creating a DTM / DFM in the next practical.

## Exercise

Create a DTM / DFM using the other file in the data folder (*acnc-overseas-activities-2021.csv*).

In [None]:
# INSERT CODE HERE

In [None]:
# INSERT CODE HERE

The solution is provided at the end of this course.

## Appendix A

### Exercise Solution

#### Creating a DTM for 2021 data on overseas charitable activities

In [None]:
# INSERT CODE HERE

--END OF FILE--