# Text Analysis

## Introduction

Computational methods are transforming research practice across the disciplines. For social scientists these methods offer a number of valuable opportunities, including creating new datasets from digital sources; unearthing new insights and avenues for research from existing data sources; and improving the accuracy and efficiency of fundamental research activities.

In this lesson we introduce and apply a range of supervised text analysis techniques to social science data.

### Aims

This lesson has two aims:
1. Demonstrate how to use Python to analyse text data relating to charitable activities.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data preprocessing problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 40-60 minutes
* **Pre-requisites**: None
* **Audience**: Researchers and analysts from any disciplinary background
* **Learning outcomes**:
    1. Understand and apply common supervised text analysis techniques to social science data.
    3. Be able to use Python for performing text analysis.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*How do we analyse social science text data?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## How do we analyse social science text data?

There are a wide array of text analysis techniques that we could apply in our research:
* **Descriptive inference:** how to characterise text; vector space model, bag of words, (dis)similarity measures, diversity, complexity, style, bursts.
* **Supervised techniques:** dictionaries, sentiment analysis, categorising.
* **Unsupervised techniques:** cluster analysis, Principal Components Analysis (PCA), topic modelling, embeddings. (Spirling, 2022)

To say nothing of using Generative AI or Large Language Models (LLMs) to conduct these analyses on our behalf.

In this lesson we focus on two common supervised text analysis techniques:
* Keyword searching / KWIC
* Sentiment analysis

## Preliminaries

First we need to ensure Python has the functionality it needs for text analysis. As you will see, it needs quite a bit of extra functionality, so this may take some time to install / import depending on your machine.

In [None]:
# Install additional packages - only run once per machine
!pip install textblob
!pip install seaborn

Packages for general data and file management:

In [None]:
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import json
import os
import re

Packages for processing text data:

In [None]:
import nltk                       # get nltk 
from nltk import word_tokenize    # and some of its key functions
from nltk import sent_tokenize
from nltk import FreqDist

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('webtext')
nltk.download('words')

from nltk.corpus import words     # list of valid words
english_words = set(words.words())

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk.corpus import wordnet                    # Functions we need for lemmatising
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 

from nltk.stem.porter import PorterStemmer         # Functions we need for stemming
porter = PorterStemmer()

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

from collections import Counter

print("Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

Packages for analysing text data:

In [None]:
# for sentiment analysis
from textblob import TextBlob

# for data visualisation
import matplotlib.pyplot as plt 
import seaborn as sns # for data visualisation

### Import data

A second important preliminary step is to import the text data you will be using.

In [None]:
infile = "https://raw.githubusercontent.com/SGSSSonline/text-analysis-summer-school-2025/refs/heads/main/data/acnc-overseas-activities-2022.csv" # define file to be imported

data = pd.read_csv(infile, encoding="ISO-8859-1")

In [None]:
data.sample(10)

In [None]:
data["activity_desc"].sample(10)

###  Create Document Term Matrix

You have likely created and saved this in a previous lesson but let's start afresh just in case.

In [None]:
def preprocess_text(text):

    # Tokenize the text and convert to lowercase
    words = nltk.word_tokenize(text)
    lower_words = [word.lower() for word in words]
    #print(lower_words)

    # Remove punctuation and numbers
    a_words = [word for word in lower_words if word.isalpha()]
    #print("Alpha words: ",a_words)

    # Lemmatise words
    lemmed_words = [lemmatizer.lemmatize(word) for word in a_words]
    #print("Lemmed words: ",lemmed_words)
    
    # Remove non-English words
    e_words = [word for word in lemmed_words if word in english_words]
    #print("English words: ", e_words)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    new_stop_words = ["registered", "registration", "company", "number", "australia", 
                      "australian", "report", "charity", "charities", "charitable", "year", 
                      "end", "statement", "statements", "trustee", "trustees", "trust", "overseas",
                     "international", "support", "fund", "provide", "provision", "activity", "activities",
                     "providing", "provided", "program", "programme", "project"]
    stop_words.update(new_stop_words)
    s_words = [word for word in e_words if word not in stop_words]
    #print("Stop words: ", s_words)

    # Stem words
    #stemmed_words = [porter.stem(word) for word in p_words]

    # Remove words with less than three characters
    clean_words = [word for word in s_words if len(str(word)) > 2]

    return ' '.join(clean_words)

### Clean text using function

In [None]:
# Ensure text column is valid
data["activity_desc"] = data["activity_desc"].astype(str)
data = data.dropna(subset=["activity_desc"])

In [None]:
data["clean_text"] = data["activity_desc"].apply(preprocess_text)
data[["abn", "activity_desc", "clean_text"]].head(5)

### Create list of documents

We want to loop over every row in the dataset and extract the charity unique id and the cleaned activity description.

In [None]:
documents = [(row["abn"], row["clean_text"]) for _, row in data.iterrows()]
documents[0:5] # view first five elements in list of documents

### Extract just the cleaned text for converting to DTM

In [None]:
text_data = [text for _, text in documents]
text_data[0:5]

### Create a Document-Term Matrix using a Count or TF-IDF vectoriser

In [None]:
vectorizer = CountVectorizer()bow = vectorizer.fit_transform(text_data)
terms = vectorizer.get_feature_names_out() # extract unique terms in corpus (vocabulary)

In [None]:
#vectorizer = TfidfVectorizer()#bow = vectorizer.fit_transform(text_data)
#terms = vectorizer.get_feature_names_out() # extract unique terms in corpus (vocabulary)

In [None]:
# Convert DTM into a Pandas DataFrame
dtm = pd.DataFrame(bow.toarray(), columns=vectorizer.get_feature_names_out(), index=[doc_id for doc_id, _ in documents])
document_ids = dtm.index.tolist() # create list of document ids

In [None]:
dtm

In [None]:
print(terms[0:500]) # view first 500 terms in vocabulary

## Supervised techniques

A supervised text analysis technique (or supervised learning more generally) is one that seeks to understand the relationship between a set of features (e.g., document word counts) and an outcome (e.g., what category or class a document belongs to). Logistic regression is a type of supervised learning technique as it estimates the relationship between a set of covariates and the probability of an observation being in a given category or not. This and similar techniques are termed **supervised** because the category or class is already known: we know whether a document was written by a small or large charity in our example. Therefore we are "supervising" the classifier / algorithm as it seeks to understand the relationship between the features and the outcome.

Often the outcome is known because it is a feature in the data e.g., in addition to charities' activity descriptions we have their organisation size and other characteristics. Other times we need to label the data so that the outcome is known: for example, we may manually review a sample of documents and categorise them as either "Good practice" or "Not good practice" (or whatever classification scheme we are using).

### Sentiment analysis

Sentiment analysis is a well established and commonly used supervised learning technique for text data. The core idea is to use a set of pre-defined words with specific connotations to classify our documents automatically, quickly and accurately (Spirling, 2022). This set of pre-defined words, or **dictionary**, serves as our labelled data: we know whether the word "amazing" is positive or not, we just need to access this information from a suitable dictionary. Although in your data the outcome is unknown - whether a piece of text has a positive or negative sentiment or tone - , the outcome is known in general as we can just look up the words in your data and see their sentiment score. Then it begins a simple calculation to see how positive or negative the words in a piece of text / document are overall. 

Let's take a simple piece of text to demonstrate the core idea:

In [None]:
text = """
The Pogues’ Body of an American is a raw, energetic anthem. The driving rhythm, powerful lyrics, and Shane MacGowan’s 
passionate vocals make it an unforgettable and emotional listening experience.
"""

In [None]:
blob = TextBlob(text)
sentiment_score = blob.sentiment.polarity
sentiment_score

Not a lot of code for once, that's nice. We pass the text to a functiot called `TextBlob` which looks up the dictionary of words and their sentiment scores. This returns an object called `blob` (call it something else if you like) which we can this access its attributes. One of those is the overall sentiment score, which in this case is *.19*. Scores can range from -1 (very negative) to +1 (very positive). Scores near zero represent text that is neither positive or negative overall.

**QUESTION:** Using the `sentiment_score` how would you describe the sentiment of the song review?

We are generally interested in the overall sentiment of the piece of text but we can also access the sentiment scores for each word:

In [None]:
word_sentiments = {word: TextBlob(word).sentiment.polarity for word in blob.words}
word_sentiments

**QUESTION:** What do you notice about how the overall sentiment score was constructed from individual word scores? Do you agree with the sentiment scores for individual scores?

**TASK:** Take one of the short reviews for the movie "Io Capitano" and produce a sentiment score (overall and for each word). Do you agree with the results and do they fit the original text of the review? https://www.rottentomatoes.com/m/io_capitano

In [None]:
# INSERT CODE

OK, let's apply sentiment analysis to our charity activity data. First let's create a function that can be applied to the whole dataset.

In [None]:
def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

In [None]:
data["sentiment_score"] = data["clean_text"].apply(get_sentiment)

In [None]:
# view a subset of columns in the dataset
data[["abn", "activity_desc", "clean_text", "sentiment_score"]]

It can be a bit difficult to view the whole dataset in Python / Jupyter Notebook, so let's try a different format.

In [None]:
sentiments = [(row["abn"], row["activity_desc"], row["sentiment_score"]) for _, row in data.iterrows()]
sentiments[0:20] # view first 20

**QUESTION:** What do you think of the sentiment scores for the first 20 documents in the corpus? Do you agree with the scores? Is it substantively meaningful to conceptualise these activity descriptions as positive or negative?

The sentiment score is a numeric representation of document tone, so let's look at the distribution of these scores across the corpus.

In [None]:
data["sentiment_score"].describe()

In [None]:
data["sentiment_score"].hist(bins=20, edgecolor='black', figsize=(8, 5))
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.title('Histogram of document sentiment scores')
plt.show()

**QUESTION:** How would you characterise the sentiment / tone of charity activity descriptions?

Finally, let's disaggregate sentiment scores by charity size.

In [None]:
# Create a boxplot to visualize distribution of sentiment score by charity size 
plt.figure(figsize=(8, 5))
sns.boxplot(x="charitysize", y="sentiment_score", data=data)
plt.xlabel("Charity Size")
plt.ylabel("Sentiment Score")
plt.title("Distribution of sentiment score by charity size")
plt.show()

**QUESTION:** Are there meaningful differences in sentiment score by charity size?

In [None]:
plt.figure(figsize=(8, 5))
sns.kdeplot(data=data, x="sentiment_score", hue="charitysize", common_norm=False, fill=True, alpha=0.3)

# Set labels and title
plt.xlabel("Sentiment Score")
plt.ylabel("Density")
plt.title("Kernel Density Plot of Sentiment Scores by Charity Size")

# Show the plot
plt.show()

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for text analysis tasks you'll almost certainly need to import some additional modules.
* **How to perform supervised text analyses**. There are a number of common and key analytical techniques that can yield substantive insight into key features of documents.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

These are but a selection of the analytical techniques at your displosal; however they are common and often key ones in text analysis projects.

## Exercise

Perform sentiment analysis using the other file in the data folder (*acnc-overseas-activities-2021.csv*).

In [None]:
# INSERT CODE HERE

In [None]:
# INSERT CODE HERE

--END OF FILE--