<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Basics of Natural Language Processing (NLP)
# - Analyses in Qualitative Research

This notebook gives a first glimpse into how simple NLP procedures can support **qualitative and mixed-methods analyses**. We work with already preprocessed texts and look at:

1. A simple keyword-in-context search to support close reading.
2. Word frequencies and a word cloud.
3. A simple sentiment analysis.
4. A rule-based topic grouping that shares the basic idea of topic modeling.
5. A small network analysis.

Technically, procedures are quantitative when texts are turned into numerical representations (for example counts, scores, or topic proportions) so that algorithms can detect patterns. We can treat these patterns as findings in their own right (e.g. overall sentiment trends), or mainly us them as navigation aids, for example, to get an overview, compare groups, and identify interesting passages for close, interpretive reading.

The examples in this notebook are intentionally small so that the code and outputs remain readable. In real projects, the same ideas can be scaled up to much larger text collections. Likewise, this notebook can only provide short, illustrative snapshots of each method. Each approach can be configured, validated, and combined in far more systematic and exhaustive ways than we can show here.

## 1. Close Reading Based on Computational Hints

Similar to the project by [The Pudding](https://pudding.cool/2025/07/kids-books/), where the team manually read and coded children’s books, we now want to close-read text passages in which animals occur. Here, we use Python to query their annotations and, in our case, to automatically **extract passages around words**, specifically animal terms. This helps us navigate the stories and identify segments for close, qualitative reading.

We illustrate this with the little mole story that we already lightly preprocessed in Notebook 9. 

First, we define a function for a simple **KWIC (Key Word In Context) search**:


In [None]:
def kwic_terms(text, terms, window=60):
    """
    For each term in `terms`, print the term with some left and right context.
    """
    # We make a lowercased copy for searching (so "Mole" and "mole" both match)
    text_lower = text.lower()

    for term in terms:
        term_lower = term.lower()

        print(f"KWIC for term: '{term}'")

        # Start searching at the beginning of the text
        search_start = 0
        hit_number = 1

        while True:
            # Find the next position of the term
            idx = text_lower.find(term_lower, search_start)

            # If nothing is found, we are done with this term
            if idx == -1:
                break

            # Define how much context we want left and right of the term
            left_start = max(0, idx - window)
            right_end = min(len(text), idx + len(term) + window)

            # Cut out left context, the term itself, and right context
            left = text[left_start:idx]
            middle = text[idx:idx + len(term)]
            right = text[idx + len(term):right_end]

            # Print one KWIC line
            print(f"{hit_number:02d}: ...{left}>>{middle}<<{right}...")

            # Move the search start to after this hit
            hit_number += 1
            search_start = idx + len(term)

Next, we load the preprocessed story, define a list of animals that we want to look for in the story and call the function to do the work:

In [None]:
# List of animals we want to look for in the story:
animals = ["mole", "horse", "goat", "cow", "flies", "dog"]

# Load text:
with open("../Data/the-story-of-the-little-mole_transcript_clean.txt", "r", encoding="utf-8") as f:
    story = f.read() 

# Call the function:
kwic_terms(story, animals, window=60)

---

### **Exercise 1:**

Try to make a tiny change to the `kwic_terms` function such as changing the window size that is defined in the function as `window`. Run the code again and look at the output.
What is different now?

---


## 2. Word Frequencies and Word Cloud

Word frequencies offer a simple first orientation. For example: which terms appear most often across all news headlines? We use **lemmas** (base forms) instead of raw word forms so that variants such as "run" / "running" / "runs" are counted together.

We then visualise the most frequent lemmas in a **word cloud**, where more frequent words are shown in larger font. This highlights recurring topics and vocabulary.

In [None]:
!pip install wordcloud

In [None]:
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import spacy
nlp = spacy.load("en_core_web_sm")
import re

First, let's load the text file, select only the first 1000 headlines (for faster code running), and preprocess the text data using simple string methods. We then process the text with `spaCy` and collect lemmas.

In [None]:
# 1) Read headlines
with open("../Data/all_headlines.txt", "r", encoding="utf-8") as f:
    headlines = f.readlines()
print("Total headlines in file:", len(headlines))

# 2) Use only the first N headlines (to reduce runtime)
N = 1000
headlines_sample = headlines[:N]
print("Headlines used:", len(headlines_sample))

# 3) Join to a single text, then preprocess:
text = "".join(headlines_sample)

# remove everything in parentheses (e.g. (PHOTOS), (VIDEO), etc.) and extra spaces
text = re.sub(r"\([^)]*\)", "", text)
text = " ".join(text.split())
print(text)

# 4) Process with spaCy
doc = nlp(text)

# 5) Collect lemmas
lemmas = []
for tok in doc:
    if tok.is_stop or tok.is_punct or tok.is_space:
        continue
    if len(tok.lemma_) < 3: # filter out very short lemmas
        continue
    lemmas.append(tok.lemma_.lower())
print(lemmas)


Next, we count word (lemma) frequencies and plot a word cloud with `WordCloud`:

In [None]:
# 6) Count how often each lemma appears
freq = Counter(lemmas)
print(freq.most_common(20))

# 7) Word cloud from these frequencies
wc = WordCloud(width=800, height=400, background_color="white")
wc.generate_from_frequencies(freq)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Why do we see "woman" and "women" after lemmatization? `spaCy` does not always recognise Women as a "normal" plural noun in real headlines, so its rule "plural → singular" does not always apply. In some cases, "women" is tagged as something else (for example a proper noun/label, PROPN), and then `spaCy` keeps the original form as the lemma (women) instead of turning it into woman. The result is that both woman and women appear as lemmas, which is a normal bit of tagging/lemmatization noise in this (not so much preprocessed) headline dataset.

## 3. Sentiment Analysis

To estimate how positive or negative each evaluation comment is, we use a simple off-the-shelf sentiment tool, `TextBlob`. It assigns two scores to each text:

- **Polarity:** from −1 (very negative) to +1 (very positive)
- **Subjectivity:** from 0 (very objective) to 1 (very subjective)

Internally, `TextBlob` relies on a **lexicon**: many words have predefined sentiment values, and the text’s overall score is an aggregation across these word-level values. More modern transformer-based models (for example BERT variants) can be more accurate, but are heavier to run. For now, `TextBlob` is enough to illustrate the idea.

Sentiment analysis can help us find very positive or very negative comments that we may want to read more closely, or compare the overall tone between groups. It is not a precise measure of "true" attitudes, because domain-specific language and mixed feelings are easy to misclassify.


In [None]:
!pip install textblob

In [None]:
from textblob import TextBlob
import glob
import pandas as pd

In [None]:
text = "Python is amazing. But sometimes debugging makes me sad."
blob = TextBlob(text)

print(blob.sentiment)

In [None]:
# 1) Find all .txt files in the folder
files = glob.glob("../Data/evaluation_comments/*.txt")

# 2) Create an empty list to store the results - one entry for each file
rows = []

# 3) Loop over all file paths
for filepath in files:
    # Open each file and read the text
    with open(filepath, encoding="utf-8") as f:
        text = f.read().strip()

    # 4) Compute the sentiment polarity for this text
    polarity = TextBlob(text).sentiment.polarity

    # 5) Store the result in a dictionary
    rows.append({
        "file": filepath,   # which file did this come from?
        "text": text,       # the full text
        "polarity": polarity
    })

# 6) Turn the list of dictionaries into a DataFrame
df_sent = pd.DataFrame(rows)

# 7) Print results
print(df_sent.head())

---

### **Exercise 2:**

Use `TextBlob` to analyse the sentiment of your own short text.

---

## 4. Exploring Topics with Simple Rule-based Grouping

In this section we move from individual words to small **themes** or "topics". Instead of training a machine-learning model, we use a **simple rule-based grouping**: we define a few keyword lists (for example, words related to "friendship" or to "adventure") and assign each book description from the children's books collection to the topic whose keywords appear most often. This mimics very simple, dictionary-based coding and shares the **basic idea of topic modeling: grouping texts into topics based on the words they contain.**

In more advanced applications, researchers often use machine-learning approaches such as Latent Dirichlet Allocation (LDA) or transformer-based models to detect topics automatically. These models learn groups of words that tend to co-occur and represent texts as mixtures of such topics. In Python, such models are often implemented with packages like `gensim`, `scikit-learn`, or `BERTopic`. However, they require some mathematical and modeling background.

In [None]:
import pandas as pd

# 1) Load the data
df = pd.read_csv("../Data/kids-book-animals.csv")

# 2) Light preprocessing
df = df.drop_duplicates(subset="description").reset_index(drop=True) # Only keep each description once
df["description"] = df["description"].astype(str) # Make sure the description is a string
df["description_clean"] = df["description"].str.lower() # Make lowercase

# 3) Define simple word lists for two topics
friendship_words = ["friend", "friends", "together", "share", "help"]
adventure_words = ["journey", "adventure", "forest", "explore", "trip"]

# 4) Define a rule-based function
def rule_topic(text):
    """
    Simple rule-based topic assignment:
    - count how often friendship/adventure words appear
    - assign the topic with more matches
    """
    words = text.split()

    friend_count = 0
    adventure_count = 0

    # Count how many friendship and adventure words appear in this text
    for w in words:
        if w in friendship_words:
            friend_count += 1
        if w in adventure_words:
            adventure_count += 1

    # Decide which topic "wins"
    if friend_count > adventure_count:
        return "friendship"
    elif adventure_count > friend_count:
        return "adventure"
    else:
        return "unclear"

# 5) Apply the rule-based function to each description
df["rule_topic"] = df["description_clean"].apply(rule_topic)

# 6) Look at the results
print(df[["description", "rule_topic"]])

# 7) Print first description that was not assigned to a topic
print(df.loc[0, "description"])

We see limitations of the simple rule-based approach. For example, description number 282 starts with "Friendship is hard for Fluffy …", but our rule-based function assigns it the topic "unclear". 

---

### **Exercise 3:**  
Take a moment to think about why this happens.

- Which exact words does our code look for in `friendship_words`?
- What preprocessing step could help here?

Some answers:
- This is because our `friendship_words` list only contains exact forms like "friend" and "friends", but not "friendship". The computer only counts exact string matches.
- The issue illustrates how strongly our results depend on both preprocessing (for example lowercasing, lemmatization, tokenization) and on the design of our keyword lists. In NLP projects, it is therefore important to look at such outputs carefully and consider how preprocessing steps and methods interact, and where they might need to be refined. The same applies when machine-learning methods are used: these models combine many signals from the text, so their decisions are harder to trace back to individual words or preprocessing steps.
- Here, we could extend the dictionaries (e.g. add "friendship"), or use lemmatization or simple patterns (e.g. treating all words starting with "friend" as friendship-related).

---

## 5. Network Analysis

We build a small network with `networkx` that links animal labels to pronouns to see which animals are associated with which genders or neutral references in the results of [The Pudding](https://pudding.cool/2025/07/kids-books/). Their results table holds two columns "animal" and "pronoun". This illustrates the same ideas as larger network analyses (nodes, edges, edge weights) in a very simple setting.

In this toy example, animals and pronouns are **nodes**, and we draw an **edge** whenever a pronoun is used for an animal in our dataset. This lets us see, for instance:

- which pronouns act as **hubs** (used for many animals),
- which animals are linked to several pronouns,
- which animals share the same pronoun patterns.

In [None]:
!pip install matplotlib networkx

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import networkx as nx

# 1) Load the data
df = pd.read_csv("../Data/kids-book-animals.csv")

# 2) Keep only the most common animals (for example: the 10 most frequent ones)
animal_counts = df["animal"].value_counts()
common_animals = animal_counts.head(10).index
df_small = df[df["animal"].isin(common_animals)].copy()

# 3) Build a list of relationships (edges) between animals and pronouns
#    Each edge is a pair: [animal, pronoun]
edges = []
for _, row in df_small.iterrows():
    animal = row["animal"]
    pronoun = row["pronoun"]
    edges.append([animal, pronoun])

# 4) Create a NetworkX graph and add all edges at once
G = nx.Graph()
G.add_edges_from(edges)

# 5) Draw the network
plt.figure(figsize=(8, 6))
nx.draw_networkx(G)
plt.axis("off")
plt.show()