<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Basics of Natural Language Processing (NLP)
# - Text Preprocessing

We now move from files and tables to text content. In this notebook, the focus is on:
- apply transparent formatting-level cleaning (e.g., lowercasing, text-specific patterns we want to remove),
- understanding  the difference between characters, tokens, and sentences,
- tokenize texts and inspect basic linguistic annotations,
- prepare text data for quantitative methods (e.g., removing punctuation, removing stop words)

## 1. Formatting-level Cleaning

Many texts are not in a format that is immediately usable for NLP techniques and can also be improved for human reading. Web-scraped text is a typical example: it often contains menus, footers, references, and irregular line breaks.

For demonstration, we open the content of the Wikipedia page that we saved in Notebook 8 as plain-text string that we can now lightly preprocess.

In [None]:
# Open data
with open("../Data/wikipedia-article.csv", "r", encoding="utf-8") as f:
    text = f.read()
print(text)

One simple option to automate data cleaning is to use Python’s built-in **regular expressions** module, regex (`re`).

In [None]:
import re

First, we want to convert everything to lowercase using just a string method:

In [None]:
text = text.lower()
print(text)

Second, we remove numeric citations like [32] or [10] with regex:

In [None]:
 text = re.sub(r"\[\d+\]", " ", text)
print(text)

Third, we collapse multiple whitespaces into a single space with regex:

In [None]:
text = re.sub(r"\s+", " ", text).strip()
print(text)

Another example where we aim to extract dates:
- `\b` word boundary, ensures we start at the beginning of a word
- `\d{1,2}` matches 1 or 2 digit day numbers
- `[A-Z][a-z]+` matches any capitalized word

Limitation: It is not realistically possible to match all date formats with a single regular expression because dates can appear in a huge variety of formats.

In [None]:
text = "The event happened on 12 March, 5 April, and 23 December."
matches = re.findall(r"\b\d{1,2} [A-Z][a-z]+", text)
print(matches)

---

### **Exercise 1:** 

What happens if you enter a typo to one of the month names and why?

In [None]:
# Enter a typo:
text = "The meeting happened on 12 March, 5 Aprils, and 23 December." # Typo entered: Aprils
matches = re.findall(r"\b\d{1,2} [A-Z][a-z]+", text)
print(matches)

Solution: It does not check whether the word is a real month name - it just checks the pattern (capitalized word after a number). So 5 Aprils still matches, even though "Aprils" is a typo.

---

## 2. NLP Library `spaCy`

For any data cleaning where we need to decide which words or units to keep or drop (linguistic cleaning), we use an NLP library. Here, we focus on `spaCy`, but other libraries such as `NLTK` exist. 

We use these NLP libraries to turn raw text into structured pieces that a computer can work with: words, sentences, and simple linguistic labels (e.g. "this is a verb", "this looks like a name", "this is probably not very informative"). For that, we generally use tokens. **Tokenization** is the process of taking a long string of text and breaking it into smaller units called tokens. The content of the text stays the same, but it becomes organized into pieces that can be counted, filtered, and analyzed.


In [None]:
!pip install -U setuptools wheel spacy
!python -m spacy download en_core_web_sm

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm") # download the small english model

`spaCy` processes a string and returns a **`doc` object**. This object holds:

- tokens,
- sentences,
- linguistic annotations like lemma, part of speech, entities.

Tokens are a technical representation that supports both quantitative and qualitative analysis.

Let us look at one example sentence how that looks like:

In [None]:
text = "I love learning Python. Python is great for data analysis and Python is also fun!"
print(text)

We let `spaCy` process the text. That means it creates a `doc` object:

In [None]:
doc = nlp(text)

Let's investigate the `doc` object. First, we print all the sentences detected (`doc.sents` object):

In [None]:
for sent in doc.sents:
    print(repr(sent))

The `token` object comes with many attributes that are essential for most NLP tasks. You can explore the full list in the documentation: https://spacy.io/api/token.

In [None]:
for token in doc:
    # token.text  : the original form
    # token.lemma_: base form (e.g. "learning" -> "learn")
    # token.pos_  : coarse part-of-speech tag (NOUN, VERB, etc.)
    # token.is_stop: True if spaCy thinks it is a stop word
    print(f"{token.text!r:12} {token.lemma_!r:12} {token.pos_:8} {token.is_stop}")

**Tokens** become "quantitative" once we start counting, aggregating, or modelling them. In many quantitative workflows (e.g. topic modelling), so-called stop words are removed so that very frequent but low-information words do not dominate the patterns. In addition, punctuation is often stripped, because it usually carries little signal for these frequency-based analyses.

Below is a small "cleaning" example for a text that contains many words that contain little information.

The workflow is:
1. Get a doc.
2. Decide which tokens to keep (rules/filters).
3. Collect cleaned tokens (list) or cleaned text (string).

In [None]:
statement = "Well, you know, I just want to say that we are basically here today to kind of talk about how we can, in some way, move forward together, and I think that, at the end of the day, people really just want to feel that things are sort of not going in the wrong direction with our democracy."
print(statement)

In [None]:
# Process with spaCy
doc = nlp(statement)

# Build a cleaned list of tokens step by step using a for-loop (Notebook 4):
clean_tokens = []

for token in doc:
    # Skip punctuation (e.g. "!" or ",")
    if token.is_punct:
        continue
    
    # Skip spaces 
    if token.is_space:
        continue
    
    # Skip stop words (very common words like 'and', 'the', 'is' ...)
    if token.is_stop:
        continue
    
    # Keep only alphabetic tokens (no numbers, no mixed tokens like "Python3")
    if not token.is_alpha:
        continue
    
    # If the token passed all checks, add its lowercase form to the list
    clean_tokens.append(token.text.lower())
print(clean_tokens)

# Turn the cleaned tokens back into a single string
clean_text = " ".join(clean_tokens)
print(clean_text)


We now switch to a dataset that contains a YouTube transcript from the children's story "The Story of the Little Mole (Who Knew it Was None of His Business)" (video: https://www.youtube.com/watch?v=plzwDLnieAk). Speech-to-text transcripts like this are a common use-case where Python can support qualitative research. 

First, we read the transcript file:

In [None]:
with open("../Data/the-story-of-the-little-mole_transcript.txt", "r", encoding="utf-8") as f:
    story = f.read()   # read entire file as one string

# Quick peek at the first 200 characters
print(story[:200])

Next, we detect and skip lines that are only timestamps and clean small artifacts (like "[Music]") before finally joining everything into one continuous story.

In [None]:
import re

clean_lines = []

for line in story.splitlines():
    line = line.strip()
    if not line:
        continue  # skip empty lines

    # Skip pure timestamp lines like "0:02" or "12:45"
    if re.fullmatch(r"\d{1,2}:\d{2}", line):
        continue

    # Remove simple [Music] or other bracketed tags
    line = re.sub(r"\[.*?\]", "", line).strip()

    if line:
        clean_lines.append(line)

# Join all remaining lines into one continuous story
clean_story = " ".join(clean_lines)
print(clean_story)

# Save text
with open("../Data/the-story-of-the-little-mole_transcript_clean.txt", "w") as f:
    f.write(clean_story)

---

### **Exercise 2:**

1. Turn the cleaned story string (`clean_story`) into a `spaCy` doc.  
2. Print all sentences that `spaCy` detects.  
3. Then, explore a token attribute you might be interested in (look them up here: https://spacy.io/api/token) and for every token in the text, print its attribute.

In [None]:
# Solution (example):
import spacy

# 1) spaCy Doc
doc = nlp(clean_story)

# 1) Turn the cleaned story string into a spaCy Doc
doc = nlp(clean_story)

# 2) Print all sentences detected by spaCy
for sent in doc.sents:
    print(repr(sent))

# 3) Inspect token attributes for all (non-space, non-punctuation) tokens
for token in doc:
    # token.text  : the original form
    # token.lemma_: base form (e.g. "learning" -> "learn")
    # token.pos_  : coarse part-of-speech tag (NOUN, VERB, etc.)
    # token.is_stop: True if spaCy thinks it is a stop word
    print(f"{token.text!r:12} {token.lemma_!r:12} {token.pos_:8} {token.is_stop}")

The sentences look strange because the transcript did not contain punctuation and sentence-like structure, and `spaCy` relies heavily on sentence-ending punctuation and similar cues to decide where one sentence ends and the next begins. Without those signals, `doc.sents` becomes a mostly arbitrary segmentation, so we shouldn’t interpret these sentences linguistically. The tokenization itself, however, is still reliable because it mainly depends on whitespace and basic rules, so token-level information like `text`, `lemma_`, and `pos_` can still be usefully inspected and worked with.

---