<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Basics of Natural Language Processing (NLP)
# - Loading Files & Data Organization

Natural Language Processing (NLP) is the automatic processing of natural language. In qualitative research, we can treat Python as a "helper" for human, interpretive, qualitative work. The goal of this notebook is to show how Python can support reading, coding, and interpretation of text data. These notebooks for NLP are inspired by an open Python textbooks for digital humanities available at [python-textbook.pythonhumanities.com](python-textbook.pythonhumanities.com).

---

This first notebook demonstrates **how to load files and organise text together with basic metadata**. Throughout this notebook we use a small set of Python libraries. We use `json` and `pandas` to organise texts and metadata in tables, and `BeautifulSoup` to clean web-based texts by stripping HTML.





## 1. The `with` statement

Up until now, we’ve worked only with data created inside our own code. But in real-world projects, you’ll mainly work with existing data and files. Let's see how to open and read text files, as well as how to save (write) new files.

We can use the `with` operator **to open a text file** and stores it's contents in the variable `data`.

In [None]:
# To open a file (from a relative path) in read mode ("r"):
with open("../Data/kids-book-animals.csv", "r") as f: 
    data = f.read()

print(data)

**To save text** into a new file, we use almost the same structure, but change the mode to `"w"` (`write`):

In [None]:
new_string = "This is some text."

# To save data to a file (in a relative path) using write mode ("w"):
with open("../Data/some-text.txt", "w") as f:
    f.write(new_string)

While the built-in Python `with` statement is sufficient for opening and saving text files, working with larger collections of texts or texts that require some form of structure is easier if we use a few dedicated libraries. The most important Python libraries for handling text data in this notebook are introduced below.

## 2. `json`

The `json` library helps us convert between **JSON text** (as it is stored in a file) and **Python objects** such as dictionaries and lists (compare Notebook 3). JSON (JavaScript Object Notation) is a common way to store and send data on the web, especially when working with websites and APIs.

In the dataset we use here, each **line** of the file is one news article stored as a small JSON object. With `json`, we can read the file line by line and turn each JSON line into a normal Python dictionary. Below, we load just a few entries from a real-world dataset of HuffPost news articles (2012–2022), where each line is a separate article in JSON format.

In [None]:
# Import the json library (part of Python’s standard library)
import json

In [None]:
# Path to the dataset file (relative path from current notebook)
file_path = "../Data/sample_articles_10000.json"

# Create an empty list to store the loaded articles
news_articles = []

# Open the file in read mode ("r")
with open(file_path, "r") as f:
    # Loop over each line in the file (each line is a separate news article in JSON format)
    for i, line in enumerate(f):
        if i >= 5:  # Stop after reading the first 5 articles (just for demonstration)
            break
        # Convert the JSON string (one line) into a Python dictionary and add it to the list
        news_articles.append(json.loads(line))

In [None]:
print(news_articles)

To print the category of the first article: take the first item in the list `news_articles` (a dictionary) and access its `category` key:

In [None]:
print(news_articles[0]["category"]) # print category of first article by accessing the first item in the list "news_articles" which is a dictionary and then calling its key "category"

## 3. `glob`

Suppose you have a folder with multiple .txt files - each one is a transcript of a different interview. You want to automatically load all these files to analyze them in Python. The `glob` library allows you to search for files in a folder based on patterns. In this example, we’ll load all .txt files from a folder and print their contents.

In [None]:
# Import the glob library (part of Python’s standard library)
import glob

In [None]:
# Get all text files in the "evaluation_comments" folder
files = glob.glob("../Data/evaluation_comments/*.txt")

# Read and display contents of each file
for filepath in files:
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()
        print(f"--- Contents of {filepath} ---")
        print(content)
        print("\n")

## 4. `beautifulsoup4` & `requests`

`beautifulsoup4` is a library used to parse HTML (HyperText Markup Language) and extract information. It’s perfect for getting data from websites in a structured, readable way. Together with `requests` we can conduct our first web scraping task. Let`s scrape some text from a Wikipedia page!

In [None]:
# Install & import the libraries
!pip install beautifulsoup4 
!pip install requests
from bs4 import BeautifulSoup
import requests

---
### **Exercise 1:** 

Define a variable that contains the web address (URL) of a Wikipedia page you want to scrape, and then use it in the code below.

In [None]:
# Wikipedia page we want to fetch


In [None]:
# Wikipedia asks scripts to send a short "User-Agent" string
headers = {
    "User-Agent": "PythonWorkshop"
}

# Download the page
response = requests.get(url, headers=headers)

# Turn the HTML into a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

print(soup)

The direct output of `soup` is the whole HTML page, which is not very easy to read or work with.
If you open this HTML in a text editor, you can see how the page is structured with different tags, for example `<p>` for paragraphs. With `BeautifulSoup` we can select only the parts we care about. Here, we now want to find all `<p>` tags (paragraphs), use a list comprehension (see Notebook 7) to turn them into plain text, use `" ".join(...)` to combine all paragraphs into one long string.

In [None]:
# Find all <p> tags (paragraph elements) on the page
paragraph_tags = soup.find_all("p")

# Turn each <p> tag into plain text and skip empty paragraphs using list comprehension
paragraphs = [p.text.strip() for p in paragraph_tags if p.text.strip()]

# Join all paragraphs into one large text block
# " ".join(...) combines a list of strings into one string, with spaces in between
main_text = " ".join(paragraphs)

# Print the first 500 characters of the article text
print(main_text[:500])

# Save to data folder (will be used in Notebook 9)
with open("../Data/wikipedia-article.csv", "w", encoding="utf-8") as f:
    f.write(main_text)

---

## 5. `pandas`

`pandas` is the most widely used Python library for working with tabular data. Tabular data is data arranged in rows and columns. It is commonly found in files like .csv or Excel spreadsheets. 

`pandas` makes it easy to read files like `.csv` into `DataFrames`, which keep both the data and its structure intact. This allows you to efficiently store, organize, and analyze data within your code. `DataFrames` are similar to Excel spreadsheets or database tables. They have a 2-dimensional data structure and labeled axes (rows and columns). These are indexed for efficient data retrieval.

<img src="../Images/dataframe.png" style="width: 300px;">

In [None]:
# Install & import pandas
!pip install pandas
import pandas as pd

We can also use `pandas` to load and save text files, similar to using the `with` statement, but now we assume a **tabular structure** (rows and columns). For example, the file we loaded at the beginning of this notebook is stored as a table, so we normally read it with `pandas` instead of treating it as one long text string.

The imported dataset comes from a study that examines how often animal characters in popular children’s books are explicitly given a gender, in order to reveal patterns and possible biases in representation. If you are curious, have a look at the excellent interactive story created by the team from [The Pudding](https://pudding.cool/2025/07/kids-books/) for a deeper exploration of their findings.

In [None]:
# Reading the same file as a table (DataFrame) with pandas
kids = pd.read_csv("../Data/kids-book-animals.csv")

# Show the first rows of the table: one row per entry, columns = metadata / text fields
kids.head()

Once the data are in a `DataFrame`, we can already do very basic analyses that can support qualitative work, such as checking how many entries we have, what columns exist, and how often certain values appear in a column. Kind of all operations one wants to do with structured data are possible. A `DataFrame` follows Python’s usual indexing logic, so we can, for example, select a specific column either by its name or its position (index):

In [None]:
# How many rows and columns do we have?
print(kids.shape)   # e.g. (number_of_rows, number_of_columns)

In [None]:
# What columns exist in this DataFrame?
print(kids.columns)

In [None]:
# Example 1: select one column by its name (e.g. "animal") and then print the "unique" string entries (animals) in that column
animals = kids["animal"]
print(animals.unique())

In [None]:
# Example 2: select the first column by its index position (column 0)
# .iloc is a pandas tool for selecting data from a DataFrame using integer positions.# dataframe.iloc[row_position, column_position]first_column = kids.iloc[:, 0]
print(first_column.head())

In [None]:
# Example 3: how often does each value appear in the "animal" column?
print(kids["animal"].value_counts())

---
### **Exercise 2:** 

In Example 2, we selected the **first column** of `kids` by its index position. Now have a look at this example, think about how this works and write a line of code that selects the **first row** of `kids` by its index position instead.

---