# Doing Things with Text 1

## Cleaning a text document

This notebook introduces basic text processing to help you analyze a single text document. In this notebook, you'll learn to clean up (preprocess) a text to prepare it for analysis. Don't worry if you're new to Python — we'll guide you through each step.

### Step 1: Setting Up NLTK

NLTK (Natural Language Toolkit) is a library for working with text. To use it, you'll need to download some additional language data the first time you use NLTK. Run the following cell once:

In [None]:
# Import NLTK and download required packages
import nltk
nltk.download('punkt')  # Tokenizer
nltk.download('stopwords')  # Stopwords

### Step 2: Importing Required Packages

Here, we're loading a few packages to help with text cleaning:
- `BeautifulSoup`: To clean up HTML text.
- `unicodedata`: For handling special characters.
- `re`: For regular expressions (patterns used for finding and cleaning text).
- `nltk.tokenize`: For splitting text into individual words.
- `nltk.corpus.stopwords`: A collection of common words like 'the', 'and', 'is', which are often removed in analysis.

In [None]:
# Import necessary libraries
from bs4 import BeautifulSoup
import unicodedata
import re
import os
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### Step 3: Define Input and Output Paths

Define where your text file is located (input) and where you want to save your processed text (output). You will use `os.path.join()` to define your paths. This approach is cross-platform, meaning it will work on Windows, macOS, and Linux.

Replace 'path', 'to', 'your', 'input', 'folder' with the actual paths to your files. It is not necessary for the output folder to exist. If it doesn't, this code will create it for you.

In [None]:
# Define input and output paths
indir = os.path.join('path', 'to', 'your', 'input', 'folder')  # Example: os.path.join('Users', 'yourname', 'Documents')
outdir = os.path.join('path', 'to', 'your', 'output', 'folder')
os.makedirs(outdir, exist_ok=True)  # Create the output directory if it doesn't exist

### Step 4: Load Your Text Document

Now, let's load your text file. Make sure the file is in the folder you specified.

In [None]:
file = 'infile.txt' # change 'infile' for actual file name
file_path = indir + file

In [None]:
with open(file_path, encoding='utf8') as f:
    text = f.read()
print('Sample text preview:', text[:400])  # Show a sample of the loaded text

### Step 5: Cleaning

Now, we'll go through several steps to clean the text. Each step has a brief explanation and an example.

#### Step 5a: Lowercase the Text
Converting text to lowercase helps standardize it, so 'Python' and 'python' are treated as the same word.

In [None]:
# Convert text to lowercase
text = text.lower()
print('Lowercased text preview:', text[:400])

#### Step 5b: Remove HTML Tags
If your text includes HTML, let's remove it to keep only the content.

In [None]:
# Remove HTML tags
soup = BeautifulSoup(text, 'html.parser')
text = soup.get_text()
print('HTML tags removed preview:', text[:400])

#### Step 5c: Remove Non-Alphabetic Characters
This step removes characters that are not letters, leaving only the words.

In [None]:
# Remove non-alphabetic characters
text = re.sub('[^a-z\s]', '', text)
print('Non-alphabetic characters removed preview:', text[:400])

#### Step 5d: Tokenize the Text
Tokenizing splits the text into individual words (tokens).

In [None]:
# Tokenize the text
tokens = word_tokenize(text)
print('Tokens:', tokens[:20])  # Show the first few tokens

#### Step 5e: Remove Stopwords
Stopwords are common words like 'the', 'is', and 'and'. Removing them can make the text more focused on meaningful words.

In [None]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
print('Text after stopwords removed:', tokens[:20])

### Step 6: Save Your Cleaned Text

Finally, let's save the cleaned text tokens back to a new file.

In [None]:
# Save the cleaned tokens to a new file
output_file = os.path.join(outdir, 'cleaned_text.txt')
with open(output_file, 'w', encoding='utf8') as f:
    f.write(' '.join(tokens))
print('Cleaned text saved to:', output_file)

### Step 7: Counting Words Before and After Preprocessing

To understand the impact of cleaning, we'll compare the total word count before and after preprocessing. This gives insight into how many words (such as punctuation, stopwords) were removed.


In [None]:
# Count the total number of words before cleaning
word_count_before = len(word_tokenize(text))
print('Total number of words before cleaning:', word_count_before)

In [None]:
# Count the total number of words after cleaning
word_count_after = len(tokens)
print('Total number of words after cleaning:', word_count_after)

In [None]:
# Display the difference in word count
print('Words removed during cleaning:', word_count_before - word_count_after)