# Doing Things with Text 2: Basic textual statistics

This notebook introduces basic statistical analysis for a single text document after it has been preprocessed. 
You will learn to calculate metrics like word counts, lexical diversity, and visualize the most common words.

### Step 1: Import Required Packages

In this notebook, we’ll use the following packages to perform our analysis:
- `nltk.tokenize`: Splits text into individual words.
- `wordcloud`: Creates a word cloud visualization.
- `matplotlib.pyplot`: Allows us to plot bar charts.

In [None]:
# Import necessary libraries
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import os

### Step 2: Define Input and Output Paths

Set your input and output directories. This setup allows the notebook to load the text file for analysis and save output images.
Make sure to replace 'path/to/your/folder' with the actual folder path containing your text document.

In [None]:
# Define input and output paths
indir = os.path.join('path', 'to', 'your', 'input', 'folder')
outdir = os.path.join('path', 'to', 'your', 'output', 'folder')
os.makedirs(outdir, exist_ok=True)  # Create output directory if it doesn't exist

### Step 3: Load Your Text Document

Now, load the text document you want to analyze. Make sure the file is in the folder specified in the input path.

In [None]:
# Load the text file
file = os.path.join(indir, 'infile.txt')  # replace 'infile.txt' with your actual file name
with open(file, encoding='utf8') as f:
    text = f.read()
print('Sample text preview:', text[:400])  # Show a sample of the loaded text

#### Step 3a (Optional): Define Custom Stop Words

You can define additional words to remove from your text. If you leave the list empty, no additional words will be removed.
This step is optional but can help refine your analysis if there are specific words you want to exclude.

In [None]:
# Define a function to remove custom stop words
def remove_custom_stopwords(text, custom_stopwords=[]):
    tokens = word_tokenize(text)
    if custom_stopwords:
        tokens = [word for word in tokens if word.lower() not in custom_stopwords]
    return tokens

# Set up custom stop words (leave empty if not using additional stop words)
custom_stopwords = ['example', 'specific', 'word']

#### Step 3b: Tokenization

In [2]:
# Apply custom stop word removal (or leave text unchanged if list is empty)
tokens = remove_custom_stopwords(text, custom_stopwords)
print("Tokenized text after optional custom stop word removal:", tokens[:20])  # Display sample tokens

NameError: name 'remove_custom_stopwords' is not defined

### Step 5: Calculate Basic Statistics

With our tokenized text, we can now calculate some basic statistics, such as the number of unique words (types)
and the total number of words (tokens).

In [None]:
# Count total words and unique words
word_counts_total = Counter(tokens)

# Print statistics
print("Total words (tokens):", len(tokens))
print("Unique words (types):", len(word_counts_total))

In [None]:
# Calculate lexical diversity
ttr = len(word_counts_total) / len(tokens)
print(f"Type-Token Ratio (TTR): {ttr:.2%}")

In [None]:
# Calculate lexical diversity
ttr = len(word_counts_total) / len(tokens)
print(f"Type-Token Ratio (TTR): {ttr:.2%}")

### Step 5: Visualize Most Common Words in a bar chart

We'll plot the most common words in a bar chart and display them in a word cloud.
This gives insight into the most frequently occurring words in your text.

In [None]:
# Count the most common words
number_top_words = 20  # Set the number of most common words to display
most_common_words = word_counts_total.most_common(number_top_words)

# Separate words and counts for plotting
words, counts = zip(*most_common_words)

# Plot bar chart
plt.figure(figsize=(10, 5))
plt.bar(words, counts)
plt.xticks(rotation=45)
plt.title("Most Common Words")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()

### Step 6: Visualize Most Common Words in a Word Cloud

A word cloud visualizes word frequency, where the size of each word indicates its frequency. You can customize the background color and colormap of the word cloud.

In [None]:
# Generate and display a word cloud
wordcloud = WordCloud(background_color='white', colormap='viridis').generate(text)

# Display the word cloud
plt.figure(figsize=(8, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### Step 7: Visualize Word Frequency by Length

Next, we’ll analyze word frequency by word length to understand the distribution of different word lengths in your text.

In [None]:
# Count words by length
word_lengths = {'3 letters': 0, 
                '4 letters': 0, 
                '5 letters': 0, 
                '6 letters': 0, 
                '7 letters': 0, 
                '8 letters': 0, 
                '9 letters': 0, 
                '10+ letters': 0}

for word in tokens:
    length = len(word)
    if length >= 3 and length <= 9:
        word_lengths[f"{length} letters"] += 1
    elif length >= 10:
        word_lengths["10+ letters"] += 1

# Plot word frequency by length
plt.figure(figsize=(10, 5))
plt.bar(word_lengths.keys(), word_lengths.values())
plt.title("Word Frequency by Length")
plt.xlabel("Word Length")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.show()

### Step 8: Print Most Common Words by Word Length

For each word length category, you can examine the most common words. This can highlight patterns in the use of specific word lengths.

In [None]:
# Define a function to get most common words by length
def most_common_words_by_length(tokens, word_length, top_n=10):
    words = [word for word in tokens if len(word) == word_length]
    word_counts = Counter(words)
    return word_counts.most_common(top_n)

# Example: Get the 10 most common 5-letter words
most_common_five_letter_words = most_common_words_by_length(tokens, 5)
print("Most common 5-letter words:", most_common_five_letter_words)