# Doing things with text 2

## Counting words from a *preprocessed* text

This notebook introduces some basic statistics for a single (cleaned) text document

**note:** To clean a single text document, use notebook 1

## Step 0: Install necessary packages

In [None]:
# run this install wordcloud command once if it isn't installed yet
!pip install wordcloud

## Step 1: Import Required Packages

Here, we're loading a few packages to help with counting and visualizing:
- `re`: For regular expressions (patterns used for finding and cleaning text).
- `os`: For interacting with the operating system, such as file and directory management.
- `wordcloud`: Generates visually appealing word clouds from text data, where word size represents frequency.
- `matplotlib.pyplot`: A plotting library used to create static, interactive, and animated visualizations in Python.
- `Counter` from `collections`: A specialized dictionary that counts the occurrences of elements in an iterable.

In [None]:
import re
import os
from wordcloud import WordCloud
import matplotlib.pyplot as plt 
from collections import Counter

## Step 3: Set Up Your File Paths

Define where your text file is located (input) and where you want to save your processed text (output). You will use os.path.join() to define your paths. This approach is cross-platform, meaning it will work on Windows, macOS, and Linux.

Replace 'path', 'to', 'your', 'input', 'folder' with the actual paths to your files. It is not necessary for the output folder to exist. If it doesn't, this code will create it for you.

In [None]:
# Define input and output paths
indir = os.path.join('path', 'to', 'your', 'input', 'folder')  # Example: os.path.join('Users', 'yourname', 'Documents')
outdir = os.path.join('path', 'to', 'your', 'output', 'folder')
os.makedirs(outdir, exist_ok=True)  # Create the output directory if it doesn't exist

## Step 4: Load Your Text Document

Now, let's load your text file. Make sure the file is in the folder you specified.

In [None]:
file = 'infile.txt' # change 'infile' for actual file name
file_path = indir + file

In [None]:
with open(file_path, encoding='utf8') as f:
    text = f.read()
print('Sample text preview:', text[:400])  # Show a sample of the loaded text

In [1]:
text_name = 'text name' # give your document a single word name

### Use the text as a list called 'input_as_list', splitted into single words

In [None]:
input_as_list = [x for x in text.split(' ')]

#### Show a sample of the splitted text

In [None]:
print(input_as_list[:100])

#### Print the length of the list (= total number of words)

In [None]:
print(len(input_as_list))

### (Optional) Step 4a: Remove custom stop words from 'input_as_list'

In [None]:
def custom_stop_words(words):
    """ Given a list of words and custom stop words, remove custom stop words """
    new_words = []
    for word in words:
        if word not in custom_words:
            new_words.append(word)
    return new_words

In [None]:
custom_words = ['the', 'and', 'that', 'with', 'said', 'this', 'when', 
                'them', 'were', 'from', 'will', 'there', 'they', 'then', 
                'their', 'your', 'would', 'only', 'even', 'know', 'could', 
                'have', 'where', 'come', 'been', 'made', 'well']

In [None]:
input_as_list = custom_stop_words(input_as_list)

## Step 5: Identify and count most common words

### 5.1 Count the total number of types (unique words) in text from 'file' after preprocessing

In [None]:
word_counts_total = Counter(input_as_list)

In [None]:
print("The total number of types in \'%s\' is: %s" %(text_name, len(word_counts_total)))

### 5.2 Calculate lexical diversity by dividing number of types by number of tokens (= type token ratio, or TTR)

In [None]:
print(f"The type token ratio of \'{text_name}\' is: {round(len(word_counts_total)/len(input_as_list)*100, 1)}%")

### 5.3 Visualize most common words in a bar chart

In [None]:
most_common = word_counts_total.most_common(20) # set the number of most common words to print/plot (here: 20)

In [None]:
#### From https://stackoverflow.com/questions/63018726/counter-and-plot-the-most-common-word-in-a-text ####

y = [count for word, count in most_common_total]
x = [word for word, count in most_common_total]

plt.rcParams["figure.figsize"] = (20,10)
plt.bar(x, y, color='crimson')
plt.title("Most common terms in %s" %(text_name))
plt.ylabel("Counts")
plt.xlabel("Terms")
plt.rc('xtick',labelsize=12)
plt.rc('ytick',labelsize=12)
#plt.yscale('log') # optionally set a log scale for the y-axis
plt.xticks(rotation=45)
for i, (word, count) in enumerate(most_common_total):
    plt.text(i, count, f' {count} ', rotation=45, size=16,
             ha='center', va='top' if i < 10 else 'bottom', color='white' if i < 10 else 'black')
plt.xlim(-0.6, len(x)-0.4) # optionally set tighter x lims
plt.tight_layout() # change the whitespace such that all labels fit nicely
plt.savefig(outdir + '%s_most_common.png' %(text_name), dpi=200, bbox_inches='tight') # change filename as wished
plt.show()

In [None]:
# Alternative. Check!

# Separate words and counts for plotting
words, counts = zip(*most_common_words)

# Plot bar chart
plt.figure(figsize=(10, 5))
plt.bar(words, counts)
plt.xticks(rotation=45)
plt.title("Most Common Words in %s" %(text_name))
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.show()

### 5.4 Visualize most common words in a word cloud

*Input for the word cloud is the original 'text' variable, because wordcloud works with strings, not lists. We can optionally add a custom word list in the WordCloud command*

In [None]:
text_cloud = WordCloud(background_color='white', colormap='viridis', stopwords=custom_words).generate(text)

In [None]:
plt.imshow(text_cloud, interpolation='bilinear')
plt.axis('off')
plt.savefig(outdir + 'wordcloud.png', dpi=200, bbox_inches='tight') # change filename as wished
plt.show()

## Step 6: Count common words by word length

### 6.1 Visualize distribution of word length in file

In [None]:
# Count words by length
word_lengths = {'3 letters': 0, 
                '4 letters': 0, 
                '5 letters': 0, 
                '6 letters': 0, 
                '7 letters': 0, 
                '8 letters': 0, 
                '9 letters': 0, 
                '10+ letters': 0}

for word in input_as_list:
    length = len(word)
    if length >= 3 and length <= 9:
        word_lengths[f"{length} letters"] += 1
    elif length >= 10:
        word_lengths["10+ letters"] += 1

In [None]:
plt.bar(range(len(word_lenghts)), word_lenghts.values(), align='center')
plt.title("Word Frequency by Length in %s" %(text_name))
plt.ylabel("Frequency")
plt.xlabel("Word lenghts")
plt.xticks(range(len(word_lengths)), word_lengths.keys(), rotation=45)
plt.savefig(outdir + '%s_word_lengths.png' %(text_name), dpi=200, bbox_inches='tight') # change filename as wished
plt.show()

### 6.2 Print Most Common Words by Word Length

In [None]:
# Function to get most common words by length
def most_common_words_by_length(tokens, word_length, top_n=15):
    words = [word for word in tokens if len(word) == word_length]
    word_counts = Counter(words)
    return word_counts.most_common(top_n)

In [None]:
# Get the 15 most common n-letter words

n = 5

most_common_n_letter_words = most_common_words_by_length(input_as_list, n)
print("Most common %s-letter words in %s:" %(n, text_name), most_common_n_letter_words)