# Doing Things with Text 3c: Statistics on multiple text documents

This notebook provides some basic statistics of the texts in one or more csv's

### Step 0: Install packages (only the first time)

In [None]:
!pip install wordcloud

### Step 1: Setting Up NLTK

NLTK (Natural Language Toolkit) is a library for working with text. To use it, you'll need to download some additional language data the first time you use NLTK. Run the following cell once:

In [None]:
# Import NLTK and download required packages
import nltk
nltk.download('punkt')  # Tokenizer
nltk.download('stopwords')  # Stopwords

### Step 2: Importing Required Packages

- `pathlib.Path`: Provides an object-oriented interface for filesystem paths
- `csv`: Provides functionality for reading from and writing to CSV (Comma-Separated Values) files
- `tqdm.tqdm`: Provides a progress bar for loops and iterable tasks, helping to track the progress of operations in real-time.
- `matplotlib.pyplot`: Allows for creating visualizations like charts and graphs to represent data visually.
- `pandas`: Provides tools for handling and analyzing structured data in tables, making it easier to work with datasets.
- `collections.Counter`: A specialized dictionary that counts the occurrences of elements in an iterable.
- `wordcloud`: Creates a word cloud visualization.

In [None]:
from pathlib import Path
import csv
from tqdm import tqdm
import matplotlib.pyplot as plt 
import pandas as pd
from collections import Counter
from wordcloud import WordCloud

### Step 3: Define Input and Output Paths

Define where your text file is located (indir) and where you want to save your processed text (outdir).

In [None]:
# Define input and output paths
indir = Path('/Path/to/indir/')
outdir = Path('/Path/to/outdir/')
outdir.mkdir(parents=True, exist_ok=True)  # Create the output directory if it doesn't exist

allfiles = sorted(indir.glob("*.csv"))

dataset = 'dataset' # here the name of your actual dataset for output files

check what's in 'allfiles':

In [None]:
for file in allfiles:
    print(file)

#### Step 3.1 check what the data structure of csv's looks like (change 'file.csv' for one of the actual files in indir)

In [None]:
df_test = pd.read_csv(str(indir) + '/' + 'file.csv', sep='\t') # most common separators are ';' or ',' or '\t'
print(df_test.head())

In [None]:
cols_to_keep = ['year', 'text'] # change according to the column headers in your csv above
index_col = 'year' # preferably, the date column

Some necessary functions

In [None]:
def save_corpus(dataset):
    dataset_out = dataset.replace(" ", "_").lower()
    return dataset_out

def remove_user_defined_stopword_list(words):
    """ Given a hardcoded list of words and stop words, remove stop words """
    new_words = []
    for word in words:
        if word not in custom_words:
            new_words.append(word)
    return new_words

### Step 4: Import csv's as df (with df['text']) as the text column), merge into one large dataframe called 'data'

In [None]:
data = []  # Use a list to collect DataFrames

for file in tqdm(allfiles):
    df = pd.read_csv(file, sep="\t", usecols=cols_to_keep, index_col=index_col)
    data.append(df)  # Append the DataFrame to the list

# Concatenate all DataFrames at once
data = pd.concat(data, axis=0, ignore_index=False)     

#### (Optional) step 4a: group rows in dataframe 'data' by year

In [None]:
data.index = pd.to_datetime(data.index, format ="%Y-%m-%d") # format depends on format in index (date) column. See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
data = data.sort_index()
data = data.groupby(data.index.year).sum(numeric_only=False)

### Step 5: Make single lists and strings

**Turn text column in data into big list 'input_as_list'**

#### Option 1: 'text' column is a list of words

In [None]:
input_as_list = [item for sublist in data['text'] for item in sublist]

#### Option 2: 'text' column is a string of words

In [None]:
input_as_list = [word for text in data['text'] for word in text.split()]

In [None]:
print(input_as_list[:100])

**Turn text data into big string 'input_as_string'**

#### Option 1: 'text' column is a list of words

In [None]:
input_as_string = " ".join(word for sublist in data['text'] for word in sublist)

#### Option 2: 'text' column is a string of words

In [None]:
input_as_string = " ".join(data['text'].tolist())

In [None]:
print(input_as_string[:100])

#### (Optional) Step 5c: User defined stopwords (for wordcloud and Counter). Change if needed!

In [None]:
custom_words = ['het', 'van', 'een', 'dat', 'zijn'] ### add words as list: 'word', 'word', 'word', etc.

Remove custom_words from input_as_list

In [None]:
input_as_list = remove_user_defined_stopword_list(input_as_list)

### Step 6: Calculate basic statistics

With our tokenized text, we can now calculate some basic statistics, such as the number of unique words (types)
and the total number of words (tokens).

word_counts_total below is a counter object that counts the frequency for each of the words in input_as_list. It feeds the bar chart below. Words that need removed from the bar chart can be put in the custom stopword list custom_words above

In [None]:
# Count unique words
word_counts_total = Counter(input_as_list)

In [None]:
print("The total number of tokens in %s is: %s"%(dataset, len(input_as_list)))

In [None]:
print("The total number of types in %s is: %s" %(dataset, len(word_counts_total)))

**Calculate lexical diversity by dividing number of types by number of tokens (= type token ratio, or TTR)**

In [None]:
print(f"The type token ratio of {dataset} is: {round(len(word_counts_total)/len(input_as_list)*100, 1)}%")

### Step 7: Visualize most common words in a bar chart

In [None]:
number_top_words = 20 # set number of most common words to print/plot
most_common_total = word_counts_total.most_common(number_top_words)

In [None]:
#### From https://stackoverflow.com/questions/63018726/counter-and-plot-the-most-common-word-in-a-text ####

y = [count for word, count in most_common_total]
x = [word for word, count in most_common_total]

plt.rcParams["figure.figsize"] = (20,10)
plt.bar(x, y, color='crimson')
plt.title("Most common terms")
plt.ylabel("Counts")
plt.xlabel("Terms")
plt.rc('xtick',labelsize=12)
plt.rc('ytick',labelsize=12)
#plt.yscale('log') # optionally set a log scale for the y-axis
plt.xticks(rotation=45)
for i, (word, count) in enumerate(most_common_total):
    plt.text(i, count, f' {count} ', rotation=45, size=16,
             ha='center', va='top' if i < 10 else 'bottom', color='white' if i < 10 else 'black')
plt.xlim(-0.6, len(x)-0.4) # optionally set tighter x lims
plt.tight_layout() # change the whitespace such that all labels fit nicely
plt.savefig(outdir / f'{save_corpus(dataset)}_most_common.png', dpi=200, bbox_inches='tight') # change filename as wished
plt.show()

### Step 8: Visualize most common words in a word cloud

**generate wordcloud**

A word cloud visualizes word frequency, where the size of each word indicates its frequency. You can customize the background color and colormap of the word cloud.

In [None]:
# Generate a word cloud
text_cloud = WordCloud(background_color='white', stopwords=custom_words).generate(input_as_string)

# Display the word cloud
plt.imshow(text_cloud, interpolation='bilinear')
plt.axis('off')
plt.savefig(outdir / f'{save_corpus(dataset)}_wordcloud.png', dpi=200, bbox_inches='tight') # change filename as wished
plt.show()

### Step 9: Most common words per dataframe row in bar charts and lists

In [None]:
outdir_bar = outdir/ f'{save_corpus(dataset)}_barchart_per_csv_row/'
outdir_bar.mkdir(parents=True, exist_ok=True)  # Create the output directory if it doesn't exist

with open(outdir / f'{save_corpus(dataset)}_mostcommon_year.txt', 'a') as f:
    print('Most common words per year in %s\n' % (dataset), file=f)
    for date, row in zip(data.index, data['text']):
        # Tokenize the text into words
        tokens = row.split()  # Alternatively, use word_tokenize(row) for more accurate tokenization
        cleaned_tokens = remove_user_defined_stopword_list(tokens)  # Apply stopword removal
        word_counts = Counter(cleaned_tokens)  # Count word frequencies
        most_common_words = word_counts.most_common(number_top_words)

        y = [count for word, count in most_common_words]
        x = [word for word, count in most_common_words]
    
        plt.rcParams["figure.figsize"] = (20,10)
        plt.bar(x, y, color='crimson')
        plt.title("Top term frequencies in " + str(date))
        plt.ylabel("Counts")
        #plt.yscale('log') # optionally set a log scale for the y-axis
        plt.xticks(rotation=45)
        for i, (word, count) in enumerate(most_common_words):
            plt.text(i, count, f' {count} ', rotation=45,
            ha='center', va='top' if i < 10 else 'bottom', color='white' if i < 10 else 'black')
        plt.xlim(-0.6, len(x)-0.4) # optionally set tighter x lims
        plt.tight_layout() # change the whitespace such that all labels fit nicely
        plt.savefig(outdir_bar / f'{save_corpus(dataset)}_most_common_{date}.png', dpi=200, bbox_inches='tight') # change filename as wished
        plt.show()
        
        print('Most common words in %s:' % (date))
        print('Most common words in %s:' % (date), file=f)
        for word, count in most_common_words:
            print('%s: %7d' % (word, count))
            print('%s: %7d' % (word, count), file=f)
        print('\n')
        print('\n', file=f)

### Step 10: Visualize Total Word Frequency by Length

Next, we’ll analyze word frequency by word length to understand the distribution of different word lengths in your text.

In [None]:
# Count words by length
word_lengths = {'3 letters': 0, 
                '4 letters': 0, 
                '5 letters': 0, 
                '6 letters': 0, 
                '7 letters': 0, 
                '8 letters': 0, 
                '9 letters': 0, 
                '10+ letters': 0}

for word in input_as_list:
    length = len(word)
    if length >= 3 and length <= 9:
        word_lengths[f"{length} letters"] += 1
    elif length >= 10:
        word_lengths["10+ letters"] += 1

# Plot word frequency by length
plt.figure(figsize=(10, 5))
plt.bar(word_lengths.keys(), word_lengths.values())
plt.title("Word Frequency by Length")
plt.xlabel("Word Length")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.savefig(outdir / f'{save_corpus(dataset)}_word_freq_length.png', dpi=200, bbox_inches='tight') # change filename as wished
plt.show()

### Step 11: Print Most Common Words by Word Length

For each word length category, you can examine the most common words. This can highlight patterns in the use of specific word lengths.

In [None]:
word_length = 5 # Define word length - can be any number
top_n = 15 # Number of top-frequency words - can be any number

# Define a function to get most common words by length
def most_common_words_by_length(tokens, word_length, top_n=top_n):
    words = [word for word in tokens if len(word) == word_length]
    word_counts = Counter(words)
    return word_counts.most_common(top_n)

most_common_n_letter_words = most_common_words_by_length(input_as_list, word_length)

print("Most common %s-letter words:\n" %(str(word_length)))
for word, frequency in most_common_n_letter_words:
    print('\'%s\': %s' %(word, frequency))