# Word clouds

To compare the word occurrences in the three papers and visualise them using word clouds, we need to follow these steps:

* Filter out stopwords.
* Generate word frequencies.
* Create word clouds for each `krantnaam`.


## Import packages

Import the necessary packages for this notebook.

In [None]:
import pandas as pd
import plotly.express as px
import pickle
import spacy
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud

## Load the dataset

In [None]:
# Deserialize
with open('data/preprocessed_docs.pkl', 'rb') as f:
    processed_docs = pickle.load(f)

## Load the model

Then we'll import our Dutch NLP model

In [None]:
# Specify the relative path to the model directory
model_path = "model/nl_core_news_sm"

# Load the model from the relative path
nlp = spacy.load(model_path)

## Stop words

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. Stop words are commonly used in Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.

We will store stopwords in a Python `set` data structure.

In [None]:
# Let's get the Dutch stopwords from the model
stopwords = nlp.Defaults.stop_words

# print first 10 stopwords
list(stopwords)[:10]

### Create functions to get word frequencies

The function below helps us extract word frequencies and store them in one `dictionary` data structure

In [None]:
def filter_lemmas(lemmas, filter_sets):
    # Combine all filter sets into one set
    combined_filters = set().union(*filter_sets)
    return [lemma for lemma in lemmas if lemma.lower() not in combined_filters and lemma.isalpha()]

def create_word_frequencies(processed_docs, filter_sets):
    word_frequencies = {}
    
    for krantnaam in processed_docs['krantnaam'].unique():
        # Filter DataFrame for the current krantnaam
        df_filtered = processed_docs[processed_docs['krantnaam'] == krantnaam]
        
        # Get all lemmas for the current krantnaam, filtering out stopwords and common short words
        all_lemmas = []
        for lemmas in df_filtered['lemmas']:
            all_lemmas.extend(filter_lemmas(lemmas, filter_sets))
        
        # Calculate word frequencies
        word_freq = Counter(all_lemmas)
        
        # Store the word frequencies in the dictionary
        word_frequencies[krantnaam] = word_freq
        
    return word_frequencies

## Word use

Let's have a look which words are most frequently used in each paper.

* 'De standaard'
* 'Het vaderland : staat- en letterkundig nieuwsblad'
* 'De Tĳd : godsdienstig-staatkundig dagblad'

### Generate word frequencies without stopwords

In [None]:
word_frequencies = create_word_frequencies(processed_docs, filter_sets = [stopwords])

In [None]:
Counter(word_frequencies['De standaard']).most_common(10)

Ouch, OCR rubbish..., let's do a bit more filtering.

## Create a filter set for OCR rubbish

A lot of short words appear to be OCR rubbish.

Let's find all short lemmas.

In [None]:
# Find lemmas with less than X characters
all_lemmas = [lemma for lemmas in processed_docs['lemmas'] for lemma in lemmas]

# Identify common short lemmas (less than X characters)
short_words = set(lemma for lemma in all_lemmas if len(lemma) < 5)

# print first 10 short_words
list(short_words)[:10]

### Generate word frequencies without stopwords, short_words

In [None]:
word_frequencies = create_word_frequencies(processed_docs, filter_sets = [stopwords, short_words])

In [None]:
Counter(word_frequencies['De standaard']).most_common(10)

### Generate word frequencies without stopwords, short_words and your own filter?

Make sure to put your words in the set in lowercase.

In [None]:
my_filter_set = {'jan','alhier','maken','komen','pct'}

In [None]:
word_frequencies = create_word_frequencies(processed_docs, filter_sets = [stopwords, short_words, my_filter_set])

In [None]:
Counter(word_frequencies['De standaard']).most_common(10)

## Word clouds

The code creates threw word clouds, do not forget to adjust the filter.

In [None]:
my_filter_set = {'jan','alhier','maken','komen','pct','aand','april','julij'}

word_frequencies = create_word_frequencies(processed_docs, filter_sets = [stopwords, short_words, my_filter_set])

# Function to generate word cloud
def generate_wordcloud(word_freq, krantnaam, max_words=50):
    wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=max_words).generate_from_frequencies(word_freq)
    
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(f'Word Cloud for {krantnaam}')
    plt.axis('off')
    plt.show()

# Generate word clouds for each krantnaam
for krantnaam, word_freq in word_frequencies.items():
    generate_wordcloud(word_freq, krantnaam, max_words=50)  # Adjust max_words as needed


## Conclusions?

Now that we have these cloud what can we interpret from them?