## Visualisation with Matplotlib

In this tutorial, you will learn about data visualization using Matplotlib. Visualization plays a crucial role in understanding and interpreting data, and one common task is visualizing word frequencies in a given text.

## Why visualise word frequencies?

Understanding word frequencies can provide valuable insights into the content of a text. Whether you are analyzing literary works, exploring patterns in historical documents, or studying any other corpus, visualizing word frequencies can help highlight key themes, identify significant terms, and uncover patterns in language usage.

Matplotlib, a powerful and widely-used plotting library in Python, provides a straightforward way to create visualizations that enhance our understanding of data. By employing Matplotlib, we can transform word frequency data into clear and informative visual representations.

## Defining the input text

First, we need to save the text for which we want to analyse the word frequencies in a variable so we can easily access it. Since we usually calculate word frequencies for longer texts, it makes sense to read the input directly from the file.

For this, we need to specify the path to our file. You can use the extract from Alice in Wonderland by Lewis Carroll you will find in the location `'panel-7/data/alice_in_wonderland.txt'`. Alternatively, you can use a `.txt` input file you are interested in analysing by uploading it to the same location and specifying the path below:

In [None]:
# TODO: Specify the file path


# Open the file in read mode ('r')
with open(file_path, 'r') as file:
    # Read the entire content of the file
    input = file.read()

# TODO: print the content of the input variable


## Pre-processing the Input for Visualisation

Preprocessing is crucial for calculating word frequencies. Tasks like tokenization and removing stop words ensure accurate and meaningful word frequency data.

In the following, it's up to you: You can perform all the pre-processing steps manually to see how your input changes step-by-step. This involves more programming :) Or you can do it the easy way and use the text processing pipeline provided by the spaCy library. A third equally convenient way to do it is to use the `CountVectorizer` provided by `sklearn` which performs all the steps for you.

### Before you get started ...

Please run the code below. It will import some necessary libaries. This might take some time ;)

In [None]:
# install the libraries listed in thre requirements.txt file
%pip install -r ../.devcontainer/python-3.12/requirements.txt --upgrade-strategy only-if-needed

# install spacy model en_core_web_sm
!python -m spacy download en_core_web_sm 

## Option A: Pre-processing step-by-step

**Step 1: Pre-processing - Setting the text to lower case**

In order to be able to compare words, we have to make sure that words written with capital and lower-case letters are counted as the same word. Apply the `lower()` function to the text so we only have words with lower-case letters. Print your results afterwards.

In [None]:
# TODO: Set the complete text to lower-case and save it in a new variable text_lower_case


# TODO: Print the text


**Step 2: Pre-processing - Tokenization**

Since we want to count the words in our text, we need to split the text up into words. Use the `word_tokenize()` function from the `nltk` library on the text for this and print your results afterwards:

In [None]:
# import word_tokenize
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# TODO: Split the text saved to your variable text_lower_case into tokens and save them in a tokens list


# TODO: Print the words list


**Step 3: Preprocessing - Removing punctuation marks and special characters**

Now look at the results: There are still lots of punctuation marks in the text! Think about it: Why might that be a problem?

So assume we also want to get rid of all the punctuation marks in the text. How do we do this?
The most straightforward way of achieving it is to apply the `isalpha()` to your tokens - it will simply look at all of them and only retain those that only contain alphabetical letters. Try it out below:

In [None]:
# TODO: Initialize an empty list filtered_tokens


# Use a for loop to filter alphanumeric tokens by applying the isalpha() function to every token in tokens and then storing the result in your filtered_tokens list
for token in tokens:
    if token.isalpha():
        filtered_tokens.append(token)

# TODO: print your results


**Step 4: Preprocessing - Eliminating stop words**

Removing stop words in NLP means removing words that don't carry much meaning, such as "the" or "and." It helps NLP models focus on the essential words in NLP analysis tasks. Let's remove the stopwords from our token list with the help of the stop word list provided by the `nltk` library:

In [None]:
# import nltk library and list of stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Get the list of English stop words from NLTK
stop_words = set(stopwords.words('english'))

# TODO: Initialize an empty filtered_tokens_no_stop list to store filtered words


# TODO: Iterate over each word in the words list

    # TODO: Check if the word is not a stop word
    
        # TODO: If it's not a stop word, append the word to the filtered_words list
        

# TODO: Print your results



**Step 5: Preprocessing - Lemmatisation**

Lemmatization in NLP entails simplifying words to their base form, reducing variations like "running" to "run." It makes analysis more accurate and improves tasks like text comprehension or feature extraction. Let's lemmatise our tokens using the `WordLemmatizer` provided by `nltk`:

In [None]:
# import nltk
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

# Instantiate the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# TODO: Initialize an empty list to store lemmatized tokens


# Use a for loop to lemmatize each token and append the lemmatised token to the lemmatized_tokens list
for token in filtered_tokens_no_stop:
    lemmatized_tokens.append(lemmatizer.lemmatize(token))

# TODO: Print the lemmatized tokens


**Step 6: Count the word frequencies**

We will now create a Python dictionary to store the word frequencies. A dictionary in Python is similar to a list, but it has the advantage that you can store pairs of items. Each pair has a unique name (key) and a value associated with it. For example, if you encounter the word 'adventure' 8 times in your text, the dictionary would store the unique name (key) 'adventure' together with the number 8 like this:

`{'adventure': 8}`

For each token from our text, we will update the corresponding entry in the dictionary. If the token is encountered for the first time, a new entry will be added with a frequency of 1. If the word is already present, its frequency will be incremented.

In [None]:
# Given a list of words, return a dictionary of word-frequency pairs.

# TODO: Define a function that takes a list of words as input

    # TODO: Initialise an empty dictionary to save word frequencies
    

    # TODO: Create a dictionary of word frequencies using a for loop, iterating over every word in the words list
    
        # TODO: Check if the word is already in the dictionary
        
            # TODO: Increment the existing count
            

            # TODO: Add the word to the dictionary with a count of 1
            

    # TODO: Return the dictionary of word frequencies
    

# TODO: Call the function with your list of lemmatised tokens and store the result in a variable


# TODO: Print the resulting word-frequency dictionary



## Option B: Pre-processing using spaCy

You can also use the tools offered by the spaCy library to pre-process your text. spaCy is a Python library used for natural language processing (NLP). We perform the exact same steps: Tokenisation, lemmatisation, removing punctuation and stopwords, converting the text to lower-case only.

Call the `createFreqDict()` function on `input` and save the results in a new variable called `word_freq_dict`. Then print the contents of `word_freq_dict`.

In [None]:
import spacy
from collections import Counter

def createFreqDict(input_text):
    # Convert the input text to lowercase
    input_text_lower = input_text.lower()

    # Load the spaCy language model
    nlp = spacy.load("en_core_web_sm")

    # Tokenize
    doc = nlp(input_text_lower)

    # Initialize Counter to count word frequencies
    word_freq_counter = Counter()

    # Process each token in the document
    for token in doc:
        # Check if the token is a valid word (is_alpha) and not a stop word
        if token.is_alpha and not token.is_stop:
            # Lemmatize the word (get the base form)
            lemma = token.lemma_

            # Update the Counter with the lemmatized word
            word_freq_counter[lemma] += 1

    # Convert Counter to dictionary
    word_freq_dict = dict(word_freq_counter)

    return word_freq_dict

# TODO: Call the function with your list of lemmatised tokens and store the result in a variable


# TODO: Print the resulting word-frequency dictionary


## Option C: Pre-processing using sklearn

Another way to pre-process your text is to use the `CountVectorizer()` function provided by the `sklearn` library. It converts a collection of text documents to a matrix of token counts. Apply the `get_word_frequencies()` function to the `input` variable and save the results in a new variable `word_freq_dict`. Then, print the results of `word_freq_dict`:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def get_word_frequencies(input_text):
    # Create an instance of CountVectorizer with stop words removed
    vectorizer = CountVectorizer(stop_words='english')
    X = vectorizer.fit_transform([input_text])

    # Get the feature names (words)
    feature_names = vectorizer.get_feature_names_out()

    # Convert the sparse matrix to a dense array
    array = X.toarray()

    # Create a dictionary of word frequencies
    word_freq_dict = dict(zip(feature_names, array[0]))

    return word_freq_dict

# TODO: Apply the get_word_frequencies() function to your input text and save it in a variable word_freq_dict


# TODO: Print the variable word_freq_dict



## Visualising your data with Matplotlib

Now, let's plot our word frequency data using Matplotlib. For this, we will create a bar chart. Call the `plot_word_frequencies()` function with our `word_freq_dict` as input parameter. You should also pass the parameter `top_n=10` to get only the 10 highest word counts:

In [None]:
# Import the 'matplotlib.pyplot' module and alias it as 'plt'
import matplotlib.pyplot as plt
from operator import itemgetter

# Define a function named 'plot_word_frequencies' that takes a 'word_freq_dict' and 'top_n' as inputs
def plot_word_frequencies(word_freq_dict, top_n=None):
    # Create a new figure for the plot
    plt.figure(figsize=(10, 6))

    # Sort the word frequency dictionary by values in descending order
    sorted_word_freq = sorted(word_freq_dict.items(), key=itemgetter(1), reverse=True)

    # If top_n is provided, select the top N items; otherwise, use all items
    sorted_word_freq = sorted_word_freq[:top_n] if top_n is not None else sorted_word_freq

    # Unpack the sorted items into separate lists of words and frequencies
    words, frequencies = zip(*sorted_word_freq)

    # Create a bar plot using the top N words and their frequencies
    plt.bar(words, frequencies)

    # Label the x-axis as 'Words'
    plt.xlabel('Words')

    # Label the y-axis as 'Frequencies'
    plt.ylabel('Frequencies')

    # Set the title of the plot to 'Word Frequencies'
    plt.title('Word Frequencies')

    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45, ha='right')

    # Display the plot
    plt.show()

# TODO: Call the function plot_word_frequencies on the word frequency dictionary as input parameter and plot the top 10 highest frequencies
