# Counting terms

In this notebook we will look at different ways of counting words. We begin by simply counting how many times a specific term appears in a text. We proceed to calculate relative frequencies, which lets us compare occurences of a term across different texts. Finally, we cover the term frequency - inverse document frequency analysis, which scores terms on how characteristic they are to a specific text.

We also start writing our own functions and introduce various new datatypes.

## Setup

We start by importing the necessary libraries.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

We select a data source and load it to a DataFrame.

In [None]:
data_file = '/work/Common-files/Data/Datasæt3/20201.csv' # Path to data

df = pd.read_csv(data_file)

To ensure that the data are loaded correctly we inspect the DataFrame with `df.head()`.

In [None]:
df.head()

## Simple counting

The pandas library has various built-in method we can use to describe our data.

Below we count all occurences of a specific term in the dataset.

First, we save our search term to a variable, which will make it easier to swap between search terms.

In [None]:
search_term = 'dansk'

The speech data are located in the `text` column. We access this column of our DateFrame with bracket notation and use the string method `count` to count the number of occurences of our search term in each row of our dataset.

In [None]:
df['text'].str.count(search_term)

If we want to get the total number of occurences across all rows, we apply the `sum` method.

In [None]:
df['text'].str.count(search_term).sum()

If we want to investigate a specific party, we can filter the data before counting occurences.

In [None]:
df[df['group_name'] == 'S']['text'].str.count(search_term).sum()

Notice that if we have missing values in the `group_name` column we will get an error. If this is the case we need to drop those rows or replace the missing values with some other value - e.g. an empty string.

In [None]:
df['group_name'].fillna('', inplace=True)

The syntax is a bit convoluted but the expression `df['group_name'] == 'S'` checks whether the name of the target party matches the `group_name` column. We pass the output of the expression to our DataFrame `df`, which only returns the rows where the expression evaluates to `True`. Finally, we apply the `count` and `sum` methods as before.

## Calculating relative frequencies

Relative frequency is a method for comparing occurences of keywords across a number of differently sized texts. 

Relative frequency is calculated by dividing the number of occurences of a word with the total number of words in the text:

$$f_i = \frac{n_i}{N}$$

Translated into code we can do the following.

In [None]:
# ID of target party.
party = 'S'

# The number of occurences of our search term.
hits = df[df['group_name'] == party]['text'].str.count(search_term).sum()

# The total number of words.
total = df[df['group_name'] == party]['text'].str.split().str.len().sum()

# Relative frequency calculation
rf = hits / total

print(rf)

We now get a very small number, which in itselt does not say a lot. However, if we repeat the process with a different party name we have a measure for comparing the two without having to worry about the individual sizes of the texts.

In [None]:
# ID of target party.
party1 = 'KF'

# The number of occurences of our search term.
hits1 = df[df['group_name'] == party1]['text'].str.count(search_term).sum()

# The total number of words.
total1 = df[df['group_name'] == party1]['text'].str.split().str.len().sum()

# Relative frequency calculation
rf1 = hits1 / total1

print(rf1)

Further, we also calculate the relative frequency of our search term across the entire dataset for an average value.

In [None]:
# The number of occurences of our search term.
full_hits = df['text'].str.count(search_term).sum()

# The total number of words.
full_total = df['text'].str.split().str.len().sum()

full_rf = full_hits / full_total

print(full_rf)

Finally, we can visualise our findings with the `matplotlib` library (`plt`).

We create two lists; one with our party titles (`labels`) and one with our relative frequencies (`values`). We then pass the two lists to the `bar` function, which returns a nice bar plot that illustrates the difference between the parties.

In [None]:
labels = [party, party1, 'Total']

values = [rf, rf1, full_rf]

plt.bar(labels, values)
plt.title(search_term)

## Reusing code with functions

In the examples above we reused a lot of the same code. Whenever we repeat code it might be useful to write a function instead, which let's us reuse the same code over and over.

Below we refactor our relative frequency calculations into a function. The function takes as arguments a search term, a list of one or more party IDs and a data source (`df`).

In [None]:
def get_relative_frequencies(term, party_list, data, include_total=True, return_data=False):
    """Calculate relative frequencies for words in a text collection.
    
    The function calculates the relative frequency of a word based on 
    their occurence in specified parties.
    
    At least one party ID must be provided in a list and the source of text data
    should be a DataFrame including the variables 'group_name' and 'text'.
    
    By default, the relative frequency across all parties is included.
    """
    
    # Prepare list for relative frequency data
    values = []
    
    # For each party name we find the number of matches and the total word count.
    for party in party_list:

        hits = data[data['group_name'] == party]['text'].str.count(term).sum()

        total = data[data['group_name'] == party]['text'].str.split().str.len().sum()
        
        # To avoid division-by-zero errors, we only calculate the relative frequency if there is at least 1 hit.
        # If there are no hits, we add zero to the list of relative frequencies.
        if hits > 0:
            values.append(hits / total)
        else:
            values.append(0)
    
    if include_total:
        party_list.append('Total')
        
        values.append(data['text'].str.count(term).sum() / data['text'].str.split().str.len().sum())
    
    # Draw the plot
    plt.bar(party_list, values)
    plt.xticks(rotation=45)
    plt.legend([term])
    plt.show()
    
    # If we need to work with the relative frequencies outside the function,
    # we can set the return_data parameter to True.
    if return_data:
        return values 

We can now calculate relative frequencies across any number of parties in a single command.

In [None]:
party_list = ['S', 'V', 'RV', 'DF', 'KF', 'SF', 'EL', 'LA']

get_relative_frequencies('pandemi', party_list, df)

## TF-IDF

Another way of counting terms is the Term Frequency - Inversed Document Frequency (tf-idf) analysis. The tf-idf method is useful for identifying terms which are significant for a specific document across a number of documents. The tf part identifies the most common words for a single document, while the idf part penalises the words if they occur in other documents as well. That means that common words such as _the_ and _a_ receive a high tf score but the score is negated by idf because the words most likely occur in all documents.

The result of the tf-idf calculation is a list of the most unique words for the specific document along with a score of how unique they are.

The tf-idf is calculated by multiplying tf and idf. This means, we first need to calculate these two statistics.

Before we start, we need to rearrange our data into a more suitable format. Our starting point will be a text string for each party we want to include in the analysis. We extract all text data from two parties and save them to two text strings (`text_a` and `text_b`)

In [None]:
party_a = 'KF'
party_b = 'SF'

text_a = ' '.join(df[df['group_name'] == party_a]['text'])
text_b = ' '.join(df[df['group_name'] == party_b]['text'])

<u>Alternatively</u>, if we want to use texts saved as text files, we can load them into text strings with the code below.

In [None]:
# Paths to file locations
path_to_text_a = '' # INPUT PATHS TO TEXT FILES.
path_to_text_b = ''

with open(path_to_text_a) as f:
    text_a = f.read()

with open(path_to_text_b) as f:
    text_a = f.read()

#### Split the strings into lists
Once we have our texts loaded into strings, we split the strings into lists of words. This kind of list is often refered to as a bag of words. The bag of words method is useful for analyses where word order is irrelevant such as tf-idf. This means that we could randomise the word order and make the text unintelligble to humans without changing the outcome of the analysis. This is especially useful when working with sensitive data.

Below we use the `split()` method which by default will split our string on whitespace, i.e. space, tab and line-break.

In [None]:
bag_of_words_a = text_a.split()
bag_of_words_b = text_b.split()

#### __OPTIONAL__: NLTK tokenization
If we want to split the our strings into words in a more sophisticated way, we can use a dedicated tokenizer.

Apart from splitting on whitespace, NLTK's `word_tokenize` function takes care of punctuation and other neat stuff. The trade-off is that it is much slower.

In [None]:
from nltk import word_tokenize

bag_of_words_a = word_tokenize(text_a)
bag_of_words_b = word_tokenize(text_b)

### TF-IDF calculations
Now, our data are prepared for the analysis. Next, we need to count all the words in our two texts. We use the `set` datatype to create a list of unique words. Sets are like lists except that all elements in a set must be unique. That means that when we convert a list into a set, all duplicates are removed.

To get all the unique words across the two texts, we join our two bags of words and convert them to a set. Now that we have changed the datatype, no duplicates are allowed and we get a list (technically not a list but a set) of all unique words across the two texts.

In [None]:
unique_words  = set(bag_of_words_a + bag_of_words_b)

Now that we know all the unique words of the texts, we just need to count them.

To keep track of our word counts, we introduce a new datatype: the dictionary (`dict`). Dictionaries are useful when we want to store data in a way that allows us to access specific parts of the data easily.

#### OPTIONAL: More about dictionaries
Dictionaries are related to lists, because they can both store multiple values. However, instead of an index, the value of a dictionary is accessed by using a _key_.

An example: I have a a dictionary (`days_per_month`) and I want to save the number of days in each month in the dictionary. I use the names of the months as my keys and the number of days are my values. To add the values to the dictionary I assign the value at the correct index; `days_per_month['january'] = 31` and repeat for the remaining months. Now I can retrieve the values by passing a specific key to the dictionary, eg. `days_per_month['november']` will return the value `30`. If I pass a key that has not been assigned a value, I will get an error.

Dictionaries are initiated with either curly brackets (`{}`) or the `dict()` command. Our previous dictionary could have been initiated with either `days_per_month = {}` or `days_per_month = dict()`.

Dictionaries doesn't have to be empty from the beginning; they can be initated with data. The syntax requires each key and value to be separated by a colon and each key-value pair to be separated by a comma. Our previous dictionary could be initiated with data in this way:

`days_per_month = {'january': 31, 'february': 28, 'march': 31, 'april': 30, 'may': 31, 'june': 30, 'july': 31, 'august': 31, 'september': 30, 'october': 31, 'november': 30, 'december': 31}`

__Enough about dictionaries - back to our analysis.__

We use the `dict` command with the `fromkeys` method to create a dictionary which has all the words from `unique_words` as keys with an initial value of 0 each.

Then we use a `for` loop to iterate over every word in our first bag of words. Each time, we use the word as key in our dictionary of all unique words (`num_of_words_a`) and increment the value by 1. Once the loop has finished our dictionary contains the word counts for all words in our first text.

We repeat the process for our second bag of words and save the word counts in another dictionary (`num_of_words_b`).

In [None]:
num_of_words_a = dict.fromkeys(unique_words, 0)

for word in bag_of_words_a:
    num_of_words_a[word] += 1

num_of_words_b = dict.fromkeys(unique_words, 0)

for word in bag_of_words_b:
    num_of_words_b[word] += 1

#### Term frequency
Similar to relative frequency, term frequency is a measure of how many times a term occurs in a document relative to the length of the document.

We define a function which lets us calculate the term frequency for all words in our word count dictionary at the same time.

The function needs a word count dictionary and a bag of words. After some initial setup, the function loops over all the items in the word count dictionary, calculates tf by dividing the word count for the specific word with the total number of words in the bag of words and saves the value to a new dictionary. Finally, the new dictionary with the tf value for each word is returned.

In [None]:
def compute_tf(num_of_words, bag_of_words):
    # Empty dictionary for results
    tf_dict = dict()
    
    # The number of words in the bag of words
    total_word_count = len(bag_of_words)
    
    # Calculate term frequency and add to new dictionary
    for word, count in num_of_words.items():
        tf_dict[word] = count / total_word_count
    
    return tf_dict

Term frequencies are calculated for both texts.

In [None]:
tf_a = compute_tf(num_of_words_a, bag_of_words_a)
tf_b = compute_tf(num_of_words_b, bag_of_words_b)

#### Inversed document frequency
Next, we need to calculate the inversed document frequency. Again we define a function that lets us process our entire text at once.

This function is a bit more complex because we use nested loops (i.e. loops within loops) and a bit more math.

In [None]:
# The math module is needed to calculate logarithm
import math

def compute_idf(documents):
    # Number of texts
    N = len(documents)
    
    # Dictionary with all unique words and initial values of 0.
    idf_dict = dict.fromkeys(documents[0].keys(), 0)
    
    # By looping through each word of each text,
    # idf_dict initially contains the number texts
    # each specific word appears in.
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idf_dict[word] += 1
    
    # idf is then calculated for each word
    # and the inital values are replaced.
    for word, val in idf_dict.items():
            idf_dict[word] = math.log(N / val)
    
    return idf_dict

Word counts for the two texts are passed to the function as a list.

In [None]:
idfs = compute_idf([num_of_words_a, num_of_words_b])

#### TF-IDF
Now that we have the inversed document frequency for each unique word we can finally calculate the tf-idf by multiplying the tf values of each text with the idf values. Again, we use a function.

In [None]:
def compute_tf_idf(tf_dict, idf):
    #Empty dictionary for results
    tf_idf = dict()
    
    # tf-idf calculation for each word
    for word, val in tf_dict.items():
        tf_idf[word] = val * idf[word]
    
    return tf_idf

Finally we can get the tf-idf for our two texts.

In [None]:
tf_idf_a = compute_tf_idf(tf_a, idfs)
tf_idf_b = compute_tf_idf(tf_b, idfs)

### Back to the DataFrame
Now that our analysis is done, we return to pandas and convert our results into a DataFrame.

In [None]:
tf_idf_df = pd.DataFrame([tf_idf_a, tf_idf_b])

To get the words most unique to the first text, we sort the values of the first row and return the first 10 elements with `head(10)`.

In [None]:
top10_a = tf_idf_df.loc[0].sort_values(ascending=False).head(10)

We can now inspect the words which, according to the tf-idf analysis, are the most significant to this text.

In [None]:
print(top10_a)

Further, we can visualise the results with a simple bar plot.

In [None]:
# Bar plot with labels and values.
plt.bar(top10_a.index, top10_a)

# Rotate labels to avoid overlap.
plt.xticks(rotation=45)

plt.show()

To inspect the results for the second text, we repeat the process with the second row of the DataFrame.

In [None]:
top10_b = tf_idf_df.loc[1].sort_values(ascending=False).head(10)

In [None]:
top10_b

In [None]:
# Bar plot with labels and values.
plt.bar(top10_b.index, top10_b)

# Rotate labels to avoid overlap.
plt.xticks(rotation=45)

plt.show()

## Conclusion

This concludes our counting notebook. We have covered different ways of counting terms at different levels. From simple word counts over relative frequency to tf-idf.

Each analysis can be modified and improved in various ways. The tf-idf in particular would benefit from using more texts for a more robust result. All analyses could also be improved by processing and cleaning the data first; for instance by lower casing all words and removing punctuation.