## Distant reading course week 2 (VT-23)

### Learning material 2a: Counting frequencies and bigrams (and visualising the results)

Matti La Mela

In this learning material, we will use pandas library to count and save a csv for visualization. We can visualise the csv files either with other software (excel, RAWtools, etc.) or with python. There is optional material about basics of visualization in Python.


In [1]:
# We will take Pride and Prejudice from Project Gutenberg with a simple http request:

# NB, you can also open the URL in your browser to see that this is the right text. Remember utf-8!

import requests

request = requests.get("https://www.gutenberg.org/cache/epub/42671/pg42671.txt")

request.encoding = "utf-8"
book = request.text

In [None]:
# Let's see how the book looks like

print (len(book))

print (book[:1000])


In [None]:
# It is quite long, so let's take the first ten chapters of the book.
# Let's use the first and last sentence to find the offsets for the index and then store this to a new string.

start = book.find("It is a truth universally acknowledged")
end = book.find("leaving her room for a couple of hours that evening.")


# We could also use CHAPTER I etc. when using find(), but be careful if there are tables of content in your file..

end += len("leaving her room for a couple of hours that evening.")  # we need to add this bit make the index include it too. Find returns us the location in the string where this sentence starts


In [None]:
# We assign this slice of ten first chapters to the variable chapters

chapters = book[start:end]

# Let's print that it looks ok:

print(chapters[0:100])

print("")

print(chapters[-100:]) # -100 starts from the 100 chars before the end of the string

print("")

print(len(chapters))

In [None]:
import re

# We do a bit of cleaning, if we would continue our analysis in Python only .. as we know, Spacy is very helpful with cleaning. But:

chapters_clean = chapters.lower() # lowercase
chapters_clean = chapters_clean.replace("\n", " ") # replace endlines with " ", if there is an extra space in some txt-type, then we could remove the endlines
chapters_clean = re.sub(r"[^a-z0-9\s]", "", chapters_clean) # we replace everything else than a-z 0-9 and whitespace \s (regex character for whitespace) with "".
tokens = chapters_clean.split() # we split this into a list


print(tokens[0:100])

In [None]:
# we can count elements on the list with the count() method

print(tokens.count("she"))
print(tokens.count("he"))


In [None]:
# We can sort the list, and have a look how it looks like.

tokens_sorted = tokens
tokens_sorted.sort()

print(tokens_sorted)


In [None]:
# For more counting, it is possible to use Counters, which is a collection or a container for counting elements part of our variables.

# We could also take this work list to excel for instance, and continue counting and visualization there

# However, we will do more counting with Pandas dataframes in the following:


In [None]:
# Let's continue with spacy for some operations: we import spacy, load the language model, and process our text into a Spacy Doc object called here part1_doc

import spacy

nlp = spacy.load("en_core_web_sm")

chapters_doc = nlp(chapters, disable=["parser", "ner"])  # We disable the parser and ner processes with are part of the Spacy nlp pipeline to gain some speed

In [None]:
# Let's take the lemmas and non-stop words only; and have a look how the "trash" looks like

# we use "and" in the if statement, so when token.is_alpha and token.is_stop == False (thus is not a stopword), we same the lemma on our list.
# Otherwise (else) this is store to cleaned_tokens list

lemmas = []
cleaned_tokens = []

for token in chapters_doc:
    if token.is_alpha and (token.is_stop == False):
        lemmas.append(token.lemma_)
    else:
        cleaned_tokens.append(token)
                             
print(lemmas[0:200])

# you can see how the non-alphanumerical & stopwords look like:

# print(cleaned_tokens[0:200])


In [None]:
# Let's save this to a file, so we can open it with Excel, Antconc Voyant tools, or similar for more analysis and visualization

with open("./texts_week2/output_lemmas.txt", mode="w", encoding="utf-8") as file:
    for lemma in lemmas:
        file.write(lemma)
        file.write("\n")


### 2. Data in tabular format (pandas)

In this exercise with use only Pandas Series which contain only one list, whereas pandas dataframes can contain several lists. You can compare this to excel sheet, where you have only one column (pandas Series) or several columns (pandas dataframe).

In [None]:
# We want to do some basic calculations in Python about word frequencies. For this we use pandas where we can handle numbers in tables.

import pandas as pd

In [None]:
# We organise our list of lemmas (that we created in the previous section) into an array with Pandas

lemmas_series = pd.Series(lemmas, name="chapters_lemmas")

# If you want, you can check the type of our object. We see that we have a Pandas series here.

# print(type(lemmas_series))

In [None]:
# when we print the Series we get only the first and the last entries, which makes it easy to study

print(lemmas_series)

In [None]:
# the method value_counts() of the Series will return another Series where all same values have been summed up: we will get frequencies

lemmas_count = lemmas_series.value_counts()

print(lemmas_count)

# Again, we have only one list so this is a Series. The index

# print(type(lemmas_count))


In [None]:
# Here the term index refers to the names of the terms which we had on the original list:

print(lemmas_count.index)



In [None]:
# They correspond to list of values, which can be printed when they are converted to a list. Here we print the ten first values, thus which are the frequencies of
# "Bingley", "say", "Miss", "Elizabeth" ..

print(list(lemmas_count)[0:10])

In [None]:
# Let's save the frequencies of our lemmas: we can open csv in excel for instance for further operations!

lemmas_count.to_csv("texts_week2\lemmas.csv", encoding="utf-8")

In [None]:
# One more time, we can have a look at the 20 most common terms:

print(lemmas_count[:20])

### 3. Counting bigrams

Bigrams (and ngrams) are sequences of two words (bi) or n-words (ngram). They are useful for studying how the words occur together. There are many applications for ngrams, eg. for predicting word occurrence or building single entities when two words should occur together (eg. better to have New York for analysis (New_York) than New and York separately.

The bigrams (2-gram) of the sentence "The weather is very good" are:

- The weather
- weather is
- is very
- very good


In [None]:
# We cannot continue with our lemmas list, while we need stopwords for building the bigrams

# Let's process our spacy_doc again! This time we won't remove stopwords.

# In this example we do not store the non-alphanumerical in the "else" part. We have saved them previously for seeing what is removed,
# if there are errors etc.

tokens_cleaned = []
tokens_lemma = []

for token in chapters_doc:
    if token.is_alpha:
        tokens_lemma.append(token.lemma_)
        
print(tokens_lemma[0:50])


In [None]:
# we use nltk library, which has powerful tools for basic NLP operations. nltk.bigrams() turns a string into bigrams, which we save as a list to variable
# token_bigrams

import nltk

token_bigrams = list(nltk.bigrams(tokens_lemma))

# This is a list containing lists. We can access the elements of the list in the list by two brackets:

print(token_bigrams[0:50])  # prints two first entries in our list token_bigrams

# print(token_bigrams[0][0]) # prints the first element in of the first element in our list token_bigrams


In [None]:
# We process our list of lists a bit, and make it into a list of strings, where the bigram elements are combined. This is easier to operate.
# eg. [('it', 'be'), ('be', 'a')] into a list of strings
# -> ["it be", "be a"

bigrams = []

for bigram in token_bigrams:
    bigrams.append(bigram[0] + ' ' + bigram[1])

print(bigrams[0:5])

In [None]:
# Lets use our bigrams for simple analysis. Can we find any differences between uses of "he" and "she"
# in the text when look at the bigrams?

# We use the regex \bhe\b for capturing "he" and \bshe\b for capturing "she". \b marks word boundary. NB we need two \\ while the python syntax removes the \ in this
# string operation.
#
# We want to capture also possession bigrams, eg. his and her. For his and her, we take only bigrams where the "his" is the first word of the bigram.


he_bigrams = []
she_bigrams = []

his_bigrams = []
her_bigrams = []

for bigram in bigrams:
    if re.search("\\bhe\\b", bigram):     # \\b -> \b
        he_bigrams.append(bigram)
    if re.search("\\bshe\\b", bigram):
        she_bigrams.append(bigram)
    if re.search("\\bhis\\b", bigram.split()[0]):   # we split the bigram into two, and do the searching only concerning the first bigram word
        his_bigrams.append(bigram)
    if re.search("\\bher\\b", bigram.split()[0]):   # we split the bigram into two, and do the searching only concerning the first bigram word
        her_bigrams.append(bigram)
    


In [None]:
# Let's take them to pandas Series and calculate directly the frequencies with value_counts()


he_bigrams_count = pd.Series(he_bigrams, name = "he").value_counts()
she_bigrams_count = pd.Series(she_bigrams, name = "she").value_counts()
his_bigrams_count = pd.Series(his_bigrams, name = "his").value_counts()
her_bigrams_count = pd.Series(her_bigrams, name = "her").value_counts()

# we can have a look at the he & his terms. 

print(he_bigrams_count[:40])
print("******************")
print(she_bigrams_count[:40])



In [None]:
# how about the his and her bigrams?

print(his_bigrams_count[0:50])
print("****************")
print(her_bigrams_count[0:50])


In [None]:
# Let's save the output

his_bigrams_count.to_csv("texts_week2/bigrams-his.csv", encoding="utf-8")

her_bigrams_count.to_csv("texts_week2/bigrams-her.csv", encoding="utf-8")



Try to open the csv in Excel!

### 4. Visualization (optional)

In [None]:
# The library matplotlib enables visualisations. We can do this also directly with pandas, which uses matlotlib too.
#
# Here is a simple example for visualising the 50 most common bigrams

import matplotlib
import matplotlib.pyplot as plt

bigrams_her = her_bigrams_count[0:50]

# we plot now our bigrams on a bar chart:

# we define the size of the figure
plt.figure(figsize = (15, 5)) 

# we put "index" values thus the bigrams on x-axis, and the count values (list) to y-axis, and define our bar chart color as "green"
plt.bar(bigrams_her.index, bigrams_her.tolist(), color="green")

# we rotate the x-axis labels by 90 so we can read them   
plt.xticks(rotation=90)

# we give a label to the x axis
plt.xlabel("Bigrams")

# we give a label to the y axis
plt.ylabel("Frequency (n)")

# we give a title to our figure
plt.suptitle("50 most frequent lemmatized bigrams with 'her' as the first word (Pride and prejudice, ch 1-10)")

# the figure is saved to our material folder
plt.savefig("./texts_week2/bigrams.png", dpi = 200)

# we display the figure in Jupiter Lab
plt.show()



In [None]:
# We can also visualize our lemmas in a wordcloud; you need to install the wordcloud package first by using pip!

from wordcloud import WordCloud

wc = WordCloud(background_color="black",width=500,height=500, max_words=50).generate_from_frequencies(lemmas_count)
plt.imshow(wc)
plt.axis("off")
plt.savefig("./texts_week2/wordcloud.png", dpi = 200)
plt.show()

