# Quantitative evaluation. Test your findings and make them pretty. A bit of statistics and visualisation

For this lesson we want to visualize some results and encourage testing your findings before drawing conclusion.

We'll first import some libraries and load our data as usual.

In [None]:
import os
import numpy as np

In [None]:
texts_trump = []

for speech in os.scandir(r"C:\Users\au576018\OneDrive - Aarhus Universitet\Documents\Kurser\kvantitativ diskursanalyse\quantitative_discourse_analysis\data\politicians\english\trump"):
    x = open(speech, encoding = "utf8")
    texts_trump.append(x.read())
    x.close()

We want to work with both texts from Obama and Trump

In [None]:
texts_obama = []

for speech in os.scandir(r"C:\Users\au576018\OneDrive - Aarhus Universitet\Documents\Kurser\kvantitativ diskursanalyse\quantitative_discourse_analysis\data\politicians\english\obama"):
    x = open(speech, encoding = "utf8")
    texts_obama.append(x.read())
    x.close()

We inspect our data. It is to alike groups of text, with 10 speeches from each speaker.

In [None]:
len(texts_trump)

In [None]:
len(texts_obama)

Now we arrange our data in a dataframe. We use the pandas library

In [None]:
import pandas as pd
from pandas import DataFrame

In [None]:
df_speeches = DataFrame({"Speeches": texts_trump + texts_obama}) # make a dataframe from our two lists of text
df_speeches

In [None]:
speaking = ["Trump"]*10 + ["Obama"]*10 # we add metadata that gives information of the speaker for each text

In [None]:
speaking

In [None]:
df_speeches["Speaker"] = speaking # this is added as a new column in the dataframe

In [None]:
df_speeches

Now we clean our text by removing punctuations, and because we want to make an analysis of the mention of countries in the text, I make countries containing more that one word recognizable as one by adding an underscore at every space.

In [None]:
def clean_wrd(text_a):
    text_b = text_a.replace("\n"," ")
    text_c = text_b.replace("The United States of America","The_United_States_of_America")
    text_0 = text_c.replace("The US","The_US")
    text_1 = text_0.replace("United States","United_States")
    text_2 = text_1.replace("."," ")
    text_3 = text_2.replace(","," ")
    text_4 = text_3.replace(":"," ")
    text_5 = text_4.replace("*"," ")
    text_6 = text_5.replace("–"," ")
    text_7 = text_6.replace("'"," ")
    text_8 = text_7.replace("”"," ")
    text_clean = text_8.replace("-"," ")
    text_token = text_clean.split()
    return text_token

In [None]:
df_speeches["Clean_texts"] = df_speeches.Speeches.apply(clean_wrd) # the clean text is added as a new column

In [None]:
df_speeches

# How many mentions of countries?

In this analysis we want to count how many times the United States is mentioned compared to how many time another country is mentioned in the speeches. In order to do so I have prepared a list containing every contry. This is loaded as a list of strings.

In [None]:
txt = open(r"C:\Users\au576018\OneDrive - Aarhus Universitet\Documents\Kurser\kvantitativ diskursanalyse\quantitative_discourse_analysis\countries.txt", encoding="utf8").read()
countries = txt.split("\n")

In [None]:
countries

I inspect the list of countries

In [None]:
len(countries)

In [None]:
countries[3]

Since we don't want to count the United States both in the group of countings of the world's countries and in the countings of the United States, we remove it from the list of countries in the world.

In [None]:
"United States" in countries

In [None]:
countries.remove("United States")

In [None]:
"United States" in countries

By close reading the speeches, we find mentions of the United States in several forms. We list these unlike forms in the following:

In [None]:
us_list = ["The_United_States_of_America", "United_States", "America", "The_US", "US"]

Now we define a function that counts how many times specific words from a list are mentioned in a text. 

In [None]:
def count_words(txt, stopwords): # takes inputs: a text you want to count, and a list with the words you are counting
    
    count = 0
    index = 0  # Starting index
    
    while index < len(txt):
        word = txt[index]
        
        if word in stopwords:
            a = stopwords.count(word)
            count = count + a
            # Remove the word from the list
            txt.remove(word)
        else:
            index += 1  # Move to the next word if it's not a stopword
    
    return count

Let's try out the function with a simple example

In [None]:
# Example usage:
word_list = ["apple", "banana", "apple", "orange", "banana", "apple", "is", "the", "of", "apple", "banana"]
stopwords = ["is", "the", "of"]

counts = count_words(word_list, stopwords)
print("Word Counts:", counts)

We now count every mentions of the words related to the United states and add the count as a new column.

In [None]:
df_speeches["US_counts"] = [count_words(txt, us_list) for txt in df_speeches["Clean_texts"].to_list()]

In [None]:
df_speeches

We count every mentions of the countries as well and add the count as a new column.

In [None]:
df_speeches["World_counts"] = [count_words(txt, countries) for txt in df_speeches["Clean_texts"].to_list()]

In [None]:
df_speeches

# Plot the frequency

These mentions gives an impression of whether the US or the world is mentioned more or less than another. This is numerical data, and a visualization could give a better understanding on how/if the countings are alike.

For visualizing the data we need a library called 'matplotlib'.

For more information about the package and for more ideas on different plot types, see the documentation: https://matplotlib.org/stable/plot_types/basic/bar.html#sphx-glr-plot-types-basic-bar-py

In [None]:
import matplotlib.pyplot as plt # imports the library

We now plot our data as a bar plot with the axis Speaker at the horisontal axis and the counts of mentions of the US at our vertical axis.

In [None]:
df_speeches.plot(x='Speaker', y='US_counts', kind='bar', legend=None)
plt.title("Mentions of US") # the title of the plot
plt.xlabel('Speakers') # the name of the x axis
plt.ylabel('Mentions') # the name of the y axis
plt.show()



We make the same type of plot for the counts of the mentions of countries in the world.

In [None]:
df_speeches.plot(x='Speaker', y='World_counts', kind='bar', legend=None)
plt.title("Mentions of the World") # the title of the plot 
plt.xlabel('Speakers') # the name of the x axis
plt.ylabel('Mentions') # the name of the y axis
plt.show()



To make it more readable you might want to plot your results together. But be careful!

In [None]:
# Selecting specific columns to plot
columns_to_plot = ['US_counts', 'World_counts']
df_speeches[columns_to_plot].plot(kind='bar')
plt.title('Mentions of US and the world')
plt.xlabel('Speaker')
plt.ylabel('Mentions')
plt.show()

Small exercise: 
    
- How do we read this plot?
- What does it tell us about the speakers individually? Compared to each other?
- Is it a good visualization?
- How could we improve?




























------------------------------------------------------------------------------------------------------------------------------

We make the following function in order to make a better evalutation of our findings

In [None]:
def ratio(us_count, world_count): # takes two lists of numbers as input
    us_counts = [int(number) for number in us_count] # make sure our datatype is integers
    world_counts = [int(number) for number in world_count]
    ratio = [] # space holder for the results of calculated ratios
    for i in range(0,len(us_counts)): # for every element from index 0 to the length of the list "us_count"
        ratio.append(us_counts[i]/(world_counts[i]+us_counts[i])) # calculate the ratio and append on ratio list
    return ratio #return list with ratios


When using the function we now get a list of numbers between 0 and 1

In [None]:
ratio(df_speeches["US_counts"].to_list(), df_speeches["World_counts"].to_list())

The ratios are added as columns in the data frame.

In [None]:
df_speeches["Ratio_us_mentions"] = ratio(df_speeches["US_counts"].to_list(), df_speeches["World_counts"].to_list())

In [None]:
df_speeches

By plotting the ratio it is more straight forward to compare Obama and Trumps mentions.

In [None]:
df_speeches.plot(x='Speaker', y='Ratio_us_mentions', kind='bar', legend=None)
plt.title("Ratio of US mentions compared to mentions of the world")
plt.xlabel('Speakers')
plt.ylabel('Ratio')
plt.show()


A plot is an easy way to "read" data. Although it might be hard to conclude directly from a plot. It is a good idea to make some kind of statistic test that might strengthen your conclusion.

# Permutation test

When working with text one should always be careful on which statistical tests to use. The permutation test is great when working with two groups of data and wanting to test whether there is a difference from the one group to another. There is also no assumptions on the distribution of your data.
The only thing to keep in mind is that one should have 30 data point or more in order to get useful results. We only work with 20 data points (a calculation for each test). We will go through with this, and discuss what to do about it afterwards.

You can read more about the permutation test here: https://www.jwilber.me/permutationtest/

In [None]:
df_speeches

We want to use the permutation test to see whether there is a difference from the two groups in the ratio of mentions of the US compared to mentions of the world.

Our null hypothesis is that here is no difference between the two groups, while the alternative hypothesis is that Trump has a bigger ratio of US mentions than Obama.

We split the data in the two groups

In [None]:
df_trump  = df_speeches[df_speeches["Speaker"] == "Trump"] # makes a subset of the rows where the column "Speaker" is "Trump"
df_trump

In [None]:
df_obama = df_speeches[df_speeches["Speaker"] == "Obama"] # makes a subset of the rows where the column "Speaker" is "Obama"
df_obama

We now need to use the following package to calculate the mean and make samples etc.

In [None]:
import statistics as st # importing statistics library

We now calculate the mean in ratio for both Trump and Obama.

In [None]:
trump_us_ratio = st.mean(df_trump["Ratio_us_mentions"].to_list()) # calculate the mean from the column "Ratio_us_mentions"
trump_us_ratio

In [None]:
obama_us_ratio = st.mean(df_obama["Ratio_us_mentions"].to_list()) # calculate the mean from the column "Ratio_us_mentions"
obama_us_ratio

We are now ready to make the permutation test of the mentions of US in the two groups. 

We will use a p-value = 0.05. This is the standard p-value, and you can always use this with no further explanation if you do not have a specific argument to choose another p-value. 

In words, what the permutation test calculates:

- calculate the observed difference in means between the two groups (Trump and Obama)
- make a list of all data points together (all ratios from both Trump and Obama)
- sets N to be a big number. This is how many times you wants to permute (shuffle) your data
- creates a list "result" to store your results from each permutation of your data

The for-loop is the permutation
- for indexes in range 0 to N (which was a very big number!) do the following:
- index1: make 10 random indexes from range 0 to 20 (because we have 10 data points in one group and 20 data points in total)
- index2: make a list of indexes which are not in use index1 (goes through all numbers op to 20, and store them as index if they aren't already on the list index1)
- group1: stores the datapoints from the list of all data points, if they have an index from index1
- group2: stores the datapoints from the list of all data points, if they have an index from index2
- calculate the difference in mean in the two new groups of data and stores on the list "result"

then every value from result is plotted in a histogram, and the observed difference in mean is plotted as a red line.

the p-value is calculated by counting how many of the results from the permutations are greater than the observed difference, and then calculate it is a ratio (so you get the percentage instead of a number).

In [None]:
from random import sample
import numpy as np

observed = trump_us_ratio-obama_us_ratio # the observed difference between trump and obama
all_data = df_speeches["Ratio_us_mentions"].to_list()

N = 10**4-1 # how many times we want to permute our data
result = [] # storage for our results
for i in range(0,N):
    index1 =  sample(list(range(0,20)), 10)
    index2 = [m for m in list(range(0,20)) if m not in index1]
    group1 = [all_data[k] for k in index1]
    group2 = [all_data[l] for l in index2]
    result.append(st.mean(group1)-st.mean(group2))

fig = plt.figure(figsize =(10, 7))
plt.hist(result, bins = 40)
plt.title("Permutation test over the difference in mentions of the US")
plt.axvline(observed, color = "red", linestyle = "--")
plt.show()


greater_or_equals = [rs for rs in result if rs >= observed]

pvalue = (len(greater_or_equals))/(N+1)  # en-sidet test
pvalue


We now have a plot and a p-value.

### Questions
- How do we read the plot?
- What does the p-value mean?
- What can we conclude from the test?
- Is this an obvious conclusion compared to the bar plots above?
- Could we improve the readability?

# Another example

From a close reading I have found that Trump and Obama use various terms when speaking of immigrants. Some are neutral and some are more negative. It of course depends on the context, but it would be nice if I could make some kind of test to see whether my perception of their discourse is correct.

First I split my data in words and lower cases, and then I count for negative and neutral words regarding the immigrant discourse.

In [None]:
def clean_low(text_0):
    text_1 = text_0.replace("\n"," ")
    text_2 = text_1.replace("."," ")
    text_3 = text_2.replace(","," ")
    text_4 = text_3.replace(":"," ")
    text_5 = text_4.replace("*"," ")
    text_6 = text_5.replace("–"," ")
    text_7 = text_6.replace("'"," ")
    text_8 = text_7.replace("”"," ")
    text_clean = text_8.replace("-"," ")
    text_lower = text_clean.lower()
    text_token = text_lower.split()
    return text_token

In [None]:
df_speeches["Low"] = df_speeches.Speeches.apply(clean_low)
df_speeches

In [None]:
print(df_speeches["Low"][1])

In the following I have listed some words in use when speaking of immigrants, and I have listed them in neutral and negative words.

In [None]:
word_list_neutral = ["immigrant", "immigrants", "refugee", "refugees", "migrant", "mexican"]
word_list_negative = ["terrorist", "terrorists", "monsters", "aliens", "smugglers", "smuggler"]

In [None]:
df_speeches["Neutral"] = [count_words(txt, word_list_neutral) for txt in df_speeches["Low"].to_list()]
df_speeches["Negative"] = [count_words(txt, word_list_negative) for txt in df_speeches["Low"].to_list()]

In [None]:
df_speeches

I calculate the ratio of negative words

In [None]:
df_speeches["Ratio_negative"] = ratio(df_speeches["Negative"].to_list(), df_speeches["Neutral"].to_list())

In [None]:
df_speeches

Now we want to perform a permutation test to see whether there is a difference between the two groups.

In [None]:
df_trump  = df_speeches[df_speeches["Speaker"] == "Trump"] # makes a subset of the rows where the column "Speaker" is "Trump"
df_trump

In [None]:
df_obama  = df_speeches[df_speeches["Speaker"] == "Obama"] # makes a subset of the rows where the column "Speaker" is "Trump"
df_obama

In [None]:
trump_negative_ratio = st.mean(df_trump["Ratio_negative"].to_list()) # calculate the mean from the column "Ratio_us_mentions"
trump_negative_ratio

In [None]:
obama_negative_ratio = st.mean(df_obama["Ratio_negative"].to_list()) # calculate the mean from the column "Ratio_us_mentions"
obama_negative_ratio

Permutation test

### Questions:
- What should our hypothesis be?

In [None]:
from random import sample
import numpy as np

observed = trump_negative_ratio-obama_negative_ratio # the observed difference between trump and obama
all_data = df_speeches["Ratio_negative"].to_list()

N = 10**4-1 # how many times we want to permute our data
result = [] # storage for our results
for i in range(0,N):
    index1 =  sample(list(range(0,20)), 10)
    index2 = [m for m in list(range(0,20)) if m not in index1]
    group1 = [all_data[k] for k in index1]
    group2 = [all_data[l] for l in index2]
    result.append(st.mean(group1)-st.mean(group2))

fig = plt.figure(figsize =(10, 7))
plt.hist(result, bins = 40)
plt.title("Permutation test over the difference in ratio of negative words for immigrants")
plt.axvline(observed, color = "red", linestyle = "--")
plt.show()


greater_or_equals = [rs for rs in result if rs >= observed]

pvalue = (len(greater_or_equals))/(N+1)  # en-sidet test
pvalue


### Questions:
- HOw do we read the plot?
- What should our conclusion be?

# Comment on the amount of data points

Almost every statistical test requires at least 30 data points in order to be strong and have a high enough precision. An outlier would for example have a great impact on the result in the above data set, as it would not be as problematic in da data set with 100 or 1000 data points.

If you work with for example interviews or are collection data, you can design your project to have at least 30 data points / 30 respondents to avoid the problem of having to little data.
If you work with longer texts, for example novelles and books, you can always consider splitting the texts into chapters, sections etc. If you have no chapters or sections, you could choose an amount of words or sentences and divide your text into smaller parts from this. 

Depending on what you seek in your analysis, small amounts of data might give you other problems in order to test your data at all. Even if you have 30 texts, the counts may include a lot of zero values which could cause problems concluding anything. This should be a consideration in operationalizing - are you sure you are measuring the concept that you want to analyse?
