# How do we work with several texts at once?

Regardless of whether our data consists of literary texts, tweets, emails, survey responses, or numbers, data frames (tables - think of spreadsheets) are the most effective way to handle them.

Often (but not always), we distinguish between data and metadata. Metadata is data about data. In our case, the texts would be our data, and information about the texts, such as author, title, and year, would be our metadata.

The most important package for working with data frames is 'Pandas'. However, sometimes we need to import more than one library because some libraries build on the functionality of others. For the tasks below, in addition to Pandas, we also need to use the libraries called 'os' and 'numpy'.

You can read more about Pandas and Numpy here:

Link to Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/

Link to Numpy documentation https://numpy.org/

In [None]:
import os
import numpy as np

In [None]:
# works on PC

texts = []

for speech in os.scandir(r"C:\Users\au576018\OneDrive - Aarhus Universitet\Documents\Kurser\kvantitativ diskursanalyse\quantitative_discourse_analysis\data\politicians\english\trump"):
    x = open(speech, encoding = "utf8")
    y = x.read()
    z = y.replace("\n", " ")
    texts.append(z)
    x.close()

In [None]:
# works on mac

# texts = [] 

path = os.path.join("text-files-path") # bruges for at undgå mac/pc-problemerne med absolutte stinavne

for fil in os.scandir(path): # for-loop
    with open (fil, encoding = "utf8") as f: # context manager
        texts.append(f.read())

In [None]:
texts

In [None]:
len(texts)

In [None]:
texts[0]

In [None]:
texts[1]

In [None]:
len(texts[0])

In [None]:
len(texts[1])

# Example on Pandas

Here's a small example on how to create a dataframe with pandas

I want to combine lists of numbers and lists of texts in a dataframe

In [None]:
import pandas as pd
from pandas import DataFrame

In [None]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # creating a list of numbers

In [None]:
numbers

In [None]:
written_numbers = ["One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten"] # creating a list of text elements

In [None]:
written_numbers

In [None]:
test_dataframe = pd.DataFrame({"Numbers": numbers, "Written numbers": written_numbers}) # making a dataframe called "test_dataframe" from the two lists we just created

In [None]:
test_dataframe

I know have some additional information to add in my dataframe. I can add this information to my existing dataframe by simply specify the names of the new column and the data I want in that specific column.


In [None]:
# these are my lists of new information I want to add to my existing dataframe

colors = ["Red", "Blue", "Yellow","Green", "Purple", "White", "Grey", "Orange", "Black", "Pink"]
mood = ["Happy", "Happy", "Sad", "Happy" ,"Sad", "Sad", "Sad", "Happy", "Happy", "Happy"]
amount = ["25", "27", "3","19", "41", "5", "6", "33", "43", "7"]

I know add the new columns one by one:

In [None]:
test_dataframe["Colors"] = colors

In [None]:
test_dataframe

In [None]:
test_dataframe["Mood"] = mood

In [None]:
test_dataframe["Amount"] = amount

In [None]:
test_dataframe

We can make several calculations directly on our dataframe. Basically you can do the same operations in a dataframe as you can ouotside a dataframe. You just have to be aware of the type of data you are working with.

Here's an example on extracting the first letter in every word in the column "Mood", and add these letters in a new column.
(This way of doing the operation is called list comprehension)

In [None]:
test_dataframe["First letter of mood"] = [letter[0] for letter in test_dataframe["Mood"]] 

In [None]:
test_dataframe

The options are many, and you can use google or chatgpt to your advantage, if you have a specific task for your program in mind

The pandas package has several in-built functionalities to inspect your data (you should ALWAYS inspect your data)

In [None]:
test_dataframe.head() # shows the top 5 rows of your dataframe

In [None]:
test_dataframe.tail() # shows the bottom 5 rows of your dataframe

In [None]:
test_dataframe[2:6] # shows the dataframe from the row index 2 to 6

In [None]:
test_dataframe["Colors"] # shows the column "Colors"

In [None]:
test_dataframe[["Colors", "Amount"]] # shows the columns "Colors" and "Amount"

If can also slice your dataframe by the names of your rows and columns

In [None]:
test_dataframe.loc[0:4, "Colors": "Amount"]

### Small exercise: Try the following lines for yourself. What do you get?

```
test_dataframe.loc[3:7, "Colors": "Amount"]
test_dataframe.loc[: "Mood"]
test_dataframe.loc[4, :]
```

In [None]:
test_dataframe

If we define a function we can all it on a whole column at once

In [None]:
test_dataframe["Length of color"] = test_dataframe.Colors.apply(len)

In [None]:
test_dataframe

If you want to reach single cells by index or a given value, you can use the following.

In [None]:
test_dataframe["Amount"][1] # goes to the column "Amount" and the row at index 1

In [None]:
test_dataframe[test_dataframe["Mood"] == "Happy"] # goes to the column "Mood" and make a subset containing every row with the value "Happy"

When you are done with your dataframe, you can save it as a csv-file by the follwing command:

In [None]:
test_dataframe.to_csv('test_dataset.csv', index=False) # change the name of the file by substituting "test_dataset"

# Lets apply our analysis from earlier in the dataframa

First we want to arrange the speeches in a dataframe.

In [None]:
df_speeches = DataFrame({"Speeches": texts})
df_speeches

Now we want to clean up the mess within the texts. We can make a function and do it in all texts at once.

we used this with code on one text, and now want to make a function to do it an all the texts.

``` 
speech_clean = speech.replace("\n", " ")
speech_clean 

```

In [None]:
def clean(txt): # we define a function called "clean" which takes a text argument
    return txt.replace("\n", " ")

In [None]:
df_speeches["Clean_texts"] = df_speeches.Speeches.apply(clean) # we make a column "Clean_texts" by applying the clean-function on the column "Speeches"

In [None]:
df_speeches

# Spacy analysis

In [None]:
import spacy # import spacy library

In [None]:
nlp = spacy.load("en_core_web_lg") # load your language model

We now want to make the texts in the dataframe an nlp-object, in order to make our analysis. We can do this at once on a whole column

In [None]:
df_speeches["nlp_texts"] = df_speeches.Clean_texts.apply(nlp) # applying nlp-function on the "Clean" column, and add it as a new column "nlp_texts"

In [None]:
df_speeches

We can inspect our data by index

In [None]:
df_speeches["nlp_texts"][2] # gives you the nlp_columns at row index 2

In [None]:
type(df_speeches["nlp_texts"][2]) # gives you the type of the element in the nlp_column at row index 2

We can now use the spacy functionality on the nlp_texts column, since they are spacy-objects

In [None]:
print(list(df_speeches["nlp_texts"][2].sents)) # gives us every sentence in the text seperated by a comma

We can still print them one by one in order to make it more readable

In [None]:
for sentence in df_speeches["nlp_texts"][2].sents: # for every sentence in text at second index in column "nlp_texts"
    print(sentence) # print the sentence

We can make a new column in the dataframe containing every senteces of the text by turning the code above into a function

In [None]:
def sentences(nlp_object): # defines a function "sentences" that takes an nlp-object as an input
    sentences = [] # place holder for sentences
    for sentence in nlp_object.sents: # for every sentence
        sentences.append(sentence) # appende sentence to the list "sentences". if you do not want to analyze sentences as nlp-objects, you can add "str()"
    return sentences # returns the collection of sentences in the text

We make the new column "Sentences" by calling our function at the column "nlp_texts"

In [None]:
df_speeches["Sentences"] = df_speeches.nlp_texts.apply(sentences)

In [None]:
df_speeches

We now want to find the most used nouns in every text. By using the code from earlier, this can be done easily by calling functions on the columns in the dataframe:

first we recall our counting-function

In [None]:
from collections import Counter
def sorted_count(a):
    b = Counter(a)
    c = sorted(b.items(), key=lambda item: item[1], reverse = True)
    return c

and the function finding the most frequent nouns in the text

In [None]:
def frequent_nouns(nlp_object): # we find the lemmas
    nouns = [token.lemma_ for token in nlp_object if token.pos_ == "NOUN"]
    most_frequent = sorted_count(nouns)[:3]
    return most_frequent

We call the "frequent_nouns"-function on the column "nlp_texts" and add the result in a new column "Frequent_nouns"

In [None]:
df_speeches["Frequent_nouns"] = df_speeches.nlp_texts.apply(frequent_nouns)

In [None]:
df_speeches

# Exersice. Do the same analysis on the 10 speeches by Obama

It might help to follow this order

- load the data
- make a dataframe with the new data
- clean your data from "\n" and add as a new column
- make each text an nlp object and add as a new column
- make a column with the sentences from the text
- make a column with the most frequent nouns


(SOLUTION is further down. But try your best and ask for help before looking at the solution)

In [None]:
# works on PC

texts_obama = []

for speech in os.scandir(r"C:\Users\au576018\OneDrive - Aarhus Universitet\Documents\Kurser\kvantitativ diskursanalyse\quantitative_discourse_analysis\data\politicians\english\obama"):
    x = open(speech, encoding = "utf8")
    texts_obama.append(x.read())
    x.close()

In [None]:
# make a dataframe from your speeches

df_speeches_obama = ...

## Related adjectives (Extra)

The word "people" is very frequent throughout the dataset. Find the noun chunks to the keyword "people" (see Notebook 1)

Help
- make a function that find the sentences where the word "people" is included
- apply this function on every text in the dataframe. make it a new column
- find the noun chunks with the root "people" and apply as a new column in your dataframe
- BONUS: extract only the adjective that describes the root "people", and add as a new column

# Solution

In [None]:
# works on PC

texts_obama = []

for speech in os.scandir(r"C:\Users\au576018\OneDrive - Aarhus Universitet\Documents\Kurser\kvantitativ diskursanalyse\quantitative_discourse_analysis\data\politicians\english\obama"):
    x = open(speech, encoding = "utf8")
    texts_obama.append(x.read())
    x.close()

# make a dataframe from your speeches

df_speeches_obama = DataFrame({"Speeches": texts_obama})
df_speeches_obama

In [None]:
df_speeches_obama["Clean"] = df_speeches_obama.Speeches.apply(clean)

In [None]:
df_speeches_obama

In [None]:
df_speeches_obama["nlp_texts"] = df_speeches_obama.Clean.apply(nlp)

In [None]:
df_speeches_obama

In [None]:
df_speeches_obama["Sentences"] = df_speeches_obama.nlp_texts.apply(sentences)

In [None]:
df_speeches_obama

In [None]:
df_speeches_obama["Frequent_nouns"] = df_speeches_obama.nlp_texts.apply(frequent_nouns)

In [None]:
df_speeches_obama