# Pandas 
The purpose is to increase awareness of pandas dataframes and some practical methods. Pandas is a 'must' if you use Python to work with data that is organized and structured in rows and columns.

We have two folders with tales, and we are working towards putting the texts together into a dataframe. Once we have assembled the texts, we will start examining them.

In [2]:
# import the libraries
import os
from pathlib import Path
import pandas as pd
import re

import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Definer the path to the folder with Grimms fairy tales

Define the path to the folder with Grimms' fairy tales."

In [3]:
# Folder path
text_folder = Path.cwd() / '../data/txt_files/grimm'
text_files = os.listdir(text_folder)

## Save the titles in a list

Save the titles in a list.

In [5]:
titles = []
for i in text_files:
    title = str(i[:-4])
    titles.append(title)

print (f'Total amount of titles : {len(titles)}.\nThe titles are: \n{titles}')

Total amount of titles : 62.
The titles are: 
['ASHPUTTEL', 'BRIAR ROSE', 'CAT AND MOUSE IN PARTNERSHIP', 'CAT-SKIN', 'CLEVER ELSIE', 'CLEVER GRETEL', 'CLEVER HANS', 'DOCTOR KNOWALL', 'FREDERICK AND CATHERINE', 'FUNDEVOGEL', 'HANS IN LUCK', 'HANSEL AND GRETEL', 'IRON HANS', 'JORINDA AND JORINDEL', 'KING GRISLY-BEARD', 'LILY AND THE LION', 'LITTLE RED-CAP [LITTLE RED RIDING HOOD]', 'MOTHER HOLLE', 'OLD SULTAN', 'RAPUNZEL', 'RUMPELSTILTSKIN', 'SNOW-WHITE AND ROSE-RED', 'SNOWDROP', 'SWEETHEART ROLAND', 'THE ADVENTURES OF CHANTICLEER AND PARTLET', 'THE BLUE LIGHT', 'THE DOG AND THE SPARROW', 'THE ELVES AND THE SHOEMAKER', 'THE FISHERMAN AND HIS WIFE', 'THE FOUR CLEVER BROTHERS', 'THE FOX AND THE CAT', 'THE FOX AND THE HORSE', 'THE FROG-PRINCE', 'THE GOLDEN BIRD', 'THE GOLDEN GOOSE', 'THE GOOSE-GIRL', 'THE JUNIPER-TREE', 'THE KING OF THE GOLDEN MOUNTAIN', 'THE LITTLE PEASANT', 'THE MISER IN THE BUSH', 'THE MOUSE, THE BIRD, AND THE SAUSAGE', 'THE OLD MAN AND HIS GRANDSON', 'THE PINK', 'THE Q

## Open the texts and save them i a list.

In [6]:
texts = []
for item in text_files:
    with open( text_folder / item, 'r', encoding='utf-8-sig') as f:
        text = f.read()
        texts.append(text)

## Store data in a dataframe

In [7]:
df1 = pd.DataFrame({'title': titles,
                  'text':texts,
                  'author': 'grimms'})

df1

Unnamed: 0,title,text,author
0,ASHPUTTEL,ASHPUTTEL\n\n\n\n\n\nThe wife of a rich man fe...,grimms
1,BRIAR ROSE,BRIAR ROSE\n\n\n\n\n\nA king and queen once up...,grimms
2,CAT AND MOUSE IN PARTNERSHIP,CAT AND MOUSE IN PARTNERSHIP\n\n\n\n\n\nA cert...,grimms
3,CAT-SKIN,"CAT-SKIN\n\n\n\n\n\nThere was once a king, who...",grimms
4,CLEVER ELSIE,CLEVER ELSIE\n\n\n\n\n\nThere was once a man w...,grimms
...,...,...,...
57,THE WEDDING OF MRS FOX,THE WEDDING OF MRS FOX\n\n\n\n\n\nFIRST STORY\...,grimms
58,THE WHITE SNAKE,THE WHITE SNAKE\n\n\n\n\n\nA long time ago the...,grimms
59,THE WILLOW-WREN AND THE BEAR,THE WILLOW-WREN AND THE BEAR\n\n\n\n\n\nOnce i...,grimms
60,THE WOLF AND THE SEVEN LITTLE KIDS,THE WOLF AND THE SEVEN LITTLE KIDS\n\n\n\n\n\n...,grimms


## Build a function from the code above and create a dataframe with HCA's fairy tales

We're gathering the various code snippets you've written above. We'll collect them into a cell and use them to build a function that can construct a DataFrame from a folder of texts.

In [8]:
# sti til mappen med tekster / path to text folder 
text_folder = Path.cwd() / '../data/txt_files/hca'
text_files = os.listdir(text_folder)

def build_df(text_files, author_name):
    titles = []
    for i in text_files:
        title = str(i[:-4])
        titles.append(title)

    texts = []
    for item in text_files:
        with open( text_folder / item, 'r', encoding='utf-8-sig') as f:
            text = f.read()
            texts.append(text)

    df = pd.DataFrame({'title': titles,
                      'text':texts,
                      'author': author_name})
    
    return df

df2 = build_df(text_files, 'hca')

df2

Unnamed: 0,title,text,author
0,THE BELL,THE BELL\n\n\n\nPeople said “The Evening Bell ...,hca
1,THE DREAM OF LITTLE TUK,"THE DREAM OF LITTLE TUK\n\n\n\nAh! yes, that w...",hca
2,THE ELDERBUSH,THE ELDERBUSH\n\n\n\nOnce upon a time there wa...,hca
3,THE EMPEROR'S NEW CLOTHES,THE EMPEROR'S NEW CLOTHES\n\n\n\nMany years ag...,hca
4,THE FALSE COLLAR,THE FALSE COLLAR\n\n\n\nThere was once a fine ...,hca
5,THE FIR TREE,THE FIR TREE\n\n\n\nOut in the woods stood a n...,hca
6,THE HAPPY FAMILY,"THE HAPPY FAMILY\n\n\n\nReally, the largest gr...",hca
7,THE LEAP-FROG,"THE LEAP-FROG\n\n\n\nA Flea, a Grasshopper, an...",hca
8,THE LITTLE MATCH GIRL,THE LITTLE MATCH GIRL\n\n\n\nMost terribly col...,hca
9,THE NAUGHTY BOY,"THE NAUGHTY BOY\n\n\n\nAlong time ago, there l...",hca


## pd.concat()
Tilføj rækker fra en dataframe til en anden dataframe

Appends rows from one dataframe to another dataframe.

In [9]:
df = pd.concat([df1, df2]).reset_index(drop=True)
df

Unnamed: 0,title,text,author
0,ASHPUTTEL,ASHPUTTEL\n\n\n\n\n\nThe wife of a rich man fe...,grimms
1,BRIAR ROSE,BRIAR ROSE\n\n\n\n\n\nA king and queen once up...,grimms
2,CAT AND MOUSE IN PARTNERSHIP,CAT AND MOUSE IN PARTNERSHIP\n\n\n\n\n\nA cert...,grimms
3,CAT-SKIN,"CAT-SKIN\n\n\n\n\n\nThere was once a king, who...",grimms
4,CLEVER ELSIE,CLEVER ELSIE\n\n\n\n\n\nThere was once a man w...,grimms
...,...,...,...
74,THE SHADOW,THE SHADOW\n\n\n\nIt is in the hot lands that ...,hca
75,THE SHOES OF FORTUNE,THE SHOES OF FORTUNE\n\n\n\nI. A Beginning\n\n...,hca
76,THE SNOW QUEEN,THE SNOW QUEEN\n\n\n\nFIRST STORY. Which Treat...,hca
77,THE STORY OF A MOTHER,THE STORY OF A MOTHER\n\n\n\nA mother sat ther...,hca


## Subset
Eventyrerne er samlet i én dataframe. Når vi har data samlet kan vi begynde at lave "subsets". Det vil sige, at vi inddeler data i mindre datasæt. Lad os afprøve det og lave forskellige subsets.


"The fairy tales are collected in one dataframe. Once we have the data gathered, we can start creating 'subsets.' This means dividing the data into smaller datasets. Let's try it and create different subsets."

In [10]:
df_hca = df[df['author'] == 'hca']
df_grimm = df[df['author'] == 'grimms']
df_queen = df[df['title'].str.contains('QUEEN')]
df_king = df[df['text'].str.contains('king')]
df_king

Unnamed: 0,title,text,author
0,ASHPUTTEL,ASHPUTTEL\n\n\n\n\n\nThe wife of a rich man fe...,grimms
1,BRIAR ROSE,BRIAR ROSE\n\n\n\n\n\nA king and queen once up...,grimms
2,CAT AND MOUSE IN PARTNERSHIP,CAT AND MOUSE IN PARTNERSHIP\n\n\n\n\n\nA cert...,grimms
3,CAT-SKIN,"CAT-SKIN\n\n\n\n\n\nThere was once a king, who...",grimms
5,CLEVER GRETEL,CLEVER GRETEL\n\n\n\n\n\nThere was once a cook...,grimms
...,...,...,...
74,THE SHADOW,THE SHADOW\n\n\n\nIt is in the hot lands that ...,hca
75,THE SHOES OF FORTUNE,THE SHOES OF FORTUNE\n\n\n\nI. A Beginning\n\n...,hca
76,THE SNOW QUEEN,THE SNOW QUEEN\n\n\n\nFIRST STORY. Which Treat...,hca
77,THE STORY OF A MOTHER,THE STORY OF A MOTHER\n\n\n\nA mother sat ther...,hca


## Methods in Pandas

We would like to examine how different fairy tales differ in their use of animals. Such an analysis could, for example, include an investigation into the number of times animals are mentioned in the tales. When calculating this, we will need to use various pandas methods, but we should also keep in mind that the tales are not of equal length. So, if we simply count the number of times animals are mentioned in the respective tales, it would be difficult to compare the results. The solution is to normalize the results by dividing the number of times an animal appears (hits) by the total number of words.

It is often easiest to start by building something simple and then transform the code into a function that can handle multiple tasks at once. We'll begin by building something that can investigate a single animal occurrence and proceed from there.

In [11]:
search_term = 'bird'

data_frame = df.copy()

# Number of words / sum of words
hits = data_frame['text'].str.count(search_term).sum()
 
# The total number of words / words in total
total = data_frame['text'].str.split().str.len().sum()

# The relative frequency
rf = hits / total * 10000

print(rf)




### Rebuild into a function ### 

def rel_freq(data_frame, column_name, search_term):
    hits1 = data_frame[column_name].str.count(search_term).sum()
    total1 = data_frame[column_name].str.split().str.len().sum()
    rf1 = ( hits1 / total1 ) * 10000
    return rf1


## Test the function

rel_freq(df, 'text', 'bird')

10.795694728969533


10.795694728969533

# Test the function with a list of animals
We have a function that can take a dataframe, a column name, and an animal and return a relative frequency. It is relatively straightforward to expand the analysis to include a longer list of animal names in both singular and plural forms. We can iterate through the list using a loop and use the function to return relative frequencies.

In [12]:
fairy_tale_animals = [
    "swan", "swans",
    "mermaid", "mermaids",
    "nightingale", "nightingales",
    "duck", "ducks",
    "fish", "fish",
    "butterfly", "butterflies",
    "beetle", "beetles",
    "swallow", "swallows",
    "dolphin", "dolphins",
    "seahorse", "seahorses",
    "wolf", "wolves",
    "frog", "frogs",
    "donkey", "donkeys",
    "raven", "ravens",
    "bird", "birds",
    "cat", "cats",
    "mouse", "mice",
    "fox", "foxes",
    "horse", "horses",
    "pig", "pigs",
    "bear", "bears",
    "lion", "lions",
    "snake", "snakes"
]


for i in fairy_tale_animals:
    rf = round ( rel_freq( df_hca , 'text', i), 4 )
    print (f'Relativ frequence of {i}: {rf}')

Relativ frequence of swan: 0.3765
Relativ frequence of swans: 0.3765
Relativ frequence of mermaid: 0.0
Relativ frequence of mermaids: 0.0
Relativ frequence of nightingale: 0.9414
Relativ frequence of nightingales: 0.0
Relativ frequence of duck: 0.0
Relativ frequence of ducks: 0.0
Relativ frequence of fish: 0.9414
Relativ frequence of fish: 0.9414
Relativ frequence of butterfly: 0.0
Relativ frequence of butterflies: 0.0
Relativ frequence of beetle: 0.0
Relativ frequence of beetles: 0.0
Relativ frequence of swallow: 0.9414
Relativ frequence of swallows: 0.5648
Relativ frequence of dolphin: 0.0
Relativ frequence of dolphins: 0.0
Relativ frequence of seahorse: 0.0
Relativ frequence of seahorses: 0.0
Relativ frequence of wolf: 0.0
Relativ frequence of wolves: 0.1883
Relativ frequence of frog: 1.6945
Relativ frequence of frogs: 0.1883
Relativ frequence of donkey: 0.0
Relativ frequence of donkeys: 0.0
Relativ frequence of raven: 0.3765
Relativ frequence of ravens: 0.1883
Relativ frequence of 

## Data processing with the .apply() and .iterrows() methods

We want to further analyze the animal contexts within the fairy tales. To get an overview of the relevant contexts, we first break down the texts into sentences. Then, we search for sentences that contain an animal name, and finally, we print the relevant sentences.

To do this, we will need the .apply() and .iterrows() methods.

We'll start with .apply(), which we always use in conjunction with a function.

First, we define the function, and then we apply it.

In [13]:
# This is a function called 'split_text' 
# It takes a text as input and splits it into paragraphs using periods as the delimiter.



def split_text(row):
    paragraphs = row.split('.')  #  Split the text into paragraphs using periods as separators.
    return paragraphs  # Return the list of paragraphs.


# This line applies the split_text function to each row in the 'text' column of the DataFramen 'df'.
# It creates a new column called 'paragraphs' to store the segmented text.


df['paragraphs'] = df['text'].apply(lambda row: split_text(row))

In [14]:
df

Unnamed: 0,title,text,author,paragraphs
0,ASHPUTTEL,ASHPUTTEL\n\n\n\n\n\nThe wife of a rich man fe...,grimms,[ASHPUTTEL\n\n\n\n\n\nThe wife of a rich man f...
1,BRIAR ROSE,BRIAR ROSE\n\n\n\n\n\nA king and queen once up...,grimms,[BRIAR ROSE\n\n\n\n\n\nA king and queen once u...
2,CAT AND MOUSE IN PARTNERSHIP,CAT AND MOUSE IN PARTNERSHIP\n\n\n\n\n\nA cert...,grimms,[CAT AND MOUSE IN PARTNERSHIP\n\n\n\n\n\nA cer...
3,CAT-SKIN,"CAT-SKIN\n\n\n\n\n\nThere was once a king, who...",grimms,"[CAT-SKIN\n\n\n\n\n\nThere was once a king, wh..."
4,CLEVER ELSIE,CLEVER ELSIE\n\n\n\n\n\nThere was once a man w...,grimms,[CLEVER ELSIE\n\n\n\n\n\nThere was once a man ...
...,...,...,...,...
74,THE SHADOW,THE SHADOW\n\n\n\nIt is in the hot lands that ...,hca,[THE SHADOW\n\n\n\nIt is in the hot lands that...
75,THE SHOES OF FORTUNE,THE SHOES OF FORTUNE\n\n\n\nI. A Beginning\n\n...,hca,"[THE SHOES OF FORTUNE\n\n\n\nI, A Beginning\n..."
76,THE SNOW QUEEN,THE SNOW QUEEN\n\n\n\nFIRST STORY. Which Treat...,hca,"[THE SNOW QUEEN\n\n\n\nFIRST STORY, Which Tre..."
77,THE STORY OF A MOTHER,THE STORY OF A MOTHER\n\n\n\nA mother sat ther...,hca,[THE STORY OF A MOTHER\n\n\n\nA mother sat the...


# use .iterrows()

In [15]:
# This is a function called 'get_paragraphs' that takes a DataFrame and a search term as input.
import re

def find_word(word, text):
    # Create a regular expression pattern to match the word with optional plural forms
    pattern = r'\b{}\b'.format(re.escape(word))

    # Search for the pattern in the text
    if re.search(pattern, text, re.IGNORECASE):
        return text
    else:
        return r'The word {word} is not in the text' 
    

def get_paragraphs(data_frame, search_term):
    # Iterate over rows using iterrows()
    for index, row in data_frame.iterrows():
        for p in row['paragraphs']:
            # Check if the search term exists in the current paragraph
            if search_term in p:
                # Retrieve author and title information from the current row
                au = row['author']
                ti = row['title']
                # Print a separator line
                print('*' * 20)
                # # Print a message indicating where the search term was found
                print(f'In {au.upper()} fairy tale {ti}, the word {search_term.upper()} is in this paragraph:\n {p}')
                print('\n')

# Call the 'get_paragraphs' function with the DataFrame 'df' and the search term 'nightingale'
get_paragraphs(df, 'nightingale') 

********************
In GRIMMS fairy tale JORINDA AND JORINDEL, the word NIGHTINGALE is in this paragraph:
  Jorindel turned to see the reason, and

beheld his Jorinda changed into a nightingale, so that her song ended

with a mournful _jug, jug_


********************
In GRIMMS fairy tale JORINDA AND JORINDEL, the word NIGHTINGALE is in this paragraph:
 



She mumbled something to herself, seized the nightingale, and went away

with it in her hand


********************
In GRIMMS fairy tale JORINDA AND JORINDEL, the word NIGHTINGALE is in this paragraph:
  Poor Jorindel saw the nightingale was gone--but

what could he do? He could not speak, he could not move from the spot

where he stood


********************
In GRIMMS fairy tale JORINDA AND JORINDEL, the word NIGHTINGALE is in this paragraph:
  He looked around at

the birds, but alas! there were many, many nightingales, and how then

should he find out which was his Jorinda? While he was thinking what to

do, he saw the fairy had

# Applicering af funktioner

In [16]:
import re
def calculate_ttr(text):
    
    # Tranform the text to lower case 
    text = text.lower()
    
    text = text.replace('-', ' ')
    
    # Build word lists
    words = re.findall(r'\b\S+\b', text)

    # Calculate the total amount of words
    total_tokens = len(words)

    # Calæculate the amount of unique types (unique words)
    unique_types = len(set(words))

    # Calculate TTR (Type-Token Ratio) - the ratio between the total amount of words and unique words  
    ttr = unique_types / total_tokens

    return ttr

In [17]:
df['TTR'] = df['text'].apply(calculate_ttr)

# Groupby

In [18]:
df.groupby('author')['TTR'].mean()

author
grimms    0.293228
hca       0.296339
Name: TTR, dtype: float64