# Today, a Tale About Pythons and Pandas
![pandas](https://miro.medium.com/max/1400/1*6d5dw6dPhy4vBp2vRW6uzw.png)

# 1. Introduction

## 1.2 Recap
So far, we discussed
- the basics of **programming** in Python 
- elementary techniques for **text processing**

"Text" came primarily as **string variables* or `.txt` files.

## 1.2 This Session: Analysing Texts Embedded in Tabular Data


- In practice, text is often embedded in other **formats**.
- Common types are CSV and XML files (or JSON, but we covered that briefly in the previous sessions).
- In the morning, we look at tabular data (CSV). In the afternoon, we cover XML.
  

More generally, we have to talk about data **formats**...
    
![img](https://media.giphy.com/media/3d5O10XObbr8LW4bDY/giphy.gif)

 ... but also, we explore more **realistic research scenarios** using our novel computational skills!
 
 ![img](https://media.giphy.com/media/UuebWyG4pts3rboawU/giphy.gif)

In order to get there, we need to cover quite some ground in one session. 

This session may feel like an information overload (at times) but don't worry if you don't understand everything at once. It took me multiple years.

It will take time, but I hope that at the end of this notebook you have an intuition (before having mastered the skills). Patience is key, sorry...

![](https://media.giphy.com/media/UxREcFThpSEqk/giphy.gif)

# 2. What is Tabular Data?

- Organized in **rows** (observations) and **columns** (features/attributes of these observations). [Example](https://github.com/Living-with-machines/dhoxss-text2tech/blob/dev/Sessions/data/names_extract.csv)
- Tabular, structured data often stored in a CSV format (Comma Separated Values). [Example](https://raw.githubusercontent.com/Living-with-machines/dhoxss-text2tech/dev/Sessions/data/names_extract.csv)
- A very common format for **storing** and **exchanging** text data (think of spreadsheets!)
- Most importantly, this format relates **text to context** (e.g. name (text) to year and frequency (context).

**Studying the relation between text and context is the focus of this session.**



# 3. Pandas to the rescue! (or libraries continued)

The library we will be looking at in this session is called 'Pandas' (and explains all the excellent gifs in this Notebook ;-) )

![pandas](https://media.giphy.com/media/EatwJZRUIv41G/giphy.gif)

Pandas' functionalities are the bread and butter for doing **data science** in Python. 


## Ho, wait...
![](https://media.giphy.com/media/o7OChVtT1oqmk/giphy.gif)

We are in a course on text analysis, not (applied) data science.

#### Yes, but:

- any venture into text mining will require some data science skills (a critical aspect of "distant reading").
- in practice, these fields overlap 
- yes, **there will be numbers**

Text-as-data often involves turning texts into numbers, which we then analyse.

Let's start with importing Pandas into our notebook.

In [None]:
import pandas as pd # import pandas using pd as abbreviation
print(pd.__doc__) # import the __doc__ attribute attached to pd

Notice that using `pd` as an abbreviation for `pandas` is merely a **convention** and you are free to use any other (syntactically acceptable) shorthand

The command below 
```python
import pandas as dfsfgjrelfgdjgkldsjgkdfgjdfklgjdfklgjkflskfdklfsk 
```

will work but will also make your life miserable...

## 3.1 Opening a CSV File

Given a path, Pandas will **open**, **read** and **parse** the CSV file and return it as a `DataFrame` object. In this case, we are loading more "serious" data, i.e. a frequency list of American baby names after 1880 till the present.

This lecture is based on two excellent books:
  - Chapter 3 of [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas
  - Chapter 4 of [Humanities Data Analysis](https://www.humanitiesdataanalysis.org/) by Folgert Karsdorp, Mike Kestemont, and Allen Riddell.

In [None]:
%%bash
# we need to unzip the file first
# run this cell before continuing
# we need to unzip a larger csv file with name data
unzip data/names.zip -d data # unzip data
ls data # the content




Let's inspect a realistic example: a list of [American baby names](https://www.ssa.gov/OACT/babynames/limits.html). We can open this `.csv` file as we have previously done. 
- Note the `sep` attribute (when working with `.tsv` files, you need to use "\t")

In [None]:
# sep has ',' as default value, change this parameter to '\t' if you need to open a tsv file
df = pd.read_csv('data/names.csv', sep=',')
df

Never believe me on my word, always check the data type!

In [None]:
type(df)

## 3.2 Exploring Pandas DataFrames

As with other Python objects, the `DataFrame` comes with a specific set of **attributes** and **methods**, tools that you can use to inspect, analyse and manipulate tabular information.

Important **attributes** for understanding the contours of your data are `.shape` and `.columns`.

In [None]:
df.shape # returns the number of rows and columns as a tuple

In [None]:
df.columns # returns the column names

In [None]:
help(df) # heeeeelp!

We observe the general shape of the dataframe, but what's actually in there? Let's explore the **content** of our dataframe. 

In [None]:
# head allows us to view the first n rows
df.head(4)

In [None]:
df.tail(4)

Data in these spreadsheets are often ordered in certain ways. Therefore `.head()` and `.tail()` can give a misleading impression. `.sample()` allows us to view a random set of n rows.

In [None]:
df.sample(4)

Besides inspecting specific observations (rows), Pandas' `describe()` method returns a **quantitative summary** of all numerical columns in the dataframe.

In [None]:
help(df.describe)

In [None]:
df.describe()

## 3.3 Sequencing methods

This table is not very readable. Let's print this more neatly by using two methods in a sequence
- firstly, we generate the summary dataframe with `describe()`
- secondly, we round the numbers in this summary dataframe to their second decimal

Using dot notation, we can easily create a chain of multiple operations.

The general syntax is:

```python
df.method_1().method_2().method_3()
```

The operations are executed from left to right.

In [None]:
# dot notation -> read from left to right
# describe numerical values in the dataframe
# this returns a dataframe
# round the numerical values to the third decimal
df.describe().round(2)

Note that this is identical to the code below, but shorter and quite common syntax.

In [None]:
df_described = df.describe()
df_described.round(2)


### ✏️ Questions:

- Which columns are missing and why?
- What do the numbers actually mean? Which ones could be interesting?
- Which value is the **median** in this summary?
- Why are the median and mean so different for the `frequency` column?

In [None]:
# type answer here

`describe()` returns a (statistical) summary for each column that contains **numerical information** (i.e. are of type `int` or `float`). 

For example, you'll notice that the `name` and `sex` columns are not present in the output. When working with dataframes, it is critical to be aware of the **data types** (abbreviated as `dtypes`) present in your table. 

You can access data types under the `dtype` attribute.


In [None]:
df.dtypes

### ✏️ 1. Exercise:

The `data` folder contains one more interesting dataset (`muse_v3.csv`), the [MuSe](https://www.kaggle.com/datasets/cakiki/muse-the-musical-sentiment-dataset) or The Musical Sentiment Dataset.

On the Kaggle website, the dataset is described as:

> The MuSe dataset contains sentiment information for 90,001 songs. The sentiment conveyed by a song is derived from the social tags given ot that song on Last.fm, derived through the Warriner et al. database, and expressed across 3 dimensions:

>valence ("the pleasantness of a stimulus") arousal ("the intensity of emotion provoked by a stimulus") dominance  ("the degree of control exerted by a stimulus")

- Open the MuSe dataset with Pandas and save the resulting dataframe in the variable `muse_df`.
- Randomly select 10 rows.
- Apply the `describe()` method to `df_muse`, what do these numbers tell you?

In [None]:
%%bash
# run this cell before continuing
# we need to unzip a larger csv file with name data
unzip data/muse_v3.csv.zip -d data # unzip data
ls data # the content

In [None]:
# open muse dataset
# df_muse = 

In [None]:
# randomly select 10 rows

In [None]:
# apply the describe method

Important: `describe()` doesn't work for text columns. We need **additional** tools to analyse these fields.

# 4. Analysing Text in Tabular Data

So far we've shown how to import and view CSV data. How can we actually interrogate and work with these files? 

**More specifically, how to use tabular data for text analysis?**



To do this we need to cover three steps:
- **Selecting** data (column and row-wise): what parts of the dataframe are relevant to your research?
- **Manipulating text columns**: converting text to numbers
- **Aggregating data**: analysing trends over time or language use across different groups



## 4.1 Data Selection

- Research often requires **data selection** (or sampling). What information is relevant to your research (e.g. a historical period)?
- Selection criteria based on **content** and/or **metadata**.
- In this example `names` are the **content**, and `year` and `gender` are **metadata**.

Selecting data from lists and dictionaries:

- retrieving values by position:
```python 
l = ['a','b']
l[1]
```
- retrieving values by key:
```python 
d = {'a': 1}
d['a']
```

Dataframes provide more powerful and advanced tools for selecting data compared to those that come with lists or dictionaries.

More technically, a `DataFrame` is a **two-dimensional array** of **indexed data** (which means that you to access content by row **and/or** column).

The code below shows the row and column index.

In [None]:
df.index # row index, index is a list of numbers

In [None]:
df.columns # column index, index is a list of names

## 4.2 Retrieving data column-wise




In the code cell below we get the full `'name'` column.

In [None]:
# a simple way to retrieve a column
df['name']

As an aside, both rows and columns are instances of the `Series` class.

In [None]:
type(df['name'])

### ✏️ 2.  Exercise:

- Select the `year` column in `df`.

In [None]:
# write answer here

You can select multiple columns, but you have to pass the column names as a **list**.

In [None]:
# this won't work
df['name','sex']

In [None]:
# this will work
df[['name','sex']]

### ✏️  3. Exercise:

- Select the "track" and "artist" column from `df_muse`.

In [None]:
#df_muse

## 4.3 Inspecting columns

What information do these columns contain? Pandas provides you with multiple methods for **understanding** and **transforming** the content of a dataframe.

`.unique()` lists all the unique values in a column. It is useful for getting a sense of the (range of) possible values.

In [None]:
df['sex'].unique()

`.value_counts()` returns the frequency of each unique value in a column. Below, we count the name of 'male' and 'female' rows.

Be careful with interpretation here! Does this mean there are fewer female names in this dataframe?

In [None]:
df['sex'].value_counts()

In [None]:
df['sex'].value_counts(normalize=True)

In [None]:
df['sex'].value_counts(normalize=True).plot(kind='bar')

### ✏️ Question:

If we apply `.value_counts()` to the "name" column, what do we see?

In [None]:
df['name'].value_counts()

In [None]:
df[df['name']=='Robert']['sex'].value_counts()

In [None]:
(2016 - 1880) * 2

You can select the top `n` values using normal Python slice notation.

In [None]:
# select the 10 most recurrent names
df['name'].value_counts()[:10]

### ✏️ 4. Exercise:

- Apply `.value_counts()` to the `"artist"` column. 
- Get the names of the ten most frequently mentioned artists in `df_muse`.
- Plot these results in a bar chart.

In [None]:
# write your answer here

### ✏️ 5. Exercise:

- How many different genres are there in the `df_muse` dataframe?
- What are the ten most (and least) popular (i.e. frequent) genres?

In [None]:
# write your answer here

## 4.4 Retrieving information row-wise

### 4.4.1 By index 

 `loc` allow you to select data by row and column. The general syntax for `loc` is:
```python
df.loc[row_name, column_name]
```

Similar to lists we can retrieve rows by the index position in the dataframe. We use the same square brackets notation, but precafe it with `.loc`, shorthand for index location.

In [None]:
df.loc[0]

### ✏️ 6. Exercise:
Get the 11th row in the `df_muse` dataframe.

In [None]:
# write answer here

But Pandas provides more options for retrieving observations by their row index.
Imagine we'd like to access songs by their genre. We have several options at our disposal, but a convenient way would be to set the `genre` column as row index and then retrieve rows based on their genre value (i.e. "rap" or "rock")

In [None]:
# just in case let's load the data again
df_muse = pd.read_csv('data/muse_v3.csv')

In [None]:
df_muse_by_genre = df_muse.set_index('genre') # set the genre column as the index of the dataframe
df_muse_by_genre

In [None]:
rap_songs = df_muse_by_genre.loc['rap'] # 
rap_songs.shape

### 4.4.2 Masking

Another technique for selecting relevant information (which will be often useful) is called **masking**. In this scenario, 
- we define a boolean expression (i.e. one which evaluates to `True` or `False`)
- apply it to a column
- select all rows for which the mask evaluates to `True`.

Below we show how this works for selecting data within a certain date range.

In [None]:
# first we inspect the data type
df.year.dtype

In [None]:
# we formulate a boolean expression
year = 1899
print(f'{year} > 1900 = ', year > 1900)
print(f'{year} > 1900 = ', year < 1900)

We can apply this expression to a column, which returns a series with the boolean value for each row index.
This series we use as a **mask** for selecting rows in the dataframe.

In [None]:
df['year'] > 1900

In Python `False` is interpreted as zero and `True` as 1. To measure how many rows match a condition we can apply `sum` to a mask.

In [None]:
# we have 1802468 row with names after 1900
sum(df['year'] > 1900)

In [None]:
# saving subset in a new dataframe
# note that index does not start from zero!
mask = df.year > 1900
df_after_1900 = df[mask]
df_after_1900

In [None]:
sum(df['year'] > 1900) == df_after_1900.shape[0]

### ✏️ 7. Exercises:
- What is the output of the code example below?
```python
df['year'].between(1900,1910)
```
- can you adapt the code to select rows between 1950 and 1960 (all names from the fifties)

In [None]:
# type answer here

### ✏️ 8. Exercises:

- Select all rows in which the value for the variable 'sex' is equal to 'F'. (Google "is equal to" operator Python)
- Save this in a new Python variable called `names_f`
- How many rows does the dataframe `names_f` contain?
- Repeat the same for male baby names. Apply `.describe()` to `names_f` and `names_m`. What differences do you see?

In [None]:
# type answer here

### 4.4.3 Combining conditions

In [None]:
# endless options
# for example all female names between 1940 and 1945
df[(df.year.between(1940,1945)) & (df.sex=='F')]

## 5. Working with Text Columns

Okay, but this is a session about text processing, not applied data science (even though, as said earlier, in practice, you'll need both!). 

In what follows, we will use bespoke function to explore historical trends in the history of American baby names.


To measure the number of characters in a string you can use `len()`.

In [None]:
len('Mary')

We can apply `len()` to each value in the `'name'` cell by simply calling the `.apply()` method and passing the function `len` as argument.

In [None]:
# counting the number of characters
# in the name colums
# this returns a new series object 
df['name'].apply(len)

As an aside, in this specific case we could have used a built-in Pandas method.

In [None]:
df['name'].str.len()

The above operation returns the length of each name but does not store the result. We'd like to save the length of each name in our main dataframe. For this we need to create a new column where we store the output.  

The syntax below may be confusing at first, but basically, we  store the length of each name in a column titled `'name_length'`.

In [None]:
df['name_length'] = df['name'].apply(len)

Revisiting the original dataframe `df`, you'll notice that it now contains an extra column containing the length of each name.

In [None]:
df.head()

### ✏️ 9. Exercise:

Add a column to `df_muse` that records the number of characters in each title.

In [None]:
# type answer here

### ✏️ 10. Exercise:

Using `.sort_values()` we can investigate the longest surnames (or song titles).

Use the `help()` to print the docstring of the `.sort_values` method. Read it carefully and try to figure out how sort row by the length of the name.

In [None]:
# type answer here

`len()` is just an example of a function we apply to a text document (in this case it is a built-in function). 

We can create our own custom or bespoke function and **apply** it to all the values in a column.

As earlier, and maybe somewhat unusually, we don't pass a variable of the type string or integer. We pass a function instead.

For example, let's create a function that checks if a name starts and ends with the letter "a".

In [None]:
def startsenda(name):
    """checks if a name starts and end with and a
    Argument:
        name (str): the input name to check
    Returns
        a boolean
    """
    name_lower = name.lower()
    return name_lower.startswith('a') and name_lower.endswith('a')

In [None]:
type(startsenda)

Before we apply the function more widely, let's first check if it works on a simple example.

In [None]:
startsenda('Andrea')

In [None]:
startsenda('Roberta')

Now we apply `startsenda` to the "name" column.

In [None]:
df['name'].apply(startsenda)

In [None]:
df['starsendwitha'] = df['name'].apply(startsenda)

This operation returns a "mask", which we can use to select the rows for which the boolean operation evaluates to `True`. 

In [None]:
mask = df['name'].apply(startsenda)
mask

In [None]:
df_startsenda = df[mask]
df.shape, df_startsenda.shape

In [None]:
df_startsenda['name'].unique()

### ✏️ 11. Exercise:

Add a column to `df_muse` that records the number of white spaces in each title.

In [None]:
# type answer here

### ✏️ 12. Exercise: Palindrome Names

Simple question: what is the proportion of Palindrome names (i.e. names that stay the same when reversed, after lowercasing of course). Think of 'Ada'.


- Write a function that returns True if the name is a palindrome (we will help you)
- Apply this function to "name" column in the dataframe. It should return a "mask", i.e. a Series that has the value `True` for palindrome names.
- Use the mask to select the subset of palindrame names within `df`.
- Lastly, how many unique names are there in this subset of `df`?

In [None]:
def is_palindrome(name):
    # tip, check if string variable is palindrome
    # name == name[::-1]


# 6. Aggregating and Analysing Text Data

By now we have enough skills to address more realistic research questions. For example, on average, are male names longer than female names (ignoring the frequency by which the names occur for now)?

- compute the length of each name in the dataframe
- for each value in sex, select all rows
- compute the mean for the 'name_length' column

With minimal code you can start investigating this question.

In [None]:
# count the number of characters in each name
df['name_length'] = df['name'].apply(len)

In [None]:
# split the dataframe in female and male
df_m = df[df.sex=='M']
df_f = df[df.sex=='F']

In [None]:
# apply the .mean() method to the name_length column
avg_m = df_m['name_length'].mean()
avg_f = df_f['name_length'].mean()

In [None]:
# print the results nicely
print('Mean length of male names: ',round(avg_m,3),
      '\nMean length of female names: ',round( avg_f,3))

Pandas offers various plot functions that enable you to look more closely at gender (and other) differences. How different are these distributions?

In [None]:
df_m['name_length'].plot(kind='hist',alpha=.75) # blue bars
df_f['name_length'].plot(kind='hist',alpha=.75) # orange bars


## .groupby()

**Grouping** and **aggregating** data are incedibly common operations, and we'll almost need them any time we want to make **comparisons** (over time or between groups). 

Pandas offers a tool `.groupby()` to facilitate comparison (of subsets) **within** a dataframe. [Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/), who wrote an excellent introduction to Pandas, has very helpfully visualized the workflow: 

`.groupby()` 
- split the dataframe in sub-dataframes by key
- for each of these sub-dataframes, apply an function (to the values)
- glue these results back together

![grouby](https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png)

Following the previous example, **keys** are the values in the `sex` columns, and **data** is the length of each name.

In generally you can think of the syntax as:

```python
df.groupby(key)[data].operation()
```

In [None]:
# we split th
df.groupby('sex')['name_length'].mean()

### ✏️ 13. Exercise:

Let's analyse the relationship between genre and title length.
- add a column to `df_muse` that records the number of characters
- optional: can you also add a column that counts the number of words in a song title
- group the dataframe by genre and compute the average title length


In [None]:
# write your answer here

### ✏️ Exercise (more difficult version).

Let's analyse the relation between genre and title length.
- add a column to `df_muse` that records number of characters
- use `value_counts()` to obtain the 20 most frequent genres
- create a mask to retrieve only songs from these most popular genres
- group the dataframe by genre and compute the average title length
- optional: plot the results as a barplot

In [None]:
# write your answer here

## Trends over time

We can also use `.groupby()` to study trends over time. 

In [None]:
df['end-n'] = df.name.str.endswith('n')

In [None]:
df['end-n']

In [None]:
df.groupby('year')['end-n'].mean().plot()

### ✏️ 14. Exercise:

Adapt the preceding code. On average, are palindromes becoming more or less common over time?

In [None]:
# write your answer here

## Grouping by multiple keys

We can compute the average name langth by gender and year. We simply pass multiple keys as a list

In [None]:
df['name_length'] = df['name'].apply(len)
by_year_gender = df.groupby(['year','sex'])['name_length'].mean()

In [None]:
by_year_gender

In [None]:
by_year_gender.unstack()

In [None]:
by_year_gender.unstack().plot()

### ✏️ 15. Final Exercise:

Genre and pronouns
- Select the 10 most popular genres in `df_muse`
- for each pronoun, "I", "we", "you", "they", add a columns that records if this pronoun appears in the song title
> TIP: you can use a regex for this, `r'\bwe\b'` 
- Group the dataframe by genre to see how pronouns differ

In [None]:
# write your answer here

# Fin.

# Appendix: Advanced Topics

## Explicit and Implicit indexing

In [None]:
# let's make a toy dataframe
mock_df = pd.DataFrame([[1,4,6],
                        [2,3,5],
                        [0,9,6],
                        [7,8,9]], 
                       columns= ['c1','c2','c3'], 
                       index=['r1','r2','r3','r4'])
mock_df

A dataframe has an **explicit** index, which are the **row names**, i.e. "r1", "r2" etc). With `.loc` we can access rows by their explicit index (also using **slice** notation!)

In [None]:
mock_df.loc['r2']

In [None]:
# slice notation, notice I am not using numbers here!
mock_df.loc['r2':"r4"]

The **implicit** index is the Python-style position-wise index, which looks similar to lists and starts at 0! We can retrieve rows by their position using `.iloc`.

In [None]:
mock_df.iloc[0]

In [None]:
# notice that this is different to mock_df.loc['r2':"r4"]
mock_df.iloc[1:3]

For our main `df` using `iloc` or `loc` doesn't make a difference when we want to access just one specific row.

In [None]:
df.iloc[3]

In [None]:
df.loc[3]

However when using slice notation, `.iloc` and `.loc` do not return exactly the same.

### ✏️ Exercise:

 Get the first five rows with `iloc` and `loc`.

In [None]:
# write your answer here

## Advanced: "Fancy" Indexing

To retrieve multiple columns, you can use so-called **'fancy' indexing** by passing column names as a list or an array. (Often we do not need all the data!)

In [None]:
df.loc[:,['name','sex']]

In [None]:
# or simpler
df[['sex','name']]

In [None]:
# you can apply fancy indexing to both rows and columns!
# cool, no?!
df.loc[[3,5],['name','sex']]

In [None]:
# you can combine multiple indexing strategies, i.e. slicing and fancy indexing
df.loc[3:5,['name','sex']]

In [None]:
df_muse_by_genre.loc['rap',['track','artist']]

## Hierarchical Index

In [None]:
by_dec_sex = df.groupby(['year','sex'])['name_length'].mean()
by_dec_sex

In [None]:
by_dec_sex.loc[1880:1885]

In [None]:
# get all female names between 1880 and 1888
by_dec_sex.loc[1880:1885,'F']

In [None]:
# you can plot the average name length by year
by_dec_sex.loc[:,'F'].plot()
by_dec_sex.loc[:,'M'].plot()

In [None]:
# a more elegant way is to use unstack
by_dec_sex.unstack().plot()

In [None]:
# just FYI you can apply multiple aggregations at once
df.groupby(['year','sex'])['name_length'].agg(['min','max','mean','median'])

## Apply and lambda functions

String method do not always suffice, in this scenario we often use `lambda` function (or a normal one of course) in combination with apply. For example, what names contain no mare than two different characters? In Python we can formulate this as a boolean expression which evaluates if the length of the set of character is small or equal than two:

In [None]:
name = 'dddffff'
len(set(name)) <= 2

Let's turn this in function. 
Let's turn this into an 

### ✏️ Exercise!

In [None]:
# enter solution here

In Python, we like to keep the number of lines small. We could write a more concise function using `lambda`.

In [None]:
twochars = lambda x: len(set(x)) <= 2

 Now we can use the `.apply()` method to the `name` column. The will "apply" (duh) the function we created to each value in the name columns and return a `pd.Series` object.

In [None]:
df.name.apply(twochars)

In [None]:
# we can use it as a mask
df[df.name.apply(twochars)]

In [None]:
# or save the result in a new column
df['two_chars'] = df.name.apply(twochars)

In [None]:
# and plot results over time
df[df.sex=='M'].groupby('year')['two_chars'].mean().plot()
df[df.sex=='F'].groupby('year')['two_chars'].mean().plot()

### ✏️ Exercise

What about names starting with 'a' and ending with 'e'  have they become more frequent over time? Use a lambda function to answer this question!.

In [None]:
# write solution here