# 05 - Pandas basics

So far, we have mainly worked with built-in Python containers to store data, e.g., `list`, `tuple`, `dict`. Although these container can be useful for many operations, they are usually not appropiate when we want to work with actual datasets. 

In the real world, data is usually stored in a *tabular* format. That is, data is stored as a collection of rows and columns:

<img src="images/table.png" width = "50%" align="center"/>

In the table, each observation is uniquely identified by its row and column identifiers. Furthermore, the observations can be of non-homogenous data types, e.g., names, dates, numbers etc. And some observations might even be missing.

This notebook gives an introduction to **pandas**, which is the Python package for working with real-world tabular data. 

However, before we can use pandas, we must first import the package into our program. It is convention to give the package the shorter alias `pd` when importing it.

In [1]:
import pandas as pd
import numpy as np

> üìù **Note:** Recall that we must write the alias (e.g., `pd`) everytime we want to use a function from a package.

## Pandas data structures

Pandas offers two additional data structures:
1. `Series`: contains observations of a single variable
2. `DataFrame`: contains observations for several variables

We can think of a `Series` as a single column in a table, whereas a `DataFrame` is a collection of columns.

To create a `Series`, we can pass a sequence of values to the `Series` function.

In [2]:
name_lst = ['Ole', 'Jenny', 'Chang', 'Jonas']

name_lst

['Ole', 'Jenny', 'Chang', 'Jonas']

In [3]:
series = pd.Series(name_lst)

series

0      Ole
1    Jenny
2    Chang
3    Jonas
dtype: object

However, in the real world, we usually work with two-dimensional data (i.e., several variables). We can store two-dimensional data in a `DataFrame`.

First, we create a dictionary with the keys as the column labels and the values as the actual data.

In [4]:
grade_dict = {
    'Name'  : ['Ole', 'Jenny', 'Chang', 'Jonas'],
    'Score' : [65.0, 58.0, 79.0, 95.0],
    'Pass'  : ['yes', 'no', 'yes', 'yes']
}

grade_dict

{'Name': ['Ole', 'Jenny', 'Chang', 'Jonas'],
 'Score': [65.0, 58.0, 79.0, 95.0],
 'Pass': ['yes', 'no', 'yes', 'yes']}

Second, we create a `DataFrame` by passing the dictionary with the column names and values to the `DataFrame` function.

In [5]:
df = pd.DataFrame(grade_dict)

df

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no
2,Chang,79.0,yes
3,Jonas,95.0,yes


<div class="alert alert-info">
<h3> Your turn</h3>
<p>The following table contains daily observations of the average temperature in three different cities:</p>

| Oslo | Bergen | Trondheim |
|------|--------|-----------|
| 0    | 4      | 0         |
| -4   | 3      | -1        |
| -3   | 4      | -3        |
| 0    | 3      | -2        |
| 3    | 3      | -2        |
| 5    | 7      | -5        |
| 4    | 8      | -6        |

<p>Store the data in the table as a <TT>DataFrame</TT> called <TT>temps_df</TT>.</p>

</div>

In [6]:
dictionary = {
    'Oslo' : [0.0,-4.0,-3.0,0.0,3.0,5.0,4.0],
    'Bergen' : [4.0,3.0,4.0,3.0,3.0,7.0,8.0],
    'Trondheim' : [0.0,-1.0,-3.0,-2.0,-2.0,-5.0,-6.0]
}

temps_df = pd.DataFrame(dictionary)
temps_df

Unnamed: 0,Oslo,Bergen,Trondheim
0,0.0,4.0,0.0
1,-4.0,3.0,-1.0
2,-3.0,4.0,-3.0
3,0.0,3.0,-2.0
4,3.0,3.0,-2.0
5,5.0,7.0,-5.0
6,4.0,8.0,-6.0


In [7]:
temps_df.loc[:4, :"Bergen"]

Unnamed: 0,Oslo,Bergen
0,0.0,4.0
1,-4.0,3.0
2,-3.0,4.0
3,0.0,3.0
4,3.0,3.0


Note that a `Series` object has an `index` attribute.

In [8]:
series.index

RangeIndex(start=0, stop=4, step=1)

A `DataFrame` object has both and an `index` and a `columns` attribute.

In [9]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [10]:
df.columns

Index(['Name', 'Score', 'Pass'], dtype='object')

#### Indexing

In general, we can select rows and/or columns in a `DataFrame` by using the standard index operator `[]`.

However, pandas supports two types of indexing:
1. Indexing by label
2. Indexing by position

To select a column by its label, we place the column name inside `[]`. Note that this will return a `Series` object.

In [11]:
# df

In [12]:
df['Name']

0      Ole
1    Jenny
2    Chang
3    Jonas
Name: Name, dtype: object

To select multiple columns by label, we must place a *list* of the column names inside the index operator `[]`.

In [13]:
df[['Name', 'Score']]

Unnamed: 0,Name,Score
0,Ole,65.0
1,Jenny,58.0
2,Chang,79.0
3,Jonas,95.0


To select as specific row by its label, we must combine the index operator `[]` with the `loc` attribute.

In [14]:
# df

In [15]:
#df[3] # KeyError
df.loc[3]

Name     Jonas
Score     95.0
Pass       yes
Name: 3, dtype: object

We can extract the value in a specific row and column by specifying both the row and column labels in the `loc` attribute:
```
df.loc[row_label, col_label]
```

In [16]:
df.loc[3, 'Name']

'Jonas'

As before, we can extract the values in multiple columns by passing a *list* of column labels.

In [17]:
df.loc[3, ['Name', 'Score']]

Name     Jonas
Score     95.0
Name: 3, dtype: object

We can also use the `loc` attribute to slice the rows the same way as we did with strings and lists. In that case, we pass a *range* of row labels inside the index operator `[]`:

```
df.loc[start_label:end_label]
```

In [18]:
df.loc[:2]

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no
2,Chang,79.0,yes


Note that when slicing the rows in a `DataFrame`, it is technically no longer necessary to use the `loc` attribute:

In [19]:
df[:2]

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no


However, in that case, we are no longer extracting rows based on their labels, but insted on their *positions*... 

As this can be confusing, it is generally better to use the `iloc` attribute when slicing a `DataFrame` by row position:
```
df.iloc[start_index:end_index]
```

In [20]:
df.iloc[:2]

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no


> üìù **Note:** Note that `loc` includes the last label, whereas `iloc` extracts all rows up to but not including the last index.

We can also use the `iloc` attribute to slice columns in a `DataFrame` based on their positions.

In [21]:
df.iloc[:2, :2]

Unnamed: 0,Name,Score
0,Ole,65.0
1,Jenny,58.0


In [22]:
df.iloc[:, -2:]

Unnamed: 0,Score,Pass
0,65.0,yes
1,58.0,no
2,79.0,yes
3,95.0,yes


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Show at least two different ways of extracting the last row in <TT>temps_df</TT>.
</div>

In [23]:
temps_df

Unnamed: 0,Oslo,Bergen,Trondheim
0,0.0,4.0,0.0
1,-4.0,3.0,-1.0
2,-3.0,4.0,-3.0
3,0.0,3.0,-2.0
4,3.0,3.0,-2.0
5,5.0,7.0,-5.0
6,4.0,8.0,-6.0


In [24]:
temps_df.loc[6]
temps_df.iloc[-1]

Oslo         4.0
Bergen       8.0
Trondheim   -6.0
Name: 6, dtype: float64

Instead of only displaying a subset of rows/columns, we can also save the subset as a new `DataFrame` by assigning it to a variable name.

However, note that slicing a `DataFrame` may not actually create a new `DataFrame` with the selected rows/columns. Instead, `pandas` may simply just be "hiding" some rows/columns from view.

In [25]:
# Two last rows are "hidden"
# df.iloc[:2]

In order to create a new `DataFrame`, we need to apply the `copy` function on the sliced subset.

In [26]:
df_subset = df.iloc[:2].copy()

df_subset

Unnamed: 0,Name,Score,Pass
0,Ole,65.0,yes
1,Jenny,58.0,no


> üí° **Tip:** If you ever encounter a `SettingWithCopyWarning` it is most likely due to having sliced a `DataFrame` without using the `copy` function to create a new copy of the subset.

## Import and save files

Instead of creating our own datasets, we often want to work with pre-existing data files that we have collected from different sources. Specifically, we want to be able to import data from files, perform operations on the data, and store the transformed data as new files. 

Pandas offers many input/output functions for handling different data files. For a complete list of these functions, see the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

The file `titanic.csv` contains information on 891 of the passengers on the Titanic. The data is stored as comma-seperated-values (CSV), which is one of the most common ways of storing tabular data. A CSV file is simply a plain text file in which each line represents a row, and each value within a row is separated by a comma. 

The file consists of the following columns/variables:

* `PassengerId`: ID of every passenger
* `Survived`: indicator whether the person survived (0 for "not survived" and 1 for "survived").
* `Pclass`: accommodation class (first, second, third)
* `Name`: name of passenger (last name, first name)
* `Sex`: gender of passenger ("male" or "female")
* `Age`: age of passenger (in years)
* `Fare`: ticket price paid by passenger (in pounds)

We can import the file by passing the file name as an input to the `read_csv` function. 

As a default, `read_csv` will look for the file in the same folder as the notebook. However, if the file is in a subfolder, we must specify the path to the file, i.e., the name of the subfolder.

In [27]:
# Use a single forward slash (/) or double backward slashes (\\) in file paths
titanic = pd.read_csv('data/titanic.csv')

In [28]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


> ‚ö†Ô∏è **Warning:** If the file is not in a subfolder but in a different location on your computer, you must specify the *full* path to the file, e.g., "C:/Users/MyUser/Documents/Data/titanic.csv" to import the file.

As a default, `read_csv` assumes that the values in the file is seperated by a comma. However, we can change this by giving a new value to the otional parameter `sep`. 

For example, a pipe-delimited version of the file can be read by setting `sep = '|'`.

In [29]:
titanic_pipe = pd.read_csv('data/titanic_pipe.csv', sep = '|')

titanic_pipe.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


`read_csv` has many optional parameters that we can pass arguments to in order to customize how we import the file.

For example, we can give a list of column names to `usecols` in order to import only a subset of the columns.

In [30]:
titanic_subset = pd.read_csv('data/titanic.csv', usecols = ['PassengerId', 'Survived', 'Name'])

titanic_subset

Unnamed: 0,PassengerId,Survived,Name
0,1,0,"Braund, Mr. Owen Harris"
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,1,"Heikkinen, Miss Laina"
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,5,0,"Allen, Mr. William Henry"
...,...,...,...
886,887,0,"Montvila, Rev. Juozas"
887,888,1,"Graham, Miss Margaret Edith"
888,889,0,"Johnston, Miss Catherine Helen ""Carrie"""
889,890,1,"Behr, Mr. Karl Howell"


See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for an overview of the parameters in `read_csv`.

We can save the data as a spreadsheet using the `to_excel` function.

In [31]:
# Store in "data" subfolder
titanic.to_excel('data/titanic.xlsx')

`to_excel` has many optional parameters that we can change.

We can for instance specify the parameters `sheet_name` and `index`.

In [32]:
titanic.to_excel('data/titanic.xlsx', sheet_name = 'passengers', index = False)

See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html) for an overview of the parameters in `to_excel`.

If we instead wanted to save the file as a CSV file, we have to use the `to_csv` function instead. See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) for an overview of the parameters in `to_csv`.

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Save <TT>temps_df</TT> both as an excel file and a CSV file in the <TT>data</TT> subfolder.
</div>

In [33]:
temps_df.to_csv('data/temps.csv')

temps_df.to_excel('data/temps.xlsx')

We can then import the excel file using the `read_excel` function.

In [34]:
titanic = pd.read_excel('data/titanic.xlsx')

titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


See the [function documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) for an overview of the parameters in `read_excel`.

In addition to data files stored locally on our computers, most input functions from pandas can also import data directly from URLs.

For example, we can find the titanic dataset on GitHub from [this](https://github.com/datasciencedojo/datasets/blob/master/titanic.csv) user.

But instead of downloading the data file to our computer and then import it, we can import the file directly into our program by using the URL. 

In [35]:
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv'
print(url)

https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv


We then pass the URL directly to `read_csv`. 

In [36]:
pd.read_csv(url, usecols = ['PassengerId', 'Survived', 'Name'])

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1032)>

## Exploring the data

By default, pandas does not display the entire dataset in memory due to performance and usability reasons:
- Large datasets can consume a lot of memory and CPU when rendered (can slow down performance)
- Showing too much data makes it hard to intrepret or navigate the data

Instead, pandas shows a truncated view, usually with a few rows at the top and bottom, and a few columns at the left and right. This allows us to quickly verify the structue, column names and a sample of values.

Although you *can* force pandas to display the entire `DataFrame`, it is better/more efficient to use functions to explore specific attributes of the data.

We can use the `head` and `tail` functions to display a specific number of rows at the top or bottom of the `DataFrame`. As a default, these functions will shows the first/last five rows.

In [53]:
titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


In [None]:
titanic.tail() 

`len` returns the number of rows.

In [None]:
len(titanic)

`info` displays the data types of the columns (note that 'object' indicates a string).

In [None]:
titanic.info()

`describe` displays descriptive statistics for the *numeric* columns.

In [52]:
titanic.describe()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare
count,891.0,891.0,891.0,891.0,714.0,891.0
mean,445.0,446.0,0.383838,2.308642,29.699118,32.204208
std,257.353842,257.353842,0.486592,0.836071,14.526497,49.693429
min,0.0,1.0,0.0,1.0,0.42,0.0
25%,222.5,223.5,0.0,2.0,20.125,7.9104
50%,445.0,446.0,0.0,3.0,28.0,14.4542
75%,667.5,668.5,1.0,3.0,38.0,31.0
max,890.0,891.0,1.0,3.0,80.0,512.3292


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Note that <TT>describe</TT> returns a <TT>DataFrame</TT> that contains the descriptive statistics. Use the table above to:
        
- Extract the average age of the passengers on the Titanic and store it in variable called <TT>mean_age</TT>
- Print the average age rounded to one decimal.</p>
</div>

In [61]:
mean_age = titanic['Age'].mean()
print(mean_age.round(1))

29.7


`nunique` returns the number of unique values in a column, whereas `unique` returns the actual unique values.

In [62]:
titanic['Survived'].nunique()

2

In [63]:
titanic['Survived'].unique()

array([0, 1])

`value_counts` returns a `Series` with the number of observations for each unique value in a column.

In [64]:
titanic['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

`corr` calculates the correlation coefficient between all of the numeric columns in a `DataFrame`. Note that we must set `numeric_only= True` to include only the numeric columns.

In [65]:
titanic.corr(numeric_only = True)

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare
Unnamed: 0,1.0,1.0,-0.005007,-0.035144,0.036847,0.012658
PassengerId,1.0,1.0,-0.005007,-0.035144,0.036847,0.012658
Survived,-0.005007,-0.005007,1.0,-0.338481,-0.077221,0.257307
Pclass,-0.035144,-0.035144,-0.338481,1.0,-0.369226,-0.5495
Age,0.036847,0.036847,-0.077221,-0.369226,1.0,0.096067
Fare,0.012658,0.012658,0.257307,-0.5495,0.096067,1.0


#### Missing data

Missing data in pandas is denoted by the value `NaN`, which stand for 'not a number'. 

In [67]:
titanic.tail()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
886,886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0
887,887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0
888,888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.45
889,889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0
890,890,891,0,3,"Dooley, Mr. Patrick",male,32.0,7.75


We can count the total number of missing values in each column in a `DataFrame` by combining the `isna` and `sum` functions.

`isna` creates boolean values (`True`/`False`) for each cell in a `DataFrame` indicating whether or not cell has a missing value. We can then use `sum` to count the number of `True` in each column.

In [68]:
titanic.isna().sum()

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
Fare             0
dtype: int64

#### Filtering rows

In addition to selecting specific rows based on their index (`iloc`) or label (`loc`), we can also select a subset of rows based on one or several conditions using *comparison operators*. This is known as boolean indexing, or *filtering* the data.

Recall that the comparison operators are `<`, `>`, `<=`, `>=`, `==`, and `!=`.

For example, let us create a `Series` of boolean values (`True` or `False`) based on whether a passenger is male or not.

In [69]:
titanic['Sex'] == 'male'

0       True
1      False
2      False
3      False
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Sex, Length: 891, dtype: bool

We can then use this boolean `Series` to select a subset of rows using the index operator `[]`.

In [75]:
titanic[titanic['Sex'] == 'male']

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
5,5,6,0,3,"Moran, Mr. James",male,,8.4583
6,6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625
7,7,8,0,3,"Palsson, Master Gosta Leonard",male,2.0,21.0750
...,...,...,...,...,...,...,...,...
883,883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,10.5000
884,884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,7.0500
886,886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
889,889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


If we in addition to selecting the rows, also want to select a subset of columns, we instead have to use the `loc` attribute.

In [71]:
titanic.loc[titanic['Sex'] == 'male', 'Name':'Age']

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
4,"Allen, Mr. William Henry",male,35.0
5,"Moran, Mr. James",male,
6,"McCarthy, Mr. Timothy J",male,54.0
7,"Palsson, Master Gosta Leonard",male,2.0
...,...,...,...
883,"Banfield, Mr. Frederick James",male,28.0
884,"Sutehall, Mr. Henry Jr",male,25.0
886,"Montvila, Rev. Juozas",male,27.0
889,"Behr, Mr. Karl Howell",male,26.0


As with slicing, we have to assign the filtered `DataFrame` to a new variable name (or overwrite a previous variable name) if we want to save the subset of rows. Recall that we have to apply the `copy` function on the subset to actually create a new `DataFrame`.

In [76]:
males = titanic[titanic['Sex'] == 'male'].copy()

males.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05
5,5,6,0,3,"Moran, Mr. James",male,,8.4583
6,6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625
7,7,8,0,3,"Palsson, Master Gosta Leonard",male,2.0,21.075


In [77]:
len(males)

577

We can double check our numbers using `value_counts`.

In [78]:
titanic['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

We can also filter rows on multiple conditions by combining conditions with the logical operators.

> üìù **Note:** Pandas use the symbols `&` (and), `|` (or) and `~` (not) for the logical operators, whereas base Python use the keywords `and`, `or` and `not`.

In [None]:
titanic[(titanic['Sex'] == 'male') & (titanic['Age'] > 70)]

In [None]:
titanic[(titanic['Sex'] == 'male') | (titanic['Age'] > 70)]

In [None]:
titanic[~(titanic['Sex'] == 'male')]

In addition, if we want to filter rows based on a subset of values, we can use the `isin` function to pass a list of allowed values to the condition.

In [None]:
titanic[titanic['Pclass'].isin([1, 2])]

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> How many passengers in <TT>titanic</TT> paid more for their ticket than the average cost of <TT>Fare</TT> but did not travel in 1st class?
</div>

In [79]:
titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


In [108]:
avg_fare = titanic['Fare'].mean()

poor_and_unlucky = titanic[(titanic['Fare'] > avg_fare) & (titanic['Pclass'] != 1)]

len(poor_and_unlucky)

52

## Transforming the data

Once we have imported a file and explored the data, we usually have to perform operations on the `DataFrame` to transform it into a format that is suitable for our analysis. 

Some examples of common transformations are:
- Create new columns and/or rows
- Delete columns and/or rows
- Change index
- Handle missing observations
- Converting data types

**Create new columns and rows**

We can create a new column in an existing `DataFrame` by assigning either a single value or a sequence of values to a new column name using the `=` operator.

In [109]:
titanic = pd.read_csv('data/titanic.csv')

In [110]:
# Create new column in titanic
titanic['new_col1'] = 'A'

titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,new_col1
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500,A
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,A
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250,A
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000,A
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500,A
...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000,A
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000,A
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500,A
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000,A


In [111]:
# Generate sequence of random integers equal to number of rows in titanic 
new_nums = np.random.randint(0, 101, size = len(titanic))

# Assign sequence to new column label in titanic
titanic['new_col2'] = new_nums

titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,new_col1,new_col2
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500,A,24
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,A,48
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250,A,7
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000,A,61
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500,A,57
...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000,A,88
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000,A,6
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500,A,33
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000,A,39


We can also create a new column based on the values in an existing column in a `DataFrame`.

In [112]:
# Multiply fare with exchange rate to convert to dollars
titanic['fare_usd'] = titanic['Fare'] * 1.34

titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,new_col1,new_col2,fare_usd
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500,A,24,9.715000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,A,48,95.519622
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250,A,7,10.619500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000,A,61,71.154000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500,A,57,10.787000
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000,A,88,17.420000
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000,A,6,40.200000
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500,A,33,31.423000
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000,A,39,40.200000


Finally, we can update the values in specific rows based on a condition by combining the `loc` attribute with the `=` operator.

In [113]:
# Change value from A to B if passenger is male
titanic.loc[titanic['Sex'] == 'male', 'new_col1'] = 'B'

In [114]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,new_col1,new_col2,fare_usd
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500,B,24,9.715000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,A,48,95.519622
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250,A,7,10.619500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000,A,61,71.154000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500,B,57,10.787000
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000,B,88,17.420000
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000,A,6,40.200000
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500,A,33,31.423000
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000,B,39,40.200000


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Import the file <TT>temperatures.csv</TT> and store the data in a variable called <TT>temps_df</TT>:
        
- Create a new column called <TT>mean</TT> that contains the average temperature for each row (i.e., day). Round the average to one decimal. 
- Create another column called <TT>indicator</TT> that takes the value 1 if the value in <TT>mean</TT> is positive, and 0 otherwise. </p>
</div>

In [123]:
temps_df = pd.read_csv('data/temperatures.csv')
temps_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800
2,2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200
3,3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200


In [141]:
temps_df['Mean'] = temps_df[['Open', 'High', 'Low', 'Close']].mean(axis = 1).round(1)
temps_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Mean
0,0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400,74.5
1,1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800,74.5
2,2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200,74.1
3,3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000,74.8
4,4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200,75.1


In [170]:
temps_df['Indicator'] = (temps_df['Mean'] > 0).astype(int)

**Drop columns and rows**

We can drop rows and columns by using the `drop` function. In addition to specifying the *labels* of the rows/columns that we want to drop, we also need to specify `axis`:
- `axis = 0` will drop rows
- `axis = 1` will drop columns

In [171]:
titanic = pd.read_csv('data/titanic.csv')

In [172]:
# Drop first row
titanic.drop(0, axis = 0)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
5,6,0,3,"Moran, Mr. James",male,,8.4583
...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


In [176]:
titanic.drop('PassengerId', axis = 1)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Fare
0,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000
888,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500
889,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


We can drop several columns/rows by passing a *list* of labels to `drop`.

In [177]:
titanic.drop(['PassengerId', 'Survived'], axis = 1)

Unnamed: 0,Pclass,Name,Sex,Age,Fare
0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,"Heikkinen, Miss Laina",female,26.0,7.9250
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...
886,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,1,"Graham, Miss Margaret Edith",female,19.0,30.0000
888,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500
889,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


However, note that this operation does not actually modify the `DataFrame` in memory.

In [178]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


Instead, we must store the modified data in a new variable name (or overwrite a previous one).

Alternatively, we can set `inplace = True`, which will transform the original `DataFrame` in memory.

In [179]:
titanic.drop(['PassengerId', 'Survived'], axis = 1, inplace = True)

In [180]:
titanic

Unnamed: 0,Pclass,Name,Sex,Age,Fare
0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,"Heikkinen, Miss Laina",female,26.0,7.9250
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...
886,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,1,"Graham, Miss Margaret Edith",female,19.0,30.0000
888,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500
889,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


**Change index**

In general, each row in a `DataFrame` should have a unique index as this allows us to identify rows/observations on their index.

In [183]:
titanic = pd.read_csv('data/titanic.csv')

In [None]:
# titanic.head()

In [None]:
# titanic['PassengerId'].nunique()

We change the index of a `DataFrame` by assigning a sequence (e.g., a column) with new index values to the `index` attribute of the `DataFrame`.

In [186]:
titanic.index = titanic['PassengerId']

titanic.head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
3,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
5,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


We can reset the index by using the `reset_index` function. However, note that this will return the old index as a column in the `DataFrame`.

In [191]:
# ValueError due to duplicate column labels
#titanic.reset_index()

In [187]:
# Drop PassengerId as column and reset index
titanic.drop('PassengerId', axis = 1).reset_index()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


Alternatively, we can set `drop = True` to avoid the old index being returned as a new column in the `DataFrame`.

In [None]:
# titanic

In [192]:
titanic.reset_index(drop = True)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.0500
...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,13.0000
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,30.0000
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,30.0000


And finally, to actually modify the `DataFrame` in memory, we need to set `inplace = True`.

In [193]:
titanic.reset_index(drop = True, inplace = True)

In [194]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05


Although it is common for the index in a `DataFrame` to be integers (usually starting at 0), the index can technically be any Python data type, e.g., strings.

In [195]:
males.index = males['Name']

males.head()

Unnamed: 0_level_0,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"Braund, Mr. Owen Harris",0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25
"Allen, Mr. William Henry",4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05
"Moran, Mr. James",5,6,0,3,"Moran, Mr. James",male,,8.4583
"McCarthy, Mr. Timothy J",6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625
"Palsson, Master Gosta Leonard",7,8,0,3,"Palsson, Master Gosta Leonard",male,2.0,21.075


<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Import the file <TT>temperatures.csv</TT> and store the data in a variable called <TT>temps_df</TT>.
        
Change the index to indicate the day of the week (Mon-Sun) for each observation.
</p>
</div>

In [239]:
temps_df = pd.read_csv('data/temperatures.csv')
temps_df

Unnamed: 0.1,Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,1,2020-01-03,74.287498,75.144997,74.125000,74.357498,73.610840,146322800
2,2,2020-01-06,73.447502,74.989998,73.187500,74.949997,74.197395,118387200
3,3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200
...,...,...,...,...,...,...,...,...
247,247,2020-12-23,132.160004,132.429993,130.779999,130.960007,130.764603,88223700
248,248,2020-12-24,131.320007,133.460007,131.100006,131.970001,131.773087,54930100
249,249,2020-12-28,133.990005,137.339996,133.509995,136.690002,136.486053,124486200
250,250,2020-12-29,138.050003,138.789993,134.339996,134.869995,134.668762,121047300


In [240]:
temps_df['Date'] = pd.to_datetime(temps_df['Date'])

In [241]:
temps_df['Weekday'] = temps_df['Date'].dt.day_name().str[:3]
temps_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Weekday
0,0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400,Thu
1,1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800,Fri
2,2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200,Mon
3,3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000,Tue
4,4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200,Wed


In [242]:
temps_df = temps_df.set_index('Weekday')

temps_df

Unnamed: 0_level_0,Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
Weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Thu,0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
Fri,1,2020-01-03,74.287498,75.144997,74.125000,74.357498,73.610840,146322800
Mon,2,2020-01-06,73.447502,74.989998,73.187500,74.949997,74.197395,118387200
Tue,3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
Wed,4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200
...,...,...,...,...,...,...,...,...
Wed,247,2020-12-23,132.160004,132.429993,130.779999,130.960007,130.764603,88223700
Thu,248,2020-12-24,131.320007,133.460007,131.100006,131.970001,131.773087,54930100
Mon,249,2020-12-28,133.990005,137.339996,133.509995,136.690002,136.486053,124486200
Tue,250,2020-12-29,138.050003,138.789993,134.339996,134.869995,134.668762,121047300


**Handle missings**

It is common for real-world data files to contain missing observations.

In [243]:
titanic = pd.read_csv('data/titanic.csv')

In [244]:
# Calculate sum of missings in each column
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
Fare             0
dtype: int64

In [250]:
# Filter rows for missings in the Age column
titanic[titanic['Age'].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare
5,6,0,3,"Moran, Mr. James",male,,8.4583
17,18,1,2,"Williams, Mr. Charles Eugene",male,,13.0000
19,20,1,3,"Masselmani, Mrs. Fatima",female,,7.2250
26,27,0,3,"Emir, Mr. Farred Chehab",male,,7.2250
28,29,1,3,"O'Dwyer, Miss Ellen ""Nellie""",female,,7.8792
...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,7.2292
863,864,0,3,"Sage, Miss Dorothy Edith ""Dolly""",female,,69.5500
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,9.5000
878,879,0,3,"Laleff, Mr. Kristo",male,,7.8958


Sometimes we want to drop rows and/or columns with missing observations from our data.

`dropna` drops rows/columns with missing observations from a `DataFrame`. As before, we need to use the `axis` parameter to specify whether we want to drop rows or columns:
- `axis = 0` will drop all rows with `Nan`
- `axis = 1` will drop all columns with `NaN`

In [None]:
titanic.dropna(axis = 0)

In [None]:
titanic.dropna(axis = 1)

Note that `dropna` did not change the `DataFrame`. To store the transformations we either have to assign them to a variable name or use the `inplace` parameter.

In [None]:
# titanic

Instead of dropping observations with missing values, we can instead replace these observations with an alternative value by using the `fillna` function.

In [None]:
# Replace NaN with string
titanic.fillna('MISSING')

In [None]:
# Replace NaN with average value
titanic['new_age'] = titanic['Age'].fillna(titanic['Age'].mean())

titanic

> ‚ö†Ô∏è **Warning:** In "real" applications, replacing missing values with sample means is usually not a good idea as it can introduce inaccuracies in data analysis.

Alternatively, we can chenge the observation in a `DataFrame` to instead be equal to a missing value by using the special value `nan` from NumPy.

In [252]:
# Replace values in new age column with NaN if Age is missing
titanic.loc[titanic['Age'].isna(), 'new_age'] = np.nan

titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Fare,new_age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,7.925,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,
4,5,0,3,"Allen, Mr. William Henry",male,35.0,8.05,


**Converting data types**

When we import files, pandas infer the data types of the columns in the file. 

However, sometimes we want to change the `dtype` of the columns. Either because pandas got it wrong, or because there are other data types that are more appropriate for our analysis. 

The file `AAPL.csv` contains data on stock prices and trading volumes for Apple on every weekday in 2020.

In [253]:
apple = pd.read_csv('data/AAPL.csv')

apple

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,2020-01-03,74.287498,75.144997,74.125000,74.357498,73.610840,146322800
2,2020-01-06,73.447502,74.989998,73.187500,74.949997,74.197395,118387200
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200
...,...,...,...,...,...,...,...
247,2020-12-23,132.160004,132.429993,130.779999,130.960007,130.764603,88223700
248,2020-12-24,131.320007,133.460007,131.100006,131.970001,131.773087,54930100
249,2020-12-28,133.990005,137.339996,133.509995,136.690002,136.486053,124486200
250,2020-12-29,138.050003,138.789993,134.339996,134.869995,134.668762,121047300


Note that `read_csv` imported the price data as floats, the trading volumes as integers, and the dates as strings.

In [254]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       252 non-null    object 
 1   Open       252 non-null    float64
 2   High       252 non-null    float64
 3   Low        252 non-null    float64
 4   Close      252 non-null    float64
 5   Adj Close  252 non-null    float64
 6   Volume     252 non-null    int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 13.9+ KB


We can apply `astype` on a column in order to change the `dtype` of a column to `str`, `float` or `int`.

In [255]:
apple['Volume'].astype('str')

0      135480400
1      146322800
2      118387200
3      108872000
4      132079200
         ...    
247     88223700
248     54930100
249    124486200
250    121047300
251     96452100
Name: Volume, Length: 252, dtype: object

However, to actually modify the `DateFrame`, we must assign the updated column to the old column by using the `=` operator.

In [256]:
apple['Volume'] = apple['Volume'].astype('str')

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200


In [257]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       252 non-null    object 
 1   Open       252 non-null    float64
 2   High       252 non-null    float64
 3   Low        252 non-null    float64
 4   Close      252 non-null    float64
 5   Adj Close  252 non-null    float64
 6   Volume     252 non-null    object 
dtypes: float64(5), object(2)
memory usage: 13.9+ KB


We can change the column from `str` to `float`...

In [258]:
apple['Volume'] = apple['Volume'].astype(float)

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400.0
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800.0
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200.0
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000.0
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200.0


...and back to `int`.

In [259]:
apple['Volume'] = apple['Volume'].astype(int)

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200


Note that converting data types may not always work. For example, converting from floats to integers will not work if the column contains missing observations.

In [260]:
titanic = pd.read_csv('data/titanic.csv')

In [261]:
# IntCastingNaNError
titanic['Age'].astype(int)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Use a <TT>for</TT> loop to convert all price columns (i.e., not <TT>Date</TT> or <TT>Volume</TT>) in <TT>apple</TT> from floats to integers.
</div>

In [263]:
df = pd.read_csv('data/AAPL.csv')

for col in ['Open', 'High', 'Low', 'Close', 'Adj Close']:
    df[col] = df[col].astype(int)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       252 non-null    object
 1   Open       252 non-null    int64 
 2   High       252 non-null    int64 
 3   Low        252 non-null    int64 
 4   Close      252 non-null    int64 
 5   Adj Close  252 non-null    int64 
 6   Volume     252 non-null    int64 
dtypes: int64(6), object(1)
memory usage: 13.9+ KB


Note that the stock data in `apple` is a time series, i.e., we observe the daily prices and volumes over time. But which data type is appropiate for dates? `str`, `float` or `int`?

Recall that the dates in `apple` are initially imported as as strings.

In [264]:
apple.loc[0, 'Date']

'2020-01-02'

Although we can interpret the `Date` column as a string, pandas was actually developed to handle time series data, and especially financial time series. Therefore, pandas comes with an additional data type known as `datetime`.

`to_datetime` will convert a series of dates to `datetime`.

In [265]:
pd.to_datetime(apple['Date'])

0     2020-01-02
1     2020-01-03
2     2020-01-06
3     2020-01-07
4     2020-01-08
         ...    
247   2020-12-23
248   2020-12-24
249   2020-12-28
250   2020-12-29
251   2020-12-30
Name: Date, Length: 252, dtype: datetime64[ns]

In [266]:
apple['Date'] = pd.to_datetime(apple['Date'])

apple.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-01-02,74.059998,75.150002,73.797501,75.087502,74.333511,135480400
1,2020-01-03,74.287498,75.144997,74.125,74.357498,73.61084,146322800
2,2020-01-06,73.447502,74.989998,73.1875,74.949997,74.197395,118387200
3,2020-01-07,74.959999,75.224998,74.370003,74.597504,73.848442,108872000
4,2020-01-08,74.290001,76.110001,74.290001,75.797501,75.036385,132079200


In [267]:
apple.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       252 non-null    datetime64[ns]
 1   Open       252 non-null    float64       
 2   High       252 non-null    float64       
 3   Low        252 non-null    float64       
 4   Close      252 non-null    float64       
 5   Adj Close  252 non-null    float64       
 6   Volume     252 non-null    int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 13.9 KB


The `Date` column is now `datetime`, meaning that each value in the column is interpreted as a *timestamp*.

In [268]:
apple.loc[0, 'Date']

Timestamp('2020-01-02 00:00:00')

## Additional resources

There are numerous tutorials on pandas on the internet. Useful additional material includes:
- The official [user guide](https://pandas.pydata.org/docs/user_guide/index.html).
- The official webpage for pandas also has a nice [tutorial](https://pandas.pydata.org/docs/getting_started/index.html) on how to get started.
- The official [API reference](https://pandas.pydata.org/docs/reference/index.html) with details on every pandas object and function.
- There are numerous tutorials (including videos) available on the internet. See [here](https://pandas.pydata.org/docs/getting_started/tutorials.html) for a list.

# Home exercises

In the home exercises, you will work with some additional data files, which can all be located in the `data` subfolder.

### üìö Exercise 1: Fuel economy

The file `mpg.xlsx` contains observations on fuel economy and six additional attributes for 398 different car models. The column `mpg` is a measure of the car's fuel economy, i.e. the number of miles per gallon of petrol.

Import the file and store it as a `DataFrame` in a variable called `mpg_df`.
        
**Task 1**: Explore the data by answering the following questions:
1. Which columns in the <code>DataFrame</code> are strings?
2. What is the average number of miles per gallon of the car models in the data?
3. What are the unique number of cylinders observed in the data?
4. How many of the car models in the data were from Europe?
5. What is the correlation between cars' fuel economy and horsepower?
6. Are there any missing observations in the data?

In [None]:
import pandas as pd

mpg_df = pd.read_excel('data/mpg.xlsx')

origin = mpg_df['origin']
name = mpg_df['name']

count = 0

for col in mpg_df.columns:
    if mpg_df[col].dtype == 'object':
        print(f'Column {col} has dtype: {mpg_df[col].dtype}.')
        count += 1
    else:
        None
    
print(f'Total count of columns with strings: {count}')

In [None]:
mpg_df.describe()

mean_mpg = mpg_df['mpg'].mean()

print(f'The average number of miles per gallon is {round(mean_mpg,2)}.')

In [None]:
unique_cylinders = mpg_df['cylinders'].nunique()

print(f'Unique numbers of cylinders are {unique_cylinders}.')

In [None]:
origin_amount = mpg_df['origin'].value_counts().get('europe', 0)

print(f'Amount of cars from Europe are {origin_amount}.')

In [None]:
correlation = mpg_df['mpg'].corr(mpg_df['horsepower'])

print(f'The correlation between a cars miles per gallon and horsepower are {rou}')

In [None]:
missing = mpg_df.isna().sum()

actually_missing = missing[missing > 0]

print(f'Yes, the following are missing: {actually_missing}.')

**Task 2**: Transform the data by performing the following operations:
1. The column `model_year` ranges from 1970 to 1982, but it contains only the last two digits of the year. Change the column so that it also contains the two first digits, e.g., '74' should be '1974'.
2. Convert the data type of the column `model_year` to `datetime`.
3. Drop the rows with missing observations on `horsepower` and store the new `DataFrame` in a variable called `mpg_df2`.
4. Instead of dropping the rows with missing values, you want to replace missing values in `horsepower`by the *origin-specific* sample mean of `horsepower`. Create a new column in `mpg_df` called `hp_imputed` that contains the observations on `horsepower`. Then, replace the missing values in `hp_imputed` with the average value of `horsepower` given the origin of the car.

   For each origin in the data ('usa', 'europe', or 'japan'):
   - Filter rows on origin and calculate the sample mean of `horsepower`
   - Use `loc` to fill missing values in `hp_imputed` with sample mean in rows of given origin
   - *Hint: To avoid code redundency, use a `for` loop with the following statement:*
     ```
     for origin in mpg_df['origin'].unique():
     ```
You should inspect the data to verify that the operation worked as expected and that missing values in `hp_imputed` have in fact been replaced by origin-specific sample means.

In [None]:
mpg_df = pd.read_excel('data/mpg.xlsx')

mpg_df['model_year'] = mpg_df['model_year'].astype(str)

mpg_df.loc[~mpg_df['model_year'].str.startswith('19'), 'model_year'] = '19' + mpg_df['model_year']

mpg_df['model_year'] = pd.to_datetime(mpg_df['model_year'])

print(mpg_df['model_year'].dtype)

In [None]:
mpg_df2 = mpg_df.dropna(subset = ['horsepower'])

mpg_df2

In [None]:
mpg_df['hp_imputed'] = mpg_df['horsepower']

for origin in mpg_df['origin'].unique():
    rows_of_origin = mpg_df['origin'] == origin

    mean_hp = mpg_df.loc[rows_of_origin, 'horsepower'].mean()

    mpg_df.loc[rows_of_origin & mpg_df['hp_imputed'].isna(), 'hp_imputed'] = mean_hp

mpg_df

### üìö Exercise 2: Electricity consumption

The file `eurostat.xlsx` contains data on electricity consumption (in gigawatt-hours) for European countries from 2001 to 2023. 

**Task 1**: Import the file and store it in a variable called `df_euro`. Note that the file contains many unecessary rows and columns. Transform the `DataFrame` so that:
- the index is the country (incl. EU and Euro area)
- the columns are the years from 2001 to 2023

The final `DataFrame` should have 43 rows and 23 columns.

*Hint*: use optional parameters in `read_excel` (e.g., `skipfooter`) to control how the file is imported.

**Task 2**: Use `df_euro` and calculate the following:
- Average electricity consumption in Finland from 2001 to 2023
- Sum of electricity consumption in all countries (not incl. EU and Euro area) in 2022

In [None]:
df_euro = pd.read_excel('data/eurostat.xlsx', sheet_name = 'Sheet 1', header = 9)

df_euro = df_euro.loc[:, ~df_euro.columns.str.contains("Unnamed")]
df_euro = df_euro.iloc[3:]
df_euro = df_euro.set_index('TIME')
df_euro.index.name = 'Country'

df_euro

In [None]:
df_euro = df_euro.replace({":": pd.NA, "p": pd.NA}).apply(pd.to_numeric, errors="coerce")

mean_finland = df_euro.loc['Finland'].mean()

print(f'The average consumption in Finland from 2001 to 2023 is {round(mean_finland,2)}.')

In [None]:
non_eu = df_euro.drop(index=[
    "European Union - 27 countries (from 2020)", 
    "Euro area - 20 countries (from 2023)"], 
    errors = "ignore")

total_2022 = non_eu['2022'].sum()

print(f'The total amount of consumption in all countries (not including EU or Euro areas) is {round(total_2022, 2)}.')

### üìö Exercise 3: Labor market statistics

The file `FRED_monthly.csv` contains time series for the US economy for each month from 1948 to 2024. The column `UNRATE` is the average monthly unemployment rate. 

Import the file and store it in a variable called `df_fred`. 

**Task 1**: Use the `df_fred` to calculate and print the following:
- Average unemployment rate in the data from 1948 to 2024.
- Average unemployment rate for each *decade* in the data from 1950 to 2010 for which you have all the observations.

*Hint*: The decade can be computed from the `Year` column using truncated integer division:
```
df_fred[Year] // 10 * 10
```

**Task 2**: Create a new `DataFrame` called `df_fred_year`, which contains the average *annual* unemployment rate for each year in `FRED_monthly.csv`.

In [None]:
df_fred = pd.read_csv('data/FRED_monthly.csv')
df_fred.set_index('DATE')


In [None]:
un_rate = round(df_fred['UNRATE'].mean(), 4)

print(f'The average unemployment rate from 1948 to 2024 is {un_rate}.')

In [None]:
df_fred['Decades'] = df_fred['Year'] // 10 * 10

avg_per_decade = df_fred.groupby('Decades')['UNRATE'].mean()

print(f'The average unemployment rate in each decade was \n{round(avg_per_decade, 2)}.')

In [None]:
df_fred_year = df_fred

df_fred_year = df_fred_year.groupby('Year')['UNRATE'].mean().reset_index()

print(f'The average unemployment rate per year is \n{round(df_fred_year, 2)}.')

print(type(df_fred_year))
print(type(df_fred['Year']))