# 07 - Data wrangling

Data wrangling is the process of transforming data into a format the is more suitable for data analysis. This involves e.g., transforming variables, aggregating data, merging data sets etc. 

We have already seen how to perform simple operations on a DataFrame, e.g., creating new columns. This notebooks shows more advanced operations that are common in data wrangling:
- Grouping data
- Combining data
- Reshaping data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-whitegrid')

## Grouping data

The pandas method `groupby` groups together rows based on the values in a single or multiple columns and returns an object that contains information about the groups.

This is very helpful in data analysis as it helps us summarize information about different groups in our data.

In [None]:
grade_dict = {
    'Name'  : ['Ole', 'Jenny', 'Chang', 'Jonas'],
    'Age' : [18, 19, 22, 20],
    'Score' : [65.0, 58.0, 79.0, 95.0],
    'Pass'  : ['yes', 'no', 'yes', 'yes']
}

df = pd.DataFrame(grade_dict)

df

`groupby` returns an object that we can perform operations on. 

In [None]:
pass_group = df.groupby('Pass')

In [None]:
pass_group

#### Aggregation

Often we want to apply an *aggregation* function on the data seperately for each group. By aggregation we mean that the result of a computation has a lower dimension than the original data.

For example, we can use the `mean` function on a grouped object to calculate the average value in each numeric column.

In [None]:
pass_group.mean(numeric_only = True)

Note that groups and data aggregation also supports column indexing.

In [None]:
pass_group['Score'].mean()

We have already seen how we can use `value_counts` to count the number of passengers in our Titanic data that survived.

In [None]:
# Import data
titanic = pd.read_csv('data/titanic.csv')

# Display value counts
titanic['Survived'].value_counts()

But what if we want to know the number of passengers that survived in 1st, 2nd and 3rd class?

Then we have to group the data together on the column `Pclass`.

In [None]:
titanic.groupby('Pclass')['Survived'].value_counts()

Or we can use `mean` to calculate the average age of passengers traveling 1st, 2nd and 3rd class.

In [None]:
titanic.groupby('Pclass')['Age'].mean()

But what if we want to know the average age for men and women traveling 1st, 2nd and 3rd class?

We can group the data by *multiple* columns by passing a list of column names to `groupby`.

In [None]:
titanic.groupby(['Pclass', 'Sex'])['Age'].mean()

There are numerous functions to aggregate grouped data, for example:
- `mean`: compute average within each group
- `sum`: sum values within each group
- `std`, `var`: within-group standard deviation and variance
- `median`: compute median within each group
- `quantile`: compute quantiles within each group
- `size`: number of observations in each group
- `count`: number of non-missing observations in each group
- `first`, `last`: first and last elements in each group
- `min`, `max`: minimum and maximum elements within a group


See the [official documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-aggregation-methods) for a complete list.

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Use the <TT>titanic</TT> data to find out what was the most expensive ticket, i.e. highest fare, in 1st, 2nd and 3rd class?
        
</div>

We can also perform aggregations on a column by applying the `agg` function on that column. Note that the name of the operation, e.g., mean, is now passed as a string in the function call.

In [None]:
# Calculate group means in a slightly more complicated way
titanic.groupby(['Pclass', 'Sex'])['Age'].agg('mean')

The benefit of using the `agg` function is that it allows us to perform multiple aggregations at the same time on a single column.

In [None]:
titanic.groupby(['Pclass', 'Sex'])['Age'].agg(['mean', 'median'])

Alternatively, we can use the slightly more advanced syntax to perform multiple aggregations on *multiple* columns in a grouped object:

```
groups.agg(
    new_column_name1 = ('column_name1', 'operation1'),
    new_column_name2 = ('column_name2', 'operation2')
)
```

In [None]:
titanic.groupby(['Pclass', 'Sex']).agg(
    average_age = ('Age', 'mean'),      # average age in group
    max_fare = ('Fare', 'max')          # maximum fare in group
)

**Plotting**

Grouping data can also be very helpful in plotting. For example, let us plot the *share* of survivors by 1st, 2nd and 3rd class.

First, we calculate the share of survivors in each class.

In [None]:
pclass = titanic.groupby('Pclass')['Survived'].mean()

pclass

Second, we update the index of the `Series`.

In [None]:
pclass.index = ['1st class', '2nd class', '3rd class']

pclass

Third, we use the `bar` function from `matplotlib` to show the share of survivors by class in a bar plot.

In [None]:
fig, ax = plt.subplots(figsize = (8, 3))

ax.bar(pclass.index, pclass, width = 0.5)

ax.set_ylabel('Share of survivors')
ax.set_title('Survival rate on the Titanic (by class)')

plt.show()

#### Transformations

In the previous section, we combined `groupby` with aggregation functions to reduce data on the group level to a single statistic, e.g., mean. Alternatively, we can combine `groupby` with the `transform` function to assign the result of a computation to a new column in the data. This will leave the number of observations unchanged (i.e., no aggregation).

For example, we can create a new column that contains the average value of the fare for specific groups in the Titanic data.

In [None]:
# Average fare for each Pclass
titanic['Fare_avg'] = titanic.groupby('Pclass')['Fare'].transform('mean')

titanic.head()

In [None]:
# titanic[(titanic['Pclass'] == 1)]

In general, we use `transform` instead of `agg` when we want to perform computation based on both the individual observations as well as some aggregate statistic.

In [None]:
# Difference between average fare (by Pclass) for each passenger
titanic['Fare_diff'] = titanic['Fare'] - titanic['Fare_avg']

titanic.head()

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Compute the <em>excess</em> fare paid by each passenger relative to the minimum fare by sex and class, i.e., compute $Fare - min(Fare)$ by sex and class. 
        
</div>

#### Time series data

Data can also be grouped on time properties when the data is a time series.

We have used the `to_datetime` function in pandas to convert timestamps (e.g., dates) from objects (strings) to the `datetime` data type.

In [None]:
apple = pd.read_csv('data/AAPL.csv')
apple['Date'] = pd.to_datetime(apple['Date'])
apple.sort_values('Date', inplace = True)

apple.head()

This can be very useful when working with time series. For example, it lets us easily filter the data on time.

In [None]:
# Select specific date
apple[apple['Date'] == '2020-01-02']

In [None]:
# Filter on range of dates
apple[(apple['Date'] >= '2020-03-15') & (apple['Date'] < '2020-06-15')]

In addition, `datetime` objects have time-related properties that we can access with the `dt` accessor, e.g., `Year`, `Month`, `Day`, `Hour` etc.

In [None]:
apple['Month'] = apple['Date'].dt.month

apple.head()

We can use the values from the `dt` accessor to group time series data on time properties, and then aggregate or transform the data.

In [None]:
apple.groupby('Month')[['Open', 'Close']].mean()

In [None]:
apple.groupby('Month').agg(
    Close_mean = ('Close', 'mean'),
    Volume_sum = ('Volume', 'sum')
)

In [None]:
apple.groupby('Month')['Close'].transform('mean')

Note that pandas offers several transformation functions that can be especially useful when working with time series.

For example, we can use `diff` to compute the change between two adjacent observations (i.e., rows).

In [None]:
apple['Close_diff'] = apple['Close'].diff()

apple.head()

In [None]:
apple['Volume_diff'] = apple.groupby('Month')['Volume'].diff()

apple[:50]

Other useful transformation functions are:
- `ffill`: Forward fill NA values within each group
- `bfill`: Back fill NA values within each group
- `cumsum`: Compute the cumulative sum within each group
- `pct_change`: Compute the percent change between adjacent values within each group
- `shift`: Shift values up or down within each group

See the [official documentation](https://pandas.pydata.org/docs/user_guide/groupby.html#built-in-transformation-methods) for a complete list.

Finally, note that pandas offer a special function called `resample` that we can use when we want to group observations by time period and apply some aggregation function. 

Note that `resample` offers an alternative to `groupby` when working with time series, but it requires that the index in the `DataFrame` is a `datetime` object.

In [None]:
apple = pd.read_csv('data/AAPL.csv')
apple['Date'] = pd.to_datetime(apple['Date'])
apple.set_index('Date', inplace = True)

apple.head()

We use `resample` by applying it on a `DataFrame` and pass it a string that describes how the observations should be grouped (`'YE'` for aggregation to years, `'QE'` for quarters, `'ME'` for months, `'W'` for weeks, etc.)

In [None]:
apple.resample('ME').mean()

In [None]:
apple.resample('W').last()

## Combining data

Pandas offers several different ways of combining multiple `DataFrames` along the row or column axes. The two most useful functions for combining data are `concat` and `merge`.

#### Concatenating 

We can use `concat` to *stack* `DataFrames` that share the same columns, but have different observations.

In [None]:
grade_dict = {
    'Name'  : ['Ole', 'Jenny', 'Chang', 'Jonas'],
    'Age' : [18, 19, 22, 20],
    'Score' : [65.0, 58.0, 79.0, 95.0],
    'Pass'  : ['yes', 'no', 'yes', 'yes']
}

df = pd.DataFrame(grade_dict)

df

In [None]:
# Create new dict with additional grades
grade_dict2 = {
    'Name'  : ['Nico', 'Maria', 'Mario', 'Janne'],
    'Age'   : [18, 24, 21, 20], 
    'Score' : [67, 48, 92, 71], 
    'Pass'  : ['yes', 'no', 'yes', 'yes']
}

df2 = pd.DataFrame(grade_dict2)

df2 

As a default, `concat` stacks a list of `DataFrames` on top of each other. 

In [None]:
pd.concat([df, df2])

But what if the `DataFrames` do not have the exact same columns?

Let us drop `Pass` from `df2`.

In [None]:
df2.drop('Pass', axis = 1, inplace = True)

df2

We can still concatenate the `DataFrames`. In that case, `concat` will simply fill the cells with missing data with `NaN`.

In [None]:
df3 = pd.concat([df, df2]) 

df3

However, note that `concat` also concatenates the index, which means that the index values are no longer unique for each observation (i.e. row). 

This can be fixed using the `reset_index` function.

In [None]:
df3.reset_index(inplace = True, drop = True)

df3

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Load the data in the files <TT>FRED_monthly_1990.csv</TT> and <TT>FRED_monthly_2000.csv</TT> found in the <TT>data</TT> subfolder. The files contain macroeconomic time series for the 1990s and 2000s, respectively.

- Concatenate the two data sets to get a final DataFrame with 240 observations.
- Set the column <TT>DATE</TT> as the index in the newly created DataFrame.
- Calculate the average unemplyment rate (column <TT>UNRATE</TT>) in the data by month.
 
</div>

The most common use of `concat` is when we have observations on the same variables scattered across multiple data sets. But it is also possible to concatenate data sets along the column dimension by specifying `axis = 1` in `concat`.

In [None]:
pd.concat([df, df2], axis = 1)

However, it is very rare that we want to "stack" data sets side-by-side in this way. Instead, we usually combine data along the column dimension by *merging* data sets according to one or several keys (i.e., common columns/identifiers).

See the [function documention](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) for more information on `concat`.

#### Merging

While concatenation simply appends (or stacks) blocks of rows or columns from multiple data sets, merging involves more control over how the data should be combined. 

We use `merge` to combine data sets that share the same observations (i.e., rows), but have different columns.

In [None]:
df1 = pd.DataFrame({
    'Name': ['Ole', 'Jenny', 'Chang', 'Jonas', 'Mario'], 
    'Score1' : [65.0, 58.0, 79.0, 95.0, 92.0]
})

df1

In [None]:
df2 = pd.DataFrame({
    'Name': ['Ole', 'Chang', 'Jonas', 'Mario', 'Nico', 'Maria'], 
    'Score2' : [70.0, 77.0, 92.0, 92.0, 72.0, 68.0]
})

df2

To merge two data sets, we apply the `merge` function on the first data set, and then pass the second data set as the first input to the function call. In addition, we need to specify the `on` parameter, which requires the label of the shared column between the two data sets (i.e., the key). 

In [None]:
df1.merge(df2, on = 'Name')

Note that as a default, `merge` will combine only those observations (i.e., rows) found in both data sets. However, in data analysis, we often encounter the issue that some observations are present in one data set but not in the other. In that case, we also need to specify the `how` parameter in `merge`, which determines which subset of the data that we will retain in the final data set:

1. `how='inner'` performs an *inner join*: the merged data contains only the intersection of keys that are present in both data sets.
2. `how='outer'` performs an *outer join*: the merged data contains the union of keys present in either of the data sets.
3. `how='left'` peforms a *left join*: all observations from the left data set are present in the final data, but rows that are only present in the right data are dropped.
4. `how='right'` performs a *right join*: all observations from the right data set are present in the final data, but rows that are only present in the left data are dropped.




A left join keeps all the observations in the first data set.

In [None]:
df1.merge(df2, on = 'Name', how = 'left')

A right join keeps all the observations in the second data set.

In [None]:
df1.merge(df2, on = 'Name', how = 'right')

An outer join keeps all observations in both data sets.

In [None]:
df1.merge(df2, on = 'Name', how = 'outer')

Note that when the `DataFrames` have more than one common variable, we must merge on *multiple* keys. Otherwise, we get duplicate columns in the merged data.

In [None]:
df1 = pd.DataFrame({
    'Name'       : ['Ole', 'Jenny', 'Chang', 'Jonas', 'Mario'],
    'Student_no' : ['s1001', 's1002', 's1003', 's1004', 's1005'],
    'Score1'     : [65.0, 58.0, 79.0, 95.0, 92.0]
})

df1

In [None]:
df2 = pd.DataFrame({
    'Name'       : ['Ole', 'Chang', 'Jonas', 'Mario', 'Nico', 'Maria'],
    'Student_no' : ['s1001', 's1003', 's1004', 's1005', 's1006', 's1007'],
    'Score2'     : [70.0, 77.0, 92.0, 92.0, 72.0, 68.0]
})

df2

In [None]:
df1.merge(df2, on = 'Name', how = 'outer')

We merge on multiple keys by passing a *list* of column labels to the `on` parameter.

In [None]:
df_merge = df1.merge(df2, on = ['Name', 'Student_no'], how = 'outer')

df_merge

See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) for more information on `merge`.

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> The file <TT>titanic_additional.csv</TT> in the <TT>data</TT> subfolder contains additional information for the passengers on the Titanic: 
        
- <TT>Ticket</TT>: ticket number
- <TT>Cabin</TT>: Deck + cabin number
- <TT>Embarked</TT>: Port at which passenger embarked: <TT>C</TT> - Cherbourg, <TT>Q</TT> - Queenstown, <TT>S</TT> - Southampton
        
Import the file and merge it with the <TT>titanic</TT> data.
        
</div>

## Reshaping data

In data analysis, we often want *tidy* data, which is a standard format for organizing data sets do that variables, observations and values are consistently structured into columns, rows and cells:
1. Each column is a variable
2. Each row is an observation
3. Each cell contains a single value

<img src="images/tidy.png" width = "80%" align="left"/>

To ensure tidy data, we sometimes have to transform the shape of our data sets. In general, there are two types of data format: 
- **Long data**: Each row represents a single entity and different atributes of that entity are spread across multiple columns (fewer rows and more columns)
- **Wide data**: Each row represents the value of a single attribute for a specific entity (more rows and fewer columns)

<img src="images/format.png" width = "60%" align="left"/>

Let us create a wide data set in which we observe students and their score in different subjects.

In [None]:
wide_df = pd.DataFrame({
    'Student': ['Ole', 'Jenny', 'Chang', 'Jonas'],
    'Math'   : [88, 92, 85, 79],
    'English': [90, 85, 87, 93],
    'PE'     : [95, 89, 92, 88]
})

wide_df

We can use `melt` to reshape the data from a wide to long format, i.e., a single column with the scores and a new column that indicates the subject.

To use `melt`, we must pass a column label to the `id_vars` parameter. This is the column that denotes the entitites (i.e., the unit of observation) and which we want to leave "untouched".

In [None]:
wide_df.melt(id_vars = 'Student')

In addition, we can pass arguments to the `var_name` and `value_name` parameters in order to specify the labels of the new columns in the data.

In [None]:
long_df = wide_df.melt(
    id_vars = 'Student', 
    var_name = 'Subject', # Name of new column with old column labels
    value_name = 'Score'  # Name of new column with old column values
)

long_df

See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) for more information on `melt`.

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Load the <TT>apple</TT> data and reshape the data from long to wide using only the price columns: "Open", "High", "Low" and "Close". The reshaped DataFrame should have 1,008 rows and the following columns:
        
- <TT>Date</TT>: Date of a given observation
- <TT>Price</TT>: String indicating the type of price (e.g., "Close")
- <TT>Value</TT>: Value of a given price metric on a given date
        
</div>

Reshaping data from long to wide is known as "pivoting". To pivot a <code>DataFrame</code>, we can use the function `pivot`.

To use `pivot`, we must specify the following parameters:

- `index`: column to use as the new index 
- `columns`: column to use as the new colum labels 
- `values`: column to use as the new values in the columns

Let us use `pivot` to reshape our student data back to a wide format.

In [None]:
long_df.pivot(
    index = 'Student',   # Column used as index in new df
    columns = 'Subject', # Column used as new columns labels in the df
    values = 'Score'     # Column used to populate the new columns
)

As before, we can use `reset_index` to return the new index as a column back to the `DataFrame`.

In [None]:
# Pivot from long to wide (and reset index)
wide_df = long_df.pivot(index = 'Student', columns = 'Subject', values = 'Score').reset_index()

# Remove index name (not necessary, just to make it look nicer)
# wide_df.rename_axis(None, axis = 1, inplace = True)

wide_df

See the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html) for more information on `pivot`.

## Additional resources
- See the official [user guide](https://pandas.pydata.org/docs/user_guide/groupby.html) for more information and examples on how to group data in pandas.
- See the official [user guide](https://pandas.pydata.org/docs/user_guide/merging.html) for more information and examples on how to combine data in pandas.
- See the official [user guide](https://pandas.pydata.org/docs/user_guide/reshaping.html) for more information and examples on how to reshape data in pandas.

# Home exercises

In the previous lecture, we saw how we could create non-value returning functions to display plots. Note that we can also create functions that returns a DataFrame. Using self-defined functions in data analysis is very useful, especially to automate the workflow such as applying the same operation on multiple data sets or columns in a DataFrame.

However, we have to take care when designing functions that transform data as DataFrames are a *mutable* date type. In general, it is a good rule-of-thumb to make a copy of the original DataFrame inside a function to avoid the function altering the underlaying data.

For example, we can create a function that takes a DataFrame and converts all the column labels in the data to lowercase. We use the `copy` function to create a new copy of the data inside the function before performing the operation. The function returns a *copy* of the DataFrame, but with all column labels in lowercase.

In [None]:
def lower_cols(df):

    col_names = []

    # Loop over column labels and convert to lowercae
    for col in df.columns:
        col_name = col.lower()
        col_names.append(col_name)

    # Make a copy of old df
    df_new = df.copy()      
    
    # Update column names in new df
    df_new.columns = col_names
    
    return df_new

In [None]:
# Import data
apple = pd.read_csv('data/AAPL.csv')

apple.head()

In [None]:
# Convert column labels to lower case
apple_new = lower_cols(apple)

apple_new.head()

By working on a copy inside the function, the function call did not alter the original DataFrame.

In [None]:
apple.head()

So far, we have focused on how to transform numeric data to suit the purpose of our analysis. However, most of the data that we deal with contain strings, i.e., text data (name, addresses, etc.). Often this data is not in the format needed for analysis, and we have to perform additional string manipulation to extract the data we need.

Such string manipulation can be acheived using the pandas [string methods](https://pandas.pydata.org/docs/user_guide/text.html#string-methods).

These string method can be accessed using the `str` attribute of string columns.

For example, let us use `lower` to convert all names in the Titanic data to lowercase.

In [None]:
titanic = pd.read_csv('data/titanic.csv')
titanic.head()

In [None]:
titanic['Name'].str.lower()

Or we can use the [`partition`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.partition.html) method to split a string column on the first space. This will return each part of the string as a seperate column in a DataFrame.

In [None]:
titanic['Name'].str.partition()

### ðŸ“š Exercise 1: Titanic aggregations

Load and merge the data in <code>titanic.csv</code> and <code>titanic_additional.csv</code> to perform the following aggregations:
1. Compute the average survival rate by sex.
2. Count the number of passengers aged +50. Compute the average survival rate by sex for this group.
3. Count the number of passengers below the age of 20 by class and sex. Compute the average survival rate for this group by class and sex.
4. Count the number of non-missing values in each column by class and sex. 
5. Compute the minimum, maximum and average age by embarkation port (column `Embarked`) in a single `agg` operation. 
6. Compute the number of passengers, the average age and the fraction of women by embarkation port in a single `agg` operation.

   *Hint*: to compute the fraction of women, you can first create a numerical indicator variable for females.

### ðŸ“š Exercise 2: Working with Titanic string data

In this exercise, you will work with the original Titanic data set in `titanic.csv` and additional data stored in `titanic_address.csv`, which contains the address for each passenger. Note that the second data set contains address information only for passengers from the UK, while all other passengers (non-UK) have missing address information.

The goal of the exercise is to calculate the survival rate by country of residence (for this exercise, we restrict ourselves to the UK, so these will be England, Scotland, Wales etc.).

**Task 1**: Load `titanic.csv` and `titanic_address.csv` into two DataFrames.

Inspect the columns contained in both data sets. As you can see, the orignal data contains the full name including the title and potential maiden name (for married women) in a single column. The address data contains this information in seperate columns. You want to merge these data sets, but first you need to create common keys (i.e., columns) in both DataFrames.

**Task 2**: In the DataFrame with the original Titanic data, split the name information into three columns just like the columns in the second DataFrame by doing the following:
- Restrict the sample to men only. (This simplifies the task. Women in this data set have much more complicated names as they contain both their husband's and their maiden name). The filtered DataFrame should have 577 passengers.
- Split the `Name` column by `,` to extract the last name and the remainder as seperate columns. You can achieve this by using the [`partition`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.partition.html#pandas.Series.str.partition) string method.
- Split the remainder (containing the title and first name) using the space character `" "` as seperator to obtain individual columns for the title and the first name.
- Store the three data series in the original DataFrame (using the column names `FirstName`,     `LastName` and `Title`) and delete the `Name` column which is no longer needed.

*Hint*: Make sure that you don't have any leading or trailing whitespace at the start/end of the strings after partition. You can remove whitespace using the [`strip`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html) method
```
df['FirstName'].str.strip()
```

**Task 3**: Merge the original Titanic data with the address data based on the name columns you just created using a *left join*. Since we don't have address information for non-UK residence, you can drop the passengers with missing addresses. The merged DataFrame should have 471 passengers with non-missing address information.

**Task 4**: The file `UK_post_codes.csv` contains UK post code prefixes (which you can ignore), the corresponding city, and the corresponding country.

Import the file and merge this data with your passenger data set using a *left join*.

*Hint*: The data with the post codes contains duplicate rows for countries due to the different postal codes. Before merging, you should ensure that you have only one row for each country-city combination. You can drop duplicate rows using the [`drop_duplicates`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) method.

**Task 5**: Using the finale DataFrame, compute the average survivial rate by country of residence.

### ðŸ“š Exercise 3: Importing multiple stock files

The subfolder `stocks` in the `data` folder contains data on prices and traded volume for each weekday in 2020 for 10 different companies. The data for each company is stored in a seperate csv file with the company ticker as the file name.

Your task is to import and combine the data sets into a single DataFrame and then calculate the monthly sum of traded volume by company.

**Task 1**: Import the files and combine them into a single DataFrame. Make sure that the dates in the final DataFrame is a datetime object.

*Hint*: Create a list of all the file names in the folder (e.g., use [`listdir`](https://docs.python.org/3/library/os.html#os.listdir) from `os` to generate the list) and then import each file in a `for` loop in which you append each DataFrame to a list. Use `concat` to combine all the DataFrames in the final list.

**Task 2**: Calculate the monthly sum of traded volume for each ticker in the data in three different ways:
1. Compute the monthly sums "manually" by looping over the data instead of using pandas aggregation methods (e.g., `groupby`).

   *Hint*: use a nested `for` loop, in which the outer loop iterates over ticker names, and the inner loop iterates over months. Recall that you can use the `dt` accessor to access time properties from a datetime object.
2. Compute the monthly sums using the pandas aggregation method `groupby` instead of loops.
3. Compute the monthly sums also using the pandas method `resample`.

### ðŸ“š Exercise 4: Reshaping electricity data

The file `eurostat.xlsx` contains data on electricity consumption (in gigawatt-hours) for European countries from 2001 to 2023. 

1. Import the file and and keep only observations for the years 2001 to 2020 and for actual countries (i.e., drop the EU/Euro aggregates). The data should have 41 countries observed for 20 unique years.
   
   *Hint*: See the solution proposal to home exercise #2 in lecture 5.
   
2. Many countries have missing observations on electricity consumption in some year. Calculate how many years each country has a non-missing observation.

   *Hint*: Reshape first the data from wide to long using the pandas method `pivot`, and then count the number of non-missing observations for each country in a `groupby`.

3. Drop the countries from the data that you do not observe for every single year between 2001 and 2020. Note that you should have 35 countries left in the data.

4. Calculate the average annual electricity consumption for the countries with complete data. Display this in a horizontal bar plot that shows the countries in a descending order (high to low). Add a vertical line to the bar plot that shows the average annual electricity consumption across all the countries in the data (i.e., unweighted average).