# 05 - Pandas basics

This notebook contains solution proposals to the home exercises.

In [3]:
import pandas as pd
import numpy as np

### ðŸ“š Exercise 1: Fuel economy

The file `mpg.xlsx` contains observations on fuel economy and six additional attributes for 398 different car models. The column `mpg` is a measure of the car's fuel economy, i.e. the number of miles per gallon of petrol.

Import the file and store it as a `DataFrame` in a variable called `mpg_df`.

In [4]:
mpg_df = pd.read_excel('mpg.xlsx')

**Task 1**: Explore the data by answering the following questions:
1. Which columns in the <code>DataFrame</code> are strings?

In [5]:
# Print overview of data types
print('Overview of datatypes:\n')
print(mpg_df.info())

Overview of datatypes:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   horsepower    392 non-null    float64
 3   weight        398 non-null    int64  
 4   acceleration  398 non-null    float64
 5   model_year    398 non-null    int64  
 6   origin        398 non-null    object 
 7   name          398 non-null    object 
dtypes: float64(3), int64(3), object(2)
memory usage: 25.0+ KB
None


2. What is the average number of miles per gallon of the car models in the data?

In [None]:
# Extract mean of mpg from descriptives...
mean_mpg = mpg_df.describe().loc['mean', 'mpg']
# ...or use "mean" function on mpg column
# mean_mpg = mpg_df['mpg'].mean()
print(f'Average mpg: {mean_mpg:.2f}')

3. What are the unique number of cylinders observed in the data?

In [None]:
# Print number of unique cylinders
print(f'Number of unique values of cylinders: {mpg_df['cylinders'].nunique()}')

4. How many of the car models in the data were from Europe?

In [None]:
# Print value count of origin
origin = mpg_df['origin'].value_counts()
print(f'Number of car models from Europe: {origin.loc['europe']}')

5. What is the correlation between cars' fuel economy and horsepower?

In [None]:
# Print correlation between mpg and horsepower
correlation = mpg_df['mpg'].corr(mpg_df['horsepower'])
print(f'Correlation between mpg and horsepower: {correlation:.2f}')

6. Are there any missing observations in the data?

In [None]:
# Print number of missing values
print('Overview of NaN:\n')
print(mpg_df.isna().sum())

**Task 2**: Transform the data by performing the following operations:
1. The column `model_year` ranges from 1970 to 1982, but it contains only the last two digits of the year. Change the column so that it also contains the two first digits, e.g., '74' should be '1974'.

In [None]:
# Alternative 1: convert to string and then add the string "19" to the beginning of each year
mpg_df['model_year'] = mpg_df['model_year'].astype('str')
mpg_df['model_year'] = '19' + mpg_df['model_year']
 
# Alternative 2: keep as integer and simply add 1900 to each year
#mpg_df['model_year'] = mpg_df['model_year'] + 1900

mpg_df.head()

2. Convert the data type of the column `model_year` to `datetime`.

In [None]:
# Note: not necessary to specify format as pandas is able to detect the format in this case
mpg_df['model_year'] = pd.to_datetime(
    mpg_df['model_year'], 
    format = '%Y' # not necessary to specify format as pandas is able to detect the format (year) in this case
)

mpg_df.head()

3. Drop the rows with missing observations on `horsepower` and store the new `DataFrame` in a variable called `mpg_df2`.

In [None]:
# Drop rows with missing from original dataframe
mpg_df2 = mpg_df.dropna(
    subset = 'horsepower', # not necessary to specify subset as "horsepower" is the only column with nans
    axis = 0
)

mpg_df2.head()

In [None]:
# Check that nans were dropped
mpg_df2.isna().sum()

4. Instead of dropping the rows with missing values, you want to replace missing values in `horsepower`by the *origin-specific* sample mean of `horsepower`. Create a new column in `mpg_df` called `hp_imputed` that contains the observations on `horsepower`. Then, replace the missing values in `hp_imputed` with the average value of `horsepower` given the origin of the car.

   For each origin in the data ('usa', 'europe', or 'japan'):
   - Filter rows on origin and calculate the sample mean of `horsepower`
   - Use `loc` to fill missing values in `hp_imputed` with sample mean in rows of given origin
   - *Hint: To avoid code redundency, use a `for` loop with the following statement:*
     ```
     for origin in mpg_df['origin'].unique():
     ```
You should inspect the data to verify that the operation worked as expected and that missing values in `hp_imputed` have in fact been replaced by origin-specific sample means.

In [None]:
# Create new column with horsepower
mpg_df['hp_imputed'] = mpg_df['horsepower']

mpg_df.head()

In [None]:
# Loop over unique origins in the data
for origin in mpg_df['origin'].unique():

    # Calculate origin-specific mean
    mean = mpg_df[mpg_df['origin'] == origin]['horsepower'].mean()

    # Replace missing values in hp_imputed with origin-specific mean
    mpg_df.loc[(mpg_df['hp_imputed'].isna()) & (mpg_df['origin'] == origin), 'hp_imputed'] = mean

mpg_df.head()

In [None]:
# Check that there are no missing in the new column
mpg_df.isna().sum()

In [None]:
# Check the rows in which horsepower has been imputed
mpg_df[mpg_df['horsepower'] != mpg_df['hp_imputed']]

### ðŸ“š Exercise 2: Electricity consumption

The file `eurostat.xlsx` contains data on electricity consumption (in gigawatt-hours) for European countries from 2001 to 2023. 

**Task 1**: Import the file and store it in a variable called `df_euro`. Note that the file contains many unecessary rows and columns. Transform the `DataFrame` so that:
- the index is the country (incl. EU and Euro area)
- the columns are the years from 2001 to 2023

The final `DataFrame` should have 43 rows and 23 columns.

*Hint*: use optional parameters in `read_excel` (e.g., `skipfooter`) to control how the file is imported.

In [None]:
# Import file
df_euro = pd.read_excel(
    'eurostat.xlsx', 
    sheet_name = 'Sheet 1',           # Specify which sheet to import
    skiprows = list(range(9)) + [10], # Skip rows at the top of the file
    skipfooter = 5,                   # Skip rows at the bottom of the file
)

df_euro.head()

In [None]:
# Create list of years as strings
years = [str(i) for i in range(2001, 2024)]

# Keep only columns with country and year observations
df_euro = df_euro[['TIME'] + years]

# Use country column as index
df_euro.set_index('TIME', inplace = True)

df_euro

**Task 2**: Use `df_euro` and calculate the following:
- Average electricity consumption in Finland from 2001 to 2023
- Sum of electricity consumption in all countries (not incl. EU and Euro area) in 2022

In [None]:
# Note that we don't have any rows with missing values
# But that is because missing values are indicated with ":"
df_euro.isna().sum()

In [None]:
# Note also that this causes many columns to be objects instead of numeric
df_euro.info()

In [None]:
# Use replace function to change these values to NaN
df_euro.replace(':', np.nan, inplace = True)

# Most columns contain NaNs now
df_euro.isna().sum()

In [None]:
# Columns are now numeric and we can calculate the statistics
df_euro.info()

In [None]:
avg_fin = df_euro.loc['Finland'].mean()
print(f'Average annual electricity consumption in Finland, 2001-2023: {avg_fin:,.2f} GWh')

In [None]:
avg_22 = df_euro.loc['Belgium':, '2022'].sum()
print(f'Sum of electricity consumption in Europe in 2022: {avg_22:,.2f} GWh')

In [None]:
# Note: Can use the "na_values" parameter in "read_excel" to specify values that should be interpreted as nan
# df_euro = pd.read_excel(
#     'eurostat.xlsx', 
#     sheet_name = 'Sheet 1',           
#     skiprows = list(range(9)) + [10], 
#     skipfooter = 5,                   
#     na_values = ':'                  # Specify additional nan values
# )

# Note: This would have ensured that columns are imported as numeric and no longer need to replace these values
# df_euro.info()

### ðŸ“š Exercise 3: Labor market statistics

The file `FRED_monthly.csv` contains time series for the US economy for each month from 1948 to 2024. The column `UNRATE` is the average monthly unemployment rate. 

Import the file and store it in a variable called `df_fred`. 

In [None]:
# Import file
df_fred = pd.read_csv('FRED_monthly.csv')

# Create decade column
df_fred['Decade'] = (df_fred['Year'] // 10) * 10

df_fred.head()

In [None]:
# Check data types
df_fred.info()

In [None]:
# Check missings
df_fred.isna().sum()

In [None]:
# Check descriptives
df_fred.describe()

**Task 1**: Use the `df_fred` to calculate and print the following:
- Average unemployment rate in the data from 1948 to 2024.
- Average unemployment rate for each *decade* in the data from 1950 to 2010 for which you have all the observations.

*Hint*: The decade can be computed from the `Year` column using truncated integer division:
```
df_fred[Year] // 10 * 10
```

In [None]:
avg_tot = df_fred['UNRATE'].mean()
print(f'Average unempoyment rate 1948-2024: {avg_tot:.2f}')

In [None]:
for dec in df_fred['Decade'].unique():
    if dec != 2020:
        # Filter on decade
        subset = df_fred[df_fred['Decade'] == dec]
        
        # Calculate average unemployment rate in subset
        avg_dec = subset['UNRATE'].mean()

        # Print average unemployment rate
        print(f"Average unemployment rate in the {dec}'s: {avg_dec:.2f}") 

**Task 2**: Create a new `DataFrame` called `df_fred_year`, which contains the average annual unemployment rate for each year in `FRED_monthly.csv`.

In [None]:
years = df_fred['Year'].unique()
avgs = []

for year in years:

    # Filter on year
    subset = df_fred[df_fred['Year'] == year]

    # Calculate average unemployment rate in subset
    avg_year = subset['UNRATE'].mean()

    # Append average unemployment rate
    avgs.append(avg_year)

In [None]:
df_fred_year = pd.DataFrame({'Year' : years, 'UNRATE' : avgs})

df_fred_year