# Pandas solutions

### Exercise 1
Please recreate the table below as a Dataframe using one of the approaches detailed above:

|  | Year | Product | Cost |
| ---| :--: | :----:  | :--: |
| 0  | 2015 | Apples  | 0.35 |
| 1  | 2016 | Apples  | 0.45 |
| 2  | 2015 | Bananas | 0.75 |
| 3  | 2016 | Bananas | 1.10 |

In [None]:
# Option 1
data = [[2015, 'Apples', 0.35],
        [2016, 'Apples', 0.45],
        [2015, 'Bananas', 0.75],
        [2016, 'Bananas', 1.10]]

df = pd.DataFrame(data, columns=['Year', 'Product', 'Cost'])

df

In [None]:
# Option 2
data = [{'Year': 2015, 'Product': 'Apples', 'Cost': 0.35},
        {'Year': 2016, 'Product': 'Apples', 'Cost': 0.45},
        {'Year': 2015, 'Product': 'Bananas', 'Cost': 0.75},
        {'Year': 2016, 'Product': 'Bananas', 'Cost': 1.10}]
        
df = pd.DataFrame(data)

df

In [None]:
# Option 3
data = {'Year': [2015, 2016, 2015, 2016],
        'Product': ['Apples', 'Apples', 'Bananas', 'Bananas'],
        'Cost': [0.35, 0.45, 0.75, 1.10]}

df = pd.DataFrame(data)

df

In [None]:
# Option 4
data = {'Year': {0: 2015, 1: 2016, 2: 2015, 3: 2016},
        'Product': {0: 'Apples', 1: 'Apples', 2: 'Bananas', 3: 'Bananas'},
        'Cost': {0: 0.35, 1: 0.45, 2: 0.75, 3: 1.10}}

df = pd.DataFrame.from_dict(data)

df

In [None]:
# Option 5
data = {0: {'Year': 2015, 'Product': 'Apples', 'Cost': 0.35},
        1: {'Year': 2016, 'Product': 'Apples', 'Cost': 0.45},
        2: {'Year': 2015, 'Product': 'Bananas', 'Cost': 0.75},
        3: {'Year': 2016, 'Product': 'Bananas', 'Cost': 1.10}}
        
df = pd.DataFrame.from_dict(data, orient="index")

df

In [None]:
# Option 6
df = pd.DataFrame()
df['Year'] = [2015, 2016, 2015, 2016]
df['Product'] = ['Apples', 'Apples', 'Bananas', 'Bananas']
df['Cost'] = [0.35, 0.45, 0.75, 1.10]

df

Which approach did you prefer? Why?

### Exercise 2

Using covid-19 data

Display the first 5 lines of the dataset.

In [None]:
df.head()

Show the last 15 lines

In [None]:
df.tail(15)

Check 10 random lines of the dataset

In [None]:
df.sample(10)

Make a new dataframe containing just the date, population data and number of cases and deaths

In [None]:
new_df = df[['dateRep', 'popData2020', 'cases', 'deaths']]
new_df.head()

Make a new dataframe containing just the 10th, 15th and 16th lines of the dataset. Which method did you use? Why?

In [None]:
new_df = df.iloc[[9, 14, 15]]
new_df

# In general, when we talk about position in a DataFrame, iloc should be used.
# However, in this specific case there is no difference between using iloc or loc because the index
# is the default one, starting at 0.

### Exercise 3

Make a new dataframe with the year, month, day, number of cases and deaths, for countries in Europe.

In [None]:
new_df = df.loc[df['continentExp']=='Europe', ['year', 'month', 'day', 'cases', 'deaths']]
new_df.head()

With what you learned so far, how would you calculate average daily death cases for Austria?

In [None]:
# 1. Filter dataframe for Austria rows only
# 2. Start a counter of deaths (0)
# 3. Start a for loop for each value in the austrian column "deaths"
# 4. Increase the counter with each value from each loop iteration. In the end this will be the total number of deaths
# 5. Get the total number of days: number of rows in the austrian dataframe
# 6. Divide total deaths by total days

austria_data = df.loc[df['countriesAndTerritories']=='Austria']

deaths_counter = 0

for value in austria_data['deaths']:
    deaths_counter += int(value)

total_days = austria_data.shape[0]

average_daily_deaths = deaths_counter/total_days

average_daily_deaths

### Exercise 4
Which country has the maximum number of deaths reported on one day?

In [None]:
df.loc[df['deaths'].idxmax(), 'countriesAndTerritories']

How many countries does europe have?

In [None]:
df.loc[df['continentExp']=='Europe', 'countriesAndTerritories'].nunique()

How many unique dates are in this data set?

In [None]:
df.dateRep.nunique()  # dates (days) are duplicated because there are multiple measurements per country

What is the average daily death cases for Norway?

In [None]:
df.loc[df['countriesAndTerritories']=='Norway', 'deaths'].mean()

### Exercise 5

Which country has most daily deaths per capita?

In [None]:
df['deaths_per_capita'] = df['deaths'] / df['popData2020'] * 100000
df.loc[df['deaths_per_capita'].idxmax(), 'countriesAndTerritories']

What is the median daily infection rate in Europe?

In [None]:
df.loc[df['continentExp']=='Europe', 'cases'].median()

Make the country code variable column lower-cased

In [None]:
df['countryterritoryCode'] = df['countryterritoryCode'].str.lower()
df.head()

Make a column called "survived", to be the opposite of the deaths column

In [None]:
df['survived'] = df['popData2020'] - df['deaths']
df.head()

### Exercise 6

What was the median number of daily cases per country?

In [None]:
df.groupby(by='countriesAndTerritories').cases.median()

How many days where there without deaths in each country?

In [None]:
df.loc[df['deaths']==0].groupby('countriesAndTerritories').dateRep.count().to_frame()

How many infected daily on average for each country in each month?

In [None]:
df.groupby(['countriesAndTerritories', 'year', 'month'])['cases'].mean().to_frame()

Calculate the daily case fatality rate (CFR) per country.

In [None]:
summary = df.groupby(['countriesAndTerritories', 'dateRep'])[['deaths', 'cases']].sum()

pd.DataFrame(summary['deaths']/summary['cases']*100, columns=['CFR'])

What was the infection rate for each country for the whole period?

In [None]:
# Option 1
total_cases_per_country = df.groupby(by='countriesAndTerritories')['cases'].sum()
total_pop_per_country = df.groupby('countriesAndTerritories')['popData2020'].mean()

total_cases_per_country / total_pop_per_country

In [None]:
# Option 2
df.groupby(by='countriesAndTerritories')['cases'].sum()/df.groupby('countriesAndTerritories')['popData2020'].mean()

### Exercise 7

Calculate how many were infected and how many survived daily on average all over Europe?

In [None]:
df[df['continentExp']=='Europe'].groupby('countriesAndTerritories').agg({'cases':'mean', 'survived':'mean'})

How many infected and survived daily on average for each country in each month?

In [None]:
df.groupby(['countriesAndTerritories', 'year', 'month']).agg({'cases':'mean', 'survived':'mean'})

### Exercise 8

Using Titanic data.

What proportion of the "deck" column is missing data?

In [None]:
titanic.deck.isna().mean()

How many rows don't contain any missing data at all?

In [None]:
# Option 1
(titanic.isna().sum(axis=1) == 0).sum()

In [None]:
# Option 2
mask_row_complete = titanic.isna().sum(axis=1) == 0
mask_row_complete.sum()

In [None]:
# Option 3
titanic.dropna().shape[0]

Make a dataframe with only the rows containing no missing data.

In [None]:
titanic_clean = titanic.dropna()
titanic_clean.head()

### Exercise 9

Using the following DataFrame, solve the exercises below.

In [None]:
data = pd.DataFrame({'time': [0.5, 1., 1.5, None, 2.5, 3., 3.5, None], 'value': [
                    6, 4, 5, 8, None, 10, 11, None]})
data

Replace all the missing "value" rows with zeros.

In [None]:
data1 = data.copy()
data1.fillna(0)

Replace the missing "time" rows with the previous value.

In [None]:
data2 = data.copy()
data2.time = data2.time.ffill()
data2

Replace all of the missing values with the data from the next row. What do you notice when you do this with this dataset?

In [None]:
data3 = data.copy()
data3.bfill()

Linearly interpolate the missing data. What is the result for this dataset?

In [None]:
data4 = data.copy()
data4.interpolate()   # not persistent