# Assignment 3

For assignment 3 we have to load our data, make 'data management decisions' and then make three frequency tables again.

These data management decisions could include:
- removing missing data
- recoding data
- binning or grouping on data

As this is exactly what I did for assignment 2, here it is again, rewritten and slightly tweaked for your reading pleasure.

In [1]:
# importing the libraries I need
import pandas as pd
import numpy as np

In [2]:
# loading the data from the local file, and having a look at the first few rows
df = pd.read_csv('data/covid_data.csv')

#### Data Management Decision 1
Because for this assignment I want to look at the total case and death rates, it makes sense to only look at the data for the last date I have full data for. At the time of writing, the last date I had full data for was 2021-04-28, so I will limit my dataframe to that date.

I am also only intersted in countries, so I will remove the grouped data for 'world' and 'africa' etc.

In [3]:
df.date = pd.to_datetime(df.date)
df_latest = df[df.date == '2021-04-28']  # last date we have data for
df_latest = df_latest.dropna(subset=['continent'])  # gets rid of summaries for 'world' and 'africa' etc, as I only want data for countries

#### Data Management Decision 2

The columns I want to make frequency tables for are 'total_cases_per_million', 'total_deaths_per_million' and 'human_development_index', so I will check for missing values from those columns.

I will then restrict my current dataframe to just those columns and remove any rows which are missing data from those columns. (A total of 21 rows overall will be removed)

In [4]:
print(df_latest.total_cases_per_million.isnull().value_counts())
print(df_latest.total_deaths_per_million.isnull().value_counts())
print(df_latest.human_development_index.isnull().value_counts())

False    190
True       8
Name: total_cases_per_million, dtype: int64
False    182
True      16
Name: total_deaths_per_million, dtype: int64
False    185
True      13
Name: human_development_index, dtype: int64


In [5]:
cols = ['location', 'total_cases_per_million', 'total_deaths_per_million', 'human_development_index']
df_latest = df_latest[cols].dropna(subset=cols)  # removing the rows with missing data in these columns

#### Data Management Decision 3

Because I am dealing with quantitative data here, not categorical data, doing a frequency table makes no sense, so I will first bin the data into categories, using binning.

I have decided on 3 categories for total cases and total deaths per million people, and five categories for the Human Development Index.

In [6]:
df_latest['total_case_rate'] = pd.cut(df_latest.total_cases_per_million, 3, labels=['low', 'medium', 'high'])
df_latest['total_death_rate'] = pd.cut(df_latest.total_deaths_per_million, 3, labels=['low', 'medium', 'high'])
bins = np.linspace(0.350, 1.0, 6)  # creating 5 equally spaced bins
df_latest['development_rating'] = pd.cut(df_latest.human_development_index, bins, labels=['very low', 'low', 'medium', 'high', 'very high'])
df_latest

Unnamed: 0,location,total_cases_per_million,total_deaths_per_million,human_development_index,total_case_rate,total_death_rate,development_rating
429,Afghanistan,1525.110,67.072,0.511,low,low,low
1301,Albania,45471.888,829.106,0.795,low,low,high
1731,Algeria,2772.568,73.750,0.748,low,low,high
2155,Andorra,170167.605,1617.809,0.868,high,medium,high
2561,Angola,796.196,17.982,0.581,low,low,low
...,...,...,...,...,...,...,...
82835,Venezuela,6856.076,73.815,0.711,low,low,medium
83298,Vietnam,29.433,0.360,0.704,low,low,medium
84147,Yemen,209.985,40.770,0.470,low,low,very low
84555,Zambia,4976.296,67.940,0.584,low,low,low


### Frequency Tables

The frequency tables are below:

In [7]:
print('Total Case Rate Frequncy Table Percentages')
df_latest.total_case_rate.value_counts(sort=False, normalize=True)

Total Case Rate Frequncy Table Percentages


low       0.751412
medium    0.225989
high      0.022599
Name: total_case_rate, dtype: float64

In [8]:
print('Total Death Rate Frequncy Table Percentages')
df_latest.total_death_rate.value_counts(sort=False, normalize=True)

Total Death Rate Frequncy Table Percentages


low       0.751412
medium    0.186441
high      0.062147
Name: total_death_rate, dtype: float64

In [9]:
print('Human Development Index Frequncy Table Percentages')
df_latest.development_rating.value_counts(sort=False, normalize=True)

Human Development Index Frequncy Table Percentages


very low     0.084746
low          0.192090
medium       0.214689
high         0.316384
very high    0.192090
Name: development_rating, dtype: float64

## Summary

### Total case rates and total death rates
Around three quarters of all countries are in the low category for total cases per million people.
Around three quarters of all countries are in the low category for total deaths per million people.
From these frequency tables alone it isn't possible to see how many are in the low category for both.
The majority of the rest of the countries are in the medium category for total cases total deaths, with very few being in the high category, although more (6%) are in the high category for deaths than for cases (2%).

### Human Development Index
This is more spaced out, as it has more categories. The biggest category is high with 32% of countries, the smallest is very low with 8% of countries. The other categories all contain approximately 20% of countries each.