# COGS 108 - Data Checkpoint

# Names

- Salma Sheriff
- Mizuki Kadowaki 
- Zoe Lederman 
- Yashaswat Malhotra


<a id='research_question'></a>
# Research Question

Does the perceived citizen happiness correlate with COVID-19 outcomes?

# Dataset(s)

Dataset Name: World Happiness Report
Link to the dataset: https://www.kaggle.com/unsdsn/world-happiness (https://www.kaggle.com/unsdsn/world-happiness)
Number of observations:
Results compiled from a happiness survey in countries, used to provide a happiness score and rank the happiness of 155 countries in 2019.

Dataset Name: COVID-19 Coronavirus Complete Dataset
Link to the dataset: https://www.kaggle.com/ashudata/covid19dataset? select=COVID_Data_Basic.csv (https://www.kaggle.com/ashudata/covid19dataset? select=COVID_Data_Basic.csv)
Number of observations:

Reports on outcomes of COVID-19 (confirmed cases, deaths, etc.) in 194 unique countries between 12/31/2019 - 11/6/2020.

We plan to combine these datasets based on county. We will only be using countries that are present in both of these datasets.


# Setup

In [1]:
import pandas as pd 
import numpy as np
import datetime

In [2]:
# import datasets
happy_2019 = pd.read_csv("data/2019.csv")
covid_basic = pd.read_csv("data/COVID_Data_Basic.csv")

# Data Cleaning

*For COVID Dataset*
1. Remove data from before January 2020 and after November 2020.
2. Remove cruise ships and countries that do not have any data between January and November 2020. 
3. Drop newConfirmed, newDeath, and newRecovered columns.
4. Convert the Data column values from String type to date-time type.
5. Remove countries that are not in Happiness datasets.

In [3]:
#datetime.strptime('')
covid_basic['Date'] = pd.to_datetime(covid_basic['Date'])

# Code for removing all data sets before January 2020
covid_test = covid_basic[~(covid_basic['Date'] < '2020-01-01')]

#Code for removing all data sets after November 2020
#This was done because not all countries have updated data after November.
covid_test = covid_test[~(covid_test['Date'] > '2020-10-31')]

In [5]:
#getting different properties of the original dataset, to compare with changes we make
covid_basic.head()
covid_basic.Date[0]
covid_basic.shape

(54522, 8)

In [6]:
#checking the same properties on a test dataset to see if we've made the desired changes
covid_test.head()
covid_test.Date[1]
covid_test.shape

(53316, 9)

In [7]:
#both datasets have 194 countries
len(covid_basic['Country'].unique()) == len(covid_test['Country'].unique())

True

In [8]:
#checking countries that are in covid dataset but not in happy
np.setdiff1d(covid_test['Country'] , happy_2019['Country or region'])

array(['Andorra', 'Angola', 'Antigua and Barbuda', 'Bahamas',
       'Bahamas, The', 'Barbados', 'Belize', 'Brunei', 'Burma',
       'Cabo Verde', "Cote d'Ivoire", 'Cruise Ship', 'Cuba', 'Czechia',
       'Diamond Princess', 'Djibouti', 'Dominica', 'Equatorial Guinea',
       'Eritrea', 'Eswatini', 'Fiji', 'Gambia, The', 'Grenada',
       'Guinea-Bissau', 'Guyana', 'Holy See', 'Korea, South',
       'Liechtenstein', 'MS Zaandam', 'Maldives', 'Marshall Islands',
       'Martinique', 'Monaco', 'Oman', 'Papua New Guinea',
       'Saint Kitts and Nevis', 'Saint Lucia',
       'Saint Vincent and the Grenadines', 'San Marino',
       'Sao Tome and Principe', 'Seychelles', 'Solomon Islands', 'Sudan',
       'Suriname', 'Taiwan*', 'Timor-Leste', 'Trinidad and Tobago', 'US',
       'West Bank and Gaza', 'Western Sahara'], dtype=object)

In [9]:
#checking countries that are in happiness dataset but not in covid
np.setdiff1d(happy_2019['Country or region'], covid_test['Country'])

array(['Czech Republic', 'Hong Kong', 'Ivory Coast', 'Myanmar',
       'Northern Cyprus', 'Palestinian Territories', 'South Korea',
       'Swaziland', 'Taiwan', 'Trinidad & Tobago', 'Turkmenistan',
       'United States'], dtype=object)

In [10]:
#renaming countries that are the same but entered differently
covid_test = covid_test.replace(["Czechia", "Cote d'Ivoire", "Burma", 
                    "West Bank and Gaza", "Korea, South", 
                    "Eswatini","Taiwan*","Trinidad and Tobago", "US"],
                   ['Czech Republic', 'Ivory Coast', 'Myanmar',
                    'Palestinian Territories', 'South Korea',
                    'Swaziland', 'Taiwan', 'Trinidad & Tobago',
                    'United States'])

In [11]:
#checking countries that are in covid dataset but not in happy
np.setdiff1d(covid_test['Country'] , happy_2019['Country or region'])

array(['Andorra', 'Angola', 'Antigua and Barbuda', 'Bahamas',
       'Bahamas, The', 'Barbados', 'Belize', 'Brunei', 'Cabo Verde',
       'Cruise Ship', 'Cuba', 'Diamond Princess', 'Djibouti', 'Dominica',
       'Equatorial Guinea', 'Eritrea', 'Fiji', 'Gambia, The', 'Grenada',
       'Guinea-Bissau', 'Guyana', 'Holy See', 'Liechtenstein',
       'MS Zaandam', 'Maldives', 'Marshall Islands', 'Martinique',
       'Monaco', 'Oman', 'Papua New Guinea', 'Saint Kitts and Nevis',
       'Saint Lucia', 'Saint Vincent and the Grenadines', 'San Marino',
       'Sao Tome and Principe', 'Seychelles', 'Solomon Islands', 'Sudan',
       'Suriname', 'Timor-Leste', 'Western Sahara'], dtype=object)

In [12]:
#checking countries that are in happiness dataset but not in covid
np.setdiff1d(happy_2019['Country or region'], covid_test['Country'])

array(['Hong Kong', 'Northern Cyprus', 'Turkmenistan'], dtype=object)

In [13]:
#removing countries that are in covid dataset and not in happiness dataset
covid_test[~covid_test['Country'].isin(['Andorra','Angola', 'Antigua and Barbuda', 'Bahamas',
       'Bahamas, The', 'Barbados', 'Belize', 'Brunei', 'Cabo Verde',
       'Cruise Ship', 'Cuba', 'Diamond Princess', 'Djibouti', 'Dominica',
       'Equatorial Guinea', 'Eritrea', 'Fiji', 'Gambia, The', 'Grenada',
       'Guinea-Bissau', 'Guyana', 'Holy See', 'Liechtenstein',
       'MS Zaandam', 'Maldives', 'Marshall Islands', 'Martinique',
       'Monaco', 'Oman', 'Papua New Guinea', 'Saint Kitts and Nevis',
       'Saint Lucia', 'Saint Vincent and the Grenadines', 'San Marino',
       'Sao Tome and Principe', 'Seychelles', 'Solomon Islands', 'Sudan',
       'Suriname', 'Timor-Leste', 'Western Sahara'])]

Unnamed: 0.1,Unnamed: 0,Country,Date,Confirmed,Death,Recovered,newConfirmed,newDeath,newRecovered
1,2,Afghanistan,2020-01-01,0,0,0,0,0,0
2,3,Afghanistan,2020-01-02,0,0,0,0,0,0
3,4,Afghanistan,2020-01-03,0,0,0,0,0,0
4,5,Afghanistan,2020-01-04,0,0,0,0,0,0
5,6,Afghanistan,2020-01-05,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
54511,55090,Zimbabwe,2020-10-27,8315,242,7804,12,0,7
54512,55091,Zimbabwe,2020-10-28,8320,242,7845,5,0,41
54513,55092,Zimbabwe,2020-10-29,8349,242,7864,29,0,19
54514,55093,Zimbabwe,2020-10-30,8362,242,7884,13,0,20


In [16]:
#removing countries that are in happiness dataset and not in covid dataset
happy_2019[~happy_2019['Country or region'].isin(['Hong Kong', 'Northern Cyprus', 'Turkmenistan'])]

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035


In [15]:
#make month column to make it easy to sum by month
covid_test['Month'] = covid_test['Date'].dt.month

#code for adding all the data for each country by month
covid_test = covid_test.groupby(['Country','Month']).sum()
covid_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Confirmed,Death,Recovered,newConfirmed,newDeath,newRecovered
Country,Month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Afghanistan,1,527,0,0,0,0,0,0
Afghanistan,2,1363,6,0,0,1,0,0
Afghanistan,3,776549,1219,29,26,173,4,5
Afghanistan,4,2568000,27237,860,2927,1997,60,255
Afghanistan,5,2100611,225655,4994,24129,13034,193,1068


In [17]:
#dropping column named "unnamed" that contained the index before the data set was altered
covid_test = covid_test.drop(columns = 'Unnamed: 0')

In [21]:
covid_test.head()
covid_test.shape

#renaming covid dataset for ease
covid_clean = covid_test

# Project Proposal (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/4  | 6 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/11  | 6 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 2/18  | 6 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 2/25  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |