# COGS 108 - Final Project 

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that PIDs will be scraped from the public submission, but student names will be included.)

* [  ] YES - make available
* [ x ] NO - keep private

# Overview

*Fill in your overview here*

# Names

- Yang Li
- Yiou Lyu
- Linfeng Hu
- Ruby Celeste Marroquin 

# Group Members IDs

- A15560579
- A15930345
- A15473121
- A16094382

# Research Question

How does the regional economic status of each province in mainland China correlate to its breakout and recovery of COVID-19?

## Background and Prior Work

*Fill in your background and prior work here* 

References (include links):
- 1)
- 2)

# Hypothesis


*Fill in your hypotheses here*

# Dataset(s)

(Copy this information for each dataset)
- Dataset Name: 
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [1]:
import pandas as pd
import json
import codecs
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import os
import patsy
import scipy.stats as stats

import bs4
from bs4 import BeautifulSoup

# Data Cleaning

In [2]:
#with codecs.open('Data/virus.json', 'r', 'utf-8') as data_file:
    #data_teacher = json.load(data_file, 'utf-8')

#topic[worksheet] = data_teacher[worksheetID]['Topic']
#out = codecs.open('Worksheet.csv', 'w', 'utf-8')
#out.write(topic[worksheet])

To clean our data, our general approach is to represent datasets in pandas dataframe. Then we drop irrelevant information or outliers in data. We also rename the columns to make it easier for later analyses.

Firstly, we deal with the datasets that consist of economic status data. 

This is the income per capita value for each province in mainland China. Income is measured in yuan. 

In [3]:
Income = pd.read_csv('Data/Income.csv')
Income = Income.dropna(axis=1, how='all')
Income.head()

Unnamed: 0,Region,2018,2017,2016,2015,2014,2013
0,Beijing,62361.22,57229.83,52530.38,48457.99,44488.57,40830.04
1,Tianjin,39506.15,37022.33,34074.46,31291.36,28832.29,26359.2
2,Hebei,23445.65,21484.13,19725.42,18118.09,16647.4,15189.64
3,Shanxi,21990.14,20420.01,19048.88,17853.67,16538.32,15119.72
4,Inner Mongolia,28375.65,26212.23,24126.64,22310.09,20559.34,18692.89


This is the per capita Gross Regional Product value for each province. GRP per capita is measured in yuan.

In [4]:
GRP = pd.read_csv('Data/GRP.csv')
GRP = GRP.dropna(axis=1, how='all')
GRP.head()

Unnamed: 0,Region,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009
0,Beijing,140211,128994,118198,106497,99995,94648,87475,81658,73856,66940
1,Tianjin,120711,118944,115053,107960,105231,100105,93173,85213,72994,62574
2,Hebei,47772,45387,43062,40255,39984,38909,36584,33969,28668,24581
3,Shanxi,45328,42060,35532,34919,35070,34984,33628,31357,26283,21522
4,Inner Mongolia,68302,63764,72064,71101,71046,67836,63886,57974,47347,39735


Next, we move on to clean the population density related datasets.

Population per province here is calculated in the unit of 10000 persons). It includes all residents (permanent and temporary, rural and urban)at the end of that year.

In [5]:
population = pd.read_csv('Data/Population.csv')
population = population.dropna(axis = 1, how = 'all')
population.head()

Unnamed: 0,Region,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009
0,Beijing,2154,2171,2173,2171,2152,2115,2069,2019,1962,1860.0
1,Tianjin,1560,1557,1562,1547,1517,1472,1413,1355,1299,1228.0
2,Hebei,7556,7520,7470,7425,7384,7333,7288,7241,7194,7034.0
3,Shanxi,3718,3702,3682,3664,3648,3630,3611,3593,3574,3427.0
4,Inner Mongolia,2534,2529,2520,2511,2505,2498,2490,2482,2472,2458.0


To calculate population density of a region, we also need to areas of each province. Here, area of each province is measured in unit of square kilometers.

Since we only need the area information of each separate region, we will drop the "Toal" row at the end which contains information about the total area of China(judging by the data contained, the row name should be a typo).

In [6]:
area = pd.read_csv('Data/Area.csv')
area = area.dropna(axis = 1, how = 'all')
#shorten column names to make following analysis simpler
area = area.rename(columns={"Area (sq.km)": "Area"})
area = area[area.District != 'Toal']
area.head()

Unnamed: 0,District,Area,proportion
0,Shanghai,8359,0.09%
1,Tianjin,11917,0.13%
2,Beijing,16406,0.17%
3,Hainan,35177,0.37%
4,Ningxia,51893,0.55%


In [7]:
# read virus data into dataframes 

list_of_virus_data = list()

# append data between Feb 1 and Feb 25 to list
for i in range(20200201,20200226): 
    path = './Data/virus/' + str(i) + '.csv'
    list_of_virus_data.append(pd.read_csv(path))
    
# File 20200226.csv is missing, reason unknow. 

    
# append data between Feb 27 and Feb 29 to list
for i in range(20200227,20200230): 
    path = './Data/virus/' + str(i) + '.csv'
    list_of_virus_data.append(pd.read_csv(path))

# append data between Mar 1 and  Mar 1 to list
for i in range(20200301,20200302): 
    path = './Data/virus/' + str(i) + '.csv'
    list_of_virus_data.append(pd.read_csv(path))
    
print('number of dataframes for virus: ',len(list_of_virus_data))

# access ith elment in the list using list_of_virus_data[i]
# for example list_of_virus_data[0] gives the first dataframe


number of dataframes for virus:  29


## Start cleaning virus data

### Clean 0th to 1th df in the list 

In [8]:
# Clean 0th to 1th df in the list 
for i in range(0,2):
    # get the df of the ith day
    df = list_of_virus_data[i]
    # use the first data row as column names
    df.columns = df.iloc[0]
    # drop first row, because is was used as header
    df = df.drop(0)
    # drop the column '1', because it is irrlavent
    df = df.drop(1, axis=1)
    # save cleaned data to list_of_virus_data 
    list_of_virus_data[i] = df

### Clean 2th df in the list

In [9]:
# Clean 2th df in the list
# get the df of the ith day
df = list_of_virus_data[2]
# reset column names
df.columns = ["Province/Region/City", "Confirmed Cases", 1]
# drop meaningless 1" column,  keep "Confirmed Cases" and "Province/Region/City"
df = df.drop(1, axis=1)
# Drop the last row, because it is comment instaed of data
df = df.drop(df.shape[0] - 1)
# save cleaned data to list_of_virus_data 
list_of_virus_data[2] = df

### Clean 3th to 10th df in the list 

In [10]:
# Clean 3th to 10th df in the list 
for i in range(3,11):
    # get the df of the ith day
    df = list_of_virus_data[i]
    # use the first data row as column names
    df.columns = df.iloc[0]
    # drop first row, because is was used as header
    df = df.drop(0)
    # drop the column '1', because it is irrlavent
    df = df.drop(1, axis=1)
    # save cleaned data to list_of_virus_data
    list_of_virus_data[i] = df

In [11]:
# View virus data of the first 11 days
# Delete this cell
for i in range (0,11):
    print(list_of_virus_data[i])
    print("\n")
    print("\n")
    print("\n")

0  Confirmed Cases Province/Region/City
1             7153                Hubei
2              599             Zhejiang
3              520            Guangdong
4              422                Henan
5              389                Hunan
6              297                Anhui
7              286              Jiangxi
8              238            Chongqing
9              207              Sichuan
10             202              Jiangsu
11             202             Shandong
12             156              Beijing
13             153             Shanghai
14             144               Fujian
15             101              Shaanxi
16             100              Guangxi
17              96                Hebei
18              91               Yunnan
19              80         Heilongjiang
20              60             Liaoning
21              57               Hainan
22              47               Shanxi
23              35                Gansu
24              34              Tianjin


In [12]:
# Clean 11th to the 28th df in the list 
# TODO

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [13]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*

# Team Contributions

*Specify who in your group worked on which parts of the project.*