- Team: Hangyu Zhou (hz477) , Evian Liu (yl2867) , YingYun Zhang (yz549) , Yang Zhao (yz563)
- Github repository link: https://github.coecis.cornell.edu/yl2867/INFO2950_Project.git
- Rubric: https://docs.google.com/document/d/1W3mPBOMhM9SD3oym2LG3NLyB78jG-Kmv9egM3B_EObg/edit

# Phase IV:  #
#### Due Nov 23. ####
- Submit an executed Jupyter notebook (.ipynb) file​ on CMS, with all of the required elements,as detailed in the deliverables section above. Includea “Questions for reviewers” section at the end of your submission, listing specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.
- We often find that the moment we "finish" a project is also the time when we have the most ideas about how to continue it. The goal of this phase is to create a version of your project that could be complete, but with enough time remaining that you can revisit your analysis, fill in gaps, and continue logical extensions.
- You will provide peer review for other groups' submissions.

## 1. Research question(s) ## 
State your research question (s) clearly.

By comparing the metropolitan cities of East and West Coast, observe how crime rate and employment changes of various industries behave in response to COVID pandemic intensity.

## 2. Data cleaning ## 
Have an initial draft of your data cleaning appendix.
Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. 
All of your data cleaning code should be found in this section, and you may want to explain the steps of your data cleaning in words as well.

#### Final Draft Requirement ####

- Data cleaning description. Submit an updated version of your data cleaning description from phase II that describes all data cleaning steps performed on your raw data to turn it into the analysis-read dataset submitted with your final project. The data cleaning description should be a separate Jupyter notebook with executed cells, and it should output the dataset you submit as part of your project (e.g. written as a .csv file).

### IMPORT DATA ###

In [1]:
# import packages
import pandas as pd
import numpy as np
import datetime
from matplotlib import pyplot as plt

In [2]:
# initialize final dataset
final_data = pd.DataFrame() 
time_lst = pd.date_range(start='1/1/2020', periods=9, freq='M').tolist()
final_data['Period'] = time_lst
final_data

Unnamed: 0,Period
0,2020-01-31
1,2020-02-29
2,2020-03-31
3,2020-04-30
4,2020-05-31
5,2020-06-30
6,2020-07-31
7,2020-08-31
8,2020-09-30


In [3]:
# Data cleaning helper function
def DataCleaning(dataname, colname):
    """
    Requires dataname: a file have only 2 rows, columns from "Jan" to "Dec"
    Returns: an updated version of final_data, with <colname>
    """
    # import data
    temp_data = pd.read_csv(dataname)
    
    # add subset total pplt only to combine data
    final_data[colname] = pd.Series(temp_data.loc[1, "Jan":"Sep"].tolist())

### a. Covid ###
- imported datasets are manually edited to condense and format for further data cleaning, including:
    - date colunm was manually formated to consistent date
    - turn rest of columns datatype to numbers

In [4]:
# import NY maniputed data
covid_ny = pd.read_csv("NYC_covid.csv")

In [5]:
# reformate date types
covid_ny['DATE_OF_INTEREST'] = covid_ny['DATE_OF_INTEREST'].apply(pd.to_datetime, utc=True)

# obtain month and day data to help locate obtaining end-of-month rows
covid_ny['months'] = covid_ny['DATE_OF_INTEREST'].dt.strftime('%m')

In [6]:
# Obtain cumulative monthly data, and adjust rows for subset data
covidNY_sum = covid_ny.groupby('months').sum()   # sum cases for cumulative monthly data

new_row = pd.DataFrame({'CASE_COUNT':0, 'HOSPITALIZED_COUNT':0, 'DEATH_COUNT':0}, index =[0])
covidNY_sum = pd.concat([new_row, covidNY_sum]).reset_index(drop = True)  # add Jan data
covidNY_sum = covidNY_sum.loc[0:8]  # drop October incomplete data

In [7]:
# add subset total counts to combine data
final_data["TotalCasesNY"] = covidNY_sum['CASE_COUNT']
final_data["TotalDeathsNY"] = covidNY_sum['DEATH_COUNT']
final_data["DeathRateNY"] = (final_data['TotalDeathsNY'] / final_data['TotalCasesNY']).fillna(0)

In [8]:
# import LA maniputed data
covid_la = pd.read_csv("LA_county_covid.csv")

In [9]:
# reformate date types
covid_la['date_use'] = covid_la['date_use'].apply(pd.to_datetime, utc=True)

# obtain month and day data to help locate obtaining end-of-month rows
covid_la['months'] = covid_la['date_use'].dt.strftime('%m')

In [10]:
# Obtain cumulative monthly data, and adjust rows for subset data
covidLA_sum = covid_la.groupby('months').sum()   # sum cases for cumulative monthly data
covidLA_sum = covidLA_sum[['new_case', 'new_deaths']]  

feb_row = pd.DataFrame({'new_case':147, 'new_deaths':0}, index =[0])
covidLA_sum = pd.concat([feb_row, covidLA_sum]).reset_index(drop = True)  # add Feb data
jan_row = pd.DataFrame({'new_case':0, 'new_deaths':0}, index =[0])
covidLA_sum = pd.concat([jan_row, covidLA_sum]).reset_index(drop = True)  # add Jan data
covidLA_sum = covidLA_sum.loc[0:9]               # drop October incomplete data

In [11]:
# add subset total counts to combine data
final_data["TotalCasesLA"] = covidLA_sum['new_case']
final_data["TotalDeathsLA"] = covidLA_sum['new_deaths']
final_data["DeathRateLA"] = (final_data['TotalDeathsLA'] / final_data['TotalCasesLA']).fillna(0)

### b. Jail ###
- imported datasets are manually edited to condense and format for further data cleaning, including:
    - remove historical data prior to 2019
    - remove analysis cells besides data records
    - remove unnames columns
    - turn datatype to numbers

In [12]:
# import NY maniputed data
jail_ny = pd.read_csv("NYC_jail_pplt.csv")

In [13]:
# add subset total pplt only to combine data
final_data["JailPpltNY"] = pd.Series(jail_ny.loc[0, "20-Jan":"20-Aug"].tolist())
#final_data["JailMonthChgNY"] = final_data["JailPpltNY"].pct_change()

In [14]:
# import LA maniputed data
jail_la = pd.read_csv("LAcounty_jail_pplt.csv")

In [15]:
# add subset total pplt only to combine data
final_data["JailPpltLA"] = jail_la.loc[4, "January,2020":"Sept, 2020"].tolist()
#final_data["JailMonthChgLA"] = final_data["JailPpltLA"].pct_change()

### c. Unemployment ###
- to use helper function, the dataset is manipulated into specified format

In [16]:
# import and clean NY data by calling helper function
DataCleaning("NYC_unemployment.csv", "UnemplNY")

In [17]:
# import and clean LA data by calling helper function
DataCleaning("LA_unemployment.csv", "UnemplLA")

### d. Weekly working Hours ###
- to use helper function, the dataset is manipulated into specified format

In [18]:
# import and clean data by calling helper function
DataCleaning("NYC_weekly_hours.csv", "WorkHrNY")
DataCleaning("LA_weekly_hours.csv", "WorkHrLA")

### e. Computer services employee ###
- to use helper function, the dataset is manipulated into specified format

In [19]:
# import and clean data by calling helper function
DataCleaning("NYC_computer.csv", "CompEmplNY")
DataCleaning("LA_computer.csv", "CompEmplLA")

### f. Local government employee ###
- to use helper function, the dataset is manipulated into specified format

In [20]:
# import and clean data by calling helper function
DataCleaning("NYC_localgovernment.csv", "GovEmplNY")
DataCleaning("LA_localgovernment.csv", "GovEmplLA")

### g. Hospital employee ###
- to use helper function, the dataset is manipulated into specified format

In [21]:
# import and clean data by calling helper function
DataCleaning("NYC_hospitals.csv", "HospEmplNY")
DataCleaning("LA_hospitals.csv", "HospEmplLA")

### h. Financial activities employee ###
- to use helper function, the dataset is manipulated into specified format

In [22]:
# import and clean data by calling helper function
DataCleaning("NYC_financialactivities.csv", "FinEmplNY")
DataCleaning("LA_financialactivities.csv", "FinEmplLA")

### i. Educational services employee ###
- to use helper function, the dataset is manipulated into specified format

In [23]:
# import and clean data by calling helper function
DataCleaning("NYC_educationalservices.csv", "EduEmplNY")
DataCleaning("LA_educationalservices.csv", "EduEmplLA")

### j. Crime ###
- imported datasets are manually edited to condense and format for further data cleaning, including:
    - data is shrinked by unused rows and columns to reduce file size
    - manipulate datatype to be consistent

In [24]:
crime_ny = pd.read_csv("NYC_crime.csv")

In [25]:
# reformate date types
crime_ny['CMPLNT_FR_DT'] = crime_ny['CMPLNT_FR_DT'].apply(pd.to_datetime, utc=True)

# obtain month and day data to help locate obtaining month rows
crime_ny['months'] = crime_ny['CMPLNT_FR_DT'].dt.strftime('%m')

In [26]:
# obtain monthly counts
crimeNY_sum = crime_ny.groupby('months').count()
crimeNY_sum = crimeNY_sum['CMPLNT_FR_DT']

final_data["CrimeNY"] = pd.Series(crimeNY_sum.tolist())

In [27]:
crime_la = pd.read_csv("LA_crime.csv")

In [28]:
# reformate date types
crime_la['Date Rptd'] = crime_la['Date Rptd'].apply(pd.to_datetime, utc=True)

# obtain month and day data to help locate obtaining month rows
crime_la['months'] = crime_la['Date Rptd'].dt.strftime('%m')

In [29]:
# obtain monthly counts
crimeLA_sum = crime_la.groupby('months').count()
final_data["CrimeLA"] = pd.Series(crimeLA_sum['Date Rptd'].tolist())

In [None]:
crime_covid.head()

### FINAL DATASET ###

In [30]:
final_data

Unnamed: 0,Period,TotalCasesNY,TotalDeathsNY,DeathRateNY,TotalCasesLA,TotalDeathsLA,DeathRateLA,JailPpltNY,JailPpltLA,UnemplNY,...,GovEmplNY,GovEmplLA,HospEmplNY,HospEmplLA,FinEmplNY,FinEmplLA,EduEmplNY,EduEmplLA,CrimeNY,CrimeLA
0,2020-01-31,0.0,0.0,0.0,0,0,0.0,5544.0,6035.0,3.5,...,489.6,578.0,168.4,152.3,475.3,342.8,246.3,173.6,38758,15991
1,2020-02-29,1.0,0.0,0.0,147,0,0.0,5356.0,5960.0,3.4,...,499.3,588.6,168.2,152.6,477.7,345.8,260.7,182.1,35354,16483
2,2020-03-31,65213.0,2193.0,0.033628,6478,95,0.014665,5195.0,6015.0,4.1,...,503.8,593.6,167.7,153.5,462.0,346.9,260.2,179.5,32419,15540
3,2020-04-30,109384.0,12742.0,0.116489,22023,1107,0.050266,3973.0,5798.0,15.0,...,486.4,572.5,165.5,155.8,455.6,329.5,226.7,166.8,24840,15166
4,2020-05-31,28481.0,2834.0,0.099505,30158,1235,0.040951,3859.0,5758.0,18.3,...,477.7,558.6,164.1,154.9,453.8,330.1,221.8,162.7,32038,16314
5,2020-06-30,10865.0,767.0,0.070594,52148,983,0.01885,3878.0,5620.0,20.3,...,472.8,555.0,165.3,155.1,453.9,331.4,212.4,157.3,32512,16969
6,2020-07-31,9783.0,356.0,0.03639,79700,1192,0.014956,3843.0,5223.0,19.9,...,419.9,501.9,166.1,154.1,456.1,337.4,208.9,147.9,35331,16913
7,2020-08-31,7366.0,150.0,0.020364,41481,1080,0.026036,3972.0,5053.0,16.0,...,447.8,502.0,165.5,154.1,460.8,335.8,208.0,148.1,36862,16573
8,2020-09-30,9902.0,124.0,0.012523,27127,618,0.022782,,4916.0,13.9,...,492.5,544.3,165.6,153.4,456.4,332.6,209.4,151.3,33660,14230


In [31]:
# final_data.to_csv("finaldata.csv", sep=",")

## 3. Data description ## 
Have an initial draft of your ​data description​ section.
Your data description should be about your analysis-ready data.


#### Final Draft Requirement ####
This should be inspired by the format presented in https://arxiv.org/abs/1803.09010. 
Answer the following questions:
- What are the observations (rows) and the attributes (columns)?
     
- Why was this dataset created?
- Who funded the creation of the dataset?
- What processes might have influenced what data was observed and recorded and what was not?
- What preprocessing was done, and how did the data come to be in the form that you are using?
- If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
- Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a C​ ornell Google Drive​ or ​Cornell Box)​ .

There are mainly nine rows indicating the first nine months of 2020. There are many columns covering a variety of covariates that will contribute to the analysis between the two cities New York and Los Angeles:
- COVID variables: 
- Criminal variables:
    - CrimeNY: number of crime cases in NYC per month
    - CrimeLA: number of crime cases in LAC per month
- Economic variables：
    - GovEmplNY: number of local government employees in thousands in NY per month
- Demographic variables：

## 4. Data limitations ## 
Identify any potential problems with your dataset.

#### Final Draft Requirement ####

- What are the limitations of your study? What are the biases in your data or assumptions of your analyses that specifically affect the conclusions you’re able to draw?

1. Assumes the reliability/authenticity of data sources
2. assume zero covid cases before dataentry due to lack of data

## 5. Exploratory data analysis ## 
Perform an (initial) exploratory data analysis.

#### Final Draft Requirement ####

- Use summary functions like mean and standard deviation along with visual displays like scatterplots and histograms to describe data.
- Provide at least one model showing patterns or relationships between variables that addresses your research question. This could be a regression or clustering, or something else that measures some property of the dataset.

## 6. Questions for reviewers ## 
List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.