- Team: Hangyu Zhou (hz477) , Evian Liu (yl2867) , YingYun Zhang (yz549) , Yang Zhao (yz563)
- Github repository link: https://github.coecis.cornell.edu/yl2867/INFO2950_Project.git
- Rubric: https://docs.google.com/document/d/1W3mPBOMhM9SD3oym2LG3NLyB78jG-Kmv9egM3B_EObg/edit

# Phase II:  #
#### Data collection and exploratory data analysis. Due Oct 6. ####
- Settle on a single idea and state your research question(s) clearly.
- Carry out most of your data collection.
- Compute some relevant summary statistics, and show some plots of your data, as applicable to your research question(s). Use this exploratory data analysis to:
    - update your research question(s), if applicable;
    - update your data description, if applicable (e.g. if you collect additional data).

### A dataset of moderate size and complexity. ###
- It should be large enough to have interesting complexity, but not so big as to be unwieldy. As a rough guideline, your dataset should be longer than you could print on a single page in standard spreadsheet format, but smaller than 20 MB.
- You may use existing datasets, combine data from APIs, or create entirely new data through instruments or surveys.
- The dataset you turn in does not have to be the dataset that you initially collected. For example, you might download 50 MB of raw logs, but use filtering and aggregation to reduce the dataset to 100 kB for your actual analysis. We want you to submit your analysis-ready data, but you should describe your full data-collection protocol and any preprocessing done in the data description section of your final report (see below). All source code use for data collection and preprocessing should also be linked to in the source code section of your final report.
- If your final, curated dataset is larger than 10MB, share a copy in Cornell Box and include a link to it in your final report.

## 1. Research question(s) ## 
State your research question (s) clearly.

## 2. Data cleaning ## 
Have an initial draft of your data cleaning appendix.
Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. 
All of your data cleaning code should be found in this section, and you may want to explain the steps of your data cleaning in words as well.

#### Final Draft Requirement ####

- Data cleaning description. Submit an updated version of your data cleaning description from phase II that describes all data cleaning steps performed on your raw data to turn it into the analysis-read dataset submitted with your final project. The data cleaning description should be a separate Jupyter notebook with executed cells, and it should output the dataset you submit as part of your project (e.g. written as a .csv file).

In [1]:
# import packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

### Crime ###
- As of 09/26/20
- NY: https://compstat.nypdonline.org/2e5c3f4b-85c1-4635-83c6-22b27fe7c75c/view/89
- LA: http://lapd-assets.lapdonline.org/assets/pdf/cityprof.pdf

In [2]:
# copy raw data from source
crime_raw = {'Homicide20YTD': [337, 228], 'Homicide19YTD': [241, 199], 
             'Rape20YTD': [1057, 966], 'Rape19YTD': [1371, 1260], 
             'Robbery20YTD': [9362, 5902], 'Robbery19YTD': [9478, 7731], 
             'FelonyAssult20YTD': [15209, 13254], 'FelonyAssult19YTD': [15687, 12895], 
             'Burglary20YTD': [11107, 10043], 'Burglary19YTD': [7835, 11633], 
            }
crimedata = pd.DataFrame(data=crime_raw)
crimedata.index = ['NY','LA']

In [3]:
# calculate new columns
crimedata['Homicide%Chg'] = (crimedata['Homicide20YTD']-crimedata['Homicide19YTD']) / crimedata['Homicide19YTD']
crimedata['Rape%Chg'] = (crimedata['Rape20YTD']-crimedata['Rape19YTD']) / crimedata['Rape19YTD']
crimedata['Robbery%Chg'] = (crimedata['Robbery20YTD']-crimedata['Robbery19YTD']) / crimedata['Robbery19YTD']
crimedata['FelonyAssult%Chg'] = (crimedata['FelonyAssult20YTD']-crimedata['FelonyAssult19YTD']) / crimedata['FelonyAssult19YTD']
crimedata['Burlary%Chg'] = (crimedata['Burglary20YTD']-crimedata['Burglary19YTD']) / crimedata['Burglary19YTD']
crimedata['TotalViolent20YTD'] = crimedata['Homicide20YTD'] + crimedata['Rape20YTD'] + crimedata['Robbery20YTD'] + crimedata['FelonyAssult20YTD']
crimedata['TotalViolent19YTD'] = crimedata['Homicide19YTD'] + crimedata['Rape19YTD'] + crimedata['Robbery19YTD'] + crimedata['FelonyAssult19YTD']
crimedata['TotalViolent%Chg'] = (crimedata['TotalViolent20YTD']-crimedata['TotalViolent19YTD']) / crimedata['TotalViolent19YTD']
crimedata['TotalCrime20YTD'] = crimedata['TotalViolent20YTD'] + crimedata['Burglary20YTD']
crimedata['TotalCrime19YTD'] = crimedata['TotalViolent19YTD'] + crimedata['Burglary19YTD']
crimedata['TotalCrime%Chg'] = (crimedata['TotalCrime20YTD']-crimedata['TotalCrime19YTD']) / crimedata['TotalCrime19YTD']
crimedata

Unnamed: 0,Homicide20YTD,Homicide19YTD,Rape20YTD,Rape19YTD,Robbery20YTD,Robbery19YTD,FelonyAssult20YTD,FelonyAssult19YTD,Burglary20YTD,Burglary19YTD,...,Rape%Chg,Robbery%Chg,FelonyAssult%Chg,Burlary%Chg,TotalViolent20YTD,TotalViolent19YTD,TotalViolent%Chg,TotalCrime20YTD,TotalCrime19YTD,TotalCrime%Chg
NY,337,241,1057,1371,9362,9478,15209,15687,11107,7835,...,-0.22903,-0.012239,-0.030471,0.417613,25965,26777,-0.030325,37072,34612,0.071074
LA,228,199,966,1260,5902,7731,13254,12895,10043,11633,...,-0.233333,-0.23658,0.02784,-0.13668,20350,22085,-0.07856,30393,33718,-0.098612


### Covid ###
- As of 10/03/20
- NY: https://www1.nyc.gov/site/doh/covid/covid-19-data.page; https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n
- LA: https://corona-virus.la/data; http://publichealth.lacounty.gov/media/coronavirus/locations.htm; http://dashboard.publichealth.lacounty.gov/covid19_surveillance_dashboard/

In [4]:
# copy raw data from source
ny_Hospitalizations = 59345
covid_raw = {'Cases': [240873, 111292], 'Deaths': [19207, 2806], }
coviddata = pd.DataFrame(data=covid_raw)
coviddata.index = ['NY','LA']

In [5]:
# calculate new columns
coviddata["DeathRate"] = coviddata["Deaths"] / coviddata["Cases"]
coviddata

Unnamed: 0,Cases,Deaths,DeathRate
NY,240873,19207,0.079739
LA,111292,2806,0.025213


### Combine All ###

In [6]:
crime_subset = crimedata[ ['Homicide%Chg', 'Rape%Chg', 'Robbery%Chg', 'FelonyAssult%Chg', 'TotalViolent%Chg', 'Burlary%Chg', 'TotalCrime%Chg'] ]
data = pd.concat([crime_subset, coviddata], axis=1)
data

Unnamed: 0,Homicide%Chg,Rape%Chg,Robbery%Chg,FelonyAssult%Chg,TotalViolent%Chg,Burlary%Chg,TotalCrime%Chg,Cases,Deaths,DeathRate
NY,0.39834,-0.22903,-0.012239,-0.030471,-0.030325,0.417613,0.071074,240873,19207,0.079739
LA,0.145729,-0.233333,-0.23658,0.02784,-0.07856,-0.13668,-0.098612,111292,2806,0.025213


## 3. Data description ## 
Have an initial draft of your ​data description​ section.
Your data description should be about your analysis-ready data.


#### Final Draft Requirement ####
This should be inspired by the format presented in https://arxiv.org/abs/1803.09010. 
Answer the following questions:
- What are the observations (rows) and the attributes (columns)?
     
- Why was this dataset created?
- Who funded the creation of the dataset?
- What processes might have influenced what data was observed and recorded and what was not?
- What preprocessing was done, and how did the data come to be in the form that you are using?
- If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
- Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a C​ ornell Google Drive​ or ​Cornell Box)​ .

There are mainly two rows indicating the two cities that are the focus of this project. There are many columns covering a variety of covariates that will contribute to the analysis between the two cities:
- COVID variables: 
    - Cases = number of positive cases
    - Deaths = number of deaths
    - DeathRate = Cases/Deaths
- Criminal variables:
    - Homicide%Chg  = 2019/20 year-to-date percent change of homicide
    - Rape%Chg = 2019/20 year-to-date percent change of rape
    - Robbery%Chg = 2019/20 year-to-date percent change of robbery
    - FelonyAssult%Chg = 2019/20 year-to-date percent change of felony assult
    - TotalViolent%Chg = 2019/20 year-to-date percent change of total violent crimes(first four crimes listed)
    - Burlary%Chg = 2019/20 year-to-date percent change of burlary
    - TotalCrime%Chg = 2019/20 year-to-date percent change of total crimes
- Demographic variables
- Economic variables

## 4. Data limitations ## 
Identify any potential problems with your dataset.

#### Final Draft Requirement ####

- What are the limitations of your study? What are the biases in your data or assumptions of your analyses that specifically affect the conclusions you’re able to draw?

1. Assumes the reliability/authenticity of data sources

## 5. Exploratory data analysis ## 
Perform an (initial) exploratory data analysis.

#### Final Draft Requirement ####

- Use summary functions like mean and standard deviation along with visual displays like scatterplots and histograms to describe data.
- Provide at least one model showing patterns or relationships between variables that addresses your research question. This could be a regression or clustering, or something else that measures some property of the dataset.

### Crime statistics for all cities ###

In [7]:
crimestats = pd.read_csv("CrimeStatistics.csv", index_col='City')
crimestats

FileNotFoundError: [Errno 2] No such file or directory: 'CrimeStatistics.csv'

In [None]:
crimestats.dtypes

In [None]:
crimestats.loc['New Orleans':'Portland','YTD Murder':'2019 YTD Murder'].plot.bar()

In [None]:
crimestats.loc['New Orleans':'Portland','YTD Violent':'2019 YTD Violent'].plot.bar()

In [None]:
crimestats.loc['New Orleans':'Portland','YTD Property':'2019 YTD Property'].plot.bar()

In [None]:
crimestats.loc['New Orleans':'Portland','YTD Total':'2019 YTD Total'].plot.bar()

### Crime statistics for NY and LA ###

In [None]:
crimedata.loc['NY','Homicide20YTD':'Burglary19YTD'].plot.bar(color=['green', 'orange'])

In [None]:
crimedata.loc['LA','Homicide20YTD':'Burglary19YTD'].plot.bar(color=['green', 'orange'])

In [None]:
coviddata.loc[:,'Cases':'Deaths'].plot.bar()

## 6. Questions for reviewers ## 
List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.