# Data Analysis on NYC Crime

## Problems?
I want to leave my home and I am open to live in the 5 boroughs of NYC. But I heard some boroughs like the Bronx is surronded in crime.

In this Analysis, I will do a deep dive in the crime in NYC. I will see if:
1) The Bronx is the place with the highest crime rate. 
2) The place with the lowest crime rate.
3) Who is more likely to create these crimes.

## Preparing Data
I found this data in the NYC open data set. It has 19 columns with over 110000 rows of data. Below you can find the link to the data.

https://data.cityofnewyork.us/Public-Safety/NYPD-Arrest-Data-Year-to-Date-/uip8-fykc


## Process
In getting the data I used an API provided by NYC. I had to change the limit in the API to give me all the data available. 

I changed it from 2000 to 200000. 

Check out the code below.

In [1]:
#Install sodapy
#!pip install sodapy

In [2]:
#install pandas
#pip install pandas

In [4]:
#!/usr/bin/env python

# make sure to install these packages before running:
# pip install pandas
# pip install sodapy

import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("8h9b-rp9u", limit=5500000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)




I changed the limit from 2000 to 200000 to give me all the data available. 


In [None]:
#checking the data last 6 rows.
results_df.tail()

Unnamed: 0,arrest_key,arrest_date,pd_cd,law_code,law_cat_cd,arrest_boro,arrest_precinct,jurisdiction_code,age_group,perp_sex,...,longitude,lon_lat,:@computed_region_efsh_h5xi,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,pd_desc,ky_cd,ofns_desc
999995,175929954,2018-03-15T00:00:00.000,339,PL 1552500,M,M,14,0,18-24,F,...,-73.98837157699995,"{'type': 'Point', 'coordinates': [-73.98837157...",13094,11,4,10,8,"LARCENY,PETIT FROM OPEN AREAS,UNCLASSIFIED",341,PETIT LARCENY
999996,175678433,2018-03-08T00:00:00.000,101,PL 1200001,M,M,18,97,18-24,M,...,-73.98784212899994,"{'type': 'Point', 'coordinates': [-73.98784212...",12081,12,4,19,10,ASSAULT 3,344,ASSAULT 3 & RELATED OFFENSES
999997,178327436,2018-04-19T00:00:00.000,478,PL 1651503,M,K,60,1,18-24,M,...,-73.98120362499998,"{'type': 'Point', 'coordinates': [-73.98120362...",18184,21,2,45,35,"THEFT OF SERVICES, UNCLASSIFIED",343,OTHER OFFENSES RELATED TO THEFT
999998,176290671,2018-03-25T00:00:00.000,101,PL 1200001,M,M,7,0,45-64,M,...,-73.98754415899998,"{'type': 'Point', 'coordinates': [-73.98754415...",11723,70,4,32,4,ASSAULT 3,344,ASSAULT 3 & RELATED OFFENSES
999999,176313446,2018-03-26T00:00:00.000,397,PL 1601001,F,B,43,0,<18,M,...,-73.87017045,"{'type': 'Point', 'coordinates': [-73.87017045...",11611,58,5,31,26,"ROBBERY,UNCLASSIFIED,OPEN AREAS",105,ROBBERY


After getting the data imma clean the sheet to make it easier to use. 
Lets look at race. 

In [None]:
#See unique values in Race because their is alot of races
results_df["arrest_boro"].unique()

array(['M', 'B', 'Q', 'K', 'S'], dtype=object)

In [None]:
#We notice something strange. White Hispanic and Black Hispanic.
#This is a bit wierd so we are going to add them together and make it Hispanic
results_df["perp_race"] = results_df['perp_race'].replace(['BLACK HISPANIC', 'WHITE HISPANIC'], ['HISPANIC', 'HISPANIC'])

In [None]:
#Just to check our code worked
results_df["perp_sex"].unique()

array(['M', 'F'], dtype=object)

In [None]:
results_df["arrest_boro"] = results_df['arrest_boro'].replace(['B', 'K', 'Q', 'M', 'S'], ['Bronx', 'Brooklyn', 'Queens', 'Manhattan', 'Staten Island'])
results_df['perp_sex'] = results_df['perp_sex'].replace(['M', 'F'], ["Male", "Female"])


Next i will drop the last 5 columns because they do not provide anything usefull in the analysis of our data. 


In [None]:
#Drop the following colmumns
results_df = results_df.drop(columns=[':@computed_region_f5dn_yrer', ':@computed_region_yeji_bk3q', ':@computed_region_92fq_4b7q', ':@computed_region_sbqj_enih',":@computed_region_efsh_h5xi" ,"lon_lat", "pd_desc" ])

In [None]:
#Check if their are any null values
results_df.isna().sum()

arrest_key              0
arrest_date             0
pd_cd                 235
law_code              118
law_cat_cd           6562
arrest_boro             0
arrest_precinct         0
jurisdiction_code       6
age_group              17
perp_sex                0
perp_race               0
x_coord_cd              1
y_coord_cd              1
latitude                1
longitude               1
ky_cd                2989
ofns_desc            2989
dtype: int64

Because we have a big sample size of over 110000, I will drop all the null values. We should still have a good amount of data to use and explore. 

In [None]:
#dropping NA values
results_df = results_df.dropna()

In [None]:
#Checking the shape of the dataframe. 
results_df.shape

(990547, 17)

In [None]:
results_df.isna().sum()

arrest_key           0
arrest_date          0
pd_cd                0
law_code             0
law_cat_cd           0
arrest_boro          0
arrest_precinct      0
jurisdiction_code    0
age_group            0
perp_sex             0
perp_race            0
x_coord_cd           0
y_coord_cd           0
latitude             0
longitude            0
ky_cd                0
ofns_desc            0
dtype: int64

As you can see, after cleaning the data we still have over 110000 rows of data. I will now save the dataframe into a csv for future analysis.

In [None]:
#Dataframe to CSV
csv_file = results_df.to_csv('NYPD_Arrest_Data_Historic.csv', index=False)


## Analyze
For my next step of the project I will analyze the data we have. I will use Tableau as a data visualization tool. 

Check Here for Tableau public.
https://public.tableau.com/app/profile/david.sierra.perez8682/viz/NYCArrestData/ArrestbyBorough