# Data Analysis on NYC Crime

## Problems?
I want to leave my home and I am open to live in the 5 boroughs of NYC. But I heard some boroughs like the Bronx is surronded in crime.

In this Analysis, I will do a deep dive in the crime in NYC. I will see if:
1) The Bronx is the place with the highest crime rate. 
2) The place with the lowest crime rate.
3) Who is more likely to create these crimes.

## Preparing Data
I found this data in the NYC open data set. It has 19 columns with over 110000 rows of data. Below you can find the link to the data.

https://data.cityofnewyork.us/Public-Safety/NYPD-Arrest-Data-Year-to-Date-/uip8-fykc


## Process
In getting the data I used an API provided by NYC. I had to change the limit in the API to give me all the data available. 

I changed it from 2000 to 200000. 

Check out the code below.

In [18]:
#Install sodapy
#!pip install sodapy

In [19]:
#install pandas
#pip install pandas

In [20]:
#!/usr/bin/env python

# make sure to install these packages before running:
# pip install pandas
# pip install sodapy

import pandas as pd
import numpy as np
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofnewyork.us,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("uip8-fykc", limit = 200000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)




I changed the limit from 2000 to 200000 to give me all the data available. 


In [21]:
#checking the data last 6 rows.
results_df.tail()

Unnamed: 0,arrest_key,arrest_date,pd_cd,pd_desc,ky_cd,ofns_desc,law_code,law_cat_cd,arrest_boro,arrest_precinct,...,x_coord_cd,y_coord_cd,latitude,longitude,geocoded_column,:@computed_region_f5dn_yrer,:@computed_region_yeji_bk3q,:@computed_region_92fq_4b7q,:@computed_region_sbqj_enih,:@computed_region_efsh_h5xi
112566,267861068,2023-05-07T00:00:00.000,922,"TRAFFIC,UNCLASSIFIED MISDEMEAN",348,VEHICLE AND TRAFFIC LAWS,VTL05110MU,M,Q,113,...,1046315,187088,40.6799807384666,-73.7762339071953,"{'type': 'Point', 'coordinates': [-73.77623390...",41,3,6,71,24669
112567,270481110,2023-06-27T00:00:00.000,101,ASSAULT 3,344,ASSAULT 3 & RELATED OFFENSES,PL 1200001,M,K,79,...,999872,187803,40.682141,-73.943673,"{'type': 'Point', 'coordinates': [-73.943673, ...",69,2,49,51,17618
112568,267833542,2023-05-06T00:00:00.000,397,"ROBBERY,OPEN AREA UNCLASSIFIED",105,ROBBERY,PL 1601502,F,B,43,...,1019852,241853,40.830435,-73.871349,"{'type': 'Point', 'coordinates': [-73.871349, ...",58,5,31,26,11610
112569,268911088,2023-05-27T00:00:00.000,494,"STOLEN PROPERTY 2,1,POSSESSION",111,POSSESSION OF STOLEN PROPERTY,PL 1654502,F,B,46,...,1011750,250274,40.853578,-73.900591,"{'type': 'Point', 'coordinates': [-73.900591, ...",6,5,22,29,10935
112570,269584440,2023-06-09T00:00:00.000,439,"LARCENY,GRAND FROM OPEN AREAS, UNATTENDED",109,GRAND LARCENY,PL 1553001,F,M,24,...,993372,229301,40.79605,-73.967052,"{'type': 'Point', 'coordinates': [-73.967052, ...",20,4,23,15,12422


After getting the data imma clean the sheet to make it easier to use. 
Lets look at race. 

In [22]:
#See unique values in Race because their is alot of races
results_df["perp_race"].unique()

array(['WHITE', 'BLACK', 'WHITE HISPANIC', 'BLACK HISPANIC', 'UNKNOWN',
       'ASIAN / PACIFIC ISLANDER', 'AMERICAN INDIAN/ALASKAN NATIVE'],
      dtype=object)

In [23]:
#We notice something strange. White Hispanic and Black Hispanic.
#This is a bit wierd so we are going to add them together and make it Hispanic
results_df["perp_race"] = results_df['perp_race'].replace(['BLACK HISPANIC', 'WHITE HISPANIC'], ['HISPANIC', 'HISPANIC'])

In [24]:
#Just to check our code worked
results_df["perp_race"].unique()

array(['WHITE', 'BLACK', 'HISPANIC', 'UNKNOWN',
       'ASIAN / PACIFIC ISLANDER', 'AMERICAN INDIAN/ALASKAN NATIVE'],
      dtype=object)


Next i will drop the last 5 columns because they do not provide anything usefull in the analysis of our data. 


In [25]:
#Drop the following colmumns
results_df = results_df.drop(columns=[':@computed_region_f5dn_yrer', ':@computed_region_yeji_bk3q', ':@computed_region_92fq_4b7q', ':@computed_region_sbqj_enih',":@computed_region_efsh_h5xi" ])

In [26]:
#Check if their are any null values
results_df.isna().sum()

arrest_key             0
arrest_date            0
pd_cd                461
pd_desc                0
ky_cd                466
ofns_desc              0
law_code               0
law_cat_cd           846
arrest_boro            0
arrest_precinct        0
jurisdiction_code      0
age_group              0
perp_sex               0
perp_race              0
x_coord_cd             0
y_coord_cd             0
latitude               0
longitude              0
geocoded_column        0
dtype: int64

Because we have a big sample size of over 110000, I will drop all the null values. We should still have a good amount of data to use and explore. 

In [27]:
#dropping NA values
results_df = results_df.dropna()

In [28]:
#Checking the shape of the dataframe. 
results_df.shape

(111259, 19)

In [29]:
results_df.isna().sum()

arrest_key           0
arrest_date          0
pd_cd                0
pd_desc              0
ky_cd                0
ofns_desc            0
law_code             0
law_cat_cd           0
arrest_boro          0
arrest_precinct      0
jurisdiction_code    0
age_group            0
perp_sex             0
perp_race            0
x_coord_cd           0
y_coord_cd           0
latitude             0
longitude            0
geocoded_column      0
dtype: int64

As you can see, after cleaning the data we still have over 110000 rows of data. I will now save the dataframe into a csv for future analysis.

In [30]:
#Dataframe to CSV
csv_file = results_df.to_csv('NYPD_Arrest_Data_1.csv', index=False)


## Analyze
For my next step of the project I will analyze the data we have. I will use Tableau as a data visualization tool. 