## Task 1:

### Problem Statement:
The city of Buffalo experiences a wide range of crime incidents, but sometimes the data available for certain incidents is incomplete or missing crucial information, making it challenging to determine the exact type of crime that occurred. In such cases, predicting the type of crime based on limited or incomplete crime-related data is crucial for effective decision-making and resource allocation by law enforcement.

This project aims to build a machine learning classification model capable of predicting the type of crime incident based on partial information such as time of day, location, and brief descriptions. This model will help in identifying the nature of crimes when details are scarce or when forecasting future incidents based on historical patterns.


In [32]:
import pandas as pd
import requests
from datetime import datetime
import numpy as np
import geopandas as gpd

In [8]:
url='https://data.buffalony.gov/resource/d6g9-xbgu.json'

## Task 3:
### Data Retrieval Process
The data extraction process was carried out using a REST API request, iterating through the dataset in a paginated manner to retrieve crime records. Below is a detailed breakdown of the process:
1. **Pagination and Data Fetching**:  
   - The dataset was accessed in chunks of **1000 records per request** using pagination with `$limit` and `$offset` to handle large amounts of data. The **`offset`** tracked how many records had been retrieved, allowing subsequent requests to fetch the next batch.

2. **Cutoff Date for Data Filtering**:  
   - The Buffalo crime data website indicates that data before **2009** is unreliable. Therefore, a **cutoff date** of **January 1, 2009** was applied. Records were retrieved in descending order based on **`incident_datetime`**, ensuring that only data from **2009** onwards was collected.

3. **Data Processing**:  
   - After converting the **`incident_datetime`** field to a **datetime** format, each batch was checked to filter out records before **2009**. Once data from earlier than **2009** was encountered, further fetching was stopped.

4. **Combining Data Chunks**:  
   - The data retrieved in chunks was combined into a single **DataFrame**, containing crime data from **January 1, 2009** onwards, ensuring the dataset's reliability based on the source's guidelines.

This approach ensures that only reliable crime data from **2009** onwards is used for analysis, following the cutoff requirements from the Buffalo crime data website.


In [9]:
df_list = []
offset = 0
limit = 1000  # Adjust this value as needed
cutoff_date = datetime(2009, 1, 1)  # Set the cutoff date to the end of 2009

while True:
    params = {
        '$limit': limit,
        '$offset': offset,
        '$order': 'incident_datetime DESC'  # Sort by incident_datetime in descending order
    }
    response = requests.get(url, params=params)
    data = response.json()
    df_page = pd.DataFrame(data)
    
    if df_page.empty:
        break
    
    # Convert incident_datetime to datetime objects
    df_page['incident_datetime'] = pd.to_datetime(df_page['incident_datetime'])
    
    # Check if we've reached data before or equal to 2009
    if df_page['incident_datetime'].min() <= cutoff_date:
        # Filter out rows after 2009
        df_page = df_page[df_page['incident_datetime'] <= cutoff_date]
        df_list.append(df_page)
        break
    
    df_list.append(df_page)
    offset += limit

df = pd.concat(df_list, ignore_index=True)

In [10]:
df.to_csv('dataset.csv')

In [11]:
df['incident_description'].value_counts()

incident_description
Buffalo Police are investigating this report of a crime.  It is important to note that this is very preliminary information and further investigation as to the facts and circumstances of this report may be necessary.    250527
Buffalo Police are investigating this report of a crime. It is important to note that this is very preliminary information and further investigation as to the facts and circumstances of this report may be necessary.       5177
LARCENY/THEFT                                                                                                                                                                                                                 2012
BURGLARY                                                                                                                                                                                                                      1061
ASSAULT                                                                

In [12]:
'''
As we can see above, there are two same incident descriptions with an extra space in one of them in the incident_description column.
So, this can be rectified using regex. 'r\s+' identifies unwanted spaces in the middle of the text and the rreplace method replaces it with a single space.
'''
df['incident_description'] = df['incident_description'].str.replace(r'\s+', ' ', regex=True)

In [13]:

df['incident_description']=df['incident_description'].str.replace('Buffalo Police are investigating this report of a crime. It is important to note that this is very preliminary information and further investigation as to the facts and circumstances of this report may be necessary.','under investigation')

In [14]:
df['incident_description']=df['incident_description'].str.replace('Buffalo Police are investigating this report of a crime. It is important to note that this is very preliminary information and further investigation as to the facts and circumstances of this report may be necessary.','under investigation')

In [18]:
df=df.replace('UNKNOWN',np.nan)

In [19]:
df=df.sort_values(by='incident_datetime')

In [35]:
df['year'] = df['incident_datetime'].dt.year
df['month'] = df['incident_datetime'].dt.month
df['day'] = df['incident_datetime'].dt.day
df['weekday'] = df['incident_datetime'].dt.weekday 
df['hour'] = df['incident_datetime'].dt.hour


In [24]:
df['incident_type_primary']=df['incident_type_primary'].str.lower()
df['parent_incident_type']=df['parent_incident_type'].str.lower()
df['address_1']=df['address_1'].str.lower()

In [25]:
df['latitude']=df['latitude'].astype('float64')
df['longitude']=df['longitude'].astype('float64')

In [26]:
df.isnull().sum()

case_number                     0
incident_datetime               0
incident_type_primary           0
incident_description            0
parent_incident_type            0
hour_of_day                     0
day_of_week                     0
address_1                      33
city                            0
state                           0
location                     5989
latitude                     5989
longitude                    5989
created_at                 189318
zip_code                     3277
neighborhood                 5975
council_district             2286
council_district_2011        3331
census_tract                 5882
census_block_group           5882
census_block                 5882
census_tract_2010           19429
census_block_group_2010     19461
census_block_2010           19431
police_district              5889
tractce20                    5882
geoid20_tract                5882
geoid20_blockgroup           5882
geoid20_block                5882
dtype: int64

In [27]:
#As we can see created_at column has too many null values, hence dropping that column
df_filtered=df.drop(columns=['created_at'])

In [28]:
#The remaining null values are very less in number when compared to the total size of the dataset, hence we can drop it
df_filtered.dropna(axis='index',inplace=True)

In [30]:
# Categorize incident types into broader crime categories (sexual, assault, vehicle, theft, murder)
df_filtered['incident_type_primary'] = df_filtered['incident_type_primary'].str.lower()

sexual_crimes = ['other sexual offense','sexual assault', 'rape', 'sexual abuse', 'sodomy']
assault_crimes=['agg assault on p/officer', 'aggr assault', 'assault']
vehicle_crimes=['theft of vehicles', 'uuv','theft of vehicle']
theft_crimes=['burglary', 'larceny/theft','robbery', 'theft of services','theft', 'breaking & entering']
murder_crimes=['crim negligent homicide', 'homicide', 'manslaughter', 'murder']
df_filtered['incident_type_primary'] = df_filtered['incident_type_primary'].replace(sexual_crimes, 'sexual crime')
df_filtered['incident_type_primary'] = df_filtered['incident_type_primary'].replace(assault_crimes, 'assault crime')
df_filtered['incident_type_primary'] = df_filtered['incident_type_primary'].replace(vehicle_crimes, 'vehicle crime')
df_filtered['incident_type_primary'] = df_filtered['incident_type_primary'].replace(theft_crimes,'theft crimes')
df_filtered['incident_type_primary'] = df_filtered['incident_type_primary'].replace(murder_crimes,'murder crimes')

In [33]:
# Convert crime data to GeoDataFrame
gdf_crimes = gpd.GeoDataFrame(
    df_filtered, 
    geometry=gpd.points_from_xy(df_filtered.longitude, df_filtered.latitude),
    crs="EPSG:4326"
)