Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred.

**Dataset Source:** 
https://www.kaggle.com/datasets/AnalyzeBoston/crimes-in-boston


In [1]:
## import nescceary libraries
import pandas as pd 
import numpy as np
import plotly.express as px
import streamlit as st

In [None]:
## Read the dataset
df = pd.read_csv("crime.csv", encoding='ISO-8859-1')
df

## Step 1: Data Overview

Understand the structure of the dataset.

In [None]:
# # Get general information about the dataset
df.info()
''' Here we notice 
1- INCIDENT_NUMBER is object
2- name of the columns are Uppercase
3- DISTRICT has alot of null
4- most of Shooting column are missing
5- Some of UCR_PART are Missing
6- drop Shooting column
'''

In [None]:

df.describe()

## Step 2: Check Duplicates row

In [None]:
df.count()-df.drop_duplicates().count()
# there are 23 duplicated rows

In [6]:
df.drop_duplicates(ignore_index=True,inplace=True)
# to drop duplicates rows without Sabotage the index

In [None]:
# check if there all NAN row
df.count()-df.dropna(how='all',axis=0).count()

## check Nan values

In [None]:
df.isna().sum()*100/len(df)

**Step 3: Edit columns name**

In [9]:
df.columns=df.columns.str.lower() 

**check unnessaciry columns** 

In [10]:
df.drop(axis=1,columns='long',inplace=True)

In [11]:
df.drop(axis=1,columns='lat',inplace=True)

In [12]:
df.drop(axis=1,columns='incident_number',inplace=True)

In [13]:
df.drop(axis=1,columns='offense_description',inplace=True)

In [None]:

df[df['shooting'] =='Y']['offense_code_group'].value_counts()
## as we can see this is an important column to visualze if there a gun shot on the Crime 
## we can't delete the column

**Filling NaN values**

In [None]:
df.isna().sum()

In [16]:
df['shooting'].fillna(value='NO',inplace=True)

In [None]:
df.head()

In [None]:
df.isna().sum()*100/len(df)

In [None]:
df.isnull().sum()

In [None]:
df['location'].value_counts()

In [None]:
df['district'].value_counts()

**Fill District column**

In [22]:
district_mode = df['district'].mode()[0]
df['district'].fillna(value=district_mode,inplace=True)

**Fill ucr_part column**

In [23]:
ucr_mode = df['ucr_part'].mode()[0]
df['ucr_part'].fillna(value=ucr_mode,inplace=True)

In [None]:
df['ucr_part'].isna().sum()

In [None]:
df.isna().sum()

**Droping column which hasn`t street name**

In [None]:
df.dropna(axis=0,subset=['street'],inplace=True)
df.reset_index()

In [None]:
df.info()

**Feature Engineering**

Making new feature for fine of each Crime

In [None]:
len(df['offense_code_group'].value_counts())

In [None]:

df['offense_code_group']=df['offense_code_group'].apply(lambda x: x.lower().strip())
df['offense_code_group'].unique()

**Creating a new column for fine for each crime**
cause we don't have any numurical column

In [30]:
fee_for_each_crime= {
    'larceny': 500,
    'vandalism': 300,
    'towed': 100,
    'investigate property': 200,
    'motor vehicle accident response': 400,
    'auto theft': 800,
    'verbal disputes': 150,
    'robbery': 1200,
    'fire related reports': 700,
    'other': 100,
    'property lost': 250,
    'assembly or gathering violations': 200,
    'larceny from motor vehicle': 500,
    'medical assistance': 300,
    'residential burglary': 1000,
    'simple assault': 400,
    'restraining order violations': 600,
    'violations': 300,
    'harassment': 250,
    'ballistics': 1500,
    'property found': 150,
    'police service incidents': 200,
    'disorderly conduct': 250,
    'property related damage': 400,
    'missing person reported': 400,
    'investigate person': 200,
    'fraud': 1300,
    'drug violation': 1800,
    'aggravated assault': 1200,
    'license plate related incidents': 150,
    'firearm violations': 2500,
    'other burglary': 700,
    'arson': 3000,
    'warrant arrests': 500,
    'bomb hoax': 4000,
    'harbor related incidents': 800,
    'counterfeiting': 1500,
    'liquor violation': 300,
    'firearm discovery': 2000,
    'landlord/tenant disputes': 200,
    'missing person located': 200,
    'auto theft recovery': 400,
    'service': 100,
    'operating under the influence': 2000,
    'confidence games': 1500,
    'search warrants': 600,
    'license violation': 300,
    'commercial burglary': 1200,
    'home invasion': 4000,
    'recovered stolen property': 300,
    'offenses against child / family': 3000,
    'prostitution': 1200,
    'evading fare': 100,
    'prisoner related incidents': 600,
    'homicide': 5000,
    'embezzlement': 1800,
    'explosives': 3500,
    'criminal harassment': 500,
    'phone call complaints': 150,
    'aircraft': 800,
    'biological threat': 4000,
    'manslaughter': 4000,
    'gambling': 500,
    'human trafficking': 4000,
    'human trafficking - involuntary servitude': 4500,
    'burglary - no property taken': 600
}

In [31]:
df['fine'] = df['offense_code_group'].map(fee_for_each_crime)

Triple the fine if the crime has shooting 

In [32]:
df['fine']=df['fine']*df['shooting'].apply(lambda x: 3 if x =='Y' else 1 )

In [None]:
df[df['shooting']=='Y']

In [None]:
df['fine'].describe()

**Data Visulaization**
**Univariate analysis**

In [None]:
df.head(2)

**Univariate Analysis**
**Visualize most crimes occurs**

In [None]:
o_c =df['offense_code_group'].value_counts().reset_index().head(25)
o_c.columns = ['offense_code_group','counts']
px.histogram(o_c,x='offense_code_group',y='counts',template='plotly_dark',title='most crimes occur',range_y=[0,40000])


In [None]:
df.head(2)

**Visualze The district where most crimes occur**

In [None]:
o_c =df['district'].value_counts().reset_index().head(25)
o_c.columns = ['district','counts']
px.histogram(o_c,x='district',y='counts',template='plotly_dark',title='which district most crimes occur',range_y=[0,40000])


**Most reporting area**

In [None]:
o_c

In [None]:
df['reporting_area'].value_counts()

**The most reporting area is unkown**

In [41]:
df['reporting_area']=df['reporting_area'].apply(lambda x: x.replace(' ','unkown'))

In [None]:
o_c =df['reporting_area'].value_counts().reset_index().head(25)
o_c.columns = ['reporting_area','counts']
px.histogram(o_c,x='reporting_area',y='counts',template='plotly_dark',title='most crimes occur')


**Visualize crimes with shooting**

In [None]:
px.histogram(df,x='offense_code_group',template='plotly_dark',color='shooting')

In [None]:
px.histogram(df,x='shooting',template='plotly_dark',color='shooting')

**Visualze month&year**

In [None]:
df

In [46]:
df['occurred_on_date']=pd.to_datetime(df['occurred_on_date'],errors='coerce')

In [47]:
df['month_year'] = df['occurred_on_date'].dt.strftime('%Y-%m')

In [None]:

o_c =df.groupby('month_year')['offense_code_group'].count().sort_index().reset_index()
o_c.columns = ['month_year','counts']
px.line(o_c,x='month_year',y='counts',template='plotly_dark',width=1000,title='2D line plot represent how many crimes occurs over 3 years')

**Visualize that most months contribute to the most crimes.**

In [49]:
df['month_name']=df['occurred_on_date'].dt.month_name()

In [None]:
o_c =df['month_name'].value_counts().reset_index().head(25)
o_c.columns = ['month_name','counts']
o_c
px.histogram(o_c,x='month_name',y='counts',template='plotly_dark',title='Visualize that most month contribute to the most crimes')

**Visualize that most year contribute to the most crimes**

In [None]:
df['year']=df['year'].astype(str)
px.histogram(df,x='year',template='plotly_dark',title='Visualize that most year contribute to the most crimes')

**Visualize that most Day contribute to the most crimes**

In [None]:

px.histogram(df,x='day_of_week',template='plotly_dark',title='Visualize that most day contribute to the most crimes')

**Visualize that most Hour contribute to the most crimes**

In [None]:
df['hour']=df['hour'].astype(str)
sorted_fig = df['hour'].value_counts().reset_index()
sorted_fig.columns = ['hour','counts']
px.histogram(sorted_fig,x='hour',y='counts',template='plotly_dark',title='Visualize that most hour contribute to the most crimes',range_y=[0,20000])

**Visualize that which street is most expsed to crimes??**

In [None]:
o_c =df['street'].value_counts().reset_index().head(25)
o_c.columns = ['street','counts']
px.histogram(o_c,x='street',y='counts',template='plotly_dark',title='which street is most expsed to crimes?',range_y=[0,40000])

**Bivariate Anlaysis**

**Visualize most street has the most_accidents**

In [None]:
most_street = df[df['shooting']=='Y']
most_street = most_street['street'].value_counts().reset_index().head(25)
most_street.columns = ['street','count']
most_street
px.histogram(most_street,x='street',y='count',template='plotly_dark',title='most_accidents')

**here we can notice the streets where shootings occur most frequently**

In [None]:
df['offense_code_group'].value_counts()

**Visualze The street with the most car accidents**

In [None]:
most_accidents = df[df['offense_code_group']=='motor vehicle accident response']
most_accidents = most_accidents['street'].value_counts().reset_index().head(25)
most_accidents.columns = ['street','count']
px.histogram(most_accidents,x='street',y='count',template='plotly_dark',title='Visualize that most year contribute to the most crimes')

**Visualze The street with the most Homicide**

In [None]:
most_Homicide = df[df['offense_code_group'] =='homicide']
most_Homicide = most_Homicide['street'].value_counts().reset_index().head(25)
most_Homicide.columns = ['street','count']
px.histogram(most_Homicide,x='street',y='count',template='plotly_dark',title='Where is the most homicide')


In [None]:
df.head()

In [None]:
## the average fine for each crime
avg_fine = df.groupby('offense_code_group')['fine'].mean().reset_index().sort_values(by='fine',ascending=False).head(30)
avg_fine

In [None]:
px.bar(avg_fine,x='offense_code_group',y='fine',template='plotly_dark')

**The highest fine for the crime**

In [None]:
px.pie(avg_fine,values='fine',names='offense_code_group',color_discrete_sequence=px.colors.qualitative.Pastel,title='The highest fine for the crime')

**Make a function make day periods for visualization**

In [94]:
df['hour'] = pd.to_numeric(df['hour'],errors='coerce')

In [95]:
def day_time(x):
    if x in range(0,6):
        return "night"
    elif x in range (6,13):
        return 'morning'
    elif x in range (13,17):
        return "afternoon"
    else :
        return 'evening'


apply the function

In [None]:
df['period'] = df['hour'].apply(day_time)
new_df = df['period'].value_counts().reset_index()
new_df.columns =['period','counts']
new_df

making a histogram to visualize which period of the day contributes the most crimes 

In [None]:
px.histogram(new_df,x='period',y='counts',template='plotly_dark')

We can notice that the evening time is the most common time for crimes to occur

**Visualize every crime with their avg fees**

In [None]:
new_df =df.groupby('offense_code_group')['fine'].mean().reset_index().sort_values(ascending=False,by='fine').head(25)
px.bar(new_df,x='offense_code_group',y='fine',template='plotly_dark',title='Visualize every crime with their avg fees')

Visualize the most street have the most car accident

In [None]:
new_df = df[df['offense_code_group'] == 'motor vehicle accident response']
new_df = new_df['street'].value_counts().reset_index().head(20)
new_df.columns = ['street','counts']
px.bar(new_df,x='street',y='counts',template='plotly_dark',title='Visualize the most street have the most car accident')

In [None]:
df['offense_code_group'].value_counts()

In [None]:

df['year'] = df['year'].astype(str)
new_df = df['year'].value_counts().reset_index()
new_df.columns = ['year','counts']
px.bar(new_df,x='year',y='counts',title='Most Crimes Occur Over the Years',template= 'plotly_dark')


In [None]:
df.head()

**Save the cleand dataset**

In [71]:
df.to_csv('cleand_dataset.csv',index=False)

**Step:5 Encoding the categorical data**

In [None]:
df= pd.read_csv('cleand_dataset.csv')
df.head()

**drop the unsceasiry coulumns for the machine learning**

In [None]:
df

In [74]:
new_df = df[['offense_code_group','shooting','fine']]

**Data preprocessing**

detect outliers

In [None]:
new_df.info()

detect ouliers

In [None]:
from datasist.structdata import detect_outliers
outliers = detect_outliers(df, 0,features=['fine'])

print("Outliers detected at indices:", outliers)

now we have 2 options either drop the outliers or using robust scaler

**Split the Data**

In [None]:
new_df.isna().sum()

In [78]:
from sklearn.model_selection import train_test_split
x = new_df.drop(axis=1,columns='fine')
y = new_df['fine']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1,shuffle=True, stratify=y)

transfer shooting column to 0 and 1

In [None]:
x_train['shooting'].value_counts()

In [80]:
x_train['shooting']=x_train['shooting'].apply(lambda x: 1 if x =='Y'  else  0)
x_test['shooting']=x_test['shooting'].apply(lambda x: 1 if x =='Y'  else  0)

In [None]:
x_train['shooting'].value_counts()

use binary encoding 

In [82]:
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['offense_code_group'])
encoded = encoder.fit_transform(x_train[['offense_code_group']])

x_train = x_train.drop('offense_code_group', axis=1).reset_index(drop=True)
x_train = pd.concat([x_train, encoded], axis=1)


In [83]:
encoder = ce.BinaryEncoder(cols=['offense_code_group'])
encoded = encoder.fit_transform(x_test[['offense_code_group']])

x_test = x_test.drop('offense_code_group', axis=1).reset_index(drop=True)
x_test = pd.concat([x_test, encoded], axis=1)

In [None]:
x_train.head()

thanks