<a href="https://colab.research.google.com/github/omo776/safeline/blob/main/Chicago_crime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Analyzing Chicago Crime From 2023 to Present 2024 Using an Unsupervised Machine Learning Model.

In [3]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt


In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
try:
    data = pd.read_csv('/content/drive/MyDrive/Crimes.csv', low_memory=False, on_bad_lines='warn')
    data.info()
except Exception as e:
    print(f"Error reading CSV file: {e}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5581 entries, 0 to 5580
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   CASE#                   5581 non-null   object 
 1   DATE  OF OCCURRENCE     5581 non-null   object 
 2   BLOCK                   5581 non-null   object 
 3    IUCR                   5581 non-null   object 
 4    PRIMARY DESCRIPTION    5581 non-null   object 
 5    SECONDARY DESCRIPTION  5581 non-null   object 
 6    LOCATION DESCRIPTION   5560 non-null   object 
 7   ARREST                  5580 non-null   object 
 8   DOMESTIC                5580 non-null   object 
 9   BEAT                    5580 non-null   float64
 10  WARD                    5580 non-null   float64
 11  FBI CD                  5580 non-null   object 
 12  X COORDINATE            5573 non-null   float64
 13  Y COORDINATE            5573 non-null   float64
 14  LATITUDE                5573 non-null   

**Data Cleaning** starts here

In [55]:
# Cleaning the dataset
# Removing null rows and unwanted columns

data_mod = data.dropna()
columns_to_drop = data.columns[[0, 3]].tolist() + data.columns[9:17].tolist() # Changed from 9:17 to 10:17
data_mod = data.drop(columns=columns_to_drop)
data_mod.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5581 entries, 0 to 5580
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   DATE  OF OCCURRENCE     5581 non-null   object
 1   BLOCK                   5581 non-null   object
 2    PRIMARY DESCRIPTION    5581 non-null   object
 3    SECONDARY DESCRIPTION  5581 non-null   object
 4    LOCATION DESCRIPTION   5560 non-null   object
 5   ARREST                  5580 non-null   object
 6   DOMESTIC                5580 non-null   object
dtypes: object(7)
memory usage: 305.3+ KB


In [56]:
# Rename the columns to remove leading/trailing spaces
data_mod = data_mod.rename(columns={
    'DATE  OF OCCURRENCE': 'DATE OF OCCURRENCE',
    'BLOCK': 'BLOCK',
    ' PRIMARY DESCRIPTION': 'PRIMARY DESCRIPTION',
    ' SECONDARY DESCRIPTION': 'SECONDARY DESCRIPTION',
    ' LOCATION DESCRIPTION': 'LOCATION DESCRIPTION',
    'ARREST': 'ARREST',
    'DOMESTIC': 'DOMESTIC',
})
print(data_mod.columns)

Index(['DATE OF OCCURRENCE', 'BLOCK', 'PRIMARY DESCRIPTION',
       'SECONDARY DESCRIPTION', 'LOCATION DESCRIPTION', 'ARREST', 'DOMESTIC'],
      dtype='object')


In [57]:
# Convert the 'DATE OF OCCURRENCE' column to datetime format
data_mod['DATE OF OCCURRENCE'] = pd.to_datetime(data_mod['DATE OF OCCURRENCE'], errors='coerce')

  data_mod['DATE OF OCCURRENCE'] = pd.to_datetime(data_mod['DATE OF OCCURRENCE'], errors='coerce')


In [67]:
# Drop any rows where the date conversion failed
data_mod = data_mod.dropna(subset=['DATE OF OCCURRENCE'])

In [68]:
# Create additional time-based columns
data_mod['Year'] = data_mod['DATE OF OCCURRENCE'].dt.year
data_mod['Month'] = data_mod['DATE OF OCCURRENCE'].dt.month
data_mod['Day'] = data_mod['DATE OF OCCURRENCE'].dt.day
data_mod['Weekday'] = data_mod['DATE OF OCCURRENCE'].dt.weekday
data_mod['Hour'] = data_mod['DATE OF OCCURRENCE'].dt.hour

In [69]:
# Convert 'ARREST' and 'DOMESTIC' columns to boolean type for consistency
data_mod['ARREST'] = data_mod['ARREST'].astype(bool)
data_mod['DOMESTIC'] = data_mod['DOMESTIC'].astype(bool)

In [70]:
data_mod.head()
data_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5581 entries, 0 to 5580
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   DATE OF OCCURRENCE     5581 non-null   datetime64[ns]
 1   BLOCK                  5581 non-null   object        
 2   PRIMARY DESCRIPTION    5581 non-null   object        
 3   SECONDARY DESCRIPTION  5581 non-null   object        
 4   LOCATION DESCRIPTION   5560 non-null   object        
 5   ARREST                 5581 non-null   bool          
 6   DOMESTIC               5581 non-null   bool          
 7   Year                   5581 non-null   int32         
 8   Month                  5581 non-null   int32         
 9   Day                    5581 non-null   int32         
 10  Weekday                5581 non-null   int32         
 11  Hour                   5581 non-null   int32         
dtypes: bool(2), datetime64[ns](1), int32(5), object(4)
memory usag

Goals achieved with Data Cleaning Process

    Loading the Dataset: The dataset was loaded into a pandas DataFrame from a CSV file. This dataset contains information about various crime incidents in Chicago.

    Removing Null Rows and Unwanted Columns:
        Null Rows: All rows with any missing values were removed using the dropna() method.
        Unwanted Columns: The following columns were dropped as they were deemed unnecessary for the analysis:
            CASE#
            IUCR
            BEAT
            WARD
            FBI CD
            X COORDINATE
            Y COORDINATE
            LATITUDE
            LONGITUDE
            LOCATION

    Renaming Columns:
        The columns were renamed for consistency and ease of use. The columns with leading and trailing spaces in their names were corrected. The new column names are:
            DATE OF OCCURRENCE
            BLOCK
            PRIMARY DESCRIPTION
            SECONDARY DESCRIPTION
            LOCATION DESCRIPTION
            ARREST
            DOMESTIC
            

    Converting Date Column to Datetime Format:
        The DATE OF OCCURRENCE column was converted to a datetime format using pd.to_datetime(). This ensures that the date information can be used effectively for time-based analysis.

Resulting DataFrame Information

The resulting DataFrame after cleaning has the following structure:

    Number of Rows: 5581
    Number of Columns: 6

The final columns in the DataFrame are:

    DATE OF OCCURRENCE
    BLOCK
    PRIMARY DESCRIPTION
    SECONDARY DESCRIPTION
    LOCATION DESCRIPTION
    ARREST
    DOMESTIC
    

This cleaned dataset is now ready for further analysis, including time-based trends, location-based patterns, and predictive modeling.

In [None]:
%cd
