## **Chicago Crime Analysis**

### **Data Injection**

In [1]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Loading dataset
data = r"Crimes_-_2001_to_Present.csv"
crime_data = pd.read_csv(data)

### **Preliminarly Data Analysis (PDA)**

In [2]:
# Checking the overview of the data
print("\n===== The first five rows =====")
print(crime_data.head())

print("\n===== The last five rows =====")
print(crime_data.tail())


===== The first five rows =====
         ID Case Number                    Date                  Block  IUCR  \
0  10224738    HY411648  09/05/2015 01:30:00 PM        043XX S WOOD ST  0486   
1  10224739    HY411615  09/04/2015 11:30:00 AM    008XX N CENTRAL AVE  0870   
2  11646166    JC213529  09/01/2018 12:01:00 AM  082XX S INGLESIDE AVE  0810   
3  10224740    HY411595  09/05/2015 12:45:00 PM      035XX W BARRY AVE  2023   
4  10224741    HY411610  09/05/2015 01:00:00 PM    0000X N LARAMIE AVE  0560   

  Primary Type              Description Location Description  Arrest  \
0      BATTERY  DOMESTIC BATTERY SIMPLE            RESIDENCE   False   
1        THEFT           POCKET-PICKING              CTA BUS   False   
2        THEFT                OVER $500            RESIDENCE   False   
3    NARCOTICS    POSS: HEROIN(BRN/TAN)             SIDEWALK    True   
4      ASSAULT                   SIMPLE            APARTMENT   False   

   Domestic  ...  Ward  Community Area  FBI Code  X C

In [3]:
crime_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7784664 entries, 0 to 7784663
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

From the above, it is evident that there are spaces in between the columns name, it will be of best practice to rename them.

In [4]:
# Renaming all the columns
crime_data.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

In [5]:
crime_data.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7784664 entries, 0 to 7784663
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   id                    int64  
 1   case_number           object 
 2   date                  object 
 3   block                 object 
 4   iucr                  object 
 5   primary_type          object 
 6   description           object 
 7   location_description  object 
 8   arrest                bool   
 9   domestic              bool   
 10  beat                  int64  
 11  district              float64
 12  ward                  float64
 13  community_area        float64
 14  fbi_code              object 
 15  x_coordinate          float64
 16  y_coordinate          float64
 17  year                  int64  
 18  updated_on            object 
 19  latitude              float64
 20  longitude             float64
 21  location              object 
dtypes: bool(2), float64(7), int64(3), object(1

From the above `crime_data.info()`, it can be seen that the memory usage is `5.2+ GB` which is not efficient enough. I'll like to reduce the memory usage. 

In [6]:
# Function to downcast the DataTypes to the smallest size.
# It will check through each of the columns and downcast the one will `int` and `float`

def downcast_columns(df: pd.DataFrame, verbose: bool = True) -> pd.DataFrame:
    """
    Downcast numeric columns in a DataFrame to smaller dtypes 
    (integers -> smallest int, floats -> smallest float).
    Reports memory usage savings if verbose=True.
    
    Parameters used
    ----------
    df : pd.DataFrame
        Input DataFrame.
    verbose : bool, optional
        If True, prints memory usage before/after, by default True.
    
    Returns
    -------
    pd.DataFrame
        DataFrame with numeric columns downcasted.
    """
    start_mem = df.memory_usage(deep=True).sum() / 1024**2  # This will give us in the value of MB instead of the GB
    
    df_optimized = df.copy()

    for col in df_optimized.select_dtypes(include=["int", "float"]).columns:
        col_type = df_optimized[col].dtype

        if "int" in str(col_type):
            df_optimized[col] = pd.to_numeric(df_optimized[col], downcast="integer")
        elif "float" in str(col_type):
            df_optimized[col] = pd.to_numeric(df_optimized[col], downcast="float")

    end_mem = df_optimized.memory_usage(deep=True).sum() / 1024**2  # MB
    
    if verbose:
        print(f"Memory usage before: {start_mem:.2f} MB")
        print(f"Memory usage after : {end_mem:.2f} MB")
        print(f"Reduced by        : {100 * (start_mem - end_mem) / start_mem:.1f}%")

    return df_optimized


In [7]:
crime_data = downcast_columns(crime_data)

Memory usage before: 5302.92 MB
Memory usage after : 4976.27 MB
Reduced by        : 6.2%


In [8]:
crime_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7784664 entries, 0 to 7784663
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   id                    int32  
 1   case_number           object 
 2   date                  object 
 3   block                 object 
 4   iucr                  object 
 5   primary_type          object 
 6   description           object 
 7   location_description  object 
 8   arrest                bool   
 9   domestic              bool   
 10  beat                  int16  
 11  district              float32
 12  ward                  float32
 13  community_area        float32
 14  fbi_code              object 
 15  x_coordinate          float32
 16  y_coordinate          float32
 17  year                  int16  
 18  updated_on            object 
 19  latitude              float32
 20  longitude             float32
 21  location              object 
dtypes: bool(2), float32(7), int16(2), int32(1)