# Crime Data Analysis Project

Welcome! This notebook walks you through a full crime data analysis pipeline, from raw data processing to model development and dashboard creation.

Each section starts with a brief explanation, followed by code and results. Comments in code cells clarify each step.

## 1. Data Processing: Convert Raw CSV to Parquet

The raw CSV file is large. We process it in chunks, optimize data types, and save as Parquet for faster loading later.

In [1]:
# Load raw CSV in chunks, optimize types, and save as Parquet
import pandas as pd
from pathlib import Path
import logging
import numpy as np
import os as os
import seaborn as sns 
import matplotlib.pyplot as plt
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def process_crime_data(input_path, output_path):
    # Define column types for memory efficiency
    dtype_mapping = {
        'DR_NO': 'int64',
        'AREA': 'int16',
        'Rpt Dist No': 'int16',
        'Part 1-2': 'int8',
        'Crm Cd': 'int16',
        'Vict Age': 'int8',
        'Vict Sex': 'category',
        'Vict Descent': 'category',
        'Premis Cd': 'float32',
        'Weapon Used Cd': 'float32',
        'Status': 'category',
        'Crm Cd 1': 'float32',
        'Crm Cd 2': 'float32',
        'Crm Cd 3': 'float32',
        'Crm Cd 4': 'float32',
        'LAT': 'float32',
        'LON': 'float32'
    }
    date_columns = ['Date Rptd', 'DATE OCC']
    try:
        chunk_iterator = pd.read_csv(
            input_path,
            chunksize=100000,
            dtype=dtype_mapping,
            parse_dates=date_columns,
            na_values=['', 'NA', 'N/A']
        )
        processed_chunks = []
        for i, chunk in enumerate(chunk_iterator):
            logging.info(f'Processing chunk {i+1}')
            processed_chunks.append(chunk)
        df = pd.concat(processed_chunks, ignore_index=True)
        df.to_parquet(output_path, index=False)
        logging.info('Data processing complete.')
    except Exception as e:
        logging.error(f'Error during data processing: {e}')

if __name__ == '__main__':
    input_file = Path('/home/ayyan/ANALYSIS_PROJECT/Crime_Data_from_2020_to_Present.csv')
    output_file = Path('/home/ayyan/ANALYSIS_PROJECT/processed_crime_data.parquet')
    process_crime_data(input_file, output_file)

  for i, chunk in enumerate(chunk_iterator):
  for i, chunk in enumerate(chunk_iterator):
2025-08-06 19:53:43,474 - INFO - Processing chunk 1
  for i, chunk in enumerate(chunk_iterator):
  for i, chunk in enumerate(chunk_iterator):
2025-08-06 19:53:44,174 - INFO - Processing chunk 2
  for i, chunk in enumerate(chunk_iterator):
  for i, chunk in enumerate(chunk_iterator):
2025-08-06 19:53:44,858 - INFO - Processing chunk 3
  for i, chunk in enumerate(chunk_iterator):
  for i, chunk in enumerate(chunk_iterator):
2025-08-06 19:53:45,530 - INFO - Processing chunk 4
  for i, chunk in enumerate(chunk_iterator):
  for i, chunk in enumerate(chunk_iterator):
2025-08-06 19:53:46,240 - INFO - Processing chunk 5
  for i, chunk in enumerate(chunk_iterator):
  for i, chunk in enumerate(chunk_iterator):
2025-08-06 19:53:46,866 - INFO - Processing chunk 6
  for i, chunk in enumerate(chunk_iterator):
  for i, chunk in enumerate(chunk_iterator):
2025-08-06 19:53:47,538 - INFO - Processing chunk 7
  for 

### Data Processing Complete

The raw CSV is now converted to Parquet. Next, let's understand the processed data.

## 2. Data Understanding

Let's load the processed data and check its structure, missing values, and summary statistics.

In [2]:
# Load processed Parquet and display info, missing values, and stats
def data_understanding(file_path):
    df = pd.read_parquet(file_path)
    print('DataFrame Info:')
    df.info()
    print('\nMissing Value Counts:')
    print(df.isnull().sum())
    print('\nSummary Statistics:')
    print(df.describe())
    print('\nTarget Variable:')
    print("We'll use 'Crm Cd Desc' (Crime Code Description) as the target for prediction.")
    return df

df_processed = data_understanding('/home/ayyan/ANALYSIS_PROJECT/processed_crime_data.parquet')

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004991 entries, 0 to 1004990
Data columns (total 28 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   DR_NO           1004991 non-null  int64         
 1   Date Rptd       1004991 non-null  datetime64[ns]
 2   DATE OCC        1004991 non-null  datetime64[ns]
 3   TIME OCC        1004991 non-null  int64         
 4   AREA            1004991 non-null  int16         
 5   AREA NAME       1004991 non-null  object        
 6   Rpt Dist No     1004991 non-null  int16         
 7   Part 1-2        1004991 non-null  int8          
 8   Crm Cd          1004991 non-null  int16         
 9   Crm Cd Desc     1004991 non-null  object        
 10  Mocodes         853372 non-null   object        
 11  Vict Age        1004991 non-null  int8          
 12  Vict Sex        860347 non-null   object        
 13  Vict Descent    860335 non-null   object        
 14  Pr

### Data Understanding Summary

We now know the data types, missing values, and basic statistics. Next, we clean and engineer features for modeling.

## 3. Data Preprocessing

We handle missing values, engineer time features, and remove outliers to prepare the data for analysis and modeling.

In [3]:
# Clean data, engineer features, and save preprocessed Parquet
def preprocess_data(df, output_path):
    # Fill missing categorical values with 'Unknown'
    categorical_cols = ['Vict Sex', 'Vict Descent', 'Weapon Desc', 'Premis Desc', 'Status']
    for col in categorical_cols:
        if df[col].dtype == 'object':
            df[col] = df[col].astype('category')
        if "Unknown" not in df[col].cat.categories:
            df[col] = df[col].cat.add_categories("Unknown")
        df[col] = df[col].fillna("Unknown")
    # Fill missing numerical values with -1
    df['Premis Cd'].fillna(-1, inplace=True)
    df['Crm Cd 1'].fillna(-1, inplace=True)
    df['Weapon Used Cd'].fillna(-1, inplace=True)
    # Drop columns with many missing values
    df.drop(columns=['Mocodes', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4', 'Cross Street'], inplace=True)
    # Feature engineering: extract hour, day, month
    df['hour_of_day'] = df['DATE OCC'].dt.hour
    df['day_of_week'] = df['DATE OCC'].dt.day_name().astype('category')
    df['month'] = df['DATE OCC'].dt.month_name().astype('category')
    # Outlier handling: cap victim age, remove invalid lat/lon
    df['Vict Age'] = np.where((df['Vict Age'] > 100) | (df['Vict Age'] <= 0), df['Vict Age'].median(), df['Vict Age'])
    df = df[(df['LAT'] != 0) & (df['LON'] != 0)].copy()
    df.to_parquet(output_path, index=False)
    print('Preprocessing complete.')
    return df

df_preprocessed = preprocess_data(df_processed.copy(), '/home/ayyan/ANALYSIS_PROJECT/preprocessed_crime_data.parquet')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Premis Cd'].fillna(-1, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Crm Cd 1'].fillna(-1, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behav

Preprocessing complete.


### Preprocessing Complete

The data is now clean and ready for analysis. Let's visualize key patterns next.

## 4. Exploratory Data Analysis (EDA)

We visualize distributions and relationships to uncover patterns in the crime data.

In [4]:
# Save EDA plots for age, area, time, and correlations
def perform_eda(df, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    sns.set_style('whitegrid')
    # Victim Age Distribution
    plt.figure(figsize=(10, 6))
    sns.histplot(df['Vict Age'], bins=30, kde=True)
    plt.title('Distribution of Victim Age')
    plt.savefig(os.path.join(output_dir, 'victim_age_distribution.png')); plt.close()
    # Crimes by Area Name (Top 15)
    plt.figure(figsize=(12, 8))
    area_order = df['AREA NAME'].value_counts().iloc[:15].index
    sns.countplot(y=df['AREA NAME'], order=area_order, palette='viridis')
    plt.title('Top 15 Areas by Number of Crimes')
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, 'crimes_by_area.png')); plt.close()
    # Crimes by Day of Week
    plt.figure(figsize=(10, 6))
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    sns.countplot(x=df['day_of_week'], order=day_order, palette='magma')
    plt.title('Number of Crimes by Day of the Week')
    plt.savefig(os.path.join(output_dir, 'crimes_by_day_of_week.png')); plt.close()
    # Crimes by Month
    plt.figure(figsize=(10, 6))
    month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    sns.countplot(x=df['month'], order=month_order, palette='cividis')
    plt.title('Number of Crimes by Month')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, 'crimes_by_month.png')); plt.close()
    # Crimes by Hour of Day
    plt.figure(figsize=(12, 6))
    sns.countplot(x=df['hour_of_day'], palette='plasma')
    plt.title('Number of Crimes by Hour of the Day')
    plt.savefig(os.path.join(output_dir, 'crimes_by_hour.png')); plt.close()
    # Correlation Heatmap
    plt.figure(figsize=(12, 10))
    numerical_cols = df.select_dtypes(include=np.number).columns
    corr_matrix = df[numerical_cols].corr()
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
    plt.title('Correlation Matrix of Numerical Features')
    plt.tight_layout()
    plt.savefig(os.path.join(output_dir, 'correlation_heatmap.png')); plt.close()
    print('EDA visualizations saved.')

perform_eda(df_preprocessed, '/home/ayyan/ANALYSIS_PROJECT/eda_plots')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(y=df['AREA NAME'], order=area_order, palette='viridis')
2025-08-06 19:54:08,814 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x=df['day_of_week'], order=day_order, palette='magma')
2025-08-06 19:54:09,416 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-08-06 19:54:10,481 - INFO - Using categorical units to plot a list of st

EDA visualizations saved.


### EDA Complete

Key visualizations are saved. Review them to spot trends in age, area, time, and feature correlations.

## 5. Model Development

We build and compare several models to predict crime type. Each step is commented for clarity.

In [None]:
# =============================================
# IMPROVED MODEL DEVELOPMENT - RANDOM FOREST FOCUS
# =============================================

# Import required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import warnings

# Filter out warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# 1. Prepare features and target
# Filter out rare crime categories (those with < 5 instances)
crime_counts = df_preprocessed['Crm Cd Desc'].value_counts()
rare_crimes = crime_counts[crime_counts < 5].index
df_filtered = df_preprocessed[~df_preprocessed['Crm Cd Desc'].isin(rare_crimes)]

# Define features and target - focus on most predictive features
X = df_filtered[[
    'AREA', 'Rpt Dist No', 'Part 1-2', 'Vict Age', 'Vict Sex', 
    'Vict Descent', 'hour_of_day', 'day_of_week', 'month', 'LAT', 'LON'
]]
y = df_filtered['Crm Cd Desc']

# 2. Encode target labels
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# 3. Identify feature types
categorical_features = ['Vict Sex', 'Vict Descent', 'day_of_week', 'month']
numerical_features = ['AREA', 'Rpt Dist No', 'Part 1-2', 'Vict Age', 'hour_of_day', 'LAT', 'LON']

# 4. Create preprocessing pipeline
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features),
    ('num', StandardScaler(), numerical_features)
])

# 5. Enhanced train/test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_encoded
)

# 6. Optimized Random Forest Pipeline
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        random_state=42,
        class_weight='balanced',  # Handle class imbalance
        n_jobs=-1  # Use all cores
    ))
])

# 7. Comprehensive hyperparameter tuning
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 15, 25, 35],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__max_features': ['sqrt', 'log2', 0.5]
}

grid_search = GridSearchCV(
    rf_pipeline,
    param_grid,
    cv=5,  # More folds for better validation
    verbose=1,
    n_jobs=-1,
    scoring='accuracy'
)

print("\nStarting Grid Search...")
grid_search.fit(X_train, y_train)

# 8. Best model evaluation
best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)

print('\nBest Parameters:')
print(grid_search.best_params_)

print('\nModel Performance:')
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(
    y_test, y_pred,
    target_names=label_encoder.classes_,
    zero_division=0
))

# 9. Feature Importance Analysis
print("\nFeature Importance Analysis:")
# Get feature names from one-hot encoding
cat_features = best_rf.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
all_features = np.concatenate([cat_features, numerical_features])

# Extract and display top 20 features
importances = best_rf.named_steps['classifier'].feature_importances_
top_features = sorted(zip(all_features, importances), key=lambda x: x[1], reverse=True)[:20]

print("\nTop 20 Predictive Features:")
for feature, importance in top_features:
    print(f"{feature}: {importance:.4f}")

# =============================================
# END OF IMPROVED MODEL DEVELOPMENT
# =============================================


Starting Grid Search...
Fitting 5 folds for each of 324 candidates, totalling 1620 fits




### Model Results Summary

Compare the accuracy and classification reports above. Random Forest and XGBoost often perform best for this type of data.

## 6. Dashboard Creation

An interactive dashboard (Streamlit) visualizes the dataset, EDA results, and model performance. See instructions in the documentation section to run it.

In [None]:
import streamlit as st
import pandas as pd
import os
from PIL import Image

# Paths to data and plots
PREPROCESSED_DATA_PATH = "/home/ayyan/new_PROJECT/preprocessed_crime_data.parquet"
EDA_PLOTS_DIR = "/home/ayyan/new_PROJECT/eda_plots"

st.set_page_config(layout="wide", page_title="Crime Data Analysis Dashboard")

st.title("Crime Data Analysis Dashboard")

@st.cache_data
def load_data(path):
    return pd.read_parquet(path)

# Load data
df = load_data(PREPROCESSED_DATA_PATH)

# Sidebar for navigation
st.sidebar.title("Navigation")
page = st.sidebar.radio("Go to", ["Dataset Overview", "EDA Visualizations", "Model Performance"]) # Model Performance will be empty for now

if page == "Dataset Overview":
    st.header("Dataset Overview")
    st.write("A quick look at the preprocessed crime data.")

    st.subheader("Data Sample")
    st.dataframe(df.head())

    st.subheader("Dataset Shape")
    st.write(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

    st.subheader("Column Information")
    st.write(df.info())

    st.subheader("Missing Values (after preprocessing)")
    st.write(df.isnull().sum())

    st.subheader("Descriptive Statistics")
    st.write(df.describe())

elif page == "EDA Visualizations":
    st.header("Exploratory Data Analysis Visualizations")
    st.write("Visual insights into the crime data.")

    plot_files = [
        "victim_age_distribution.png",
        "crimes_by_area.png",
        "crimes_by_day_of_week.png",
        "crimes_by_month.png",
        "crimes_by_hour.png",
        "correlation_heatmap.png"
    ]

    for plot_file in plot_files:
        plot_path = os.path.join(EDA_PLOTS_DIR, plot_file)
        if os.path.exists(plot_path):
            st.subheader(plot_file.replace("_", " ").replace(".png", "").title())
            image = Image.open(plot_path)
            st.image(image, use_column_width=True)
        else:
            st.warning(f"Plot not found: {plot_file}")

elif page == "Model Performance":
    st.header("Model Performance")
    st.write("This section will display the performance metrics of the trained models.")
    st.info("Model training and evaluation results will be displayed here once the model development section is complete.")


## 7. Documentation

This section details the project documentation, including a `README.md` file and detailed explanations within this Jupyter notebook.

### README.md Content

```markdown
# Crime Data Analysis Project

## Project Overview
This project performs a comprehensive data analysis on crime data from 2020 to present. It covers data understanding, preprocessing, exploratory data analysis (EDA), model development for crime type prediction, and an interactive dashboard for visualization.

## Dataset Description
The dataset `Crime_Data_from_2020_to_Present.csv` contains crime incidents recorded from 2020 onwards. Key features include crime type, location, time, victim demographics, and weapon information.

## Installation Instructions
1.  **Clone the repository:**
    ```bash
    git clone <repository_url>
    cd <repository_name>
    ```
2.  **Create and activate a virtual environment:**
    ```bash
    python3 -m venv .venv
    source .venv/bin/activate
    ```
3.  **Install dependencies:**
    ```bash
    pip install -r requirements.txt
    ```

## Usage Guide
1.  **Run the Jupyter Notebook:**
    Open `crime_data_analysis.ipynb` in Jupyter Lab or Jupyter Notebook and run all cells sequentially to perform data processing, EDA, and model training.
    ```bash
    jupyter lab
    # or
    jupyter notebook
    ```
2.  **Run the Dashboard:**
    After running the notebook and generating the necessary processed data and EDA plots, you can launch the interactive dashboard:
    ```bash
    streamlit run dashboard_app.py
    ```

## Results Summary
(This section will be updated with key findings from EDA and model performance metrics after the notebook is fully executed.)

## Future Improvements
- Implement more advanced feature engineering (e.g., geospatial features).
- Explore deep learning models for crime prediction.
- Integrate real-time data streaming for live updates.
- Enhance dashboard interactivity and add more detailed model insights.
- Implement model explainability techniques (SHAP, LIME).
- Create deployment-ready code using Flask/FastAPI.
- Add unit tests for critical functions.
- Include Docker setup for reproducibility.
```

## 8. Bonus Features

This section outlines potential bonus features that can be implemented to further enhance the project.

### Model Explainability (SHAP, LIME)
Implement techniques like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to understand how individual features contribute to model predictions. This provides transparency and interpretability to the black-box models.

### Deployment-Ready Code (Flask/FastAPI)
Develop a simple API using Flask or FastAPI to serve the trained model. This would allow the model to be integrated into other applications or services for real-time predictions.

### Unit Tests for Critical Functions
Write unit tests for key functions in data processing, feature engineering, and model training to ensure code reliability and maintainability.

### Docker Setup for Reproducibility
Create a `Dockerfile` to containerize the entire project, including dependencies and the application. This ensures that the project can be easily reproduced and deployed across different environments.