<h1 align=center style="line-height:200%;color:#0099cc">
<font color="#0099cc">
Criminology
</font>
</h1>


<h2 align=left style="line-height:200%;color:#0099cc">
<font color="#0099cc">
Introduction and Problem Statement
</font>
</h2>

<p dir=ltr style="direction: ltr;text-align: justify;line-height:200%;font-size:medium">
<font size=3>
Your goal in this problem is to create a predictive model that can classify new incidents into one of several crime categories using historical data. This model should help law enforcement agencies identify patterns of criminal behavior and make more informed decisions regarding crime prevention strategies in Los Angeles.
</font>
</p>


<h2 align=left style="line-height:200%;color:#0099cc">
<font color="#0099cc">
Dataset Introduction
</font>
</h2>

<p dir=ltr style="direction: ltr; text-align: justify; line-height:200%; font-size:medium">
<font size=3>
The training dataset contains 84,113 rows. The description of each column is provided in the table below.
</font>
</p>

<center>
<div dir=ltr style="direction: ltr;line-height:200%;font-size:medium">
<font size=3>
    
|Column|Description|
|:------:|:---:|
|DR_NO| A unique identifier for each crime report|
|Date Rptd| Date the crime was reported|
|DATE OCC| Date the crime occurred|
|TIME OCC| Time the crime occurred|
|AREA| Code of the area where the crime occurred|
|AREA NAME| Name of the area where the crime occurred|
|Rpt Dist No| Reporting District Number|
|Crm Cd| Crime Code, indicating the crime category (This column is the target variable)|
|Crm Cd Desc| Description of the crime|
|Mocodes| Codes indicating the method used to commit the crime|
|Vict Age| Victim's Age|
|Vict Sex| Victim's Sex|
|Vict Descent| Victim's Descent/Ethnicity|
|Premis Cd| Code for the type of premises where the crime occurred|
|Premis Desc| Description of the premises|
|Weapon Used Cd| Code for the type of weapon used in the crime|
|Weapon Desc| Description of the weapon used|
|Status| Case Status (e.g., pending, closed)|
|Status Desc| Description of the case status|
|Crm Cd 1, Crm Cd 2, Crm Cd 3, Crm Cd 4| Additional crime codes if the incident involves multiple crimes|
|LOCATION| Latitude and longitude of the crime|
|Cross Street| Nearest cross street to the crime location|
|LAT| Crime's latitude coordinate|
|LON| Crime's longitude coordinate|

</font>
</div>
</center>

<p dir=ltr style="direction: ltr; text-align: justify; line-height:200%; font-size:medium">
<font size=3>
The test dataset is similar to the training set, except that it does not contain the <code>Crm Cd</code> column, which is the target variable for the problem. It also does not contain the <code>Crm Cd 1</code>, <code>Crm Cd 2</code>, <code>Crm Cd 3</code>, <code>Crm Cd 4</code>, and <code>Crm Cd Desc</code> columns, which provide direct information about the target variable. The test dataset has 9346 rows and 21 columns.
</font>
</p>


<h3 align=left style="line-height:200%;color:#0099cc">
<font color="#0099cc">
Evaluation Metric
</font>
</h3>

<p dir=ltr style="direction: ltr; text-align: justify; line-height:200%; font-size:medium">
<font size=3>
The evaluation metric used in the judging of this problem is the <code>F1 Macro</code> score, which examines precision and recall across all crime categories in a balanced manner. Unlike accuracy, which can be swayed by imbalanced data, the <code>F1 Macro</code> score calculates the <code>F1</code> score for each crime category independently and then computes their average. This metric ensures that performance on infrequent crime categories holds equal importance. This feature makes the metric suitable for predicting various types of crimes, regardless of their frequency in the dataset.
</font>
</p>


<h2 align=left style="line-height:200%;color:#0099cc">
<font color="#0099cc">
Prediction for Test Data and Output
</font>
</h2>

<p dir=ltr style="direction: ltr;text-align: left;line-height:200%;font-size:medium">
<font size=3>
Use your model to predict the samples in the test data and prepare the results in a table format (<code>dataframe</code>) as shown below.
</font>
</p>

<div dir=ltr style="direction: ltr;text-align: center;line-height:200%;font-size:medium">
<font size=3>
    
|Column|Description|
|------|---|
|Crm Cd|Predicted Crime Code|
    
</font>
</div>


<p dir=ltr style="direction: ltr;text-align: left;line-height:200%;font-size:medium">
<font size=3>
The dataframe must be named <i>submission</i>; otherwise, the judging system cannot evaluate your effort.
    <br>
    This dataframe contains only 1 column named <i>Crm Cd</i> and has ? rows.
    <br>
    You must have one predicted value for each row in the <i>test</i> dataframe.
    <br>
    The table below shows the first 5 rows of the <code>submission</code> dataframe. Note that in your answer, the values of the <i>Crm Cd</i> column may be different.
</font>
</p>

<div style="text-align: center;line-height:200%;font-size:medium">
<font size=3>
    
|Crm Cd|
|-----|
|210|
|420|
|930|
|624|
|420|

</font>
</div>


In [15]:
# import libraries 
import os
import pandas as pd
import numpy as np

# better to import these at the end of the file for performance
# from sklearn.model_selection import StratifiedKFold, cross_val_score
# from sklearn.metrics import f1_score, make_scorer
# from sklearn.ensemble import RandomForestClassifier


In [16]:
# Locate data directory 
possible_data_dirs = [
    os.path.join(os.getcwd(), 'data'),
    os.path.join(os.getcwd(), '..', 'data'),
    os.path.join(os.path.dirname(os.getcwd()), 'data')
]

data_dir = None
for p in possible_data_dirs:
    if os.path.exists(os.path.join(p, 'train.csv')) and os.path.exists(os.path.join(p, 'test.csv')):
        data_dir = p
        print(f"Data directory found: {data_dir}")
        break

if data_dir is None:
    raise FileNotFoundError("Could not find data directory containing train.csv and test.csv")

train_path = os.path.join(data_dir, 'train.csv')
test_path = os.path.join(data_dir, 'test.csv')

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)


# define the target column
target_col = 'Crm Cd'



Data directory found: /Users/taher/Projects/data-processing–preliminary  /Q3/notebook/../data


In [17]:
# now lets define the function to build the features out of the training data

def build_features(df: pd.DataFrame) -> pd.DataFrame:
    df_proc = df.copy()

    # Extract time features
    if 'TIME OCC' in df_proc.columns:
        def extract_hour(val):
            try:
                t = int(val)
            except Exception:
                return np.nan
            return (t // 100) % 24
        df_proc['TIME_OCC_HOUR'] = df_proc['TIME OCC'].apply(extract_hour)

    # Extract date features
    for date_col in ['Date Rptd', 'DATE OCC']:
        if date_col in df_proc.columns:
            # infer the datetime format
            dt = pd.to_datetime(
                df_proc[date_col].astype(str).str.strip(),
                format='%m/%d/%Y %I:%M:%S %p',
                errors='coerce'
            )
            df_proc[date_col + '_YEAR'] = dt.dt.year
            df_proc[date_col + '_MONTH'] = dt.dt.month
            df_proc[date_col + '_DAY'] = dt.dt.day
            df_proc[date_col + '_DOW'] = dt.dt.dayofweek

    # Mocodes summary
    if 'Mocodes' in df_proc.columns:
        df_proc['MOCODES_COUNT'] = (
            df_proc['Mocodes']
            .fillna('')
            .astype(str)
            .map(lambda s: 0 if s.strip() == '' else len(s.strip().split()))
        )

    # Victim age cleaning
    if 'Vict Age' in df_proc.columns:
        df_proc['Vict Age'] = pd.to_numeric(df_proc['Vict Age'], errors='coerce')
        df_proc['Vict Age'] = df_proc['Vict Age'].clip(lower=0, upper=100)

    # Drop leakage / text-heavy columns not in test or too high-cardinality
    drop_cols = [
        'DR_NO', 'Crm Cd Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4',
        'Date Rptd', 'DATE OCC', 'Mocodes', 'LOCATION', 'Cross Street'
    ]
    df_proc = df_proc.drop(columns=[c for c in drop_cols if c in df_proc.columns], errors='ignore')

    return df_proc

print("Features are extracted and cleaned")

# print stats of the training data
# print(train_df.describe())

# print stats of the test data
# print(test_df.describe())



Features are extracted and cleaned


In [18]:
# Separate target
y = train_df[target_col].astype(int)
X_train_raw = train_df.drop(columns=[target_col], errors='ignore')
X_test_raw = test_df.copy()

# Build features
X_train_feats = build_features(X_train_raw)
X_test_feats = build_features(X_test_raw)

# Combine to ensure consistent encoding of categoricals 
combined = pd.concat([X_train_feats, X_test_feats], axis=0, ignore_index=True)

# Label-encode object columns deterministically
object_cols = [c for c in combined.columns if combined[c].dtype == 'object']
for col in object_cols:
    combined[col] = combined[col].fillna('NA').astype('category').cat.codes.astype(np.int32)

# Fill remaining numeric missing values
for col in combined.columns:
    if combined[col].dtype.kind in 'biufc':
        combined[col] = combined[col].fillna(-1)

# Split back
X_train = combined.iloc[:len(X_train_feats), :].reset_index(drop=True)
X_test = combined.iloc[len(X_train_feats):, :].reset_index(drop=True)

# prient how many rows are in the training data and the test data and the combined data
print(f"Training data has {len(X_train)} rows")
print(f"Test data has {len(X_test)} rows")
print(f"Combined data has {len(combined)} rows")

# print the first 5 rows of the combined data for testing how the data looks like
# print(combined.head())
 




Training data has 84113 rows
Test data has 9346 rows
Combined data has 93459 rows


In [19]:
# Train and predict phase yes
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import f1_score
from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier

# use RandomForestClassifier from sklearn
clf = RandomForestClassifier(
    n_estimators=400,
    max_depth=None,
    min_samples_leaf=2,
    max_features='sqrt',
    class_weight='balanced',
    n_jobs=-1,
    random_state=42,
) 

# Print the model parameters 
print(clf)

# Print the model summary
# print(clf.get_params())
 




RandomForestClassifier(class_weight='balanced', min_samples_leaf=2,
                       n_estimators=400, n_jobs=-1, random_state=42)


In [20]:

# Optional - quick CV to gauge F1-macro
try:
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    f1_scores = cross_val_score(
        clf, X_train, y,
        scoring='f1_macro',
        cv=cv, n_jobs=-1
    )
    print(f"CV F1-macro (3-fold): mean={f1_scores.mean():.3f}, std={f1_scores.std():.3f}")
except Exception as e:
    print(f"CV failed: {e}")

clf.fit(X_train, y)
y_pred = clf.predict(X_test)
 

submission = pd.DataFrame({'Crm Cd': pd.Series(y_pred).astype(int)})

CV F1-macro (3-fold): mean=0.519, std=0.001


<h2 dir=ltr align=left style="line-height:200%;color:#0099cc">
<font color="#0099cc">
<b>Submission Builder Cell</b>
</font>
</h2>

<p dir=ltr style="direction: ltr; text-align: justify; line-height:200%; font-size:medium">
<font size=3>
Run the cell below to create the <code>result.zip</code> file. Note that before running the cell below, you must have saved the changes made in the notebook (<code>ctrl+s</code>); otherwise, your score will change to zero at the end of the competition.
    <br>
    Also, if you are using Colab to run this notebook file, download the latest version of your notebook and include it in the submission file before sending the <code>result.zip</code> file.
</font>


In [21]:
import os
import zipfile

def compress(file_names):
    print("File Paths:")
    print(file_names)
    compression = zipfile.ZIP_DEFLATED
    with zipfile.ZipFile("result.zip", mode="w") as zf:
        for file_name in file_names:
            zf.write('./' + file_name, file_name, compress_type=compression)

submission.to_csv('submission.csv', index=False)
file_names = ['crime_detection.ipynb', 'submission.csv']
compress(file_names)

File Paths:
['crime_detection.ipynb', 'submission.csv']
