## Overview

This notebook performs comprehensive data analysis and preprocessing on a crime classification dataset. The goal is to prepare the data for a machine learning model that will classify crimes into categories and sub-categories based on additional crime information.

## Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
import difflib

## Loading Data

In [2]:
train_dataset = pd.read_csv('./dataset/train.csv')
test_dataset = pd.read_csv('./dataset/test.csv')

In [3]:
print("Training Dataset Shape:", train_dataset.shape)
print("\nFirst few rows of training data:")
train_dataset.head()

Training Dataset Shape: (93686, 3)

First few rows of training data:


Unnamed: 0,category,sub_category,crimeaditionalinfo
0,Online and Social Media Related Crime,Cyber Bullying Stalking Sexting,I had continue received random calls and abusi...
1,Online Financial Fraud,Fraud CallVishing,The above fraudster is continuously messaging ...
2,Online Gambling Betting,Online Gambling Betting,He is acting like a police and demanding for m...
3,Online and Social Media Related Crime,Online Job Fraud,In apna Job I have applied for job interview f...
4,Online Financial Fraud,Fraud CallVishing,I received a call from lady stating that she w...


In [4]:
print("\nTest Dataset Shape:", test_dataset.shape)
print("\nFirst few rows of test data:")
test_dataset.head()


Test Dataset Shape: (31229, 3)

First few rows of test data:


Unnamed: 0,category,sub_category,crimeaditionalinfo
0,RapeGang Rape RGRSexually Abusive Content,,Sir namaskar mein Ranjit Kumar PatraPaise neh...
1,Online Financial Fraud,DebitCredit Card FraudSim Swap Fraud,KOTAK MAHINDRA BANK FRAUD\r\nFRAUD AMOUNT
2,Cyber Attack/ Dependent Crimes,SQL Injection,The issue actually started when I got this ema...
3,Online Financial Fraud,Fraud CallVishing,I am amit kumar from karwi chitrakoot I am tot...
4,Any Other Cyber Crime,Other,I have ordered saree and blouse from rinki s...


The dataset consists of crime reports with three main columns:
- `crimeaditionalinfo`: Detailed description of the crime
- `category`: Main crime category
- `sub_category`: More specific classification within the main category

## Checking for Missing Values

In [5]:
print("Training Dataset Info:")
train_dataset.info()

Training Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93686 entries, 0 to 93685
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   category            93686 non-null  object
 1   sub_category        87095 non-null  object
 2   crimeaditionalinfo  93665 non-null  object
dtypes: object(3)
memory usage: 2.1+ MB


### Check for missing values in both datasets

In [6]:
print("\nMissing values in training dataset:")
train_dataset.isna().sum()


Missing values in training dataset:


category                 0
sub_category          6591
crimeaditionalinfo      21
dtype: int64

In [7]:
print("\nMissing values in test dataset:")
test_dataset.isna().sum()


Missing values in test dataset:


category                 0
sub_category          2236
crimeaditionalinfo       7
dtype: int64

From the missing value analysis, we can observe:
1. Some rows have missing crime descriptions (`crimeaditionalinfo`)
2. Several rows have missing sub-categories
3. All main categories are present

## Handling Missing Values
### 1. Crime Description Handling


In [8]:
# Remove rows where crime description is missing
train_dataset = train_dataset.dropna(subset=['crimeaditionalinfo'])
test_dataset = test_dataset.dropna(subset=['crimeaditionalinfo'])

We remove rows with missing crime descriptions because:
- The crime description is our main input feature
- Without a description, we cannot make meaningful predictions
- Imputation would not be appropriate for text data in this context

### 2. Sub-category Analysis

In [9]:
# Analyze patterns in missing sub-categories
subcat_nan_train_df = train_dataset[train_dataset['sub_category'].isna()]
print("\nSample rows with missing sub-categories:")
subcat_nan_train_df


Sample rows with missing sub-categories:


Unnamed: 0,category,sub_category,crimeaditionalinfo
8,RapeGang Rape RGRSexually Abusive Content,,I got the message on Whatsapp to my number The...
25,RapeGang Rape RGRSexually Abusive Content,,Respected Sir\r\n\r\nA very serious matter I w...
39,Sexually Explicit Act,,httpswwwxnxxtvvideousapbfuckkkarrr\r\n\r\n Abo...
45,Sexually Obscene material,,Many fake accounts are created and Im sufferin...
49,Sexually Explicit Act,,SirMaam \r\nThis is my third report on this re...
...,...,...,...
93632,Sexually Explicit Act,,ob cash ...
93648,Sexually Explicit Act,,I got fraud by atm exchange in union bank of i...
93653,RapeGang Rape RGRSexually Abusive Content,,Respected Sir\r\n\r\nA very serious matter I w...
93667,RapeGang Rape RGRSexually Abusive Content,,Respected Sir\r\n\r\nA very serious matter I w...


In [10]:
# Check categories with missing sub-categories
print("\nUnique categories with missing sub-categories:")
list(subcat_nan_train_df['category'].unique())


Unique categories with missing sub-categories:


['RapeGang Rape RGRSexually Abusive Content',
 'Sexually Explicit Act',
 'Sexually Obscene material',
 'Child Pornography CPChild Sexual Abuse Material CSAM']

In [11]:
# Check if these are sexually sensitive crimes
print("\nNumber of sexually-related crimes:")
subcat_nan_train_df['category'].str.lower().str.contains("sex").sum()


Number of sexually-related crimes:


6591

This includes every row in the dataframe

**Important insight:** We discovered that sub-categories are primarily missing for sexually sensitive crimes. This appears to be an intentional data collection choice or may-be there is no sub-category for these category of crimes, rather than random missing data.



### Checking Test Dataset for Similar Patterns

In [12]:
subcat_nan_test_df = test_dataset[test_dataset['sub_category'].isna()]
print("Test dataset patterns:")
list(subcat_nan_test_df['category'].unique())

Test dataset patterns:


['RapeGang Rape RGRSexually Abusive Content',
 'Sexually Explicit Act',
 'Sexually Obscene material',
 'Child Pornography CPChild Sexual Abuse Material CSAM']

In [13]:
print("\nSexually-related crimes in test data:")
subcat_nan_test_df['category'].str.lower().str.contains("sex").sum()


Sexually-related crimes in test data:


2235

The test dataset shows the same pattern, confirming our observation about sexually sensitive crimes.

### Checking Existing Sexual Crime Sub-categories

In [14]:
sub_categories = list(train_dataset['sub_category'].unique())
sub_categories.remove(np.nan)

print("Existing sub-categories related to sexual crimes:")
for text in sub_categories:
    if 'sex' in text.lower():
        print(text)

Existing sub-categories related to sexual crimes:
Cyber Bullying  Stalking  Sexting


While there is a sub-category **Cyber Bullying Stalking Sexting**, it's distinct from the categories with missing sub-categories.

### Sub-category Imputation

In [15]:
prefix = "sub-category of "
train_dataset.loc[:, 'sub_category'] = train_dataset['sub_category'].fillna(
    train_dataset['category'].apply(lambda x: prefix + str(x))
)
test_dataset.loc[:, 'sub_category'] = test_dataset['sub_category'].fillna(
    test_dataset['category'].apply(lambda x: prefix + str(x))
)

Reason for this Imputation strategy:
- We prefix the main category with "sub-category of" to maintain data hierarchy
- This approach preserves the sensitive nature of these crimes
- It provides a consistent way to handle missing sub-categories


## Label Analysis

In [16]:
print("Original label counts:")
print("Categories:", train_dataset['category'].nunique())
print("Sub-categories:", train_dataset['sub_category'].nunique())

Original label counts:
Categories: 15
Sub-categories: 39


In [17]:
print("\nCategory distribution:")
print(train_dataset['category'].value_counts())


Category distribution:
category
Online Financial Fraud                                  57416
Online and Social Media Related Crime                   12138
Any Other Cyber Crime                                   10877
Cyber Attack/ Dependent Crimes                           3608
RapeGang Rape RGRSexually Abusive Content                2822
Sexually Obscene material                                1838
Hacking  Damage to computercomputer system etc           1710
Sexually Explicit Act                                    1552
Cryptocurrency Crime                                      480
Online Gambling  Betting                                  444
Child Pornography CPChild Sexual Abuse Material CSAM      379
Online Cyber Trafficking                                  183
Cyber Terrorism                                           161
Ransomware                                                 56
Report Unlawful Content                                     1
Name: count, dtype: int64


In [18]:
print("\nSub-category distribution:")
print(train_dataset['sub_category'].value_counts())


Sub-category distribution:
sub_category
UPI Related Frauds                                                      26843
Other                                                                   10877
DebitCredit Card FraudSim Swap Fraud                                    10802
Internet Banking Related Fraud                                           8871
Fraud CallVishing                                                        5802
Cyber Bullying  Stalking  Sexting                                        4089
EWallet Related Fraud                                                    4047
sub-category of RapeGang Rape RGRSexually Abusive Content                2822
FakeImpersonating Profile                                                2299
Profile Hacking Identity Theft                                           2072
Cheating by Impersonation                                                1987
sub-category of Sexually Obscene material                                1838
sub-category of Sexuall

Label distribution insights:
- We have 15 main categories and 35 original sub-categories
- After imputation, sub-categories increased to 39
- This increase reflects our handling of sensitive crime categories

## Checking Label Relationships

In [19]:
subcategories = train_dataset['sub_category'].unique()

print("Sub-categories with multiple parent categories:")
for subcategory in subcategories:
    classes = train_dataset[train_dataset['sub_category']==subcategory]['category'].unique()
    if len(classes) > 1:
        print(f"{subcategory} has {len(classes)} categories: {classes}")

Sub-categories with multiple parent categories:
Tampering with computer source documents has 2 categories: ['Cyber Attack/ Dependent Crimes'
 'Hacking  Damage to computercomputer system etc']


**Important finding:** Only `Tampering with computer source documents` appears under multiple categories, suggesting a strong hierarchical structure in our data (i.e) roughly speaking category has `One-to-Many` relationship with the sub-category

## Handling Duplicates and Inconsistencies
### Initial Duplicate Check

In [20]:
print("Unique crime descriptions:", train_dataset['crimeaditionalinfo'].nunique())
print("Total rows:", train_dataset.drop_duplicates().shape[0])
print("Original rows:", train_dataset.shape[0])

Unique crime descriptions: 85013
Total rows: 85876
Original rows: 93665


We found some discrepancies between unique descriptions and rows, suggesting potential inconsistencies.

### Analyzing Inconsistent Classifications

In [21]:
# We are now zooming into the descriptions which are associated to two to three categories or sub-categories

inconsistent_data = (
    train_dataset.groupby('crimeaditionalinfo')[['category', 'sub_category']]
    .nunique()
    .reset_index()
    .query('category > 1 or sub_category > 1')
)
print("Inconsistent classifications:")
inconsistent_data

Inconsistent classifications:


Unnamed: 0,crimeaditionalinfo,category,sub_category
290,\r\n Dear sir\r\n Please stop the frau...,1,2
292,\r\n Dear sir\r\n Please stop the frau...,1,2
293,\r\n Dear sir\r\n Please stop the frau...,1,2
294,\r\n Dear sir\r\n Please stop the frau...,1,2
337,\r\nCitizen Details \r\nAdd pnoJai singhpura ...,1,2
...,...,...,...
82139,please frezee this amount,1,2
82195,please take action,1,2
83145,sir pls hold this amount,1,3
83149,sir pls kindly stop this amount,1,2


**Observation 1:** Some crime descriptions are classified under multiple categories, suggesting:
1. Ambiguous descriptions that could fit multiple categories

**Observation 2:** We see that same descriptions are repeated, even though we have grouped only unique descriptions:
1. There may be some trailing or beginning spaces which make the same descriptions considered as different unique descriptions


### Text Cleaning

In [22]:
# Remove trailing spaces
train_dataset = train_dataset.apply(lambda col: col.str.strip())
test_dataset = test_dataset.apply(lambda col: col.str.strip())

In [23]:
inconsistent_data = (
    train_dataset.groupby('crimeaditionalinfo')[['category', 'sub_category']]
    .nunique()
    .reset_index()
    .query('category > 1 or sub_category > 1')
)
print("Inconsistent classifications:")
inconsistent_data

Inconsistent classifications:


Unnamed: 0,crimeaditionalinfo,category,sub_category
0,,12,28
1382,ANY DESK APP FRAUD,1,2
1438,ATM,1,2
2059,Amount debited fraudulently,1,2
2063,Amount debited fraudulently from my account,1,3
...,...,...,...
81705,sir pls hold this amount,1,3
81708,sir pls kindly stop this amount,1,2
82069,someone made a fraud with me of amount,1,2
83172,upi,2,3


This cleaning step revealed that some apparent duplicates were due to trailing spaces. As you could see, the number of unique `crimeaditionalinfo` has also reduced as there is no repeatition of `crimeaditionalinfo`

### Handling Multiple Classifications

In [24]:
resolved_data = (
    train_dataset[train_dataset['crimeaditionalinfo'].isin(inconsistent_data['crimeaditionalinfo'])]
    .groupby('crimeaditionalinfo')
    .agg({
        'category': lambda x: ' or '.join(sorted(x.unique())),
        'sub_category': lambda x: ' or '.join(sorted(x.unique()))
    })
    .reset_index()
)
resolved_data = resolved_data.drop(0)

**Resolution strategy:**
- Instead of forcing a single classification, we preserve multiple categories using "or"
- This approach maintains the complexity of real-world crime classification
- It allows our model to learn that some crimes can span multiple categories

## Final Dataset Creation

In [25]:
# Combine cleaned and resolved data
cleaned_dataset = train_dataset[~train_dataset['crimeaditionalinfo'].isin(inconsistent_data['crimeaditionalinfo'])]
train_dataset = pd.concat([cleaned_dataset, resolved_data], ignore_index=True)

In [26]:
# Apply same process to test dataset

inconsistent_test_data = (
    test_dataset.groupby('crimeaditionalinfo')[['category', 'sub_category']]
    .nunique()
    .reset_index()
    .query('category > 1 or sub_category > 1')
)

resolved_test_data = (
    test_dataset[test_dataset['crimeaditionalinfo'].isin(inconsistent_test_data['crimeaditionalinfo'])]
    .groupby('crimeaditionalinfo')
    .agg({
        'category': lambda x: ' or '.join(sorted(x.unique())),
        'sub_category': lambda x: ' or '.join(sorted(x.unique()))
    })
    .reset_index()
)
resolved_test_data = resolved_test_data.drop(0)

cleaned_test_dataset = test_dataset[~test_dataset['crimeaditionalinfo'].isin(inconsistent_test_data['crimeaditionalinfo'])]
test_dataset = pd.concat([cleaned_test_dataset, resolved_test_data], ignore_index=True)

### Remove final duplicates

In [27]:
train_dataset = train_dataset.drop_duplicates()
test_dataset = test_dataset.drop_duplicates()

In [28]:
# Final Dataset Shape

print("Pre-processed Training Dataset Shape:", train_dataset.shape)
print("Pre-processed Test Dataset Shape:", test_dataset.shape)

Pre-processed Training Dataset Shape: (83908, 3)
Pre-processed Test Dataset Shape: (28339, 3)


Comparing to original raw dataset, 
- training set had `93686` rows 
- test set had `31229` rows

In [29]:
train_dataset.to_csv("./cleaned_dataset/cleaned_train_dataset.csv", index=False)
test_dataset.to_csv("./cleaned_dataset/cleaned_test_dataset.csv", index=False)

**Final dataset characteristics:**
- Clean, consistent text data
- Preserved multiple classifications where appropriate
- Maintained sensitive information handling
- Ready for model training

**Important note for modeling:**
The presence of multiple categories (joined by "or") suggests we should consider this as a multi-label classification problem rather than a simple multi-class classification.