# London Business Crime investigations

### Navigation
* [README](https://github.com/Kaori61/crime-data-analysis/blob/main/README.md)
* [Clean data](https://github.com/Kaori61/crime-data-analysis/blob/main/dataset/cleaned_data.csv)
* [Raw data](https://github.com/Kaori61/crime-data-analysis/blob/main/dataset/raw_data.csv)
* [Exploratory data analysis (EDA)](https://github.com/Kaori61/crime-data-analysis/blob/main/jupyter_notebooks/exploratory_data_analysis.ipynb)
* [Statistical analysis](https://github.com/Kaori61/crime-data-analysis/blob/main/jupyter_notebooks/statistical_analysis.ipynb)

### Import libraries and load data

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np

In [2]:
# loading csv data 
df = pd.read_csv('../dataset/raw_data.csv')
df.head()

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date
0,2023-01-01,Business Crime Outcomes,Barnet,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,Evidential Difficulties Victim Based-Named Sus...,N,1,2025-01-05
1,2023-01-01,Business Crime Outcomes,Barnet,ROBBERY,ROBBERY OF BUSINESS PROPERTY,Investigation Complete; No Suspect Identified....,N,1,2025-01-05
2,2023-01-01,Business Crime Outcomes,Barnet,THEFT,SHOPLIFTING,Evidential Difficulties Victim Based-Named Sus...,N,4,2025-01-05
3,2023-01-01,Business Crime Outcomes,Barnet,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,Evidential Difficulties Victim Based-Named Sus...,N,1,2025-01-05
4,2023-01-01,Business Crime Outcomes,Barnet,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,Investigation Complete; No Suspect Identified....,N,1,2025-01-05


In [3]:
# check the dataset is loaded correctly 
df.shape

(415847, 9)

# Data cleaning

### Check for Null values

In [4]:
# check for null values
df.isnull().sum()

Date                   0
Measure                0
Borough                0
Crime Section          0
Crime group            0
Outcome             1171
Positive Outcome       0
Outcome Count          0
Refresh Date           0
dtype: int64

The Outcome column contains the description of the crime outcome. Missing value in this column is likely means there is no recorded outcome yet. Therefore, null values in Outcome column will be replaced with "No outcome yet",

#### Handling missing Outcome values

In [5]:
# check rows with missing outcome values
df[df['Outcome'].isnull()]

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date
207012,2024-03-17,Business Crime Outcomes,Other,MISCELLANEOUS CRIMES AGAINST SOCIETY,MISC CRIMES AGAINST SOCIETY,,N,1,2025-01-05
217397,2024-03-22,Business Crime Outcomes,Croydon,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,,N,1,2025-01-05
217640,2024-03-23,Business Crime Outcomes,Hillingdon,MISCELLANEOUS CRIMES AGAINST SOCIETY,MISC CRIMES AGAINST SOCIETY,,N,1,2025-01-05
217899,2024-04-19,Business Crime Outcomes,Havering,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,,N,1,2025-01-05
220420,2024-04-01,Business Crime Outcomes,Hillingdon,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,,N,1,2025-01-05
...,...,...,...,...,...,...,...,...,...
414749,2024-12-18,Business Crime Outcomes,Waltham Forest,DRUG OFFENCES,TRAFFICKING OF DRUGS,,N,1,2025-01-05
414798,2024-12-19,Business Crime Outcomes,Ealing,SEXUAL OFFENCES,RAPE,,N,1,2025-01-05
414941,2024-12-20,Business Crime Outcomes,Hillingdon,THEFT,OTHER THEFT,,N,1,2025-01-05
415014,2024-12-21,Business Crime Outcomes,Croydon,BURGLARY,BURGLARY BUSINESS AND COMMUNITY,,N,1,2025-01-05


In [6]:
# replacing null values in Outcome with 'No outcome yet'
df['Outcome'] = df['Outcome'].fillna('No outcome yet')

# check null values are replaced
(df['Outcome'] == 'No outcome yet').sum()

np.int64(1171)

Missing values are replaced with 'No outocme yet'

#### Handling other null values

In [7]:
# check missing value in Date column
df[df['Date'].isnull()]

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date


This row didn't have any values so I will delete this row.

In [8]:
# delete row 407341
df = df.drop(407341)

In [9]:
# check null value
df.isnull().sum()

Date                0
Measure             0
Borough             0
Crime Section       0
Crime group         0
Outcome             0
Positive Outcome    0
Outcome Count       0
Refresh Date        0
dtype: int64

All missing values are handled now.

### Check data type

In [10]:
df.dtypes

Date                object
Measure             object
Borough             object
Crime Section       object
Crime group         object
Outcome             object
Positive Outcome    object
Outcome Count        int64
Refresh Date        object
dtype: object

Date and Refresh Date needs to be in date format so I will convert into date format.

In [11]:
# convert into date format
df['Date'] = pd.to_datetime(df['Date'])
df['Refresh Date'] = pd.to_datetime(df['Refresh Date'])
df.dtypes

Date                datetime64[ns]
Measure                     object
Borough                     object
Crime Section               object
Crime group                 object
Outcome                     object
Positive Outcome            object
Outcome Count                int64
Refresh Date        datetime64[ns]
dtype: object

Outcome Count should be integer not floaat so I willl convert it into interger.

In [12]:
# convert into integer
df['Outcome Count'] = df['Outcome Count'].astype(int)
df.dtypes

Date                datetime64[ns]
Measure                     object
Borough                     object
Crime Section               object
Crime group                 object
Outcome                     object
Positive Outcome            object
Outcome Count                int64
Refresh Date        datetime64[ns]
dtype: object

### Check inconsistent values

In [13]:
# check how many unique values each categorical columns have
df.select_dtypes(include='object').nunique()

Measure              1
Borough             33
Crime Section       13
Crime group         31
Outcome             31
Positive Outcome     2
dtype: int64

In [14]:
# check what uique values are in all categorical columns
for col in df.select_dtypes(include='object').columns:
    print(f"\n Unique values in '{col}':")
    print(df[col].value_counts(dropna=False))


 Unique values in 'Measure':
Measure
Business Crime Outcomes    415846
Name: count, dtype: int64

 Unique values in 'Borough':
Borough
Westminster               23383
Camden                    16688
Newham                    15939
Tower Hamlets             15830
Lambeth                   15773
Southwark                 15213
Brent                     14673
Hackney                   14655
Other                     14647
Islington                 14266
Croydon                   13891
Ealing                    13843
Hillingdon                13402
Greenwich                 13183
Wandsworth                13142
Haringey                  12993
Barnet                    12965
Lewisham                  12943
Hounslow                  12818
Enfield                   12506
Hammersmith and Fulham    12154
Redbridge                 12005
Bromley                   11393
Waltham Forest            10745
Havering                  10671
Kensington and Chelsea    10550
Barking and Dagenham      10022


This analysis investigates the geographical crime trend, so an unspecified location isn't useful. Therefore, I will delete 'Other' in the Borough column. 

In [15]:
# keep only the rows that is not equal to 'Other'
df = df[df['Borough'] != 'Other']

# check if the value is deleted correctly
df['Borough'].value_counts()

Borough
Westminster               23383
Camden                    16688
Newham                    15939
Tower Hamlets             15830
Lambeth                   15773
Southwark                 15213
Brent                     14673
Hackney                   14655
Islington                 14266
Croydon                   13891
Ealing                    13843
Hillingdon                13402
Greenwich                 13183
Wandsworth                13142
Haringey                  12993
Barnet                    12965
Lewisham                  12943
Hounslow                  12818
Enfield                   12506
Hammersmith and Fulham    12154
Redbridge                 12005
Bromley                   11393
Waltham Forest            10745
Havering                  10671
Kensington and Chelsea    10550
Barking and Dagenham      10022
Harrow                     8415
Merton                     8194
Kingston upon Thames       8085
Bexley                     7613
Sutton                     6869


In [16]:
# check the unique value again
df.select_dtypes(include='object').nunique()

Measure              1
Borough             32
Crime Section       13
Crime group         31
Outcome             31
Positive Outcome     2
dtype: int64

### Check for duplicate

In [17]:
df.duplicated().sum()

np.int64(0)

No duplicate found. Data cleaning process is satisfactory.

## Create new columns

I would like to make separate columns for Year, Month, Date for analysis.

In [18]:
# create new columns
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Weekday'] = df['Date'].dt.day_name()

# check for new columns
df.head() 

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date,Year,Month,Weekday
0,2023-01-01,Business Crime Outcomes,Barnet,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,Evidential Difficulties Victim Based-Named Sus...,N,1,2025-01-05,2023,1,Sunday
1,2023-01-01,Business Crime Outcomes,Barnet,ROBBERY,ROBBERY OF BUSINESS PROPERTY,Investigation Complete; No Suspect Identified....,N,1,2025-01-05,2023,1,Sunday
2,2023-01-01,Business Crime Outcomes,Barnet,THEFT,SHOPLIFTING,Evidential Difficulties Victim Based-Named Sus...,N,4,2025-01-05,2023,1,Sunday
3,2023-01-01,Business Crime Outcomes,Barnet,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,Evidential Difficulties Victim Based-Named Sus...,N,1,2025-01-05,2023,1,Sunday
4,2023-01-01,Business Crime Outcomes,Barnet,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,Investigation Complete; No Suspect Identified....,N,1,2025-01-05,2023,1,Sunday


Delete uneccesary columns

In [19]:
df.drop(columns=['Measure', 'Refresh Date'], inplace=True)
df.head()

Unnamed: 0,Date,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Year,Month,Weekday
0,2023-01-01,Barnet,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,Evidential Difficulties Victim Based-Named Sus...,N,1,2023,1,Sunday
1,2023-01-01,Barnet,ROBBERY,ROBBERY OF BUSINESS PROPERTY,Investigation Complete; No Suspect Identified....,N,1,2023,1,Sunday
2,2023-01-01,Barnet,THEFT,SHOPLIFTING,Evidential Difficulties Victim Based-Named Sus...,N,4,2023,1,Sunday
3,2023-01-01,Barnet,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,Evidential Difficulties Victim Based-Named Sus...,N,1,2023,1,Sunday
4,2023-01-01,Barnet,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,Investigation Complete; No Suspect Identified....,N,1,2023,1,Sunday


### Save cleaned data as csv

In [20]:
df.to_csv("../dataset/cleaned_data.csv", index=False)

##### The next step taken is Exporatory Data Analysis which can find from [here](https://github.com/Kaori61/crime-data-analysis/blob/main/jupyter_notebooks/exploratory_data_analysis.ipynb).
Go back to [README](https://github.com/Kaori61/crime-data-analysis/blob/main/README.md)