# London Business Crime investigations

### Navigation
* [README](https://github.com/Kaori61/crime-data-analysis/blob/main/README.md)
* [Clean data](https://github.com/Kaori61/crime-data-analysis/blob/main/dataset/data_cleaned.csv)
* [Raw data](https://github.com/Kaori61/crime-data-analysis/blob/main/dataset/raw_data.csv)
* [Exploratory data analysis (EDA)](https://github.com/Kaori61/crime-data-analysis/blob/main/jupyter_notebooks/exploratory_data_analysis.ipynb)
* [Statistical analysis](https://github.com/Kaori61/crime-data-analysis/blob/main/jupyter_notebooks/statistical_analysis.ipynb)
* [Dashboard - Business Crime Trends in London](https://public.tableau.com/views/LondonBusinessCrimeAnalysis/Dashboard1?:language=en-GB&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link)


### Import libraries and load data

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np

In [None]:
# loading csv data 
raw = pd.read_csv('../dataset/raw_data.csv')

# sample data to reduce dataset. Using 10% of the original rows.
df = raw.sample(frac=0.1, random_state=42)
df.to_csv("../dataset/sample_data.csv", index=False)
df.head()

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date
248172,2024-06-10,Business Crime Outcomes,Southwark,DRUG OFFENCES,POSSESSION OF DRUGS,Charged/Summonsed/Postal Requisition,Y,1,2025-01-05
298165,2024-09-04,Business Crime Outcomes,Lewisham,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,Investigation Complete; No Suspect Identified....,N,1,2025-01-05
191832,2024-01-16,Business Crime Outcomes,Other,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,Investigation Complete; No Suspect Identified....,N,2,2025-01-05
84441,2023-06-08,Business Crime Outcomes,Islington,THEFT,SHOPLIFTING,Investigation Complete; No Suspect Identified....,N,18,2025-01-05
386902,2024-03-17,Business Crime Outcomes,Newham,THEFT,SHOPLIFTING,Investigation Complete; No Suspect Identified....,N,3,2025-01-05


In [3]:
# check the dataset is loaded correctly 
df.shape

(41585, 9)

### Data cleaning

##### Check for Null values

In [4]:
# check for null values
df.isnull().sum()

Date                  0
Measure               0
Borough               0
Crime Section         0
Crime group           0
Outcome             145
Positive Outcome      0
Outcome Count         0
Refresh Date          0
dtype: int64

The Outcome column contains the description of the crime outcome. Missing value in this column likely means there is no recorded outcome yet. Therefore, null values in Outcome column will be replaced with "No outcome yet",

##### Handling missing Outcome values

In [5]:
# check rows with missing outcome values
df[df['Outcome'].isnull()]

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date
288084,2024-08-16,Business Crime Outcomes,Kensington and Chelsea,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,,N,1,2025-01-05
273330,2024-08-16,Business Crime Outcomes,Wandsworth,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,,N,1,2025-01-05
237436,2024-06-07,Business Crime Outcomes,Brent,THEFT,OTHER THEFT,,N,1,2025-01-05
327298,2024-11-10,Business Crime Outcomes,Hillingdon,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,,N,1,2025-01-05
288820,2024-08-28,Business Crime Outcomes,Bexley,VIOLENCE AGAINST THE PERSON,VIOLENCE WITHOUT INJURY,,N,1,2025-01-05
...,...,...,...,...,...,...,...,...,...
325926,2024-10-22,Business Crime Outcomes,Tower Hamlets,ARSON AND CRIMINAL DAMAGE,CRIMINAL DAMAGE,,N,1,2025-01-05
293500,2024-09-18,Business Crime Outcomes,Barking and Dagenham,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,,N,1,2025-01-05
315501,2024-10-01,Business Crime Outcomes,Tower Hamlets,PUBLIC ORDER OFFENCES,RACE OR RELIGIOUS AGG PUBLIC FEAR,,N,1,2025-01-05
273416,2024-06-20,Business Crime Outcomes,Haringey,PUBLIC ORDER OFFENCES,RACE OR RELIGIOUS AGG PUBLIC FEAR,,N,1,2025-01-05


In [6]:
# replacing null values in Outcome with 'No outcome yet'
df['Outcome'] = df['Outcome'].fillna('No outcome yet')

# check null values are replaced
(df['Outcome'] == 'No outcome yet').sum()

np.int64(145)

145 of missing values in Outcome columns are replaced with 'No outocme yet'

##### Handling other null values

In [7]:
# check missing value in Date column
df[df['Date'].isnull()]

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date


In [8]:
# check null value
df.isnull().sum()

Date                0
Measure             0
Borough             0
Crime Section       0
Crime group         0
Outcome             0
Positive Outcome    0
Outcome Count       0
Refresh Date        0
dtype: int64

All missing values are handled now.

### Check data type

In [9]:
df.dtypes

Date                object
Measure             object
Borough             object
Crime Section       object
Crime group         object
Outcome             object
Positive Outcome    object
Outcome Count        int64
Refresh Date        object
dtype: object

Date and Refresh Date needs to be in date format so I will convert into date format.

In [10]:
# convert into date format
df['Date'] = pd.to_datetime(df['Date'])
df['Refresh Date'] = pd.to_datetime(df['Refresh Date'])
df.dtypes

Date                datetime64[ns]
Measure                     object
Borough                     object
Crime Section               object
Crime group                 object
Outcome                     object
Positive Outcome            object
Outcome Count                int64
Refresh Date        datetime64[ns]
dtype: object

Outcome Count should be integer not float so I willl convert it into interger.

In [11]:
# convert into integer
df['Outcome Count'] = df['Outcome Count'].astype(int)
df.dtypes

Date                datetime64[ns]
Measure                     object
Borough                     object
Crime Section               object
Crime group                 object
Outcome                     object
Positive Outcome            object
Outcome Count                int64
Refresh Date        datetime64[ns]
dtype: object

### Check inconsistent categorical values

In [12]:
# check how many unique values each categorical columns have
df.select_dtypes(include='object').nunique()

Measure              1
Borough             33
Crime Section       12
Crime group         30
Outcome             31
Positive Outcome     2
dtype: int64

In [13]:
# check what uique values are in all categorical columns
for col in df.select_dtypes(include='object').columns:
    print(f"\n Unique values in '{col}':")
    print(df[col].value_counts(dropna=False))


 Unique values in 'Measure':
Measure
Business Crime Outcomes    41585
Name: count, dtype: int64

 Unique values in 'Borough':
Borough
Westminster               2323
Newham                    1668
Camden                    1636
Tower Hamlets             1569
Lambeth                   1558
Brent                     1495
Hackney                   1491
Southwark                 1482
Islington                 1429
Other                     1426
Wandsworth                1378
Hillingdon                1375
Ealing                    1374
Lewisham                  1363
Croydon                   1359
Greenwich                 1341
Haringey                  1328
Barnet                    1255
Enfield                   1215
Redbridge                 1214
Hammersmith and Fulham    1206
Hounslow                  1198
Bromley                   1171
Waltham Forest            1051
Kensington and Chelsea    1035
Havering                  1035
Barking and Dagenham      1025
Harrow                     8

This analysis investigates the geographical crime trend, so an unspecified location isn't useful. Therefore, I will delete 'Other' in the Borough column. 

In [14]:
# keep only the rows that is not equal to 'Other'
df = df[df['Borough'] != 'Other']

# check if the value is deleted correctly
df['Borough'].value_counts()

Borough
Westminster               2323
Newham                    1668
Camden                    1636
Tower Hamlets             1569
Lambeth                   1558
Brent                     1495
Hackney                   1491
Southwark                 1482
Islington                 1429
Wandsworth                1378
Hillingdon                1375
Ealing                    1374
Lewisham                  1363
Croydon                   1359
Greenwich                 1341
Haringey                  1328
Barnet                    1255
Enfield                   1215
Redbridge                 1214
Hammersmith and Fulham    1206
Hounslow                  1198
Bromley                   1171
Waltham Forest            1051
Kensington and Chelsea    1035
Havering                  1035
Barking and Dagenham      1025
Harrow                     859
Kingston upon Thames       834
Merton                     820
Bexley                     761
Sutton                     683
Richmond upon Thames       628


In [15]:
# check the unique value again
df.select_dtypes(include='object').nunique()

Measure              1
Borough             32
Crime Section       12
Crime group         30
Outcome             31
Positive Outcome     2
dtype: int64

##### Check for duplicate

In [16]:
df.duplicated().sum()

np.int64(0)

No duplicate found. Data cleaning process is satisfactory.

### Create new columns

I would like to make separate columns for Year, Month, Date for analysis.

In [17]:
# create new columns
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Weekday'] = df['Date'].dt.day_name()

# check for new columns
df.head() 

Unnamed: 0,Date,Measure,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Refresh Date,Year,Month,Weekday
248172,2024-06-10,Business Crime Outcomes,Southwark,DRUG OFFENCES,POSSESSION OF DRUGS,Charged/Summonsed/Postal Requisition,Y,1,2025-01-05,2024,6,Monday
298165,2024-09-04,Business Crime Outcomes,Lewisham,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,Investigation Complete; No Suspect Identified....,N,1,2025-01-05,2024,9,Wednesday
84441,2023-06-08,Business Crime Outcomes,Islington,THEFT,SHOPLIFTING,Investigation Complete; No Suspect Identified....,N,18,2025-01-05,2023,6,Thursday
386902,2024-03-17,Business Crime Outcomes,Newham,THEFT,SHOPLIFTING,Investigation Complete; No Suspect Identified....,N,3,2025-01-05,2024,3,Sunday
259355,2024-05-25,Business Crime Outcomes,Lambeth,DRUG OFFENCES,POSSESSION OF DRUGS,Community resolution (Crime),Y,7,2025-01-05,2024,5,Saturday


Delete uneccesary columns. 

Measure refers to type of crime and this data only include one type which is business crime so I delete the column. 
Refresh Date refers to the date that was updated this dataset which isn't needed for this analysis.

In [18]:
df.drop(columns=['Measure', 'Refresh Date'], inplace=True)
df.head()

Unnamed: 0,Date,Borough,Crime Section,Crime group,Outcome,Positive Outcome,Outcome Count,Year,Month,Weekday
248172,2024-06-10,Southwark,DRUG OFFENCES,POSSESSION OF DRUGS,Charged/Summonsed/Postal Requisition,Y,1,2024,6,Monday
298165,2024-09-04,Lewisham,PUBLIC ORDER OFFENCES,PUBLIC FEAR ALARM OR DISTRESS,Investigation Complete; No Suspect Identified....,N,1,2024,9,Wednesday
84441,2023-06-08,Islington,THEFT,SHOPLIFTING,Investigation Complete; No Suspect Identified....,N,18,2023,6,Thursday
386902,2024-03-17,Newham,THEFT,SHOPLIFTING,Investigation Complete; No Suspect Identified....,N,3,2024,3,Sunday
259355,2024-05-25,Lambeth,DRUG OFFENCES,POSSESSION OF DRUGS,Community resolution (Crime),Y,7,2024,5,Saturday


### Save cleaned data as csv

In [19]:
df.to_csv("../dataset/data_cleaned.csv", index=False)

##### The next step taken is Exporatory Data Analysis which can find from [here](https://github.com/Kaori61/crime-data-analysis/blob/main/jupyter_notebooks/exploratory_data_analysis.ipynb).
Go back to [README](https://github.com/Kaori61/crime-data-analysis/blob/main/README.md) / [Statistical analysis](https://github.com/Kaori61/crime-data-analysis/blob/main/jupyter_notebooks/statistical_analysis.ipynb) / [Dashboard](https://public.tableau.com/views/LondonBusinessCrimeAnalysis/Dashboard1?:language=en-GB&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link)
