# **A detailed analysis of crime in London**

## Objectives

* Extract the dataset from Kaggle and transform the data to be prepare for analysis.
* Conduct descriptive analysis to understand the basic charectaristics of the data.
* Visualise the data by using visualisation techniques.
* Conclude and develop reports based on the data, prepare for presenation.

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [72]:
import os
current_dir = os.getcwd()
current_dir

'/'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [73]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [74]:
current_dir = os.getcwd()
current_dir

'/'

**Import packages required.**

Section 1 content

In [75]:
# Data manipulation and analysis
import pandas as pd 
import numpy as np
import os

# Data visuaimport shutil
import shutil
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go


---

**Data extraction**

Section 2 content

In [76]:
!pip install kagglehub



In [77]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("jboysen/london-crime")

print("Path to dataset files:", path)

Path to dataset files: /home/gitpod/.cache/kagglehub/datasets/jboysen/london-crime/versions/1


**Data transformation**

In [78]:
#Load the dataset from pandas
df = pd.read_csv(path + "/london_crime_by_lsoa.csv")

In [79]:
# Display basic information about the dataset
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13490604 entries, 0 to 13490603
Data columns (total 7 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   lsoa_code       object
 1   borough         object
 2   major_category  object
 3   minor_category  object
 4   value           int64 
 5   year            int64 
 6   month           int64 
dtypes: int64(3), object(4)
memory usage: 720.5+ MB
None


In [80]:
# Display the first few rows of the dataset
print (df.head())

   lsoa_code     borough               major_category  \
0  E01001116     Croydon                     Burglary   
1  E01001646   Greenwich  Violence Against the Person   
2  E01000677     Bromley  Violence Against the Person   
3  E01003774   Redbridge                     Burglary   
4  E01004563  Wandsworth                      Robbery   

                minor_category  value  year  month  
0  Burglary in Other Buildings      0  2016     11  
1               Other violence      0  2016     11  
2               Other violence      0  2015      5  
3  Burglary in Other Buildings      0  2016      3  
4            Personal Property      0  2008      6  


In [81]:
# Generate a summary of the statistics
print(df.describe())

              value          year         month
count  1.349060e+07  1.349060e+07  1.349060e+07
mean   4.779444e-01  2.012000e+03  6.500000e+00
std    1.771513e+00  2.581989e+00  3.452053e+00
min    0.000000e+00  2.008000e+03  1.000000e+00
25%    0.000000e+00  2.010000e+03  3.750000e+00
50%    0.000000e+00  2.012000e+03  6.500000e+00
75%    1.000000e+00  2.014000e+03  9.250000e+00
max    3.090000e+02  2.016000e+03  1.200000e+01


In [82]:
# Check for missing values and data types
print(df.isnull().sum())
print(df.dtypes)

lsoa_code         0
borough           0
major_category    0
minor_category    0
value             0
year              0
month             0
dtype: int64
lsoa_code         object
borough           object
major_category    object
minor_category    object
value              int64
year               int64
month              int64
dtype: object


In [None]:
# Group data by borough and calculate total crimes per borough
total_crimes_by_borough = df.groupby('borough')['value'].sum().sort_values()
print(total_crimes_by_borough)

borough
City of London               780
Kingston upon Thames       89306
Richmond upon Thames       96771
Sutton                    100987
Bexley                    114136
Merton                    115654
Harrow                    116848
Havering                  138947
Barking and Dagenham      149447
Kensington and Chelsea    171981
Greenwich                 181568
Redbridge                 183562
Bromley                   184349
Hammersmith and Fulham    185259
Hounslow                  186772
Enfield                   193880
Waltham Forest            203879
Wandsworth                204741
Hillingdon                209680
Barnet                    212191
Haringey                  213272
Lewisham                  215137
Hackney                   217119
Brent                     227551
Tower Hamlets             228613
Islington                 230286
Ealing                    251562
Croydon                   260294
Newham                    262024
Camden                    275147
So

In [None]:
# List the central boroughs and run a comparison to the outer boroughs
central_boroughs = ['City of London', 'Westminster', 'Camden', 'Islington', 'Kensington and Chelsea']
df['is_central'] = df['borough'].isin(central_boroughs)
central_vs_outer = df.groupby('is_central')['value'].mean()
print(central_vs_outer)

is_central
False    0.439952
True     0.803246
Name: value, dtype: float64


In [85]:
# Find the most common types of crime in each borough
top_crimes_summary = (
    df.groupby(['borough', 'major_category'])['value']
    .sum()
    .reset_index()
    .sort_values(['borough', 'value'], ascending=[True, False])
    .groupby('borough')
    .head(3)
)

print(top_crimes_summary)

                  borough               major_category   value
7    Barking and Dagenham           Theft and Handling   50999
8    Barking and Dagenham  Violence Against the Person   43091
1    Barking and Dagenham              Criminal Damage   18888
16                 Barnet           Theft and Handling   87285
17                 Barnet  Violence Against the Person   46565
..                    ...                          ...     ...
285            Wandsworth  Violence Against the Person   45865
277            Wandsworth                     Burglary   25533
293           Westminster           Theft and Handling  277617
294           Westminster  Violence Against the Person   71448
288           Westminster                        Drugs   34031

[99 rows x 3 columns]


In [86]:
# Analyze the trends overtime to complete a time-based analysis of the data.
if 'year' in df.columns and 'month' in df.columns:
    df['date'] = pd.to_datetime(df['year'].astype(str) + '-' + df['month'].astype(str))
    time_trend = df.groupby(['date', 'borough'])['value'].sum().unstack()
    print(time_trend)

borough     Barking and Dagenham  Barnet  Bexley  Brent  Bromley  Camden  \
date                                                                       
2008-01-01                  1615    2134    1346   2136     2097    2610   
2008-02-01                  1580    1861    1296   1895     1988    2608   
2008-03-01                  1417    1992    1342   1946     1923    2720   
2008-04-01                  1522    1999    1240   1797     1869    2532   
2008-05-01                  1460    2144    1280   2026     2027    2680   
...                          ...     ...     ...    ...      ...     ...   
2016-08-01                  1346    2043     994   2269     1537    2476   
2016-09-01                  1363    2110    1106   2185     1595    2425   
2016-10-01                  1323    2074    1148   2366     1758    2542   
2016-11-01                  1316    2032    1152   2158     1755    2419   
2016-12-01                  1278    1975    1182   2285     1860    2628   

borough    

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.