# Importing Data & Cleaning

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import glob
import os

<b> Importing PSNI street crime data from 08-2015 to 05-2018 and displaying the first five rows </b>

In [2]:
src_dir ='../../data/PSNI_StreetCrime_1Year'
psni_crime_data = pd.DataFrame()
list_ = []
i = 0
for root, dirs, files in os.walk(src_dir):
    allFiles = glob.glob(root + "/*.csv")
    for file in allFiles:
        year_df = pd.read_csv(file,index_col=None, header=0)
        list_.append(year_df)
        psni_crime_data = pd.concat(list_)
psni_crime_data.head()

Unnamed: 0,Crime ID,Month,Reported by,Falls within,Longitude,Latitude,Location,LSOA code,LSOA name,Crime type,Last outcome category,Context
0,,2015-08,Police Service of Northern Ireland,Police Service of Northern Ireland,-7.378949,54.717334,On or near Dublin Street,,,Anti-social behaviour,,
1,,2015-08,Police Service of Northern Ireland,Police Service of Northern Ireland,-5.891082,54.222501,On or near Bracken Avenue,,,Anti-social behaviour,,
2,,2015-08,Police Service of Northern Ireland,Police Service of Northern Ireland,-5.667276,54.663573,On or near High Street,,,Anti-social behaviour,,
3,,2015-08,Police Service of Northern Ireland,Police Service of Northern Ireland,-5.96233,54.587243,On or near Rodney Parade,,,Anti-social behaviour,,
4,,2015-08,Police Service of Northern Ireland,Police Service of Northern Ireland,-5.894063,54.590423,On or near Brenda Street,,,Anti-social behaviour,,


<b> Let's look at the data frame from a high level and identify the data type and stored data in the columns  </b>

In [3]:
psni_crime_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 445745 entries, 0 to 13728
Data columns (total 12 columns):
Crime ID                 274726 non-null object
Month                    445745 non-null object
Reported by              445745 non-null object
Falls within             445745 non-null object
Longitude                443370 non-null float64
Latitude                 443370 non-null float64
Location                 445745 non-null object
LSOA code                0 non-null float64
LSOA name                0 non-null float64
Crime type               445745 non-null object
Last outcome category    0 non-null float64
Context                  0 non-null float64
dtypes: float64(6), object(6)
memory usage: 44.2+ MB


<b> There are a number of fields that are redundant, containing all null values. Let's drop them from the table.  </b>

In [4]:
psni_crime_data = psni_crime_data.drop(['LSOA code', 'LSOA name', 'Last outcome category', 'Context'], axis=1)

In [5]:
psni_crime_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 445745 entries, 0 to 13728
Data columns (total 8 columns):
Crime ID        274726 non-null object
Month           445745 non-null object
Reported by     445745 non-null object
Falls within    445745 non-null object
Longitude       443370 non-null float64
Latitude        443370 non-null float64
Location        445745 non-null object
Crime type      445745 non-null object
dtypes: float64(2), object(6)
memory usage: 30.6+ MB


<b> Crime ID is not shared by all the rows. According to the official police API (https://data.police.uk/docs/method/crime-street/), this is used as an ID for the API and is not a police identifier.  Let's investigate this ID further and decide whether it is valuable to have in the table. </b>

In [6]:
num_unique_crime_ids = psni_crime_data['Crime ID'].nunique()
print(f"Number of unique crime ids: {num_unique_crime_ids}")

psni_crime_data_crimetype = psni_crime_data.groupby('Crime type')['Crime ID'].count()
psni_crime_data_crimetype

Number of unique crime ids: 11728


Crime type
Anti-social behaviour                0
Bicycle theft                     2141
Burglary                         20368
Criminal damage and arson        52336
Drugs                            16483
Other crime                       7472
Other theft                      37313
Possession of weapons             2701
Public order                      3334
Robbery                           1732
Shoplifting                      17389
Theft from the person             1253
Vehicle crime                    11743
Violence and sexual offences    100461
Name: Crime ID, dtype: int64

<b> Anti social behaviour does not have any Crime IDs associated with them. As Crime ID is not unique this makes me suspect there could be duplicates of the same crime in the table. I'm going to investigate this by finding rows with the same Crime ID, Month, Longitude, Latitude, Location and Crime Type </b>

In [7]:
grouped_psni_crime_data = psni_crime_data.fillna(-1).groupby(['Crime ID','Month', 'Longitude','Latitude', 'Location', 'Crime type']).size().reset_index(name='counts')

In [8]:
num_duplicate_rows = grouped_psni_crime_data[grouped_psni_crime_data.counts > 1].counts.sum()
percent_duplicate_rows = "{:.1%}".format(num_duplicate_rows/len(psni_crime_data))
print(f"{percent_duplicate_rows}")

16.3%


<b> 16.3% of rows could be potentially duplicate. I will drop Crime ID as it is not an effective unique identifier for the rows and has no use interacting with the Police API.</b>

In [9]:
psni_crime_data = psni_crime_data.drop(['Crime ID'], axis=1)

<b> 'Reported by' and 'Falls within' columns can be dropped as in all entries they are both set to 'Police Service of Northern Ireland'</b>

In [18]:
print(psni_crime_data['Reported by'].nunique())
print(psni_crime_data['Falls within'].nunique())
psni_crime_data = psni_crime_data.drop(['Reported by', 'Falls within'], axis=1)

1
1


In [19]:
psni_crime_data.dtypes

Month          object
Longitude     float64
Latitude      float64
Location       object
Crime type     object
dtype: object