# **DATA COLLECTION NOTEBOOK**

## Objectives

* Fetch data from Kaggle, save as raw data and inspect data

## Inputs

* The data are from https://www.kaggle.com/datasets/candacegostinski/crime-data-analysis/data
* The input data is a csv file called Crime_Data_from_2020_to_Present.csv
* You need to install requirements.txt and kaggle==1.5.12

## Outputs

* The output data is a csv file called dataPP5.csv 

## Additional Comments

* ...


---

# Install python packages in the notebooks

In [5]:
%pip install -r /workspace/PP5_My_project/requirements.txt

Collecting streamlit==0.85.0 (from -r /workspace/PP5_My_project/requirements.txt (line 2))
  Downloading streamlit-0.85.0-py2.py3-none-any.whl.metadata (1.1 kB)
Collecting altair<5 (from -r /workspace/PP5_My_project/requirements.txt (line 3))
  Downloading altair-4.2.2-py3-none-any.whl.metadata (13 kB)
Collecting astor (from streamlit==0.85.0->-r /workspace/PP5_My_project/requirements.txt (line 2))
  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting base58 (from streamlit==0.85.0->-r /workspace/PP5_My_project/requirements.txt (line 2))
  Downloading base58-2.1.1-py3-none-any.whl.metadata (3.1 kB)
Collecting blinker (from streamlit==0.85.0->-r /workspace/PP5_My_project/requirements.txt (line 2))
  Downloading blinker-1.8.2-py3-none-any.whl.metadata (1.6 kB)
Collecting cachetools>=4.0 (from streamlit==0.85.0->-r /workspace/PP5_My_project/requirements.txt (line 2))
  Downloading cachetools-5.5.0-py3-none-any.whl.metadata (5.3 kB)
Collecting click<8.0,>=7.0 (from st

---

# Change working directory

* Storing the notebooks in a subfolder, therefore we change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/PP5_My_project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/PP5_My_project'

---

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Make the token recognisable in the session

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [6]:
KaggleDatasetPath = "nathaniellybrand/los-angeles-crime-dataset-2020-present"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading los-angeles-crime-dataset-2020-present.zip to inputs/datasets/raw
 95%|███████████████████████████████████▉  | 35.0M/37.0M [00:02<00:00, 24.1MB/s]
100%|██████████████████████████████████████| 37.0M/37.0M [00:02<00:00, 18.6MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/los-angeles-crime-dataset-2020-present.zip
  inflating: inputs/datasets/raw/Crime_Data_from_2020_to_Present.csv  


---

# Load and Inspect Kaggle data

Section 2 content

In [5]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Crime_Data_from_2020_to_Present.csv")
df.head()

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,10304468,01/08/2020 12:00:00 AM,01/08/2020 12:00:00 AM,2230,3,Southwest,377,2,624,BATTERY - SIMPLE ASSAULT,...,AO,Adult Other,624.0,,,,1100 W 39TH PL,,34.0141,-118.2978
1,190101086,01/02/2020 12:00:00 AM,01/01/2020 12:00:00 AM,330,1,Central,163,2,624,BATTERY - SIMPLE ASSAULT,...,IC,Invest Cont,624.0,,,,700 S HILL ST,,34.0459,-118.2545
2,200110444,04/14/2020 12:00:00 AM,02/13/2020 12:00:00 AM,1200,1,Central,155,2,845,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,...,AA,Adult Arrest,845.0,,,,200 E 6TH ST,,34.0448,-118.2474
3,191501505,01/01/2020 12:00:00 AM,01/01/2020 12:00:00 AM,1730,15,N Hollywood,1543,2,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),...,IC,Invest Cont,745.0,998.0,,,5400 CORTEEN PL,,34.1685,-118.4019
4,191921269,01/01/2020 12:00:00 AM,01/01/2020 12:00:00 AM,415,19,Mission,1998,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,IC,Invest Cont,740.0,,,,14400 TITUS ST,,34.2198,-118.4468


DataFrame Summary

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752911 entries, 0 to 752910
Data columns (total 28 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   DR_NO           752911 non-null  int64  
 1   Date Rptd       752911 non-null  object 
 2   DATE OCC        752911 non-null  object 
 3   TIME OCC        752911 non-null  int64  
 4   AREA            752911 non-null  int64  
 5   AREA NAME       752911 non-null  object 
 6   Rpt Dist No     752911 non-null  int64  
 7   Part 1-2        752911 non-null  int64  
 8   Crm Cd          752911 non-null  int64  
 9   Crm Cd Desc     752911 non-null  object 
 10  Mocodes         649650 non-null  object 
 11  Vict Age        752911 non-null  int64  
 12  Vict Sex        654681 non-null  object 
 13  Vict Descent    654675 non-null  object 
 14  Premis Cd       752902 non-null  float64
 15  Premis Desc     752476 non-null  object 
 16  Weapon Used Cd  261472 non-null  float64
 17  Weapon Des

We want to check if there are duplicated `DR_NO`: There are not.

In [8]:
df[df.duplicated(subset=['DR_NO'])]

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON


Find missing values

In [9]:
df.isna().sum()

DR_NO                  0
Date Rptd              0
DATE OCC               0
TIME OCC               0
AREA                   0
AREA NAME              0
Rpt Dist No            0
Part 1-2               0
Crm Cd                 0
Crm Cd Desc            0
Mocodes           103261
Vict Age               0
Vict Sex           98230
Vict Descent       98236
Premis Cd              9
Premis Desc          435
Weapon Used Cd    491439
Weapon Desc       491439
Status                 0
Status Desc            0
Crm Cd 1              10
Crm Cd 2          697204
Crm Cd 3          751044
Crm Cd 4          752855
LOCATION               0
Cross Street      631859
LAT                    0
LON                    0
dtype: int64

Evaluating distribution and shape of a variable with missing data

In [8]:
# Variable containing missing data
missing_data = df.columns[df.isna().sum() > 0].to_list()
missing_data

['Mocodes',
 'Vict Sex',
 'Vict Descent',
 'Premis Cd',
 'Premis Desc',
 'Weapon Used Cd',
 'Weapon Desc',
 'Cross Street']

In [10]:
from ydata_profiling import ProfileReport
if missing_data:
    profile = ProfileReport(df=df[missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Getting exemple of the data in variable with missing data

In [9]:
# Exemple of missing data (not missing) in order to understand what to do with missing values
for col in missing_data:
    unique_values = df[col].dropna().unique()[:5]
    print(f"{col}: {unique_values}\n")

Mocodes: ['0444 0913' '0416 1822 1414' '1501' '0329 1402' '0329']

Vict Sex: ['F' 'M' 'X' 'H' '-']

Vict Descent: ['B' 'H' 'X' 'W' 'A']

Premis Cd: [501. 102. 726. 502. 409.]

Premis Desc: ['SINGLE FAMILY DWELLING' 'SIDEWALK' 'POLICE FACILITY'
 'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)' 'BEAUTY SUPPLY STORE']

Weapon Used Cd: [400. 500. 306. 511. 204.]

Weapon Desc: ['STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)'
 'UNKNOWN WEAPON/OTHER WEAPON' 'ROCK/THROWN OBJECT' 'VERBAL THREAT'
 'FOLDING KNIFE']

Cross Street: ['OLIVE' 'VERMONT' 'HILL' 'LOS ANGELES' 'SAN JULIAN']



Trying to understand the importance of the missing data, starting with 'Mocodes' variable. Wisch after further enquiring on the internet might be interesting since it gives more information about the crime committed. It might be a way to give us an idea of the financial damage the victime undured. 

In [10]:
# Filter rows where 'Mocodes' is missing and display the first 5
missing_rows = df[df['Mocodes'].isna()]
print(missing_rows.head(5))

         DR_NO               Date Rptd                DATE OCC  TIME OCC  \
33   200117988  09/15/2020 12:00:00 AM  09/03/2020 12:00:00 AM      2000   
45   221412410  06/15/2022 12:00:00 AM  11/12/2020 12:00:00 AM      1700   
78   200104073  01/02/2020 12:00:00 AM  01/02/2020 12:00:00 AM       345   
104  200516527  11/12/2020 12:00:00 AM  11/12/2020 12:00:00 AM       430   
105  200609101  04/15/2020 12:00:00 AM  04/12/2020 12:00:00 AM      1300   

     AREA  AREA NAME  Rpt Dist No  Part 1-2 Crm Cd  \
33      1    Central          111         1    510   
45     14    Pacific         1444         1    420   
78      1    Central          143         1    510   
104     5     Harbor          508         1    510   
105     6  Hollywood          648         1    510   

                                         Crm Cd Desc  ... Status  Status Desc  \
33                                  VEHICLE - STOLEN  ...     IC  Invest Cont   
45   THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)  ..

Print a list of unique variable 'Crm Cd Desc' where variable 'Mocodes' is missing

In [11]:
# Filter rows where 'Mocodes' is missing and get unique 'Crm Cd Desc' values
unique_crm_cd_desc = df[df['Mocodes'].isna()]['Crm Cd Desc'].unique()
# Display the unique values
print(unique_crm_cd_desc)

['VEHICLE - STOLEN' 'THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)'
 'VANDALISM - MISDEAMEANOR ($399 OR UNDER)'
 'CRIMINAL THREATS - NO WEAPON DISPLAYED' 'THEFT OF IDENTITY'
 'SEX OFFENDER REGISTRANT OUT OF COMPLIANCE'
 'THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD'
 'OTHER MISCELLANEOUS CRIME' 'ARSON'
 'LETTERS, LEWD  -  TELEPHONE CALLS, LEWD' 'BOMB SCARE' 'STALKING'
 'CHILD NEGLECT (SEE 300 W.I.C.)'
 'EMBEZZLEMENT, GRAND THEFT ($950.01 & OVER)'
 'CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER)'
 'DOCUMENT WORTHLESS ($200 & UNDER)' 'BURGLARY, ATTEMPTED'
 'FAILURE TO YIELD' 'VEHICLE - ATTEMPT STOLEN' 'TRESPASSING'
 'CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT' 'BUNCO, PETTY THEFT'
 'RAPE, FORCIBLE' 'THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)'
 'DEFRAUDING INNKEEPER/THEFT OF SERVICES, OVER $950.01' 'BURGLARY'
 'DEFRAUDING INNKEEPER/THEFT OF SERVICES, $950 & UNDER'
 'DOCUMENT FORGERY / STOLEN FELONY' 'DISCHARGE FIREARMS/SHOTS FIRED'
 'DRIVING WITHOUT OWNER

Print a list of unique variable 'Crm Cd Desc' containing '$' where variable 'Mocodes' is missing 

In [12]:
# Filter rows where 'Mocodes' is missing and 'Crm Cd Desc' contains '$'
filtered_data = df[df['Mocodes'].isna() & df['Crm Cd Desc'].str.contains('\$', na=False)]

# Get unique values
unique_crm_cd_desc_with_dollar = filtered_data['Crm Cd Desc'].unique()

# Display
print(unique_crm_cd_desc_with_dollar)

['THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)'
 'VANDALISM - MISDEAMEANOR ($399 OR UNDER)'
 'THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD'
 'EMBEZZLEMENT, GRAND THEFT ($950.01 & OVER)'
 'DOCUMENT WORTHLESS ($200 & UNDER)'
 'THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)'
 'DEFRAUDING INNKEEPER/THEFT OF SERVICES, OVER $950.01'
 'DEFRAUDING INNKEEPER/THEFT OF SERVICES, $950 & UNDER'
 'VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)'
 'THEFT PLAIN - PETTY ($950 & UNDER)'
 'CREDIT CARDS, FRAUD USE ($950 & UNDER'
 'SHOPLIFTING - PETTY THEFT ($950 & UNDER)'
 'CREDIT CARDS, FRAUD USE ($950.01 & OVER)'
 'DOCUMENT WORTHLESS ($200.01 & OVER)'
 'EMBEZZLEMENT, PETTY THEFT ($950 & UNDER)']


Print a list of unique variable 'Mocodes', which is not missing, and where it can find '$' inside variable 'Crm Cd Desc. We need to print variable 'Mocodes',  'Crm Cd', 'Crm Cd Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4'. These could be somehow identical.

In [13]:
# Filter rows where 'Mocodes' is NOT missing and 'Crm Cd Desc' contains '$'
filtered_data = df[df['Mocodes'].notna() & df['Crm Cd Desc'].str.contains('\$', na=False)]

# Select the relevant columns and drop duplicates 'Mocodes'
selected_columns = filtered_data[['Mocodes', 'Crm Cd', 'Crm Cd Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4']].drop_duplicates(subset=['Mocodes'])

# Display
print(selected_columns)

                              Mocodes Crm Cd  \
3                           0329 1402    745   
4                                0329    740   
6                 1402 2004 0344 0387    442   
8                      1822 0344 1402    341   
9                 1300 0202 0378 0325    341   
...                               ...    ...   
752535       0329 1300 0903 2053 1528    740   
752553       0352 1221 0344 1822 1402    420   
752670       1300 0344 0346 1302 1822    420   
752733            1309 0104 0325 0326    442   
752873  1822 2004 0325 0344 0315 0329    442   

                                              Crm Cd Desc Crm Cd 1 Crm Cd 2  \
3                VANDALISM - MISDEAMEANOR ($399 OR UNDER)    745.0    998.0   
4       VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...    740.0      nan   
6                SHOPLIFTING - PETTY THEFT ($950 & UNDER)    442.0    998.0   
8       THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LI...    341.0    998.0   
9       THEFT-GRAND ($950.01

Looking for specific 'Mocodes', because they could imply damage to property:
Mocodes	Description
0311	Graffiti: Involves defacing property with spray paint or other substances, which is a clear act of property damage.
0321	Ransacked: Indicates that the suspect caused damage by overturning or destroying property during a search.
0322	Smashed display case: Breaking a display case is a direct act of property damage.
0329	Vandalized: Vandalism involves intentional destruction or defacement of property.
0330	Victims vehicle taken: This implies damage if the vehicle was stolen forcefully or damaged in the process.
0331	Mailbox Bombing: Bombing a mailbox results in severe damage to property.
0332	Mailbox Vandalism: Vandalizing a mailbox is a form of property damage.
0344	Removes vict property: Taking someone's property could involve force, leading to potential damage to the property.
0346	Snatch property and runs: Property could be damaged during the act of forcibly snatching it.
0358	Forces Entry: Forcing entry into a place often results in damage to doors, windows, or other property.
0363	Home invasion: Typically involves forced entry and can lead to significant property damage.
0375	Removes cash register: Physically removing a cash register could involve damaging it or other property around it.
0382	Removed money/property from safe: Removing property from a safe could involve damaging the safe itself or surrounding areas.
0385	Suspect removed parts from vehicle: Removing parts from a vehicle likely causes damage to the vehicle.
0386	Suspect removed property from trunk of vehicle: Removing items from a vehicle’s trunk could involve force, potentially damaging the vehicle.
0397	Cut lock (to bicycle, gate, etc.): Cutting a lock to gain access typically causes damage to the lock and possibly to the surrounding property.
0398	Roof access (remove A/C, equip, etc.): Gaining access through the roof and removing equipment like A/C units often results in damage to the roof and the equipment.
0399	Vehicle to Vehicle shooting: A shooting between vehicles is highly likely to cause damage to the vehicles involved.
0407	Burned Victim: Although primarily affecting a person, burning could also cause collateral damage to property.
0428	Dismembered: While primarily about physical harm, in certain contexts, dismemberment could involve property damage, depending on the method used and the location.
0434	Bed Sheets/Linens: This could suggest the suspect used or damaged property like bed sheets or linens.
0436	Clothing: The involvement of clothing might suggest property damage if clothing was torn or altered.
0605	Traffic Accident/Traffic related incident: Traffic accidents typically involve damage to vehicles or other property.
0916	Forced theft of vehicle (Car-Jacking): Car-jacking often involves force, which can result in damage to the vehicle or nearby property.
0919	Road Rage: Road rage incidents frequently lead to damage to vehicles or other property.
0922	ATM Theft with PIN number: Theft from an ATM could imply damage to the ATM machine itself if it was tampered with or broken into.
1100	Shots Fired: This can lead to substantial damage to property, such as buildings, vehicles, or surrounding areas where gunfire occurs.
1221	Missing Clothing/Jewelry: If clothing or jewelry is missing due to theft or loss, there might be damage associated with the event from which they went missing.
1301	Forced victim vehicle to curb: Forcing a vehicle to the curb can lead to damage to the vehicle or surrounding property.
1302	Suspect forced way into victim's vehicle: This implies potential damage to the vehicle from forced entry.
1307	Breaks window: This directly indicates property damage due to the act of breaking a window.
1308	Drives by and snatches property: This can lead to damage if the property is forcibly taken, potentially damaging the surrounding area or the vehicle involved.
1609	Smashed: This suggests a forceful act that likely results in property damage.
1612	Punched/Pulled Door Lock: This directly indicates damage as it involves forcefully tampering with a door lock.
1905	Destruction of computer data: This implies damage to digital property, such as files or databases.
1909	Introduction of virus or contaminants into computer system/program: This suggests damaging a computer system or software, impacting its functionality.
1911	Theft of computer data: While primarily about data theft, it can also indicate damage to the integrity of the computer system.
2032	Victim left property unattended: This situation often leads to theft or damage to property left vulnerable.
2046	Suspect damaged property equal to or exceeding $25,000: This explicitly indicates significant property damage.


In [14]:
# Define te list of specific 'Mocodes'
mocodes_list = [
    '0311', '0321', '0322', '0329', '0330', '0331', '0332', '0344', '0346', '0358',
    '0363', '0375', '0382', '0385', '0386', '0397', '0398', '0399', '0407', '0428',
    '0434', '0436', '0605', '0916', '0919', '0922', '1100', '1221', '1301', '1302',
    '1307', '1308', '1609', '1612', '1905', '1909', '1911', '2032', '2046'
]

# Filter rows where any Mocode in the list is found in the 'Mocodes' column
filtered_data = df[df['Mocodes'].notna() & df['Mocodes'].apply(lambda x: isinstance(x, str) and any(code in x.split() for code in mocodes_list))]

# Select the relevant columns
selected_columns = filtered_data[['Mocodes', 'Crm Cd', 'Crm Cd Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4']]

# Display the result
print(selected_columns)

                    Mocodes Crm Cd  \
3                 0329 1402    745   
4                      0329    740   
6       1402 2004 0344 0387    442   
8            1822 0344 1402    341   
10      1822 1414 0344 1307    330   
...                     ...    ...   
752902                 0344    341   
752904            1822 0344    440   
752905            0344 1822    341   
752907            1300 0329    740   
752910            0329 1822    745   

                                              Crm Cd Desc Crm Cd 1 Crm Cd 2  \
3                VANDALISM - MISDEAMEANOR ($399 OR UNDER)    745.0    998.0   
4       VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...    740.0      nan   
6                SHOPLIFTING - PETTY THEFT ($950 & UNDER)    442.0    998.0   
8       THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LI...    341.0    998.0   
10                                  BURGLARY FROM VEHICLE    330.0      nan   
...                                                   ...      ...   

Looking for Mocode '2046 which represent a damage > $ 25000

In [15]:
# Define te list of specific 'Mocodes'
mocodes_list = ['2046']

# Filter rows where any Mocode in the list is found in the 'Mocodes' column
filtered_data = df[df['Mocodes'].notna() & df['Mocodes'].apply(lambda x: isinstance(x, str) and any(code in x.split() for code in mocodes_list))]

# Select the relevant columns
selected_columns = filtered_data[['Mocodes', 'Crm Cd', 'Crm Cd Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4']]

# Display the result
print(selected_columns)

                                                  Mocodes Crm Cd  \
3326                                       1307 0329 2046    740   
13232                            1822 2046 1414 1420 0329    740   
13497                  1609 0329 0344 2046 1402 1822 0216    310   
52482                  1501 0206 0209 0216 0211 2046 1420    648   
54649             0344 1609 1822 2046 1414 0352 0325 0329    310   
64004                  1409 0329 0910 1822 0342 1414 2046    648   
65520                                 2046 0329 1236 0344    341   
66248                                 2046 1822 1420 0329    740   
70919                       1822 2046 0325 0344 0321 0329    310   
85134                                 1822 0329 2046 1214    740   
232941                                     0329 2004 2046    310   
244640       0311 0329 0903 1236 2053 1523 1414 2046 1822    946   
253765                      2046 0350 0398 0344 1402 1602    310   
285457                 1822 0321 0344 0352 0354 

Comparing 'Crm Cd' and variable 'Crm Cd 1', cause they are pretty identical, I want to know if i can drop 'Crm Cd 1'

In [19]:
# Convert the column to float first, then int, handling erros
for col in ['Crm Cd', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4']:
    df[col] = pd.to_numeric(df[col], errors='coerce') # convert to float, invalid parsing will be set as NaN
    df[col] = df[col].fillna(0).astype(int) # Fill NaN with 0 and convert to int

# Compare the two columns directly
compare = df['Crm Cd'] == df['Crm Cd 1']

# Check if all values are identical
if compare.all():
    print("Identical")
else:
    # if not count the number of differing rows
    differing_rows_count = (~compare).sum()
    print(f"There are {differing_rows_count} different case.")
    # view the rows where the columns differ
    differences = df[~compare][['Crm Cd', 'Crm Cd 1']]
    print(differences.head())

There are 1573 different case.
      Crm Cd  Crm Cd 1
292      626       434
316      815       310
527      820       812
1545     761       753
2824     812       627


Comparing 'Crm Cd' and variable 'Crm Cd 1', cause they are pretty identical, I want to know if i can drop 'Crm Cd 1'
Looking for the 'Crm Cd Desc' of the differing 'Crm Cd 1'
Converting 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4' into int64 type

In [22]:
# Compare the two columns directly
compare = df['Crm Cd'] == df['Crm Cd 1']

# Check if all values are identical
if compare.all():
    print("Identical")
else:
    # rows where the columns differ
    differences = df[~compare][['Crm Cd', 'Crm Cd 1']]

    # Extract the differing values from 'Crm Cd 1'
    differing_values = differences['Crm Cd 1'].unique()

    # Create a DataFrame to store the results
    results = []

    # Loop through each differing value and find the corresponding 'Crm Cd Desc'
    for value in differing_values:
        # Find the first corresponding 'Crm Cd' entries
        matching_row = df[df['Crm Cd'] == value]

        # if there are matching row, get the corresponding 'Crm Cd Desc'
        if not matching_row.empty:
            first_matching_row = matching_row.iloc[0]
            results.append({
                'Crm Cd 1': value,
                'Crm Cd': first_matching_row['Crm Cd'],
                'Crm Cd Desc': first_matching_row['Crm Cd Desc']
            })

    # Convert result to a DataFrame for better readability
    results_df = pd.DataFrame(results)

    # Display the first few rows of the results
    print(results_df)
    print(f"Total results found: {results_df.shape[0]}")

    Crm Cd 1  Crm Cd                                        Crm Cd Desc
0        434     434                                 FALSE IMPRISONMENT
1        310     310                                           BURGLARY
2        812     812  CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 ...
3        753     753                     DISCHARGE FIREARMS/SHOTS FIRED
4        627     627            CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT
5        624     624                           BATTERY - SIMPLE ASSAULT
6        740     740  VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...
7        625     625                                      OTHER ASSAULT
8        745     745           VANDALISM - MISDEAMEANOR ($399 OR UNDER)
9        888     888                                        TRESPASSING
10       762     762                                       LEWD CONDUCT
11       760     760                    LEWD/LASCIVIOUS ACTS WITH CHILD
12       626     626                  INTIMATE PARTNER - SIMPLE 

Print a list of unique values for the pair variable 'Crm Cd' & 'Crm Cd Desc'

In [14]:
# Drop duplicates based on 'Crm Cd Desc' to get unique pairs of 'Crm Cd' and 'Crm Cd Desc'
unique_pairs_crm = df[['Crm Cd', 'Crm Cd Desc']].drop_duplicates(subset='Crm Cd Desc')

#Count
print(f'Total unique values: {len(unique_pairs_crm)}')

# Display
print(unique_pairs_crm.head(138))


Total unique values: 138
        Crm Cd                                        Crm Cd Desc
0          624                           BATTERY - SIMPLE ASSAULT
2          845          SEX OFFENDER REGISTRANT OUT OF COMPLIANCE
3          745           VANDALISM - MISDEAMEANOR ($399 OR UNDER)
4          740  VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...
5          121                                     RAPE, FORCIBLE
...        ...                                                ...
241246     904  FIREARMS EMERGENCY PROTECTIVE ORDER (FIREARMS ...
328295     830       INCEST (SEXUAL ACTS BETWEEN BLOOD RELATIVES)
463359     445                 DISHONEST EMPLOYEE ATTEMPTED THEFT
471249     432                     BLOCKING DOOR INDUCTION CENTER
530818     882                                    INCITING A RIOT

[138 rows x 2 columns]


Then victime sex missing values

In [35]:
# Get all unique values of variable 'Vict Sex'
unique_values_vict_sex = df['Vict Sex'].unique()
print(unique_values_vict_sex)

# Get the unique values and their counts
unique_values_vict_sex_count = df['Vict Sex'].value_counts(dropna=False)
print(unique_values_vict_sex_count)

['F' 'M' 'X' nan 'H' '-']
Vict Sex
M      311959
F      278272
NaN     98230
X       64363
H          86
-           1
Name: count, dtype: int64


Victime Descent missing values.

Considering: A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian

In [39]:
# Get all unique values of variable 'Vict Sex'
unique_values_vict_sex = df['Vict Descent'].unique()
print(unique_values_vict_sex)
print(f'Number of unique value: {len(unique_values_vict_sex)}')

# Get the unique values and their counts
unique_values_vict_sex_count = df['Vict Descent'].value_counts(dropna=False)
print(unique_values_vict_sex_count)

# Count the occurences of 'Nan/X/O/-'
Total_count = df['Vict Descent'].isna().sum() + df['Vict Descent'].value_counts().get('X', 0) + df['Vict Descent'].value_counts().get('O', 0)  + df['Vict Descent'].value_counts().get('-', 0)
print(f'Number of NaN/X/O/- summed: {Total_count}')

['B' 'H' 'X' 'W' 'A' 'O' nan 'C' 'F' 'K' 'I' 'V' 'Z' 'J' 'P' 'S' 'G' 'U'
 'D' 'L' '-']
Number of unique value: 21
Vict Descent
H      231206
W      154519
B      108140
NaN     98236
X       71360
O       59578
A       16447
K        4003
F        3137
C        2829
J        1057
V         764
I         717
Z         367
P         205
U         155
D          51
G          50
L          46
S          42
-           2
Name: count, dtype: int64
Number of NaN/X/O/- summed: 229176


Then 'Premis Cd' and 'Premis Desc'

In [42]:
# Drop duplicates based on 'Premis Desc' to get unique pairs of 'Premis Cd' and 'Premis Desc'
unique_pairs_premis = df[['Premis Cd', 'Premis Desc']].drop_duplicates(subset='Premis Cd')

#Count
print(f'Total unique values: {len(unique_pairs_premis)}')

# Display
print(unique_pairs_premis.head(307))

Total unique values: 313
        Premis Cd                                     Premis Desc
0           501.0                          SINGLE FAMILY DWELLING
1           102.0                                        SIDEWALK
2           726.0                                 POLICE FACILITY
3           502.0    MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)
4           409.0                             BEAUTY SUPPLY STORE
...           ...                                             ...
242468      305.0            CHEMICAL STORAGE/MANUFACTURING PLANT
252178      889.0      MTA - SILVER LINE - LAC/USC MEDICAL CENTER
284459      240.0                        DEPT OF DEFENSE FACILITY
297592      805.0        RETIRED (DUPLICATE) DO NOT USE THIS CODE
364050      126.0  TRAIN, OTHER THAN MTA (ALSO QUERY 809/810/811)

[307 rows x 2 columns]


Searching for 'Other' or 'unkown'

In [45]:
# Define search keywords
search_terms = ['other', 'unknown']

# Use str.contains() to filter rows (case insensitiv)
filtered_rows_permis = df[df['Premis Desc'].str.contains('|'.join(search_terms), case=False, na=False)]

# Drop duplicates based on 'Premis Desc' to get unique pairs of 'Premis Cd' and 'Premis Desc'
unique_pairs_premis_filtered = filtered_rows_permis[['Premis Cd', 'Premis Desc']].drop_duplicates()

# Display
print(unique_pairs_premis_filtered)

        Premis Cd                                     Premis Desc
9           203.0                                  OTHER BUSINESS
49          406.0                                     OTHER STORE
178         710.0                                   OTHER PREMISE
231         116.0                                   OTHER/OUTSIDE
341         504.0                                 OTHER RESIDENCE
1025        703.0               ENTERTAINMENT/COMEDY CLUB (OTHER)
1171        811.0         OTHER RR TRAIN (UNION PAC, SANTE FE ETC
1950        729.0                          SPECIALTY SCHOOL/OTHER
12144       136.0                   OTHER INTERSTATE, CHARTER BUS
15545       740.0                          OTHER PLACE OF WORSHIP
24451       215.0            TRAIN DEPOT/TERMINAL, OTHER THAN MTA
24713       214.0              BUS DEPOT/TERMINAL, OTHER THAN MTA
25343       742.0                             SPORTS VENUE, OTHER
102474      129.0                        TERMINAL, OTHER THAN MTA
364050    

'Weapon used Cd' and 'Weapon Desc' using the same method than for Premis (previous one). We want to replace missing values with the most undamaging data possible for the analysis in regards to our business requirements. A data which can be seen as "standard" case scenario.

In [64]:
# Drop duplicates based on 'Premis Desc' to get unique pairs of 'Premis Cd' and 'Premis Desc'
unique_pairs_weapon = df[['Weapon Used Cd', 'Weapon Desc']].drop_duplicates(subset='Weapon Used Cd')

#Count
print(f'Total unique values: {len(unique_pairs_weapon)}')

# Display
print(unique_pairs_weapon)

# Get the unique values and their counts
unique_values_weapon_count = df['Weapon Used Cd'].value_counts(dropna=False)

# Sort in ascending order and print the 20 lowest values
highest_20_values = unique_values_weapon_count.sort_values(ascending=False).head(20)
print(highest_20_values)


Total unique values: 80
        Weapon Used Cd                                     Weapon Desc
0                400.0  STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)
1                500.0                     UNKNOWN WEAPON/OTHER WEAPON
2                  NaN                                             NaN
10               306.0                              ROCK/THROWN OBJECT
11               511.0                                   VERBAL THREAT
...                ...                                             ...
114389           125.0                                   RELIC FIREARM
136555           121.0   HECKLER & KOCH 91 SEMIAUTOMATIC ASSAULT RIFLE
176551           118.0                 UZI SEMIAUTOMATIC ASSAULT RIFLE
243263           117.0            UNK TYPE SEMIAUTOMATIC ASSAULT RIFLE
633797           124.0                M-14 SEMIAUTOMATIC ASSAULT RIFLE

[80 rows x 2 columns]
Weapon Used Cd
NaN      491439
400.0    140485
500.0     27112
511.0     19302
102.0     16205
109.0 

Looking for 'no weapon'

In [48]:
# Define search keywords
search_terms_weapon = ['no', 'without', 'unarmed']

# Use str.contains() to filter rows (case insensitiv)
filtered_rows_weapon = df[df['Weapon Desc'].str.contains('|'.join(search_terms_weapon), case=False, na=False)]

# Drop duplicates to get unique pairs of 'Weapon Used Cd' and 'Weapon Desc'
unique_pairs_weapon_filtered = filtered_rows_weapon[['Weapon Used Cd', 'Weapon Desc']].drop_duplicates()

# Display
print(unique_pairs_weapon_filtered)

      Weapon Used Cd                      Weapon Desc
1              500.0      UNKNOWN WEAPON/OTHER WEAPON
513            504.0                      DEMAND NOTE
674            106.0                  UNKNOWN FIREARM
1442           223.0  UNKNOWN TYPE CUTTING INSTRUMENT


'Cross Street' analysis

In [49]:
# Get all unique values of variable 'Cross Street'
unique_values_cross_street = df['Cross Street'].unique()
print(unique_values_cross_street)
print(f'Number of unique value: {len(unique_values_cross_street)}')

# Get the unique values and their counts
unique_values_cross_street_count = df['Cross Street'].value_counts(dropna=False)
print(unique_values_cross_street_count)

[nan 'OLIVE' 'VERMONT' ... 'E  101ST' 'WOODLAWN                     ST'
 'GRAYSTONE                    ST']
Number of unique value: 9387
Cross Street
NaN                                631859
BROADWAY                             1982
FIGUEROA                     ST      1508
VERMONT                      AV      1279
FIGUEROA                             1266
                                    ...  
MILTON                                  1
MT LEE                                  1
LAKERIDGE                    PL         1
LAGUNA                       ST         1
GRAYSTONE                    ST         1
Name: count, Length: 9387, dtype: int64


Finding fitting category to replace missing values

In [65]:
# Define search keywords
search_terms_street = ['unkown', 'without']

# Use str.contains() to filter rows (case insensitiv)
filtered_rows_street = df[df['Cross Street'].str.contains('|'.join(search_terms_street), case=False, na=False)]

# Drop duplicates based on 'Premis Desc' to get unique pairs of 'Premis Cd' and 'Premis Desc'
unique_pairs_street_filtered = filtered_rows_street['Cross Street'].unique()

# Display
print(unique_pairs_street_filtered)

# Get the unique values and their counts
unique_values_street_count = df['Cross Street'].value_counts(dropna=False)

# Sort in ascending order and print the 20 lowest values
highest_20_values_street = unique_values_street_count.sort_values(ascending=False).head(20)
print(highest_20_values_street)

[]
Cross Street
NaN                                631859
BROADWAY                             1982
FIGUEROA                     ST      1508
VERMONT                      AV      1279
FIGUEROA                             1266
WESTERN                      AV      1189
SAN PEDRO                            1157
MAIN                         ST      1151
VERMONT                               908
CENTRAL                      AV       814
WESTERN                               776
AVALON                       BL       764
SAN PEDRO                    ST       668
MAIN                                  637
HOOVER                       ST       584
OLYMPIC                               575
LOS ANGELES                           567
SUNSET                                566
WILSHIRE                              562
HOLLYWOOD                             532
Name: count, dtype: int64


Description of data/columns

1. 'Mocodes'
* 'Mocodes' and all the 'Crm Cd' do not share the same data and meaning
* Only one Mocode will help us give a value to the damage that happend during the crime was committed: '2046'. A value that is not to be found in 'Crm Cd' variables. 
* 'Crm Cd' variables, some of them contains an estimation of the damage caused. but other like 'Robbery', 'Burglary', 'vehicle, stolen' don't.
* The 'Crm Cd' variables have the same meaning, they repeat from time to time, we saw that between 'Crm Cd' and 'Crm Cd 1' we found 1573 differences, so 'Crm Cd 1' can add meaning to the data. Yet we are interested only in finding out the damage caused. 
* I therefore decided to add a column with the max value, that I determined based on 'Crm Cd' title if '$' was present inside of it. And I also created an ad'hoc list with all 'Crm Cd' variable that implied financial damage. Out of the 138 unique 'Crm Cd' variables, I keept 39 and gave them arbitrary max damage cause, keeping the logic. 
2. 'Vict Sex'
* Contains 6 different values (incl. NaN). For the purpose of this exercice, I ll considere only 3 (M/F/X), X meaning 'unknown' or here 'other'. In regards to the business requirements, I considere that it is ok to transform (NaN/H/-) into X.
3. 'Vict Descent'
* I noticed that there are 21 different values, 4 of them are (NaN/O/X/-), but do not give any other information, I will replace (NaN/O/-) by X and call it 'other' like for 'Vict Sex'. Although this would make the 2nd greatest groupe of 'Vict Descent' in the data with 229176 count.
4. 'Premis Cd' and 'Premis Desc'
* I found a value that can be used to replace missing value, whitout beeing damaging for the data analysis : "116" and "OTHER/OUTSIDE"
5. 'Weapon used Cd' and 'Weapon Desc'
* I found a value that can be used to replace missing value, whitout beeing damaging for the data analysis : "999" and "NO WEAPON"
6. 'Cross Street'
* I found a value that can be used to replace missing value, whitout beeing damaging for the data analysis : "NO CROSS STREET"
7. Other variable work
* Date Occ: extract the day of the week
* Drop follwing variables:
DR_NO                  /
Date Rptd              /
Rpt Dist No            /
Part 1-2               /
Mocodes           /
Status                 /
Status Desc           /
Crm Cd 1              /
Crm Cd 2          /
Crm Cd 3          /
Crm Cd 4          /


---

# Push files to Repo

In [6]:
import os
#try:
#  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
#except Exception as e:
#  print(e)

df.to_csv(f"outputs/datasets/collection/dataPP5.csv",index=False)


---

# Conclusion and next step

In conclusion we saw that the data is very large, yet in really clean (good usability).

However we are lacking a column that can help us determine weither a financial damage was done to the victime.

We found that there is two source that we can use to do this:
- Mocodes: '2046'
- 'Crm Cd' (1, 2, 3 ,4): according to the file 'Crm Cd Desc analyses.cvs' that we made manually.

We also found missing values, and found by what we can replace them.

We also found useless variable.

we will do these steps in the next notebook.