LAPD Crime dataset from 2020 to 2025

28 Columns:
0. DR_NO: Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits
1. Date Rptd
2. DATE OCC
3. TIME OCC: 
4. AREA: The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
5. AREA NAME: The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles.
6. Rpt Dist No: A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. 
7. Part 1-2
8. Crm Cd: Indicates the crime committed. (Same as Crime Code 1)
9. Crm Cd Desc: Defines the Crime Code provided
10. Mocodes: 	
Modus Operandi: Activities associated with the suspect in commission of the crime.See attached PDF for list of MO Codes in numerical order.
11. Vict Age: Two character numeric
12. Vict Sex: F - Female M - Male X - Unknown
13. Vict Descent: Descent Code: A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian
14. Premis Cd: The type of structure, vehicle, or location where the crime took place.
15. Premis Desc: Defines the Premise Code provided.
16. Weapon Used Cd: The type of weapon used in the crime.
17. Weapon Desc: Defines the Weapon Used Code provided.
18. Status: Status of the case. (IC is the default)
19. Status Desc: Defines the Status Code provided.
20. Crm Cd 1: Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious.
21. Crm Cd 2: May contain a code for an additional crime, less serious than Crime Code 1.
22. Crm Cd 3
23. Crm Cd 4
24. LOCATION: Street address of crime incident rounded to the nearest hundred block to maintain anonymity.
25. Cross Street: Cross Street of rounded Address
27. LAT: Latitude
28. LON: Longtitude

Reference: https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8/about_data 

Steps:
1. Read csv data
2. Check null data
- Fill null value with X in "Vict Sex" and "Vict Descent" column
- Replace null with "IC" (default value) in "Status" column
- Remove column "Weapon Used Cd", "Weapon Desc", "Crm Cd 2", "Crm Cd 3", "Crm Cd 4" and "Cross Street" (too many null, data not useful)
----> what can I do with other null columns? 
3. Rename columns to advoid using space
4. Change data type of all columns

In [3]:
import pandas as pd

In [4]:
crime = pd.read_csv("Crime_Data_from_2020_to_Present_20260131.csv")
crime.head()

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,211507896,2021 Apr 11 12:00:00 AM,2020 Nov 07 12:00:00 AM,845,15,N Hollywood,1502,2,354,THEFT OF IDENTITY,...,IC,Invest Cont,354.0,,,,7800 BEEMAN AV,,34.2124,-118.4092
1,201516622,2020 Oct 21 12:00:00 AM,2020 Oct 18 12:00:00 AM,1845,15,N Hollywood,1521,1,230,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",...,IC,Invest Cont,230.0,,,,ATOLL AV,N GAULT,34.1993,-118.4203
2,240913563,2024 Dec 10 12:00:00 AM,2020 Oct 30 12:00:00 AM,1240,9,Van Nuys,933,2,354,THEFT OF IDENTITY,...,IC,Invest Cont,354.0,,,,14600 SYLVAN ST,,34.1847,-118.4509
3,210704711,2020 Dec 24 12:00:00 AM,2020 Dec 24 12:00:00 AM,1310,7,Wilshire,782,1,331,THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND ...,...,IC,Invest Cont,331.0,,,,6000 COMEY AV,,34.0339,-118.3747
4,201418201,2020 Oct 03 12:00:00 AM,2020 Sep 29 12:00:00 AM,1830,14,Pacific,1454,1,420,THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER),...,IC,Invest Cont,420.0,,,,4700 LA VILLA MARINA,,33.9813,-118.435


In [5]:
crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004991 entries, 0 to 1004990
Data columns (total 28 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   DR_NO           1004991 non-null  int64  
 1   Date Rptd       1004991 non-null  object 
 2   DATE OCC        1004991 non-null  object 
 3   TIME OCC        1004991 non-null  int64  
 4   AREA            1004991 non-null  int64  
 5   AREA NAME       1004991 non-null  object 
 6   Rpt Dist No     1004991 non-null  int64  
 7   Part 1-2        1004991 non-null  int64  
 8   Crm Cd          1004991 non-null  int64  
 9   Crm Cd Desc     1004991 non-null  object 
 10  Mocodes         853372 non-null   object 
 11  Vict Age        1004991 non-null  int64  
 12  Vict Sex        860347 non-null   object 
 13  Vict Descent    860335 non-null   object 
 14  Premis Cd       1004975 non-null  float64
 15  Premis Desc     1004403 non-null  object 
 16  Weapon Used Cd  327247 non-null   fl

In [6]:
# Find null in "Mocodes" column
print(crime['Mocodes'])
print(crime.loc[crime["Mocodes"].isnull(), "Mocodes"])
print(crime["Mocodes"].isnull().mean() * 100)

# Null: 151619 entries euqal to 15.08%

0                                             0377
1          0416 0334 2004 1822 1414 0305 0319 0400
2                                             0377
3                                             0344
4                              1300 0344 1606 2032
                            ...                   
1004986                                        NaN
1004987                             1258 0553 0602
1004988                                        NaN
1004989                        0400 1259 1822 0356
1004990                        0529 2024 1815 0913
Name: Mocodes, Length: 1004991, dtype: object
12         NaN
31         NaN
39         NaN
53         NaN
54         NaN
          ... 
1004980    NaN
1004981    NaN
1004983    NaN
1004986    NaN
1004988    NaN
Name: Mocodes, Length: 151619, dtype: object
15.086602765596908


In [7]:
# Find null in "Vict Sex" column
print(crime['Vict Sex'])
print(crime.loc[crime["Vict Sex"].isnull(), "Vict Sex"])
print(crime["Vict Sex"].isnull().mean() * 100)

# Null: 144644 entries euqal to 14.39%

0          M
1          M
2          M
3          F
4          M
          ..
1004986    M
1004987    M
1004988    F
1004989    M
1004990    F
Name: Vict Sex, Length: 1004991, dtype: object
12         NaN
31         NaN
39         NaN
53         NaN
54         NaN
          ... 
1004944    NaN
1004958    NaN
1004967    NaN
1004980    NaN
1004983    NaN
Name: Vict Sex, Length: 144644, dtype: object
14.392566699602286


In [8]:
# Turn NaN in "Vict Sex" column into unknown "X"
crime["Vict Sex"] = crime["Vict Sex"].fillna("X")


In [9]:
# Find null in "Vict Descent" column
print(crime['Vict Descent'])
print(crime.loc[crime["Vict Descent"].isnull(), "Vict Descent"])
print(crime["Vict Descent"].isnull().mean() * 100)

# Null: 144656 entries euqal to 14.39%

0          H
1          H
2          W
3          A
4          H
          ..
1004986    X
1004987    B
1004988    H
1004989    H
1004990    H
Name: Vict Descent, Length: 1004991, dtype: object
12         NaN
31         NaN
39         NaN
53         NaN
54         NaN
          ... 
1004944    NaN
1004958    NaN
1004967    NaN
1004980    NaN
1004983    NaN
Name: Vict Descent, Length: 144656, dtype: object
14.39376074014593


In [10]:
# Turn NaN in "Vict Descent" column into unknown "X"
crime["Vict Descent"] = crime["Vict Descent"].fillna("X")

In [11]:
# Find null in "Premis Cd" column
print(crime['Premis Cd'])
print(crime.loc[crime["Premis Cd"].isnull(), "Premis Cd"])
print(crime["Premis Cd"].isnull().mean() * 100)

# Null: 0.0015%

0          501.0
1          102.0
2          501.0
3          101.0
4          103.0
           ...  
1004986    101.0
1004987    501.0
1004988    101.0
1004989    721.0
1004990    721.0
Name: Premis Cd, Length: 1004991, dtype: float64
44349    NaN
126003   NaN
342084   NaN
380958   NaN
387121   NaN
503306   NaN
508546   NaN
540858   NaN
647036   NaN
708864   NaN
882203   NaN
891044   NaN
907208   NaN
911939   NaN
924323   NaN
941367   NaN
Name: Premis Cd, dtype: float64
0.0015920540581955461


In [12]:
# Find null in "Premis Desc" column
print(crime['Premis Desc'])
print(crime.loc[crime["Premis Desc"].isnull(), "Premis Desc"])
print(crime["Premis Desc"].isnull().mean() * 100)

# Null: 588 entries euqal to 0.058%

0          SINGLE FAMILY DWELLING
1                        SIDEWALK
2          SINGLE FAMILY DWELLING
3                          STREET
4                           ALLEY
                    ...          
1004986                    STREET
1004987    SINGLE FAMILY DWELLING
1004988                    STREET
1004989               HIGH SCHOOL
1004990               HIGH SCHOOL
Name: Premis Desc, Length: 1004991, dtype: object
861        NaN
899        NaN
13186      NaN
13967      NaN
14394      NaN
          ... 
990424     NaN
990707     NaN
996501     NaN
999798     NaN
1003658    NaN
Name: Premis Desc, Length: 588, dtype: object
0.05850798663868632


In [13]:
# Find null in "Weapon Used Cd" column
print(crime['Weapon Used Cd'])
print(crime.loc[crime["Weapon Used Cd"].isnull(), "Weapon Used Cd"])
print(crime["Weapon Used Cd"].isnull().mean() * 100)

# Null: 677744 entries euqal to 67.43%

0            NaN
1          200.0
2            NaN
3            NaN
4            NaN
           ...  
1004986      NaN
1004987      NaN
1004988      NaN
1004989    400.0
1004990      NaN
Name: Weapon Used Cd, Length: 1004991, dtype: float64
0         NaN
2         NaN
3         NaN
4         NaN
5         NaN
           ..
1004985   NaN
1004986   NaN
1004987   NaN
1004988   NaN
1004990   NaN
Name: Weapon Used Cd, Length: 677744, dtype: float64
67.43781785110514


In [14]:
# Find null in "Weapon Desc" column
print(crime['Weapon Desc'])
print(crime.loc[crime["Weapon Desc"].isnull(), "Weapon Desc"])
print(crime["Weapon Desc"].isnull().mean() * 100)

# Null: 677744 entries euqal to 67.43%

0                                                     NaN
1                        KNIFE WITH BLADE 6INCHES OR LESS
2                                                     NaN
3                                                     NaN
4                                                     NaN
                                ...                      
1004986                                               NaN
1004987                                               NaN
1004988                                               NaN
1004989    STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)
1004990                                               NaN
Name: Weapon Desc, Length: 1004991, dtype: object
0          NaN
2          NaN
3          NaN
4          NaN
5          NaN
          ... 
1004985    NaN
1004986    NaN
1004987    NaN
1004988    NaN
1004990    NaN
Name: Weapon Desc, Length: 677744, dtype: object
67.43781785110514


In [28]:
# Find null in "Status" column
print(crime['Status'].unique())
print(crime.loc[crime["Status"].isnull(), "Status"])


['IC' 'AO' 'AA' 'JA' 'JO' 'CC' nan]
882203    NaN
Name: Status, dtype: object


In [29]:
# Replace null with "IC" (default value) in "Status" column
crime["Status"] = crime["Status"].fillna("IC")

In [33]:
# Find null in "Crm Cd 1" column
print(crime['Crm Cd 1'].unique())
print(crime.loc[crime["Crm Cd 1"].isnull(), "Crm Cd 1"])
print(crime["Crm Cd 1"].isnull().mean() * 100)

[354. 230. 331. 420. 812. 510. 310. 330. 440. 626. 624. 745. 520. 740.
 210. 906. 480. 668. 627. 930. 341. 860. 956. 900. 648. 901. 350. 903.
 236. 220. 760. 946. 761. 888. 666. 442. 320. 649. 343. 890. 623. 522.
 237. 122. 662. 954. 805. 625. 121. 845. 753. 940. 250. 231. 664. 437.
 902. 813. 251. 410. 647. 810. 815. 762. 352. 922. 351. 421. 822. 850.
 444. 670. 821. 910. 886. 661. 110. 441. 931. 435. 438. 928. 443. 755.
 235. 763. 820. 433. 622. 932. 439. 487. 951. 949. 865. 474. 654. 113.
 450. 920. 475. 446. 652. 434. 653. 950. 814. 806. 943. 660. 345. 921.
 933. 651. 756. 349. 944. 353. 436. 430. 473. 880. 471. 452. 521. 347.
 470.  nan 451. 485. 942. 870. 924. 840. 948. 884. 904. 830. 432. 882.
 445. 926. 453.]
91235    NaN
133312   NaN
237001   NaN
350431   NaN
432358   NaN
495781   NaN
508451   NaN
694465   NaN
705396   NaN
801499   NaN
873827   NaN
Name: Crm Cd 1, dtype: float64
0.001094537165009438


In [None]:
# Find null in "Crm Cd 2" column
print(crime['Crm Cd 2'])
print(crime.loc[crime["Crm Cd 2"].isnull(), "Crm Cd 2"])
print(crime["Crm Cd 2"].isnull().mean() * 100)

0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
           ..
1004986   NaN
1004987   NaN
1004988   NaN
1004989   NaN
1004990   NaN
Name: Crm Cd 2, Length: 1004991, dtype: float64
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
           ..
1004986   NaN
1004987   NaN
1004988   NaN
1004989   NaN
1004990   NaN
Name: Crm Cd 2, Length: 935831, dtype: float64
93.11834633344975


In [30]:
crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004991 entries, 0 to 1004990
Data columns (total 28 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   DR_NO           1004991 non-null  int64  
 1   Date Rptd       1004991 non-null  object 
 2   DATE OCC        1004991 non-null  object 
 3   TIME OCC        1004991 non-null  int64  
 4   AREA            1004991 non-null  int64  
 5   AREA NAME       1004991 non-null  object 
 6   Rpt Dist No     1004991 non-null  int64  
 7   Part 1-2        1004991 non-null  int64  
 8   Crm Cd          1004991 non-null  int64  
 9   Crm Cd Desc     1004991 non-null  object 
 10  Mocodes         853372 non-null   object 
 11  Vict Age        1004991 non-null  int64  
 12  Vict Sex        1004991 non-null  object 
 13  Vict Descent    1004991 non-null  object 
 14  Premis Cd       1004975 non-null  float64
 15  Premis Desc     1004403 non-null  object 
 16  Weapon Used Cd  327247 non-null   fl