# Data cleaning for the dataset containing murders in India from the year 2001 to 2010.

The dataset was downloaded from Kaggle at https://www.kaggle.com/datasets/rajanand/crime-in-india/code

Import the required packages.

In [1]:
import os
import pandas as pd

Load the data.

In [2]:
path = os.getcwd() + "/murders_india.csv"
data_org = pd.read_csv(path)
data_org

Unnamed: 0,Area_Name,Year,Group_Name,Sub_Group_Name,Victims_Above_50_Yrs,Victims_Total,Victims_Upto_10_15_Yrs,Victims_Upto_10_Yrs,Victims_Upto_15_18_Yrs,Victims_Upto_18_30_Yrs,Victims_Upto_30_50_Yrs
0,Andaman & Nicobar Islands,2001,Murder - Female Victims,2. Female Victims,,6,,,,4.0,2.0
1,Andhra Pradesh,2001,Murder - Female Victims,2. Female Victims,67.0,607,15.0,38.0,43.0,269.0,175.0
2,Arunachal Pradesh,2001,Murder - Female Victims,2. Female Victims,2.0,16,0.0,0.0,0.0,10.0,4.0
3,Assam,2001,Murder - Female Victims,2. Female Victims,11.0,128,8.0,4.0,23.0,45.0,37.0
4,Bihar,2001,Murder - Female Victims,2. Female Victims,12.0,366,0.0,0.0,40.0,191.0,123.0
...,...,...,...,...,...,...,...,...,...,...,...
1013,Tamil Nadu,2010,Murder - Total Victims,3. Total,327.0,1908,13.0,63.0,16.0,650.0,839.0
1014,Tripura,2010,Murder - Total Victims,3. Total,24.0,159,2.0,0.0,0.0,60.0,73.0
1015,Uttar Pradesh,2010,Murder - Total Victims,3. Total,344.0,4456,82.0,138.0,126.0,2358.0,1408.0
1016,Uttarakhand,2010,Murder - Total Victims,3. Total,19.0,176,1.0,,2.0,91.0,63.0


Keep only the required columns.

In [3]:
data  = data_org[["Area_Name", "Year", "Sub_Group_Name", "Victims_Total"]]
data

Unnamed: 0,Area_Name,Year,Sub_Group_Name,Victims_Total
0,Andaman & Nicobar Islands,2001,2. Female Victims,6
1,Andhra Pradesh,2001,2. Female Victims,607
2,Arunachal Pradesh,2001,2. Female Victims,16
3,Assam,2001,2. Female Victims,128
4,Bihar,2001,2. Female Victims,366
...,...,...,...,...
1013,Tamil Nadu,2010,3. Total,1908
1014,Tripura,2010,3. Total,159
1015,Uttar Pradesh,2010,3. Total,4456
1016,Uttarakhand,2010,3. Total,176


In [4]:
data.columns

Index(['Area_Name', 'Year', 'Sub_Group_Name', 'Victims_Total'], dtype='object')

Rename the columns.


In [5]:
key_rename = {
    'Area_Name' : "State",  
    'Sub_Group_Name' : 'Gender', 
    'Victims_Total' : 'Total Crimes'
}
data.rename(columns= key_rename, inplace= True)
data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.rename(columns= key_rename, inplace= True)


Unnamed: 0,State,Year,Gender,Total Crimes
0,Andaman & Nicobar Islands,2001,2. Female Victims,6
1,Andhra Pradesh,2001,2. Female Victims,607
2,Arunachal Pradesh,2001,2. Female Victims,16
3,Assam,2001,2. Female Victims,128
4,Bihar,2001,2. Female Victims,366
...,...,...,...,...
1013,Tamil Nadu,2010,3. Total,1908
1014,Tripura,2010,3. Total,159
1015,Uttar Pradesh,2010,3. Total,4456
1016,Uttarakhand,2010,3. Total,176


In [6]:
pd.unique(data.Gender)

array(['2. Female Victims', '1. Male Victims', '3. Total'], dtype=object)

In [7]:
gender = data.Gender
gender = gender.apply(lambda x : x.replace("2. ", ""))
gender = gender.apply(lambda x : x.replace(" Victims", ""))
gender = gender.apply(lambda x : x.replace("1. ", ""))
gender = gender.apply(lambda x : x.replace("3. ", ""))
data.Gender = gender
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.Gender = gender


Unnamed: 0,State,Year,Gender,Total Crimes
0,Andaman & Nicobar Islands,2001,Female,6
1,Andhra Pradesh,2001,Female,607
2,Arunachal Pradesh,2001,Female,16
3,Assam,2001,Female,128
4,Bihar,2001,Female,366


In [8]:
pd.unique(gender)

array(['Female', 'Male', 'Total'], dtype=object)

Keep data which contains either **Make** or **Female** entries in the **Gender** columns.

In [9]:
index = data.Gender != 'Total'
data = data[index]
data

Unnamed: 0,State,Year,Gender,Total Crimes
0,Andaman & Nicobar Islands,2001,Female,6
1,Andhra Pradesh,2001,Female,607
2,Arunachal Pradesh,2001,Female,16
3,Assam,2001,Female,128
4,Bihar,2001,Female,366
...,...,...,...,...
671,Tamil Nadu,2010,Male,1279
672,Tripura,2010,Male,96
673,Uttar Pradesh,2010,Male,3597
674,Uttarakhand,2010,Male,132


Save the clean data.

In [10]:
data.to_excel(os.getcwd()+ "/clean_murders_india.xlsx", index = False)

Author

Mangaljit Singh