# Dataset Link
https://www.kaggle.com/code/rohitgrewal/police-data-analysis

# Questions need to analyze in this dataset

* Instruction ( For Data Cleaning ) - Remove the column that only contains missing values.
* Question ( Based on Filtering + Value Counts ) - For Speeding , were Men or Women stopped more often ? 
* Question ( Groupby ) - Does gender affect who gets searched during a stop ?
* Question ( mapping + data-type casting ) - What is the mean stop_duration ?
* Question ( Groupby , Describe ) - Compare the age distributions for each violation.

In [27]:
# Import the dataset and required libraries

In [28]:
import pandas as pd
import matplotlib.pyplot as plt

In [29]:
df=pd.read_csv("Police Dataset.csv")

In [30]:
df.head()

Unnamed: 0,stop_date,stop_time,country_name,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,1/2/2005,1:55,,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,1/18/2005,8:15,,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,1/23/2005,23:15,,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2/20/2005,17:15,,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,3/14/2005,10:00,,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


In [31]:
df.shape

(65535, 15)

In [32]:
df.size

983025

In [33]:
df.ndim

2

In [34]:
df.info

<bound method DataFrame.info of        stop_date stop_time  country_name driver_gender  driver_age_raw  \
0       1/2/2005      1:55           NaN             M          1985.0   
1      1/18/2005      8:15           NaN             M          1965.0   
2      1/23/2005     23:15           NaN             M          1972.0   
3      2/20/2005     17:15           NaN             M          1986.0   
4      3/14/2005     10:00           NaN             F          1984.0   
...          ...       ...           ...           ...             ...   
65530  12/6/2012     17:54           NaN             F          1987.0   
65531  12/6/2012     22:22           NaN             M          1954.0   
65532  12/6/2012     23:20           NaN             M          1985.0   
65533  12/7/2012      0:23           NaN           NaN             NaN   
65534  12/7/2012      0:30           NaN             F          1985.0   

       driver_age driver_race                   violation_raw  violation  \
0  

In [35]:
df.describe()

Unnamed: 0,country_name,driver_age_raw,driver_age
count,0.0,61481.0,61228.0
mean,,1967.791106,34.148984
std,,121.050106,12.76071
min,,0.0,15.0
25%,,1965.0,23.0
50%,,1978.0,31.0
75%,,1985.0,43.0
max,,8801.0,88.0


# Data Cleaning

Check for the missing value and remove the records

In [36]:
df.isnull().sum()

stop_date                 0
stop_time                 0
country_name          65535
driver_gender          4061
driver_age_raw         4054
driver_age             4307
driver_race            4060
violation_raw          4060
violation              4060
search_conducted          0
search_type           63056
stop_outcome           4060
is_arrested            4060
stop_duration          4060
drugs_related_stop        0
dtype: int64

As we can see, "country_name" column is not required for analysis, so we can drop that column

In [37]:
df.drop(columns = "country_name",inplace=True)

In [38]:
df.head(2)

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,1/2/2005,1:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,1/18/2005,8:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


* For speeding, check how many Men or Women are stopped more often?

In [39]:
df[df.violation == "Speeding"]. driver_gender.value_counts()

driver_gender
M    25517
F    11686
Name: count, dtype: int64

* Does gender affect who gets searched during a stop?

In [40]:
df.groupby("driver_gender").search_conducted.sum() # groupby used to make a group of each unique values present in a column

driver_gender
F     366
M    2113
Name: search_conducted, dtype: int64

* How many times search was conducted?

In [41]:
df.search_conducted.value_counts()

search_conducted
False    63056
True      2479
Name: count, dtype: int64

## Mapping + Data-type Casting

What is the mean stop_duration?

* Mapping - We've to map the new values to the column
* Data - type casting -- to convert data-type of one element to another : string--> float

In [42]:
df.head(2)

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,1/2/2005,1:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,1/18/2005,8:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


* To find how many unique values are present in stop_duration.

In [43]:
df.stop_duration.value_counts()

stop_duration
0-15 Min     47379
16-30 Min    11448
30+ Min       2647
2                1
Name: count, dtype: int64

Now, map new values to the column data, 0-15 Min: 7, 16-30 Min: 24,30+ Min:45

In [44]:
df["stop_duration"]=df["stop_duration"].map({'0-15 Min': 7, '16-30 Min': 24, '30+ Min':45})

In [45]:
df.head()

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,1/2/2005,1:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,7.0,False
1,1/18/2005,8:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,7.0,False
2,1/23/2005,23:15,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,7.0,False
3,2/20/2005,17:15,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,24.0,False
4,3/14/2005,10:00,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,7.0,False


* Get the average stop_duration 

In [46]:
df['stop_duration'].mean()

np.float64(11.802062660637016)

# Compare the age distributions for each violations

In [47]:
df.head(2)

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,1/2/2005,1:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,7.0,False
1,1/18/2005,8:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,7.0,False


In [48]:
df.describe()

Unnamed: 0,driver_age_raw,driver_age,stop_duration
count,61481.0,61228.0,61474.0
mean,1967.791106,34.148984,11.802063
std,121.050106,12.76071,9.640422
min,0.0,15.0,7.0
25%,1965.0,23.0,7.0
50%,1978.0,31.0,7.0
75%,1985.0,43.0,7.0
max,8801.0,88.0,45.0


In [49]:
df.groupby('violation').driver_age.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
violation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Equipment,6507.0,31.682957,11.380671,16.0,23.0,28.0,39.0,81.0
Moving violation,11876.0,36.736443,13.25835,15.0,25.0,35.0,47.0,86.0
Other,3477.0,40.362381,12.754423,16.0,30.0,41.0,50.0,86.0
Registration/plates,2240.0,32.656696,11.15078,16.0,24.0,30.0,40.0,74.0
Seat belt,3.0,30.333333,10.214369,23.0,24.5,26.0,34.0,42.0
Speeding,37120.0,33.262581,12.615781,15.0,23.0,30.0,42.0,88.0
