<a href="https://colab.research.google.com/github/Anuj-gitch/DataScience/blob/main/Analyzing_Traffic_police_Activity/Analyzing_Police_Activity_with_pandas_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<font color = red> **Description:**

Course will give you the chance to apply that knowledge by answering interesting questions about a real dataset! You will explore the Stanford Open Policing Project dataset and analyze the impact of gender on police behavior. During the course, you will gain more practice cleaning messy data, creating visualizations, combining and reshaping datasets, and manipulating time series data. Analyzing Police Activity with pandas will give you valuable experience analyzing a dataset from start to finish, preparing you for your data science career!



##<font color = red> **Prepare Tools**

In [None]:
# data Analysis Tools
import pandas as pd
import numpy as np

# Visualization Tools
import matplotlib.pyplot as plt
import seaborn as sns

## <font color = blue> **Preparing the data for analysis**

---



Before beginning your analysis, it is critical that you first examine and clean the dataset, to make working with it a more efficient process. In this chapter, you will practice fixing data types, handling missing values, and dropping columns and rows while learning about the Stanford Open Policing Project dataset.

### Examining the dataset

Throughout this course, you'll be analyzing a dataset of traffic stops in Rhode Island that was collected by the Stanford Open Policing Project.

Before beginning your analysis, it's important that you familiarize yourself with the dataset. In this exercise, you'll read the dataset into pandas, examine the first few rows, and then count the number of missing values.

* Import pandas using the alias pd.
* Read the file police.csv into a DataFrame named ri.
* Examine the first 5 rows of the DataFrame (known as the "head").
* Count the number of missing values in each column: Use .isnull() to check which DataFrame elements are missing, and then take the .sum() to count the number of True values in each column.

In [None]:
# Import the pandas library as pd
import pandas as pd

# Read 'police.csv' into a DataFrame named ri
ri = pd.read_csv('https://assets.datacamp.com/production/repositories/1497/datasets/62bd9feef451860db02d26553613a299721882e8/police.csv')

# Examine the head of the DataFrame
print(ri.head())

# Count the number of missing values in each column
print(ri.isnull().sum())

  state   stop_date stop_time  ...  stop_duration drugs_related_stop district
0    RI  2005-01-04     12:55  ...       0-15 Min              False  Zone X4
1    RI  2005-01-23     23:15  ...       0-15 Min              False  Zone K3
2    RI  2005-02-17     04:15  ...       0-15 Min              False  Zone X4
3    RI  2005-02-20     17:15  ...      16-30 Min              False  Zone X1
4    RI  2005-02-24     01:20  ...       0-15 Min              False  Zone X3

[5 rows x 15 columns]
state                     0
stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64


It looks like most of the columns have at least some missing values. We'll figure out how to handle these values in the next exercise!

### Dropping columns

Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the DataFrame, to make it easier for you to focus on the remaining columns.

In this exercise, you'll drop the county_name column because it only contains missing values, and you'll drop the state column because all of the traffic stops took place in one state (Rhode Island). Thus, these columns can be dropped because they contain no useful information. The number of missing values in each column has been printed to the console for you.

1. Examine the DataFrame's .shape to find out the number of rows and columns.
2. Drop both the county_name and state columns by passing the column names to the .drop() method as a list of strings.
3. Examine the .shape again to verify that there are now two fewer columns.

In [None]:
# Examine the shape of the DataFrame
print(ri.shape)

# Drop the 'county_name' and 'state' columns
ri.drop(['county_name','state'], axis='columns', inplace=True)

# Examine the shape of the DataFrame (again)
print(ri.shape)

(91741, 15)
(91741, 13)


Great job! We'll continue to remove unnecessary data from the DataFrame in the next exercise.

### Dropping rows
When you know that a specific column will be critical to your analysis, and only a small fraction of rows are missing a value in that column, it often makes sense to remove those rows from the dataset.

During this course, the driver_gender column will be critical to many of your analyses. Because only a small fraction of rows are missing driver_gender, we'll drop those rows from the dataset.

1. Count the number of missing values in each column.
2. Drop all rows that are missing driver_gender by passing the column name to the subset parameter of .dropna().
3. Count the number of missing values in each column again, to verify that none of the remaining rows are missing driver_gender.
4. Examine the DataFrame's .shape to see how many rows and columns remain.

In [None]:
# Count the number of missing values in each column
print(ri.isnull().sum())
print('-----------------------')
# Drop all rows that are missing 'driver_gender'
ri.dropna(subset=['driver_gender'], inplace=True)

# Count the number of missing values in each column (again)
print(ri.isnull().sum())
print('--------------------------')
# Examine the shape of the DataFrame
print(ri.shape)

stop_date                 0
stop_time                 0
driver_gender          5205
driver_race            5202
violation_raw          5202
violation              5202
search_conducted          0
search_type           88434
stop_outcome           5202
is_arrested            5202
stop_duration          5202
drugs_related_stop        0
district                  0
dtype: int64
-----------------------
stop_date                 0
stop_time                 0
driver_gender             0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
dtype: int64
--------------------------
(86536, 13)


### Finding an incorrect data type
The dtypes attribute of the ri DataFrame has been printed for you. Your task is to explore the ri DataFrame in the IPython Shell to determine which column's data type should be changed.

Answer : is_arrested should have a data type of bool

In [None]:
ri.dtypes

stop_date             object
stop_time             object
driver_gender         object
driver_race           object
violation_raw         object
violation             object
search_conducted        bool
search_type           object
stop_outcome          object
is_arrested           object
stop_duration         object
drugs_related_stop      bool
district              object
dtype: object

### Fixing a data type
We saw in the previous exercise that the is_arrested column currently has the object data type. In this exercise, we'll change the data type to bool, which is the most suitable type for a column containing True and False values.

Fixing the data type will enable us to use mathematical operations on the is_arrested column that would not be possible otherwise.

* Examine the head of the is_arrested column to verify that it contains True and False values and to check the column's data type.
* Use the .astype() method to convert is_arrested to a bool column.
* Check the new data type of is_arrested to confirm that it is now a bool column.

In [None]:
# Examine the head of the 'is_arrested' column
print(ri.is_arrested.head())

# Change the data type of 'is_arrested' to 'bool'
ri['is_arrested'] = ri.is_arrested.astype('bool')

# Check the data type of 'is_arrested' 
print(ri.is_arrested.dtypes)

0    False
1    False
2    False
3     True
4    False
Name: is_arrested, dtype: object
bool


In [None]:
apple.index

DatetimeIndex(['2018-02-13 16:00:00', '2018-02-14 16:00:00',
               '2018-02-15 16:00:00'],
              dtype='datetime64[ns]', name='date_and_time', freq=None)

### Combining object columns
Currently, the date and time of each traffic stop are stored in separate object columns: stop_date and stop_time.

In this exercise, you'll combine these two columns into a single column, and then convert it to datetime format. This will enable convenient date-based attributes that we'll use later in the course.

* Use a string method to concatenate stop_date and stop_time (separated by a space), and store the result in combined.
* Convert combined to datetime format, and store the result in a new column named stop_datetime.
* Examine the DataFrame .dtypes to confirm that stop_datetime is a datetime column.

In [None]:
ri.head()

Unnamed: 0,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3
2,2005-02-17,04:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X4
3,2005-02-20,17:15,M,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False,Zone X1
4,2005-02-24,01:20,F,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone X3


In [None]:
# Concatenate 'stop_date' and 'stop_time' (separated by a space)
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')

# Convert 'combined' to datetime format
ri['stop_datetime'] = pd.to_datetime(combined)

# Examine the data types of the DataFrame
print(ri.dtypes)

stop_date                     object
stop_time                     object
driver_gender                 object
driver_race                   object
violation_raw                 object
violation                     object
search_conducted                bool
search_type                   object
stop_outcome                  object
is_arrested                     bool
stop_duration                 object
drugs_related_stop              bool
district                      object
stop_datetime         datetime64[ns]
dtype: object


### Setting the index
The last step that you'll take in this chapter is to set the stop_datetime column as the DataFrame's index. By replacing the default index with a DatetimeIndex, you'll make it easier to analyze the dataset by date and time, which will come in handy later in the course!

* Set stop_datetime as the DataFrame index.
* Examine the index to verify that it is a DatetimeIndex.
* Examine the DataFrame columns to confirm that stop_datetime is no longer one of the columns.

In [None]:


# Set 'stop_datetime' as the index
ri.set_index('stop_datetime',inplace = True)

# Examine the index
print(ri.index)

# Examine the columns
print(ri.columns)

DatetimeIndex(['2005-01-04 12:55:00', '2005-01-23 23:15:00',
               '2005-02-17 04:15:00', '2005-02-20 17:15:00',
               '2005-02-24 01:20:00', '2005-03-14 10:00:00',
               '2005-03-29 21:55:00', '2005-04-04 21:25:00',
               '2005-07-14 11:20:00', '2005-07-14 19:55:00',
               ...
               '2015-12-31 13:23:00', '2015-12-31 18:59:00',
               '2015-12-31 19:13:00', '2015-12-31 20:20:00',
               '2015-12-31 20:50:00', '2015-12-31 21:21:00',
               '2015-12-31 21:59:00', '2015-12-31 22:04:00',
               '2015-12-31 22:09:00', '2015-12-31 22:47:00'],
              dtype='datetime64[ns]', name='stop_datetime', length=86536, freq=None)
Index(['stop_date', 'stop_time', 'driver_gender', 'driver_race',
       'violation_raw', 'violation', 'search_conducted', 'search_type',
       'stop_outcome', 'is_arrested', 'stop_duration', 'drugs_related_stop',
       'district'],
      dtype='object')


## <font color = blue> **Exploring the relationship between gender and policing**

Does the gender of a driver have an impact on police behavior during a traffic stop? In this chapter, you will explore that question while practicing filtering, grouping, method chaining, Boolean math, string methods, and more!

### Examining traffic violations
Before comparing the violations being committed by each gender, you should examine the violations committed by all drivers to get a baseline understanding of the data.

In this exercise, you'll count the unique values in the violation column, and then separately express those counts as proportions.

* Count the unique values in the violation column of the ri DataFrame, to see what violations are being committed by all drivers.

* Express the violation counts as proportions of the total.

In [None]:
# Count the unique values in 'violation'
print(ri.violation.value_counts())
display('---------------------------------------------------------------------')
# Express the counts as proportions
print(ri.violation.value_counts(normalize = True))

Speeding               48423
Moving violation       16224
Equipment              10921
Other                   4409
Registration/plates     3703
Seat belt               2856
Name: violation, dtype: int64


'---------------------------------------------------------------------'

Speeding               0.559571
Moving violation       0.187483
Equipment              0.126202
Other                  0.050950
Registration/plates    0.042791
Seat belt              0.033004
Name: violation, dtype: float64


**Interesting! More than half of all violations are for speeding, followed by other moving violations and equipment violations.**

### Comparing violations by gender
The question we're trying to answer is whether male and female drivers tend to commit different types of traffic violations.

In this exercise, you'll first create a DataFrame for each gender, and then analyze the violations in each DataFrame separately.

<div class="exercise--instructions__content"><ul>
<li>Create a DataFrame, <code>female</code>, that only contains rows in which <code>driver_gender</code> is <code>&apos;F&apos;</code>.</li>
<li>Create a DataFrame, <code>male</code>, that only contains rows in which <code>driver_gender</code> is <code>&apos;M&apos;</code>.</li>
<li>Count the violations committed by female drivers and express them as proportions.</li>
<li>Count the violations committed by male drivers and express them as proportions.</li>
</ul></div>

In [None]:
ri.driver_gender

stop_datetime
2005-01-04 12:55:00    M
2005-01-23 23:15:00    M
2005-02-17 04:15:00    M
2005-02-20 17:15:00    M
2005-02-24 01:20:00    F
                      ..
2015-12-31 21:21:00    F
2015-12-31 21:59:00    F
2015-12-31 22:04:00    M
2015-12-31 22:09:00    F
2015-12-31 22:47:00    M
Name: driver_gender, Length: 86536, dtype: object

In [None]:
# Create a DataFrame of female drivers
female = ri[ri.driver_gender == 'F']

# Create a DataFrame of male drivers
male = ri[ri.driver_gender == 'M']
print('--------------Female-----------------------------------')
# Compute the violations by female drivers (as proportions)
print(female.violation.value_counts(normalize = True))

print('--------------Male-----------------------------------')

# Compute the violations by male drivers (as proportions)
print(male.violation.value_counts(normalize = True))

--------------Female-----------------------------------
Speeding               0.658114
Moving violation       0.138218
Equipment              0.105199
Registration/plates    0.044418
Other                  0.029738
Seat belt              0.024312
Name: violation, dtype: float64
--------------Male-----------------------------------
Speeding               0.522243
Moving violation       0.206144
Equipment              0.134158
Other                  0.058985
Registration/plates    0.042175
Seat belt              0.036296
Name: violation, dtype: float64


About two-thirds of female traffic stops are for speeding, whereas stops of males are more balanced among the six categories. This doesn't mean that females speed more often than males, however, since we didn't take into account the number of stops or drivers.