___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

<h1><p style="text-align: center;">Data Analysis with Python <br>Project - 1</p><h1> - Traffic Police Stops <img src="https://docs.google.com/uc?id=17CPCwi3_VvzcS87TOsh4_U8eExOhL6Ki" class="img-fluid" alt="CLRSWY" width="200" height="100"> 

Before beginning your analysis, it is critical that you first examine and clean the dataset, to make working with it a more efficient process. You will practice fixing data types, handling missing values, and dropping columns and rows while learning about the Stanford Open Policing Project dataset.

***

## Examining the dataset

You'll be analyzing a dataset of traffic stops in Rhode Island that was collected by the Stanford Open Policing Project.

Before beginning your analysis, it's important that you familiarize yourself with the dataset. You'll read the dataset into pandas, examine the first few rows, and then count the number of missing values.

In [2]:
import pandas as pd

In [3]:
ri = pd.read_csv('police.csv') # nrow parameter gets the specified number of rows in the file

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
ri = pd.DataFrame(ri)

In [5]:
ri.head()

Unnamed: 0,id,state,stop_date,stop_time,location_raw,county_name,county_fips,fine_grained_location,police_department,driver_gender,...,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
0,RI-2005-00001,RI,2005-01-02,01:55,Zone K1,,,,600,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone K1
1,RI-2005-00002,RI,2005-01-02,20:30,Zone X4,,,,500,M,...,False,,,False,Citation,False,16-30 Min,False,False,Zone X4
2,RI-2005-00003,RI,2005-01-04,11:30,Zone X1,,,,0,,...,False,,,False,,,,,False,Zone X1
3,RI-2005-00004,RI,2005-01-04,12:55,Zone X4,,,,500,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4
4,RI-2005-00005,RI,2005-01-06,01:30,Zone X4,,,,500,M,...,False,,,False,Citation,False,0-15 Min,False,False,Zone X4


In [6]:
ri = ri[0:50001] # I take a sample of 50000 rows of data

In [7]:
ri.isnull().sum()

id                           0
state                        0
stop_date                    0
stop_time                    0
location_raw                 0
county_name              50001
county_fips              50001
fine_grained_location    50001
police_department            0
driver_gender             1991
driver_age_raw            1972
driver_age                2209
driver_race_raw           1989
driver_race               1989
violation_raw             1989
violation                 1989
search_conducted             0
search_type_raw          47989
search_type              47989
contraband_found             0
stop_outcome              1989
is_arrested               1989
stop_duration             1989
out_of_state              2204
drugs_related_stop           0
district                     0
dtype: int64

***

In [13]:
ri.isnull().sum()/ri.isnull().shape[0]*100

id                         0.000000
state                      0.000000
stop_date                  0.000000
stop_time                  0.000000
location_raw               0.000000
county_name              100.000000
county_fips              100.000000
fine_grained_location    100.000000
police_department          0.000000
driver_gender              3.981920
driver_age_raw             3.943921
driver_age                 4.417912
driver_race_raw            3.977920
driver_race                3.977920
violation_raw              3.977920
violation                  3.977920
search_conducted           0.000000
search_type_raw           95.976080
search_type               95.976080
contraband_found           0.000000
stop_outcome               3.977920
is_arrested                3.977920
stop_duration              3.977920
out_of_state               4.407912
drugs_related_stop         0.000000
district                   0.000000
dtype: float64

## Dropping columns

Often, a DataFrame will contain columns that are not useful to your analysis. Such columns should be dropped from the ``DataFrame``, to make it easier for you to focus on the remaining columns.

You'll drop the ``county_name`` column because it only contains missing values, and you'll drop the ``state`` column because all of the traffic stops took place in one state (Rhode Island). Thus, these columns can be dropped because they contain no useful information.

In [14]:
ri.shape

(50001, 26)

In [17]:
drop_columns = ['state', 'county_name', 'county_fips', 'fine_grained_location']

In [19]:
ri.shape

(50001, 22)

***

## Dropping rows

When you know that a specific column will be critical to your analysis, and only a small fraction of rows are missing a value in that column, it often makes sense to remove those rows from the dataset.

During this course, the ``driver_gender`` column will be critical to many of your analyses. Because only a small fraction of rows are missing ``driver_gender``, we'll drop those rows from the dataset.

In [21]:
ri.isnull().sum()

id                        0
stop_date                 0
stop_time                 0
location_raw              0
police_department         0
driver_gender          1991
driver_age_raw         1972
driver_age             2209
driver_race_raw        1989
driver_race            1989
violation_raw          1989
violation              1989
search_conducted          0
search_type_raw       47989
search_type           47989
contraband_found          0
stop_outcome           1989
is_arrested            1989
stop_duration          1989
out_of_state           2204
drugs_related_stop        0
district                  0
dtype: int64

In [22]:
ri.dropna(subset=["driver_gender"], inplace=True)

In [23]:
ri.isnull().sum()

id                        0
stop_date                 0
stop_time                 0
location_raw              0
police_department         0
driver_gender             0
driver_age_raw            0
driver_age              232
driver_race_raw           0
driver_race               0
violation_raw             0
violation                 0
search_conducted          0
search_type_raw       45998
search_type           45998
contraband_found          0
stop_outcome              0
is_arrested               0
stop_duration             0
out_of_state            215
drugs_related_stop        0
district                  0
dtype: int64

In [24]:
ri.shape

(48010, 22)

***

## Fixing a data type

We know that the ``is_arrested`` column currently has the ``object`` data type. In this exercise, we'll change the data type to ``bool``, which is the most suitable type for a column containing ``True`` and ``False`` values.

Fixing the data type will enable us to use mathematical operations on the ``is_arrested`` column that would not be possible otherwise.

In [25]:
ri['is_arrested']

0        False
1        False
3        False
4        False
5        False
         ...  
49995    False
49996    False
49997    False
49998    False
49999    False
Name: is_arrested, Length: 48010, dtype: object

In [26]:
ri['is_arrested'].dtypes

dtype('O')

In [91]:
ri['is_arrested'] = ri['is_arrested'].astype('bool')

In [92]:
ri.dtypes

id                        object
stop_date                 object
stop_time                 object
location_raw              object
county_fips              float64
fine_grained_location    float64
police_department         object
driver_gender             object
driver_age_raw           float64
driver_age               float64
driver_race_raw           object
driver_race               object
violation_raw             object
violation                 object
search_conducted          object
search_type_raw           object
search_type               object
contraband_found            bool
stop_outcome              object
is_arrested                 bool
stop_duration             object
out_of_state              object
drugs_related_stop          bool
district                  object
combined                  object
dtype: object

***

## Combining object columns

Currently, the date and time of each traffic stop are stored in separate object columns: ``stop_date`` and ``stop_time``.

You'll combine these two columns into a single column, and then convert it to ``datetime`` format. This will enable convenient date-based attributes that we'll use later in the course.

In [27]:
ri['stop_time'].head()

0    01:55
1    20:30
3    12:55
4    01:30
5    08:05
Name: stop_time, dtype: object

In [28]:
# combined = ri.stop_date.str.cat(ri.stop_time, sep=" ")  --> alternative

In [29]:
ri['combined'] = ri['stop_date'] + " " + ri['stop_time']

In [30]:
ri['stop_datetime'] = pd.to_datetime(ri['combined'])

In [31]:
ri

Unnamed: 0,id,stop_date,stop_time,location_raw,police_department,driver_gender,driver_age_raw,driver_age,driver_race_raw,driver_race,...,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,combined,stop_datetime
0,RI-2005-00001,2005-01-02,01:55,Zone K1,600,M,1985.0,20.0,W,White,...,,False,Citation,False,0-15 Min,False,False,Zone K1,2005-01-02 01:55,2005-01-02 01:55:00
1,RI-2005-00002,2005-01-02,20:30,Zone X4,500,M,1987.0,18.0,W,White,...,,False,Citation,False,16-30 Min,False,False,Zone X4,2005-01-02 20:30,2005-01-02 20:30:00
3,RI-2005-00004,2005-01-04,12:55,Zone X4,500,M,1986.0,19.0,W,White,...,,False,Citation,False,0-15 Min,False,False,Zone X4,2005-01-04 12:55,2005-01-04 12:55:00
4,RI-2005-00005,2005-01-06,01:30,Zone X4,500,M,1978.0,27.0,B,Black,...,,False,Citation,False,0-15 Min,False,False,Zone X4,2005-01-06 01:30,2005-01-06 01:30:00
5,RI-2005-00006,2005-01-12,08:05,Zone X1,0,M,1973.0,32.0,B,Black,...,,False,Citation,False,30+ Min,True,False,Zone X1,2005-01-12 08:05,2005-01-12 08:05:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,RI-2006-35917,2006-08-08,22:45,Zone K3,300,M,1973.0,33.0,B,Black,...,,False,Citation,False,0-15 Min,False,False,Zone K3,2006-08-08 22:45,2006-08-08 22:45:00
49996,RI-2006-35918,2006-08-08,22:45,Zone K3,300,F,1971.0,35.0,B,Black,...,,False,Citation,False,0-15 Min,True,False,Zone K3,2006-08-08 22:45,2006-08-08 22:45:00
49997,RI-2006-35919,2006-08-08,22:53,Zone X4,500,M,1952.0,54.0,W,White,...,,False,Citation,False,16-30 Min,True,False,Zone X4,2006-08-08 22:53,2006-08-08 22:53:00
49998,RI-2006-35920,2006-08-08,23:00,Zone K1,600,F,1982.0,24.0,W,White,...,,False,Citation,False,0-15 Min,False,False,Zone K1,2006-08-08 23:00,2006-08-08 23:00:00


In [114]:
ri.drop(columns='combined')

Unnamed: 0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,stop_datetime
0,RI-2005-00001,2005-01-02,01:55,Zone K1,,,600,M,1985.0,20.0,...,,,False,Citation,True,0-15 Min,False,False,Zone K1,2005-01-02 01:55:00
1,RI-2005-00002,2005-01-02,20:30,Zone X4,,,500,M,1987.0,18.0,...,,,False,Citation,True,16-30 Min,False,False,Zone X4,2005-01-02 20:30:00
3,RI-2005-00004,2005-01-04,12:55,Zone X4,,,500,M,1986.0,19.0,...,,,False,Citation,True,0-15 Min,False,False,Zone X4,2005-01-04 12:55:00
4,RI-2005-00005,2005-01-06,01:30,Zone X4,,,500,M,1978.0,27.0,...,,,False,Citation,True,0-15 Min,False,False,Zone X4,2005-01-06 01:30:00
5,RI-2005-00006,2005-01-12,08:05,Zone X1,,,0,M,1973.0,32.0,...,,,False,Citation,True,30+ Min,True,False,Zone X1,2005-01-12 08:05:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,RI-2006-35917,2006-08-08,22:45,Zone K3,,,300,M,1973.0,33.0,...,,,False,Citation,True,0-15 Min,False,False,Zone K3,2006-08-08 22:45:00
49996,RI-2006-35918,2006-08-08,22:45,Zone K3,,,300,F,1971.0,35.0,...,,,False,Citation,True,0-15 Min,True,False,Zone K3,2006-08-08 22:45:00
49997,RI-2006-35919,2006-08-08,22:53,Zone X4,,,500,M,1952.0,54.0,...,,,False,Citation,True,16-30 Min,True,False,Zone X4,2006-08-08 22:53:00
49998,RI-2006-35920,2006-08-08,23:00,Zone K1,,,600,F,1982.0,24.0,...,,,False,Citation,True,0-15 Min,False,False,Zone K1,2006-08-08 23:00:00


The last step that you'll take in this chapter is to set the ``stop_datetime`` column as the ``DataFrame``'s index. By replacing the default index with a ``DatetimeIndex``, you'll make it easier to analyze the dataset by date and time, which will come in handy later in the course.

In [117]:
ri.set_index(ri['stop_datetime'], inplace=True)

In [119]:
ri.head()

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district,combined,stop_datetime
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-02 01:55:00,RI-2005-00001,2005-01-02,01:55,Zone K1,,,600,M,1985.0,20.0,...,,False,Citation,True,0-15 Min,False,False,Zone K1,2005-01-02 01:55,2005-01-02 01:55:00
2005-01-02 20:30:00,RI-2005-00002,2005-01-02,20:30,Zone X4,,,500,M,1987.0,18.0,...,,False,Citation,True,16-30 Min,False,False,Zone X4,2005-01-02 20:30,2005-01-02 20:30:00
2005-01-04 12:55:00,RI-2005-00004,2005-01-04,12:55,Zone X4,,,500,M,1986.0,19.0,...,,False,Citation,True,0-15 Min,False,False,Zone X4,2005-01-04 12:55,2005-01-04 12:55:00
2005-01-06 01:30:00,RI-2005-00005,2005-01-06,01:30,Zone X4,,,500,M,1978.0,27.0,...,,False,Citation,True,0-15 Min,False,False,Zone X4,2005-01-06 01:30,2005-01-06 01:30:00
2005-01-12 08:05:00,RI-2005-00006,2005-01-12,08:05,Zone X1,,,0,M,1973.0,32.0,...,,False,Citation,True,30+ Min,True,False,Zone X1,2005-01-12 08:05,2005-01-12 08:05:00


In [121]:
ri.index.values

array(['2005-01-02T01:55:00.000000000', '2005-01-02T20:30:00.000000000',
       '2005-01-04T12:55:00.000000000', ...,
       '2006-08-08T22:53:00.000000000', '2006-08-08T23:00:00.000000000',
       '2006-08-08T23:00:00.000000000'], dtype='datetime64[ns]')

In [124]:
ri.drop(['stop_datetime', 'combined'], axis=1)

Unnamed: 0_level_0,id,stop_date,stop_time,location_raw,county_fips,fine_grained_location,police_department,driver_gender,driver_age_raw,driver_age,...,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-02 01:55:00,RI-2005-00001,2005-01-02,01:55,Zone K1,,,600,M,1985.0,20.0,...,False,,,False,Citation,True,0-15 Min,False,False,Zone K1
2005-01-02 20:30:00,RI-2005-00002,2005-01-02,20:30,Zone X4,,,500,M,1987.0,18.0,...,False,,,False,Citation,True,16-30 Min,False,False,Zone X4
2005-01-04 12:55:00,RI-2005-00004,2005-01-04,12:55,Zone X4,,,500,M,1986.0,19.0,...,False,,,False,Citation,True,0-15 Min,False,False,Zone X4
2005-01-06 01:30:00,RI-2005-00005,2005-01-06,01:30,Zone X4,,,500,M,1978.0,27.0,...,False,,,False,Citation,True,0-15 Min,False,False,Zone X4
2005-01-12 08:05:00,RI-2005-00006,2005-01-12,08:05,Zone X1,,,0,M,1973.0,32.0,...,False,,,False,Citation,True,30+ Min,True,False,Zone X1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2006-08-08 22:45:00,RI-2006-35917,2006-08-08,22:45,Zone K3,,,300,M,1973.0,33.0,...,False,,,False,Citation,True,0-15 Min,False,False,Zone K3
2006-08-08 22:45:00,RI-2006-35918,2006-08-08,22:45,Zone K3,,,300,F,1971.0,35.0,...,False,,,False,Citation,True,0-15 Min,True,False,Zone K3
2006-08-08 22:53:00,RI-2006-35919,2006-08-08,22:53,Zone X4,,,500,M,1952.0,54.0,...,False,,,False,Citation,True,16-30 Min,True,False,Zone X4
2006-08-08 23:00:00,RI-2006-35920,2006-08-08,23:00,Zone K1,,,600,F,1982.0,24.0,...,False,,,False,Citation,True,0-15 Min,False,False,Zone K1
