# Import, Wrangling and Basic Description
## Read me before setting out to code
1. The raw data frame is **data**, which is not allowed to be modified. All the modification should be on **df**, which is an copy of data. The reason is to prevent data missing.
2. I didn't cleaning the data. Rather, I gave a introduction on how to clean the data and what problem might arise when cleaning the data, as well as a small example.
3. Read these code helps you a lot when you're about to do visualizing and analysing.
4. Just comment some of the code if those long output table confuse you

## 1.1 Defind data importing functions, import packages

In [1]:
# created by TEAM SDPD Pull Over
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
# Data Wrangling : creates indexed dict with compiled data from files listed
files = ['actions_taken', 'basic_details', 'basis_of_search', 'basis_property_seizure', 'disability', 'evidence_found', 'gender', 'property_seized', 'race', 'reason_for_stop', 'stop_result']

def getFileData( filename ) :
    data = csv.reader(open('Data/' + filename + '.csv'))
    fields = data.__next__()
    rows = {}
    for (index, row) in enumerate(data) :
        items = zip(fields, row)
        item = {}
        for (name, value) in items :
            item[name] = value.strip()
        rows[index] = item
    return rows

def getFileData( filename ) :
    data = csv.reader(open("Data/" + filename + '.csv'))
    fields = data.__next__()
    rows = {}
    for (index, row) in enumerate(data) :
        items = zip(fields, row)
        item = {}
        for (name, value) in items :
            item[name] = value.strip()
        rows[index] = item
    return rows

def getUidDict( files ) :
    uid_dict = {}
    for file in files :
        fileData = getFileData( file )
        for rowIndex in fileData :
            row = fileData[rowIndex]
            stop_id = row['stop_id']
            pid = row['pid']
            uid = stop_id + "_" + pid
            uid_dict[uid] = {}
    for file in files :
        fileData = getFileData( file )
        for rowIndex in fileData :
            row = fileData[rowIndex]
            stop_id = row['stop_id']
            pid = row['pid']
            uid = stop_id + "_" + pid
            for field in row :
                uid_dict[uid][field] = row[field]
    return uid_dict

def getIndexedDict( uid_dict ) :
    dict = {}
    for (index, uid) in enumerate( uid_dict ) :
        dict[index] = uid_dict[ uid ]
        dict[index]['uid'] = uid
    return dict

# Example of how to get first index of data
#print(getIndexedDict( getUidDict(files) )[0])


## 1.2 Import Data into **data** dataFrame
This cell might take 30 seconds to execute to import all the data

In [3]:
data = pd.DataFrame(getIndexedDict( getUidDict(files)))
data = data.T

## 1.3 Take a look at the data

In [30]:
data.head(7)

Unnamed: 0,action,agency,assignment,basis_for_search,basis_for_search_explanation,basisforpropertyseizure,beat,beat_name,block,cityname,...,resulttext,school_name,stop_id,stop_in_response_to_cfs,stopdate,stopduration,stoptime,street,type_of_property_seized,uid
0,,SD,"Patrol, traffic enforcement, field operations",,,,122,Pacific Beach 122,700.0,SAN DIEGO,...,647(F) PC - DISORD CONDUCT:ALCOHOL (M) 64005,,2443,0,2018-07-01 00:00:00,30,00:01:37,Grand Avenue,,2443_1
1,,SD,"Patrol, traffic enforcement, field operations",,,,121,Mission Beach 121,,SAN DIEGO,...,22349(B) VC - EXC 55MPH SPEED:2 LANE RD (I) 54395,,2444,0,2018-07-01 00:00:00,10,00:03:34,NOBEL DRIVE,,2444_1
2,Curbside detention,SD,Other,,,,822,El Cerrito 822,4400.0,SAN DIEGO,...,,,2447,1,2018-07-01 00:00:00,15,00:05:43,59th Street,,2447_1
3,Curbside detention,SD,Other,,,,822,El Cerrito 822,4400.0,SAN DIEGO,...,,,2447,1,2018-07-01 00:00:00,15,00:05:43,59th Street,,2447_2
4,,SD,"Patrol, traffic enforcement, field operations",,,,614,Ocean Beach 614,4800.0,SAN DIEGO,...,,,2448,0,2018-07-01 00:00:00,5,00:19:06,NIAGARA AVE,,2448_1
5,Search of person was conducted,SD,"Patrol, traffic enforcement, field operations",Incident to arrest,subject was transported to detox and was searc...,,115,University City 115,4500.0,SAN DIEGO,...,647(F) PC - DISORD CONDUCT:ALCOHOL (M) 64005,,2449,1,2018-07-01 00:00:00,15,00:03:00,la jolla village dr,,2449_1
6,Handcuffed or flex cuffed,SD,Other,,,,122,Pacific Beach 122,800.0,SAN DIEGO,...,647(F) PC - DISORD CONDUCT:ALCOHOL (M) 64005,,2451,0,2018-07-01 00:00:00,20,00:24:02,Thomas,,2451_1


## 1.4 All the rows

In [None]:
# show basic infomation of data
data.info()

## 2.1 Brief introduction to data
1. count  - how many rows in the data
2. unique - number of different values
3. top    - largest value
4. freq   - count of top value

In [4]:
data.shape

(89470, 49)

In [5]:
data.describe().T

Unnamed: 0,count,unique,top,freq
action,89470,23,,59871
agency,89470,1,SD,89470
assignment,89470,10,"Patrol, traffic enforcement, field operations",80806
basis_for_search,89470,13,,72683
basis_for_search_explanation,89470,7210,,77080
basisforpropertyseizure,89470,6,,87454
beat,89470,126,521,6987
beat_name,89470,126,East Village 521,6987
block,89470,221,,8008
cityname,89470,37,SAN DIEGO,88101


## 2.2 Create **df** object and drop rows
Here I use deep copy, which means that the modification on **df** will not influence **data**  
The rows that will be dropped is **agency** and **ori**  
These two columns have the same value for all the rows and provide no useful values

In [6]:
df = data.copy(deep=True)
df = df.drop(["agency","ori"],axis = 1)
print("data.columns:" + str(len(data.columns)))
print("df.columns:" + str(len(df.columns)))

data.columns:49
df.columns:47


## 2.3 Show all the distictive data in a row 
Here I use row "race" as an example

In [41]:
df['action'].unique()

array(['None', 'Curbside detention', 'Search of person was conducted',
       'Handcuffed or flex cuffed', 'Field sobriety test conducted',
       'Search of property was conducted', 'Physical or Vehicle contact',
       'Person removed from vehicle by order', 'Chemical spray used',
       'Property was seized', 'Patrol car detention', 'Vehicle impounded',
       'Person removed from vehicle by physical contact',
       'Asked for consent to search person',
       'Asked for consent to search property', 'Person photographed',
       'Canine removed from vehicle or used to search',
       'Electronic control device used',
       'Admission or written statement obtained from student',
       'Firearm pointed at person', 'Baton or other impact weapon used',
       'Canine bit or held person',
       'Impact projectile discharged or used'], dtype=object)

## 2.4 Notice: Data Cleaning Example
All the data is the frame is "string" as shown below.  
Please convert it into numerate if it is needed  
  
  
I wrote a small example for converting a column into numerate type.  
It is worth mentioning that "null" in original table is empty string "".  
Please replace it with **np.nan** or **default value** like 0 

In [40]:
# All the columns are in str type
for i in range(len(df.columns)):
    print("Column type: "+str(type(df.iloc[i][0]))+"\tColunm name:\t"+df.columns[i])

Column type: <class 'str'>	Colunm name:	action
Column type: <class 'str'>	Colunm name:	assignment
Column type: <class 'str'>	Colunm name:	basis_for_search
Column type: <class 'str'>	Colunm name:	basis_for_search_explanation
Column type: <class 'str'>	Colunm name:	basisforpropertyseizure
Column type: <class 'str'>	Colunm name:	beat
Column type: <class 'str'>	Colunm name:	beat_name
Column type: <class 'str'>	Colunm name:	block
Column type: <class 'str'>	Colunm name:	cityname
Column type: <class 'str'>	Colunm name:	code
Column type: <class 'str'>	Colunm name:	consented
Column type: <class 'str'>	Colunm name:	contraband
Column type: <class 'str'>	Colunm name:	disability
Column type: <class 'str'>	Colunm name:	exp_years
Column type: <class 'str'>	Colunm name:	gend
Column type: <class 'str'>	Colunm name:	gend_nc
Column type: <class 'str'>	Colunm name:	gender
Column type: <class 'str'>	Colunm name:	gender_nonconforming
Column type: <class 'str'>	Colunm name:	highway_exit
Column type: <class '

'None'