***
# Exploratory Data Analysis
MSDS 7331-403, Lab 1  
*Jenna Ford, Edward Fry, Christian Nava, and Jonathan Tan* 
***

In [None]:
import pandas as pd
import numpy as np
import datetime
from datetime import date
import calendar

df = pd.read_csv('Data\Arrest_Data_from_2010_to_Present.csv') # read in the csv file

## Business Understanding
*(10 points)*

*Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific.*

"This dataset reflects arrest incidents in the City of Los Angeles dating back to 2010. This data is transcribed from original arrest reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0.0000°, 0.0000°). Address fields are only provided to the nearest hundred block in order to maintain privacy."

## Data Understanding

*(10 points)*  
*Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file.*


The data consists of 17 attributes and their descriptions are provided by the [City of Los Angeles open data](https://data.lacity.org/A-Safe-City/Arrest-Data-from-2010-to-Present/yru6-6re4) and [Kaggle](https://www.kaggle.com/cityofLA/los-angeles-crime-arrest-data). Descriptions from these locations are displayed in Table 1 below:

**Table 1: Attribute Descriptions**

| Attribute | Description |
| :--- | :--- |
| **Report ID** | ID for the arrest |
| **Arrest Date** | Date in MM/DD/YYYY format |
| **Time** | In 24-hour military time |
| **Area ID** | The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21. |
| **Area Name** | The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles. |
| **Reporting District** | A four-digit code that represents a sub-area within a Geographic Area. All arrest records reference the "RD" that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at http://geohub.lacity.org/datasets/c4f83909b81d4786aa8ba8a74a4b4db1_4 |
| **Age** | Two character numeric.|
| **Sex Code** | F - Female; M - Male|
| **Descent Code** | Descent Code: A - Other Asian B - Black C - Chinese D - Cambodian F - Filipino G - Guamanian H - Hispanic/Latin/Mexican I - American Indian/Alaskan Native J - Japanese K - Korean L - Laotian O - Other P - Pacific Islander S - Samoan U - Hawaiian V - Vietnamese W - White X - Unknown Z - Asian Indian |
| **Charge Group Code** | Category of arrest charge. |
| **Charge Group Description** | Defines the Charge Group Code provided. |
| **Arrest Type Code** | A code to indicate the type of charge the individual was arrested for. D - Dependent F - Felony I - Infraction M - Misdemeanor O - Other |
| **Charge** | The charge the individual was arrested for. |
| **Charge Description** | Defines the Charge provided. |
| **Address** | Street address of crime incident rounded to the nearest hundred block to maintain anonymity. |
| **Cross Street** | Cross Street of rounded Address. |
| **Location** | The location where the crime incident occurred. Actual address is omitted for confidentiality. XY coordinates reflect the nearest 100 block. |



*(15 points)*  
*Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Be specific.*

#### Missing Values

A check for missing values reveals that 56% of observations have missing data for the `Cross Street` attribute. This is a significant amount that will require closer inspection.

In [None]:
def df(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
missing_data(df).head(17)

#### Duplicate Data

To We want to make sure we have unique observations, i.e., no two records have the same values for all attributes. This will reduce the risk of biased estimates. In this dataset a duplicate record would lead to further inspection as it is unlikely two arrests were made in the exact same location, on the same date and time, for the same charge, for two individuals of the same age, gender, and ethnic descent. A check for duplicate records verifies our dataset contains unique obervations.  

In [None]:
# check for duplicate records
df.duplicated().sum()

#### Remove Unecessary Attributes

The `Charge Group Description`, `Charge Group Code`, and `Charge Description` all provide similar information as the `Charge`. Of these four attributes, `Charge` is the only attribute that does not have missing values. We will drop `Charge Group Description`, `Charge Group Code`, and `Charge Description`.

In [2]:
# remove redundandt columns or those that do not relevant information
df.drop(['Report ID',
         'Area ID',
         'Location',
         'Charge Group Description', 
         'Charge Group Code', 
         'Charge Description'], axis=1, inplace=True)

In [3]:
# Assign, clean up, and split out the classes of data
ordinalFeatures = ['Arrest Date', 'Time', 'Age']
categoricalFeatures = ['Area Name', 'Reporting District', 'Sex Code', 'Descent Code',
                       'Charge Group Description', 'Arrest Type Code', 'Charge', 'Charge Description',
                       'Address', 'Cross Street', 'Location']
df['Age'] = df['Age'].astype(np.int8)
#df['Arrest Date'] = df['Arrest Date'].astype(np.datetime64)
df['Reporting District'] = df['Reporting District'].astype(np.str)

In [4]:
#need to cleanup the time field...it is stored like 645 instead of 06:45
df_cleansed = df

#convert float to string
df_cleansed['Time'] = df_cleansed['Time'].astype(str) 

#get rid of decimals
df_cleansed['Time'] = df_cleansed['Time'].str.split(".", expand=True)[0] 

#convert missing to 0000
df_cleansed['Time'] = df_cleansed['Time'].replace(to_replace="nan",value="0000") 

#treat 0 as missing and convert to 0000
df_cleansed['Time'] = df_cleansed['Time'].replace(to_replace="0",value="0000") 

#2400 is not a valid time, converting to 0001 so it isn't the same as missing
df_cleansed['Time'] = df_cleansed['Time'].replace(to_replace="2400",value="0001") 

#split the time string to get the appropriate digits that correspond to hours and minutes
df_cleansed['Hour'] = np.where(df_cleansed['Time'].str.len() == 4,df_cleansed['Time'].str[-4:2],np.where(df_cleansed['Time'].str.len() == 3,df_cleansed['Time'].str[-3:1],"00"))
df_cleansed['Minute'] = df_cleansed['Time'].str[-2:4]

#put hour and minute back together in time format
df_cleansed['NewTime'] = pd.to_datetime(df_cleansed['Hour'] + ':' + df_cleansed['Minute'] + ':00',format='%H:%M:%S').dt.time

In [5]:
#need to clean up cross street field

#remove duplicate whitespaces
df_cleansed['Cross Street'] = df_cleansed['Cross Street'].replace('\s+',' ',regex=True)
df_cleansed['Address'] = df_cleansed['Address'].replace('\s+',' ',regex=True)

#if all digits are numeric, nullify
df_cleansed['Address New'] = np.where(df_cleansed["Address"].str.isdigit() == True,np.nan, df_cleansed["Address"])
df_cleansed['Cross Street New'] = np.where(df_cleansed["Cross Street"].str.isdigit() == True,np.nan, df_cleansed["Cross Street"])

df_cleansed['Address_first_word'] = df_cleansed['Address'].str.split(n=1).str[0]
df_cleansed['Street'] = np.where(df_cleansed['Address_first_word'].str.isdigit() == True,df_cleansed['Address'].str.split(n=1).str[1],df_cleansed['Address'])

df_cleansed['Cross_street_first_word'] = df_cleansed['Cross Street'].str.split(n=1).str[0]
df_cleansed['CrossStreet'] = np.where(df_cleansed['Cross_street_first_word'].str.isdigit() == True,df_cleansed['Cross Street'].str.split(n=1).str[1],df_cleansed['Cross Street'])

In [6]:
#delete columns not relevant to analysis
df_cleansed.drop(['Time','Hour','Minute','Address','Cross Street','Address New','Cross Street New','Address_first_word','Cross_street_first_word'],axis=1,inplace=True)

In [7]:
df_cleansed.describe()

Unnamed: 0,Age
count,1326626.0
mean,34.22556
std,13.60807
min,0.0
25%,23.0
50%,32.0
75%,45.0
max,97.0


In [None]:
#create year, month, day, day of week columns

def findYear(date):
    year = datetime.datetime.strptime(date, '%m/%d/%Y').year
    return(year)

df_cleansed['Year'] = df_cleansed['Arrest Date'].apply(findYear)

def findMonth(date):
    month = datetime.datetime.strptime(date, '%m/%d/%Y').month
    return(month)

df_cleansed['Month'] = df_cleansed['Arrest Date'].apply(findMonth)

def findDay(date):
    day = datetime.datetime.strptime(date, '%m/%d/%Y').day
    return(day)

df_cleansed['Day'] = df_cleansed['Arrest Date'].apply(findDay)

def findDay(date): 
    day = datetime.datetime.strptime(date, '%m/%d/%Y').weekday() 
    return (calendar.day_name[day]) 

#df_cleansed['Day of Week'] = df_cleansed['Arrest Date'].apply(findDay)

def findNDay(date): 
    day = datetime.datetime.strptime(date, '%m/%d/%Y').weekday() 
    return (day) 

df_cleansed['N Day of Week'] = df_cleansed['Arrest Date'].apply(findNDay)

#check results
df_cleansed
df_cleansed.to_csv(r'Cleansed.csv')

*(10 points)*   
*Give simple, appropriate statistics (range, mode, mean, median, variance,
counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison. Explain the significance of the statistics run and why they are meaningful.*  

*(15 points)*  
*Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate.*

*(15 points)*  
*Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.*  

*(10 points)*  
*Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).*

*(5 points)*  
*Are there other features that could be added to the data or created from
existing features? Which ones?*

## Exceptional Work
*(10 points)*  
• *You have free reign to provide additional analyses.  
• One idea: implement dimensionality reduction, then visualize and interpret the results.*