Names: Jessica Hatfield and Nate Ostrander

# Assessing Arrest Data at the University of Maryland

The purpose of this project is ...

In [35]:
from bs4 import BeautifulSoup
import requests as req
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt

# Data collection and processing

We are getting our data from the <a href="http://www.umpd.umd.edu/stats/arrest_report.cfm">University of Maryland Police Department's Crime Stats</a> website. We are using the arrest data for the past seven years, i.e. from 2010 to 2017, because the data from the years 2006 to 2009 is not available. To collect the data from the webpage, we are using the python requests and BeautifulSoup packages and storing the data in a Pandas DataFrame. It is also important to note that the UMPD updates their arrest data every day, so the data we use for 2017 will change as the semester progresses.

Below is an example of the UMPD arrest data from 2010 before we began cleaning and processing the data. As you can see, because the HTML table contains rowspans and colspans, information corresponding to each arrest record is spread out across two rows of the dataframe.

In [36]:
r = req.get('http://www.umpd.umd.edu/stats/arrest_report.cfm?year=2010')
text = r.text
soup = BeautifulSoup(text, 'html.parser')
table = soup.find('table')
df = pd.read_html(str(table), flavor='bs4')[0]
df.head()

Unnamed: 0,0,1,2,3,4,5
0,ARRESTNUMBER,ARRESTED DATE TIMECHARGE,UMPD CASE NUMBER,AGE,RACE,SEX
1,16001,11/09/10 23:30,2010-00000115,,Black,Male
2,CDS: Possess Paraphernalia,,,,,
3,16002,11/10/10 00:20,2010-00000126,,Black,Male
4,"Theft: $1,000 to Under $10,000",,,,,


Next, we need to clean up and process the data. The problems resulting from rowspans and colpans are solved by moving the decription from the second row of the corresponding arrest record's data into the first row and then removing the second row completely. Additionally, a year column is added to the dataframe for easy access to each year's data. Finally, after individual dataframes are created for each year from 2010 to 2017, they are concatenated together to form one large dataframe that contains the arrest data for every year from 2010 to 2017.

In [37]:
frames = []

# Have to make separate requests for every year
for i in range(2010, 2018):

    # Putting the data into a dataframe
    r = req.get('http://www.umpd.umd.edu/stats/arrest_report.cfm?year=' + str(i))
    text = r.text
    soup = BeautifulSoup(text, 'html.parser')
    table = soup.find('table')
    df = pd.read_html(str(table), flavor='bs4')[0]
    
    df.drop(df.index[:1], inplace=True)
    df.columns = ['arrest_number', 'date_time', 'case_number', 'age', 'race', 'gender']
    df['description'] = ''
    df['year'] = 0
    
    # Fixing issues caused by the row and colspans
    for index in range(1, df.shape[0]):
        descr = df.get_value(index + 1, 'arrest_number')
        df.set_value(index, 'description', descr)
        df.set_value(index, 'year', i)
    
    df = df.iloc[::2]
    
    # Adding current df to the list of dataframes
    frames.append(df)

# Combining individual dataframes
dataframe = pd.concat(frames)
dataframe.reset_index(inplace=True, drop=True)
dataframe.head()

Unnamed: 0,arrest_number,date_time,case_number,age,race,gender,description,year
0,16001,11/09/10 23:30,2010-00000115,,Black,Male,CDS: Possess Paraphernalia,2010
1,16002,11/10/10 00:20,2010-00000126,,Black,Male,"Theft: $1,000 to Under $10,000",2010
2,16003,11/10/10 00:20,2010-00000126,,Black,Male,"Theft: $1,000 to Under $10,000",2010
3,16005,11/10/10 22:44,2010-00000292,,White,Male,"(Driving, Attempting to drive) veh. while unde...",2010
4,16006,11/11/10 17:54,2010-00000414,,White,Male,"(Driving, Attempting to drive) motor veh. on h...",2010


# Exploratory data analysis and visualization

Description of what we were looking for, initial ideas before establishing a hypothesis, etc.

# Analysis and hypothesis testing

Probably some stat and regression stuff here.

# Insight

Conclusions and things that we learned.