# COGS 108 - Data Checkpoint

# Names

- Nick Fithen
- Lupe De Anda
- Andrew Lona
- Kimberly Alonzo
- Andres Villegas


<a id='research_question'></a>
# Research Question



*Is there a causal relationship between the amount of nationwide police shootings and the increase of bills proposed in California concerning law enforcement?*


# Dataset(s)

**Top 50 Police Vocabulary**
- Dataset Name: vocab.csv
- Link to the dataset: https://raw.githubusercontent.com/lupedeanda/test_repo/main/vocab.csv
- Gained access to dataset via: https://policeteststudyguide.com/top-50-police-vocabulary-you-need-to-know/
- Number of observations: 50

The dataset takes 50 words that are used and understood by professionals in the police force. We additionally added the definitions of each word in a separate column. 

**Popular Police Codes**
- Dataset Name: codes.csv
- Link to the dataset: https://raw.githubusercontent.com/lupedeanda/test_repo/main/codes.csv
- Gained access to dataset via: https://policecodes.net
- Number of observations: 30

The dataset takes 30 police codes that officers used to communicate emergencies, crimes, and situations on the job. We included the descriptions of each code in a separate column.

**Fatal Police Shootings Data**
- Dataset Name: shootings.csv
- Link to the dataset: https://github.com/washingtonpost/data-police-shootings/releases/download/v0.1/fatal-police-shootings-data.csv (original), https://raw.githubusercontent.com/lupedeanda/test_repo/main/shootings.csv (ours)
- Gained access to dataset via: https://github.com/washingtonpost/data-police-shootings
- Number of observations: 6,241

The dataset monitors all reported police shootings across the United States since 2015. Columns in the dataset document how the person was shot and describe the victim and events leading up to the incident.

**California Legislative Information***
- Dataset Name: bills.csv
- Link to the dataset: https://raw.githubusercontent.com/lupedeanda/test_repo/main/bills.csv
- Gained access to dataset via: https://leginfo.legislature.ca.gov/faces/billSearchClient.xhtml 
- Number of observations: 62,699

The dataset keeps track of bills proposed in the Californnia Legislature. Coulmns in the data set keep track of who proposed the bill, what the bill intended to change, and the current status of the proposed bill.




# Setup

In [None]:
#importing basic programs + explanations of each import
import pandas as pd #Needed to create and modify dataframes from the data we collect (and hopefully merge them together)

#import seaborn as sns #Not used yet but needed to quickly graph data for visual inspection (such as for outliers)
#import matplotlib.pyplot as plt #Not used but will be important for specific visualizations in the near future
#import numpy as np #Not used yet but is useful for array manipulations

from bs4 import BeautifulSoup # Beautiful Soup library used for web scraping
import requests # this is needed to communicate with the html file

# Data Cleaning

In [None]:
#reading in dataframes
df_vocab = pd.read_csv('https://raw.githubusercontent.com/lupedeanda/test_repo/main/vocab.csv')
df_codes = pd.read_csv('https://raw.githubusercontent.com/lupedeanda/test_repo/main/codes.csv')
df_bills = pd.read_csv('https://raw.githubusercontent.com/lupedeanda/test_repo/main/bills.csv')
df_shootings = pd.read_csv('https://raw.githubusercontent.com/lupedeanda/test_repo/main/shootings.csv')

In [None]:
#only have 2 columns and what we have currently seems to be as usable/clean as we can make it
df_vocab.head()

Unnamed: 0,Word,Definition
0,Corroborate,To give support to a theory or finding
1,Suborn,To bribe someone to commit an unlawful act
2,Sequester,To isolate
3,Libel,To publish a false statement that damages an i...
4,Adjourn,"To postpone, often referring to court"


In [None]:
#same as above
#only have 2 columns and what we have currently seems to be as usable/clean as we can make it
df_codes.head()

Unnamed: 0,Code,Description
0,10-35,Confidential information or open window
1,10-60,Squad in vicinity
2,10-11,Identify frequency / Dispatching too fast
3,10-100,Misdemeanor warrant / Out using restroom
4,10-10,Negative / Fight in progress


In [None]:
### this is a tentative dataset while we attempt to build a webscraper to collect all of the available data
#dataset as is is pretty concise and usable
#was created by government officials with uniformity in mind so there isn't much we need to fix

#Steps needed for this dataset:
#0. Webscrape full dataset. Deadline is next monday May 10. If unable to acquire, then it'll have to be made by hand (copy/paste)
#1. Check for missing values and remove rows (unless the specific bill can be researched to fill the row with such info)
#2. Check for outliers such as uncommon status strings
#3. Remove last 7 characters for Session year and then convert to an integer
#4. Possibly rename measures based on AB labels (will need to research more into this)

df_bills.head()

Unnamed: 0,Measure,Session Year,Subject,Author,Status
0,AB-1,2009 - 2010,Teachers: program of professional growth: conf...,Monning,Vetoed
1,ABX1-1,2009 - 2010,Taxation: corporate reorganizations: built-in ...,Charles Calderon,Assembly - Died
2,ABX2-1,2009 - 2010,State employment: salary freeze.,Portantino,Assembly - Died
3,ABX3-1,2009 - 2010,2009–10 Budget.,Evans,Senate - Died
4,ABX4-1,2009 - 2010,Budget Act of 2009: revisions.,Evans,Chaptered


In [None]:
#Checking for missing values
df_bills.isna().values.any()
#No NaNs found!

False

In [None]:
#Checking for unique values in Session Year (Skipping Measure, Subject, and Author for now as they are meant to be unique)
list(df_bills['Session Year'].unique())

['2009 - 2010',
 '2011 - 2012',
 '2013 - 2014',
 '2015 - 2016',
 '2017 - 2018',
 '2019 - 2020',
 '2021 - 2022']

In [None]:
#Checking for unique values in Status

#list(df_bills['Status'].unique())

#Uh Oh, too many unique statuses.

#Will need to check each string for Senate and Assembly and remove the specific dept.
#We won't need this information for the scope of the analysis we are doing (these are too specific)

#Seems like the unique strings that would need to be kept are Vetoed, Assembly, Senate, Chaptered, and Enrolled [V,A,S,C,E]
#Char counts for each string in order are 6, 8, 6, 9, 8

#Best practice would be to iterate through each entry in the column and strip all characters that follow each word (using an if check)

#df_bills_test = df_bills.copy() #This was just to do a bit of testing so I don't have to rerun the entire kernal

#Setup
index = 0 #Creating index as iterating per each in the loop caused an error (saying it was too much to unpack)
strfindlist = ['Vetoed', 'Assembly', 'Senate', 'Chaptered', 'Enrolled'] #A list of the words we want to keep
separator = '' #Separator when using the join function. We don't want any spaces so it will be blank


for row in df_bills['Status']: #Iterating through each row first
  tempstring = "" #A temporary string to store the finished cleaned rows
  tempstringlist = [] #A list to store each character (python cannot append to a string easily)
  strcount = 0 #Counting each character when looping through the string. Just a check to ensure nothing goes wrong
  strfind = "" #Used to ensure each if statement is properly accessed
  strboolean = False #Used to initially start the if statement for character check (this might not be needed)

  for char in row: #Looping through each character in each row/string
    if (char == "V" and strboolean == False) or (strfind == 'Vetoed' and strcount < 6): #Checking if the first character matches the key words (and if the boolean check is False), if not, it will check the stringfind to ensure it is the right path of if statement to follow
      tempstring = tempstringlist.append(char) #Appending each character to the list to join for later
      strfind = str(strfindlist[0]) #Assigning the string to the finder so that this if statement will be the one used
      strcount += 1 #Counting each character added so that the statement can be broken once finished
      strboolean = True #Stop looking for the first letter of the string
    elif (char == "A" and strboolean == False) or (strfind == 'Assembly' and strcount < 8):
      tempstring = tempstringlist.append(char)
      strfind = strfindlist[1]
      strcount += 1
      strboolean = True
    elif (char == "S" and strboolean == False) or (strfind == 'Senate' and strcount < 6):
      tempstring = tempstringlist.append(char)
      strfind = strfindlist[2]
      strcount += 1
      strboolean = True
    elif (char == "C" and strboolean == False) or (strfind == 'Chaptered' and strcount < 9):
      tempstring = tempstringlist.append(char)
      strfind = strfindlist[3]
      strcount += 1
      strboolean = True
    elif (char == "E" and strboolean == False) or (strfind == 'Enrolled' and strcount < 8):
      tempstring = tempstringlist.append(char)
      strfind = strfindlist[4]
      strcount += 1
      strboolean = True
    else:
      break #Breaks from the loop so that no other characters are added (what we wanted!)

  tempstring = separator.join(tempstringlist) #Joins the list into a temporary string variable
  df_bills.at[index,'Status'] = tempstring #Rewrites the row under the 'Status' column so that the new string is there
  index += 1 #Increases index count

df_bills.head(20) #Checking the new dataframe

Unnamed: 0,Measure,Session Year,Subject,Author,Status
0,AB-1,2009 - 2010,Teachers: program of professional growth: conf...,Monning,Vetoed
1,ABX1-1,2009 - 2010,Taxation: corporate reorganizations: built-in ...,Charles Calderon,Assembly
2,ABX2-1,2009 - 2010,State employment: salary freeze.,Portantino,Assembly
3,ABX3-1,2009 - 2010,2009–10 Budget.,Evans,Senate
4,ABX4-1,2009 - 2010,Budget Act of 2009: revisions.,Evans,Chaptered
5,ABX5-1,2009 - 2010,School accountability.,Solorio,Assembly
6,ABX6-1,2009 - 2010,Taxation: Oil Industry Fair Share Act.,Nava,Assembly
7,ABX7-1,2009 - 2010,Public resources.,Fuller,Assembly
8,ABX8-1,2009 - 2010,Budget Act of 2009.,Committee on Budget,Chaptered
9,AB-1,2011 - 2012,Education finance: CalWORKs Stage 3.,John A. Pérez,Assembly


In [None]:
#Nick- trying to think of a definition to create that would take a string and find if it has Senate or Assembly. Depending on if it has the '- Died' part to it
#def rid_dept(stringy):
#  for 

#Got you! -Andrew
#Thank u ur wizard -Nick

In [None]:
#dropping names of victims for privacy reasons
df_shootings = df_shootings.drop(labels='name',axis=1)

#dropping unnecessary geocoding data 
#we don't need precise locations as we already have city and state available
df_shootings = df_shootings.drop(labels='latitude',axis=1)
df_shootings = df_shootings.drop(labels='longitude',axis=1)
df_shootings = df_shootings.drop(labels='is_geocoding_exact',axis=1)

In [None]:
#Getting rid of the .0 that trails after the ages in the 'ages column
df_shootings['age']= df_shootings['age'].astype(str)
df_shootings['age']= df_shootings['age'].str.replace('.0', '')

In [None]:
#printing dataset to see columns
df_shootings.head()

Unnamed: 0,id,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,2015-01-02,shot,gun,53,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,2015-01-02,shot,gun,47,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,2015-01-03,shot and Tasered,unarmed,23,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,2015-01-04,shot,toy weapon,32,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,2015-01-04,shot,nail gun,39,M,H,Evans,CO,False,attack,Not fleeing,False


In [None]:
#isolating arm types 
df_shootings['armed'].value_counts()

#99 different types of values for armed consider condensing to just unarmed or armed regardless of weapon 
#kimberly

gun                    3577
knife                   919
unarmed                 402
toy weapon              212
vehicle                 194
                       ... 
bayonet                   1
claimed to be armed       1
ice pick                  1
wasp spray                1
nail gun                  1
Name: armed, Length: 99, dtype: int64

In [None]:
df_shootings['manner_of_death'].value_counts()
# this raises questions as dataset only contains shootings which resulted in death ,
# perhaps new data set needed to account for all shooting related incidents involving police
#Kimberly

shot                5923
shot and Tasered     318
Name: manner_of_death, dtype: int64

### Webscraping of Bills from CA Legislature (Still In Progress)
* Due to the website settup of the CA Legislature website, we ran into some problems when webscraping using Beautiful Soup (with REQUESTS)
* Website = https://leginfo.legislature.ca.gov/faces/advance/advance.xhtml
* The typical request scraping process is not compatible with this type of html as it seems to be waiting for user input in the search parameters before it presents a table of bills.
* As a placeholder, we have the tables copy/pasted directly into a spreadsheet processor and exported as a csv, however we are still looking into how to accomplish a proper webscrape of the data:
_____________________________________________________
* We used this website to assist in this other form of scraping, called POST
  * http://jonathansoma.com/lede/foundations/classes/friday%20sessions/advanced-scraping-form-submissions-completed/
* This requires us to locate specific parameters which are then sent via requests.post(url, data=post_params)
* So far we are finding some of the required posts, but will still need some more experimentation to figure out the specific ones to return the full list of bills since 1999
  * A meeting with a TA next week will take place to help with understanding this process a bit more