# Cleaning Searchlight

## PLEASE READ BELOW!

Thank you so much for helping GoodlyLabs clean Searchlight's Daily Congressional Database! For context, our database consists of all speeches said on the floor by both representatives and senators in the US Congress over the past 24 years. Currently, we manage all this data by storing information about the speakers (the state they represent, the party they belong to, etc.) and information about the speeches (the date it was said, the actual speech text, etc.) in two separate tables. The speakers table is about two thousand rows long, and the speeches table is about 1 million rows long. Unfortunately, with this much data, there's bound to be errors.

The following jupyter notebook exists so that volunteers such as yourselves can easily help us edit/clean both the speakers and speeches table. Each table has its own associated challenges and errors, which we will explain in the sections below. At the end of this notebook, you will have two final products: a cleaned speakers table and a cleaned speeches table in csv format, which you will then zip and upload to this Google Drive folder here (<url>).

# Initialization

Don't worry about the inner workings of these functions or packages, just run all of the cells below (except for the first, which you will have to edit). Later, for each section, we'll explain exactly what each corresponding function does and how you can use it.

For the cell directly below, use the excel spreadsheet here (<url>) to determine what values you specifically need to fill in for the constants below.

In [8]:
# SPEECHES_TABLE_START = "fill"
# SPEECHES_TABLE_END = "fill"
# SPEAKERS_TABLE_START = "fill"
# SPEAKERS_TABLE_END = "fill"
# NOTEBOOK_ID = "fill"

SPEECHES_TABLE_START = 0
SPEECHES_TABLE_END = "fill"
SPEAKERS_TABLE_START = "fill"
SPEAKERS_TABLE_END = "fill"
NOTEBOOK_ID = "fill"

#If you're curious, what these constants are doing is constraining the dataset to just 
#the small subset you have been assigned to clean. 

In [3]:
import pandas as pd
import re
import numpy as np
import urllib
from pathlib import Path
import math

In [58]:
speakers = pd.read_csv('allspeakers.csv')

In [60]:
# speeches = pd.read_csv('allspeeches.csv')

In [62]:
def fix_district(index, district):
    speakers.loc[index, 'district'] = district
    print(str(speakers.loc[index, 'last_name']) + "'s district is now: " + str(speakers.loc[index, 'district']))

# Cleaning Speakers

## The speakers table currently has two major errors: missing type and missing district values.

### Fixing District Values

In [67]:
# Run this cell to see which rows in the speakers table (specifically, those who are house reps,
# since senators are not assigned districts) are missing district values.
representatives = speakers[speakers['chamber'] == 'HOUSE']
representatives[representatives['district'].isnull()]

Unnamed: 0,speaker_id,first_name,last_name,chamber,type,party,state,district,bio_guide_id,congress_id
25,2350.0,Jodey,ARRINGTON,HOUSE,REPRESENTATIVE,R,TX,,A000375,115.0
34,2337.0,Don,BACON,HOUSE,REPRESENTATIVE,R,NE,,B001298,115.0
40,1734.0,Frank,BALLANCE,HOUSE,REPRESENTATIVE,D,NC,,B001238,108.0
42,2326.0,Jim,BANKS,HOUSE,REPRESENTATIVE,R,IN,,B001299,115.0
44,51.0,Peter,BARCA,HOUSE,REPRESENTATIVE,D,WI,,B001226,103.0
47,53.0,Tom,BARLOW,HOUSE,REPRESENTATIVE,D,KY,,B000151,103.0
50,2311.0,Nanette,BARRAGAN,HOUSE,REPRESENTATIVE,D,CA,,B001300,115.0
68,72.0,Anthony,BEILENSON,HOUSE,REPRESENTATIVE,D,CA,,B000318,103.0
74,78.0,Helen,BENTLEY,HOUSE,REPRESENTATIVE,R,MD,,B000392,103.0
79,2333.0,Jack,BERGMAN,HOUSE,REPRESENTATIVE,R,MI,,B001301,115.0


To fix the district values in the database, you will be using the "fix_district" function. This function takes in two inputs, index and 

In [66]:
fix_district(21, 33)

APPLEGATE's district is now: 33.0
