Created by: [SmirkyGraphs](https://smirkygraphs.github.io/). Code: [Github](https://github.com/SmirkyGraphs/Python-Notebooks). Source: [BOE](http://www.elections.ri.gov/fines/).
<hr>

# Cleaning Campaign Finance Fines

This notebook uses tabula-py to read data from the pdf files and merge them all into 1 dataset. The data is then cleaned using pandas to add the date, covert total into a number format and normalize some naming of candidates. Lastly, dates are converted from Month_Year to a datetime and the data is saved.

In [1]:
import pandas as pd
import glob as glob
from tabula import read_pdf
from datetime import datetime

In [2]:
files = glob.glob('./data/raw/*.pdf')

# February_2018 and June_2019 cause errors in the data so we'll add them manually
remove = ['./data/raw\\February_2018.pdf', './data/raw\\June_2019.pdf']
files = [x for x in files if x not in remove]

In [3]:
# read in all files and create a dataframe

frames = []
for file in files:
    # read pdf into dataframe
    df = read_pdf(file, pages="all")
    
    # skip total row
    df = df[:-1]
    
    # add filename
    df['date'] = file[11:-4]
    
    # add file to frames
    frames.append(df)

df = pd.concat(frames, sort=True)

In [4]:
# add 3 manually collected months
df2 = pd.read_csv('./data/files/February_2018.csv')
df3 = pd.read_csv('./data/files/June_2019.csv')
df4 = pd.read_csv('./data/files/September_2019.csv')

df = df.append([df2, df3, df4], sort=True)

In [5]:
# spacing in some of the pdfs cause issues in column alignment
df['Name'] = df['Name'].fillna(df['October 2011 Aging'])
df['Total'] = df['Total'].fillna(df['Unnamed: 1'])

In [7]:
# remove error row
df = df[df['Name'] != 'Name']

# remove nulls (row spaces)
df = df[df['Name'].notnull()]

# convert to uppercase
df['Name'] = df['Name'].str.upper()

# remove periods and extra spaces
df['Name'] = df['Name'].str.replace('.', '')
df['Name'] = df['Name'].str.strip()

# remove string type numbers
df['Total'] = df['Total'].str.replace(r"[a-zA-Z]",'')
df['Total'] = df['Total'].str.replace(r"[!@#$%^&*(),]",'')
df['Total'] = df['Total'].str.strip()

# convert to float
df['Total'] = df['Total'].astype(float)

In [8]:
# fix naming errors
df.loc[df['Name'].str.contains('CCRI PSA PAC'), 'Name'] = 'CCRI PSA PAC'
df.loc[df['Name'].str.contains('IBEW LOCAL 2323 PAC'), 'Name'] = 'IBEW LOCAL 2323 PAC'
df.loc[df['Name'].str.contains('POLICE OFFICERS LOCAL'), 'Name'] = 'INTERNATIONAL BROTHERHOOD OF POLICE OFFICERS LOCAL 301'
df.loc[df['Name'].str.contains('POLICE & FIREFIGHTERS ASSOCIATIO'), 'Name'] = 'PROVIDENCE RETIRED POLICE & FIREFIGHTERS ASSOCIATION'
df.loc[df['Name'].str.contains('RI ASSOCIATION FOR JUSTICE PAC'), 'Name'] = 'RI ASSOCIATION FOR JUSTICE PAC'

In [9]:
# keep only wanted columns
cols = ['Name', 'Total', 'date']
df = df[cols]

In [10]:
# convert date (month_year)
df['date'] = df['date'].apply(lambda x: datetime.strptime(x, '%B_%Y'))

In [11]:
# save file
df.to_csv('./data/clean/ri_campaign_fines_clean.csv', index=False)