Gearoid Lacey C00183380
Data quality issues and improvements
#NOTE: The following code assumes the 'Food_Inspections.csv' file is in the same directory as the notebook
The first data quality issue I know exists is different spellings of the same DBA name. To attempt to counteract this I intend to change all DBA names to lowercase and remove any apostrophes, commas full stops and hypens. I will also remove leading and trailing whitespace. I also perform some amendments to the variations of the names mcdonalds, subway, dunkin donuts, 7 eleven and kfc.


Another issue I intend to fix is location data being stored as seperate values and also in a combined string in the location column. From my initial inspections the location data in the location column appears to be more accurate so removing the seperate location rows would appear to be the better option. I initially populate the location column if its empty by creating a tuple and storing the values from the latitude and longitude columns in it. There are also some outliers regarding the location, longitude and latitude rows. There were rows without any location latitude or longitude data but I still keep these rows as theres also an address and zip code which could be used to locate a premises. Note there where no rows missing an address value.


Empty cells are populated with the value 'null'
I also noticed that some rows are missing information, therefore I chose to populate these rows with the value null as they are predominatly text based. If they were numerically based then you could potentially fill the missing values with a zero to keep the rows.

In rows that are missing an AKA Name I will copy in the DBA Name. Or in rows that are missing a DBA Name I copy in the AKA Name. Although the AKA Name is not the legal name of the business copying the AKA Name into the DBA Name may still be useful as they are often quite similar.

As the data is based on premises's in Chicago Illinois, the state column was dropped.


I noticed that some city values contained misspellings or were populated with names of other cities in Illinois. To counteract this I look for the substring 'chicago' in every city cell. When a city does not contain the chicago substring, I remove the original value and insert 'non-chicago address, attention required' in its place meaning the address or the location column should be used to determine which city the facility is in. Also one city value used was "chcicago". As looking for the substring "chicago" would not work in this case I look specifically for this value also and change it to "chicago".


Also the dates were in the format mm/dd/yyyy. To adjust this I split the date on the occurence of the "/" and reconstruct the date so it is in the format of dd/mm/yyyy. I also allow for dates that contain hyphens instead of forward slashes, if they occur I change them to forward slashes.


When working with the Inspection type column I noticed numerous faults. One of the values present in this column was "two people ate and got sick". To me this is not categorical, therefore I replace that string with "suspected food poisoning" which is another category within this column.


Also there were numerous different types of canvass inspections most of which had different spellings of the word canvass. Therefore I ammended each of these spelling mistakes by using the replace function and explicitly stating the error and the replacement value. Another error in the Inspection Type column is when the user performing the inspection appears to leave reminders in the inspection type value e.g 'finish complaint inspection from 5 18 10'. As this Inspection type is in relation to a a complaint I will change the value in the cell to 'complaint'.


I also noticed errors with duplicate license numbers where the license number is 0. As this license number was being assigned to numerous different premises which is incorrect according to the dataset description linked below. I will replace any row with a license number 0 to 'unknown'


Regarding duplicates in the csv file, if you open the data in excel and highlight every column except the inspection id column and then press the remove duplicates function in the data tab, it says there are appoximately 87 duplicates. As the inspection ID was different for these, then technically the rows are not duplicated and hence I did not remove them.
Dataset description: https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF

In [55]:
import pandas as pd
import numpy as np


def dba_names(data):
    dba_names = data['DBA Name']
    print('Number of unique names before: ', len({items for items in dba_names}))
    
    column_names = ['DBA Name', 'AKA Name']
    for column in column_names:
        data[column] = data[column].str.lower()
        data[column] = data[column].str.strip(' ')
        data[column] = data[column].str.replace(',', '')
        data[column] = data[column].str.replace('.', '')
        data[column] = data[column].str.replace('-', '')
        data[column] = data[column].str.replace('/', '')
    
    new_dba_names = data['DBA Name']
    print('Number of unique names after: ', len({items for items in new_dba_names}))
    return data
    

def ammend_location_data(data):
    def fill_location(row):
        if pd.isnull(row['Location']):
            new_location = str((row['Latitude'],  row['Longitude']))
            return new_location
        
    data['Location'] = data.apply(fill_location, axis=1) # apply change date function to every row
    
    return data


def missing_names(data):
    def fill_DBA(row):
        if pd.isnull(row['DBA Name']):
            return row['AKA Name']
    def fill_AKA(row):
        if pd.isnull(row['AKA Name']):
            return row['DBA Name']
    
    data['DBA Name'] = data.apply(fill_DBA, axis=1) 
    data['AKA Name'] = data.apply(fill_AKA, axis=1) 

    return data


def remove_columns(data):
    # axis 1 means its applied to each row whereas axis 0 means its applied to each column
    data = data.drop(['State'], axis=1) 
    data = data.drop(['Latitude'], axis=1)
    data = data.drop(['Longitude'], axis=1)
    
    return data


def ammend_city(data):
    def fill_city(row):
        if 'chicago' in str(row['City']).lower() and str(row['City']).lower() != 'chicago':
            city = 'chicago' 
            return city
        
        if 'chicago' not in str(row['City']).lower():
            city = 'non-chicago address, attention required'
            return city
    data['City'] = data.apply(fill_city, axis=1)
    return data
    
    # THE COMMENTED CODE BELOW TAKES APPROX. 12 SECONDS
    '''for index, rows in data.iterrows():
        if 'chicago' in str(data.loc[index, 'City']).lower() and str(data.loc[index, 'City']).lower() != 'chicago':
            data.loc[index, 'City'] = 'chicago' 
        if 'chicago' not in str(data.loc[index, 'City']).lower():
            data.loc[index, 'City'] = 'non-chicago address, attention required'
    return data'''


def ammend_dates(data):
    def change_date(row):
        if pd.notnull(row['Inspection Date']):
            old_date = str(row['Inspection Date']).split('/')
            new_date = old_date[1] + '/' + old_date[0] + '/' + old_date[2]
            return new_date
            
    data['Inspection Date'] = data.apply(change_date, axis=1) # apply change date function to every row
    
    return data
 

def ammend_inspections(data):
    def change_inspection(row):
        if str(row['Inspection Type']).lower() == 'two people ate and got sick.':
            inspection_desc = 'Suspected food poisoning'
            return inspection_desc
        elif 'canv' in str(str(row['Inspection Type']).lower()):
            inspection_desc = 'Canvass'
            return inspection_desc
        elif 'fire' in str(str(row['Inspection Type']).lower()):
            inspection_desc = 'Fire Complaint'
            return inspection_desc
        elif 'out' in str(str(row['Inspection Type']).lower()) and 'business' in str(str(row['Inspection Type']).lower()):
            inspection_desc = 'Out of business'
            return inspection_desc
        elif 'finish complaint inspection from 5 18 10' in str(str(row['Inspection Type']).lower()):
            inspection_desc = 'Complaint'
            return inspection_desc
            
            
    data['Inspection Type'] = data.apply(change_inspection, axis=1) # apply change date function to every row
    return data
 
    
def ammend_license_num(data):
    def change_lisence(row):
        if pd.notnull(row['License #']):
            if row['License #'] == 0:
                unknown = 'Unknown'
                return unknown
            
    data['License #'] = data.apply(change_lisence, axis=1) # apply change date function to every row
    
    return data  
            
                    
#  need to supply parse_dates function otherwise dates appear as none as it only accepts strings, ints and floats when
#  reading csv files, noticeably has a big impact on execution time (approx. 18 second increase)
data = pd.read_csv('Food_Inspections.csv', infer_datetime_format=True) 
data = dba_names(data)
data = ammend_location_data(data)
data = missing_names(data)
data = remove_columns(data)
data = ammend_city(data)
data = ammend_dates(data)
data = ammend_inspections(data)
data = ammend_license_num(data)
data.to_csv('Output.csv')


CPU times: user 2.01 s, sys: 180 ms, total: 2.19 s
Wall time: 2.21 s
Number of unique names before:  24895
Number of unique names after:  24490
CPU times: user 803 ms, sys: 19.1 ms, total: 822 ms
Wall time: 819 ms
CPU times: user 2.3 s, sys: 47.9 ms, total: 2.34 s
Wall time: 2.34 s
CPU times: user 4.31 s, sys: 60.3 ms, total: 4.37 s
Wall time: 4.36 s
CPU times: user 61.6 ms, sys: 16.8 ms, total: 78.4 ms
Wall time: 77.2 ms
CPU times: user 4.36 s, sys: 19.3 ms, total: 4.38 s
Wall time: 4.38 s
CPU times: user 3.49 s, sys: 14.1 ms, total: 3.5 s
Wall time: 3.5 s
CPU times: user 4.55 s, sys: 11.3 ms, total: 4.56 s
Wall time: 4.56 s
CPU times: user 3.21 s, sys: 11.2 ms, total: 3.22 s
Wall time: 3.22 s
CPU times: user 4.07 s, sys: 92.1 ms, total: 4.16 s
Wall time: 4.22 s
CPU times: user 29.2 s, sys: 475 ms, total: 29.6 s
Wall time: 29.7 s
