# Integrity constraints with Python and SQL

In [1]:
import sqlite3
import csv
import pandas as pd

First, we convert our files into sqlite tables:

In [4]:
# Reading files
con = sqlite3.connect("salary_responses.db", timeout = 10)
df = pd.read_csv('Data/salary_responses.csv')
df_clean = pd.read_csv('Data/data_full.csv')
df.columns = df.columns.str.replace(' ','_')
df.columns = df.columns.str.strip()
df_clean.columns = df_clean.columns.str.strip()

# Creating tables
df.to_sql("SalaryResponsesDirty", con)
df_clean.to_sql("SalaryResponsesClean", con)
con.cursor().close()

Next, we create a function to get the column names for our original "dirty" file:

In [19]:
def get_col_names(table):
    con = sqlite3.connect("salary_responses.db")
    cur = con.cursor()
    query = 'SELECT * FROM {}'.format(table)
    cur.execute(query)
    cur.close()
    return [member[0] for member in cur.description]

## Column Names

In [7]:
dirty_names = get_col_names('SalaryResponsesDirty')
clean_names = get_col_names('SalaryResponsesClean')

As we can see below, the column names for the original date are extremely long and contain special characters:

**Dirty File:**

In [8]:
print(*dirty_names, sep = "\n")

index
Timestamp
How_old_are_you?
What_industry_do_you_work_in?
Job_title
If_your_job_title_needs_additional_context,_please_clarify_here:
What_is_your_annual_salary?_(You'll_indicate_the_currency_in_a_later_question._If_you_are_part-time_or_hourly,_please_enter_an_annualized_equivalent_--_what_you_would_earn_if_you_worked_the_job_40_hours_a_week,_52_weeks_a_year.)
How_much_additional_monetary_compensation_do_you_get,_if_any_(for_example,_bonuses_or_overtime_in_an_average_year)?_Please_only_include_monetary_compensation_here,_not_the_value_of_benefits.
Please_indicate_the_currency
If_"Other,"_please_indicate_the_currency_here:_
If_your_income_needs_additional_context,_please_provide_it_here:
What_country_do_you_work_in?
If_you're_in_the_U.S.,_what_state_do_you_work_in?
What_city_do_you_work_in?
How_many_years_of_professional_work_experience_do_you_have_overall?
How_many_years_of_professional_work_experience_do_you_have_in_your_field?
What_is_your_highest_level_of_education_completed?
Wh

**Clean File:**

In [9]:
print(*clean_names, sep = "\n")

level_0
timestamp
age
industry
job_title
annual_salary
additional_salary
currency
country
state
city
total_experience
current_experience
education
gender
race
age_min
age_max
total_experience_min
total_experience_max
current_experience_min
current_experience_max
education_lvl
gender_idx
continent
lat_long
lat
long
USD_rate
annual_salary_USD
additional_salary_USD
total_salary_USD
race_idx
index
job_title_clean
industry_clean
total_salary


Because the dirty file's column names are very difficult to work with, we will continue our Integrity Constraint checks with clean column names:

In [10]:
new_df = df.set_axis(['index','timestamp', 'age', 'industry', 'job_title', 'job_context','annual_salary',
                      'additional_salary', 'currency', 'currency_context', 'country',
                      'state', 'city', 'total_experience', 'current_experience', 'education',
                    'gender', 'race'], axis = 1, inplace=False)
con = sqlite3.connect("salary_responses.db")
cur = con.cursor()
new_df.to_sql("SalaryDirtyUpdated", con)
con.cursor().close()

## Education

Next, we get the unique values in the Education column:

In [11]:
def unique_values(col, table):
    con = sqlite3.connect("salary_responses.db")
    cur = con.cursor()
    query = 'SELECT DISTINCT {} FROM {}'.format(col, table)
    cur.execute(query)
    res = cur.fetchall()
    cur.close()
    return [i[0] for i in res]

**Dirty File:**

In [12]:
print(unique_values('education', 'SalaryDirtyUpdated'))

["Master's degree", 'College degree', 'PhD', None, 'Some college', 'High School', 'Professional degree (MD, JD, etc.)']


**Clean File:**

In [13]:
print(unique_values('education', 'SalaryResponsesClean'))

["Master's degree", 'College degree', 'PhD', None, 'Some college', 'High School', 'Professional degree']


## Race

For race, we opted to create a boolean mask for the clean file to facilitate analysis:

**Dirty File:**

In [14]:
print(*unique_values('race', 'SalaryDirtyUpdated'), sep = "\n")

White
Hispanic, Latino, or Spanish origin, White
Asian or Asian American, White
Asian or Asian American
Another option not listed here or prefer not to answer
Hispanic, Latino, or Spanish origin
Middle Eastern or Northern African
Hispanic, Latino, or Spanish origin, Middle Eastern or Northern African, White
Black or African American
Black or African American, White
None
Black or African American, Hispanic, Latino, or Spanish origin, White
Native American or Alaska Native
Native American or Alaska Native, White
Hispanic, Latino, or Spanish origin, Another option not listed here or prefer not to answer
Black or African American, Middle Eastern or Northern African, Native American or Alaska Native, White
White, Another option not listed here or prefer not to answer
Black or African American, Native American or Alaska Native, White
Asian or Asian American, Another option not listed here or prefer not to answer
Middle Eastern or Northern African, White
Asian or Asian American, Black or Afri

**Clean File:**

In [22]:
print(*unique_values('race_idx', 'SalaryResponsesClean'), sep = "\n")

7
4,7
1,7
1
8
4
5
4,5,7
2
2,7
None
2,4,7
6
6,7
4,8
2,5,6,7
7,8
2,6,7
1,8
5,7
1,2,7
2,4
1,2
1,4,7
6,7,8
1,4
1,6,7
4,6
2,5,7
2,4,6,7
2,8
6,8
1,7,8
1,5
1,4,6,7
4,5
4,6,7
5,7,8
4,7,8
1,2,4
1,2,6,7
5,6,7
1,5,7
2,5
4,6,8
1,6
5,6
1,4,8
1,4,7,8


## Country

In [15]:
def country_ic(col, table):
    con = sqlite3.connect("salary_responses.db")
    cur = con.cursor()
    query = "SELECT DISTINCT {} FROM {} WHERE {} LIKE ?".format(col, table, col)
    cur.execute(query,('%United States%',))
    res = cur.fetchall()
    cur.close()
    return [i[0] for i in res]

We can test the IC for country by checking the unique values entered for the `United States` as an example:

**Dirty File:**

In [16]:
print(*country_ic('country', 'SalaryDirtyUpdated'), sep = '\n')

United States
United States 
United States of America
United states
United states 
united states
United States of America 
united states of america
The United States
UNITED STATES
united States
United States of American 
United States (I work from home and my clients are all over the US/Canada/PR
United Statesp
UNited States
United States of Americas
United States of america 
united states 
United states of America 
For the United States government, but posted overseas
United States of america
 United States
United States Of America
United states of America
United STates
United States- Puerto Rico
United states of america
United States is America
United States of American


**Clean File:**

In [17]:
print(*country_ic('country', 'SalaryResponsesClean'), sep = '\n')

United States
