# Lab Assignment 8: Data Management Using `pandas`, Part 1
## DS 6001: Practice and Application of Data Science
## Name: Afnan Alabdulwahab

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

In this lab, you will be working with the [2017 Workplace Health in America survey](https://www.cdc.gov/workplacehealthpromotion/survey/data.html) which was conducted by the Centers for Disease Control and Prevention. According to the survey's [guidence document](https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/2017-WHA-Guidance-Document-for-Use-of-Public-Data-files-508.pdf):

> The Workplace Health in America (WHA) Survey gathered information from a cross-sectional, nationally representative sample of US worksites. The sample was drawn from the Dun & Bradstreet (D&B) database of all private and public employers in the United States with at least 10 employees. Like previous national surveys, the worksite served as the sampling unit rather than the companies or firms to which the worksites belonged. Worksites were selected using a stratified simple random sample (SRS) design, where the primary strata were ten multi-state regions defined by the Centers for Disease Control and Prevention (CDC), plus an additional stratum containing all hospital worksites. 

The data contain over 300 features that report the industry and type of company where the respondents are employed, what kind of health insurance and other health programs are offered, and other characteristics of the workplaces including whether employees are allowed to work from home and the gender and age makeup of the workforce. The data are full of interesting information, but in order to make use of the data a great deal of data manipulation is required first.

## Problem 0
Import the following libraries:

In [10]:
import numpy as np
import pandas as pd
import sidetable
import sqlite3
import warnings
warnings.filterwarnings('ignore')

## Problem 1
The raw data are stored in an ASCII file on the 2017 Workplace Health in America survey [homepage](https://www.cdc.gov/workplacehealthpromotion/survey/data.html). Load the raw data directly into Python without downloading the data onto your harddrive and display a dataframe with only the 14th, 28th, and 102nd rows of the data. [1 point]

Setting pandas option to display all columns:

In [114]:
pd.set_option('display.max_columns', None) 

First, I am defining the URL of the ASCII file. Then loading the data using `pd.read_csv()` and passting the delimiter '~' to the `sep` argument. According to the website this file is an ASCII “~” delimited file:

In [117]:
url = "https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/whpps_120717.csv"
df = pd.read_csv(url, sep="~")
df.iloc[[14, 28, 102],:]

Unnamed: 0,OC1,OC3,HI1,HI2,HI3,HI4,HRA1,HRA1A,HRA1B,HRA1E,CP1,CP2,CP3,CP4,CP5,HP1,HP2,HP3,HP4,HP5,HP5A,HP6,HP7A,HP7B,HP7C,HP7D,HP7D1,HP7D2,HP7D3,HP7E,HP7E1,HP7F,HP7F1,HP7F2,HP7F3,HP7F4,HP7F5,HP7F6,HP7F7,HP7F8,HP7F9,HP7F10,HP7F11,HP7G,HP8,HPR1_1,HPR1_1A,HPR1_1B,HPR1_1C,HPR1_2A,HPR1_2B,HPR1_2C,HPR1_2D,HPR1_2E,HPR1_2F,HPR1_2G,HPR1_2H,HPR1_2I,HPR1_2J,HPR1_2K,HPR2_1,HPR2_1A,HPR2_1B,HPR2_1C,HPR2_2A,HPR2_2B,HPR2_2C,HPR2_2D,HPR2_3A,HPR2_3B,HPR2_3C,HPR2_3D,HPR2_4A,HPR2_4B,HPR2_4C,HPR2_4D,HPR2_4E,HPR3_1,HPR3_1A,HPR3_1B,HPR3_1C,HPR3_2,HPR4_1,HPR4_1A,HPR4_1B,HPR4_1C,HPR4_2A,HPR4_2B,HPR4_2C,HPR4_2D,HPR4_2E,HPR4_2F,HPR4_2G,HPR4_2G1,HPR4_2G2,HPR4_2G3,HPR4_2G4,HPR4_2G5,HPR4_2G6,HPR4_2G7,HPR5_1,HPR5_1A,HPR5_1B,HPR6_1,HPR6_1A,HPR6_1B,HPR6_1C,HPR7_1,HPR7_1A,HPR7_1B,HPR7_1C,HPR8_1,HPR8_1A,HPR8_1B,HPR8_1C,HPR9_1,HPR9_1A,HPR9_1B,HPR9_1C,HS11,HS11_2,HS11A,HS12,HS12_2,HS13,HS13_2,HS14,HS14_2,HS15,HS15_2,HS16,HS16_2,HS17,HS17_2,HS18,HS18_2,HS19,HS19_2,HS2A,HS2B,HS3,DM11M1,DM11M2,DM11M3,DM12M1,DM12M2,DM12M3,DM13M1,DM13M2,DM13M3,DM14M1,DM14M2,DM14M3,DM15M1,DM15M2,DM15M3,DM16M1,DM16M2,DM16M3,DM17M1,DM17M2,DM17M3,DM18M1,DM18M2,DM18M3,DM19M1,DM19M2,DM19M3,DM20M1,DM20M2,DM20M3,DM2A,DM2B,DM3,KP2,KP3_1,KP3_2,KP3_3,KP4,KP5A,KP5C,KP5E,KP5F,KP5G,KP5H,KP5J,KP5J_01,WL1,WL2,WL3M1,WL3M2,WL3M3,WL3M4,WL3M5,WL5,WL6,WL7,WL8,WL9,WL11,WL12,WL14,WL15,B1_1,B1_2,B1_3,B1_4,B1_5,B1_6,B1_7,B1_8,B1_9,B1_10,B1_11,B1_12,OSH1,OSH2,OSH3,OSH4,OSH5,OSH6,OSH7_1,OSH7_2,OSH7_3,E1M1,E1M2,E1M3,E1M4,E1M5,E1M6,E1M7,E1M8,E1M9,E2,WD1_1,WD1_2,WD2,WD3,WD4,WD5,WD6,WD7,HPR5_2A_S,HPR5_2B_S,HPR5_2C_S,HPR5_2D_S,HPR5_2E_S,WL1_1_S,WL1_2_S,WL1_3_S,WL1_4_S,WL1_5_S,HPR9_2A_S,HPR9_2B_S,HPR9_2C_S,HPR9_2D_S,HPR9_2E_S,HPR9_2F_S,HPR8_2A_S,HPR8_2B_S,HPR8_2C_S,HPR8_2D_S,HPR8_2E_S,HPR7_2A_S,HPR7_2C_S,HPR7_2D_S,HPR7_2E_S,HPR7_2F_S,HPR6_2A_S,HPR6_2B_S,HPR6_2C_S,HPR6_2D_S,HPR6_2E_S,OSH7_1_S,OSH7_2_S,OSH7_4_S,OSH7_6_S,OSH8_S,OSH81A_S,OSH81B_S,OSH81C_S,KP1A_S,KP1B_S,KP1C_S,KP1D_S,KP1E_S,KP1F_S,KP1G_S,KP1G_S_01,HOSPITAL,OC1_07,WL3_05,E1_09,Suppquex,Id,Region,CDC_Region,Industry,Size,Varstrata,"Finalwt_worksite,,,,"
14,7,2.0,2.0,1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,97.0,1.0,1.0,1.0,3.0,3.0,1.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,97.0,1.0,97.0,1.0,1.0,1.0,1.0,1.0,97.0,1.0,1.0,1.0,1.0,1.0,1.0,97.0,3.0,1.0,3.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,3.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,4.0,1.0,1.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,3.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,4.0,1.0,2.0,1.0,1.0,1.0,1.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,,1.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,3.0,3.0,2.0,97.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,8.0,40.0,25.0,75.0,90.0,45.0,1.0,0.0,5.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,,1.0,Infection Control,,,2.0,1539.0,2.0,4.0,7.0,5.0,0.0,"47.793940929,,,,"
28,1,3.0,2.0,3.0,1.0,1.0,2.0,96.0,96.0,96.0,1.0,1.0,2.0,1.0,2.0,1.0,97.0,1.0,97.0,97.0,96.0,2.0,1.0,1.0,2.0,2.0,96.0,96.0,96.0,2.0,96.0,2.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,97.0,97.0,1.0,3.0,1.0,97.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,3.0,1.0,97.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,97.0,2.0,1.0,1.0,1.0,97.0,1.0,1.0,2.0,1.0,97.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,3.0,1.0,97.0,97.0,96.0,96.0,96.0,1.0,1.0,3.0,97.0,2.0,96.0,96.0,96.0,2.0,9.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,2.0,9.0,99.0,99.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,4.0,2.0,2.0,3.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,96.0,,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,97.0,2.0,4.0,2.0,4.0,1.0,97.0,2.0,97.0,97.0,97.0,97.0,97.0,3.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,8.0,997.0,997.0,997.0,997.0,997.0,997.0,0.0,997.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,,1.0,,,,2.0,2755.0,3.0,5.0,7.0,6.0,0.0,"47.793940929,,,,"
102,1,3.0,2.0,3.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,10.0,1.0,1.0,3.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,3.0,4.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,3.0,4.0,2.0,1.0,1.0,3.0,4.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,96.0,96.0,1.0,3.0,4.0,1.0,1.0,1.0,4.0,1.0,1.0,3.0,4.0,1.0,1.0,3.0,4.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,9.0,2.0,9.0,1.0,1.0,2.0,9.0,3.0,4.0,3.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,4.0,1.0,2.0,1.0,1.0,2.0,97.0,4.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,97.0,1.0,1.0,97.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,5.0,2.0,2.0,4.0,3.0,5.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,8.0,997.0,997.0,75.0,997.0,997.0,25.0,0.0,997.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,95.0,,1.0,,,,2.0,12686.0,3.0,5.0,7.0,8.0,0.0,"47.793940929,,,,"


## Problem 2 
The data contain 301 columns. Create a new variable in Python's memory to store a working version of the data. In the working version, delete all of the columns except for the following:

* `Industry`: 7 Industry Categories with NAICS codes

* `Size`: 8 Employee Size Categories

* `OC3` Is your organization for profit, non-profit, government?

* `HI1` In general, do you offer full, partial or no payment of premiums for personal health insurance for full-time employees?

* `HI2` Over the past 12 months, were full-time employees asked to pay a larger proportion, smaller proportion or the same proportion of personal health insurance premiums?

* `HI3`: Does your organization offer personal health insurance for your part-time employees?

* `CP1`: Are there health education programs, which focus on skill development and lifestyle behavior change along with information dissemination and awareness building?

* `WL6`: Allow employees to work from home?

* Every column that begins `WD`, expressing the percentage of employees that have certain characteristics at the firm

[1 point]

The easiest way to reduce the dataframe to include only these columns is to define a list of these column names, then pass the list to the dataframe index as follows:

In [118]:
mycols = ['Industry', 'Size', 'OC3', 'HI1', 'HI2', 'HI3', 'CP1', 'WL6']

Here, I am creating a list of column names from the dataframe, `df`, that start with the prefix "WD". I am using a list comprehension to iterate over the column names and check if each name starts with "WD". The resulting list, `wdcols`, contains only those column names that meet this condition:

In [119]:
wdcols = [x for x in df.columns if x.startswith("WD")]
wdcols

['WD1_1', 'WD1_2', 'WD2', 'WD3', 'WD4', 'WD5', 'WD6', 'WD7']

Creating a new DataFrame `df_clean` containing `mycols` with `wdcols` to create the final list of columns to be included in `df_clean`:

In [120]:
df_clean = df[mycols + wdcols]
df_clean

Unnamed: 0,Industry,Size,OC3,HI1,HI2,HI3,CP1,WL6,WD1_1,WD1_2,WD2,WD3,WD4,WD5,WD6,WD7
0,7.0,7.0,3.0,2.0,1.0,2.0,1.0,1.0,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,7.0,6.0,3.0,2.0,3.0,1.0,1.0,1.0,997.0,997.0,90.0,90.0,997.0,997.0,0.0,997.0
2,7.0,8.0,3.0,1.0,3.0,1.0,1.0,1.0,35.0,4.0,997.0,997.0,40.0,15.0,997.0,997.0
3,7.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,50.0,15.0,50.0,85.0,75.0,0.0,0.0,997.0
4,7.0,4.0,3.0,1.0,3.0,1.0,1.0,1.0,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,6.0,5.0,4.0,1.0,3.0,1.0,1.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
2839,6.0,5.0,4.0,2.0,3.0,1.0,1.0,2.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0
2840,6.0,8.0,4.0,2.0,3.0,1.0,1.0,1.0,27.0,997.0,61.0,997.0,997.0,997.0,997.0,997.0
2841,6.0,8.0,4.0,2.0,3.0,1.0,2.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0


## Problem 3
The [codebook](https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/2017-WHA-Datafile-Codebook-508.pdf) for the WHA data contain short descriptions of the meaning of each of the columns in the data. Use these descriptions to decide on better and more intuitive names for the columns in the working version of the data, and rename the columns accordingly. [1 point]

New culomn names:
* **Industry** -> `industry`
* **Size** -> `company_size`
* **OC3** -> `org_type`
* **HI1** -> `hi_coverage_ft` for health insurance coverage for full-time employees.
* **HI2** -> `hi_change_ft`
* **HI3** -> `hi_pt`
* **CP1** -> `health_edu_prog`
* **WL6** -> `wfh`

I based my nameing for the following on these questions https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/2017-WHA-Survey-Instrument-508.pdf
* **WD1_1** -> `employees_under_30`
* **WD1_2** -> `employees_60plus`
* **WD2** -> `female_employees`
* **WD3** -> `hourly_employees`
* **WD4** -> `non_daytime_shift_workers`
* **WD5** -> `remote_workers`
* **WD6** -> `unionized_employees`
* **WD7** -> `annual_turnover_percentage`

Using the `.rename()` method on the dataframe with a dictionary parameter that contains mapping of the old column names to the new names, and `axis=1` to work with columns.

In [121]:
df_clean = df_clean.rename({
    'Industry': 'industry',
    'Size': 'company_size',
    'OC3': 'org_type',
    'HI1': 'hi_coverage_ft',
    'HI2': 'hi_change_ft',
    'HI3': 'hi_pt',
    'CP1': 'health_edu_prog',
    'WL6': 'wfh',
    'WD1_1': 'employees_under_30',
    'WD1_2': 'employees_60plus', 
    'WD2': 'female_employees', 
    'WD3': 'hourly_employees', 
    'WD4': 'non_daytime_shift_workers', 
    'WD5': 'remote_workers', 
    'WD6': 'unionized_employees', 
    'WD7': 'annual_turnover_percentage'}, axis=1)
df_clean

Unnamed: 0,industry,company_size,org_type,hi_coverage_ft,hi_change_ft,hi_pt,health_edu_prog,wfh,employees_under_30,employees_60plus,female_employees,hourly_employees,non_daytime_shift_workers,remote_workers,unionized_employees,annual_turnover_percentage
0,7.0,7.0,3.0,2.0,1.0,2.0,1.0,1.0,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,7.0,6.0,3.0,2.0,3.0,1.0,1.0,1.0,997.0,997.0,90.0,90.0,997.0,997.0,0.0,997.0
2,7.0,8.0,3.0,1.0,3.0,1.0,1.0,1.0,35.0,4.0,997.0,997.0,40.0,15.0,997.0,997.0
3,7.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,50.0,15.0,50.0,85.0,75.0,0.0,0.0,997.0
4,7.0,4.0,3.0,1.0,3.0,1.0,1.0,1.0,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,6.0,5.0,4.0,1.0,3.0,1.0,1.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
2839,6.0,5.0,4.0,2.0,3.0,1.0,1.0,2.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0
2840,6.0,8.0,4.0,2.0,3.0,1.0,1.0,1.0,27.0,997.0,61.0,997.0,997.0,997.0,997.0,997.0
2841,6.0,8.0,4.0,2.0,3.0,1.0,2.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0


## Problem 4
Using the codebook and this [dictionary of NAICS industrial codes](https://www.naics.com/search-naics-codes-by-industry/), place descriptive labels on the categories of the industry column in the working data. [1 point]

Creating a dictionary in which the key are the category numbers that we want to recode, and the values are the new labels we want to replace these categories with. Then using `.map()` on the `industry` column to apply the mapping defined by the dictionary `replace_map`.

In [123]:
replace_map ={1.0: 'Agriculture_Construction_Manufacturing',
              2.0: 'Wholesale_Retail_Transportation',
              3.0: 'Arts_Entertainment_Food_Services',
              4.0: 'Information_Finance_Professional_Services', 
              5.0: 'Education_Health_Social_Assistance', 
              6.0: 'Public_Administration',
              7.0: 'Hospital_Worksites'}
df_clean.industry = df_clean.industry.map(replace_map)
df_clean.industry

0          Hospital_Worksites
1          Hospital_Worksites
2          Hospital_Worksites
3          Hospital_Worksites
4          Hospital_Worksites
                ...          
2838    Public_Administration
2839    Public_Administration
2840    Public_Administration
2841    Public_Administration
2842    Public_Administration
Name: industry, Length: 2843, dtype: object

## Problem 5
Using the codebook, recode the "size" column to have three categories: "Small" for workplaces with fewer than 100 employees, "Medium" for workplaces with at least 100 but fewer than 500 employees, and "Large" for companies with at least 500 employees. [Note: Python dataframes have an attribute `.size` that reports the space the dataframe takes up in memory. Don't confuse this attribute with the column named "Size" in the raw data.] [1 point]

based on the codebook, the codes for company's size categories are:

Size: 8 Employee Size Categories

                     Size      
---------------------------------------------------------------------------------
* 1 = Size Category 1: 10-24 
* 2 = Size Category 2: 25-49 
* 3 = Size Category 3: 50-99
* 4 = Size Category 4: 100-249 
* 5 = Size Category 5: 250-499
* 6 = Size Category 6: 500-749 
* 7 = Size Category 7: 750-999 
* 8 = Size Category 8: 1,000+

Creating a dictionary in which the key are the categories that we want to recode, and the values are the new labels we want to replace these categories with. Then using `.map()` on the `company_size` column to apply the mapping defined by the dictionary `replace_map`.

In [124]:
replace_map ={1: 'Small',
              2: 'Small',
              3: 'Small', 
              4: 'Medium', 
              5: 'Medium', 
              6: 'Large', 
              7: 'Large', 
              8: 'Large'}
df_clean.company_size = df_clean.company_size.map(replace_map)
df_clean.company_size

0        Large
1        Large
2        Large
3       Medium
4       Medium
         ...  
2838    Medium
2839    Medium
2840     Large
2841     Large
2842     Large
Name: company_size, Length: 2843, dtype: object

## Problem 6
Use the codebook to write accurate and descriptive labels for each category for each categorical column in the working data. Then apply all of these labels to the data at once. Code "Legitimate Skip", "Don't know", "Refused", and "Blank" as missing values. [2 points]

To label the categories of many features, I am going to be using a dictionary in which each feature to be recoded is a key and the mapping dictionary for that feature is the value. Then I can pass the entire map to the whole dataframe with `.replace()`.



In [126]:
replace_map = {
    'org_type': {1: 'For profit, public', 2: 'For profit, private', 3: 'Non-profit', 4: 'State or local government', 5: 'Federal government', 6: 'Other'},
    'hi_coverage_ft': {1: 'Full insurance coverage offered', 2: 'Partial insurance coverage offered', 3: 'No insurance coverage offered'},
    'hi_change_ft': {1: 'Larger', 2: 'Smaller', 3: 'About the same'},
    'hi_pt': {1: 'Yes', 2: 'No'},
    'health_edu_prog': {1: 'Yes', 2: 'No'},
    'wfh': {1: 'Yes', 2: 'No'}
}
df_clean = df_clean.replace(replace_map)

All the categorical columns share the same code for the following:
* 96 = Legitimate skip
* 97 = Don't know 
* 98 = Refusal
* 99 = Blank

So we can apply the same recoding dictionary to the following columns 'org_type', 'hi_coverage_ft', 'hi_change_ft', 'hi_pt', 'health_edu_prog', and 'wfh'.

In [127]:
catcols = ['org_type', 'hi_coverage_ft', 'hi_change_ft', 'hi_pt', 'health_edu_prog', 'wfh']
df_clean[catcols] = df_clean[catcols].replace([96, 97, 98, 99], np.nan)
df_clean

Unnamed: 0,industry,company_size,org_type,hi_coverage_ft,hi_change_ft,hi_pt,health_edu_prog,wfh,employees_under_30,employees_60plus,female_employees,hourly_employees,non_daytime_shift_workers,remote_workers,unionized_employees,annual_turnover_percentage
0,Hospital_Worksites,Large,Non-profit,Partial insurance coverage offered,Larger,No,Yes,Yes,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,Hospital_Worksites,Large,Non-profit,Partial insurance coverage offered,About the same,Yes,Yes,Yes,997.0,997.0,90.0,90.0,997.0,997.0,0.0,997.0
2,Hospital_Worksites,Large,Non-profit,Full insurance coverage offered,About the same,Yes,Yes,Yes,35.0,4.0,997.0,997.0,40.0,15.0,997.0,997.0
3,Hospital_Worksites,Medium,"For profit, private",Full insurance coverage offered,Smaller,Yes,No,No,50.0,15.0,50.0,85.0,75.0,0.0,0.0,997.0
4,Hospital_Worksites,Medium,Non-profit,Full insurance coverage offered,About the same,Yes,Yes,Yes,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,Public_Administration,Medium,State or local government,Full insurance coverage offered,About the same,Yes,Yes,,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
2839,Public_Administration,Medium,State or local government,Partial insurance coverage offered,About the same,Yes,Yes,No,997.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0
2840,Public_Administration,Large,State or local government,Partial insurance coverage offered,About the same,Yes,Yes,Yes,27.0,997.0,61.0,997.0,997.0,997.0,997.0,997.0
2841,Public_Administration,Large,State or local government,Partial insurance coverage offered,About the same,Yes,No,,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0


## Problem 7
The features that measure the percent of the workforce with a particular characteristic use the codes 997, 998, and 999 to represent "Don't know", "Refusal", and "Blank/Invalid" respectively. Replace these values with missing values for all of the percentage features at the same time. [1 point]

First, I defined a list of columns that measure workforce percentages and named it `percentcols`. Then replaced codes 997, 998, 999 with NaN using `.replace()`:

In [128]:
percentcols = ['employees_under_30', 'employees_60plus', 'female_employees', 'hourly_employees',\
               'non_daytime_shift_workers', 'remote_workers', 'unionized_employees','annual_turnover_percentage']
df_clean[percentcols] = df_clean[percentcols].replace([997, 998, 999], np.nan)
df_clean

Unnamed: 0,industry,company_size,org_type,hi_coverage_ft,hi_change_ft,hi_pt,health_edu_prog,wfh,employees_under_30,employees_60plus,female_employees,hourly_employees,non_daytime_shift_workers,remote_workers,unionized_employees,annual_turnover_percentage
0,Hospital_Worksites,Large,Non-profit,Partial insurance coverage offered,Larger,No,Yes,Yes,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,Hospital_Worksites,Large,Non-profit,Partial insurance coverage offered,About the same,Yes,Yes,Yes,,,90.0,90.0,,,0.0,
2,Hospital_Worksites,Large,Non-profit,Full insurance coverage offered,About the same,Yes,Yes,Yes,35.0,4.0,,,40.0,15.0,,
3,Hospital_Worksites,Medium,"For profit, private",Full insurance coverage offered,Smaller,Yes,No,No,50.0,15.0,50.0,85.0,75.0,0.0,0.0,
4,Hospital_Worksites,Medium,Non-profit,Full insurance coverage offered,About the same,Yes,Yes,Yes,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,Public_Administration,Medium,State or local government,Full insurance coverage offered,About the same,Yes,Yes,,,,,,,,,
2839,Public_Administration,Medium,State or local government,Partial insurance coverage offered,About the same,Yes,Yes,No,,,,,,,,
2840,Public_Administration,Large,State or local government,Partial insurance coverage offered,About the same,Yes,Yes,Yes,27.0,,61.0,,,,,
2841,Public_Administration,Large,State or local government,Partial insurance coverage offered,About the same,Yes,No,,,,,,,,,


## Problem 8
Sort the working data by industry in ascending alphabetical order. Within industry categories, sort the rows by size in ascending alphabetical order. Within groups with the same industry and size, sort by percent of the workforce that is under 30 in descending numeric order. [1 point]

To sort, I use the `.sort_values()` method. Within the method, I use the `by` argument to specify which columns I want to sort by in a list, starting with `industry`, then `company_size` and lastly `employees_under_30`. Then I'll pass a list of boolean values to the `ascending` argument to specify whether each of the columns should be sorted in ascending or descending order .

In [129]:
sorted_df = df_clean.sort_values(by = ['industry', 'company_size', 'employees_under_30'], ascending = [True, True, False])
sorted_df

Unnamed: 0,industry,company_size,org_type,hi_coverage_ft,hi_change_ft,hi_pt,health_edu_prog,wfh,employees_under_30,employees_60plus,female_employees,hourly_employees,non_daytime_shift_workers,remote_workers,unionized_employees,annual_turnover_percentage
1732,Agriculture_Construction_Manufacturing,Large,"For profit, private",Partial insurance coverage offered,About the same,No,Yes,No,50.0,10.0,50.0,75.0,10.0,0.0,0.0,75.0
1476,Agriculture_Construction_Manufacturing,Large,"For profit, private",Partial insurance coverage offered,About the same,No,Yes,No,40.0,10.0,30.0,60.0,30.0,5.0,0.0,10.0
1477,Agriculture_Construction_Manufacturing,Large,"For profit, private",Partial insurance coverage offered,Smaller,No,Yes,Yes,25.0,15.0,20.0,60.0,10.0,2.0,60.0,5.0
704,Agriculture_Construction_Manufacturing,Large,"For profit, private",Full insurance coverage offered,About the same,No,Yes,Yes,20.0,15.0,17.0,62.0,10.0,5.0,0.0,11.0
1241,Agriculture_Construction_Manufacturing,Large,"For profit, private",Full insurance coverage offered,About the same,No,Yes,Yes,20.0,25.0,50.0,70.0,20.0,5.0,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2604,Wholesale_Retail_Transportation,Small,Non-profit,Full insurance coverage offered,About the same,No,Yes,No,,,,,,,,
2626,Wholesale_Retail_Transportation,Small,"For profit, private",Partial insurance coverage offered,Larger,Yes,No,No,,,,,,,,
2629,Wholesale_Retail_Transportation,Small,"For profit, public",Full insurance coverage offered,Larger,No,Yes,Yes,,2.0,15.0,,,90.0,0.0,15.0
2631,Wholesale_Retail_Transportation,Small,"For profit, private",Partial insurance coverage offered,Larger,Yes,No,No,,,,95.0,,,,


## Problem 9
There is one row in the working data that has a `NaN` value for industry. Delete this row. Use a logical expression, and not the row number. [1 point]

Using the `.drop()` to remove rows with NaN in the 'industry' column. The `df_clean[df_clean.industry.isnull()].index` identifies the index of the rows where the value is NaN and `.drop()` removes these rows from the dataframe.

In [130]:
df_clean = df_clean.drop(df_clean[df_clean.industry.isnull()].index)
df_clean

Unnamed: 0,industry,company_size,org_type,hi_coverage_ft,hi_change_ft,hi_pt,health_edu_prog,wfh,employees_under_30,employees_60plus,female_employees,hourly_employees,non_daytime_shift_workers,remote_workers,unionized_employees,annual_turnover_percentage
0,Hospital_Worksites,Large,Non-profit,Partial insurance coverage offered,Larger,No,Yes,Yes,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,Hospital_Worksites,Large,Non-profit,Partial insurance coverage offered,About the same,Yes,Yes,Yes,,,90.0,90.0,,,0.0,
2,Hospital_Worksites,Large,Non-profit,Full insurance coverage offered,About the same,Yes,Yes,Yes,35.0,4.0,,,40.0,15.0,,
3,Hospital_Worksites,Medium,"For profit, private",Full insurance coverage offered,Smaller,Yes,No,No,50.0,15.0,50.0,85.0,75.0,0.0,0.0,
4,Hospital_Worksites,Medium,Non-profit,Full insurance coverage offered,About the same,Yes,Yes,Yes,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,Public_Administration,Medium,State or local government,Full insurance coverage offered,About the same,Yes,Yes,,,,,,,,,
2839,Public_Administration,Medium,State or local government,Partial insurance coverage offered,About the same,Yes,Yes,No,,,,,,,,
2840,Public_Administration,Large,State or local government,Partial insurance coverage offered,About the same,Yes,Yes,Yes,27.0,,61.0,,,,,
2841,Public_Administration,Large,State or local government,Partial insurance coverage offered,About the same,Yes,No,,,,,,,,,


## Problem 10
Create a new feature named `gender_balance` that has three categories: "Mostly men" for workplaces with between 0% and 35% female employees, "Balanced" for workplaces with more than 35% and at most 65% female employees, and "Mostly women" for workplaces with more than 65% female employees. [1 point]

To break the values of the `female_employees` column into categories, I am using `pandas`' `pd.cut()` function. This function create categories from break points in a continuous-valued column. The first argument of the function is the column whose values we want to categorize. I named this column `female_employees`. The second argumnet is `bins` and in it I put a list of the breakpoints (0%, 35%, 65% and 100%). The third argument `labels` is a tuple of the labels to assign ("Mostly Men", "Balanced", "Mostly Women").

In [131]:
df_clean['gender_balance'] = pd.cut(df_clean.female_employees, bins=[0, 35, 65, 100], labels=("Mostly Men", "Balanced", "Mostly Women"))
df_clean[['female_employees', 'gender_balance']]

Unnamed: 0,female_employees,gender_balance
0,85.0,Mostly Women
1,90.0,Mostly Women
2,,
3,50.0,Balanced
4,60.0,Balanced
...,...,...
2838,,
2839,,
2840,61.0,Balanced
2841,,


## Problem 11
Change the data type of all categorical features in the working data from "object" to "category". [1 point]

First, observing the current columns data types using the `.dtypes` attribute on the dataframe:

In [132]:
df_clean.dtypes

industry                        object
company_size                    object
org_type                        object
hi_coverage_ft                  object
hi_change_ft                    object
hi_pt                           object
health_edu_prog                 object
wfh                             object
employees_under_30             float64
employees_60plus               float64
female_employees               float64
hourly_employees               float64
non_daytime_shift_workers      float64
remote_workers                 float64
unionized_employees            float64
annual_turnover_percentage     float64
gender_balance                category
dtype: object

To change the data type of the categorical features to 'category', I am using `.astype('category')` method applied to all the columns whose type is object: 

In [133]:
catcols = ['industry', 'company_size', 'org_type', 'hi_coverage_ft', 'hi_change_ft', 'hi_pt', 'health_edu_prog', 'wfh']
df_clean[catcols] = df_clean[catcols].astype('category')
df_clean.dtypes

industry                      category
company_size                  category
org_type                      category
hi_coverage_ft                category
hi_change_ft                  category
hi_pt                         category
health_edu_prog               category
wfh                           category
employees_under_30             float64
employees_60plus               float64
female_employees               float64
hourly_employees               float64
non_daytime_shift_workers      float64
remote_workers                 float64
unionized_employees            float64
annual_turnover_percentage     float64
gender_balance                category
dtype: object

## Problem 12
Filter the data to only those rows that represent small workplaces that allow employees to work from home. Then report how many of these workplaces offer full insurance, partial insurance, and no insurance. Use a function that reports the percent, cumulative count, and cumulative percent in addition to the counts. [1 point]

First, I am filterring the data to only rows that represent small workplaces that allow employees to work from home by applying `.query()` to the dataframe and passing the logical condition `"company_size=='Small' & wfh=='Yes'"`. Next, I chain the `.stb.freq()` method to the filtered dataframe, and pass the column name `hi_coverage_ft` for insurance coverage:

In [134]:
df_clean.query("company_size=='Small' & wfh=='Yes'").stb.freq(['hi_coverage_ft'])

Unnamed: 0,hi_coverage_ft,count,percent,cumulative_count,cumulative_percent
0,Full insurance coverage offered,324,46.285714,324,46.285714
1,Partial insurance coverage offered,310,44.285714,634,90.571429
2,No insurance coverage offered,66,9.428571,700,100.0


## Problem 13
Anything that can be done in SQL can be done with `pandas`. The next several questions ask you to write `pandas` code to match a given SQL query. But to check that the SQL query and `pandas` code yield the same result, create a new database wsing the `sqlite3` package and input the cleaned WHA data as a table in this database. (See module 6 for a discussion of SQlite in Python.) [1 point]

Setting the working directory to where I want to save the database:

In [135]:
import os
os.chdir("/Users/afnan/Documents/DS6001/databases/ds6001databases/M08/Lab")

Creating the database file (calling it 'wha') and establising a connection to the database with the `.connect()` method:

In [84]:
wha_db = sqlite3.connect("wha.db")

To add the dataframe as an entity in the database I just created, I'll use the `.to_sql()` method:

In [86]:
df_clean.to_sql('whadata', wha_db, index=False, chunksize=1000, if_exists='replace')

2842

## Problem 14
Write `pandas` code that replicates the output of the following SQL code:
```
SELECT size, type, premiums AS insurance, percent_female FROM whpps
WHERE industry = 'Hospitals' AND premium_change='Smaller'
ORDER BY percent_female DESC;
```
For each of these queries, your feature names might be different from the ones listed in the query, depending on the names you chose in problem 3.
[2 points]

First, I am going to use the SQL code on the SQlite database to compare it to the pandas output:

In [138]:
query = '''
SELECT company_size, org_type, hi_coverage_ft AS insurance, female_employees FROM whadata
WHERE industry = 'Hospital_Worksites' AND hi_change_ft='Smaller'
ORDER BY female_employees DESC;
'''
sqloutput_df = pd.read_sql_query(query, wha_db)
sqloutput_df

Unnamed: 0,company_size,org_type,insurance,female_employees
0,Medium,Non-profit,Full insurance coverage offered,89.0
1,Large,Non-profit,Partial insurance coverage offered,80.0
2,Large,Non-profit,Partial insurance coverage offered,80.0
3,Small,Non-profit,Full insurance coverage offered,75.0
4,Medium,Non-profit,Partial insurance coverage offered,65.0
5,Medium,"For profit, private",Full insurance coverage offered,50.0
6,Medium,,Partial insurance coverage offered,
7,Medium,Non-profit,Partial insurance coverage offered,
8,Medium,Non-profit,Full insurance coverage offered,
9,Medium,Non-profit,Full insurance coverage offered,


Now using `pandas`:

filtering the dataframe for industry = 'Hospital_Worksites' and hi_change_ft='Smaller' using `.query()`:

In [139]:
filtered_df1 = df_clean.query("industry == 'Hospital_Worksites' & hi_change_ft == 'Smaller'")

selecting columns and renaming `hi_coverage_ft` to `insurance`:

In [140]:
mycols = ['company_size', 'org_type', 'hi_coverage_ft', 'female_employees']
filtered_df1 = filtered_df1[mycols]
filtered_df1 = filtered_df1.rename({'hi_coverage_ft': 'insurance'}, axis=1)

finally, sorting the dataframe:

In [141]:
filtered_df1 = filtered_df1.sort_values(by='female_employees', ascending=False)
filtered_df1

Unnamed: 0,company_size,org_type,insurance,female_employees
320,Medium,Non-profit,Full insurance coverage offered,89.0
187,Large,Non-profit,Partial insurance coverage offered,80.0
214,Large,Non-profit,Partial insurance coverage offered,80.0
229,Small,Non-profit,Full insurance coverage offered,75.0
191,Medium,Non-profit,Partial insurance coverage offered,65.0
3,Medium,"For profit, private",Full insurance coverage offered,50.0
11,Medium,,Partial insurance coverage offered,
48,Medium,Non-profit,Partial insurance coverage offered,
51,Medium,Non-profit,Full insurance coverage offered,
75,Medium,Non-profit,Full insurance coverage offered,


## Problem 15
Write `pandas` code that replicates the output of the following SQL code:
```
SELECT industry, 
    AVG(percent_female) as percent_female, 
    AVG(percent_under30) as percent_under30,
    AVG(percent_over60) as percent_over60
FROM whpps
GROUP BY industry
ORDER BY percent_female DESC;
```
[2 points]

First, I am going to use the SQL code on the SQlite database to compare it to the pandas output:

In [142]:
query = '''
SELECT industry, 
    AVG(female_employees) as percent_female, 
    AVG(employees_under_30) as percent_under30,
    AVG(employees_60plus) as percent_over60
FROM whadata
GROUP BY industry
ORDER BY percent_female DESC;
'''
sqloutput_df = pd.read_sql_query(query, wha_db)
sqloutput_df

Unnamed: 0,industry,percent_female,percent_under30,percent_over60
0,Education_Health_Social_Assistance,80.657143,25.745665,11.34957
1,Hospital_Worksites,76.427027,27.213793,16.489655
2,Arts_Entertainment_Food_Services,53.804416,38.566343,11.544872
3,Information_Finance_Professional_Services,50.632184,23.821752,12.465465
4,Public_Administration,39.056738,21.015625,15.015385
5,Wholesale_Retail_Transportation,32.657258,29.108696,12.584034
6,Agriculture_Construction_Manufacturing,20.328605,22.257143,10.690355


Using `pandas`:

I will group the data by `industry` using `.groupby()` on the dataframe. Then, use the `.agg()` method to calculate the averages of female_employees, employees_under_30 and employees_60plus and set them to their perpective coulmn names. Finally, sort by percent_female in descending order using `.sort_values()`:

In [143]:
grouped_df = df_clean.groupby('industry').agg(
    percent_female = ('female_employees', 'mean'),
    percent_under30 = ('employees_under_30', 'mean'),
    percent_over60 = ('employees_60plus', 'mean')
).reset_index()
grouped_df = grouped_df.sort_values(by='percent_female', ascending=False)
grouped_df 

Unnamed: 0,industry,percent_female,percent_under30,percent_over60
2,Education_Health_Social_Assistance,80.657143,25.745665,11.34957
3,Hospital_Worksites,76.427027,27.213793,16.489655
1,Arts_Entertainment_Food_Services,53.804416,38.566343,11.544872
4,Information_Finance_Professional_Services,50.632184,23.821752,12.465465
5,Public_Administration,39.056738,21.015625,15.015385
6,Wholesale_Retail_Transportation,32.657258,29.108696,12.584034
0,Agriculture_Construction_Manufacturing,20.328605,22.257143,10.690355


## Problem 16
Write `pandas` code that replicates the output of the following SQL code:
```
SELECT gender_balance, premiums, COUNT(*)
FROM whpps
GROUP BY gender_balance, premiums
HAVING gender_balance is NOT NULL and premiums is NOT NULL;
```
[2 points]

First, I am going to use the SQL code on the SQlite database to compare it to the pandas output:

In [144]:
query = '''
SELECT gender_balance, hi_coverage_ft, COUNT(*)
FROM whadata
GROUP BY gender_balance, hi_coverage_ft
HAVING gender_balance is NOT NULL and hi_coverage_ft is NOT NULL;
'''
sqloutput_df = pd.read_sql_query(query, wha_db)
sqloutput_df

Unnamed: 0,gender_balance,hi_coverage_ft,COUNT(*)
0,Balanced,Full insurance coverage offered,226
1,Balanced,No insurance coverage offered,77
2,Balanced,Partial insurance coverage offered,271
3,Mostly Men,Full insurance coverage offered,293
4,Mostly Men,No insurance coverage offered,87
5,Mostly Men,Partial insurance coverage offered,321
6,Mostly Women,Full insurance coverage offered,267
7,Mostly Women,No insurance coverage offered,107
8,Mostly Women,Partial insurance coverage offered,333


Using `pandas`:

I will group the data by `gender_balance` and `hi_coverage_ft` using `.groupby()` on the dataframe. Then, I will chain `.size()` to get the count of rows for each group. Then, convert the resulting series to a dataframe using `.reset_index()` with a new column named `COUNT(*)` to match the output of the SQL query.

Next, to filter out the groups where either `gender_balance` or `hi_coverage_ft` is NULL I will use `dropna()`

In [145]:
grouped_df1 = df_clean.groupby(['gender_balance', 'hi_coverage_ft']).size().reset_index(name='COUNT(*)')
grouped_df1 = grouped_df1.dropna(subset=['gender_balance', 'hi_coverage_ft'])
grouped_df1

Unnamed: 0,gender_balance,hi_coverage_ft,COUNT(*)
0,Mostly Men,Full insurance coverage offered,293
1,Mostly Men,No insurance coverage offered,87
2,Mostly Men,Partial insurance coverage offered,321
3,Balanced,Full insurance coverage offered,226
4,Balanced,No insurance coverage offered,77
5,Balanced,Partial insurance coverage offered,271
6,Mostly Women,Full insurance coverage offered,267
7,Mostly Women,No insurance coverage offered,107
8,Mostly Women,Partial insurance coverage offered,333


The result is the same except for the order of the rows.

To keep track of the changes made and save the new version of the database, I'll use the `.commit()` method. And to free up the resources the database is using on my machine, I'l use the `.close()` method:

In [146]:
wha_db.commit()
wha_db.close()