# Lab Assignment 8: Data Management Using `pandas`, Part 1
## DS 6001: Practice and Application of Data Science

**Emily Lien, egl6a**

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

In this lab, you will be working with the [2017 Workplace Health in America survey](https://www.cdc.gov/workplacehealthpromotion/survey/data.html) which was conducted by the Centers for Disease Control and Prevention. According to the survey's [guidence document](https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/2017-WHA-Guidance-Document-for-Use-of-Public-Data-files-508.pdf):

> The Workplace Health in America (WHA) Survey gathered information from a cross-sectional, nationally representative sample of US worksites. The sample was drawn from the Dun & Bradstreet (D&B) database of all private and public employers in the United States with at least 10 employees. Like previous national surveys, the worksite served as the sampling unit rather than the companies or firms to which the worksites belonged. Worksites were selected using a stratified simple random sample (SRS) design, where the primary strata were ten multi-state regions defined by the Centers for Disease Control and Prevention (CDC), plus an additional stratum containing all hospital worksites. 

The data contain over 300 features that report the industry and type of company where the respondents are employed, what kind of health insurance and other health programs are offered, and other characteristics of the workplaces including whether employees are allowed to work from home and the gender and age makeup of the workforce. The data are full of interesting information, but in order to make use of the data a great deal of data manipulation is required first.

## Problem 0
Import the following libraries:

In [2]:
import numpy as np
import pandas as pd
import sidetable
import sqlite3
import warnings
warnings.filterwarnings('ignore')

## Problem 1
The raw data are stored in an ASCII file on the 2017 Workplace Health in America survey [homepage](https://www.cdc.gov/workplacehealthpromotion/survey/data.html). Load the raw data directly into Python without downloading the data onto your harddrive and display a dataframe with only the 14th, 28th, and 102nd rows of the data. [1 point]

In [224]:
Data = pd.read_csv("https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/whpps_120717.csv",sep='~')

In [225]:
DataSlice = Data.iloc[[14,28,102],]

In [226]:
DataSlice

Unnamed: 0,OC1,OC3,HI1,HI2,HI3,HI4,HRA1,HRA1A,HRA1B,HRA1E,...,WL3_05,E1_09,Suppquex,Id,Region,CDC_Region,Industry,Size,Varstrata,"Finalwt_worksite,,,,"
14,7,2.0,2.0,1.0,2.0,1.0,1.0,3.0,2.0,2.0,...,,,2.0,1539.0,2.0,4.0,7.0,5.0,0.0,"47.793940929,,,,"
28,1,3.0,2.0,3.0,1.0,1.0,2.0,96.0,96.0,96.0,...,,,2.0,2755.0,3.0,5.0,7.0,6.0,0.0,"47.793940929,,,,"
102,1,3.0,2.0,3.0,1.0,1.0,1.0,1.0,4.0,2.0,...,,,2.0,12686.0,3.0,5.0,7.0,8.0,0.0,"47.793940929,,,,"


## Problem 2 
The data contain 301 columns. Create a new variable in Python's memory to store a working version of the data. In the working version, delete all of the columns except for the following:

* `Industry`: 7 Industry Categories with NAICS codes

* `Size`: 8 Employee Size Categories

* `OC3` Is your organization for profit, non-profit, government?

* `HI1` In general, do you offer full, partial or no payment of premiums for personal health insurance for full-time employees?

* `HI2` Over the past 12 months, were full-time employees asked to pay a larger proportion, smaller proportion or the same proportion of personal health insurance premiums?

* `HI3`: Does your organization offer personal health insurance for your part-time employees?

* `CP1`: Are there health education programs, which focus on skill development and lifestyle behavior change along with information dissemination and awareness building?

* `WL6`: Allow employees to work from home?

* Every column that begins `WD`, expressing the percentage of employees that have certain characteristics at the firm

[1 point]

In [227]:
columns = ['Industry','Size','OC3','HI1','HI2','HI3','CP1','WL6']
WD = [x for x in Data.columns if x.startswith("WD")]
DataWork=Data[columns+WD]

In [228]:
DataWork

Unnamed: 0,Industry,Size,OC3,HI1,HI2,HI3,CP1,WL6,WD1_1,WD1_2,WD2,WD3,WD4,WD5,WD6,WD7
0,7.0,7.0,3.0,2.0,1.0,2.0,1.0,1.0,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,7.0,6.0,3.0,2.0,3.0,1.0,1.0,1.0,997.0,997.0,90.0,90.0,997.0,997.0,0.0,997.0
2,7.0,8.0,3.0,1.0,3.0,1.0,1.0,1.0,35.0,4.0,997.0,997.0,40.0,15.0,997.0,997.0
3,7.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,50.0,15.0,50.0,85.0,75.0,0.0,0.0,997.0
4,7.0,4.0,3.0,1.0,3.0,1.0,1.0,1.0,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,6.0,5.0,4.0,1.0,3.0,1.0,1.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
2839,6.0,5.0,4.0,2.0,3.0,1.0,1.0,2.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0
2840,6.0,8.0,4.0,2.0,3.0,1.0,1.0,1.0,27.0,997.0,61.0,997.0,997.0,997.0,997.0,997.0
2841,6.0,8.0,4.0,2.0,3.0,1.0,2.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0


## Problem 3
The [codebook](https://www.cdc.gov/workplacehealthpromotion/data-surveillance/docs/2017-WHA-Datafile-Codebook-508.pdf) for the WHA data contain short descriptions of the meaning of each of the columns in the data. Use these descriptions to decide on better and more intuitive names for the columns in the working version of the data, and rename the columns accordingly. [1 point]

Industry = Industry

Size = Size

OrgType = OC3 categorical

PremiumCoverage = HI1 categorical

PremiumProportion = HI2 categorical

PartTimeInsure = HI3 categorical

CP1 = HealthEducation categorical

WL6 = WorkFromHome categorical

WD1_1 = Under30

WD1_2 = 60andUp

WD2 = PercentFemale

WD3 = Hourly

WD4= AtypShift

WD5=Remote

WD6=Union

WD7=AnnualTurnover

In [229]:
DatWork= DataWork.rename(columns={'OC3':'OrgType', 
                          'HI1':'PremiumCoverage', 
                          'HI2':'PremiumProportion', 
                          'HI3':'PartTimeInsure',
                          'CP1':'HealthEducation',
                          'WL6':'WorkFromHome',
                          'WD1_1':'Under30',
                          'WD1_2':'Over60',
                          'WD2':'PercentFemale',
                          'WD3':'Hourly',
                          'WD4':'ATypShift',
                          'WD5':'Remote',
                          'WD6':'Union',
                          'WD7':'AnnualTurnover'}, inplace=True)

In [230]:
DataWork

Unnamed: 0,Industry,Size,OrgType,PremiumCoverage,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover
0,7.0,7.0,3.0,2.0,1.0,2.0,1.0,1.0,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,7.0,6.0,3.0,2.0,3.0,1.0,1.0,1.0,997.0,997.0,90.0,90.0,997.0,997.0,0.0,997.0
2,7.0,8.0,3.0,1.0,3.0,1.0,1.0,1.0,35.0,4.0,997.0,997.0,40.0,15.0,997.0,997.0
3,7.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,50.0,15.0,50.0,85.0,75.0,0.0,0.0,997.0
4,7.0,4.0,3.0,1.0,3.0,1.0,1.0,1.0,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,6.0,5.0,4.0,1.0,3.0,1.0,1.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
2839,6.0,5.0,4.0,2.0,3.0,1.0,1.0,2.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0
2840,6.0,8.0,4.0,2.0,3.0,1.0,1.0,1.0,27.0,997.0,61.0,997.0,997.0,997.0,997.0,997.0
2841,6.0,8.0,4.0,2.0,3.0,1.0,2.0,99.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0


## Problem 4
Using the codebook and this [dictionary of NAICS industrial codes](https://www.naics.com/search-naics-codes-by-industry/), place descriptive labels on the categories of the industry column in the working data. [1 point]

In [231]:
DataWork.Industry

0       7.0
1       7.0
2       7.0
3       7.0
4       7.0
       ... 
2838    6.0
2839    6.0
2840    6.0
2841    6.0
2842    6.0
Name: Industry, Length: 2843, dtype: float64

In [232]:
replace_map = {1:'Industrial Trades', 
               2:'Retail', 
               3:'Entertainment', 
               4:'Entreprenurial',
              5:'Public Services',
              6:'Public Admin',
              7:'Hospitals'}
DataWork.Industry = DataWork.Industry.map(replace_map)
DataWork.Industry

0          Hospitals
1          Hospitals
2          Hospitals
3          Hospitals
4          Hospitals
            ...     
2838    Public Admin
2839    Public Admin
2840    Public Admin
2841    Public Admin
2842    Public Admin
Name: Industry, Length: 2843, dtype: object

## Problem 5
Using the codebook, recode the "size" column to have three categories: "Small" for workplaces with fewer than 100 employees, "Medium" for workplaces with at least 100 but fewer than 500 employees, and "Large" for companies with at least 500 employees. [Note: Python dataframes have an attribute `.size` that reports the space the dataframe takes up in memory. Don't confuse this attribute with the column named "Size" in the raw data.] [1 point]

In [233]:
replace_map = {1:'Small', 
               2:'Small', 
               3:'Small', 
               4:'Medium',
              5:'Medium',
              6:'Large',
              7:'Large',
              8:'Large'}
DataWork.Size = DataWork.Size.map(replace_map)
DataWork.Size

0        Large
1        Large
2        Large
3       Medium
4       Medium
         ...  
2838    Medium
2839    Medium
2840     Large
2841     Large
2842     Large
Name: Size, Length: 2843, dtype: object

## Problem 6
Use the codebook to write accurate and descriptive labels for each category for each categorical column in the working data. Then apply all of these labels to the data at once. Code "Legitimate Skip", "Don't know", "Refused", and "Blank" as missing values. [2 points]

OrgType = OC3 categorical

PremiumCoverage = HI1 categorical

PremiumProportion = HI2 categorical

PartTimeInsure = HI3 categorical

CP1 = HealthEducation categorical

WL6 = WorkFromHome categorical

In [234]:
replace_mass = {'OrgType':{1:'PublicForProfit',
                           2:'PrivateForProfit',
                           3:'NonProfit',
                           4:'State/LocalGov',
                           5:'FedGov',
                           6:'Other',
                           97:'Missing',
                           98:'Missing',
                           99:'Missing'},
                'PremiumCoverage':{1:'FullCoverage',
                                   2:'PartialCoverage',
                                   3:'NoCoverage',
                                   96:'Missing',
                                   97:'Missing',
                                   98:'Missing',
                                   99:'Missing'},
                'PremiumProportion':{1:'Larger',
                                     2:'Smaller',
                                     3:'Same',
                                     96:'Missing',
                                     97:'Missing',
                                     98:'Missing',
                                     99:'Missing'},
                'PartTimeInsure':{1:'Yes',
                                  2:'No',
                                  97:'Missing',
                                  98:'Missing',
                                  99:'Missing'},
                'HealthEducation':{1:'Yes',
                                   2:'No',
                                   97:'Missing',
                                   98:'Missing'},
                'WorkFromHome':{1:'Yes',
                                2:'No',
                                97:'Missing',
                                98:'Missing',
                                99:'Missing'}}
DataWork= DataWork.replace(replace_mass)

In [235]:
DataWork

Unnamed: 0,Industry,Size,OrgType,PremiumCoverage,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover
0,Hospitals,Large,NonProfit,PartialCoverage,Larger,No,Yes,Yes,25.0,20.0,85.0,60.0,40.0,15.0,0.0,22.0
1,Hospitals,Large,NonProfit,PartialCoverage,Same,Yes,Yes,Yes,997.0,997.0,90.0,90.0,997.0,997.0,0.0,997.0
2,Hospitals,Large,NonProfit,FullCoverage,Same,Yes,Yes,Yes,35.0,4.0,997.0,997.0,40.0,15.0,997.0,997.0
3,Hospitals,Medium,PrivateForProfit,FullCoverage,Smaller,Yes,No,No,50.0,15.0,50.0,85.0,75.0,0.0,0.0,997.0
4,Hospitals,Medium,NonProfit,FullCoverage,Same,Yes,Yes,Yes,50.0,40.0,60.0,60.0,40.0,30.0,0.0,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,Public Admin,Medium,State/LocalGov,FullCoverage,Same,Yes,Yes,Missing,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0
2839,Public Admin,Medium,State/LocalGov,PartialCoverage,Same,Yes,Yes,No,997.0,997.0,997.0,997.0,997.0,997.0,997.0,997.0
2840,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,Yes,Yes,27.0,997.0,61.0,997.0,997.0,997.0,997.0,997.0
2841,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,No,Missing,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0


## Problem 7
The features that measure the percent of the workforce with a particular characteristic use the codes 997, 998, and 999 to represent "Don't know", "Refusal", and "Blank/Invalid" respectively. Replace these values with missing values for all of the percentage features at the same time. [1 point]

In [236]:
replace_nan = {'Under30':{997:'NaN',
                          998:'NaN',
                          999:'NaN'},
               'Over60':{997:'NaN',
                          998:'NaN',
                          999:'NaN'},
               'PercentFemale':{997:'NaN',
                                998:'NaN',
                                999:'NaN'},
               'Hourly':{997:'NaN',
                         998:'NaN',
                         999:'NaN'},
               'AtypShift':{997:'NaN',
                            998:'NaN',
                            999:'NaN'},
               'Remote':{997:'NaN',
                         998:'NaN',
                         999:'NaN'},
               'Union':{997:'NaN',
                        998:'NaN',
                        999:'NaN'},
               'AnnualTurnover':{997:'NaN',
                                 998:'NaN',
                                 999:'NaN'}}
DataWork= DataWork.replace(replace_nan)

In [237]:
DataWork

Unnamed: 0,Industry,Size,OrgType,PremiumCoverage,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover
0,Hospitals,Large,NonProfit,PartialCoverage,Larger,No,Yes,Yes,25,20,85,60,40.0,15,0,22
1,Hospitals,Large,NonProfit,PartialCoverage,Same,Yes,Yes,Yes,,,90,90,997.0,,0,
2,Hospitals,Large,NonProfit,FullCoverage,Same,Yes,Yes,Yes,35,4,,,40.0,15,,
3,Hospitals,Medium,PrivateForProfit,FullCoverage,Smaller,Yes,No,No,50,15,50,85,75.0,0,0,
4,Hospitals,Medium,NonProfit,FullCoverage,Same,Yes,Yes,Yes,50,40,60,60,40.0,30,0,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,Public Admin,Medium,State/LocalGov,FullCoverage,Same,Yes,Yes,Missing,,,,,999.0,,,
2839,Public Admin,Medium,State/LocalGov,PartialCoverage,Same,Yes,Yes,No,,,,,997.0,,,
2840,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,Yes,Yes,27,,61,,997.0,,,
2841,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,No,Missing,,,,,999.0,,,


## Problem 8
Sort the working data by industry in ascending alphabetical order. Within industry categories, sort the rows by size in ascending alphabetical order. Within groups with the same industry and size, sort by percent of the workforce that is under 30 in descending numeric order. [1 point]

In [238]:
DataWork.sort_values(by=['Industry','Size','Under30'],ascending=[True,True,False])

Unnamed: 0,Industry,Size,OrgType,PremiumCoverage,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover
1310,Entertainment,Large,PublicForProfit,PartialCoverage,Same,Missing,Missing,Missing,,,,,999.0,,,
1827,Entertainment,Large,PublicForProfit,Missing,Missing,Yes,Missing,Missing,,,,,999.0,,,
1830,Entertainment,Large,PrivateForProfit,PartialCoverage,Same,No,No,No,,,,,997.0,0,0,
1831,Entertainment,Large,State/LocalGov,PartialCoverage,Same,Yes,Yes,No,,,,,997.0,,,
2431,Entertainment,Large,PrivateForProfit,PartialCoverage,Smaller,No,No,No,70,5,25,95,15.0,0,0,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2095,Retail,Small,PrivateForProfit,PartialCoverage,Same,No,No,Yes,0,20,85,10,0.0,10,0,0
2099,Retail,Small,PrivateForProfit,FullCoverage,Same,No,No,Yes,0,4,80,90,0.0,15,0,
2118,Retail,Small,PrivateForProfit,PartialCoverage,Same,Yes,No,No,0,40,40,95,15.0,0,0,10
2590,Retail,Small,PrivateForProfit,PartialCoverage,Smaller,No,No,No,0,38,10,90,0.0,5,0,


## Problem 9
There is one row in the working data that has a `NaN` value for industry. Delete this row. Use a logical expression, and not the row number. [1 point]

In [268]:
DataWork.query("Industry=='NaN'")
#I don't know why this isn't working. When I look at the dataset, I SEE the row, but I can't query it! I can query other NaN values, but
# not this one for some godforsaken reason.

Unnamed: 0,Industry,Size,OrgType,PremiumCoverage,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover,gender_balance


## Problem 10
Create a new feature named `gender_balance` that has three categories: "Mostly men" for workplaces with between 0% and 35% female employees, "Balanced" for workplaces with more than 35% and at most 65% female employees, and "Mostly women" for workplaces with more than 65% female employees. [1 point]

In [240]:
#You probably wanted me to use pd.cut() for this, but it was giving me errors because of the NaN values and I couldn't fix it
gender_balance=[]
for i in range(0,2843):
    if DataWork.PercentFemale[i] == 'NaN':
        gender_balance.append('Missing')
    elif 0 <= DataWork.PercentFemale[i] and DataWork.PercentFemale[i] <=35:
        gender_balance.append('Mostly men')
    elif 35 < DataWork.PercentFemale[i] and DataWork.PercentFemale[i] <=65:
        gender_balance.append('Balanced')
    else:
        gender_balance.append('Mostly women')


In [241]:
DataWork['gender_balance'] = gender_balance

In [242]:
DataWork

Unnamed: 0,Industry,Size,OrgType,PremiumCoverage,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover,gender_balance
0,Hospitals,Large,NonProfit,PartialCoverage,Larger,No,Yes,Yes,25,20,85,60,40.0,15,0,22,Mostly women
1,Hospitals,Large,NonProfit,PartialCoverage,Same,Yes,Yes,Yes,,,90,90,997.0,,0,,Mostly women
2,Hospitals,Large,NonProfit,FullCoverage,Same,Yes,Yes,Yes,35,4,,,40.0,15,,,Missing
3,Hospitals,Medium,PrivateForProfit,FullCoverage,Smaller,Yes,No,No,50,15,50,85,75.0,0,0,,Balanced
4,Hospitals,Medium,NonProfit,FullCoverage,Same,Yes,Yes,Yes,50,40,60,60,40.0,30,0,28,Balanced
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,Public Admin,Medium,State/LocalGov,FullCoverage,Same,Yes,Yes,Missing,,,,,999.0,,,,Missing
2839,Public Admin,Medium,State/LocalGov,PartialCoverage,Same,Yes,Yes,No,,,,,997.0,,,,Missing
2840,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,Yes,Yes,27,,61,,997.0,,,,Balanced
2841,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,No,Missing,,,,,999.0,,,,Missing


## Problem 11
Change the data type of all categorical features in the working data from "object" to "category". [1 point]

OrgType = OC3 categorical

PremiumCoverage = HI1 categorical

PremiumProportion = HI2 categorical

PartTimeInsure = HI3 categorical

CP1 = HealthEducation categorical

WL6 = WorkFromHome categorical


In [243]:
categorical = ['OrgType','PremiumCoverage','PremiumProportion','PartTimeInsure','HealthEducation','WorkFromHome']

In [244]:
DataWork[categorical] = DataWork[categorical].astype('category')

## Problem 12
Filter the data to only those rows that represent small workplaces that allow employees to work from home. Then report how many of these workplaces offer full insurance, partial insurance, and no insurance. Use a function that reports the percent, cumulative count, and cumulative percent in addition to the counts. [1 point]

In [245]:
DataWork.query("Size=='Small' & WorkFromHome=='Yes'" )

Unnamed: 0,Industry,Size,OrgType,PremiumCoverage,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover,gender_balance
5,Hospitals,Small,NonProfit,FullCoverage,Same,Yes,Yes,Yes,20,25,65,80,25.0,5,0,15,Balanced
10,Hospitals,Small,NonProfit,FullCoverage,Same,Yes,Yes,Yes,,,,,997.0,,,,Missing
20,Hospitals,Small,NonProfit,NoCoverage,Missing,No,No,Yes,20,20,66,90,20.0,5,0,5,Mostly women
22,Hospitals,Small,NonProfit,FullCoverage,Same,Yes,Yes,Yes,25,8,82,83,997.0,,7,15,Mostly women
30,Hospitals,Small,NonProfit,PartialCoverage,Larger,No,No,Yes,,,,,997.0,,0,,Missing
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2819,Public Admin,Small,State/LocalGov,FullCoverage,Same,Yes,Yes,Yes,10,15,7,15,75.0,2,99,25,Mostly men
2822,Public Admin,Small,State/LocalGov,Missing,Missing,Missing,Yes,Yes,,,,,997.0,,,,Missing
2826,Public Admin,Small,State/LocalGov,FullCoverage,Same,Missing,Yes,Yes,,,,,997.0,,,,Missing
2827,Public Admin,Small,FedGov,FullCoverage,Missing,Yes,Missing,Yes,5,5,5,95,5.0,0,90,10,Mostly men


In [246]:
DataWork.query("Size=='Small' & WorkFromHome=='Yes'" ).stb.freq(['PremiumCoverage'])

Unnamed: 0,PremiumCoverage,count,percent,cumulative_count,cumulative_percent
0,FullCoverage,324,45.698166,324,45.698166
1,PartialCoverage,310,43.723554,634,89.421721
2,NoCoverage,66,9.308886,700,98.730606
3,Missing,9,1.269394,709,100.0


## Problem 13
Anything that can be done in SQL can be done with `pandas`. The next several questions ask you to write `pandas` code to match a given SQL query. But to check that the SQL query and `pandas` code yield the same result, create a new database wsing the `sqlite3` package and input the cleaned WHA data as a table in this database. (See module 6 for a discussion of SQlite in Python.) [1 point]

In [247]:
DataWork.columns
#making these data types floats because the NaN values are causing me problems down the line
DataWork['Over60']=DataWork['Over60'].astype('float')
DataWork['Under30']=DataWork['Under30'].astype('float')
DataWork['PercentFemale']=DataWork['PercentFemale'].astype('float')

In [131]:
WHA_SQL= sqlite3.connect("WHA_sql.db")

In [248]:
DataWork.to_sql('WHA', WHA_SQL, index=False, chunksize=1000, if_exists='replace')

## Problem 14
Write `pandas` code that replicates the output of the following SQL code:
```
SELECT size, type, premiums AS insurance, percent_female FROM whpps
WHERE industry = 'Hospitals' AND premium_change='Smaller'
ORDER BY percent_female DESC;
```
For each of these queries, your feature names might be different from the ones listed in the query, depending on the names you chose in problem 3.
[2 points]

In [133]:
WHA_cursor = WHA_SQL.cursor()

In [255]:
#Checking the SQL query first
sql_query = """
SELECT Size, OrgType, PremiumCoverage AS insurance, PercentFemale FROM WHA WHERE Industry='Hospitals' AND
                   PremiumProportion='Smaller' 
                   ORDER BY PercentFemale DESC
"""
pd.read_sql_query(sql_query, WHA)

Unnamed: 0,Size,OrgType,insurance,PercentFemale
0,Medium,NonProfit,FullCoverage,89.0
1,Large,NonProfit,PartialCoverage,80.0
2,Large,NonProfit,PartialCoverage,80.0
3,Small,NonProfit,FullCoverage,75.0
4,Medium,NonProfit,PartialCoverage,65.0
5,Medium,PrivateForProfit,FullCoverage,50.0
6,Medium,Missing,PartialCoverage,
7,Medium,NonProfit,PartialCoverage,
8,Medium,NonProfit,FullCoverage,
9,Medium,NonProfit,FullCoverage,


In [250]:
#Now for the pandas version
DataDupe=DataWork.rename(columns={'PremiumCoverage':'insurance'},inplace=False)

In [251]:
DataDupe

Unnamed: 0,Industry,Size,OrgType,insurance,PremiumProportion,PartTimeInsure,HealthEducation,WorkFromHome,Under30,Over60,PercentFemale,Hourly,ATypShift,Remote,Union,AnnualTurnover,gender_balance
0,Hospitals,Large,NonProfit,PartialCoverage,Larger,No,Yes,Yes,25.0,20.0,85.0,60,40.0,15,0,22,Mostly women
1,Hospitals,Large,NonProfit,PartialCoverage,Same,Yes,Yes,Yes,,,90.0,90,997.0,,0,,Mostly women
2,Hospitals,Large,NonProfit,FullCoverage,Same,Yes,Yes,Yes,35.0,4.0,,,40.0,15,,,Missing
3,Hospitals,Medium,PrivateForProfit,FullCoverage,Smaller,Yes,No,No,50.0,15.0,50.0,85,75.0,0,0,,Balanced
4,Hospitals,Medium,NonProfit,FullCoverage,Same,Yes,Yes,Yes,50.0,40.0,60.0,60,40.0,30,0,28,Balanced
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2838,Public Admin,Medium,State/LocalGov,FullCoverage,Same,Yes,Yes,Missing,,,,,999.0,,,,Missing
2839,Public Admin,Medium,State/LocalGov,PartialCoverage,Same,Yes,Yes,No,,,,,997.0,,,,Missing
2840,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,Yes,Yes,27.0,,61.0,,997.0,,,,Balanced
2841,Public Admin,Large,State/LocalGov,PartialCoverage,Same,Yes,No,Missing,,,,,999.0,,,,Missing


In [252]:
DupeTable=DataDupe[['Size', 'OrgType','insurance','PercentFemale','Industry','PremiumProportion']].query(
    'Industry=="Hospitals" & PremiumProportion=="Smaller"').sort_values('PercentFemale')
DupeTable[['Size','OrgType','insurance','PercentFemale']]

Unnamed: 0,Size,OrgType,insurance,PercentFemale
3,Medium,PrivateForProfit,FullCoverage,50.0
191,Medium,NonProfit,PartialCoverage,65.0
229,Small,NonProfit,FullCoverage,75.0
187,Large,NonProfit,PartialCoverage,80.0
214,Large,NonProfit,PartialCoverage,80.0
320,Medium,NonProfit,FullCoverage,89.0
11,Medium,Missing,PartialCoverage,
48,Medium,NonProfit,PartialCoverage,
51,Medium,NonProfit,FullCoverage,
75,Medium,NonProfit,FullCoverage,


## Problem 15
Write `pandas` code that replicates the output of the following SQL code:
```
SELECT industry, 
    AVG(percent_female) as percent_female, 
    AVG(percent_under30) as percent_under30,
    AVG(percent_over60) as percent_over60
FROM whpps
GROUP BY industry
ORDER BY percent_female DESC;
```
[2 points]

In [260]:
#Notes: Had to go all the way back to the beginning to rename 60andUp as Over60 because for some reason, the query didn't like a 
#name starting with a number
query_15="""
SELECT Industry, 
    AVG(PercentFemale) as percent_female, 
    AVG(Under30) as percent_under30,
    AVG(Over60) as percent_over60
FROM WHA
GROUP BY Industry
ORDER BY percent_female DESC"""
pd.read_sql_query(query_15, WHA)

Unnamed: 0,Industry,percent_female,percent_under30,percent_over60
0,Public Services,80.657143,25.745665,11.34957
1,Hospitals,76.427027,27.213793,16.489655
2,Entertainment,53.804416,38.566343,11.544872
3,Entreprenurial,50.632184,23.821752,12.465465
4,Public Admin,39.056738,21.015625,15.015385
5,Retail,32.657258,29.108696,12.584034
6,Industrial Trades,20.328605,22.257143,10.690355
7,,,,


In [258]:
DataWork.groupby('Industry').agg({'PercentFemale':'mean',
                               'Under30':'mean',
                               'Over60':'mean',}).sort_values('PercentFemale',ascending=False)

Unnamed: 0_level_0,PercentFemale,Under30,Over60
Industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Public Services,80.657143,25.745665,11.34957
Hospitals,76.427027,27.213793,16.489655
Entertainment,53.804416,38.566343,11.544872
Entreprenurial,50.632184,23.821752,12.465465
Public Admin,39.056738,21.015625,15.015385
Retail,32.657258,29.108696,12.584034
Industrial Trades,20.328605,22.257143,10.690355


## Problem 16
Write `pandas` code that replicates the output of the following SQL code:
```
SELECT gender_balance, premiums, COUNT(*)
FROM whpps
GROUP BY gender_balance, premiums
HAVING gender_balance is NOT NULL and premiums is NOT NULL;
```
[2 points]

In [264]:
query_16="""
SELECT gender_balance, PremiumCoverage, COUNT(*)
FROM WHA
GROUP BY gender_balance, PremiumCoverage
HAVING gender_balance is NOT 'Missing' and PremiumCoverage is NOT 'Missing'
"""
pd.read_sql_query(query_16, WHA)

Unnamed: 0,gender_balance,PremiumCoverage,COUNT(*)
0,Balanced,FullCoverage,226
1,Balanced,NoCoverage,77
2,Balanced,PartialCoverage,271
3,Mostly men,FullCoverage,301
4,Mostly men,NoCoverage,91
5,Mostly men,PartialCoverage,332
6,Mostly women,,1
7,Mostly women,FullCoverage,267
8,Mostly women,NoCoverage,107
9,Mostly women,PartialCoverage,333


In [287]:
DataDupe2=DataWork[['gender_balance','PremiumCoverage']].query('gender_balance !="Missing" & PremiumCoverage != "Missing"')

In [296]:
pd.DataFrame(DataDupe2.groupby(['gender_balance','PremiumCoverage']).size())

Unnamed: 0_level_0,Unnamed: 1_level_0,0
gender_balance,PremiumCoverage,Unnamed: 2_level_1
Balanced,FullCoverage,226
Balanced,Missing,0
Balanced,NoCoverage,77
Balanced,PartialCoverage,271
Mostly men,FullCoverage,301
Mostly men,Missing,0
Mostly men,NoCoverage,91
Mostly men,PartialCoverage,332
Mostly women,FullCoverage,267
Mostly women,Missing,0
