# DS 3001 - Project 1: GSS Data

#### Glory Gurrola and Divya Kuruvilla

## Research Question: How does social class or income affect access to health care?
Strategy: 
- pull data from the GSS columns relating to health care, income, and other things in that topic, and then analyze the trends we find. 

1. Change variable names to be more clear 
2. replace nans 
3. make some pretty graphs 

In [109]:
import pandas as pd
data = pd.read_csv('./Project_Data.csv')

In [110]:
data.head()

Unnamed: 0,wrkstat,income,satfin,finalter,health,helpsick,hlthinsr,doc13,doc14,health1,...,hlthcare,hlthall,hlthcovr,hlthtype,hrdshp6,hlthacc1,hlthacc4,hlthacc3,hlthacc2,prvdhlth
0,working full time,,not satisfied at all,better,good,,,,,,...,,,,,,,,,,
1,retired,,more or less satisfied,stayed same,fair,,,,,,...,,,,,,,,,,
2,working part time,,pretty well satisfied,better,excellent,,,,,,...,,,,,,,,,,
3,working full time,,not satisfied at all,stayed same,good,,,,,,...,,,,,,,,,,
4,keeping house,,pretty well satisfied,better,good,,,,,,...,,,,,,,,,,


In [111]:
# .shape prints out the number of rows and columns 
print(data.shape, '\n')

# .columns prints out the names of the columns 
# There are 25 columns as specificed in the data.shape output
print(data.columns.tolist())

(72426, 25) 

['wrkstat', 'income', 'satfin', 'finalter', 'health', 'helpsick', 'hlthinsr', 'doc13', 'doc14', 'health1', 'hlthplan', 'diffcare', 'insrlmts', 'insrchng', 'emphlth', 'hlthcare', 'hlthall', 'hlthcovr', 'hlthtype', 'hrdshp6', 'hlthacc1', 'hlthacc4', 'hlthacc3', 'hlthacc2', 'prvdhlth']


### Changing Variable Names
We are editing the names of the variables in the dataframe in order to make it more clear what each of the variables is representing


In [112]:
print(data.columns.tolist())
## Updated Names of all of the variables

['wrkstat', 'income', 'satfin', 'finalter', 'health', 'helpsick', 'hlthinsr', 'doc13', 'doc14', 'health1', 'hlthplan', 'diffcare', 'insrlmts', 'insrchng', 'emphlth', 'hlthcare', 'hlthall', 'hlthcovr', 'hlthtype', 'hrdshp6', 'hlthacc1', 'hlthacc4', 'hlthacc3', 'hlthacc2', 'prvdhlth']


In [113]:
print(data.columns.tolist())

data = data.rename(columns={'wrkstat': 'WorkStatus',
                            'income':'Income' ,
                            'satfin': 'FinancialSatisfaction',
                            'finalter': 'FinancialSitChange',
                            'health':'HealthCondition',
                            'helpsick':'GovPayHealthBills',
                            'hlthinsr':'HealthInsurance98',
                            'doc13': 'DeniedTreatmentWorry', 
                            'doc14':'DocCostOverCare', 
                            'health1': 'GeneralHealth', 
                            'hlthplan':'HealthInsurance02', 
                            'diffcare':'DifficultCare', 
                            'insrlmts': 'InsuranceLimits', 
                            'insrchng':'SwitchInsurancePlans', 
                            'emphlth':'EmployerCoverage', 
                            'hlthcare':'GovResponsibleForSick', 
                            'hlthall':'UniversalHealthCare', 
                            'hlthcovr':'HealthInsurance08', 
                            'hlthtype':'InsuranceSource', 
                            'hrdshp6':'LackedHealthInsurance', 
                            'hlthacc1':'RichVsPoorAccess', 
                            'hlthacc4':'CitVsNCitAccess', 
                            'hlthacc3':'MenVsWomenAccess', 
                            'hlthacc2':'OldVsYoungAccess', 
                            'prvdhlth':'WhoShouldProvideHealthCare'})

['wrkstat', 'income', 'satfin', 'finalter', 'health', 'helpsick', 'hlthinsr', 'doc13', 'doc14', 'health1', 'hlthplan', 'diffcare', 'insrlmts', 'insrchng', 'emphlth', 'hlthcare', 'hlthall', 'hlthcovr', 'hlthtype', 'hrdshp6', 'hlthacc1', 'hlthacc4', 'hlthacc3', 'hlthacc2', 'prvdhlth']


### Cleaning Variables
We are cleaning the variables in the dataframe to get rid of NaN and make the values easier to analyze.


In [114]:
# Cleaning the FinancialSitChange variable
# From the GSS codebook, the FinancialSitChange variable stands for "change in financial situation"
# The question asked was "During the last few years, has your financial situation been getting better, worse, or has it stayed the same?"
# This question was asked in all 34 years of the survey

# Check for the Number of Missing Values 
print(data['FinancialSitChange'].unique())
missing = data['FinancialSitChange'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with 'No Response' since this is a categorical variable.
# When making our graphs and visualization, we can represent NaN with 'No Response' to keep our categorical variables consistent.
data['FinancialSitChange'].fillna('no response', inplace= True)

# Since 36 values within the column 'FinancialSitChange' are 'finalter', we decided to just drop these values
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'finalter'
data = data[data['FinancialSitChange'] != 'finalter']

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['FinancialSitChange'].value_counts(), '\n')

# Check for Unique Types After CLeaning
unique_types_cleaned = data['FinancialSitChange'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['FinancialSitChange'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')
data['FinancialSitChange'].fillna('No Response', inplace= True)




['better' 'stayed same' 'worse' nan 'finalter']
Missing values Before Cleaning:  4795 

Number of Instances After Cleaning:  
 FinancialSitChange
stayed same    26780
better         25358
worse          15457
no response     4795
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['better' 'stayed same' 'worse' 'no response'] 

Missing values After Cleaning:  0 



In [115]:

# Cleaning the Income Variable 
# From the GSS codebook, the income variable stands for "total family income"
# The question asked was "In which of these groups did your total family income, from all sources, fall last year before taxes, that is? Just tell me the letter."
# The data provided from the GSS had ranges of incomes.

# Check for the Number of Missing Values 
print(data['Income'].unique())
missing = data['Income'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# To clean the income variable, we decided to drop the NaN.
# For our research question, we want to do analysis on income related to healthcare data points.
# Thus, it was not useful to keep the NaN values when making visualizations so we dropped them.
data = data.dropna(subset=['Income'])

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['Income'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['Income'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['Income'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan '$10,000 to $14,999' '$7,000 to $7,999' '$4,000 to $4,999'
 '$1,000 to $2,999' '$15,000 to $19,999' '$5,000 to $5,999'
 '$20,000 to $24,999' '$3,000 to $3,999' 'under $1,000' '$8,000 to $9,999'
 '$25,000 or more' '$6,000 to $6,999']
Missing values Before Cleaning:  8951 

Number of Instances After Cleaning:  
 Income
$25,000 or more       34785
$10,000 to $14,999     6850
$20,000 to $24,999     5528
$15,000 to $19,999     5301
$8,000 to $9,999       2285
$1,000 to $2,999       1412
$7,000 to $7,999       1315
$5,000 to $5,999       1314
$3,000 to $3,999       1309
$6,000 to $6,999       1249
$4,000 to $4,999       1189
under $1,000            902
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['$10,000 to $14,999' '$7,000 to $7,999' '$4,000 to $4,999'
 '$1,000 to $2,999' '$15,000 to $19,999' '$5,000 to $5,999'
 '$20,000 to $24,999' '$3,000 to $3,999' 'under $1,000' '$8,000 to $9,999'
 '$25,000 or more' '$6,000 to $6,999'] 

Missing values After Cleaning:  0 



In [116]:

# Cleaning the 'HealthCondition' variable 
# From the GSS codebook, the variable stands for the "conditon of health."
# The question asked was "Would you say your own health, in general, is excellent, good, fair, or  poor?"

# Check for the Number of Missing Values
print(data['HealthCondition'].unique())
missing = data['HealthCondition'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with 'No Response' since this is a categorical variable.
# When making our graphs and visualization, we can represent NaN with 'No Response' to keep our categorical variables consistent.
data['HealthCondition'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['HealthCondition'].value_counts(), '\n')

# Check for Unique Types After CLeaning
unique_types_cleaned = data['HealthCondition'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['HealthCondition'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


['fair' 'good' 'excellent' 'poor' nan]
Missing values Before Cleaning:  15409 

Number of Instances After Cleaning:  
 HealthCondition
good           22411
no response    15409
excellent      13873
fair            9204
poor            2542
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['fair' 'good' 'excellent' 'poor' 'no response'] 

Missing values After Cleaning:  0 



In [117]:
# Cleaning the 'WhoShouldProvideHealthCare' variable
# From the GSS codebook, this variable stands for "who provides for sick people."
# The question asked was "People have different opinions on who should provide services in America. Who do you think should primarily provide... Healthcare for the sick?  Should it be...  "

# Check for the Number of Missing Values
print(data['WhoShouldProvideHealthCare'].unique())
missing = data['WhoShouldProvideHealthCare'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the WhoShouldProvideHealthCare variable with No Response since WhoShouldProvideHealthCare is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['WhoShouldProvideHealthCare'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['WhoShouldProvideHealthCare'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['WhoShouldProvideHealthCare'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['WhoShouldProvideHealthCare'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'private companies/for-profit organizations' 'government'
 'non-profit organizations/charities/cooperatives'
 'family, relatives or friends' 'religious organizations']
Missing values Before Cleaning:  61201 

Number of Instances After Cleaning:  
 WhoShouldProvideHealthCare
no response                                        61201
government                                          1311
private companies/for-profit organizations           479
non-profit organizations/charities/cooperatives      261
family, relatives or friends                         166
religious organizations                               21
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'private companies/for-profit organizations' 'government'
 'non-profit organizations/charities/cooperatives'
 'family, relatives or friends' 'religious organizations'] 

Missing values After Cleaning:  0 



In [118]:
# Cleaning the 'WorkStatus' variable 
# From the GSS codebook, this variable stands for the "labor force status."
# The question asked was "Last week were you working full time, part time, going to school, keeping house, or what?"

# Check for the Number of Missing Values
print(data['WorkStatus'].unique())
missing = data['WorkStatus'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['WorkStatus'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['WorkStatus'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['WorkStatus'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['WorkStatus'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

['working full time' 'keeping house' 'retired' 'working part time'
 'in school'
 'with a job, but not at work because of temporary illness, vacation, strike'
 'other' 'unemployed, laid off, looking for work' nan]
Missing values Before Cleaning:  11 

Number of Instances After Cleaning:  
 WorkStatus
working full time                                                             32052
retired                                                                        9087
keeping house                                                                  8922
working part time                                                              6551
unemployed, laid off, looking for work                                         2256
in school                                                                      1825
with a job, but not at work because of temporary illness, vacation, strike     1384
other                                                                          1351
no response                

In [119]:
# Cleaning GeneralHealth Variable 
# GeneralHealth Question : Would you say that in general your health is Excelent, Very good, Good, Fair, or Poor?

print(data['GeneralHealth'].value_counts())
print()
missing = data['GeneralHealth'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the GeneralHealth variable with No Response since GeneralHealth is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['GeneralHealth'].fillna('no response', inplace= True)

#Print missing values after cleaning
missing = data['GeneralHealth'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

GeneralHealth
good         2956
very good    2800
excellent    2181
fair         1271
poor          202
Name: count, dtype: int64

Missing values Before Cleaning:  54029 

Missing values After Cleaning:  0 



In [120]:
# Cleaning the Variale OldVsYoungAccess
# From the GSS codebook, this variable stands for "Old vs. Young Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for old people than for young people"                   

# Check for the Number of Missing Values
print(data['OldVsYoungAccess'].unique())
missing = data['OldVsYoungAccess'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['OldVsYoungAccess'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['OldVsYoungAccess'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['OldVsYoungAccess'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['OldVsYoungAccess'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'much harder' 'about the same' 'somewhat harder' 'much easier'
 'somewhat easier']
Missing values Before Cleaning:  62465 

Number of Instances After Cleaning:  
 OldVsYoungAccess
no response        62465
about the same       354
somewhat harder      224
somewhat easier      202
much harder          116
much easier           78
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'much harder' 'about the same' 'somewhat harder'
 'much easier' 'somewhat easier'] 

Missing values After Cleaning:  0 



In [121]:
# Cleaning the Variale MenVsWomenAccess
# From the GSS codebook, this variable stands for "Men vs Women Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for women than for men"                

# Check for the Number of Missing Values
print(data['MenVsWomenAccess'].unique())
missing = data['MenVsWomenAccess'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['MenVsWomenAccess'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['MenVsWomenAccess'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['MenVsWomenAccess'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['MenVsWomenAccess'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'somewhat harder' 'about the same' 'much harder' 'somewhat easier'
 'much easier']
Missing values Before Cleaning:  62492 

Number of Instances After Cleaning:  
 MenVsWomenAccess
no response        62492
about the same       653
somewhat harder      139
somewhat easier       75
much easier           41
much harder           39
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'somewhat harder' 'about the same' 'much harder'
 'somewhat easier' 'much easier'] 

Missing values After Cleaning:  0 



In [122]:
# Cleaning the Variale CitVsNCitAccess
# From the GSS codebook, this variable stands for "Citizen vs Noncitizen Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for citizens of the United States than for people who do not hold U.S.citizenship?"

# Check for the Number of Missing Values
print(data['CitVsNCitAccess'].unique())
missing = data['CitVsNCitAccess'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['CitVsNCitAccess'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['CitVsNCitAccess'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['CitVsNCitAccess'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['CitVsNCitAccess'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'much easier' 'about the same' 'much harder' 'somewhat easier'
 'somewhat harder']
Missing values Before Cleaning:  62514 

Number of Instances After Cleaning:  
 CitVsNCitAccess
no response        62514
much easier          278
somewhat easier      226
about the same       210
somewhat harder      112
much harder           99
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'much easier' 'about the same' 'much harder'
 'somewhat easier' 'somewhat harder'] 

Missing values After Cleaning:  0 



In [123]:
# Cleaning the Variale RichVsPoorAccess
# From the GSS codebook, this variable stands for "Rich vs Poor Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for rich people than for poor people?"

# Check for the Number of Missing Values
print(data['RichVsPoorAccess'].unique())
missing = data['RichVsPoorAccess'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['RichVsPoorAccess'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['RichVsPoorAccess'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['RichVsPoorAccess'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['RichVsPoorAccess'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'much easier' 'somewhat easier' 'about the same' 'somewhat harder'
 'much harder']
Missing values Before Cleaning:  62444 

Number of Instances After Cleaning:  
 RichVsPoorAccess
no response        62444
much easier          618
somewhat easier      193
about the same       127
somewhat harder       34
much harder           23
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'much easier' 'somewhat easier' 'about the same'
 'somewhat harder' 'much harder'] 

Missing values After Cleaning:  0 



In [124]:
# Cleaning the Variale LackedHealthInsurance
# From the GSS codebook, this variable stands for "lacking health insurance coverage"
# The question asked was "Now I am going to ask about specific hardships. 
    # Did any of the following occur to you since (CURRENT MONTH), (1990/2003)? 
    # Lacked health insurance coverage (e.g. Medicare, Medicaid, Blue Cross, an HMO, etc.)?"

# Check for the Number of Missing Values
print(data['LackedHealthInsurance'].unique())
missing = data['LackedHealthInsurance'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['LackedHealthInsurance'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['LackedHealthInsurance'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['LackedHealthInsurance'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['LackedHealthInsurance'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'no' 'yes']
Missing values Before Cleaning:  61357 

Number of Instances After Cleaning:  
 LackedHealthInsurance
no response    61357
no              1748
yes              334
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'no' 'yes'] 

Missing values After Cleaning:  0 



In [125]:
# Cleaning the Variale EmployerCoverage
# From the GSS codebook, this variable stands for "do you receive health insurance from employer"
# The question asked was "Do you receive health insurance from your employer?"

# Check for the Number of Missing Values
print(data['EmployerCoverage'].unique())
missing = data['EmployerCoverage'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['EmployerCoverage'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['EmployerCoverage'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['EmployerCoverage'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['EmployerCoverage'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'no' 'yes']
Missing values Before Cleaning:  62495 

Number of Instances After Cleaning:  
 EmployerCoverage
no response    62495
yes              602
no               342
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'no' 'yes'] 

Missing values After Cleaning:  0 



In [126]:
# Cleaning the Variale InsuranceSource
# From the GSS codebook, this variable stands for "source of health insurance"
# The question asked was "What is the source of your health insurance?"

# Check for the Number of Missing Values
print(data['InsuranceSource'].unique())
missing = data['InsuranceSource'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['InsuranceSource'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['InsuranceSource'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['InsuranceSource'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['InsuranceSource'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan
 'individual plan from private insurer not related to current or past employment'
 'my employer' 'medicaid' 'medicare' 'employer of my spouse/partner'
 'other' 'employer of someone else in my family']
Missing values Before Cleaning:  63358 

Number of Instances After Cleaning:  
 InsuranceSource
no response                                                                       63358
my employer                                                                          40
medicare                                                                             14
employer of my spouse/partner                                                         9
individual plan from private insurer not related to current or past employment        6
medicaid                                                                              6
other                                                                                 4
employer of someone else in my family                                             

In [127]:
# 'hlthcare':'GovResponsibleForSick', 
# 'hlthall':'UniversalHealthCare', 
# 'hlthcovr':'HealthInsurance08', 

# Cleaning the Variale HealthInsurance08
# From the GSS codebook, this variable stands for "has health insurance coverage"
# The qustion asked was "Do you currently have health insurance coverage?"

# Check for the Number of Missing Values
print(data['HealthInsurance08'].unique())
missing = data['HealthInsurance08'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['HealthInsurance08'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['HealthInsurance08'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['HealthInsurance08'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['HealthInsurance08'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')



[nan 'yes' 'no']
Missing values Before Cleaning:  63336 

Number of Instances After Cleaning:  
 HealthInsurance08
no response    63336
yes               79
no                24
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'yes' 'no'] 

Missing values After Cleaning:  0 

