# DS 3001 - Project 1: GSS Data

#### Glory Gurrola and Divya Kuruvilla

## Research Question: How does social class or income affect access to health care?
Strategy: 
- pull data from the GSS columns relating to health care, income, and other things in that topic, and then analyze the trends we find. 

1. Change variable names to be more clear 
2. replace nans 
3. make some pretty graphs 

In [156]:
import pandas as pd
data = pd.read_csv('./Project_Data.csv')

In [157]:
data.head()

Unnamed: 0,wrkstat,income,satfin,finalter,health,helpsick,hlthinsr,doc13,doc14,health1,...,hlthcare,hlthall,hlthcovr,hlthtype,hrdshp6,hlthacc1,hlthacc4,hlthacc3,hlthacc2,prvdhlth
0,working full time,,not satisfied at all,better,good,,,,,,...,,,,,,,,,,
1,retired,,more or less satisfied,stayed same,fair,,,,,,...,,,,,,,,,,
2,working part time,,pretty well satisfied,better,excellent,,,,,,...,,,,,,,,,,
3,working full time,,not satisfied at all,stayed same,good,,,,,,...,,,,,,,,,,
4,keeping house,,pretty well satisfied,better,good,,,,,,...,,,,,,,,,,


### Cleaning Variables
We are cleaning the variables in the dataframe to get rid of NaN and make the values easier to analyze.


In [158]:
# Start cleaning the other Variables here
# Use old variable names 


In [159]:
# Cleaning the Variale hlthplan
# From the GSS codebook, this variable stands for "r had medicare or medicaid"
# The question asked was "Do you have any health insurance, including Medicare or Medicaid?"

# Check for the Number of Missing Values
print(data['hlthplan'].unique())
missing = data['hlthplan'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthplan'].fillna('no response', inplace= True)
data = data[data['hlthplan'] != 'hlthplan']

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthplan'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthplan'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthplan'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'hlthplan' 'no' 'yes']
Missing values Before Cleaning:  69635 

Number of Instances After Cleaning:  
 no response    69635
yes             2387
no               368
Name: hlthplan, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'no' 'yes'] 

Missing values After Cleaning:  0 



In [160]:
# Cleaning the Variale hlthinsr
# From the GSS codebook, this variable stands for "have health insurace"
# The question asked was "Are you, yourself, covered by health insurance, a government plan like 
# Medicare or Medicaid, or some other plan that pays for your medical care?"
# This specific variable contains data specifically from 1998

# Check for the Number of Missing Values
print(data['hlthinsr'].unique())
missing = data['hlthinsr'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthinsr'].fillna('no response', inplace= True)

# We decided to just drop the value 'hlthinsr' within this column
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'finalter'
data = data[data['hlthinsr'] != 'hlthinsr']

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthinsr'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthinsr'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthinsr'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'yes covered' 'no not covered']
Missing values Before Cleaning:  71008 

Number of Instances After Cleaning:  
 no response       71008
yes covered        1189
no not covered      193
Name: hlthinsr, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'yes covered' 'no not covered'] 

Missing values After Cleaning:  0 



In [161]:
# Cleaning the Variale doc13 
# From the GSS codebook, this variable stands for "doctors deny me the treatment needed"
# The qustion asked was "I worry that I will be denied the treatment or services I need"
# The question was answered on a scale of 1 to 5 where 1 = Strongly Agree 2 = Agree 3 = Uncertain 4 = Disagree 5 = Strong Disagree
# The data responses for this question are from the 1998 survey

# Check for the Number of Missing Values
print(data['doc13'].unique())
missing = data['doc13'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# To clean this variable, we replaced the nan with "No response"

# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['doc13'].fillna('no response', inplace= True)

# Rather than having numbers as the data, we replaced the numbers with their actual meaning in order to make the data 
# more clear and understandable without needing to access the GSS data to see what the numbers represent 

# value_mapping = {
#     '1.0': 'STRONGLY AGREE',
#     '2.0': 'AGREE',
#     '3.0': 'UNCERTAIN',
#     '4.0': 'DISAGREE',
#     '5.0': 'STRONGLY DISAGREE',
# }
# data['doc13'].replace(value_mapping, inplace=True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['doc13'].value_counts(), '\n')


[nan 'disagree' 'agree' 'strong disagree' 'uncertain' 'strongly agree']
Missing values Before Cleaning:  71033 

Number of Instances After Cleaning:  
 no response        71033
disagree             734
agree                261
uncertain            185
strong disagree      118
strongly agree        59
Name: doc13, dtype: int64 



In [162]:
# Cleaning the hlthcare variable
# From the GSS codebook, this variable stands for "govts resp. provide hlth care for sick"
# The question asked was "On the whole, do you think it should or should not be the government's responsibility to . . . 
        # Provide health care for the sick."

# Check for the Number of Missing Values 
print(data['hlthcare'].unique())
missing = data['hlthcare'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthcare'].fillna('no response', inplace= True)

# We decided to just drop the value 'finalter' within this column
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'finalter'
data = data[data['hlthcare'] != 'hlthcare']

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthcare'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthcare'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthcare'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')



[nan 'definitely should be' 'probably should be' 'probably should not be'
 'definitely should not be']
Missing values Before Cleaning:  66475 

Number of Instances After Cleaning:  
 no response                 66475
definitely should be         2704
probably should be           2435
probably should not be        582
definitely should not be      194
Name: hlthcare, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'definitely should be' 'probably should be'
 'probably should not be' 'definitely should not be'] 

Missing values After Cleaning:  0 



In [163]:
# Cleaning the finalter variable
# From the GSS codebook, this variable stands for "change in financial situation"
# The question asked was "During the last few years, has your financial situation been getting better, worse, or has it stayed the same?"
# This question was asked in all 34 years of the survey

# Check for the Number of Missing Values 
print(data['finalter'].unique())
missing = data['finalter'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with 'No Response' since this is a categorical variable.
# When making our graphs and visualization, we can represent NaN with 'No Response' to keep our categorical variables consistent.
data['finalter'].fillna('no response', inplace= True)

# We decided to just drop the value 'finalter' within this column
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'finalter'
data = data[data['finalter'] != 'finalter']

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['finalter'].value_counts(), '\n')

# Check for Unique Types After CLeaning
unique_types_cleaned = data['finalter'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['finalter'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')




['better' 'stayed same' 'worse' nan]
Missing values Before Cleaning:  4795 

Number of Instances After Cleaning:  
 stayed same    26780
better         25358
worse          15457
no response     4795
Name: finalter, dtype: int64 

Unique 'Type' values after cleaning: ['better' 'stayed same' 'worse' 'no response'] 

Missing values After Cleaning:  0 



In [164]:
# Cleaning the Income Variable 
# From the GSS codebook, the income variable stands for "total family income"
# The question asked was "In which of these groups did your total family income, from all sources, fall last year before taxes, that is? Just tell me the letter."
# The data provided from the GSS had ranges of incomes.

# Check for the Number of Missing Values 
print(data['income'].unique())
missing = data['income'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# To clean the income variable, we decided to drop the NaN.
# For our research question, we want to do analysis on income related to healthcare data points.
# Thus, it was not useful to keep the NaN values when making visualizations so we dropped them.
data = data.dropna(subset=['income'])

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['income'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['income'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['income'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan '$10,000 to $14,999' '$7,000 to $7,999' '$4,000 to $4,999'
 '$1,000 to $2,999' '$15,000 to $19,999' '$5,000 to $5,999'
 '$20,000 to $24,999' '$3,000 to $3,999' 'under $1,000' '$8,000 to $9,999'
 '$25,000 or more' '$6,000 to $6,999']
Missing values Before Cleaning:  8951 

Number of Instances After Cleaning:  
 $25,000 or more       34785
$10,000 to $14,999     6850
$20,000 to $24,999     5528
$15,000 to $19,999     5301
$8,000 to $9,999       2285
$1,000 to $2,999       1412
$7,000 to $7,999       1315
$5,000 to $5,999       1314
$3,000 to $3,999       1309
$6,000 to $6,999       1249
$4,000 to $4,999       1189
under $1,000            902
Name: income, dtype: int64 

Unique 'Type' values after cleaning: ['$10,000 to $14,999' '$7,000 to $7,999' '$4,000 to $4,999'
 '$1,000 to $2,999' '$15,000 to $19,999' '$5,000 to $5,999'
 '$20,000 to $24,999' '$3,000 to $3,999' 'under $1,000' '$8,000 to $9,999'
 '$25,000 or more' '$6,000 to $6,999'] 

Missing values After Cleaning:  0 



In [165]:

# Cleaning the health variable 
# From the GSS codebook, the variable stands for the "conditon of health."
# The question asked was "Would you say your own health, in general, is excellent, good, fair, or  poor?"

# Check for the Number of Missing Values
print(data['health'].unique())
missing = data['health'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with 'No Response' since this is a categorical variable.
# When making our graphs and visualization, we can represent NaN with 'No Response' to keep our categorical variables consistent.
data['health'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['health'].value_counts(), '\n')

# Check for Unique Types After CLeaning
unique_types_cleaned = data['health'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['health'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


['fair' 'good' 'excellent' 'poor' nan]
Missing values Before Cleaning:  15409 

Number of Instances After Cleaning:  
 good           22411
no response    15409
excellent      13873
fair            9204
poor            2542
Name: health, dtype: int64 

Unique 'Type' values after cleaning: ['fair' 'good' 'excellent' 'poor' 'no response'] 

Missing values After Cleaning:  0 



In [166]:
# Cleaning the prvdhlth variable
# From the GSS codebook, this variable stands for "who provides for sick people."
# The question asked was "People have different opinions on who should provide services in America. Who do you think should primarily provide... Healthcare for the sick?  Should it be...  "

# Check for the Number of Missing Values
print(data['prvdhlth'].unique())
missing = data['prvdhlth'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the WhoShouldProvideHealthCare variable with No Response since WhoShouldProvideHealthCare is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['prvdhlth'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['prvdhlth'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['prvdhlth'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['prvdhlth'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'private companies/for-profit organizations' 'government'
 'non-profit organizations/charities/cooperatives'
 'family, relatives or friends' 'religious organizations']
Missing values Before Cleaning:  61201 

Number of Instances After Cleaning:  
 no response                                        61201
government                                          1311
private companies/for-profit organizations           479
non-profit organizations/charities/cooperatives      261
family, relatives or friends                         166
religious organizations                               21
Name: prvdhlth, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'private companies/for-profit organizations' 'government'
 'non-profit organizations/charities/cooperatives'
 'family, relatives or friends' 'religious organizations'] 

Missing values After Cleaning:  0 



In [167]:
# Cleaning the WorkStatus variable 
# From the GSS codebook, this variable stands for the "labor force status."
# The question asked was "Last week were you working full time, part time, going to school, keeping house, or what?"

# Check for the Number of Missing Values
print(data['wrkstat'].unique())
missing = data['wrkstat'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['wrkstat'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['wrkstat'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['wrkstat'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['wrkstat'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

['working full time' 'keeping house' 'retired' 'working part time'
 'in school'
 'with a job, but not at work because of temporary illness, vacation, strike'
 'other' 'unemployed, laid off, looking for work' nan]
Missing values Before Cleaning:  11 

Number of Instances After Cleaning:  
 working full time                                                             32052
retired                                                                        9087
keeping house                                                                  8922
working part time                                                              6551
unemployed, laid off, looking for work                                         2256
in school                                                                      1825
with a job, but not at work because of temporary illness, vacation, strike     1384
other                                                                          1351
no response                           

In [168]:
# Cleaning health1 Variable 
# GeneralHealth Question : Would you say that in general your health is Excelent, Very good, Good, Fair, or Poor?

print(data['health1'].value_counts())
print()
missing = data['health1'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the GeneralHealth variable with No Response since GeneralHealth is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['health1'].fillna('no response', inplace= True)

#Print missing values after cleaning
missing = data['health1'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

good         2956
very good    2800
excellent    2181
fair         1271
poor          202
Name: health1, dtype: int64

Missing values Before Cleaning:  54029 

Missing values After Cleaning:  0 



In [169]:
# Cleaning the Variale hlthacc2
# From the GSS codebook, this variable stands for "Old vs. Young Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for old people than for young people"                   

# Check for the Number of Missing Values
print(data['hlthacc2'].unique())
missing = data['hlthacc2'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthacc2'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthacc2'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthacc2'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthacc2'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'much harder' 'about the same' 'somewhat harder' 'much easier'
 'somewhat easier']
Missing values Before Cleaning:  62465 

Number of Instances After Cleaning:  
 no response        62465
about the same       354
somewhat harder      224
somewhat easier      202
much harder          116
much easier           78
Name: hlthacc2, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'much harder' 'about the same' 'somewhat harder'
 'much easier' 'somewhat easier'] 

Missing values After Cleaning:  0 



In [170]:
# Cleaning the Variale hlthacc3
# From the GSS codebook, this variable stands for "Men vs Women Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for women than for men"                

# Check for the Number of Missing Values
print(data['hlthacc3'].unique())
missing = data['hlthacc3'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthacc3'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthacc3'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthacc3'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthacc3'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'somewhat harder' 'about the same' 'much harder' 'somewhat easier'
 'much easier']
Missing values Before Cleaning:  62492 

Number of Instances After Cleaning:  
 no response        62492
about the same       653
somewhat harder      139
somewhat easier       75
much easier           41
much harder           39
Name: hlthacc3, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'somewhat harder' 'about the same' 'much harder'
 'somewhat easier' 'much easier'] 

Missing values After Cleaning:  0 



In [171]:
# Cleaning the Variale hlthacc4
# From the GSS codebook, this variable stands for "Citizen vs Noncitizen Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for citizens of the United States than for people who do not hold U.S.citizenship?"

# Check for the Number of Missing Values
print(data['hlthacc4'].unique())
missing = data['hlthacc4'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthacc4'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthacc4'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthacc4'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthacc4'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan 'much easier' 'about the same' 'much harder' 'somewhat easier'
 'somewhat harder']
Missing values Before Cleaning:  62514 

Number of Instances After Cleaning:  
 no response        62514
much easier          278
somewhat easier      226
about the same       210
somewhat harder      112
much harder           99
Name: hlthacc4, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'much easier' 'about the same' 'much harder'
 'somewhat easier' 'somewhat harder'] 

Missing values After Cleaning:  0 



In [172]:
# Cleaning the Variale hlthacc1
# From the GSS codebook, this variable stands for "Rich vs Poor Access to Healthcare"
# The question asked was "In the United States, do you think it is easier or harder to get access to health care... for rich people than for poor people?"

# Check for the Number of Missing Values
print(data['hlthacc1'].unique())
missing = data['hlthacc1'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthacc1'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthacc1'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthacc1'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthacc1'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'much easier' 'somewhat easier' 'about the same' 'somewhat harder'
 'much harder']
Missing values Before Cleaning:  62444 

Number of Instances After Cleaning:  
 no response        62444
much easier          618
somewhat easier      193
about the same       127
somewhat harder       34
much harder           23
Name: hlthacc1, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'much easier' 'somewhat easier' 'about the same'
 'somewhat harder' 'much harder'] 

Missing values After Cleaning:  0 



In [173]:
# Cleaning the Variale hrdshp6
# From the GSS codebook, this variable stands for "lacking health insurance coverage"
# The question asked was "Now I am going to ask about specific hardships. 
    # Did any of the following occur to you since (CURRENT MONTH), (1990/2003)? 
    # Lacked health insurance coverage (e.g. Medicare, Medicaid, Blue Cross, an HMO, etc.)?"

# Check for the Number of Missing Values
print(data['hrdshp6'].unique())
missing = data['hrdshp6'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hrdshp6'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hrdshp6'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hrdshp6'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hrdshp6'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'no' 'yes']
Missing values Before Cleaning:  61357 

Number of Instances After Cleaning:  
 no response    61357
no              1748
yes              334
Name: hrdshp6, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'no' 'yes'] 

Missing values After Cleaning:  0 



In [174]:
# Cleaning the Variale emphlth
# From the GSS codebook, this variable stands for "do you receive health insurance from employer"
# The question asked was "Do you receive health insurance from your employer?"

# Check for the Number of Missing Values
print(data['emphlth'].unique())
missing = data['emphlth'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['emphlth'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['emphlth'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['emphlth'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['emphlth'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan 'no' 'yes']
Missing values Before Cleaning:  62495 

Number of Instances After Cleaning:  
 no response    62495
yes              602
no               342
Name: emphlth, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'no' 'yes'] 

Missing values After Cleaning:  0 



In [175]:
# Cleaning the Variale InsuranceSource
# From the GSS codebook, this variable stands for "source of health insurance"
# The question asked was "What is the source of your health insurance?"

# Check for the Number of Missing Values
print(data['hlthtype'].unique())
missing = data['hlthtype'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthtype'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthtype'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthtype'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthtype'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan
 'individual plan from private insurer not related to current or past employment'
 'my employer' 'medicaid' 'medicare' 'employer of my spouse/partner'
 'other' 'employer of someone else in my family']
Missing values Before Cleaning:  63358 

Number of Instances After Cleaning:  
 no response                                                                       63358
my employer                                                                          40
medicare                                                                             14
employer of my spouse/partner                                                         9
individual plan from private insurer not related to current or past employment        6
medicaid                                                                              6
other                                                                                 4
employer of someone else in my family                                                 2
Name: hlth

In [176]:
# Cleaning the Variale hlthcovr
# From the GSS codebook, this variable stands for "has health insurance coverage"
# The qustion asked was "Do you currently have health insurance coverage?"

# Check for the Number of Missing Values
print(data['hlthcovr'].unique())
missing = data['hlthcovr'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['hlthcovr'].fillna('no response', inplace= True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthcovr'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthcovr'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthcovr'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')



[nan 'yes' 'no']
Missing values Before Cleaning:  63336 

Number of Instances After Cleaning:  
 no response    63336
yes               79
no                24
Name: hlthcovr, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'yes' 'no'] 

Missing values After Cleaning:  0 



In [177]:
# Cleaning the Variale UniversalHealthCare
# From the GSS codebook, this variable stands for "healthcare provided for everyone"
# The qustion asked was "On a scale of 1 to 7, where 1 is not at all important and 7 is very important, 
        # how important is it: That health care be provided for everyone"

# Check for the Number of Missing Values
print(data['hlthall'].unique())
missing = data['hlthall'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# To clean this variable, we decided to drop the NaN.
# The data for this question was in terms of a scale - where integers were between 1 and 7.
# Replacing the NaN with 'no response' would not help with analysis since this was a numeric variable.
# Replacing the NaN with '0' would also not help since 0 is not an answer that respondents could give since the scale was from 1-7.
# Thus, it was not useful to keep the NaN values so we dropped them.
data = data.dropna(subset=['hlthall'])

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['hlthall'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['hlthall'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['hlthall'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan '6.0' '5.0' '3.0' '7.0' '1.0' '4.0' '2.0']
Missing values Before Cleaning:  62323 

Number of Instances After Cleaning:  
 7.0    638
6.0    138
4.0    104
5.0     83
1.0     74
3.0     45
2.0     34
Name: hlthall, dtype: int64 

Unique 'Type' values after cleaning: ['6.0' '5.0' '3.0' '7.0' '1.0' '4.0' '2.0'] 

Missing values After Cleaning:  0 



In [178]:
# Cleaning the Variale GovPayHealthBills ('helpsick' before rename)
# From the GSS codebook, this variable stands for "should government help pay for medical care"
# The qustion asked was "In general, some people think that it is the responsibility of the government in Washington
# to see to it that people have help in paying for doctors and hospital bills. Others think that these matters are not
# the responsibility of the federal government and that people shoul take care of these things themselves. 
# Where would you place yourself on this scale?"
# There is data collected from this question on 25 of the 34 available years so their is a large amount of data available 
# regarding this question 

# Check for the Number of Missing Values
print(data['helpsick'].unique())
missing = data['helpsick'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# To clean this variable, we replaced the nan with "No response"
# The data for this question was "GOV SHOULD HELP", "AGREE WITH BOTH", "PEOPLE HELP SELVES" and people answered
# on a scale of 1 to 5 where 1 is strongly agree that gov should help, and 5 is strongly agree with people help selves 
# Replace the NaN values in the variable with No Response since this is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
data['helpsick'].fillna('no response', inplace= True)

# Rather than having numbers as the data, we replaced the numbers with their actual meaning in order to make the data 
# more clear and understandable without needing to access the GSS data to see what the numbers represent 
value_mapping = {
    '1.0': 'gov should help',
    '2.0': 'lean towards gov should help',
    '3.0': 'agree with both',
    '4.0': 'lean toward people help selves',
    '5.0': 'people help selves',
}
data['helpsick'].replace(value_mapping, inplace=True)

# Count the number of instances of each unique value in the column 
print("Number of Instances After Cleaning: ", '\n', data['helpsick'].value_counts(), '\n')

# Check for Unique Types After Cleaning
unique_types_cleaned = data['helpsick'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['helpsick'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


[nan '4.0' '3.0' '1.0' '2.0' '5.0']
Missing values Before Cleaning:  400 

Number of Instances After Cleaning:  
 no response                       400
agree with both                   236
gov should help                   184
lean towards gov should help      139
lean toward people help selves     80
people help selves                 77
Name: helpsick, dtype: int64 

Unique 'Type' values after cleaning: ['no response' 'lean toward people help selves' 'agree with both'
 'gov should help' 'lean towards gov should help' 'people help selves'] 

Missing values After Cleaning:  0 



In [179]:
# 'diffcare'

In [180]:
# 'insrlmts'

In [181]:
# 'insrchng'

In [182]:
# 'satfin'

### Changing Variable Names
We are editing the names of the variables in the dataframe in order to make it more clear what each of the variables is representing


In [183]:
print(data.columns.tolist())
## Updated Names of all of the variables

['wrkstat', 'income', 'satfin', 'finalter', 'health', 'helpsick', 'hlthinsr', 'doc13', 'doc14', 'health1', 'hlthplan', 'diffcare', 'insrlmts', 'insrchng', 'emphlth', 'hlthcare', 'hlthall', 'hlthcovr', 'hlthtype', 'hrdshp6', 'hlthacc1', 'hlthacc4', 'hlthacc3', 'hlthacc2', 'prvdhlth']


In [184]:
print(data.columns.tolist())

data = data.rename(columns={'wrkstat': 'WorkStatus', # Done
                            'income':'Income' , # Done
                            'satfin': 'FinancialSatisfaction', 
                            'finalter': 'FinancialSitChange', # Done
                            'health':'HealthCondition', # Done 
                            'helpsick':'GovPayHealthBills', # Done
                            'hlthinsr':'HealthInsurance98',
                            'doc13': 'DeniedTreatmentWorry', 
                            'doc14':'DocCostOverCare', 
                            'health1': 'GeneralHealth', # Done
                            'hlthplan':'HealthInsurance02', 
                            'diffcare':'DifficultCare', 
                            'insrlmts': 'InsuranceLimits', 
                            'insrchng':'SwitchInsurancePlans', 
                            'emphlth':'EmployerCoverage', # Done
                            'hlthcare':'GovResponsibleForSick', # Done
                            'hlthall':'UniversalHealthCare', # Done
                            'hlthcovr':'HealthInsurance08', # Done
                            'hlthtype':'InsuranceSource', # Done
                            'hrdshp6':'LackedHealthInsurance', # Done
                            'hlthacc1':'RichVsPoorAccess', # Done
                            'hlthacc4':'CitVsNCitAccess', # Done
                            'hlthacc3':'MenVsWomenAccess', # Done
                            'hlthacc2':'OldVsYoungAccess', # Done
                            'prvdhlth':'WhoShouldProvideHealthCare' }) # Done

['wrkstat', 'income', 'satfin', 'finalter', 'health', 'helpsick', 'hlthinsr', 'doc13', 'doc14', 'health1', 'hlthplan', 'diffcare', 'insrlmts', 'insrchng', 'emphlth', 'hlthcare', 'hlthall', 'hlthcovr', 'hlthtype', 'hrdshp6', 'hlthacc1', 'hlthacc4', 'hlthacc3', 'hlthacc2', 'prvdhlth']


In [185]:
# .shape prints out the number of rows and columns 
print(data.shape, '\n')

# .columns prints out the names of the columns 
# There are 25 columns as specificed in the data.shape output
print(data.columns.tolist())

(1116, 25) 

['WorkStatus', 'Income', 'FinancialSatisfaction', 'FinancialSitChange', 'HealthCondition', 'GovPayHealthBills', 'HealthInsurance98', 'DeniedTreatmentWorry', 'DocCostOverCare', 'GeneralHealth', 'HealthInsurance02', 'DifficultCare', 'InsuranceLimits', 'SwitchInsurancePlans', 'EmployerCoverage', 'GovResponsibleForSick', 'UniversalHealthCare', 'HealthInsurance08', 'InsuranceSource', 'LackedHealthInsurance', 'RichVsPoorAccess', 'CitVsNCitAccess', 'MenVsWomenAccess', 'OldVsYoungAccess', 'WhoShouldProvideHealthCare']


### Data Visualizations and Analysis 

After cleaning and changing the variable names, graphs that visualize variables and their relationships can be created. 
