# DS 3001 - Project 1: GSS Data

#### Glory Gurrola and Divya Kuruvilla

## Research Question: How does social class or income affect access to health care?
Strategy: 
- pull data from the GSS columns relating to health care, income, and other things in that topic, and then analyze the trends we find. 

1. Change variable names to be more clear 
2. replace nans 
3. make some pretty graphs 

In [1]:
import pandas as pd
data = pd.read_csv('./Project_Data.csv')

In [2]:
data.head()

Unnamed: 0,wrkstat,income,satfin,finalter,health,helpsick,hlthinsr,doc13,doc14,health1,...,hlthcare,hlthall,hlthcovr,hlthtype,hrdshp6,hlthacc1,hlthacc4,hlthacc3,hlthacc2,prvdhlth
0,working full time,,not satisfied at all,better,good,,,,,,...,,,,,,,,,,
1,retired,,more or less satisfied,stayed same,fair,,,,,,...,,,,,,,,,,
2,working part time,,pretty well satisfied,better,excellent,,,,,,...,,,,,,,,,,
3,working full time,,not satisfied at all,stayed same,good,,,,,,...,,,,,,,,,,
4,keeping house,,pretty well satisfied,better,good,,,,,,...,,,,,,,,,,


In [3]:
# .shape prints out the number of rows and columns 
print(data.shape, '\n')

# .columns prints out the names of the columns 
# There are 25 columns as specificed in the data.shape output
print(data.columns.tolist())

(72426, 25) 

['wrkstat', 'income', 'satfin', 'finalter', 'health', 'helpsick', 'hlthinsr', 'doc13', 'doc14', 'health1', 'hlthplan', 'diffcare', 'insrlmts', 'insrchng', 'emphlth', 'hlthcare', 'hlthall', 'hlthcovr', 'hlthtype', 'hrdshp6', 'hlthacc1', 'hlthacc4', 'hlthacc3', 'hlthacc2', 'prvdhlth']


### Changing Variable Names
We are editing the names of the variables in the dataframe in order to make it more clear what each of the variables is representing


In [None]:
print(data.columns.tolist())
## Updated Names of all of the variables

['WorkStatus', 'Income', 'FinancialSatisfaction', 'FinancialSitChange', 'HealthCondition', 'GovPayHealthBills', 'HealthInsurance98', 'DeniedTreatmentWorry', 'DocCostOverCare', 'GeneralHealth', 'HealthInsurance02', 'DifficultCare', 'InsuranceLimits', 'SwitchInsurancePlans', 'EmployerCoverage', 'GovResponsibleForSick', 'UniversalHealthCare', 'HealthInsurance08', 'InsuranceSource', 'LackedHealthInsurance', 'RichVsPoorAccess', 'CitVsNCitAccess', 'MenVsWomenAccess', 'OldVsYoungAccess', 'WhoShouldProvideHealthCare']


In [None]:
print(data.columns.tolist())

data = data.rename(columns={'wrkstat': 'WorkStatus',
                            'income':'Income' ,
                            'satfin': 'FinancialSatisfaction',
                            'finalter': 'FinancialSitChange',
                            'health':'HealthCondition',
                            'helpsick':'GovPayHealthBills',
                            'hlthinsr':'HealthInsurance98',
                            'doc13': 'DeniedTreatmentWorry', 
                            'doc14':'DocCostOverCare', 
                            'health1': 'GeneralHealth', 
                            'hlthplan':'HealthInsurance02', 
                            'diffcare':'DifficultCare', 
                            'insrlmts': 'InsuranceLimits', 
                            'insrchng':'SwitchInsurancePlans', 
                            'emphlth':'EmployerCoverage', 
                            'hlthcare':'GovResponsibleForSick', 
                            'hlthall':'UniversalHealthCare', 
                            'hlthcovr':'HealthInsurance08', 
                            'hlthtype':'InsuranceSource', 
                            'hrdshp6':'LackedHealthInsurance', 
                            'hlthacc1':'RichVsPoorAccess', 
                            'hlthacc4':'CitVsNCitAccess', 
                            'hlthacc3':'MenVsWomenAccess', 
                            'hlthacc2':'OldVsYoungAccess', 
                            'prvdhlth':'WhoShouldProvideHealthCare'})

['wrkstat', 'income', 'satfin', 'finalter', 'health', 'helpsick', 'hlthinsr', 'doc13', 'doc14', 'health1', 'hlthplan', 'diffcare', 'insrlmts', 'insrchng', 'emphlth', 'hlthcare', 'hlthall', 'hlthcovr', 'hlthtype', 'hrdshp6', 'hlthacc1', 'hlthacc4', 'hlthacc3', 'hlthacc2', 'prvdhlth']


### Cleaning Variables
We are cleaning the variables in the dataframe to get rid of NaN and make the values easier to analyze.


In [4]:
data = pd.read_csv('./Project_Data.csv')

# Cleaning the finalter variable 
# From the GSS codebook, the finalter variable stands for "change in financial situation"
# The question asked was "During the last few years, has your financial situation been getting better, worse, or has it stayed the same?"
# This question was asked in all 34 years of the survey

# Check for the Number of Missing Values 
print(data['finalter'].unique())
missing = data['finalter'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')


# Impute NaN values with 'No Response'
# Replace the NaN values in the finalter variable with 'No Response' since finalter is a categorical variable.
# 'No Response' is more helpful than NAN in relation to our research question. 
# When making our graphs and visualization, we can represent NaN with 'No Response' to keep our categorical variables consistent.
data['finalter'].fillna('No Response', inplace= True)

# Count the number of instances of each unique value in the column 
print(data['finalter'].value_counts(), '\n')

# Since 36 values within the column 'health' are 'health', we decided to just drop these values
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'health'
data = data[data['finalter'] != 'finalter']

# Check for Unique Types After CLeaning
unique_types_cleaned = data['finalter'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['finalter'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')
data['finalter'].fillna('No Response', inplace= True)




['better' 'stayed same' 'worse' nan 'finalter']
Missing values Before Cleaning:  4795 

finalter
stayed same    26780
better         25358
worse          15457
No Response     4795
finalter          36
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['better' 'stayed same' 'worse' 'No Response'] 

Missing values After Cleaning:  0 



In [5]:
data.head()

Unnamed: 0,wrkstat,income,satfin,finalter,health,helpsick,hlthinsr,doc13,doc14,health1,...,hlthcare,hlthall,hlthcovr,hlthtype,hrdshp6,hlthacc1,hlthacc4,hlthacc3,hlthacc2,prvdhlth
0,working full time,,not satisfied at all,better,good,,,,,,...,,,,,,,,,,
1,retired,,more or less satisfied,stayed same,fair,,,,,,...,,,,,,,,,,
2,working part time,,pretty well satisfied,better,excellent,,,,,,...,,,,,,,,,,
3,working full time,,not satisfied at all,stayed same,good,,,,,,...,,,,,,,,,,
4,keeping house,,pretty well satisfied,better,good,,,,,,...,,,,,,,,,,


In [15]:

# Cleaning the Income Variable 
# From the GSS codebook, the income variable stands for "total family income"
# The question asked was "In which of these groups did your total family income, from all sources, fall last year before taxes, that is? Just tell me the letter."
# The data provided from the GSS had ranges of incomes.

# Check for the Number of Missing Values 
print(data['income'].unique())
missing = data['income'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# To clean the income variable, we decided to drop the NaN.
# For our research question, we want to do analysis on income related to healthcare data points.
# Thus, it was not useful to keep the NaN values when making visualizations so we dropped them.
data = data.dropna(subset=['income'])

# Count the number of instances of each unique value in the column 
print(data['income'].value_counts(), '\n')

# Since 36 values within the column 'income' are 'income', we decided to just drop these values
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'income'
data = data[data['income'] != 'income']

# Check for Unique Types After CLeaning
unique_types_cleaned = data['income'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['income'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

[nan '$10,000 to $14,999' '$7,000 to $7,999' '$4,000 to $4,999'
 '$1,000 to $2,999' '$15,000 to $19,999' '$5,000 to $5,999'
 '$20,000 to $24,999' '$3,000 to $3,999' 'under $1,000' '$8,000 to $9,999'
 '$25,000 or more' '$6,000 to $6,999' 'income']
Missing values Before Cleaning:  8951 

income
$25,000 or more       34785
$10,000 to $14,999     6850
$20,000 to $24,999     5528
$15,000 to $19,999     5301
$8,000 to $9,999       2285
$1,000 to $2,999       1412
$7,000 to $7,999       1315
$5,000 to $5,999       1314
$3,000 to $3,999       1309
$6,000 to $6,999       1249
$4,000 to $4,999       1189
under $1,000            902
income                   36
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['$10,000 to $14,999' '$7,000 to $7,999' '$4,000 to $4,999'
 '$1,000 to $2,999' '$15,000 to $19,999' '$5,000 to $5,999'
 '$20,000 to $24,999' '$3,000 to $3,999' 'under $1,000' '$8,000 to $9,999'
 '$25,000 or more' '$6,000 to $6,999'] 

Missing values After Cleaning:  0 



In [7]:
data.head()

Unnamed: 0,wrkstat,income,satfin,finalter,health,helpsick,hlthinsr,doc13,doc14,health1,...,hlthcare,hlthall,hlthcovr,hlthtype,hrdshp6,hlthacc1,hlthacc4,hlthacc3,hlthacc2,prvdhlth
1613,working full time,"$10,000 to $14,999",pretty well satisfied,stayed same,fair,,,,,,...,,,,,,,,,,
1614,keeping house,"$7,000 to $7,999",more or less satisfied,stayed same,good,,,,,,...,,,,,,,,,,
1615,working full time,"$10,000 to $14,999",more or less satisfied,stayed same,excellent,,,,,,...,,,,,,,,,,
1616,working full time,"$10,000 to $14,999",not satisfied at all,stayed same,excellent,,,,,,...,,,,,,,,,,
1617,keeping house,"$10,000 to $14,999",pretty well satisfied,better,good,,,,,,...,,,,,,,,,,


In [8]:
data = pd.read_csv('./Project_Data.csv')

# Cleaning the 'health' variable 
# From the GSS codebook, the health variable stands for the "conditon of health."
# The question asked was "Would you say your own health, in general, is excellent, good, fair, or  poor?"

# Check for the Number of Missing Values
print(data['health'].unique())
missing = data['health'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the health variable with 'No Response' since health is a categorical variable.
# 'No Response' is more helpful than NAN in relation to our research question. 
# When making our graphs and visualization, we can represent NaN with 'No Response' to keep our categorical variables consistent.
data['health'].fillna('No Response', inplace= True)

# Count the number of instances of each unique value in the column 
print(data['health'].value_counts(), '\n')

# Since 36 values within the column 'health' are 'health', we decided to just drop these values
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'health'
data = data[data['health'] != 'health']

# Check for Unique Types After CLeaning
unique_types_cleaned = data['health'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['health'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


['good' 'fair' 'excellent' 'poor' nan 'health']
Missing values Before Cleaning:  17236 

health
good           25651
No Response    17236
excellent      15712
fair           10737
poor            3054
health            36
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['good' 'fair' 'excellent' 'poor' 'No Response'] 

Missing values After Cleaning:  0 



In [9]:
data.head()

Unnamed: 0,wrkstat,income,satfin,finalter,health,helpsick,hlthinsr,doc13,doc14,health1,...,hlthcare,hlthall,hlthcovr,hlthtype,hrdshp6,hlthacc1,hlthacc4,hlthacc3,hlthacc2,prvdhlth
0,working full time,,not satisfied at all,better,good,,,,,,...,,,,,,,,,,
1,retired,,more or less satisfied,stayed same,fair,,,,,,...,,,,,,,,,,
2,working part time,,pretty well satisfied,better,excellent,,,,,,...,,,,,,,,,,
3,working full time,,not satisfied at all,stayed same,good,,,,,,...,,,,,,,,,,
4,keeping house,,pretty well satisfied,better,good,,,,,,...,,,,,,,,,,


In [23]:

# Cleaning the 'WhoShouldProvideHealthCare' variable, which previously was 'prvdhlth'
# # Cleaning the 'WhoShouldProvideHealthCare' variable 
# From the GSS codebook, this variable stands for "who provides for sick people."
# The question asked was "People have different opinions on who should provide services in America. Who do you think should primarily provide... Healthcare for the sick?  Should it be...  "

# Check for the Number of Missing Values
print(data['WhoShouldProvideHealthCare'].unique())
missing = data['WhoShouldProvideHealthCare'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the WhoShouldProvideHealthCare variable with No Response since WhoShouldProvideHealthCare is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
# It will make it easier to understand our data and draw conclusions when cross referencing this variable with orther variables.
data['WhoShouldProvideHealthCare'].fillna('No Response', inplace= True)

# Count the number of instances of each unique value in the column 
value_counts = data['WhoShouldProvideHealthCare'].value_counts()
print("Number of Instances before Dropping: ", value_counts, '\n')

# Since 36 values within the column 'WhoShouldProvideHealthCare' are 'prvdhlth', we decided to just drop these values
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'prvdhlth'
data = data[data['WhoShouldProvideHealthCare'] != 'prvdhlth']

# Check for Unique Types After Cleaning
unique_types_cleaned = data['WhoShouldProvideHealthCare'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['WhoShouldProvideHealthCare'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')


['No Response' 'prvdhlth' 'private companies/for-profit organizations'
 'government' 'non-profit organizations/charities/cooperatives'
 'family, relatives or friends' 'religious organizations']
Missing values Before Cleaning:  0 

Number of Instances before Dropping:  WhoShouldProvideHealthCare
No Response                                        69959
government                                          1429
private companies/for-profit organizations           515
non-profit organizations/charities/cooperatives      283
family, relatives or friends                         182
prvdhlth                                              36
religious organizations                               22
Name: count, dtype: int64 

Unique 'Type' values after cleaning: ['No Response' 'private companies/for-profit organizations' 'government'
 'non-profit organizations/charities/cooperatives'
 'family, relatives or friends' 'religious organizations'] 

Missing values After Cleaning:  0 



In [10]:
data = pd.read_csv('./Project_Data.csv')

# Cleaning the 'wrkstat' variable 
# From the GSS codebook, the wrkstat variable stands for the "labor force status."
# The question asked was "Last week were you working full time, part time, going to school, keeping house, or what?"

# Check for the Number of Missing Values
print(data['wrkstat'].unique())
missing = data['wrkstat'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the wrkstat variable with No Response since wrkstat is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
# It will make it easier to understand our data and draw conclusions when cross referencing this variable with orther variables.
data['wrkstat'].fillna('No Response', inplace= True)

# Count the number of instances of each unique value in the column 
value_counts = data['wrkstat'].value_counts()
print("Number of Instances before Dropping: ", value_counts, '\n')

# Since 36 values within the column 'wrkstat' are 'wrkstat', we decided to just drop these values
# Having the name of the column as a unique value won't be helpful when we create our visualizations
# Drop the instances that say 'wrkstat'
data = data[data['wrkstat'] != 'wrkstat']

# Check for Unique Types After Cleaning
unique_types_cleaned = data['wrkstat'].unique()
print("Unique 'Type' values after cleaning:", unique_types_cleaned, '\n')

#Print missing values after cleaning
missing = data['wrkstat'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')

['working full time' 'retired' 'working part time' 'keeping house'
 'in school' 'unemployed, laid off, looking for work'
 'with a job, but not at work because of temporary illness, vacation, strike'
 'other' 'wrkstat' nan]
Missing values Before Cleaning:  36 

Number of Instances before Dropping:  wrkstat
working full time                                                             35267
retired                                                                       10886
keeping house                                                                 10764
working part time                                                              7430
unemployed, laid off, looking for work                                         2621
in school                                                                      2187
other                                                                          1643
with a job, but not at work because of temporary illness, vacation, strike     1556
wrkstat              

In [11]:
data.head()

Unnamed: 0,wrkstat,income,satfin,finalter,health,helpsick,hlthinsr,doc13,doc14,health1,...,hlthcare,hlthall,hlthcovr,hlthtype,hrdshp6,hlthacc1,hlthacc4,hlthacc3,hlthacc2,prvdhlth
0,working full time,,not satisfied at all,better,good,,,,,,...,,,,,,,,,,
1,retired,,more or less satisfied,stayed same,fair,,,,,,...,,,,,,,,,,
2,working part time,,pretty well satisfied,better,excellent,,,,,,...,,,,,,,,,,
3,working full time,,not satisfied at all,stayed same,good,,,,,,...,,,,,,,,,,
4,keeping house,,pretty well satisfied,better,good,,,,,,...,,,,,,,,,,


In [None]:
# Cleaning GeneralHealth Variable 
# GeneralHealth Question : Would you say that in general your health is Excelent, Very good, Good, Fair, or Poor?

print(data['GeneralHealth'].value_counts())
print()
missing = data['GeneralHealth'].isnull().sum()
print("Missing values Before Cleaning: ", missing, '\n')

# Impute NaN values with 'No Response'
# Replace the NaN values in the GeneralHealth variable with No Response since GeneralHealth is a categorical variable.
# No Response is more helpful than NaN when we make graphs and visualizations.
# It will make it easier to understand our data and draw conclusions when cross referencing this variable with orther variables.
data['GeneralHealth'].fillna('No Response', inplace= True)

#Print missing values after cleaning
missing = data['GeneralHealth'].isnull().sum()
print("Missing values After Cleaning: ", missing, '\n')