**Data Cleaning**



QUESTION: Can social, work, internet or time trends be used to determine happiness?

To look at happiness I decided to extract as much data as I could that pertained to my question from the GSS (General Social Survey).

In [0]:
# import packages
import pandas as pd
import numpy as np

In [9]:
# read in data
data = pd.read_csv('GSSdata.csv')

# how much happiness are we looking at, right off the bat?
data['General happiness'].value_counts()

  interactivity=interactivity, compiler=compiler, result=result)


Pretty happy      33563
Very happy        18823
Not too happy      7668
Not applicable     4383
No answer           338
Don't know           39
Name: General happiness, dtype: int64

It is time to start cleaning. Since the data is spread out across multiple years and different ballots, the columns are messy and can be inconsistent.

I'll start out by replacing any non-response with a NaN so it is obvious what I'm working with. This will create consistency if an answer was meant to mean nothing.

I also want to make sure I'll be able to split up this data and then re-merge it later, so I'll extract the unique identifier for each row.


In [10]:
# replace blank responses with NaN
data = data.replace('Not applicable', np.NaN)
data = data.replace('No issp', np.NaN)
data = data.replace('No answer', np.NaN)
data = data.reset_index()
data.rename(columns={'index' : 'unique'})
data.head()

Unnamed: 0,index,Rs income in constant $,Year of birth,Household type (condensed),Household type,Rs job is secure,Standard of living of r will improve,Hours per day r have to relax,Days per month r work extra hours,Www hours per week,Email hours per week,R most recent home you have purchased.,For how long have you had your present job,Years worked for your present employer,In uncertain times i usually expect best,If something can go wrong for me it will,I'm always optimistic about my future,How much time felt sad in past wk,How much time felt lonely in past wk,How much time felt happy in past wk,How much time sleep was restless in past wk,How much time felt depressed in past wk,Hours of internet use on weekends,Minutes of internet use on weekends,Hours of internet use on weekdays,Minutes of internet use on weekdays,I expect more good things to happen to me than bad,I rarely count on good things happening to me,I hardly ever expect things to go my way,Hours per day watching tv,Satisfaction with financial situation,Rs self ranking of social position,Respondents income,Total family income,Household members 18 yrs and older,Household members 13 thru 17 yrs old,Household members 6 thru 12 yrs old,Household members less than 6 yrs old,Number of persons in household,Rs highest degree,Age of respondent,Number of hours usually work a week,Number of hours worked last week,Respondent id number,Region of interview,Should govt reduce income differences,General happiness,Job or housework,Spend evening with siblings,Spend evening with parents,Spend evening at bar,Spend evening with friends,Spend evening with neighbor,Spend evening with relatives,Can people be trusted,People fair or try to take advantage,People helpful or looking out for selves,Is life exciting or dull,Happiness of marriage,Gss year for this respondent
0,0,0,1949,,,,,,,,,,-1.0,-1.0,,,,,,,,,,,,,,,,,Not at all sat,,,,1,0,0,0,1,Bachelor,23,,,1.0,E. nor. central,,Not too happy,A little dissat,,,,,,,Depends,Fair,Lookout for self,,,1972.0
1,1,0,1902,,,,,,,,,,-1.0,-1.0,,,,,,,,,,,,,,,,,More or less,,,,2,0,0,0,2,Lt high school,70,,,2.0,E. nor. central,,Not too happy,,,,,,,,Can trust,Fair,Helpful,,,1972.0
2,2,0,1924,,,,,,,,,,-1.0,-1.0,,,,,,,,,,,,,,,,,Satisfied,,,,2,1,1,0,4,High school,48,,,3.0,E. nor. central,,Pretty happy,Mod. satisfied,,,,,,,Cannot trust,Take advantage,Lookout for self,,,1972.0
3,3,0,1945,,,,,,,,,,-1.0,-1.0,,,,,,,,,,,,,,,,,Not at all sat,,,,2,0,0,0,2,Bachelor,27,,,4.0,E. nor. central,,Not too happy,Very satisfied,,,,,,,Cannot trust,Fair,Lookout for self,,,1972.0
4,4,0,1911,,,,,,,,,,-1.0,-1.0,,,,,,,,,,,,,,,,,Satisfied,,,,2,0,0,0,2,High school,61,,,5.0,E. nor. central,,Pretty happy,,,,,,,,Cannot trust,Fair,Lookout for self,,,1972.0


In [0]:
# There is a lot of data here with a lot of empty values
# My goal is to look at happiness as a whole across years
# After looking at the data, I'd like to split the columns into categories
# and study happiness based on these groups
# The groups will be:
#             1) job/work
#             2) social life
#             3) Screen time
# Each of these groups of data will be connected with:
#             1) unique ID
#             2) Happiness of respondent

This is to split up this data into regions of interest. I want general happiness as it's own so I can look at it as a standalone while exploring. The others will have multiple items, but I will probably focus on income, social time, and web time.

In [0]:
happiness = ['index',
             'Standard of living of r will improve', 
             'How much time felt sad in past wk', 
             'How much time felt happy in past wk',
             'How much time felt depressed in past wk',
             'I expect more good things to happen to me than bad',
             'I\'m always optimistic about my future',
             'Happiness of marriage',
             'General happiness',
             'Satisfaction with financial situation',
             'Rs self ranking of social position',
             'Is life exciting or dull',
             "Gss year for this respondent                       "]


career = ['index',
          'Rs income in constant $',
          'Rs job is secure',
          'Respondents income',
          'Number of hours usually work a week',
          'General happiness']

demo = ['index',
        'Gss year for this respondent                       ',
        'Year of birth',
        'Region of interview',
        'General happiness']
        
social = ['index',
          'Spend evening with siblings',
          'Spend evening with relatives',
          'Spend evening with neighbor',
          'Spend evening with parents',
          'Spend evening with friends',
          'Spend evening at bar',
          'General happiness']

web = ['index',
       "Www hours per week",
       "Email hours per week",
        "Hours per day watching tv",
                   "Hours of internet use on weekends",
                   "Minutes of internet use on weekends",
                   "Hours of internet use on weekdays",
                   "Minutes of internet use on weekdays",
                   "Gss year for this respondent                       ",
       'General happiness']

In [0]:
# make a dataframe for each set of features
df_happiness = data[happiness]
df_jobs = data[career]
df_demo = data[demo]
df_social = data[social]
df_internet = data[web]

For some reason, year has a bunch of spaces after it. I'd like to fix that.

In [0]:
df_happiness = df_happiness.rename(columns={
    'Gss year for this respondent                       ' : "year"
})
df_jobs = df_jobs.rename(columns={
    'Gss year for this respondent                       ' : "year"
})

There were some strange values in the income column and it was coming in as a string. I will remove the 'source' value and turn this value into a float. I also want to remove anyone with 0 income

In [14]:
# clean jobs dataframe
df_jobs['Rs income in constant $']
df_jobs = df_jobs[df_jobs['Rs income in constant $'] != 'Source']
df_jobs['income'] = df_jobs['Rs income in constant $'].astype(float)
df_jobs = df_jobs[df_jobs['income'] > 0]
df_jobs.head()

Unnamed: 0,index,Rs income in constant $,Rs job is secure,Respondents income,Number of hours usually work a week,General happiness,income
3117,3117,4935,,$1000 to 2999,,Very happy,4935.0
3118,3118,43178,,$15000 - 19999,,Very happy,43178.0
3121,3121,18505,,$7000 to 7999,,Pretty happy,18505.0
3122,3122,22206,,$8000 to 9999,,Pretty happy,22206.0
3123,3123,55515,,$20000 - 24999,,Very happy,55515.0


Time to look at the social dataframe.

In [15]:
df_social.head()

Unnamed: 0,index,Spend evening with siblings,Spend evening with relatives,Spend evening with neighbor,Spend evening with parents,Spend evening with friends,Spend evening at bar,General happiness
0,0,,,,,,,Not too happy
1,1,,,,,,,Not too happy
2,2,,,,,,,Pretty happy
3,3,,,,,,,Not too happy
4,4,,,,,,,Pretty happy


If the ballot didn't contain any information about the person's social life then their responses to happiness are irrelevant to this dataframe.

In [16]:
# clean social dataframe
df_social = df_social[(df_social['Spend evening with siblings'].notna()) &
                          (df_social['Spend evening with relatives'].notna()) &
                          (df_social['Spend evening with neighbor'].notna()) &
                          (df_social['Spend evening with parents'].notna()) &
                          (df_social['Spend evening with friends'].notna())]
df_social

Unnamed: 0,index,Spend evening with siblings,Spend evening with relatives,Spend evening with neighbor,Spend evening with parents,Spend evening with friends,Spend evening at bar,General happiness
9120,9120,Sev times a year,Sev times a year,Never,Never,Sev times a year,Never,Pretty happy
9121,9121,Sev times a year,Sev times a week,Sev times a week,No such people,Sev times a year,Never,Pretty happy
9122,9122,Never,Once a month,Once a month,Sev times a mnth,Once a month,Sev times a mnth,Very happy
9123,9123,Sev times a year,Sev times a week,Once a month,No such people,Sev times a year,Sev times a year,Very happy
9124,9124,Sev times a mnth,Once a month,Sev times a mnth,Sev times a mnth,Sev times a year,Once a month,Pretty happy
...,...,...,...,...,...,...,...,...
32337,32337,Never,Once a year,Sev times a mnth,Never,Once a month,Sev times a year,Pretty happy
32346,32346,Sev times a week,Sev times a week,Never,Sev times a week,Sev times a week,Once a month,Pretty happy
32351,32351,Never,Sev times a week,Never,Never,Sev times a mnth,Never,Not too happy
32362,32362,Once a year,Sev times a mnth,Once a year,No such people,Never,Never,Pretty happy


In [17]:
df_social['Spend evening with friends'].value_counts()

Once a month        2888
Sev times a mnth    2688
Sev times a week    2576
Sev times a year    2538
Never               1393
Once a year          952
Almost daily         378
Don't know            17
Name: Spend evening with friends, dtype: int64

I want to assign each value from the above output to a score. I will use these scores to determine how "social" this person is. I will create new columns with the social value for sibling, relatives, neighbors, parents and friends and then sum that for a total social score.

In [0]:
map_vals = {
      "Don't know" : 0,
      "No such people" : 0,
    	"Never": 1,
    	"Once a year": 2,
      "Once a month" : 3,
    	"Sev times a year": 4,
    	"Sev times a mnth": 5,
    	"Sev times a week": 6,
      "Almost daily": 7
	}
df_social["siblings"] = df_social["Spend evening with siblings"].map(map_vals)
df_social["relatives"] = df_social["Spend evening with relatives"].map(map_vals)
df_social["neighbor"] = df_social["Spend evening with neighbor"].map(map_vals)
df_social["parents"] = df_social["Spend evening with parents"].map(map_vals)
df_social["friends"] = df_social["Spend evening with friends"].map(map_vals)
df_social["social_score"] = df_social[['siblings',
                                       'relatives',
                                       'neighbor',
                                       'parents',
                                       'friends']].sum(axis=1)

In [20]:
# observe the results
df_social.head()

Unnamed: 0,index,Spend evening with siblings,Spend evening with relatives,Spend evening with neighbor,Spend evening with parents,Spend evening with friends,Spend evening at bar,General happiness,siblings,relatives,neighbor,parents,friends,social_score
9120,9120,Sev times a year,Sev times a year,Never,Never,Sev times a year,Never,Pretty happy,4,4,1,1,4,14
9121,9121,Sev times a year,Sev times a week,Sev times a week,No such people,Sev times a year,Never,Pretty happy,4,6,6,0,4,20
9122,9122,Never,Once a month,Once a month,Sev times a mnth,Once a month,Sev times a mnth,Very happy,1,3,3,5,3,15
9123,9123,Sev times a year,Sev times a week,Once a month,No such people,Sev times a year,Sev times a year,Very happy,4,6,3,0,4,17
9124,9124,Sev times a mnth,Once a month,Sev times a mnth,Sev times a mnth,Sev times a year,Once a month,Pretty happy,5,3,5,5,4,22


In [0]:
# clean internet dataframe
df_internet = df_internet[(df_internet['Www hours per week'].notna()) &
                          (df_internet['Email hours per week'].notna()) &
                          (df_internet['Hours per day watching tv'].notna()) &
                          (df_internet['Hours of internet use on weekends'].notna()) &
                          (df_internet['Hours of internet use on weekdays'].notna()) &
                          (df_internet['Minutes of internet use on weekdays'].notna())]

In [21]:
# rename internet columns
df_internet = df_internet.rename(columns={
    'Www hours per week' : 'week_web',
    'Hours of internet use on weekdays' : 'weekday_internet',
    'Hours of internet use on weekends' : 'weekend_internet',
    'Hours per day watching tv' : 'week_tv',
    'Email hours per week' : 'week_email',
    'Minutes of internet use on weekends' : 'weekend_internet_minutes',
    'Minutes of internet use on weekdays' : 'weekday_internet_minutes'
})

#convert to numbers, change 'don't know' answers to NaN
df_internet = df_internet.replace("Don't know", np.NaN)
df_internet = df_internet.replace("Not applicable", np.NaN)
df_internet = df_internet.replace("No answer", np.NaN)
df_internet = df_internet.replace(np.NaN, '0')


df_internet['week_web'] = df_internet['week_web'].astype('int32')
df_internet['week_email'] = df_internet['week_email'].astype('int32')
df_internet['week_tv'] = df_internet['week_tv'].astype('int32')
df_internet['weekend_internet'] = df_internet['weekend_internet'].astype('int32')
df_internet['weekend_internet_minutes'] = (df_internet['weekend_internet_minutes']
                                           .astype('int32'))
df_internet['weekday_internet'] = (df_internet['weekday_internet']
                                   .astype('int32'))
df_internet['weekday_internet_minutes'] = (df_internet['weekday_internet_minutes']
                                           .astype('int32'))

#   #get total internet time
#   #I'll ignore added minutes for simplicity
df_internet['internet_per_week'] = (df_internet['weekend_internet'] * 2 + 
                                    df_internet['weekday_internet'] * 5)
#   # get total screen time
#   # lots of people probably miscalculate this...but it's the best data I could find
df_internet['screen_time_per_week'] = (df_internet['internet_per_week'] +
                                       df_internet['week_email'] +
                                        df_internet['week_tv'] * 7)

df_internet.head()

Unnamed: 0,index,week_web,week_email,week_tv,weekend_internet,weekend_internet_minutes,weekday_internet,weekday_internet_minutes,Gss year for this respondent,General happiness,internet_per_week,screen_time_per_week
0,0,0,0,0,0,0,0,0,1972,Not too happy,0,0
1,1,0,0,0,0,0,0,0,1972,Not too happy,0,0
2,2,0,0,0,0,0,0,0,1972,Pretty happy,0,0
3,3,0,0,0,0,0,0,0,1972,Not too happy,0,0
4,4,0,0,0,0,0,0,0,1972,Pretty happy,0,0


Now that cleaning is done, write these dataframes out to a csv so they can be imported to exploration and model producing.

In [0]:
df_happiness.to_csv('happy.csv')

In [0]:
df_jobs.to_csv('jobs.csv')

In [0]:
df_social.to_csv('social.csv')

In [0]:
df_internet.to_csv('internet.csv')