# Project 1: Trends in General Happiness versus Financial Satisfaction and Work Satisfaction Over Time
## Group 11
- Elizabeth Hunter (mwm6nq)
- Michael Hijduk (aqt7bt)
- Eva Massarelli (ecm8yu)
- Anjali Mehta (wfn7ad)
- Anika Potluri (agu4yh)
## ! git clone https://www.github.com/DS3001/group11

# Summary

Research question(s), overview of methods and findings. Make sure to reference the codebook and describe the research strategy...

### Research Question 2 
One of the general topics this research project aimed to explore is what is the relationship between age and general happiness/satisfaction. The following are questions that explored this general theme: what age are people happiest, what age are people most financially satisfied, and what age are people most satisfied with their job. The project employed box plots to explore these questions and analyze the relationship between age and different forms of satisfaction. This analysis assesses the main findings, the statistical support for conclusions, the research strategy, and the interpretation of results.

# Data
Describe the source of data, variables, and challenges weaving in the cleaning code as is relevant. 
Data used for this project was sourced from the General Social Survey (GSS), a public opinion survey which has been administered annually or biennially in the U.S. starting in 1972. It is important to note that according to the GSS 2022 release notes:

"Changes in opinions, attitudes, and behaviors observed in 2021 and 2022 relative to historical trends may be due to actual change in concept over time and/or may have resulted from methodological changes made to the survey methodology during the COVID-19 global pandemic. Research and interpretation done using the 2021 and 2022 GSS data should take extra care to ensure the analysis reflects actual changes in public opinion and is not unduly influenced by the change in data collection methods. For more information on the 2021 and 2022 GSS methodology and its implications, please visit https://gss.norc.org/Get-The-Data."

To simplify the importation of the large data set, the data was imported into R and variables of interest, shown in data dictionary below, were seleted and exported to a comma separated value file.

| Variable        | Description                                                      | Potential Reponses |
| --- | --- | --- |
| AGE     | indicates the respondent's age |                     |
| YEAR    | indicates the year of the respondent's answers     |               |
| WRKSTAT | Answers the question: Last week were you working full time, part time, going to school, keeping house, or what?       | "working full time," "working part time," "with a job, but not at work because of temporary illness, vacation, strike," "unemployed, laid off, looking for work," "retired," "in school," "keeping house," "other"  |
| RINCOME | Answers the question: Did you earn any income from [OCCUPATION DESCRIBED IN OCC-INDUSTRY] in [the last year]? |  "under \\$1,000,"  "\\$1,000 to \\$2,999,"  "\\$3,000 to \\$3,999,"  "\\$4,000 to \\$4,999,"  "\\$5,000 to \\$5,999,"  "\\$6,000 to \\$6,999,"  "\\$7,000 to \\$7,999"   |
| HAPPY   | Answers the question: how would you say things are these days--would you say that you are very happy, pretty happy, or not too happy?    | "very happy," "pretty happy," "not too happy"       |
| SATFIN  | Answers the question: We are interested in how people are getting along financially these days. So far as you and your family are concerned, would you say that you are pretty well satisfied with your present financial situation, more or less satisfied, or not satisfied at all?  | "pretty well satisfied," "more or less satisfied," "not satisfied at all" |
| SATJOB  | Answers the question: On the whole, how satisfied are you with the work you do -- would you say you are very satisfied, moderately satisfied, a little dissatisfied, or very dissatisfied?    | "very satisfied," "moderately satisfied," "a little dissatisfied," "very dissatisfied" |

See below for the first entries into our selected dataframe.

In [7]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('./data/selectedData.csv') # Import data into environment
df.head()

Unnamed: 0,age,year,wrkstat,rincome,happy,satfin,satjob
0,23.0,1972,working full time,,not too happy,not satisfied at all,a little dissatisfied
1,70.0,1972,retired,,not too happy,more or less satisfied,
2,48.0,1972,working part time,,pretty happy,pretty well satisfied,moderately satisfied
3,27.0,1972,working full time,,not too happy,not satisfied at all,very satisfied
4,61.0,1972,keeping house,,pretty happy,pretty well satisfied,


To start wrangling the data, we first looked at the dimensions of the data and the unique variable names.

In [None]:
# wrangling
print(df.shape, '\n') # List the dimensions of df
print(df.dtypes, '\n') # The types of the variables
print(df.columns, '\n') # Column names

Many of the variables are of the "object" data type; because these data are categorical, this is okay.
We looked at age, year, and work status first. Per good data management practices, any data that we manipulated was put into a separate dataframe, which we called "gdf."

In [None]:
# Age
var = 'age'
print(df[var].describe(),'\n') # 72390-71612=769 missing values
print(df[var].unique(),'\n') # missing values are already in nan format
df[var].hist(bins=50) # Initial histogram, odd spikes may be due to how survey was previously administered

In [None]:
print('Total Missings: \n', sum(df[var].isnull()),'\n') # says 769 are missing, matches expected
gdf = df.loc[df[var].isnull() == 0, :] # makes df where only values that aren't null are included
print('Total Missings after nans removed: \n', sum(gdf[var].isnull()),'\n') # checks to see if nans were removed

In [None]:
# Year
var = 'year'
print(df[var].describe(),'\n') # 72390-72390=0 missing values, no nans to remove
df[var].hist(bins=50) # Initial histogram
# can see that the GSS was previously conducted annually
# since the mid-90s, the survey has only been conducted on even numbered years.

In [None]:
# Work status
var = 'wrkstat'
print(df[var].describe(),'\n') # 72390-72354=36 missing values expected
print(df[var].unique(),'\n') # 8 categories not including nan (will remove nans and "other")
print(df[var].value_counts(), '\n')
print(df[var].hist(bins=8,grid=False), '\n') # plot
plt.xticks(rotation=90) # makes plot readable
df[var+'_NA'] = df[var].isnull() # Create a bond missing dummy; long tail
print('Total Missings: \n', sum(df[var+'_NA']),'\n') # 36 missing values, already in nan form

In [None]:
gdf.loc[gdf[var] == 'with a job, but not at work because of temporary illness, vacation, strike', var] = 'with job, not at work rn' #shortening variable name  
gdf = gdf.loc[gdf['wrkstat'].isnull() == 0,:] # Removing the nan values as only 36 
gdf = gdf.loc[gdf['wrkstat'] != 'other',:] # Even though there are 1,643 "other" values, we decided to remove them
print(gdf[var].value_counts(), '\n')
gdf = gdf.rename(columns = {'wrkstat': 'work status'})

Now we then investigated income and converted the entries into real dollar amounts. We also cleaned work satisfaction and financial satisfaction. For this lab, we decided to drop null values from the data, as we were not confident in a method of interpolating values. However, after recent classes, we would likely use k-nearest neighbor in the future to handle some of the variable that included large amount of missing data, such as "rnicome," "happy," "satfin," and "satjob."

To clean "rincome," we converted the income values into term of real dollars so a more meaningful temporal comparison could be made. Categories that adjusted incomes had been made, but we decided to use the non-adjusted values and do the conversion ourselves.

In [None]:
# Income
cpi = pd.read_excel('./Data/cpi.xlsx') #data from website https://liberalarts.oregonstate.edu/spp/polisci/faculty-staff/robert-sahr/inflation-conversion-factors-years-1774-estimated-2024-dollars-recent-years/individual-year-conversion-factor-table-0 
                                    # in a new excel with the year and conversion factor for money for that year to 2016, estimates 2017 to 2022 conversion factor to 2016
print(cpi.head()) #values from cpi data

var = 'rincome'
print(df[var].describe(),'\n') # 72390-42333=30057 missing values
print(df[var].unique(),'\n') # missing values are already in nan format
df[var].hist(bins=12) # odd spikes may be due to how survey previously pooled ages (Methodological 56)
plt.xticks(rotation=90)
df[var+'_NA'] = df[var].isnull() # Create a bond missing dummy; long tail
print('Total Missings: \n', sum(df[var+'_NA']),'\n') # missing values match expected

For income, we decided to make the variable numeric. To do this, we took the average value of each income range and set it as a numeric category.

In [None]:
gdf = gdf.loc[gdf['rincome'].isnull() == 0,:] # remove nans
gdf = gdf.replace(['$1,000 to $2,999','$15,000 to $19,999','$7,000 to $7,999','$8,000 to $9,999','$20,000 to $24,999','$4,000 to $4,999','$10,000 to $14,999','$25,000 or more','$3,000 to $3,999','under $1,000','$5,000 to $5,999','$6,000 to $6,999'], 
                [1500,17500,7500,9000,22500,4500,12500,25000,3500,1000,5500,6500]) # replace with middle
print(gdf[var].unique(),'\n') # check

To merge the data from the GSS and the time value conversion factors, we joined the tables on the variable "year."

In [None]:
md = cpi.merge(gdf, on='year') #merging the data from the cpi and the data file cleaning on the common relation of year                                   
md['Income_2016'] = (md[var]/md['cf']) #creating the income related to the year of 2016 through the conversion factor -- divide the income by the conversion factor for that year
print(md.loc[md['year'] == 2022, 'cf']) #cf from 2022 to 2016 is 1.149

In [None]:
md['real_income'] = md['Income_2016'] *(1.149) # multiply the income from 2016 by the conversion factor for 2022 to get the value of money in 2022
md['real_income'].value_counts()
print(md.describe(),'\n')
md['real_income'].hist(bins=12)

In [None]:
gdf = gdf.rename(columns = {'rincome': 'income'})
gdf['income'] = md['real_income']

It is important to note that the "happy" variable was adjusted in 1972 and 1985 by the GSS to correct for leading questions asked before the personal happiness question that skewed answers. Also, 1986 and 1987 variant data was removed by the GSS.

In [None]:
#happy
var = 'happy'
print(df[var].describe(),'\n') # 72390-67588=4,802 missing values expected
print(df[var].unique(),'\n') # 3 categories not including nan
print(df[var].value_counts(), '\n')
print(df[var].hist(bins=3,grid=False), '\n')
df[var+'_NA'] = df[var].isnull() # Create a bond missing dummy; long tail
print('Total Missings: \n', sum(df[var+'_NA']),'\n') #number of missings is same as nan
#assuming missing values is that people are unhappy

In [None]:
gdf.loc[gdf[var].isnull(), var] = 'not happy' # Changing rows with nans to a new category called "not happy"
print('Total Missings: \n', sum(gdf[var].isnull()),'\n') # checks that nulls were renamed

In [None]:
#satfin
var = 'satfin'
print(df[var].describe(),'\n') # 72390-67722=4,668 missing values expected
print(df[var].unique(),'\n') # 3 categories not including nan
print(df[var].value_counts(), '\n')
print(df[var].hist(bins=3,grid=False), '\n')
df[var+'_NA'] = df[var].isnull() # Create a bond missing dummy; long tail
print('Total Missings: \n', sum(df[var+'_NA']),'\n') #number of missings is same as nan

In [None]:
gdf = gdf.loc[gdf['satfin'].isnull() == 0,:] # remove nans
gdf = gdf.rename(columns = {'satfin': 'financial satisfaction'})

In [None]:
#satjob
var = 'satjob'
print(df[var].describe(),'\n') # 72390-51887=20,503 missing values expected
print(df[var].unique(),'\n') # 4 categories not including nan
print(df[var].value_counts(), '\n')
print(df[var].hist(bins=4,grid=False), '\n')
df[var+'_NA'] = df[var].isnull() # Create a bond missing dummy; long tail
print('Total Missings: \n', sum(df[var+'_NA']),'\n') #number of expected missings is same as nan

In [None]:
gdf = gdf.loc[gdf[var].isnull() == 0,:] # remove nans
gdf = gdf.rename(columns = {'satjob': 'job satisfaction'})

Once the data was cleaned, we took a look at the dataframe without missing values. Below, one can see the variable names have been changed to more clearly describe the values contained in the variable and there are no longer missing values.

In [None]:
gdf.head()

In [None]:
print(gdf.shape) # 72,390-36,702=35,688 rows removed while cleaning
gdf.describe()

# Results
Show visualizations and discuss findings...

### Research Question 1 

### Research Question 2 

In [5]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [6]:
# find value vounts for each variable so the box plot
# can be made properly 
gdf['happy'].value_counts()
gdf['job satisfaction'].value_counts()
gdf['financial satisfaction'].value_counts()

NameError: name 'gdf' is not defined

In [None]:
# first box plot 
plt.figure(figsize=(12, 6))
sns.boxplot(data=gdf, x='age', y='happy', order=["very happy", "pretty happy", "not too happy", "not happy" ])
plt.title('Age vs. Happiness')
plt.xlabel('Age')
plt.ylabel('Happiness')
plt.xticks(rotation=45)
plt.show()

1. Age and Happiness: The research shows that the mean age for people reporting very high happiness is older, but roughly the mean ages for those reported they are 'very happy', 'pretty happy' and 'not too happy' are generally around the same age (~40 years old). This suggests that generally younger people are slightly more unhappy, but then it is difficult to assert why the results after that are so varied. 

In [None]:
# second box plot 
plt.figure(figsize=(12, 6))
sns.boxplot(data=gdf, x='age', y='financial satisfaction', order=["pretty well satisfied", "more or less satisfied", 
                                                            "not satisfied at all"])
plt.title('Age vs. Financial Satisfaction')
plt.xlabel('Age')
plt.ylabel('Financial Satisfaction')
plt.xticks(rotation=45)
plt.show()

2. Age and Financial Satisfaction: The research indicates that the mean age for individuals reporting high financial satisfaction is older (~44). This implies that financial satisfaction tends to increase with age, possibly due to career progression and accumulation of wealth. This finding aligns with the life-course theory, which suggests that people become more financially satisfied as they progress through life stages.

In [None]:
# third box plot 
plt.figure(figsize=(12, 6))
sns.boxplot(data=gdf, x='age', y='job satisfaction', order=["very satisfied", "moderately satisfied", 
                                                           "a little dissatisfied", "very dissatisfied"])
plt.title('Age vs. Work Satisfaction')
plt.xlabel('Age')
plt.ylabel('Work Satisfaction')
plt.xticks(rotation=45)
plt.show()

3. Age and Job Satisfaction: The analysis reveals that the mean age for very high job satisfaction is older (~43). This suggests that job satisfaction tends to increase with age, possibly due to career stability, experience, and finding one's professional niche.

The research strategy employed in this question involved the use of box plots to visually compare age and satisfaction levels. The use of box plots is an effective visual representation of the data, allowing for a quick comparison of age and different types of satisfaction. 

# Conclusion
Re-summarize the project, defend from criticism, suggest future work...

### Research Question 1 

### Research Question 2
The research question does not delve into the reasons behind the observed age-satisfaction relationships. While it identifies trends, it does not explain why these trends exist. Additionally, the research does not consider potential confounding variables that may affect satisfaction levels, such as social and economic factors.
To build upon this research, it is essential to investigate the factors causing unhappiness, especially among younger individuals. Exploring the role of loneliness, economic instability, and career prospects in happiness and life satisfaction could be a valuable extension of this project. Additionally, examining how external factors, such as societal changes or economic conditions, influence these age-satisfaction relationships would provide a more comprehensive understanding. 

# Works Cited

Davern, Michael; Bautista, Rene; Freese, Jeremy; Herd, Pamela; and Morgan, Stephen L.; *General Social Survey
1972-2022*. [Machine-readable data file]. Principal Investigator, Michael Davern; Co-Principal Investigators,
Rene Bautista, Jeremy Freese, Pamela Herd, and Stephen L. Morgan. NORC ed. Chicago, 2023. 1 datafile
(Release 1) and 1 codebook (2022 Release 1).

Smith, T.S. (October, 1988). *GSS Methodological Report No. 56*. University of Chicago. The National Science Foundation, Grant No. SES-8747227.https://gss.norc.org/Documents/reports/methodological-reports/MR056.pdf