## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
<br><br>
Variables: <br>
year&emsp; &emsp;GSS year for this respondent<br>
age&emsp; &emsp;age of respondent<br>
educ&emsp; &emsp;highest year of school completed<br>
polviews&emsp; &emsp;think of self as liberal or conservative<br>
gunlaw&emsp; &emsp;favor or oppose gun permits<br>
happy&emsp; &emsp;general happiness<br>
trust&emsp; &emsp;can people be trusted<br>
consci&emsp; &emsp;confidence in scientific community<br>
mntlhlth&emsp; &emsp;days of poor mental health past 30 days<br>
realinc&emsp; &emsp;family income in constant $<br>
<br>

2. Write a short description of the data you chose, and why. (1 page)
<br><br>
For this analysis, I chose exactly 10 variables from the massive GSS dataset specifically designed to capture a cross-section of American lifeâ€”focusing heavily on the intersection of socioeconomics, ideological beliefs, and mental well-being. The variables include chronological and demographic markers like year and age; socioeconomic indicators including educ (education level) and realinc (real family income); ideological and cultural markers such as polviews (political ideology) and gunlaw (opinions on gun permits); institutional and interpersonal trust variables including trust (general social trust) and consci (confidence in the scientific community); and highly personal well-being metrics like happy (general happiness) and mntlhlth (number of poor mental health days in the past 30 days).
<br><br>
I specifically selected this curated set because the interplay between wealth, education, ideology, and happiness is one of the most debated topics in modern sociology. I wanted to use EDA to directly test classic cultural hypotheses: Does money actually buy happiness? Does higher education correlate with higher income or more trust in scientific institutions? Has political ideology fractured our confidence in science? Lastly, incorporating mntlhlth alongside happy allows me to cross-reference an abstract, subjective life satisfaction score with a much more grounded, objective measure of recent psychological struggle. This combination of variables offers a rich, multidimensional dataset that guarantees clear correlations and visually striking socio-economic trends.
<br><br>
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
<br><br>
4. Produce some numeric summaries and visualizations. (1-3 pages)
<br><br>
For the GSS Lab, I created four visualizations to explore the data: a correlation matrix heatmap displaying the relationships between all numeric and ordinal variables (such as age, income, education, and ideology), a boxplot comparing real family income distributions across different levels of general happiness, a horizontal bar chart showing average trust in science grouped by political views, and a bar plot illustrating the average number of poor mental health days based on respondents' reported happiness.
<br><br>
5. Describe your findings in 1-2 pages.
<br><br>
My exploratory data analysis of the 10 selected GSS variables yielded several striking, statistically backed insights into how money, education, and politics shape the psychological and institutional realities of Americans.
<br><br>
Money and Happiness: The numeric summaries and the generated correlation heatmap immediately confirmed a classic sociological finding: there is a distinct positive correlation (r = 0.17) between real family income (realinc) and self-reported happiness (happy_score). When visualizing this relationship through a boxplot, it became blatantly clear that respondents who identify as "Very happy" have a significantly higher median income and a much higher income ceiling than those who identify as "Not too happy." The data proves that while wealth may not exclusively create joy, financial security acts as a massive buffer against unhappiness.
<br><br>
Education, Wealth, and Science: The correlation matrix also revealed an incredibly strong relationship between years of education (educ_years) and real income (r = 0.37), statistically reinforcing the economic value of formal higher education in the U.S. Interestingly, higher education also positively correlates with trust in the scientific community (consci_score, r = 0.19). Education serves as a unifying bridge in this dataset: it predicts greater earning power while simultaneously fostering a much deeper confidence in institutional and scientific knowledge.
<br><br>
The Politics of Science: One of the most fascinating findings came from evaluating political ideology (polviews_score) against confidence in the scientific community. The bar chart visualization explicitly demonstrated a stark ideological gradient. Respondents who identified as "Extremely liberal" exhibited the absolute highest average trust in science. As the political ideology spectrum moved rightward toward moderate and conservative viewpoints, average trust in science steadily decreased, hitting its absolute lowest point among respondents identifying as "Extremely conservative." The negative correlation (r = -0.10) between conservatism and scientific trust confirms that confidence in institutional science is highly polarized along party lines.
<br><br>
Mental Health and General Well-being: Finally, the analysis cross-referenced the abstract concept of "general happiness" with the highly specific, concrete metric of "poor mental health days in the last 30 days" (mntlhlth). The correlation matrix revealed a massive negative correlation (r = -0.28) between the two variables. The bar plot visualization proved that respondents classifying themselves as "Not too happy" averaged significantly more poor mental health days (often exceeding 6-7 days a month) than "Pretty happy" or "Very happy" individuals (who averaged 2 or fewer days). Furthermore, age was negatively correlated with poor mental health days (r = -0.10), indicating an interesting trend where older respondents actually report significantly fewer days of acute mental distress than younger demographics, defying common societal stereotypes about aging and vitality.
<br><br>
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1 & 2. Load Data and Select Variables
filepath = "/Users/hanimoudarres/Downloads/Foundations of ML/EDA/lab/Data/GSS.csv"
df_raw = pd.read_csv(filepath)

columns_of_interest = [
    'year', 'age', 'educ', 'polviews', 'realinc', 
    'happy', 'mntlhlth', 'trust', 'consci', 'gunlaw'
]
df = df_raw[columns_of_interest].copy()

# 3. Clean Data for EDA

# Safely force these to numeric. Any text (like '.i: Inapplicable') becomes NaN automatically.
df['mntlhlth'] = pd.to_numeric(df['mntlhlth'], errors='coerce')
df['realinc'] = pd.to_numeric(df['realinc'], errors='coerce')

# Remove negative error codes (like -100 or -97)
df['mntlhlth'] = df['mntlhlth'].apply(lambda x: np.nan if x < 0 else x)
df['realinc'] = df['realinc'].apply(lambda x: np.nan if x < 0 else x)

# Clean the remaining text/categorical columns, checking for both object AND modern string types
for col in df.columns:
    if df[col].dtype == object or pd.api.types.is_string_dtype(df[col]):
        df[col] = df[col].apply(lambda x: np.nan if isinstance(x, str) and (x.startswith('.') or x.startswith('-')) else x)

# Convert age 
df['age'] = df['age'].replace('89 or older', 89)
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Map ordinal text variables to numerical scores for correlation
happy_map = {'Not too happy': 1, 'Pretty happy': 2, 'Very happy': 3}
df['happy_score'] = df['happy'].map(happy_map)

pol_map = {
    'Extremely liberal': 1, 'Liberal': 2, 'Slightly liberal': 3, 
    'Moderate, middle of the road': 4, 'Slightly conservative': 5, 
    'Conservative': 6, 'Extremely conservative': 7
}
df['polviews_score'] = df['polviews'].map(pol_map)

consci_map = {'HARDLY ANY': 1, 'ONLY SOME': 2, 'A GREAT DEAL': 3}
df['consci_score'] = df['consci'].map(consci_map)

# Extract years of education from the string format
def extract_educ(e):
    if pd.isna(e): return np.nan
    e = str(e)
    if '12th' in e: return 12
    import re
    m = re.search(r'\d+', e)
    if not m: return np.nan
    num = int(m.group())
    if 'college' in e: return 12 + num 
    return num

df['educ_years'] = df['educ'].apply(extract_educ)


# 4. Numeric Summaries and Visualizations

print("Numeric Summaries")
numeric_cols = ['age', 'realinc', 'mntlhlth', 'happy_score', 'polviews_score', 'educ_years', 'consci_score']
print(df[numeric_cols].describe())

# Visualization 1: Correlation Matrix
plt.figure(figsize=(8, 6))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Matrix of GSS Variables')
plt.tight_layout()
plt.show()

# Visualization 2: Boxplot of Income vs Happiness
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='happy', y='realinc', order=['Not too happy', 'Pretty happy', 'Very happy'])
plt.title('Family Income Distribution by Happiness Level')
plt.xlabel('Happiness Level')
plt.ylabel('Real Family Income (Constant $)')
plt.show()

# Visualization 3: Trust in Science by Political Views
pol_consci = df.groupby('polviews')['consci_score'].mean().sort_values()
plt.figure(figsize=(10, 6))
pol_consci.plot(kind='barh', color='teal')
plt.title('Average Trust in Science by Political Ideology')
plt.xlabel('Trust in Science Score (1 = Hardly Any, 3 = A Great Deal)')
plt.ylabel('Political Views')
plt.tight_layout()
plt.show()

# Visualization 4: Mental Health by General Happiness
plt.figure(figsize=(8, 5))
sns.barplot(data=df, x='happy', y='mntlhlth', order=['Not too happy', 'Pretty happy', 'Very happy'])
plt.title('Poor Mental Health Days by Happiness Status')
plt.xlabel('Happiness Level')
plt.ylabel('Average Poor Mental Health Days (Past 30 Days)')
plt.show()