## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest. (You can also check out `get_gss.ipynb` for some processed data.)
2. Write a short description of the data you chose, and why. (~500 words)
3. Load the data using Pandas. Clean them up for EDA. Do this in this notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations.
5. Describe your findings. (500 - 1000 words, or more)

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.


In [6]:
import pandas as pd

var_list = ['year','id','wrkstat','prestige','occ','educ','sex','race','rincome','polviews']

# List of variables you want to save
output_filename = 'selected_gss_data.csv' # Name of the file you want to save the data to
#
phase = 0 # Starts in write mode; after one iteration of loop, switches to append mode
#
for k in range(3): # for each chunk of the data
    url = 'https://github.com/DS3001/project_gss/raw/main/gss_chunk_' + str(1+k) + '.parquet' # Create url to the chunk to be processed
    df = pd.read_parquet(url) # Download this chunk of data
    if phase == 0 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='w', # control write versus append
                                header=var_list, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode
    elif phase == 1 :
        df.loc[:,var_list].to_csv(output_filename, # specifies target file to save the chunk to
                                mode='a', # control write versus append
                                header=None, # variable names
                                index=False) # no row index saved
        phase = 1 # Switch from write mode to append mode

gss = pd.read_csv("selected_gss_data.csv")

# Display 1st 5 rows
print(gss.head())

   year  id            wrkstat  prestige    occ  educ     sex   race rincome  \
0  1972   1  working full time      50.0  205.0  16.0  female  white     NaN   
1  1972   2            retired      45.0  441.0  10.0    male  white     NaN   
2  1972   3  working part time      44.0  270.0  12.0  female  white     NaN   
3  1972   4  working full time      57.0    1.0  17.0  female  white     NaN   
4  1972   5      keeping house      40.0  385.0  12.0  female  white     NaN   

  polviews  
0      NaN  
1      NaN  
2      NaN  
3      NaN  
4      NaN  


We chose 10 variables to feature in our dataset: year, id, wrkstat, prestige, occ, educ, sex, race, rincome, and polviews. We chose the year variable as it tracks trends over time and is useful to analyze how opinions/demographics have changed over time. We chose the ID variable as it is a unique identifier for respondents and can allow us to clean/merge the data while maintaining individual data indexes. We chose wrkstat as it shows if someone's employment level and could have a relationship with income, education, prestige, etc. We chose prestige as it measures the social standing of jobs, it can be useful for connecting occupation, education, and other factors to social class. We chose occ to get specific occupation information, we can get specifics about industires and job types. We chose educ to see the years of education individuals have, this is a key socioeconomic variable. We chose sex to analyze the differences in our data across genders. We chose race to analyze inequality and see if there are relationships between our data and race. We chose rincome as it is a direct measure of financial ability and likely has relationships with occupation, education, and more. Finally, our last variable is polviews to analyze how political views align with certain demographics and social positions.

There are interesting relationships we can explore with these variables such as education v. income, does more schooling translate to higher income? Work status v. gender, are men and women equally represented in full-time work? Presitge v. politics, are higher-prestige jobs associated with a particular political group? Race v. education, are there gaps in average education by race and how have they changed over time? Income v. political views, is there a relationship between political views and high earners?

In [7]:
gss
## clean data for EDA
## numeric summaries and visualizations
## describe findings


Unnamed: 0,year,id,wrkstat,prestige,occ,educ,sex,race,rincome,polviews
0,1972,1,working full time,50.0,205.0,16.0,female,white,,
1,1972,2,retired,45.0,441.0,10.0,male,white,,
2,1972,3,working part time,44.0,270.0,12.0,female,white,,
3,1972,4,working full time,57.0,1.0,17.0,female,white,,
4,1972,5,keeping house,40.0,385.0,12.0,female,white,,
...,...,...,...,...,...,...,...,...,...,...
72385,2022,3541,working full time,,,12.0,female,white,,extremely liberal
72386,2022,3542,working full time,,,19.0,female,white,"$25,000 or more","moderate, middle of the road"
72387,2022,3543,working full time,,,15.0,male,white,"$25,000 or more",slightly liberal
72388,2022,3544,working full time,,,17.0,female,white,"$25,000 or more",slightly liberal
