## EDA LAB

The General Social Survey (GSS) is a bi-annual nationally representative survey of Americans, with almost 7000 different questions asked since the survey began in the 1970s. It has straightforward questions about respondents' demographic information, but also questions like "Does your job regularly require you to perform repetitive or forceful hand movements or involve awkward postures?" or "How often do the demands of your job interfere with your family life?" There are a variety of controversial questions. No matter what you're curious about, there's something interesting in here to check out. The codebook is 904 pages (use CTRL+F to search it).

The data and codebook are available at:
https://gss.norc.org/us/en/gss/get-the-data.html

The datasets are so large that it might make sense to pick the variables you want, and then download just those variables from:
https://gssdataexplorer.norc.org/variables/vfilter

Here is your task:
1. Download a small (5-15) set of variables of interest.
2. Write a short description of the data you chose, and why. (1 page)
3. Load the data using Pandas. Clean them up for EDA. Do this in a notebook with comments or markdown chunks explaining your choices.
4. Produce some numeric summaries and visualizations. (1-3 pages)
5. Describe your findings in 1-2 pages.
6. If you have other content that you think absolutely must be included, you can include it in an appendix of any length.

For example, you might want to look at how aspects of a person's childhood family are correlated or not with their career or family choices as an adult. Or how political or religious affiliations correlate with drug use or sexual practices. It's an extremely wide-ranging survey.

Feel free to work with other people in groups, and ask questions!

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import os
current_directory = os.getcwd()
print(current_directory)


C:\Users\karan\OneDrive\Desktop\College_Stuff\ds_ml\EDA\lab


In [8]:
# 1: Download Data:
df = pd.read_excel('./GSS.xlsx') # Filed located in same folder


# 2: Explain Why I chose Data

year: I chose this because I wanted to see how mindsets and beliefs changed over the past 50 years. <br>
id_: This was to give each response a unique identifer. <br>
wrkstat: I chose this variable just in case I wanted to see how working status has changed over time. <br>
marital: I was interested in seeing marital patterns over differnt groups. <br>
divorce: I wanted to see divorce patterns over different groups, specifcally race. <br>
age: I wanted to use age mostly as a filter as some variables (like marital status) matter more at certain ages.
educ: I chose this variable because I was interested on how level of education impacts different view points and other variables. <br>
sex: I chose sex because I wanted to see how the work status of different genders have changed. As well as other view points such as party. <br>
race: I chose race because I wanted to see each variable changes with each different race. It will help me try to uncover societal restrictions. <br>
homepop: I chose homepop because I want to if different groups of people have more people in their house. I know certain cultures believe in a multigenerational house holds.<br>
partyid: This variable is one of the most interesting to me as I want to see how a person's political party impacts their view, also what groups are more likely to be a certain party. <br>
spkrac: I chose this variable because I was interested to see what groups are ok with racists speaking in their neighborhood. There is no variable for holding racist tendencies, but this variable would help me see which groups are ok with racists. I also wanted to see how this changes over time. <br>
spkhomo: I chose the variable spkhomo due to similar reasons as spkrac. I wanted to see what groups are ok with a "homosexual" speaking in their neighborhood. It is a good indication of LGBTQ+ acceptence. <br>
relig: I chose relig because it would be intersting to see how different religions have different view points and behavior. I would like to graph religion with most of the variables in the chart.<br>
ballot: this variable came with the data set and will be dropped. 

In [82]:
#3: Clean data
df.head()

Unnamed: 0,year,id_,wrkstat,marital,divorce,age,educ,sex,race,hompop,partyid,spkrac,spkhomo,relig,ballot
0,1972,1,Working full time,Never married,.i: Inapplicable,23,4 years of college,FEMALE,White,1,"Independent, close to democrat",.i: Inapplicable,.i: Inapplicable,Jewish,.i: Inapplicable
1,1972,2,Retired,Married,NO,70,10th grade,MALE,White,2,Not very strong democrat,.i: Inapplicable,.i: Inapplicable,Catholic,.i: Inapplicable
2,1972,3,Working part time,Married,NO,48,12th grade,FEMALE,White,4,"Independent (neither, no response)",.i: Inapplicable,.i: Inapplicable,Protestant,.i: Inapplicable
3,1972,4,Working full time,Married,NO,27,5 years of college,FEMALE,White,2,Not very strong democrat,.i: Inapplicable,.i: Inapplicable,Other,.i: Inapplicable
4,1972,5,Keeping house,Married,NO,61,12th grade,FEMALE,White,2,Strong democrat,.i: Inapplicable,.i: Inapplicable,Protestant,.i: Inapplicable


In [87]:
df_updated = df.drop('ballot', axis=1).copy() 
# dropping ballot because it is not useful to any of the data I want to visualize
df_updated.head()

Unnamed: 0,year,id_,wrkstat,marital,divorce,age,educ,sex,race,hompop,partyid,spkrac,spkhomo,relig
0,1972,1,Working full time,Never married,.i: Inapplicable,23,4 years of college,FEMALE,White,1,"Independent, close to democrat",.i: Inapplicable,.i: Inapplicable,Jewish
1,1972,2,Retired,Married,NO,70,10th grade,MALE,White,2,Not very strong democrat,.i: Inapplicable,.i: Inapplicable,Catholic
2,1972,3,Working part time,Married,NO,48,12th grade,FEMALE,White,4,"Independent (neither, no response)",.i: Inapplicable,.i: Inapplicable,Protestant
3,1972,4,Working full time,Married,NO,27,5 years of college,FEMALE,White,2,Not very strong democrat,.i: Inapplicable,.i: Inapplicable,Other
4,1972,5,Keeping house,Married,NO,61,12th grade,FEMALE,White,2,Strong democrat,.i: Inapplicable,.i: Inapplicable,Protestant


In [60]:
# year

# the year variable does not need to be clened

column_name = 'year'

unique_values = df_updated[column_name].unique()
print(unique_values) 







value_counts = df_updated[column_name].value_counts() # had to check that there were enough 89+ to not just remove it

print(value_counts)






[1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 1986 1987
 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010
 2012 2014 2016 2018 2021 2022]
2006    4510
2021    4032
2022    3544
1994    2992
1996    2904
2016    2867
1998    2832
2000    2817
2004    2812
2002    2765
2014    2538
2018    2348
2010    2044
2008    2023
2012    1974
1982    1860
1987    1819
1972    1613
1993    1606
1983    1599
1989    1537
1985    1534
1978    1532
1977    1530
1991    1517
1973    1504
1976    1499
1975    1490
1974    1484
1988    1481
1984    1473
1986    1470
1980    1468
1990    1372
Name: year, dtype: int64


In [55]:
# wrkstat

# combined 'Other', '.n:  No answer', '.s:  Skipped on Web', '.d:  Do not Know/Cannot Choose' to np.nan to make it more clear. 
# The data does not need three different variables for an unkown answer. Also, Other or no answer would both give nothing to the audience


column_name = 'wrkstat'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['Other', '.n:  No answer', '.s:  Skipped on Web', '.d:  Do not Know/Cannot Choose']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan

unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['Working full time' 'Retired' 'Working part time' 'Keeping house'
 'In school' 'Unemployed, laid off, looking for work'
 'With a job, but not at work because of temporary illness, vacation, strike'
 'Other' '.n:  No answer' '.s:  Skipped on Web'
 '.d:  Do not Know/Cannot Choose']
['Working full time' 'Retired' 'Working part time' 'Keeping house'
 'In school' 'Unemployed, laid off, looking for work'
 'With a job, but not at work because of temporary illness, vacation, strike'
 nan]


In [56]:
# marital

# combined '.n:  No answer', '.s:  Skipped on Web', and '.d:  Do not Know/Cannot Choose' to np.nan to make it more clear. 
# The data does not need three different variables for an unkown answer


column_name = 'marital'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer', '.s:  Skipped on Web', '.d:  Do not Know/Cannot Choose']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan

unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['Never married' 'Married' 'Divorced' 'Widowed' 'Separated'
 '.n:  No answer' '.s:  Skipped on Web' '.d:  Do not Know/Cannot Choose']
['Never married' 'Married' 'Divorced' 'Widowed' 'Separated' nan]


In [57]:
# divorce

# combined '.i:  Inapplicable', '.n:  No answer','.d:  Do not Know/Cannot Choose', '.s:  Skipped on Web' to np.nan to make it more clear. 
# The data does not need three different variables for an unkown answer


column_name = 'divorce'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.i:  Inapplicable', '.n:  No answer',
 '.d:  Do not Know/Cannot Choose', '.s:  Skipped on Web']




for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan

unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['.i:  Inapplicable' 'NO' 'YES' '.n:  No answer'
 '.d:  Do not Know/Cannot Choose' '.s:  Skipped on Web']
[nan 'NO' 'YES']


In [53]:
# age

# Got rid of non numeric variables (now nan). Made all the variables numeric. Keep note that 89 represents 89+

column_name = 'age'

unique_values = df_updated[column_name].unique()
print(unique_values) 






df_updated.loc[df_updated[column_name] == '89 or older', column_name] = '89'

df_updated[column_name] = pd.to_numeric(df_updated[column_name], errors='coerce')

value_counts = df_updated[column_name].value_counts() # had to check that there were enough 89+ to not just remove it

print(value_counts)



unique_values = df_updated[column_name].unique()
print(unique_values) 



[23. 70. 48. 27. 61. 26. 28. 21. 30. 56. 54. 49. 41. 24. 62. 46. 57. 58.
 71. 53. 42. 20. 25. 78. 35. 51. 76. 39. 64. 50. 40. 43. 37. 22. 31. 52.
 47. 45. 68. 63. 19. 55. 44. 34. 36. 74. 69. 29. 67. 75. 38. 73. 84. 82.
 72. 59. 33. 81. 65. 32. nan 60. 80. 66. 77. 18. 79. 83. 85. 88. 87. 89.
 86.]
30.0    1571
32.0    1566
34.0    1552
28.0    1548
33.0    1526
        ... 
18.0     267
85.0     221
86.0     211
87.0     158
88.0     130
Name: age, Length: 72, dtype: int64
[23. 70. 48. 27. 61. 26. 28. 21. 30. 56. 54. 49. 41. 24. 62. 46. 57. 58.
 71. 53. 42. 20. 25. 78. 35. 51. 76. 39. 64. 50. 40. 43. 37. 22. 31. 52.
 47. 45. 68. 63. 19. 55. 44. 34. 36. 74. 69. 29. 67. 75. 38. 73. 84. 82.
 72. 59. 33. 81. 65. 32. nan 60. 80. 66. 77. 18. 79. 83. 85. 88. 87. 89.
 86.]


In [70]:
# educ

# combined '.n:  No answer', '.d:  Do not Know/Cannot Choose' to np.nan to make it more clear. 
# Combined non completed K-12 grades into their own variables (some elementary, some secondary)
# I would like to combine the college variables, but is not possible to do so concisely because some college cannot be verified due to people graduating in three years

column_name = 'educ'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer', '.d:  Do not Know/Cannot Choose']

some_elementary = ['3rd grade', '2nd grade',
 '4th grade', '1st grade']

some_secondary = ['6th grade', '7th grade', '8th grade', '9th grade', '10th grade', '11th grade']

for val in some_elementary:
    df_updated.loc[df_updated[column_name] == val, column_name] = 'Some Elementary'

for val in some_secondary:
    df_updated.loc[df_updated[column_name] == val, column_name] = 'Some Secondary'


for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan

unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['4 years of college' '10th grade' '12th grade' '5 years of college'
 '2 years of college' '1 year of college' '6th grade' '9th grade'
 '8th grade' '11th grade' '7th grade' '3 years of college'
 '8 or more years of college' '6 years of college' '3rd grade' '2nd grade'
 '4th grade' '5th grade' '7 years of college' '1st grade' nan
 'No formal schooling']
['4 years of college' 'Some Secondary' '12th grade' '5 years of college'
 '2 years of college' '1 year of college' '3 years of college'
 '8 or more years of college' '6 years of college' 'Some Elementary'
 '5th grade' '7 years of college' nan 'No formal schooling']


In [64]:
# sex

column_name = 'sex'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer', '.s:  Skipped on Web'] # these values were changed to np.nan because they were not answered

change = ['.i:  Inapplicable', '.d:  Do not Know/Cannot Choose'] 
# these values were changed to other because of the gender spectrum 


for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan
    
for val in change:
    df_updated.loc[df_updated[column_name] == val, column_name] = 'Other'


unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['FEMALE' 'MALE' '.n:  No answer' '.i:  Inapplicable'
 '.s:  Skipped on Web' '.d:  Do not Know/Cannot Choose']
['FEMALE' 'MALE' nan 'Other']


In [67]:
# race

# Made inaplicable np.na because someone has to have a race, so they must have not answered 

column_name = 'race'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.i:  Inapplicable']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan

unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['White' 'Black' 'Other' '.i:  Inapplicable']
['White' 'Black' 'Other' nan]


In [69]:
# hompop

# made numerical and made -100 value and '.n:  No answer' np.nan because they are missing 

column_name = 'hompop'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer', '-100']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan

df_updated[column_name] = pd.to_numeric(df_updated[column_name], errors='coerce')

    
unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['1' '2' '4' '3' '7' '5' '6' '11' '9' '8' '10' '14' '15' '12'
 '.n:  No answer' '13' '16' '0' '-100']
[ 1.  2.  4.  3.  7.  5.  6. 11.  9.  8. 10. 14. 15. 12. nan 13. 16.  0.]


In [74]:
# partyid

# combined '.n:  No answer' and '.d:  Do not Know/Cannot Choose' to np.nan because they are both non answers

column_name = 'partyid'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer','.d:  Do not Know/Cannot Choose']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan
          

df_updated.loc[df_updated[column_name] == 'Independent (neither, no response)', column_name] = 'Independent, neutral'
# Updated this to make it more concise and fit how the others are written          



unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['Independent, close to democrat' 'Not very strong democrat'
 'Independent, neutral' 'Strong democrat' 'Not very strong republican'
 'Independent, close to republican' 'Strong republican' 'Other party'
 '.n:  No answer' '.d:  Do not Know/Cannot Choose']
['Independent, close to democrat' 'Not very strong democrat'
 'Independent, neutral' 'Strong democrat' 'Not very strong republican'
 'Independent, close to republican' 'Strong republican' 'Other party' nan]


In [84]:
# spkrac

# combined '.n:  No answer', '.s:  Skipped on Web', '.i:  Inapplicable' to np.nan because they are both non answers
# Allowed '.d:  Do not Know/Cannot Choose' because answering as not sure for an opnion based answer is fine 
# and can still can show how a population's mindset changes. Just changed it to get ride of the .d part
column_name = 'spkrac'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer', '.s:  Skipped on Web', '.i:  Inapplicable']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan
          


df_updated.loc[df_updated[column_name] == '.d:  Do not Know/Cannot Choose', column_name] = 'Do not Know/Cannot Choose'


unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['.i:  Inapplicable' 'ALLOWED' '.d:  Do not Know/Cannot Choose'
 'NOT ALLOWED' '.n:  No answer' '.s:  Skipped on Web']
[nan 'ALLOWED' 'Do not Know/Cannot Choose' 'NOT ALLOWED']


In [88]:
# spkhomo

# combined '.n:  No answer', '.s:  Skipped on Web', '.i:  Inapplicable, '.y:  Not available in this year'' to np.nan because they are both non answers
# Allowed '.d:  Do not Know/Cannot Choose' because answering as not sure for an opnion based answer is fine 
# and can still can show how a population's mindset changes. Just changed it to get ride of the .d part
column_name = 'spkhomo'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer', '.s:  Skipped on Web', '.i:  Inapplicable', '.y:  Not available in this year']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan
          


df_updated.loc[df_updated[column_name] == '.d:  Do not Know/Cannot Choose', column_name] = 'Do not Know/Cannot Choose'


unique_values = df_updated[column_name].unique()
    
print(unique_values) 


['.i:  Inapplicable' 'NOT ALLOWED' 'ALLOWED'
 '.d:  Do not Know/Cannot Choose' '.n:  No answer' '.s:  Skipped on Web'
 '.y:  Not available in this year']
[nan 'NOT ALLOWED' 'ALLOWED' 'Do not Know/Cannot Choose']


In [89]:
# relig

# combined '.n:  No answer', '.s:  Skipped on Web', '.i:  Inapplicable, '.y:  Not available in this year'' to np.nan because they are both non answers
# Allowed '.d:  Do not Know/Cannot Choose' because answering as not sure for an opnion based answer is fine 
# and can still can show how a population's mindset changes. Just changed it to get ride of the .d part
column_name = 'relig'

unique_values = df_updated[column_name].unique()
print(unique_values) 

remove = ['.n:  No answer', '.s:  Skipped on Web', '.i:  Inapplicable', '.y:  Not available in this year']



for val in remove:
    df_updated.loc[df_updated[column_name] == val, column_name] = np.nan
          


df_updated.loc[df_updated[column_name] == '.d:  Do not Know/Cannot Choose', column_name] = 'Do not Know/Cannot Choose'


unique_values = df_updated[column_name].unique()
    
print(unique_values) 

['Jewish' 'Catholic' 'Protestant' 'Other' 'None' '.n:  No answer'
 '.d:  Do not Know/Cannot Choose' 'Inter-nondenominational' 'Christian'
 'Muslim/islam' 'Buddhism' 'Orthodox-christian' 'Native american'
 'Hinduism' 'Other eastern religions' '.s:  Skipped on Web']
['Jewish' 'Catholic' 'Protestant' 'Other' 'None' nan
 'Do not Know/Cannot Choose' 'Inter-nondenominational' 'Christian'
 'Muslim/islam' 'Buddhism' 'Orthodox-christian' 'Native american'
 'Hinduism' 'Other eastern religions']


# 4. Numeric summaries and visualizations.