# Exercise Nine: Numbers

This week, you'll be exploring the [GSS](https://gssdataexplorer.norc.org/) dataset we worked within in the "Social Stats" exercise. Using our demo and the textbook as a guide, pick three new variables to explore. Your workflow should:

- Import the current version of the file (available for download at the link above), and isolate the columns of interest based on the variables you want to include
- Using the variable navigator provided by GSS, determine the years applicable and narrow your dataset accordingly.
- Visualize at least two quantiative relationships or patterns: these might include connections between clear numerical values, such as age and income, or more complex visualizations based on boolean data (for example, our "yes" and "no" to reading fiction.)
- Group the data using at least two different divisions to spot interesting trends, and plot at least one variance across a group (refer to our example of happiness among fiction readers as a starting point.)

For a bonus challenge, try running another analysis using an advanced method such as summary statistics or cross tabulation.

## Methodology for Determining Search Codes 

I am interested in the intersection of military service, firearm ownership, and support for legalization of marijuana. Through various searches in the GSS Codebook, I found the following applicable codes.

- MILWRKEV - EVER WORK FOR MILITARY OR DOD?
- MILWRKNW - CURRENTLY WORK FOR MILITARY OR DOD?
- VETAID - ANY IN HH RECEIVE MIL OR VET BENEFITS
- MEMVET - MEMBERSHIP IN VETERAN GROUP
- GUNLAW - FAVOR OR OPPOSE GUN PERMITS
- OWNGUN - HAVE GUN IN HOME
- GUNSDRNK - SHOULD CARRYING A FIREARM DRINKING ALCOHOL BE ILLEGAL
- GRASS - SHOULD MARIJUANA BE MADE LEGAL
- GRASSY - SHOULD MARIJUANA BE LEGAL-VERSION

Unfortunately, I could not find a simple yes or no code for military service. The closest codes I found were MILWRKNW and MEMVET. Both have their draw backs. MILWRKNW could refer to both military personnel and civilians working for the military. MEMVET would specifically deal with veterans, as vet groups require military service, but this is a smaller subset than just being a veteran. Initially, I decided to use MILWRKNW as a stand in for ‘yes or no military service.’ as this code may most widely capture the kind of individual I am interested in.  However, this produced very few answers across my years.  MEMVET served as a better code.

I will use OWNGUN for firearm ownership, as this is a relatively straightforward code. If someone owns a gun, one would assume they support gun ownership. I will use GRASS for approval of marijuana legalization.

I will also include some basic demographic codes in my searches to further parse the data.

## Years of Interest 

I chose to focus on four years: 1975, 1991, 2007, 2018. Except for 2018, these years signify important years in American military history. 1975 marked the end of the Vietnam War. The Persian Gulf War occurred in 1991. 2007 marks a point of high intensity for the Global War on Terror, specifically with the troop surge in Iraq. 2018 is the last year of available data. I will use this year to assess current opinions.

## Imports and Narrows by Column and Year

In [6]:
import pandas as pd

columns = ['id', 'year', 'age', 'sex', 'race', 'memvet', 'owngun', 'grass']
df = pd.read_stata("GSS7218_R1.dta", columns=columns)

df = df.loc[df['year'].isin({1975, 1991, 2007, 2018})]
print(df.head)

<bound method NDFrame.head of          id  year age     sex   race memvet owngun      grass
4601      1  1975  38    male  white     no    NaN  NOT LEGAL
4602      2  1975  20  female  white     no    NaN  NOT LEGAL
4603      3  1975  61  female  white     no    NaN  NOT LEGAL
4604      4  1975  19    male  white     no    NaN      legal
4605      5  1975  28    male  white     no    NaN      legal
...     ...   ...  ..     ...    ...    ...    ...        ...
64809  2344  2018  37  female  white    NaN     no        NaN
64810  2345  2018  75  female  white    NaN     no        NaN
64811  2346  2018  67  female  white    NaN    yes      legal
64812  2347  2018  72    male  white    NaN    NaN  NOT LEGAL
64813  2348  2018  79  female  white    NaN    yes        NaN

[5355 rows x 8 columns]>


In [19]:
# Cleaning dataset to removes NANs from columns of interest
df = df.loc[df['memvet'].notna()]
df = df.loc[df['owngun'].notna()]
df = df.loc[df['grass'].notna()]
df = df.loc[df['age'].notna()]
print(df.head)

<bound method NDFrame.head of          id  year   age     sex   race memvet   owngun      grass
26266     2  1991  32.0  female  white     no       no      legal
26268     4  1991  26.0  female  white     no       no      legal
26271     7  1991  46.0    male  black     no      yes      legal
26273     9  1991  57.0  female  black     no       no  NOT LEGAL
26284    20  1991  55.0    male  black     no       no  NOT LEGAL
...     ...   ...   ...     ...    ...    ...      ...        ...
27761  1497  1991  56.0  female  white     no  refused  NOT LEGAL
27764  1500  1991  73.0  female  white     no      yes  NOT LEGAL
27769  1505  1991  66.0  female  white     no      yes      legal
27773  1509  1991  22.0    male  white     no      yes  NOT LEGAL
27780  1516  1991  70.0    male  white    yes      yes  NOT LEGAL

[463 rows x 8 columns]>


## Visualize Two Quantitative Aspects of the Data

In [20]:
# I'm unlcear on this error.  To test the system, I ran the cells on the Complete example from class.  I received the same error.
print(df['age'].mean())

df.groupby('memvet')['age'].mean().plot(kind='barh')
plt.xlabel('Mean Age for Membership in a Vat Group')
plt.legend();

TypeError: Categorical cannot perform the operation mean

In [12]:
# As above, this cell in the Complete version returned the same error.
performance_counts = df['grass'].value_counts()
labels=["Not Legal","Legal"]
colors=["#ff9999","#99ff99"]
explode = (0, 0.1)
fig1, ax1 = plt.subplots()
ax1.pie(performance_counts, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')  
plt.tight_layout()
plt.show()

NameError: name 'plt' is not defined

## Use Groupby to Spot Additional Trends

In [13]:
exhibition_gender = df.groupby('owngun')['sex'].value_counts()
exhibition_gender

owngun   sex   
yes      male      105
         female     87
no       female    171
         male       96
refused  female      3
         male        1
Name: sex, dtype: int64

In [14]:
exhibition_gender = df.groupby('memvet')['sex'].value_counts()
exhibition_gender

memvet  sex   
yes     male       24
        female      6
no      female    255
        male      178
Name: sex, dtype: int64

In [15]:
exhibition_gender = df.groupby('grass')['sex'].value_counts()
exhibition_gender

grass      sex   
legal      male       49
           female     43
NOT LEGAL  female    218
           male      153
Name: sex, dtype: int64

In [17]:
# Unclear why this conversion failed.
df['grass'] = df['grass'].replace(['Not Legal', 'Legal'], [0, 1])
df['owngun'] = df['owngun'].replace(['No', 'Yes'], [0, 1])
df.head()

df.groupby('owngun')['memvet'].mean().plot(kind='barh')
plt.xlabel('Gun Ownership Among Memebers of Vet Groups')
plt.legend();

DataError: No numeric types to aggregate