# Working with Categorical data

Categorical data has special attributes and it can be leveraged to make your analyses many times faster. In `stata` we are used to working with categorical data under the hood though they don't call it categoricals. <br>
Categorical __series__ are stored as _integers_ and those integers represent a _string_ value. It can be as simple as: 


| Value | Code |
| :----: | :------: |
| 0 | "Male" |
| 1 | "Female" |

because `pandas` stores the data as _integers_ they take up __a lot less memory__ essentially making your analysis faster.

***
In this notebook we will take advantage of these attributes to analyze ACS data.

First, 
```python
import pandas # and load the data
```

In [2]:
df = pd.read_stata("../data/raw/usa_00026.dta")

df.head()

Unnamed: 0,year,datanum,serial,hhwt,statefip,countyfips,gq,pernum,perwt,sex,...,occ2010,ind,wkswork2,uhrswork,looking,availble,inctot,incwage,vetstat,vetstatd
0,2000,3,218,600,california,,households under 1970 definition,1,671,male,...,purchasing managers,607,,,"no, did not look for work",,24800,0,not a veteran,no military service
1,2000,3,226,600,california,,households under 1970 definition,2,610,female,...,"property, real estate, and community associati...",707,14-26 weeks,30.0,"no, did not look for work","no, other reason(s)",103640,100000,not a veteran,no military service
2,2000,3,784,600,california,,households under 1970 definition,2,618,female,...,"unemployed, with no work experience in the las...",0,,,"no, did not look for work",,34000,0,not a veteran,no military service
3,2000,3,1033,600,california,,households under 1970 definition,1,637,male,...,"first-line supervisors of farming, fishing, an...",27,50-52 weeks,50.0,,,37500,37500,veteran,veteran
4,2000,3,1115,600,california,,households under 1970 definition,1,579,female,...,"property, real estate, and community associati...",707,50-52 weeks,16.0,"no, did not look for work","no, other reason(s)",145000,120000,not a veteran,no military service


***
We will work with the variable `educd` the detailed version of the educational attainment variable in the American Community Survey.

In [3]:
df['educd'].unique()

[some college, but less than 1 year, associate's degree, type not specified, no schooling completed, high school graduate or ged, 1 or more years of college credit, no degree, ..., grade 7, grade 5, grade 2, grade 1, nursery school, preschool]
Length: 28
Categories (28, object): [no schooling completed < nursery school to grade 4 < nursery school, preschool < kindergarten ... bachelor's degree < master's degree < professional degree beyond a bachelor's degree < doctoral degree]

to __access__ the attributes for categorical variables you use the `.cat` __accessor__. 

In [4]:
print(df['educd'].cat.categories)

Index(['no schooling completed', 'nursery school to grade 4',
       'nursery school, preschool', 'kindergarten', 'grade 1', 'grade 2',
       'grade 3', 'grade 4', 'grade 5 or 6', 'grade 5', 'grade 6',
       'grade 7 or 8', 'grade 7', 'grade 8', 'grade 9', 'grade 10', 'grade 11',
       '12th grade, no diploma', 'high school graduate or ged',
       'regular high school diploma', 'ged or alternative credential',
       'some college, but less than 1 year',
       '1 or more years of college credit, no degree',
       'associate's degree, type not specified', 'bachelor's degree',
       'master's degree', 'professional degree beyond a bachelor's degree',
       'doctoral degree'],
      dtype='object')


Not all categories are _ordered_ but luckily ours are. if you noticed the output of `df['educd'].unique()` showed you the order of these categories. 

We can use this __order__ attribute to compare values with `<` and `>` as if they were numeric values.

try this
```python
df[df['educd'] > "master's degree"].head()
```

We can use these comparison to create our `educ_attainment` variable.

In [None]:
# anything above a BA is advanced
df.loc[(df['educd'] > "bachelor's degree"),'educ_attainment'] = 'advanced degree'

# if it's a bachelor's degree code it as bachelor
df.loc[(df['educd'] == "bachelor's degree"),'educ_attainment'] = 'bachelor'

# if it's less than a bachelor's degree AND greater than a ged or alternative credential code it as some college
df.loc[((df['educd'] < "bachelor's degree") & (df['educd'] > "ged or alternative credential")),'educ_attainment'] = 'some college'

# if it's ged or alternative credential OR high school graduate or ged code it as hs grad
df.loc[((df['educd'] == "ged or alternative credential") | (df['educd'] == 'high school graduate or ged')),'educ_attainment'] = 'hs grad'

# if it's less than high school graduate or ged code it as no hs
df.loc[(df['educd'] < "high school graduate or ged"),'educ_attainment'] = 'no hs'

# With this we grab our df['educ_attainment'] column and we make it a Categorical series then assign it back to df['educ_attainment']
df['educ_attainment'] = pd.Categorical(df['educ_attainment'], categories=['no hs', 'hs grad', 'some college', 'bachelor', 'advanced degree'], ordered = True)

df.head()

Now we do something similar for `citizen`

In [None]:
df['citizen'].cat.categories

In [None]:
df['nativity'] = 'foreign born'
df.loc[(df['citizen'] < 'naturalized citizen'), 'nativity'] = 'native'

In [None]:
education = df.groupby(['year','nativity','educ_attainment'])['perwt'].sum().to_frame()

In [None]:
education_pct = education.groupby(['year', 'nativity']).apply(lambda x: x/x.sum()) # this is a more advanced technique to grab the percentages of each group

education_pct.style.format('{:.2%}')

In [None]:
import altair as alt

In [None]:
data = education_pct.reset_index()
data.columns = ['year', 'nativity', 'ed_att', 'pctg']
alt.Chart(data).mark_bar().encode(
    x = alt.X('nativity:O'),
    y = alt.Y('pctg:Q', scale = alt.Scale(domain=[0,1]), axis = alt.Axis(format="%")),
    color = alt.Color('ed_att:N'),
    column = alt.Column('year:O')
)