# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
import pandas as pd
import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users.

In [3]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_csv(url, sep='|')

In [4]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     943 non-null    int64 
 1   age         943 non-null    int64 
 2   gender      943 non-null    object
 3   occupation  943 non-null    object
 4   zip_code    943 non-null    object
dtypes: int64(2), object(3)
memory usage: 37.0+ KB


In [5]:
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


### Step 4. Discover what is the mean age per occupation

In [6]:
(
    users.groupby('occupation')
    .agg({'age':'mean'})
    .round(1)
    .sort_values(by='age', ascending=False)
)

Unnamed: 0_level_0,age
occupation,Unnamed: 1_level_1
retired,63.1
doctor,43.6
educator,42.0
healthcare,41.6
librarian,40.0
executive,38.7
administrator,38.7
marketing,37.6
lawyer,36.8
engineer,36.4


### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

- group by occupation
- create a column that is True if gender='M'
- aggregate mean on that column

In [17]:
users[['occupation', 'gender' ]].head()

Unnamed: 0,occupation,gender
0,technician,M
1,other,F
2,writer,M
3,technician,M
4,other,F


[Formatting pandas series and dataframes w/out changing the underlying data](https://stackoverflow.com/questions/20937538/how-to-display-pandas-dataframe-of-floats-using-a-format-string-for-columns)

In [9]:
df= \
(
    users[['occupation', 'gender' ]]
 ### create a boolean series, True if Male
 .assign(
     pct_male = lambda x: x.gender == 'M'
     )
 .groupby('occupation')
 ### just take the mean of the pct_male column you just created,
 ### in the groupby, that will provide the % male per occupation
 .agg({'pct_male':'mean'})
 .sort_values(by='pct_male', ascending=False)
 )

### very cool...allows you to format dataframe output w/out altering their underlying
### content
df.style.format({
    'pct_male': lambda val: f'{val*100:,.1f}%',
    })


Unnamed: 0_level_0,pct_male
occupation,Unnamed: 1_level_1
doctor,100.0%
engineer,97.0%
technician,96.3%
retired,92.9%
programmer,90.9%
executive,90.6%
scientist,90.3%
entertainment,88.9%
lawyer,83.3%
salesman,75.0%


### Step 6. For each occupation, calculate the minimum and maximum ages

In [10]:
(
    users.groupby('occupation')
    .agg({'age':['min', 'max']})
)

Unnamed: 0_level_0,age,age
Unnamed: 0_level_1,min,max
occupation,Unnamed: 1_level_2,Unnamed: 2_level_2
administrator,21,70
artist,19,48
doctor,28,64
educator,23,63
engineer,22,70
entertainment,15,50
executive,22,69
healthcare,22,62
homemaker,20,50
lawyer,21,53


### Step 7. For each combination of occupation and gender, calculate the mean age

In [11]:
(
    users.groupby(['occupation', 'gender'])
    .agg({'age':'mean'})
)

Unnamed: 0_level_0,Unnamed: 1_level_0,age
occupation,gender,Unnamed: 2_level_1
administrator,F,40.638889
administrator,M,37.162791
artist,F,30.307692
artist,M,32.333333
doctor,M,43.571429
educator,F,39.115385
educator,M,43.101449
engineer,F,29.5
engineer,M,36.6
entertainment,F,31.0


#### This is the same as the prior solution, but I used `unstack()` to put the gender data side-by-side.

In [12]:
(
    users.groupby(['occupation', 'gender'])
    .agg({'age':'mean'})
).unstack('gender').round(1)

Unnamed: 0_level_0,age,age
gender,F,M
occupation,Unnamed: 1_level_2,Unnamed: 2_level_2
administrator,40.6,37.2
artist,30.3,32.3
doctor,,43.6
educator,39.1,43.1
engineer,29.5,36.6
entertainment,31.0,29.0
executive,44.0,38.2
healthcare,39.8,45.4
homemaker,34.2,23.0
lawyer,39.5,36.2


### Step 8.  For each occupation present the percentage of women and men

In [13]:
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


### I'm going to provide MY solution complete and then step by step a second time below. Here's the complete solution.

In [30]:

(
    users[['occupation', 'gender']]
    .groupby(['occupation', 'gender'])
    .agg({'gender':'count'}) # index is occupation-gender, 
                             # data is count of each combination
    .rename({'gender':'gndr_cnt'},axis=1) # provides a name for the count 
                                          # different from the index heading
 
    .assign(occ_gndr_tot = lambda x: x.groupby('occupation').gndr_cnt.transform('sum'))
                          # this calculates the total # of people in each occupation and
                          # then uses transform to replicate the value for each gender

    .assign(pct_gndr = lambda x: x.gndr_cnt/x.occ_gndr_tot) # just divides the occupation
                                        # count by gender & divides by the total for each occupation
    .drop(columns=['gndr_cnt', 'occ_gndr_tot'])
    .unstack()                         # so I can use unstack to move the gender date into rows
    .droplevel(0, axis=1)              # drops the pct_gndr level of the column index
    .style.format('{:.1%}', na_rep="--") # makes it pretty

)   


gender,F,M
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,45.6%,54.4%
artist,46.4%,53.6%
doctor,--,100.0%
educator,27.4%,72.6%
engineer,3.0%,97.0%
entertainment,11.1%,88.9%
executive,9.4%,90.6%
healthcare,68.8%,31.2%
homemaker,85.7%,14.3%
lawyer,16.7%,83.3%


### Now step-by-step

In [31]:
df = (
    users[['occupation', 'gender']]
    .groupby(['occupation', 'gender'])
    .agg({'gender':'count'}) # index is occupation-gender, 
                             # data is count of each combination
)
df.head(10)   

Unnamed: 0_level_0,Unnamed: 1_level_0,gender
occupation,gender,Unnamed: 2_level_1
administrator,F,36
administrator,M,43
artist,F,13
artist,M,15
doctor,M,7
educator,F,26
educator,M,69
engineer,F,2
engineer,M,65
entertainment,F,2


##### Two things we need to address about the above:
##### - We change the `gender` column header which is now being used to count the       number of people in each gender/occupation combination to `gndr_cnt`

##### - We need a total count per occupation into which to divide the gender count column, which we do with `transform`

##### How this `transform` command works:
##### - groups the dataframe (`lambda x: x.groupby`) by `occupation`
##### - Select the series `gndr_cnt` from the `occupation` groupby
##### - `transform(`sum`)` sums the gndr_cnt in each occupation groupby and replicates that value for every position in the groupby

In [32]:
df = \
(
    users[['occupation', 'gender']]
    .groupby(['occupation', 'gender'])
    .agg({'gender':'count'}) # index is occupation-gender, 
                             # data is count of each combination
    .rename({'gender':'gndr_cnt'},axis=1) # provides a name for the count 
                                          # different from the index heading
 
    .assign(occ_gndr_tot = lambda x: x.groupby('occupation').gndr_cnt.transform('sum'))
                          # this calculates the total # of people in each occupation and
                          # then uses transform to replicate the value for each gender

)   

df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,gndr_cnt,occ_gndr_tot
occupation,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
administrator,F,36,79
administrator,M,43,79
artist,F,13,28
artist,M,15,28
doctor,M,7,7
educator,F,26,95
educator,M,69,95
engineer,F,2,67
engineer,M,65,67
entertainment,F,2,18


##### The rest is pretty straightforward; `transform` is by far the hardest part:
##### - Now that we have a total for each occupation/gender combination, we can calculate the percent of each gender in each occupation using another `assign` statement
##### - We `drop` `gndr_cnt` and `occ_gndr_tot`, as they were just intermediate values
##### - I used `unstack` to move the gender data into rows for each occupation
##### - I dropped the outer column index `pct_gndr` using `droplevel` as it's unattractive and we know what the data is
##### - I again used `style.format` to make the data more readable

In [33]:
(
    users[['occupation', 'gender']]
    .groupby(['occupation', 'gender'])
    .agg({'gender':'count'}) # index is occupation-gender, 
                             # data is count of each combination
    .rename({'gender':'gndr_cnt'},axis=1) # provides a name for the count 
                                          # different from the index heading
 
    .assign(occ_gndr_tot = lambda x: x.groupby('occupation').gndr_cnt.transform('sum'))
                          # this calculates the total # of people in each occupation and
                          # then uses transform to replicate the value for each gender

    .assign(pct_gndr = lambda x: x.gndr_cnt/x.occ_gndr_tot) # just divides the occupation
                                        # count by gender & divides by the total for each occupation
    .drop(columns=['gndr_cnt', 'occ_gndr_tot'])
    .unstack()                         # so I can use unstack to move the gender date into rows
    .droplevel(0, axis=1)              # drops the pct_gndr level of the column index
    .style.format('{:.1%}', na_rep="--") # makes it pretty

)   

gender,F,M
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,45.6%,54.4%
artist,46.4%,53.6%
doctor,--,100.0%
educator,27.4%,72.6%
engineer,3.0%,97.0%
entertainment,11.1%,88.9%
executive,9.4%,90.6%
healthcare,68.8%,31.2%
homemaker,85.7%,14.3%
lawyer,16.7%,83.3%


### guipsamora's solution

##### Simpler than mine. The thing that was totally new for me was what was being done with `gender_ocup.div(occup_count, level = "occupation")`. Completely new to me. 
##### - As best as I can understand it, this is an alternative to using `transform`
##### - It broadcasts the `occupation` counts from a groupby just done on `occupation` to a groupby done on both `occupation` and `gender`
##### - I had operated under the assumption this wouldn't have been possible w/two different sized dataframes. The `level` attribute of `div` allows the `occupation` counts to be broadcast across the `occupation`/`gender` counts, matching on `occupation`

In [50]:
# create a data frame and apply count to gender
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})
#gender_ocup = gender_ocup.rename({'gender': 'gender2'}, axis=1)

# create a DataFrame and apply count for each occupation
occup_count = users.groupby(['occupation']).agg('count')

# divide the gender_ocup per the occup_count and multiply per 100
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100

# present all rows from the 'gender column'
occup_gender.loc[: , 'gender'].head(12)

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429
doctor         M         100.000000
educator       F          27.368421
               M          72.631579
engineer       F           2.985075
               M          97.014925
entertainment  F          11.111111
               M          88.888889
executive      F           9.375000
Name: gender, dtype: float64

##### We see the structure of `gender_ocup` here. We would want to divide each occupation's gender counts by the occupation totals

In [42]:
gender_ocup.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,gender
occupation,gender,Unnamed: 2_level_1
administrator,F,36
administrator,M,43
artist,F,13
artist,M,15
doctor,M,7
educator,F,26
educator,M,69
engineer,F,2
engineer,M,65
entertainment,F,2


The `occup_count` has totals for each occupation

In [43]:
occup_count.head(10)

Unnamed: 0_level_0,user_id,age,gender,zip_code
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
administrator,79,79,79,79
artist,28,28,28,28
doctor,7,7,7,7
educator,95,95,95,95
engineer,67,67,67,67
entertainment,18,18,18,18
executive,32,32,32,32
healthcare,16,16,16,16
homemaker,7,7,7,7
lawyer,12,12,12,12


##### This intermediate result works by matching both on `occupation` for rows and for the `gender` column. (I changed the `gender` column name above in `gender_ocup` and the broadcasting ceases to work.) 

In [47]:
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100
occup_gender.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,age,gender,user_id,zip_code
occupation,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
administrator,F,,45.56962,,
administrator,M,,54.43038,,
artist,F,,46.428571,,
artist,M,,53.571429,,
doctor,M,,100.0,,
educator,F,,27.368421,,
educator,M,,72.631579,,
engineer,F,,2.985075,,
engineer,M,,97.014925,,
entertainment,F,,11.111111,,
