# Task Overview
This project belongs to the NLP domain. The task is straightforward: assign the correct job category to a job description. This is thus a multi-class classification task with 28 classes to choose from.
The data has been retrieved from CommonCrawl. The latter has been famously used to train OpenAI's GPT-3 model. The data is therefore representative of what can be found on the English speaking part of the Internet, and thus contains a certain amount of bias. One of the goals of this competition is to design a solution that is both accurate as well as fair, as explained in the Evaluation section.

## Evaluation

First of all, solutions are evaluated according to the Macro F1 metric, The Macro F1 score is simply the arithmetic average of the F1 score for each class.
We will also analyse proposed solutions according to their fairness with respect to the provided genders. In other words, we want you to design a solution that is not biased towards one gender in particular. To be specific, we will measure the average demographic parity across all classes. A fair model is a model where this criteria is close to 1. 

## Datasets

**data.json**
Contains job descriptions as well as genders for the training set, which contains 217,197 samples. If you're using pandas, then you can easily open this with pd.read_json.

**label.csv**
Contains job labels for the training set.

**categories_string.csv**
Provides a mapping between job labels and label integers

## Reading the data and libraries

In [2]:
import pandas as pd
import pickle

df = pd.read_json("resources/data.json")
label = pd.read_csv("resources/label.csv")
category = pd.read_csv("resources/categories_string.csv")

print("done")

done


In [3]:
print(df.head())
print(df.shape)

   Id                                        description gender
0   0   She is also a Ronald D. Asmus Policy Entrepre...      F
1   1   He is a member of the AICPA and WICPA. Brent ...      M
2   2   Dr. Aster has held teaching and research posi...      M
4   3   He runs a boutique design studio attending cl...      M
5   4   He focuses on cloud security, identity and ac...      M
(217197, 3)


In [4]:
print(label.head())
print(category.head())

   Id  Category
0   0        19
1   1         9
2   2        19
3   3        24
4   4        24
                  0  1
0            pastor  0
1             model  1
2      yoga_teacher  2
3           teacher  3
4  personal_trainer  4


## Cleaning

The only cleaning transformation applied here is that we lower the data so that all words are lower case. Hence research and Research will be considered as similar word.

You might want to look at other cleaning step such that removing stopwords, stemming words, etc.

In [5]:
train_df["description_lower"] = [x.lower() for x in train_df.description]

NameError: name 'train_df' is not defined

## Fairness Metric

The fairness of the proposed solution will be evaluated through the macro disparate impact. Hereafter we show how it can be computed on the original data. 
Essentially, we will look at the individual disparate impact of each job with respect to both genders, and then compute the non-weighted average of these disparate impacts. 

In [3]:
names = pd.read_csv('categories_string.csv')['0'].to_dict()
jobs = pd.read_csv('label.csv', index_col='Id')['Category']
jobs = jobs.map(names)
jobs = jobs.rename('job')
print(jobs.head())

genders = pd.read_json('data.json').set_index('Id')['gender']
print(genders.head())

Id
0     professor
1    accountant
2     professor
3     architect
4     architect
Name: job, dtype: object
Id
0      F
1      M
8      F
80     F
780    M
Name: gender, dtype: object


In [4]:
people = pd.concat((jobs, genders), axis='columns')
people.head()

Unnamed: 0_level_0,job,gender
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,professor,F
1,accountant,M
2,professor,M
3,architect,M
4,architect,M


Let's first look at the gender distribution for each job.

In [10]:
counts = people.groupby(['job', 'gender']).size().unstack('gender')
counts

gender,F,M
job,Unnamed: 1_level_1,Unnamed: 2_level_1
accountant,1129,1992
architect,1314,4527
attorney,7106,11714
chiropractor,391,1015
comedian,345,1294
composer,553,2842
dentist,1895,3555
dietitian,2120,168
dj,125,706
filmmaker,1394,2730


Now let's compute the disparate impact for each job.

In [11]:
counts['disparate_impact'] = counts[['M', 'F']].max(axis='columns') / counts[['M', 'F']].min(axis='columns')
counts.sort_values('disparate_impact', ascending=False)

gender,F,M,disparate_impact
job,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dietitian,2120,168,12.619048
rapper,64,719,11.234375
nurse,11493,1129,10.179805
surgeon,890,5726,6.433708
yoga_teacher,803,141,5.695035
dj,125,706,5.648
software_engineer,613,3447,5.623165
paralegal,814,153,5.320261
composer,553,2842,5.139241
model,3398,717,4.739191


Now we can obtain the macro disparate impact by simply computing the average of the disparate_impact column.

In [12]:
counts['disparate_impact'].mean()

3.898171170378378

Of course, you can do all these steps in a function !