# Kagglers' Gender Pay Gap & Salary Prediction

#### In this kernel, I will use the answers to the 8 first questions : 
- Gender
- Age
- Nationality
- Education
- Major
- Profession
- Industry of profession
- Ancienety

To try to predict the answer to the 9th : **The annual income**.

 > #### **I will also tackle the issue of the gender pay gap in this dataset.**
 
#### **Don't forget to leave an upvote !**

### Summary : 
- **1 - Target visualization**
- **2 - EDA & Gender Wage Gap Analysis**
- **3 - Model designing & Results**




 ### Tools used :
 - *Pandas* and *numpy* for manipulating the data
 - *Matplotlib* and *Seaborn* for dataviz
 - *LightGBM* for the model
 - *ScikitLearn* for some Machine Learning tools.

In [None]:
import warnings
import itertools
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

warnings.simplefilter(action='ignore', category=FutureWarning)
sns.set_style('whitegrid')

## 0 - Loading data
We focus on the multipe choice responses questions, as they are much easier to feed to a model.

In [None]:
df_choice = pd.read_csv('../input/multipleChoiceResponses.csv')

In [None]:
df_choice.head()

In [None]:
print("Number of replies to the survey :", df_choice.shape[0])

The first line is the name of the question, we store it separately.

In [None]:
question_names = df_choice.iloc[0]
df_choice = df_choice.drop(0, axis=0)

## 1 - Analyzing our target

In [None]:
print(question_names['Q9'])

In [None]:
print(df_choice['Q9'].unique())

Only numerical values interest us.

In [None]:
df_choice = df_choice[df_choice['Q9'].notnull()]
df_choice = df_choice[df_choice['Q9'] != 'I do not wish to disclose my approximate yearly compensation']

In [None]:
print(df_choice.shape[0], "replies left")

In [None]:
order = ['0-10,000', '10-20,000', '20-30,000', '30-40,000', '40-50,000', 
  '50-60,000', '60-70,000', '70-80,000', '80-90,000', '90-100,000', 
  '100-125,000', '125-150,000', '150-200,000', '200-250,000', '250-300,000', 
  '300-400,000', '400-500,000', '500,000+']

plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q9'], order=order)
plt.xticks(rotation=-45)
plt.xlabel("Yearly Income ($)", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Yearly income repartition", fontsize=15)
plt.show()

The repartition is quite logical, however it is going to be complicated to predict lower populated class, and I will have to do some merging. But let us keep it like this for now.
However, we make the target numerical by taking the mean of the interval, for vizualisation purpose

In [None]:
dic = {'0-10,000': 5000, '10-20,000': 15000, '20-30,000': 25000, '30-40,000': 35000, 
       '40-50,000': 45000, '50-60,000': 55000, '60-70,000': 65000, '70-80,000': 75000, 
       '80-90,000': 85000, '90-100,000': 95000, '100-125,000': 112500, 
       '125-150,000': 137500, '150-200,000': 175000, '200-250,000': 225000, 
       '250-300,000': 275000, '300-400,000': 350000, '400-500,000': 450000, 
       '500,000+':500000}

df_choice['target'] = df_choice['Q9'].apply(lambda x: dic[x])

#### Can we spot obvious trolls ?
For instance, students that make half a million a year...

In [None]:
liars = df_choice[df_choice['Q6'] == "Student"]
liars = liars[liars['target'] >= 500000]

In [None]:
liars.head(10)

Come one, you can't be a student and earn more than 500k a year. But if you do, please tell me the trick. To avoid obvious trolls I will exclude people that indicated earning more than 500k.

In [None]:
df_choice = df_choice[df_choice['target'] < 500000]

## 2 - Features EDA
I'm going to go through the first questions, checking if they can be useful for the prediction task.

### 2.1 - Gender

In [None]:
print(question_names['Q1'])

Gender is always an interesting feature. Let's see what we got.

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q1'])
plt.xlabel("Gender", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Gender Repartition among the Kaggle Community", fontsize=15)
plt.show()
plt.show()

Not sure how to deal with the two outliers. Let us keep them as they are.

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='Q1', y='target', data=df_choice)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Gender", fontsize=12)
plt.title("Distribution of the Yearly income for Different Genders", fontsize=15)
plt.show()

The distribution of male and female salaries is quite similar so far. I am a bit more concerned about the other categories.

### 2.2 - Age

In [None]:
print(question_names['Q2'])

Simple feature. Always useful.

In [None]:
order = ['18-21', '22-24', '25-29', '30-34','35-39', '40-44', '45-49', '50-54', '55-59', '60-69', '70-79', '80+']
plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q2'], order=order)
plt.xlabel("Age", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Age Repartition of Kagglers", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x='Q2', y='target', data=df_choice, order=order, showfliers=False)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Age", fontsize=12)
plt.title("Distribution of the Yearly Income for Age Groups", fontsize=15)
plt.show()

The older you are, the more you earn. Until you retire. Therefore the distribution is understandable.

### 2.3 - Nationality

In [None]:
print(question_names['Q3'])

Salaries highly vary depending on the country you work in. I regroup most country by continent / region, except for the five most represented ones (USA, India, China, Russia and Brazil)
Note that Asia mean Asia except India, China and Russia; that North America is Cana and Mexico only; and that South America does not count Brazil.

In [None]:
country_dic = {'Morocco': 'Africa',
             'Tunisia': 'Africa',
             'Austria': 'Europe',
             'Hong Kong (S.A.R.)': 'Asia',
             'Republic of Korea': 'Asia',
             'Thailand': 'Asia',
             'Czech Republic': 'Europe',
             'Philippines': 'Asia',
             'Romania': 'Europe',
             'Kenya': 'Africa',
             'Finland': 'Europe',
             'Norway': 'Europe',
             'Peru': 'South America',
             'Iran, Islamic Republic of...': 'Middle East',
             'Bangladesh': 'Asia',
             'New Zealand': 'Oceania',
             'Egypt': 'Africa',
             'Chile': 'South America',
             'Belarus': 'Europe',
             'Hungary': 'Europe',
             'Ireland': 'Europe',
             'Belgium': 'Europe',
             'Malaysia': 'Asia',
             'Denmark': 'Europe',
             'Greece': 'Europe',
             'Pakistan': 'Asia',
             'Viet Nam': 'Asia',
             'Argentina': 'South America',
             'Colombia': 'South America',
             'Indonesia': 'Oceania',
             'Portugal': 'Europe',
             'South Africa': 'Africa',
             'South Korea': 'Asia',
             'Switzerland': 'Europe',
             'Sweden': 'Europe',
             'Israel': 'Middle East',
             'Nigeria': 'Africa',
             'Singapore': 'Asia',
             'I do not wish to disclose my location': 'dna',
             'Mexico': 'North America',
             'Ukraine': 'Europe',
             'Netherlands': 'Europe',
             'Turkey': 'Asia',
             'Poland': 'Europe',
             'Australia': 'Oceania',
             'Italy': 'Europe',
             'Spain': 'Europe',
             'Japan': 'Asia',
             'France': 'Europe',
             'Canada': 'North America', 
             'United Kingdom of Great Britain and Northern Ireland': 'Europe',
             'Germany': 'Europe',
             'Brazil': 'South America',
             'Russia': 'Russia',
             'Other': 'Other',
             'China': 'China',
             'India': 'India',
             'United States of America': 'USA'}

In [None]:
df_choice['Q3'] = df_choice['Q3'].apply(lambda x: country_dic[x])

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q3'], order=df_choice['Q3'].value_counts().index)
plt.xticks(rotation=-70)
plt.xlabel("Country / Region", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Where are Kagglers from ?", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='Q3', y='target', data=df_choice, order=df_choice['Q3'].value_counts().index)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Nationality", fontsize=12)
plt.title("Distribution of the Yearly Income for Different Regions", fontsize=15)
plt.show()

As expected, North American, Oceanian and Middle East Kagglers earn a bit more. Mostly because life is not cheap the economical system permits high wages.

#### Gender wage gap in different countries

In [None]:
df = df_choice[df_choice['Q1'] != "Prefer not to say"]
df = df[df['Q1'] != "Prefer to self-describe"]

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='Q3', y='target', hue='Q1', data=df, split=True, order=df_choice['Q3'].value_counts().index)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Nationality", fontsize=12)
plt.title("Illustration of the Gender Wage Gap for Different Regions", fontsize=15)
plt.show()

The wage gap is visible here, and appears to be higher in Europe and in North America than in Asia. I will focus on the gender pay gap in the USA for the next visualizations. 

### 2.4 - Education

In [None]:
print(question_names['Q4'])

The studies you did are linked to your income, at least in the beginning of your career.

In [None]:
order = ['Doctoral degree', 'Master’s degree', 'Bachelor’s degree',  'Some college/university study without earning a bachelor’s degree',
         'Professional degree', 'No formal education past high school', 'I prefer not to answer']

plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q4'], order=order)
plt.xticks(rotation=-70)
plt.xlabel("Studies", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Which Level of Study do Kagglers Have ?", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='Q4', y='target', data=df_choice, order=order)
plt.xticks(rotation=-70)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Nationality", fontsize=12)
plt.title("Distribution of the Yearly Income for Different Levels of Study", fontsize=15)
plt.show()

The more you study, the more you earn ? It is a bit visible here, but it is not very obvious.

#### More about the gender pay gap

In [None]:
df = df[df['Q3'] == 'USA']

plt.figure(figsize=(15,10))
sns.violinplot(x='Q4', y='target', hue='Q1', data=df, split=True, order=order)
plt.xticks(rotation=-70)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Studies", fontsize=12)
plt.title("Illustration of the Gender Wage Gap for Different Levels of Education in the USA", fontsize=15)
plt.show()

This shows that the gender pay gap is not (only) caused by a difference of education in the USA, as there is differencies inside each type of studies. 

Furthermore, recent studies have shown that women tend to study more than men in the USA. This is not really the case here because women are under-represented. But the point I'm making is that the length of studies is not the cause of the pay gap.

### 2.5 - Major

In [None]:
print(question_names['Q5'])

It is a bit hard to determine whether what you studied will determine your income. There is no big difference between people who majored Computer Science and in Engineering or Mathematics. 

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q5'], order=df_choice['Q5'].value_counts().index)
plt.xticks(rotation=-80)
plt.xlabel("Major", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("What's Kagglers' Fields of Study ?", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='Q5', y='target', data=df_choice, order=df_choice['Q5'].value_counts().index)
plt.xticks(rotation=-80)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Major", fontsize=12)
plt.title("Distribution of the Yearly Income for Different Fields of Study", fontsize=15)
plt.show()

As expected, the distributions are very similar, but perhaps a model can learn a bit from this. I expect more from the next feature.

### 2.6 - Profession

In [None]:
print(question_names['Q6'])

Well profession has to determine your salary. At least a little.

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q6'], order=df_choice['Q6'].value_counts().index)
plt.xticks(rotation=-70)
plt.xlabel("Profession", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("What's Kagglers' Job ?", fontsize=15)
plt.show()

Again, this is going to be hard to learn from this, because most jobs are similar. However, students and research assistant are expected to earn less. 

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x='Q6', y='target', data=df_choice, order=df_choice['Q6'].value_counts().index, showfliers=False)
plt.xticks(rotation=-70)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Profession", fontsize=12)
plt.title("Distribution of the Yearly Income for Different Types of Jobs", fontsize=15)
plt.show()

Top earning jobs are Chief Officer, Manager and Principal Investigator. Which I believe should be jobs held by older people (?).

#### Checking the wage gap in the US in the same profession

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='Q6', y='target', hue='Q1', data=df, split=True, order=df['Q6'].value_counts().index)
plt.xticks(rotation=-70)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Profession", fontsize=12)
plt.title("Illustration of the Gender Wage Gap for Different Professions in the USA", fontsize=15)
plt.show()

Inside a same job, the pay gap seems to be smaller. It is harder to directly come to a conclusion. Overall, men seem to gain more, but women do better as Chief Officer, and approximately as much as men in most jobs.


This and the previous graph leads us to think that men tend to occupy higher earning jobs. Let us verify this.

In [None]:
# Mean salary of each job
means = df.groupby(['Q6'])['target'].mean().sort_values(ascending=False)

# Women proportion of each job
d = {"Female":1, "Male":0}
df['Q1'] = df['Q1'].apply(lambda x: d[x])
women_perc = df.groupby(['Q6'])['Q1'].mean()

# Joining
df_job = pd.concat([means, women_perc], axis=1)

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(x=df_job.index, y='Q1', data=df_job, order=means.index)
plt.xticks(rotation=-70)
plt.ylabel("Women proportion", fontsize=12)
plt.xlabel("Profession", fontsize=12)
plt.title("Percentage of Women in Jobs, Sorted by Average Salary in the USA ", fontsize=15)
plt.show()

In [None]:
# Linear regression
z = np.polyfit(df_job['Q1'], df_job['target'], 1)
p = np.poly1d(z)

plt.figure(figsize=(15,10))
plt.scatter(df_job['Q1'], df_job['target'], label='Samples')
plt.plot(np.arange(0, 0.6, 0.01), p(np.arange(0, 0.6, 0.01)), linestyle=':', label='Trend')
plt.ylabel("Average Yearly Income of the Job ($)", fontsize=12)
plt.xlabel("Percentage of Women in the Job", fontsize=12)
plt.title("In the USA, the Higher Earning the Job, the fewer Women", fontsize=15)
plt.legend()
plt.show()

This *(kind of)* shows what I wanted to. There is still two features so let us not stop here.

### 2.7 - Industry

In [None]:
print(question_names['Q7'])

There surely has to be domains that pay better, because more money is involved.

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q7'], order=df_choice['Q7'].value_counts().index)
plt.xticks(rotation=-70)
plt.xlabel("Industry of Employer", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Where do people work ?", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x='Q7', y='target', data=df_choice, order=df_choice['Q7'].value_counts().index, showfliers=False)
plt.xticks(rotation=-70)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Industry", fontsize=12)
plt.title("Distribution of the Yearly Income for Different Industries of Employment", fontsize=15)
plt.show()

The three categories that stand out are students *(again, well they're not paid)*, Non-profit/Services and Academics/Education. Nothing incoherent, as those last two are paid by the state.

I was expecting more contrasted results, but a model can definitely learn something from this.

### 2.8 - Experience

In [None]:
question_names['Q8']

It is correlated to age, but I believe it is more precise to predict salary.

In [None]:
order = ['0-1', '1-2', '2-3',  '3-4', '4-5', '5-10', '10-15', '15-20', '20-25', '25-30', '30 +']

plt.figure(figsize=(15,10))
sns.countplot(df_choice['Q8'], order=order)
plt.xlabel("Years of Experience", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("How Experienced are Kagglers in their Current Jobs ?", fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(15,10))
sns.violinplot(x='Q8', y='target', data=df_choice, order=order)
plt.xticks(rotation=-70)
plt.ylabel("Yearly Income ($)", fontsize=12)
plt.xlabel("Profession", fontsize=12)
plt.title("Distribution of the Yearly Income in Function of the Years of Experience", fontsize=15)
plt.show()

Once again, logical results. The salary is a strictly increasing function of your experience in the job.

### I am going to stop here for the features. I might add more in the future but these should be enough to get some results.

## 3 - Model

### 3.1 - Input data

In [None]:
features = ["Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8"]
target = ["target"]

df = df_choice[features + target]

df = df.fillna('?')

We need our classifier to understand our data, therefore we categorize it. However, age and experience have a logical order that we need to keep. We take the average of each interval as the feature.

In [None]:
dic_age = {'30-34': 32, '22-24': 23, '35-39': 37, '18-21': 19.5, '40-44': 42, '25-29': 27, '55-59': 57, '60-69': 64.5, '45-49': 47, '50-54': 52, '70-79': 74.5, '80+': 80}
dic_exp = {'5-10': 7.5, '0-1': 0.5, '10-15': 12.5, '3-4': 3.5, '1-2': 1.5, '2-3': 2.5, '15-20': 17.5, '4-5': 4.5, '25-30': 27.5, '20-25': 22.5, '30 +': 30, '?': 0}

df['Q2'] = df['Q2'].apply(lambda x: dic_age[x])
df['Q8'] = df['Q8'].apply(lambda x: dic_exp[x])

for q in ["Q1", "Q3", "Q4", "Q5", "Q6", "Q7"]:
    df[q] = df[q].astype('category')
    
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

We also rename our columns, for easier feature understanding.

In [None]:
df = df.rename(index=str, columns={"Q1": 'Gender', "Q2": 'Age', "Q3": 'Country', "Q4": 'Education', "Q5": 'Major', "Q6": 'Profession', "Q7": 'Industry', "Q8": 'Experience'})

### 3.2 - Target
Let us say we want to predict the income in thousands of USD. I will tackle the problem as a classification one.
In fact, a regression one will give a bad accuracy on lower salaries. I believe it is more important to make a distinction between earn 40k and 60k than between earning 200k and 250k.

Therefore I make 6 categories : 
- less than 10k
- between 10k and 30k
- between 30k and 50k
- between 50k and 80k
- between 80k and 125k
- more than 100k

In [None]:
classes = ['less than 10k', 'between 10k and 30k', 'between 30k and 50k', 'between 50k and 80k', 'between 80k and 125k', 'more than 100k']

In [None]:
dic_target = {5000: 0,  
              15000: 1, 25000: 1, 
              35000: 2, 45000: 2, 
              55000: 3,  65000: 3,  75000: 3,
              85000: 4, 95000: 4, 112500: 4,
              137500: 5,  175000: 5, 225000: 5, 275000: 5, 350000: 5,  450000: 5
             }

df['target'] = df['target'].apply(lambda x: dic_target[x])

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(df['target'])
plt.xticks(range(0, 7), classes)
plt.ylabel("Count", fontsize=12)
plt.xlabel("Yearly income ($)", fontsize=12)
plt.title("Reparition of our New Classes", fontsize=15)
plt.show()

In [None]:
df.head()

### 3.3 - Train / Test split

In [None]:
df_train, df_test = train_test_split(df, test_size=0.2)

In [None]:
print(f"Training on {df_train.shape[0]} samples.")

### 3.4 Gradient Boosting

Note that I did not bother tweeking the parameters. The goal here is not to get the better results but to check the importance of each feature.

In [None]:
features = ['Gender', 'Age', 'Country', 'Education', 'Major', 'Profession', 'Industry', 'Experience']
      
def run_lgb(df_train, df_test):
    params = {"objective" : "multiclass",
              "num_class": 6,
              "metric" : "multi_error",
              "num_leaves" : 30,
              "min_child_weight" : 50,
              "learning_rate" : 0.05,
              "bagging_fraction" : 0.7,
              "feature_fraction" : 0.7,
              "bagging_seed" : 420,
              "verbosity" : -1
             }
    
    lg_train = lgb.Dataset(df_train[features], label=(df_train["target"].values))
    lg_test = lgb.Dataset(df_test[features], label=(df_test["target"].values))
    model = lgb.train(params, lg_train, 1000, valid_sets=[lg_test], early_stopping_rounds=100, verbose_eval=100)
    
    return model

In [None]:
model = run_lgb(df_train, df_test)

In [None]:
pred_train = model.predict(df_train[features], num_iteration=model.best_iteration)
pred_test = model.predict(df_test[features], num_iteration=model.best_iteration)

### 3.5 Results
#### Feature importance

In [None]:
fig, ax = plt.subplots(figsize=(12,10))
lgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
ax.grid(False)
plt.ylabel('Feature', size=12)
plt.xlabel('Importance', size=12)
plt.title("Importance of the Features our LightGBM Model", fontsize=15)
plt.show()

**Gender is by far the least important feature ! **

This does not mean that the gender pay gap does not exist but it does show that this is not what matters when determining the salary of a Kaggler.

The profession is the most important parameter, and we have shown earlier that higher earning jobs had a higher proportion of men. 



We also notice that studies (education & major) have little influence on earnings. **It is what you do more than what you did that will determine your income.**


#### Confusion Matrices
We make sure our model had correct predictions.

In [None]:
def plot_confusion_matrix(cm, classes, title='Confusion matrix', normalize=False, cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    fmt = '.2f' if normalize else 'd'

    fig, ax = plt.subplots(figsize=(15, 10))
    ax.imshow(cm, interpolation='nearest', cmap=cmap)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, size=15)
    plt.colorbar()
    plt.grid(False)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = (cm.max()+cm.min()) / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label', size=12)
    plt.xlabel('Predicted label', size=12)

In [None]:
conf_mat_train = confusion_matrix(np.argmax(pred_train, axis=1), df_train[target].values)

plot_confusion_matrix(conf_mat_train, classes, title='Confusion matrix on train data', normalize=True)

So far, so good !

In [None]:
conf_mat_test = confusion_matrix(np.argmax(pred_test, axis=1), df_test[target].values)

plot_confusion_matrix(conf_mat_test, classes, title='Confusion matrix on test data', normalize=True)

As expected, low paid and high paid scientist are the easier to detect. Mostly because the range of high paid person I took is very wide, and because students are easy to detect.

#### *Thanks for reading ! *
This took me quite a while to do, hope you enjoyed and learned some stuff!