# What factors are driving pay discrimination between men and women in your organization?

In this case you will learn how to run linear regressions in Python, using the `statsmodels` library.

In [2]:
!pip install -r requirements.txt

Collecting scipy
  Using cached scipy-1.7.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.4 MB)
Collecting matplotlib
  Using cached matplotlib-3.4.2-cp38-cp38-manylinux1_x86_64.whl (10.3 MB)
Collecting seaborn
  Using cached seaborn-0.11.1-py3-none-any.whl (285 kB)
Collecting statsmodels
  Using cached statsmodels-0.12.2-cp38-cp38-manylinux1_x86_64.whl (9.4 MB)
Collecting pillow>=6.2.0
  Using cached Pillow-8.3.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (3.0 MB)
Collecting kiwisolver>=1.0.1
  Using cached kiwisolver-1.3.1-cp38-cp38-manylinux1_x86_64.whl (1.2 MB)
Collecting cycler>=0.10
  Using cached cycler-0.10.0-py2.py3-none-any.whl (6.5 kB)
Collecting patsy>=0.5
  Using cached patsy-0.5.1-py2.py3-none-any.whl (231 kB)


Installing collected packages: pillow, kiwisolver, cycler, scipy, patsy, matplotlib, statsmodels, seaborn
Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.4.2 patsy-0.5.1 pillow-8.3.1 scipy-1.7.0 seaborn-0.11.1 statsmodels-0.12.2


In [3]:
import pandas as pd
import statsmodels.formula.api as sm

## Loading our data

As always, let's start by reading in our dataset and inspecting a few rows:

In [None]:
df = pd.read_csv('data/company_dataset.csv')

In [None]:
df.head()

## Simple linear regressions

During lecture we ran this model, which we will now replicate using code:

$$ PAY{\_}YEARLY = \beta_0 + \beta_1 AGE{\_}YEARS + \varepsilon $$

The first step is to transform the mathematical formula into a `statsmodels` formula. The syntax is as follows:

~~~plain
output_variable ~ input_variable
~~~

Thus, we will define our formula as:

In [None]:
formula1 = 'pay_yearly ~ age_years'
formula1

We used only the column names. There is no need to add the name of the DataFrame (`df`) just yet.

After this, we create a `statsmodels` model with the [**`ols()`**](https://www.statsmodels.org/stable/generated/statsmodels.formula.api.ols.html) function. OLS stands for "Ordinary Least Squares", which is the name of the process that minimizes the sum of squared residuals (which we saw during lecture). It is here where we tell the library which DataFrame the variables belong to:

In [None]:
model1 = sm.ols(formula = formula1, data = df)

Here we passed two arguments - the formula (in our case `formula = formula1`) and the DataFrame (`data = df`).

The model does not do anything yet. It is only a computer representation of our model - it hasn't been fed any data yet, and therefore cannot produce coefficients or any other meaningful outputs. To actually fit the model to the data, we must call the `.fit()` method like this:

In [None]:
fitted1 = model1.fit()

However, if you now simply call the `fitted1` variable or try to print it, you won't see any meaningful output:

In [None]:
fitted1

In [None]:
print(fitted1)

To see the actual regression output, you have to call the [**`.summary()`**](https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a]) method and print it:

In [None]:
print(fitted1.summary())

Now you can see your regression output! To summarize:

1. You first define a `statsmodels` formula (`formula1 = 'pay_yearly ~ age_years'`)
2. Then you pass the formula to `ols()` (`model1 = sm.ols(formula = formula1, data = df)`)
3. After that, you fit the model with `.fit()` (`fitted1 = model1.fit()`)
4. Finally, you print the output with the help of the `.summary()` method (`print(fitted1.summary())`)

In just one cell:

In [None]:
formula1 = 'pay_yearly ~ age_years'
model1 = sm.ols(formula = formula1, data = df)
fitted1 = model1.fit()
print(fitted1.summary())

You can also retrieve only a subset of the output:

* `fitted1.params` gives you the coefficients
* `fitted1.pvalues` gives you the $p$-values
* `fitted1.rsquared` gives you the $R^2$

In [None]:
fitted1.params

In [None]:
fitted1.pvalues

In [None]:
fitted1.rsquared

For a full list of the attributes that you can retrieve, run `dir(fitted1)` (the [**`dir()`**](https://www.geeksforgeeks.org/python-dir-function) function in Python lets you inspect all the attributes of an object).

## Multiple linear regression

The steps to run these regressions in `statsmodels` are exactly the same as in the simple regression case, with a small change to the formula.

This is one of the models we fitted during lecture:

$$ PAY{\_}YEARLY = \beta_0 + \beta_1 AGE{\_}YEARS + \beta_2 MALE{\_}FEMALE  + \varepsilon $$

The corresponding `statsmodels` formula would be:

In [None]:
formula2 = 'pay_yearly ~ age_years + male_female'

That is, whenever you need to include a new input variable, you attach it to the formula using the `+` symbol. If you need to have more than two input variables, you can just keep appending `+` symbols:

~~~plain
output_variable ~ input_variable_1 + input_variable_2 + input_variable_3 + ...
~~~

### Exercise 1

#### 1.1

Using `formula2` above, create the variables `model2` and `fitted2` and print the output of your linear model.

**Answer.**

In [None]:
model2 = sm.ols(formula=formula2, data = df)


-------

#### 1.2

What is the $R^2$? (access it directly using the fitted model's attributes).

**Answer.**

-------

### Exercise 2

Code the following model and print its output (call it `model3`):

$$
PAY{\_}YEARLY = \beta_0 + \beta_1 AGE{\_}YEARS + \beta_2 {MALE{\_}FEMALE} + \beta_3 EDUCATION + \varepsilon
$$

**Answer.**

-------

Adding categorical variables to the model is very simple. You do not need to do anything special - you just include them as usual:

In [None]:
formula4 = 'pay_yearly ~ job_title'
model4 = sm.ols(formula = formula4, data = df)
fitted4 = model4.fit()
print(fitted4.summary())

### Exercise 3

Code the following model and print only the *coefficients* (call it `model5`):

$$
PAY{\_}YEARLY = \beta_0 + \beta_1 AGE{\_}YEARS + \beta_2 {MALE{\_}FEMALE} + \beta_3 EDUCATION + \beta_4 JOB{\_}TITLE + \varepsilon
$$

**Answer.**

-------