# Your first predictions - Predicting Salaries 📈

--------------

1. Run below cell to `import` some Python libraries - these will be our tools for working with data 📊

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns

2. Run below cell to read the `CSV` file into a `DataFrame` - a format that is great for data analysis inside Python! 

*Note: the datasets is cleaned and federated for learning purposes*

In [5]:
salaries = pd.read_csv('../../data/salaries_day3.csv')
salaries

Unnamed: 0,Gender,Age,Department,Department_code,Years_exp,Tenure (months),Gross
0,0,25,Tech,7,7.5,7,74922
1,1,26,Operations,3,8.0,6,44375
2,0,24,Operations,3,7.0,8,82263
3,0,26,Operations,3,8.0,6,44375
4,0,29,Engineering,0,9.5,25,235405
...,...,...,...,...,...,...,...
1797,0,29,Other,4,9.5,34,88934
1798,0,27,Engineering,0,8.5,33,133224
1799,0,29,Operations,3,9.5,15,72547
1800,0,47,Other,4,18.5,30,227176


--------------

## We can get a lot of insight without ML! 🤔

### 2. Your turn! 🚀

Let's start by **understanding the data we have** - how big is the dataset, what is the information (columns) we have and so on:

**💡 Tip:** remember to check the slides for the right methods ;)

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
salaries.shape # to see how many rows, columns
salaries.dtypes # to see available columns and their data type
round(salaries.describe()) # to see a readable summary about the dataset, like averages, minimums and maximums
</pre>
</details>

Now try to **separate only some columns** - say we only want to see departments, or departments and salaries:

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
salaries["Department"] # to see one column
salaries[["Department", "Gross"]] # double bracket if we want to see multiple columns
</pre>
</details>

-------

### 3. Your turn - Now let's do some **visualization** 📊. 


Let's follow some basic intuition - **does years of experience affect gross salary❓**

Let's use a [Seaborn Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) - a method inside the Seaborn library (which we imported above and shortened to `sns`) that gives us a graph with data points as dots with `x` and `y` values.

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.scatterplot(data=salaries, x="Years_exp", y="Gross")
</pre>
</details>

Remember one of the questions from the slides - **do women and men earn equally in this example❓**

*Note: 'Male' is coded as 0, 'Female' - as 1*

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.scatterplot(data=salaries, x="Years_exp", y="Gross", hue="Gender")
</pre>
</details>

Let's also understand the **number of** some data points we have - how many women and men? How many in each department? Seaborn `countplot` is here to help with that.

**💡 Tip:** you can always call methods `.dtypes` or `.columns` on your dataset to check what columns you have.

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.countplot(data=salaries, x="Gender") # to see how many of each gender we have in the dataset
sns.countplot(data=salaries, x="Department") # to see how many of each department we have
</pre>
</details>

**Bonus question:** can you visualize **how many men and women there are per department**? 🤔 A `hue` might help...

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
sns.countplot(data=salaries, x="Department", hue="Gender")
</pre>
</details>

--------------

#### 🥈*A good data expert knows all the most complex models.* 
### 🥇*A great data expert knows when results can be achieved without them.* 

--------------

## Your first model - Linear Regression 📈

**1.** First, let's create what will be our...
  * Features and target
  * Inputs and output
  * X and Y

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
features = salaries.drop(["Gross", "Department"], axis="columns") # dropping the Department column because it's text
target = salaries["Gross"]
</pre>
</details>

Feel free to check what is in your `features` and `target` below:

In [None]:
# your code here

**2.** Time to **import** the Linear Regression model

Python libraries like [Scikit-learn](https://scikit-learn.org/0.21/modules/classes.html) make it super easy for people getting into Data Science and ML to experiment.

The code is already in the library, it's just about **calling the right methods!** 🛠

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
from sklearn.linear_model import LinearRegression
</pre>
</details>

Now to **initialize** the model

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model = LinearRegression()
</pre>
</details>

**3.** We **train** the model. 

This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work **learning**! 🤖

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.fit(features, target)
</pre>
</details>

**4.** We **score** the model

Models can have different default scoring metrics. Linear Regression by default uses something called `R-squared` - a metric that shows how much of change in the target (Gross salary) can be explained by the changes in features (Age, Tenure, Gender etc.)

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.score(features, target)
</pre>
</details>

⚠️ **Careful not to confuse this with accuracy**. The above number is shows that **"the inputs we have can help us predict around 40-45% of change in the salary"** Which is decent considering we did this in 10 min! 

**5.** Let's **predict** the salary of a new hire 🔮

*Note: here is a reminder of the columns in the table:* `['Gender', 'Age', 'Department_code', 'Years_exp', 'Tenure (months)']`

In [None]:
# here's a freebie! You can change the numbers below to change the info of your hire ;)
hire = [[1, 29, 2, 5.2, 10]]

# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.predict(hire)
</pre>
</details>

💡 A hint for **departments and their codes**:

* Engineering - 0
* Finance - 1
* Media - 2
* Operations - 3
* Other - 4
* Product - 5
* Sales - 6
* Tech - 7

--------------

**6.** **Explaining** the model

There is a whole concept called [**Explainable AI (XAI)**](https://arxiv.org/abs/2006.00093) which is rising in popularity, as the widespread application of machine learning, particularly deep learning, has led to the development of highly accurate models but **models lack explainability and interpretability**.

Luckily, Linear Regression is a [linear model](https://scikit-learn.org/stable/modules/linear_model.html), so it's explainability is quite high.

**6.1.** We can check the `coef_` or the **coefficients** of the model. These explain how much the target (Gross salary) changes with a change of `1` in each of the features (inputs), while holding other features constant.

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.coef_
</pre>
</details>

🤔 We'd need to check the column order again, to know which number is which input. But, **we got you covered!** Run the cell below:

In [None]:
pd.concat([pd.DataFrame(features.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)

**6.2** The other thing we can check is the **intercept** of the model. This is the target (Gross salary) for when all inputs are 0. So imagine a newborn baby going to the office:

In [None]:
# your code here


<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
model.intercept_
</pre>
</details>

# Congratulations, you are a Linear Regression wizard! 🧙‍♀️🧙‍♂️

* You can try to play around with the `hire` variable to see the `.predict`ion results
* You can also try to change the `features` variable - try removing more columns!