# Your first predictions - Predicting Salaries 📈

--------------

## But first - a tutorial on `Jupyter Notebook` and `Python` basics 🚴‍♀️

### Jupyter Notebook 📝

Notebook consists of two main parts.

1. Text instructions like this one - these are made using a text formatting language called [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

2. Code cells like the one below:

In [None]:
1 + 1 * 2

1. To run a code cell, click into it with your mouse and press the `► Run` button in the navbar at the top of the notebook. 
2. You can also use the shortcut `Shift + Enter` to run a cell!
3. A cell that has been run will get a `In [number]` next to it
4. An output (returned value) of a cell will be displayed below with a `Out[number]` next to it
5. If you want to add another code cell - look for the `➕` button in the navbar.

In [None]:
# you will have cells like these for you to code in

--------

### 🐍Python basics

[**Python**](https://docs.python.org/) has been around since late 1980s. In fact, Machine Learning concept has been around since 1950s! 😯

But rapid advances in internet speed, data storage and the very active Python community has married the two things very well in the last 5 years.

In **Python** we have **built-in data types** to help us work with different kinds of data:

**Strings** (`str` in Python) for **literal text, column or file names**. Made by putting quotes (`""`) around the text.

In [None]:
"hello!"
"ML like a pro"

**Integers** (`int` in Python) for **whole numbers**

In [None]:
42
-10

**Floats** (`float` in Python) for **numbers with decimal points**. The decimal delimeter is always `.`

In [None]:
3.14

📦 We have **variables** to help store data:

In [None]:
name = "Alan Turing"
age = 42
new_employee_data = [0, 30, 3, 7.1, 12]

...and **re-use** it later!:

In [None]:
"Hi, my name is " + name

In [None]:
# getting one year older :(
age = age + 1
age

💥And we have **methods** to perform actions on data:

In [None]:
name.upper()

In [None]:
number_of_n = name.count('n') # creating a new variable as a result of the method call
number_of_n

Don't worry if some things feel unnatural at first - you *are* learning a new language in just 5 minutes! 💪

--------------

# Let's get back into Machine Learning 🤖

1. Run below cell to `import` some Python libraries - these will be our tools for working with data 📊

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

2. Run below cell to read the `CSV` file into a `DataFrame` - a format that is great for data analysis inside Python! 

*Note: the datasets is cleaned and federated for learning purposes*

In [None]:
salaries = pd.read_csv('clean_data/salaries.csv')
salaries

--------------

## We can get a lot of insight without ML! 🤔

Let's follow some basic intuition - **does years of experience affect gross salary❓**

Let's use a [Seaborn Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) - a method inside the Seaborn library (which we imported above and shortened to `sns`) that gives us a graph with data points as dots with `x` and `y` values.

In [None]:
sns.scatterplot(data=salaries, x='Years_exp', y='Gross')

### Your turn! 🚀
Try to re-code the `scatterplot` in the cell below, but changing the `x` value to another column. 

**Tip**: It's okay to copy-paste code 😉

In [None]:
# your code goes here

Remember one of the questions from the slides - **do women and men earn equally in our company❓**

*Note: 'Male' is coded as 0, 'Female' - as 1*

In [None]:
sns.scatterplot(data=salaries, x='Years_exp', y='Gross', hue='Gender')

If the trend is still not clear, we can also go back to our `DataFrame` (the data table itself) and group it by the 'Gender' column and see the `mean()` (average) of the Gross salary:

In [None]:
salaries.groupby('Gender').mean()['Gross']

### Your turn! 🚀

Try to re-code the above cells, but changing `'Gender'` and `'Gross'`to other columns, like `'Department'` and `'Tenure (months)'`.

In [None]:
# your code goes here

--------------

#### 🥈*A good data expert knows all the most complex models.* 
### 🥇*A great data expert knows when results can be achieved without them.* 

--------------

## Your first model - Linear Regression 🚀

**1.** First, let's decide what will be our...
  * Features and targets
  * Inputs and output
  * X and Y

In [None]:
inputs = salaries.drop(['Gross', 'Department'], axis='columns') # dropping the output column to create the inputs (features)
output = salaries[['Gross']]

**2.** Feel free to check what is in your `inputs` and `output` below:

In [None]:
# your code here

**3.** Time to **import** the Linear Regression model

Python libraries like [Scikit-learn](https://scikit-learn.org/0.21/modules/classes.html) make it super easy for people getting into Data Science and ML to experiment.

The code is already in the library, it's just about **calling the right methods!** 🛠

In [None]:
from sklearn.linear_model import LinearRegression

**4.** We **initialize** a model and store it into a variable

In [None]:
model = LinearRegression()

**5.** We **train** the model. 

This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work! 🤖

In [None]:
model.fit(inputs, output)

**6.** We **score** the model

Models can have different default scoring metrics. Linear Regression by default uses something called `R-squared` - a metric that shows how much of change in the output (Gross salary) can be explained by the changes in inputs (Age, Tenure, Gender etc.)

In [None]:
model.score(inputs, output)

⚠️**Careful not to confuse this with accuracy**. The above number is shows that **"the inputs we have can help us predict around 40-45% of change in the salary"** Which is decent considering we did this in 10 min! 

**7.** We **predict** the salary of a new hire 🔮

*Note: here is a reminder of the columns in the table:* `['Gender', 'Age', 'Department_code', 'Years_exp', 'Tenure (months)']`

In [None]:
hire = [[0, 30, 1, 5.2, 10]]

In [None]:
model.predict(hire)

###  **Your turn! 🚀**

1. Try changing the numbers stored inside the `hire` variable. Can you find some insight from what you see?
2. You can try dropping different columns in the cell where we set the `inputs`. How does that change the score and predictions?

In [None]:
# your code here

--------------

**8.** A debated point - **explaining** the model

There is a whole concept called [**Explainable AI (XAI)**](https://arxiv.org/abs/2006.00093) which is rising in popularity, as the widespread application of machine learning, particularly deep learning, has led to the development of highly accurate models but **models lack explainability and interpretability**.

Luckily, Linear Regression is a [linear model](https://scikit-learn.org/stable/modules/linear_model.html), so it's explainability is quite high.

**8.1.** We can check the `coef_` or the **coefficients** of the model. These explain how much the target (Gross salary) changes with a change of `1` in each of the features (inputs).

In [None]:
model.coef_

🤔We'd need to check the column order again, to know which number is which input. But, **we got you covered!** Run the cell below:

In [None]:
pd.concat([pd.DataFrame(inputs.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)

**8.2** The other thing we can check is the **intercept** of the model. This is the target (Gross salary) for when all inputs are 0. So imagine a newborn baby going to the office:

In [None]:
model.intercept_

# Congratulations, you are a Linear Regression wizzard! 🧙‍♀️🧙‍♂️