# Your first predictions - Predicting Salaries 📈

--------------

## But first - how do we use this `Jupyter Notebook`? 🤔

Notebook consists of two main parts.

1. Text instructions like this one - these are made using a text formatting language called [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

2. Code cells like the one below:

1 + 1 * 2

To run a code cell, click into it with your mouse and press the `► Run` button in the navbar at the top of the notebook. 

You can also use the shortcut `Shift + Enter` to run a cell!

In [None]:
1 + 1

As you see after you run a cell it will get a `In [number]` next to it - this means that the code was run, and the number is the *order* in which it was run.

We also get an output! 💡If the **last line inside a code cell returns** something (produces some result) then that result will be printed directly below with a `Out[number]` next to it.

If you want to add another code cell - look for the `➕` button in the navbar.

--------------

--------------

## Next - Python basics in 10 min! ⏰

[Python](https://docs.python.org/) has been around since late 1980s. In fact, Machine Learning concept has been around since 1950s! 😯

But rapid advances in internet speed, data storage and the very active Python community has married the two things very well in the last 5 years.

### Python built in data `types`

To work easily with different kind of data, we have the notion of `type` in Python.

Let's first see the **String** (shortened as `str` in Python). Strings represent **text, labels, names** and are written **inside single quotes or double quotes**.

In [None]:
"hello"
'this is also a string'
"136 8866 8855" # even though this looks like a number, but phone numbers are also stored as strings!
"Python" + " is " + "cool" # we can also add strings together

⚠️**Syntax rules are very important**. Even though as humans we can still read through typos, but computers can't and will throw errors! Try to run the cells below:

In [None]:
"what if I forget a quote?

In [None]:
"what if I use different quotes?'

💡**Tip**: note the small upward arrow `^` in the error message. Python will try to help you find the error, so looking for that arrow can be very helpful!

--------------

Next let's check out **numeric data** 🔢

For numeric data we have two types.

1. **Integer** (shortened as `int` in Python) is for **whole numbers**, like the ones below:

In [None]:
1
42
-11

2. **Float** (written as `float` in Python) is for **numbers with decimal points**, like these ones:

In [None]:
3.14
0.3

❓**How do I know the data type of something?**

Great question! You can use the `type()` method and put the object you want to check inside the parenthesis. You can try running the cells below:

In [None]:
type("Getting nerdy")

In [None]:
type(42)

In [None]:
type(3.14)

--------------

### Variables - storing data for later 📦

Knowing Python basics, like built in data types, is a great first step! But we need to find a convenient way to store this data inside the program.

Let's take an example! Say we have a phone number, written as a `str`:

In [None]:
"155 8899 8877"

Now let's say I want to create a confirmation message with this phone number. What to do? Without variables, we just need to write the whole thing again:

In [None]:
"A confirmation message has been sent to 155 8899 8877"

Let's also imagine, on my profile, page I can see what my current phone number is set to:

In [None]:
"Your current phone number: 155 8899 8877"

Now let's image that **the user changed their phone number**. What needs to be done? 

**We need to go and rewrite everything again. 😭** This would make software impossible to maintain. ❗️

Instead, let's **store the phone number into a variable** and then just **call the variable** when we need it.

Here is how that looks like:

In [None]:
phone = "155 8899 8877"

In [None]:
"A confirmation message has been sent to " + phone

In [None]:
"Your current phone number: " + phone

You can try changing the phone number and re-run the above two cells again - you'll see that they pick up the new phone number! 💡

We will use variables to store our data, Machine Learning models and predictions, so make note of this syntax!

```
variable = value
```

The value can be **any data type**. In the above example we are storing a simple string, but we will soon store much more rich data!

--------------

### Methods - performing an action to a piece of data 💥

Let's take a name. What is a good data `type` to store a name? 🤔

In [None]:
name = "monica"

This name needs a fix - **the first letter is lower case**. We can change it by **calling a method on the variable name**.

Here is how that looks like:

In [None]:
name.title()

⚠️**Note that most methods don't change the original value!**

In [None]:
name

**But in most cases we want to overwrite the original value!** Let's **update the variable** `name`:

In [None]:
name = name.title()

In [None]:
name

☝️This is something we will be doing a lot, as we **transform data** and **store it back into a variable**

Notice that in `title()` the **parenthesis are empty**. But **some methods will take what is called `parameters` or `arguments`.**

These are like extra settings for the method (action) to work correctly 🛠

Here is an example:

In [None]:
name.count('i')

You probably guessed it - the above method counts how many letters `i` are **in the value stored inside the variable** (in our case, the string `Monica`).

The syntax to keep in mind here is:

```
variable.method()
variable.method(parameters)
variable = variable.method()

# or if you want to make a new variable:
new_variable = variable.method()
```

Don't worry if some things feel unnatural at first - you are learning a new language in just 5 minutes! 💪


--------------

--------------

# Let's get back into Machine Learning 🤖

1. Run below cell to `import` some Python libraries - these will be our tools for working with data 📊

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

2. Run below cell to read the `CSV` file into a `DataFrame` - a format that is great for data analysis inside Python! 

*Note: the dataset is cleaned and federated for learning purposes*

In [None]:
salaries = pd.read_csv('clean_salaries.csv')
salaries

--------------

## We can get a lot of insight without ML! 🤔

Let's follow some basic intuition - **does years of experience affect gross salary❓**

Let's use a [Seaborn Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) - a method inside the Seaborn library (which we imported above and shortened to `sns`) that gives us a graph with data points as dots with `x` and `y` values.

In [None]:
sns.scatterplot(data=salaries, x='Years_exp', y='Gross')

### Your turn! 🚀
Try to re-code the `scatterplot` in the cell below, but changing the `x` value to something else. 

**Tip**: It's okay to copy-paste code 😉

In [None]:
# your code goes here

Remember one of the questions from the slides - **do women and men earn equally in our company❓**

*Note: 'Male' is coded as 0, 'Female' - as 1*

In [None]:
sns.scatterplot(data=salaries, x='Years_exp', y='Gross', hue='Gender')

If the trend is still not clear, we can also go back to our `DataFrame` (the data table itself) and group it by the 'Gender' column and see the `mean()` (average) of the Gross salary:

In [None]:
salaries.groupby('Gender').mean()['Gross']

### Your turn! 🚀

Try to re-code the above cells, but changing `'Gender'` and `'Gross'`to other columns, like `'Department'` and `'Tenure (months)'`.

In [None]:
# your code goes here

--------------

#### 🥈*A good data expert knows all the most complex models.* 
### 🥇*A great data expert knows when results can be achieved without them.* 

--------------

## Your first model - Linear Regression 🚀

1. First, let's decide what will be our...
  * Features and targets
  * Inputs and output
  * X and Y

In [None]:
inputs = salaries.drop(['Gross', 'Department'], axis='columns') # Linear Regression only works with numbers, let's drop Department
output = salaries[['Gross']]

2. Feel free to check what is in your `inputs` and `output` below:

In [None]:
# your code here

3. **Time to import the Linear Regression model**

Python libraries like [Scikit-learn](https://scikit-learn.org/0.21/modules/classes.html) make it super easy for people getting into Data Science and ML to experiment.

The code is already in the library, it's just about **calling the right methods!** 🛠

In [None]:
from sklearn.linear_model import LinearRegression

4. We **initialize** a model and store it into a variable

In [None]:
model = LinearRegression()

5. We **train** the model. 

This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work! 🤖

In [None]:
model.fit(inputs, output)

6. We **score** the model

Models can have different default scoring metrics. Linear Regression by default uses something called `R-squared` - a metric that shows how much of change in the output (Gross salary) can be explained by the changes in inputs (Age, Tenure, Gender etc.)

In [None]:
model.score(inputs, output)

⚠️**Careful not to confuse this with accuracy**. The above number shows that **"the inputs we have can help us predict around 40-45% of change in the salary"** Which is decent considering we did this in 10 min! 

7. We **predict** the salary of a new hire 🔮

*Note: here is a reminder of the columns in the table:* `['Gender', 'Age', 'Department_code', 'Years_exp', 'Tenure (months)']`

In [None]:
hire = [[0, 30, 1, 5.2, 10]]

In [None]:
model.predict(hire)

###  **Your turn! 🚀**

1. Try changing the numbers stored inside the `hire` variable. Can you find some insight from what you see?
2. You can try dropping differe columns in the cell where we set the `inputs`. How does that change the score and predictions?

In [None]:
# your code here

--------------

8. Now comes the part where there's the most debate - **explaining** the model.

There is a whole concept called [**Explainable AI (XAI)**](https://arxiv.org/abs/2006.00093) which is rising in popularity, as the widespread application of machine learning, particularly deep learning, has led to the development of highly accurate models but lack explainability and interpretability.

Luckily, Linear Regression is a [linear model](https://scikit-learn.org/stable/modules/linear_model.html), so it's explainability is quite high.

8.1. We can check the `coef_` or the **coefficients** of the model. These explain how much the target (Gross salary) changes with a change of `1` in each of the features (inputs).

In [None]:
model.coef_

🤔We'd need to check the column order again, to know which number is which input. But, **we got you covered!** Run the cell below:

In [None]:
pd.concat([pd.DataFrame(inputs.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)

8.2 The other thing we can check is the **intercept** of the model. This is the target (Gross salary) for when all inputs are 0. So imagine a newborn baby going to the office:

In [None]:
model.intercept_

### Congratulations, you are a Linear Regression wizzard! 🧙‍♀️🧙‍♂️