# Predicting Salaries

## Intro - Jupyter & Python

### Jupyter Notebook 📝

A Jupyter Notebook is one of the most **essential** tools for any Data Scientist. It allows you to write regular text and run code *in the same environment*!

Notebooks are based on cells that are **independent from one another**. There are several kinds of cells, but the two main ones are:

- **Markdown cells** allow you to write regular text using [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) (like this one!)
- **Code cells** allow you to run **any** Python code your heart desires, right inside the notebook.

In [1]:
# This is a code cell, the output is shown below
1 + 1 * 2

3

Cells have two **modes**:

- Editing mode
- Selection mode

While in **editing mode**, you can write things inside the cell, no matter what kind of cell it is. If you press `escape` (or click outside the cell), you switch into **selection mode**, which enables all sorts of cell shortcuts and navigation hotkeys. Here are some of them:

- Arrows navigate between cells
- `A` inserts a cell above
- `B` inserts a cell below
- `Y` changes the cell type to `code`
- `M` changes the cell type to Markdown
- `Enter` goes into editing mode
- `D`x2 deletes a cell (holding it down "chain-deletes" **all cells**)

Try it out below!

In [3]:
# Play around with cells!

To run a code cell, select it and click the `► Run` button in the navbar at the top of the notebook. You can also use the shortcut `Shift + Enter` while in editing mode.

A cell that has been run will get an `In [number]` next to it, denoting its position in the sequence of cells that have been run so far (if the number is `2`, then it was overall the second cell that ran). An output (returned value) of a cell will be displayed below it, with an `Out[number]` next to it.

In [4]:
print('This cell has been run!')

This cell has been run!


--------

### Python Basics 🐍

[**Python**](https://docs.python.org/) has been around since the late 1980s; fun fact: Machine Learning as a concept has been around since the 1950s! 😯

Rapid advances in internet speeds, data storage solutions, and the very active Python community have ensured that Python evolved enough to be able to apply ML in the real world.

In **Python**, we have **built-in data types** to help us work with different kinds of data:

- Strings (`str`) are **literal text**, and you can create one by using single or double quotes around any word
- Integers (`int`) are whole numbers and can be created by simply typing in the desired number
- Floats (`float`) are decimal numbers with a `.` (as opposed to a `,`)

There are a few other types out there, but these will do for now!

In [5]:
# These are strings
'Hello World!'
"Do you prefer double or single quotes?"
'You have to choose one, no mixing allowed!"

SyntaxError: unterminated string literal (detected at line 4) (2867853561.py, line 4)

In [6]:
# Numbers are your friends...sometimes
42
12

12

In [7]:
# This one is a good friend 🥧
3.141514

3.141514

In [8]:
# You can do calculations with any kind of number
print(9 / 3)
print(2 + 5)
print(2.5 * 2)

3.0
7
5.0


Apart from data types, we have something else that is very important: **variables**. Variables are like boxes that store data and can be accessed at any time.

In [9]:
name = "Ada Lovelace"
age = 42

print(name)
print(age)

Ada Lovelace
42


You can also do some operations using variables (mind the data types!)

In [10]:
'Hi, my name is ' + name

'Hi, my name is Ada Lovelace'

In [11]:
age = age + 1
age

43

Finally, we have **methods** (or **functions** if you're old-school like me). Functions are, quite simply, sets of instructions that can be called upon and executed **as many times as you want**!

We differentiate between **built-in** and **defined** functions. Built-in functions are included in Python and in whatever packages you're using. Defined functions are functions that you defined yourself!

In [12]:
# Built-in function to lowercase strings
name.lower()

'ada lovelace'

In [13]:
def say_hello():
    sentence = 'Hello world!'

    return sentence

In [14]:
say_hello()

'Hello world!'

In [15]:
# You can use a function's output for other things, like creating variables
new_sentence = say_hello() + " Let's roll!"  # Notice that I changed quote types!

new_sentence

"Hello world! Let's roll!"

### 1. Your turn! 🚀
Practice using some of the basic types we just covered. Here are some ideas:

* Create two strings and add them together with a `+` sign
* Create a variable with your age in years, then count your age in hours (roughly)
* Check if your birth month number is higher than (`>`) your birthday number
* Create a variable with your full name, then tell yourself that you rock in all caps! 💪 (i.e. `"YOU ROCK ALAN TURING!"`)

In [None]:
# Your code here

Don't worry if some things feel unnatural at first - you are learning a new language in just 20 minutes! 💪

## Let's get back to Data Science 🤖

Run the cell below to `import` some Python libraries - these will be our tools for working with data 📊

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

Now, run the cell below to read the `CSV` file into a `DataFrame` - a format that is great for data analysis inside Python! 

*Note: the dataset is cleaned and federated for learning purposes*

In [2]:
salaries = pd.read_csv('data/salaries.csv')
salaries.head()

Unnamed: 0,Gender,Age,Department,Department_code,Years_exp,Tenure (months),Gross
0,0,25,Tech,7,7.5,7,74922
1,1,26,Operations,3,8.0,6,44375
2,0,24,Operations,3,7.0,8,82263
3,0,26,Operations,3,8.0,6,44375
4,0,29,Engineering,0,9.5,25,235405


**Remember:** we can gain a lot of insight with analysis alone, no ML needed!

In [3]:
salaries.columns

Index(['Gender', 'Age', 'Department', 'Department_code', 'Years_exp',
       'Tenure (months)', 'Gross'],
      dtype='object')

In [4]:
# Note: In DataFrames, objects are strings
salaries.dtypes

Gender               int64
Age                  int64
Department          object
Department_code      int64
Years_exp          float64
Tenure (months)      int64
Gross                int64
dtype: object

In [6]:
# Note: The dataset is already clean, so there will be no missing data ;)
salaries.isna().sum()

Gender             0
Age                0
Department         0
Department_code    0
Years_exp          0
Tenure (months)    0
Gross              0
dtype: int64

In [7]:
salaries['Age'].mean()

31.516648168701444

In [8]:
# Remember: Data is often biased because humans are biased!
salaries['Gender'].unique()

array([0, 1])

# Your turn! 🚀