# ML Week 1 - Introduction and Prob/Stat

This notebook has the following sections:

* [Part 0: What is machine learning?](##Part-0:-What-is-machine-learning?)
* [Part 1: Math basics](#Part-1:-Math-basics)
* [Part 2: Probability](#Part-2:-Probability)
* [Part 3: Variable relationships and some statistics](#Part-3:-Variable-relationships-and-some-statistics)


## Part 0: What is machine learning?

[Top](#ML-Week-1---Introduction-and-Prob/Stat) | [Previous section](#ML-Week-1---Introduction-and-Prob/Stat) | [Next section](#Part-1:-Math-basics) | [Bottom](#Downloading-the-notebook)

### Background

[SaS](https://www.sas.com/en_au/insights/analytics/machine-learning.html) defines machine learning as the following...

> **Machine learning** is a method of data analysis that automates analytical model building. It is a branch of **artificial intelligence** based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

Let's dive into this with an example, mainly the example of Spotify's recommendation system - which uses a machine learning algorithm to tailor music preferences to each customer. This [blogpost](https://blogs.unimelb.edu.au/sciencecommunication/2018/08/27/how-spotify-already-knows-your-next-favourite-song/) from Holly Whiftield at UniMelb does a great job of explaining what goes into Spotify's recommendation engine.

Spotify's algorithm takes in four types of information...

1. The song titles from your library
2. Audio information from your library
3. Your most recently played playlists
4. The playlists of your network - meaning people who listen to _similar_ songs as yourself, but aren't you

It then uses this to create the following simple workflow.

![](https://blogs.unimelb.edu.au/sciencecommunication/files/2018/08/generic_algo-x0xm2c.png)

The **machine learning** portion decodes this inputted information and combines it to produce your [**Discover Weekly**](https://www.spotify.com/discoverweekly/) playlist. Now you might ask, what machine learning is used? I'll ping a few questions back at you to ponder...

* How does a machine know **what song titles are similar**? It's not quite an obvious thing...and the field of [**Natural Language Processing**](https://en.wikipedia.org/wiki/Natural_language_processing) helps out with this.
* How does a machine know **what audio files are similar**? Sound information is complex...I mean look at a raw sound wave below and tell me what **features** you would be able to pick from it.

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSNtFAuDQGLpxBLlVSqvjaFgPRGDNzAhaHcr6wX1T0fJhXkhQVY)

* How do we jointly combine similar playlists? This is called [**Collaborative Filtering**](https://en.wikipedia.org/wiki/Collaborative_filtering), which helps a machine learn similarities based upon music taste.

...so there's a ton of machine learning involved in this process!

![](https://media1.tenor.com/images/d255f86cd4be8a67347842a4c615f06c/tenor.gif?itemid=8009182)

### Overview of the class - what will we cover?

This class will feature six weeks of content around machine learning fundamentals. We'll cover some of the main algortihms used in practice, when it's appropriate to use these algorithms, and expose you to some of the fundamental processes used by data scinetists.

Hopefully by the end of this you'll have an idea about how to introduce machine learning within your organisation, and the skills required to successfully implement machine learning.

Here's an outline of this week and weeks to come.

---

**Week 1: Probability and Statistics**

If you think about data we collect, all of our data is usually categorized by **processes** that happen around us. Maybe we're observing weather patterns, or capturing information on mobile network traffic. If we want to start finding patterns in our data, we need to start thinking about

* modeling our data
* the randomness that governs these processes
* and relationships between these processes

Computers are really good at number crunching...and not much else, which is why machine learning comes from a large mathematical foundation, mostly governed by probability and statistics. So, we'll start there.


---

**Week 2: Linear Regression**

We'll then cover our first learning algorithm...called **linear regression**, something some of you might be familar with. If you've ever draw a trendline in excel, the process that creates that trendline uses linear regression. We'll try to **predict** some continuous variable, say the price of a house, based upon a collection of data that is related to housing prices.

---

**Week 3: Cross-Validation**

Great, so we've trained a machine learning algorithm, but how do we know if it's _good_? **Cross-validation** is the process of **tuning** an algorithm, to make sure that it is an **accurate** description of how we believe our model should behave.

---

**Week 4: Logistic Regression**

So, linear regression works really well in predicting continuous variables, but is not great at predicting **categorical variables**. A categorical variable, as it sounds, is based upon a category...for instance whether an image on my phone contains my face...or not. **Logistic regression** is very similar to linear regression, and can be used to predict categorical data.

---

**Week 5: Neural Networks**

Logistic regression is not always great though, especially when we have a lot of **complex information**. On the other hand, **neural networks**, the model that is fueling our current buzz around data science, work really well to find complex relationships, as long we have enough data. We'll take a brief look at what a neural network is, why it's become so popular, and some use cases.

---

**Week 6: Clustering**

During our last week, we'll take a look at machine learning from a different lens. We've been doing a lot of prediction, which assumes we know what we want to predict. But what happens if our data doesn't have a specific predictor variable? Maybe I have a bunch of data that represents my customers, and I just want to find **some interesting patterns?**. We'll conclude the course with a discussion on **clustering**, which is a form of **unsupervised learning**.

---

#### What won't we cover?

By the end of this course...you won't necessarily be a pro data scientist.

But, that being said, you'll know some of the lingo, and hopefully will have a slight edge to investigate further. There's a ton of opportunities to further dive into data science, from learning...

* data architecture
* feature engineering
* data visualisation
* building services around machine learning algorithms
* cloud computing
* deep learning
* natural language processing

and a _ton_ more. You'll get exposed to some of this lingo, and hopefully know what subject area you want to explore next :)

![](http://www.quickmeme.com/img/6b/6b8d1b344bff170d04555e05af3211f74a6d439065e59872c278c2cabcc75ee3.jpg)

### Good resources

Here are a couple of great resources as you kick-off your machine learning careers.

* [Cousera's ML Course by Andrew Ng](https://www.coursera.org/learn/machine-learning)
* [Udacity's Data Science Programs](https://www.udacity.com/course/data-scientist-nanodegree--nd025)
* [Datacamp](https://www.datacamp.com/)
* [Kaggle](https://www.kaggle.com/)...probably the best resource. They host a **ton of competitions with free data** to start practicing your own data science skills!


### Re-introduction to Jupyter


#### How to use this notebook?

A Jupyter Notebook is an interactive way to work with code in a web browser. 

Jupyter is a pseudo-acronym for three programming languages: 
* Ju - [Julia](https://julialang.org/) 
* Pyt -[Python](https://www.python.org/) and; 
* eR - [R](https://www.r-project.org/about.html)

Notebooks let you write instructions and run code in one file - this makes writing and running code simpler for you!

#### Essential Keyboard Commands for Notebooks

You need to know some keyboard shortcuts to use notebooks effectively.

We suggest you start with [Max Melnick's blog post](http://maxmelnick.com/2016/04/19/python-beginner-tips-and-tricks.html) on keyboard shortcuts for beginners, but the following tips are essential for navigating and running code in a notebook:

* A cell with a **<span style="color:blue">blue</span>** background is in <strong style="color:blue">Command Mode</strong>. This will allow you to toggle up/down cells using the `arrow keys`. You can press `enter/return` on a cell in command mode to enter edit mode

* A cell with a **<span style="color:green">green</span>** background is in <strong style="color:green">Edit Mode</strong>. This will allow you to change the content of cells. You can press the `escape key` on a cell in command mode to enter edit mode

* To run the contents of a cell, you can type:
  * `ctrl + enter`, which will run the cotents of a cell and keep the cursor in place
  * `shift + enter`, which will run the contents of a cell, and move the cursor to the next cell (or create a new cell)

### Exercise

Edit the below by changing "Gretchen" to your own name by entering edit mode, and then run the cell using the directions above.

In [None]:
print("Gretchen")

We can add/delete cells using the following commands in <span style="color:blue">**Command Mode**</span>:

- `a`, adds a cell above the current cell

- `b`, adds a cell below the current cell

- `d + d`, (pressing the "d" key twice in succession) deletes a cell


### Exercise

Add/delete the cells such that each individual cell prints the numbers 1-5 in order. The numbers 2 and 4 are already completed for you.

In [None]:
print(2)

In [None]:
print(33)

In [None]:
print(4)

### Re-introduction to Python

Python is an _interpretive_ programming language invented in the 1980s. It's actually named after Monty Python and Holy Grail.

#### Why learn Python?

Python has gained popularity because it has an easier syntax (rules to follow while coding) than many other programming languages. Python is a very diverse in its applications which has led to its adoption in areas such as data science and web development.

All of the following companies actively use Python:

![Image](https://www.probytes.net/wp-content/uploads/2018/08/appl.png)

#### Is Python the _only_ language used in Data Science!

**NO!!** It's not! R is another language frequently referenced by data scientists.

---

<img src="https://i.stack.imgur.com/y1cQw.png" width="600">

---

Python has a growing usership, so we've chosen to teach our course in Python!

---

<img src="https://cdn-images-1.medium.com/max/1600/0*5sXl34xnPk7LW6YP" width="600">

---

## Part 1: Math basics

[Top](#ML-Week-1---Introduction-and-Prob/Stat) | [Previous section](#Part-0:-What-is-machine-learning?) | [Next section](#Part-2:-Probability) | [Bottom](#Downloading-the-notebook)

Let's start with the basics, and we'll build up from there.

### Scalars, Vectors and Matrices

> A **scalar** is a number, like the number `3.5`, or the number `-15`.


> A **vector** is a list, or array of numbers. We can represent vectors in python using objects like **lists**, **numpy arrays** and **pandas series**.

We will often represent vectors in [latex notation](http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html). Here is a two vector, labeled as the variable $a$.

$$\textbf{a} = [2, 3]$$


> A **matrix** is a rectangular array of numbers, arranged in rows and columns. We can create matrices using objects in python like **2-dimensional lists**, **2-dimensional numpy arrays (or matrices)** and **pandas DataFrames**.

Here is an example of a matrix with 2-rows, and 3-columns, also called a 2x3 matrix.

$$
\textbf{A} = \begin{bmatrix}
    -1 & 3.5 & 106 \\
    25 & -7.6 & 0.3
\end{bmatrix}
$$

Run the following code cell. It will do the following...

* Import in numpy and pandas
* Create a set of scalars
* Create a set of vectors, using different python objects
* Create a set of matrices, using different python objects

In [None]:
# Import numpy as pandas
import numpy as np
import pandas as pd

# Create scalars
a = 3.5
b = -20

print('Scalars: ')
print(a)
print(b)
print("\n")

# Create vectors
m = [10.0, 20.3, -30.0]
n = np.array([50.0, 25, 36, 0.6])
o = pd.Series(data=[20.6, 30.0, 41.0], name='my_data')

print('Vectors: ')
print(m)
print(n)
print(o)
print("\n")

# Create matrices
v = [[20.0, 30.0], [10.0, 5], [11, -6]]
w = np.array([[11, 12, 14], [-5, -7, 88]])
x = np.matrix([[20.0, 30.0], [27, 35]])
y = pd.DataFrame(data=[[20.0, 11], [30.0, 54], [26, 43]], columns=['Column_1', 'Column_2'])

print('Matrices: ')
print(v)
print(w)
print(x)
print(y)

### Notation

#### Subscript

We often refer to a single element of a vector $\textbf{a}$ as $\textbf{a}_i$. For instance, for the vector:

$$\textbf{a} = [2, 3, 6], \ a_1 = 2,\ a_2 = 3,\ a_3 = 6$$

#### Sigma

Often times, we might use something called **capital sigma** notation. This simply represents a **sum** of numbers. For example, the following states to **sum the values of a.**

$$ \sum_{i=1}^{3}a_i = a_1 + a_2 + a_3 = 2 + 3 + 6 = 11 $$

####  Pi

We might also use **capital pi** notation. This represents a **multiplication** of numbers. For example, the following states to **multiply the values of a.**

$$ \prod_{i=3}^{3} a_i = a_1 * a_2 * a_3 = 2 * 3 * 6 = 36 $$

### Exercise

The following code defines a single vector, called `a`. Your job is to calculate...

1. $\sum a_i$, using the `np.sum(my_vec)` command

2. $\prod a_i$, using the `np.prod(my_vec)` command

In [None]:
# Create a vector
a = np.array([2, 3, 6])

# MODIFY THE CODE TO CALCULATE THE SUM HERE
print(a)

# MODIFY THE CODE TO CALCULATE THE PRODUCT HERE
print(a)

### Functions

Functions (_mathematical_ functions, specifically) associate a relationship between an input and an output. For instance, the following function squares its input and adds 1 to the result.

$$ f(x) = x^2 + 1 $$

### Exercise

Code the above function, $f(x) = x^2 + 1$, into python below, by creating a Python function called `square_plus_one`. Run the function on each value of the input vector $\textbf{a} = [-5, -4, 0, 6, 7]$, and store the output in an `np.array` called `b`. The result for the input vector should be $\textbf{b} = [26, 17, 1, 37, 50]$

We have already started the code for you, and initialised the $\textbf{a}$ and $\textbf{b}$ arrays. We have also outlined the skeleton to the function.

In [None]:
# Define input
a = np.array([-5, -4, 0, 6, 7])

# Define output
b = np.array([0, 0, 0, 0, 0])

# Function setup
def square_plus_one(v):
    
    # MODIFY THE CODE ON THE LINE BELOW
    # TO square a number we can use the "**" operator
    # for example 3**2 = 9
    res = 0
    
    return res

# Run the function on each element
for i in range(len(a)):
    b[i] = square_plus_one(a[i])
    
# Print b
print(b)

### Graphing functions

It's tough to understand the relationship that a function defines without **graphing** the function. As you might remember, Python has some handy tools for visualising data, specifically the [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/) libraries.

Let's plot a visual of the two vectors we just made, $\textbf{a}$ and $\textbf{b}$. Run the cell below to see what our data looks like. We'll need to import matplotlib and seaborn first.

In [None]:
# Imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Data setup
a = np.array([-5, -4, 0, 6, 7])
b = np.array([26, 17, 1, 37, 50])

# Graph the data
plt.figure(figsize=(10, 10))
sns.lmplot(x='a', y='b', data=pd.DataFrame({'a': a, 'b': b}), order=2, ci=None)
plt.title('Y vs. X')

#### Small tangent...error functions

A lot of the time, in machine learning we often are trying to **minimise an error**. For instance, imagine I have a matrix, where the first column represents the square footage of a house, and the second column represents housing prices.

| Square Footage | Price | 
| - | - |
| 1,000 | \$100,000 |
| 1,200 | \$100,000 | 
| 1,200 | \$100,000 |
| 1,200 | \$180,000 |
| 2,100 | \$585,000 |

We might want to create an algorithm that predicts the price from a given square footage. In other words, our goal is  to **create a function** that says:

$$f(square\_footage) = price $$

We can use the `sns.lmplot` to find this line. Run the cell below to do.

In [None]:
# Create a dataframe
housing_df = pd.DataFrame({
    'Square Footage': [1000, 1200, 1200, 1200, 2100],
    'Price': [100000, 100000, 100000, 180000, 585000]
})

# Plot
plt.figure(figsize=(10, 10))
sns.lmplot(x='Square Footage', y='Price', data=housing_df, ci=None)
plt.title('Housing Price vs. Square Footage')

Now how do we get this line? Usually, the algorithm creates a function **minimises the error between each point in our data and the line itself**.

We'll talk more about this next week in our **linear regression** lesson, but today we'll dive into the concept of **minimisation**.

### Optimisation, Derivatives and Gradients

Optimisation is the field of mathematics devoted to finding the **minimums and maximums** of different things, whether it be functions, or the efficiency of a process, or anything else.  The entire field of [industrial engineering](https://en.wikipedia.org/wiki/Industrial_engineering) is largely based upon optimisation and its processes.

When we **minimise an error** function, we are often looking to find a **global optimum** of a solution. Sometimes functions have many **local optimum** values, and it can be tough to find a global value. 

---

![](https://www.researchgate.net/profile/Joze_Tavcar/publication/264812060/figure/fig5/AS:358257924296708@1462426753350/Global-and-local-optimum.png)

---

#### Derivatives

Often times we use what's called the **derivative** of a function to find its **global optimum** value. A **derivative** can be described by its own function.

### Exercise

The following code creates a three column dataframe, with columns, `X, F_X, Derivative(F_X)`. It then graphs `X vs. F_X` on a `sns.lineplot`. It also plots a line at $y=0$.

Add one additional line of code to also graph `X vs. Derivative(F_X)` on the same `sns.lineplot`. Is there any relationship between the minimum of `F_X` and the `Derivative(F_X)`?

In [None]:
# Create the dataframe
df = pd.DataFrame({
    'X': [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5],
    'F_X': [29, 20, 13, 8, 5, 4, 5, 8, 13, 20, 29],
    'Derviatve(F_X)': [-10, -8, -6, -4, -2, 0, 2, 4, 6, 8, 10]
})

# Plot
sns.lineplot(x='X', y='F_X', data=df, ci=None)
sns.lineplot(x='X', y=[0] * df.shape[0], data=df, ci=None, color='grey')

# INSERT YOUR CODE HERE


The derivative of a function crosses the $y=0$ line at the local optimum value. The derivative tells us information about **how fast or slow a function is increasing/decreasing**.

#### Gradients and Gradient Descent

A **gradient** is a generalised name for a derviative. Often times we'll have functions with more than one input, and we'll use the word **gradient** instead of derivative.

It's often not easy to find the global minimum when we have really complex error functions, so we often use techniques to _approximate_ the gradient. One of these techniques is called **gradient descent**, which slowly works our way down an error function until it hits the minimum, as the picture describes below.

---

![](https://ml-cheatsheet.readthedocs.io/en/latest/_images/gradient_descent_demystified.png)

---

## Part 2: Probability

[Top](#ML-Week-1---Introduction-and-Prob/Stat) | [Previous section](#Part-1:-Math-basics) | [Next section](#Part-3:-Variable-relationships-and-some-statistics) | [Bottom](#Downloading-the-notebook)

Every day, random events occur around us. There is a chance it will rain today, a chance that Donald Trump will not be President of the U.S., or that there will be traffic on the way way to work tomorrow. The field of **probability** explains the science of how we can estimate and model random events.

> A **probability** is the likelihood or chance of a random event occurring.

If you think of what we do in machine learning, we try to **predict** events and show **relationships** within a dataset, and all of our data points can usually be modeled by probabilities. Thus, much of machine learning is built upon probability, and we will take a brief segue way into this field.

### Random Variables

A **random variable** is a variable whose possible values reflect the outcome of a specific event. Here are a few examples of random processes...

* Flipping a coin. There are two values our random variable might take on, $H$ for heads, or $T$ for tails.
* The time this class will take, which is in _minutes_, and likely on average will be 2 hours.

### Continuous vs. discrete random variables

The first example highlights a **discrete random variable**, because our random variable can only take on a specific set of distinct values.

The second example highlights a **continuous random variable**, since there are an infinite amount of possible values that the value can take on, namely any number of minutes, seconds, miliseconds, and everything in between.

### Distribution function

The following code samples 100,000 values from a fair coin toss (meaning equally likely to get heads or tails), and then creates a **histogram** of the results.

In [None]:
# Toss a coint 100,000 times
coin_toss = np.random.choice([1, 0], size=100000, replace=True, p=[0.5, 0.5])

# Graph these values
sns.distplot(coin_toss, kde=False, bins=2, norm_hist=True, hist_kws={'edgecolor':'white'})
sns.despine()

What do you notice about the total area of the bars? You likely see that

* Each bar has a height of ~1.0
* Each bar has a width of ~0.5

If we sum up the total area of these bars, we'll find that we can get a total summation of 1.0. 

A **histogram** graphs the **probability mass function** or **probability density function** of a random variable. The rule for a density function is that the total area must sum to 1.0.

### Exercise

Let's graph the second process.

The following code samples 100,000 times from a **normal distribution** with a **mean of 2**, modeling the distribution of time we spend in our machine learning class. We'll talk about what these things mean (pun intended) in a second. Your job is to...

* graph the resulting density in an `sns.distplot()`. You can copy the code we just used
* Please set `kde=False`, `bins=None`, `norm_hist=True`, `kde_kws=False`.

In [None]:
# Sample 100,0000
class_time = np.random.normal(loc=2.0, scale=(1.0/6.0), size=100000)

# INSERT YOUR CODE HERE


That might look like a familiar shape...we'll get back to this shape means.

### Expected value (also known as a mean, or average)

The **expected value**, otherwise known as the **mean** or **average** of an event is a measure of the **center** of a distribution.

It can be calculated in many ways, but normally, if we have a sample of values, we calculate it by summing up the values and dividing by the total number of values we have. The mean is often denoted by the greek letter mu, or $\mu$

$$ \mu = \sum_{i=1}^{N}X_i = \frac{X_1 + X_2 + ... + X_N}{N} $$

In the above plot, the **mean** or center was about 2.0.

### Variance and Standard Deviation

The **variance** and **standard deviation** define the **spread** of a distribution. Spread is how wide or thin a distribution is. The standard deviation is often signified by the lowercase greek letter sigma, or $\sigma$.

If a distribution has a higher variance, there is more **randomness** in an outcome. Here is an example of two distributions, one with a small standard deviation, and one with a large standard deviation.

Here is an example of IQ scores showing different standard deviations...

![](https://spss-tutorials.com/img/standard-deviation-example-4.png)

* The left distribution has a _small_ standard deviation.
* The right distribution has a _large_ standard deviation

### Thought exercise

* What are examples of processes in real-life that have a lot of randomness, and likely large standard deviations?
* What are examples of processes that have small standard deviations?
* If someone gave you a value of a Verbal IQ score of 160, and value of a spatial IQ score of 160, which would you more confidently say is less likely to occur, based upon the given data? 
* Take a look at [538's 2016 election forecast](https://projects.fivethirtyeight.com/2016-election-forecast/), specifically the "electoral votes" section in "How the forecast has changed". Based upon your new knowledge of variance, what do you think this shows? Scroll through this and fine other nuances that might not be so obvious.

### Normal Distribution

The **normal distribution** is a very common distribution in probability and statistics. It's pictured below.

---

<img src="https://www.syncfusion.com/books/Statistics_Using_Excel_Succinctly/Images/normal-curve.png" width="500">

---

There is a special case of a normal distribution, called a **gaussian distribution**. There are a few special things about this distribution.

* It is symmetric
* It has a mean of 0
* It has a standard deviation of 1
* It is easier to estimate the likelihood something is going to hapeen from the mean and standard deviation

It turns out, for some reason, that a lot of things in nature happen to be governed by a normal distribution. We won't go too into this, but we'll prove it!

### Exercise

The following function samples from what is called a **binomial distribution**. A binomial distirbution is often used to approximate events like...

* If there's a 30% chance it'll rain over the next seven days, what's the probability it rains 4 out of 7 days?
* If there's a 0.01% chance I'll win the lottery per ticket, and I bought 100 tickets, what's the probability I will have 20 winning tickets out of these 100 (super low...)

In the following code, we sample from a binomial distribution, and the number of times we sample is goverened by the variable `s` defined in the code. We then graph the resulting samples.

Play with values of `s` (make them greater). How does the distribution change as we sample more values?

In [None]:
# CHANGE S HERE
s = 10

# Sample from a binomial "s" times
my_sample = np.random.binomial(n=1000, p=0.3, size=s)

# Plot
sns.distplot(my_sample, kde=False)

## Part 3: Variable relationships and some statistics

[Top](#ML-Week-1---Introduction-and-Prob/Stat) | [Previous section](#Part-2:-Probability) | [Next section](#Downloading-the-notebook) | [Bottom](#Downloading-the-notebook)

### Order statistics: median and IQR

#### Median

Another measure of center is the **median**.

> A **median** is the value separating the higher and lower halves of a sample of data.

Let's say we have the following dataset...

$$\textbf{a} = [5, 12, 0, 4, 7]$$

If we order the dataset, the central value is the median.

$$\textbf{a} = [0, 4, \textbf{5}, 7, 12]$$.

The bolded $5$ is the median.

#### IQR

Another name for the median is the **50th percentile, or quantile (50%)** because it divides the data into two equal halves. If the data has an _even_ number of values, the median can be found by averaging the two middle data values.

We can find any other percentiles by performing a similar exercise. Two other important percentiles are the...

* **25th percentile (25%)** which splits the data between the lower quartile and the upper three quartiles
* **75th percentile (75%)** which splits the data into the lower three quartile and one upper quartile

Here is a dataset...the 25th, 50th and 75th percentiles are bolded.

$$\textbf{a} = [5,\ 20,\ \textbf{37},\ 42,\ \textbf{78},\ 81,\ \textbf{83},\ 90,\ 94] $$

The **IQR** is another measure of spread.

> The **inner-quartile range, or IQR** is the difference between the 75th and 25th percentiles within a data sample.

Since all of these measures depend upon the order of the data, we call these measures **order statistics**. 

One last statistic is the **range**.

> The **range** is the difference between the **max** and **min** values in a dataset.

### Exercise


The following code creates a pandas dataframe that simulates household income in the U.S. You can find documentation on the [actual dataset here](https://healthinequality.org/rankings/).

Please calculate and print the following:

* The mean (we give this to you)
* The median
* The 25th percentile. [Hint here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html).
* The 75th percentile
* The IQR

In [None]:
# Create data
income_data = pd.DataFrame(
    np.random.lognormal(mean=11, sigma=1.15, size=10000),
    columns=['hh_inc']
)

# Here is how you access the household income column
print(income_data[['hh_inc']].head())
print("\n")

# Finding the mean
print(income_data['hh_inc'].mean())

# MODIFY THE CODE HERE
print(income_data['hh_inc'].mean())
print(income_data['hh_inc'].mean())
print(income_data['hh_inc'].mean())
print(income_data['hh_inc'].mean() - income_data['hh_inc'].mean())

#### Boxplots

A boxplot is a visual that summarises data based upon the median and the IQR. It also shows us **outliers**, or values that are very uncommon and far away from the center of the dataset.

<img src="https://cdn-images-1.medium.com/max/1600/1*2c21SkzJMf3frPXPAR_gZA.png" width="700">

The following code will create a boxplot of our data:

In [None]:
# Create boxplot
sns.boxplot(income_data['hh_inc'])

#### Outliers and skewness

As you see, we have **a lot of outliers!!** (the dots on the very right). Often, our choice of whether to summarise data by the mean or median is due to how many outliers we have. As you likely saw in the data above, the mean was **much higher** than the median. This is because outliers affect the mean _much more_ than the median.

We call this difference, skewness. Here are example of positively and negatively skewed distributions. Our own data is **positively skewed**, which is expected of income distributions.

---

![](https://www.managedfuturesinvesting.com/images/default-source/default-album/measure-of-skewness.jpg)

---

### Exercise

Skewness can often be [measured](https://www.statisticshowto.datasciencecentral.com/pearsons-coefficient-of-skewness/) by the following formula...

$$ skewness = \frac{mean - median}{standard\ deviation}$$

Calculate the measure of skewness for household income in the cell below.

In [None]:
# INSERT YOUR CODE HERE


### Correlation

The last type of statistic we'll talk about is **correlation**. If you think about machine learning algorithms, we depend upon a set of variables in data to give us meaningful insight. We thus depend upon the **interactions** between these variables to find a pattern, or predict a result.

Two things correlate when:

> Either of two things so related that one directly implies or is complementary to the other (from Merriam-Webster)

Put into mathematics, this simply means that knowing information about one variable implies information about another variables. There are two main types of correlation.

> A positive correlation implies raising the value of one variable will also raise the value of another variable. Subsequently, lowering the value of one variable will lower the value of another variable.

> A negative correlation implies the opposite. Raising the value of one variable lowers the value of another variable.

There is a numerical measure of correlation, called $r^2$. There are three main key points about the $r^2$ value of variables...

* A $r^2$ that approaches 1 implies positive correlation
* A $r^2$ that approaches -1 implies negative correlation
* A $r^2$ that approaches 0 implies no correlation

Here are some pictures to show you examples of correlated, non-correlated, and negative correlated variables with corresponding $r^2$ values.

---

![](https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-2-small.png)

---

### Exercise

Run the following code cell to load the [Boston housing pricing dataset from sklearn](https://scikit-learn.org/stable/datasets/index.html#boston-dataset). This dataset has a set of 14 columns, described below, that can be used to predict the median value of owner occupied homes given data about an area.

* CRIM per capita crime rate by town
* ZN proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS proportion of non-retail business acres per town
* CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX nitric oxides concentration (parts per 10 million)
* RM average number of rooms per dwelling
* AGE proportion of owner-occupied units built prior to 1940
* DIS weighted distances to five Boston employment centres
* RAD index of accessibility to radial highways
* TAX full-value property-tax rate per \$10,000
* PTRATIO pupil-teacher ratio by town
* B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT
* MEDV Median value of owner-occupied homes in \$1000â€™s (**this is what we usually predict**)

In [None]:
# Import sklearn dataset
from sklearn.datasets import load_boston
boston = load_boston()
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
boston_df['MEDV'] = boston['target']

target = boston['target']

The following code creates a [correlation matrix](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) of our data within a variable called `boston_corr`. A cell within a correlation matrix shows the $r^2$ value for the specific variables a row and column represent.

Your job is to do the following.

* Create a [sns.heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) to visualise the correlation matrix

Then analyse the following:

* Which columns are positively correlated with each other ($r^2 > 0$)?
* Which columns are negatively correlated with each other ($r^2 < 0$)?
* Which columns completely correlate ($r^2 = 1$)? Why would this be?

In [None]:
# Create correlation matrix
boston_corr = boston_df.corr()

# Make a figure
plt.figure(figsize=(8, 5))

# INSERT YOUR CODE HERE


## Thank you!!

[Top](#ML-Week-1---Introduction-and-Prob/Stat) | [Previous section](#Part-3:-Variable-relationships-and-some-statistics) | [Next section](#Downloading-the-notebook) | [Bottom](#Downloading-the-notebook)

That concludes our week 1 lesson. Hopefully you enjoyed :)


## Downloading the notebook
If you would like to retain your work, please follow the following directions:

* On the top of this screen, in the header menu, click "File", then "Download as" and then "Notebook".
* You will need to download [Python 3.7 with Anaconda](https://www.anaconda.com/distribution/#download-section) to use this in the future