

                                            
 #       **Mathematics for Machine Learning** - Part I





Before we move onto `NumPy`, `Pandas`, & the whole "Scientific Python Stack", we think it would be good to get familiar with some of the underlying **mathematics for ML.**  


_**Please don't worry or panic!**_.  Yes - these can be very complex topics, but you don't have to master them in their _entirety_ to get benefit from them, or to use the principals & equations effectively.  Additionally, we can always decide to delve deeper into particular topics, if you need clarification or want more!


&nbsp;
_____

### At the Heart:  Random Variables

The "learning" part of Machine Learning leverages `Random Variables`.    

In probability & statistics, a `Random Variable` is a type of _**variable**_ whose value is subject to _**variations**_ due to  mathematical _**chance**_ (in other words probability -- that's the 'random' part). In contrast to other programming or mathematical variables (_where a variable represents a **single** unknown or assigned value or data structure_), a `Random Variable` can take on an entire _**set of possible different values**_ -- each of which carries an associated probability (odds, chance) of it happening.

&nbsp;

A `Random Variables` _**possible values**_ might represent the _**possible outcomes**_ of a yet-to-be-performed experiment, or the _**possible outcomes**_ of an experiment whose already-existing values are uncertain (for example -- the a result of incomplete information). 

`Random Variables ` can also represent either the results of

1.  An _**objectively random**_ process (_think rolling dice, spinning roulette, or a 'random walk'_)
2.  A _**subjective random process**_ resulting from incomplete knowledge of a quantity or outcome (_What is the probability it will rain today & I'll need my umbrella??, If I surf in the Pacific, what are the chances of shark attack?, What are the chances that my Jelly Belly bag will have 53 red Jelly Beans?_).





`Random Variables` can further be classified as either _**discrete**_ or as _**continuous**_ 

<br><br>
<img width="60%" src="./images/Random Variables.png">

<br><br>

A **Discrete Random Variable** represents a *countable* number of *distinct values* and can thus be **_quantified_**. `Random Variable`**`R`** can be defined as the number that comes up when you roll a "Fair Dice". 
**`R`** can take on any of the _**possible values**_ of **`[1,2,3,4,5,6]`** (_each of which has a probability of happening of 1/6 or 0.167 of the time_). Each potential outcome is **_distinct_**, & all possible outcomes are **_enumerated_** (countable):

<br><br>
<img src="./images/Discrete RV.png">

&nbsp;

In contrast, a **Continuous Random Variable** represents an _infinite number_ of possible values (uncountable). These values are drawn from an **interval** or **collection of intervals**. `Random Variable`**`R`** here can represent the potential height of students in a class -- all of which fall within a curve-shaped set of intervals -- but could be any one of an _infinite set_ of specific numbers. The probability (_chance_) of a specific student having a height that falls in a given interval is represented by the area under a section of the curve.

<br><br>
<img width="90%" align="center" src="./images/Normal Curve w Random Variable.png">


&nbsp;


The mathematical function (_formula, equation_) describing the possible values of a discreet or continuous `Random Variable` and the associated probabilities of each outcome is known as a _**probability distribution**_ or _**statistical distribution**_.  Since all random functions, variables, & operations are based on these **_statistical distributions_** we're going to cover some of the more common and useful ones in this session.

&nbsp;

Later on, we'll go through some ***derivatives*** -- which is a fancy way of saying equations that calculate the slope of a line that intersects a curve at a specific point -- they're one of the most important building blocks of ML algorithms -- used for things like _**Cost functions**_, & _**gradient descent**_.

&nbsp;

In the next session we'll dive deeper into probability, statistical concepts & derivatives. In this one let's unveil equations, matrix, introductory probability & common distributions.

_**Let's get started !**_


## Some Mathematical Refreshers 

1.  [Simple Learn Math Refresher](https://www.simplilearn.com/math-refresher-machine-learning-tutorial)
2.  [Edx.org Essential Math for ML](https://www.edx.org/course/essential-math-for-machine-learning-python-edition)
3.  [Mathematics for Machine Learning Book](https://mml-book.github.io/)

## Where to find help?

1. Khan Academy- The best place for clear explanations!
   * [Variables](https://www.khanacademy.org/math/algebra/introduction-to-algebra/alg1-intro-to-variables/v/what-is-a-variable)  
   * [Coefficients](https://www.khanacademy.org/math/cc-sixth-grade-math/cc-6th-equivalent-exp/cc-6th-parts-of-expressions/v/expression-terms-factors-and-coefficients)
   * [Functions](https://www.khanacademy.org/math/algebra/algebra-functions)
   * [Linear equations](https://wikipedia.org/wiki/Linear_equation)
   * [The Sigmoid function](https://wikipedia.org/wiki/Sigmoid_function)
   * [Linear Algebra](https://www.khanacademy.org/math/linear-algebra)
   * [Multivariable Calculus](https://www.khanacademy.org/math/multivariable-calculus)
   * [Statistics & Probability](https://www.khanacademy.org/math/statistics-probability)  
   
  

2. [Stanford CS229 Review Notes](http://cs229.stanford.edu/section/cs229-linalg.pdf) 


3. [3Blue1Brown YouTube](https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw)  
   *  _This channel is a good place to dive deep. It has **great visualizations**._  
   

4. [Coursera Mathematics for ML Specialization](https://www.coursera.org/specializations/mathematics-machine-learning)
   *  _This is the "long path" of math but **worth it**. Very beneficial in long run_ 




### _Having Doubts??_ 

....Well that's why we have our slack channel! Drop all your doubt there, and let us deal with it. ;-) 

**Our Advice?** Start by reading/reviewing general resources around the topics below & dig in further with more detailed resources & exercises as required.  It's not as complicated as it looks.  _**Really**_.  Also?  _*we're programmers*_ - so we're going to make our computers do most of this math - we only really have to understand enough to _apply_ it.  :D



# Equations

### Khan Academy : [Linear equations](https://wikipedia.org/wiki/Linear_equation)  & [Polynomial equations](https://www.khanacademy.org/math/algebra/x2f8bb11595b61c86:quadratics-multiplying-factoring/x2f8bb11595b61c86:multiply-monomial-polynomial/v/polynomials-intro)  

&nbsp;

An **equation** is a combination of **_n variables_**. The **variables (_`a,b,x,y`, etc._)** in math (_just like the variables in programming_) are **terms** (_symbols_) which do not have a **fixed value** -- unlike **constants (_`3,5,1,8`_)**, which **do** have a fixed values. We combine **variables** & **constants** using mathematical **operations (_`+,-,*,/`_)** to create **equations** (_functions_).

<img width="70%" align="left" src="./images/linear vs polynomial.png">
<br><br>

A **linear equation** is an equation with _**single power**_ variables. When graphed, it would make a **line**:

_**`1/2(x) + 2 = y`** in the graph to the left._


Conversely, a **polynomial equation** has variables raised to _**powers of more than 1**_.  Wen graphed, these equations form **curves**:

_**`x^2 - x - 2 = y`** in the graph._


Another word for "power" here is _**degree**_.  We use _**degree**_ to refer to the **highest exponent** or **highest power** of the variables in the **polynomial**. 

**`x^2 - x - 2 = y`** is a **second degree, or _quadratic_ polynomial**





## Matrix


### Khan Academy : [Matrices](https://www.khanacademy.org/math/algebra-home/alg-matrices)  


A **Matrix** is nothing more than a type of _**array**_ (_in core Python think of a `list` or `list of lists` -- which are built on arrays_).  


<img width="55%" align="left" src="./images/Matrix_fun.png">
<br><br>


A **matrix** has two dimensions - **height** & **width** (_rows & columns_). 

In Python libraries such as `numpy` & `pandas`, matrices are modeled as _**multi-dimensional arrays**_ -- essentially tables -- of elements all with matching data types (_all `int`, all `float`....etc. etc._).  


When dealing with only **one dimension** of a matrix, we use the term **vector**. **Vectors** are just one-dimensional  _**matrices**_ (_a single `list` of items ... a flat array_). Matrices are essentially **_stacked vectors_** -- multiple rows / multiple columns glued together.


**Shape** is used to describe how many rows and columns are in a matrix (_**e.g.** `shape = 4x3` = a matrix with **4 rows** & **three columns** or `shape = 3x3x4` = a matrix with **3 rows**, **3 columns**, **three deep**_)  

**Rank** is used to describe the _number of dimensions_ of a matrix (_**e.g.** if a matrix is `4x3x3`, it is said to have a **rank** of 3 -- or **3 dimensions**_).


Various mathematical operations can be performed on matrices such as addition,subtraction, multiplication etc....although there are some specific rules to follow. Every element in one matrix is operated on with the corresponding element of another matrix when performing matrix operations -- & we'll get into some common operations below.  

You can look over addition & subtraction yourself -- we'll cover the multiplication...which can be a little tricky.




## Matrix multiplication

To multiply matrices we need **dot product**.

The **`Dot Product`** is where we multiply _matching members_ from each corresponding row --> column, then sum them up:


<img width="55%" align="left" src="./images/vector multiply.png">
<br><br>
   

We match the 1st members (**`1 & 7`**), multiply them, likewise for the 2nd members (**`2 & 9`**) and the 3rd members (**`3 & 11`**), and finally sum them up!  

....but when we add another dimension, it gets a little more complicated....



## Matrix multiplication part II

&nbsp;&nbsp;

### Given two matrices `A` and `B`, we proceed in a specific order:  


<br>
<img width="35%" align="center" src="./images/matrix math start.png">



&nbsp;&nbsp;

### `Step 1:` Multiply the elements in the first row of `A` with the elements in the first column of `B`.  Add these together to get the element C (_the first entry in the answer_):

&nbsp;&nbsp;

<img width="70%" align="center" src="./images/matrix math step 1.png">


&nbsp;&nbsp;

### `Step 2:` Multiply the elements in the first row of `A` with the corresponding elements in the second column of `B`. Add the products to get the element C<sub>1,2</sub>  (_the second entry in the answer_):

&nbsp;&nbsp;

<img width="70%" align="center" src="./images/matrix math step 2.png">

&nbsp;&nbsp;

### `Step 3:` Multiply the elements in the second row of `A` with the elements in the first column of `B`.  Add these together to get the element C<sub>2,1</sub> (_the third entry in the answer_):

&nbsp;&nbsp;

<img width="70%" align="center" src="./images/matrix math step 3.png">


&nbsp;&nbsp;

### `Step 4:` Multiply the elements in the second row of `A` with the elements in the second column of `B`.  Add these together to get the element C<sub>2,2</sub> (_the final entry in the answer_):

&nbsp;&nbsp;


<img width="70%" align="center" src="./images/matrix math step 4.png">


&nbsp;&nbsp;

### So the product of `A` * `B` is:

&nbsp;&nbsp;


<img width="35%" align="center" src="./images/matrix math step final.png">
<br>
<br>
<br>
<br>

## But there are _rules:_ **Do not forget the order of matrices!**

### Two matrices can only be multiplied if the **columns** of one matrix match the **rows** of another matrix.  

#### Let's  say `A`=(2,3) then `B` has to have the shape with number of rows=3 (_`B` can be (3,4) or (3,whatever)_)


&nbsp;&nbsp;


<img width="35%" align="center" src="./images/Matrix Dimension Match.png">
<br>
<br>
<br>
<br>


#### If this order is mismatched , you cannot multiply the 2 matrices.  

**No match = no go.**  In the image below, we do not have a **third row** in the second matrix to match up with the numbers in the **third column** of the first matrix -- so the products for those values in the result matrix would be **"undefined"** -- there's nothing for those numbers to be multiplied with. 
&nbsp;&nbsp;


<img width="35%" align="center" src="./images/Matrix Dimension Mismatch.png">


&nbsp;
&nbsp;

____

### **Points to Remember!**

1. The number of **columns** in the **`left matrix`** must equal the number of **rows** in the **`right matrix`**.
2. The answer matrix always has the `same number of rows as the left matrix` and the `same number of columns as the right matrix`.
3. **Order matters.** Multiplying A•B is not the same as multiplying B•A.
4. Data in the left matrix should be arranged as rows., while data in the right matrix should be arranged as columns.  

&nbsp;

_**FYI:**_  


There is a matrix called identity matrix in which only diagonal elements are 1 and 0 elsewhere.





## Probability 

Again, [Stanford CS229 Review Notes PDF](http://cs229.stanford.edu/section/cs229-prob.pdf) comes to rescue.  If visulaizations are more your speed, the [Seeing Theory](https://seeing-theory.brown.edu/) website is _**amaizing**_. 



1. The probability of any specific event is between 0 and 1 (inclusive). The sum of total probabilities of an event cannot exceed 1, that is, 0 <= p(x) <= 1.

2. Probability is all about the _**possibility**_ of various outcomes. The set of `all
   possible outcomes` is called the **sample space**. e.g. The **sample space** for a coin flip is `{heads, tails}`.

3. A `random variable X`, is a **variable** which takes on **`randomly chosen values from a sample space`**. When playing with random variables each outcome is equally likely to occur.

## Distributions 


#### As a Reminder:  _`continuous`_ vs _`discrete`_ values:

Values that can be anything between a **range** are **continuous**. Values that are specified or complete (**countable**) are **discrete**.

## 1. Bernoulli Distribution 

The simplest **discreet** probability distribution consisting of only 2 possible outcomes/values in a **single** experiment:  

0 & 1 (_or the parameters p & q_). **0(q)** = `false|failure|no` &  **1(p)** = `true|success|yes`


<br><br>
<img width="90%" align="left" src="./images/Bernoulli.png">


&nbsp;


## 2. Uniform Distribution


 The simplest **continuous** probability distribution.  Unlike the Bernoulli distribution above -- which models **one** experiment -- this distribution represents a _**range**_ of infinite numbers(outcomes) all _**equally likely.**_
 
 <br><br>
<img width="90%" align="left" src="./images/Uniform.png">



## 3. Binomial Distribution

The **Binomial distribution** is a _**superset**_ of the **Bernoulli distribution**. When we have to calculate the possibilities of **same event** occurring more than once we head towards a **binomial distribution** with the results. Here (_as with Bernoulli_) there are two possible outcomes (1 & 0) but every trial is **independent** of another in its probably outcome. A binomial distribution that only includes **one** trial is a **Bernoulli distribution**.


https://math.stackexchange.com/questions/838107/what-is-the-difference-and-relationship-between-the-binomial-and-bernoulli-distr


<br><br>
<img width="90%" align="left" src="./images/Binomial.png">


&nbsp;


## 4. Normal (Gaussian | Laplace-Gauss) Distribution

Is a bell shaped **continuous** distribution in which the mean, median, & mode are equal to one another (_..time to Google these terms.._ :) ).  

The number of values falling to the left-hand & right-hand of this distribution are _**equal**_,  which makes this distribution _**symmetrical**_.  If you look closer at how the values are distributed, most are centered at or near the peak, with fewer values the further we travel from the **median**. 

We'll be using this distribution very often throughout this study group.

<br><br>
<img width="90%" align="left" src="./images/Gaussian Distibution corrected.png">


&nbsp;



## 5. Poissons Distribution

Similar to the **Binomial** distribution in that all the events are _**independent of each other**_.
_So what's the difference?_

Remember **continuous** vs **discrete** ? 

The **Binomial** is based on **discrete events**, while the **Poisson** is based on **continuous events**.

To Break it down further --> The **Binomial** distribution is based on events that have a _**fixed number of attempts**_, The Poisson distribution is based on events that have an _**infinite number of attempts**_, in a **fixed amount of time**.

Poisson distributions are most often used to model events that could happen a very large number of times -- but happen _rarely_ -- or spaced in intervals such as distance, area or volume.  Think meteor showers, bus arrivals, subscription renewals - anything that happens/repeats in a _cycle_.

<br><br>
<img width="90%" align="left" src="./images/Poisson.png">


&nbsp;



Further reading :



https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/

# **Exercises**

*Practice makes a woman perfect!*


Here's a list of some cool exercises so that you can apply what you've learnt!

1. **Some Linear Algebra Fun** : https://www.albert.io/linear-algebra

2. **Remember multiplication of matrices is important?** :https://www.intmath.com/matrices-determinants/4-multiplying-matrices.php