---

# Week 1 Lecture - Introduction {-}


### Unit Convenor & Lecturer {-}
[George Milunovich](https://www.georgemilunovich.com)  
[george.milunovich@mq.edu.au](mailto:george.milunovich@mq.edu.au)


### References  {-}

1. Python Machine Learning 3rd Edition by Raschka & Mirjalili - Chapter 1
    - Macquarie University Libarary Link: [https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/?ar](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/?ar)
    - Sign in with your university email address
2. Various open-source material



### Learning Objectives  {-}

1. Installing and Running Python
2. Learning How to open and read Jupyter Notebooks (.ipynb) with JupyterLab
3. Datasets Used & Basic Terminology
4. Running Python Code in Jupyter Notebooks
5. Mathematical Notation Used in Machine Learning
6. Explanatory vs. Predictive Models
7. Different Types of Machine Learning
    - Supervised Learning
    - Unsupervised Learning
    - Reinforcement Learning

---

## Installing and Running Python  {-}


### Installing Python  {-}

We will use Anaconda distribution of Python.
- Go to Anaconda website [https://www.anaconda.com/products/individual](https://www.anaconda.com/products/individual)
- Scroll down to the bottom of the page until you see "Anaconda Installers" as in the pic below


<img src="images/pic1.png" alt="Drawing" style="width: 450px;"></img>

<!-- ![](images/pic1.png) -->

- Choose an installer for your OS, Python 3.11 (or higher) and 64-Bit Installer
- Click on the saved installer and install to your machine
- Done!


- If unable to install see: 
    - Step-by-step instructions [https://docs.anaconda.com/anaconda/navigator/](https://docs.anaconda.com/anaconda/navigator/)
    


### Running Python {-}
- If on Windows, click Start and type Anaconda
    - If using other type of OS follow similar instructions
- Click on Anaconda Navigator as in pic below

<img src="images/pic2.png" alt="Drawing" style="width: 450px;"/>

<!-- ![](images/pic2.png) -->

- Anaconda Navigator will open
- Click on "Launch" below **JupyterLab** package

<!-- ![image.png](images/pic3.png) -->
<img src="images/pic3.png" alt="Drawing" style="width: 450px;"/>

<!-- ![](images/pic3.png) -->


- Download Lecture Notes for Week 1 (week1.zip) from iLearn
- In "File Browser" navigation panel on the LHS in JupyterLab navigate to the directory where you saved and extracted the lecture notes

<img src="images/pic4.png" alt="Drawing" style="width: 400px;"/>

<br>

- Select week1_lecture.ipynb to open Jupyter file for this Week 1 
- We can now proceed with the lecture

---
---

<br>

## Predictive Analytics vs. Explanatory Statistical Modeling {-}

Many students and practitioners confuse **Explanatory Statistical Models** and **Predictive Models**.
Listed below are some key differences.


<hr style="width:25%;margin-left:0;"> 

- **Explanatory Statistical Modeling**
    1. Explanatory Models - models that are built for the purpose of **testing causal hypotheses** that specify how and why certain empirical phenomena occur
        - Explanatory statistical models are based on underlying **causal** relationships between **theoretical** constructs
        - Focus: Knowing what happens to $y$ when we change $x$ 
    3. Causal theoretical model -> A set of hypotheses -> Test using statistical models and statistical inference 
    4. Methods for evaluating the **explanatory power** of a model are statistical tests or measures such as **$R^2$**, which indicate the strength of the relationship between $y$ and $x$ 
    5. It is often assumed that *predictive power* follows automatically from the explanatory model 
        - However, **explanatory power does not imply predictive power** 
    6. A statistically significant effect or relationship does not guarantee high predictive power 
        - Because **the magnitude of the causal effect might not be sufficient for obtaining levels of predictive accuracy** that are practically meaningful
    7. Example: predict whether it will rain today: $P(\text{rain}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1\text{Temperature} + \beta_2 \text{Humidity} + \beta_3 \text{Pressure} + \beta_4\text{WindSpeed} + \beta_5\text{CloudCover} + \text{other relevant variables})}}$


<hr style="width:25%;margin-left:0;"> 

- **Predictive Modeling** (predictive accuracy)
    1. Predictive models are statistical models which have the ability to generate accurate predictions of **new** observations
        - Focus: Accuracy of prediction - knowing $y$ for a given $x$
        - New observations can be defined temporally (i.e., observations in a future time period) or cross-sectionally (i.e., observations that were not included in the original sample used to build the model)
    3. Predictive models integrate **knowledge from existing theoretical models in a less formal way** 
        - Such models rely on *associations* between measurable variable
    4. **Statistical significance plays a smaller role** in assessing predictive performance.
       - Sometimes removing predictors with small coefficients, even if they are statistically significant (and theoretically justified), results in improved prediction accuracy
    5. Assessment Methods for evaluating the accuracy of those predictions in practice, e.g. RMSE
    6. Example: predict whether it will rain today $P(\text{rain}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1\text{People Carry Umbrellas})}}$

           
<hr style="width:25%;margin-left:0;"> 

---



## Machine Learning/Predictive Analytics - Introduction  {-}

**Machine learning (ML)** is the study of computer algorithms that improve through experience 

- Machine learning algorithms build a mathematical model based on **sample data**, also known as **training data**
- We make predictions about new observations called **out-of-sample data** or **test data**  
- Machine learning algorithms are used in a wide variety of applications, for example:  
    - Predicting demand quantities for a product
    - Forecasting house prices
    - Text and voice recognition   
    - Computer vision  
    - Medical application, such as detecting skin cancer   


References: [https://en.wikipedia.org/wiki/Machine_learning](https://en.wikipedia.org/wiki/Machine_learning)

---

## Some Basic Terminology, Datasets and Notation {-}


### Introduction to datasets we will use {-}

We will use a number of different datasets in this unit, including the following:    

1. The famous Iris dataset - [https://archive.ics.uci.edu/ml/datasets/iris](https://archive.ics.uci.edu/ml/datasets/iris)    
2. Credit card default payments - [https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)    
3. House prices in Iowa (US) - will need to register for Kaggle (free) [https://www.kaggle.com/c/house-prices-advanced-regression-techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)    

Our data is typically organised in tables (dataframes) where   
- Columns represent features - also known as variables, attributes, measurements   
- Rows represent individual observations - also known as instances or examples   


### Iris Dataset {-}

This is perhaps the best known database to be found in the pattern recognition literature. 

 
Lets have a quick look at the Iris dataset in Python:


```
# !pip install xlwt  # if required
```


```
# python comments are made using the hash "#" symbol
# python code to read Iris data from the internet
# original data file has no column names so we assign them ourelves

import pandas as pd  # import pandas library 
column_names = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class Label']  # define column names

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names = column_names) # read data from URL

df
```

Next, we are going to save the file in the `data` folder for later use

```
df.to_excel('data/iris.xlsx')  # save for later use
```

In [3]:
!pip install xlwt

Collecting xlwt


[notice] A new release of pip is available: 23.1.1 -> 24.1.2
[notice] To update, run: C:\Users\LENOVO\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip



  Downloading xlwt-1.3.0-py2.py3-none-any.whl (99 kB)
                                              0.0/100.0 kB ? eta -:--:--
     -------------------------------------- 100.0/100.0 kB 5.6 MB/s eta 0:00:00
Installing collected packages: xlwt
Successfully installed xlwt-1.3.0


In [8]:
!pip install openpyxl

Collecting openpyxl


[notice] A new release of pip is available: 23.1.1 -> 24.1.2
[notice] To update, run: C:\Users\LENOVO\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip



  Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
                                              0.0/250.9 kB ? eta -:--:--
     -------------------------------------- 250.9/250.9 kB 5.2 MB/s eta 0:00:00
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.5


In [9]:
# python comments are made using the hash "#" symbol
# python code to read Iris data from the internet
# original data file has no column names so we assign them ourelves

import pandas as pd  # import pandas library 
column_names = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class Label']  # define column names

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names = column_names) # read data from URL

df

In [None]:
df.to_excel('data/iris.xlsx')  # save for later use

**Python Counting**  
As we can see from the above printout, Python starts counting at 0.  

- So we have 150 examples (observations) across rows 0 to 149 and 5 variables across columns 0 to 4, containing 4 features (explanatory variables) as well as the target variable in the last column.   


**Feature Information:**  

1. Sepal Length in cm
2. Sepal Width in cm
3. Petal Length in cm
4. Petal Width in cm
5. Class:
    - Iris Setosa
    - Iris Versicolour
    - Iris Virginica
    
Predicted attribute: class of iris plant.  

So the objective is to predict Iris class given the set of features: Sepal Length, Sepal Width, Petal Length and Petal Width.   

### Some Terminology {-}

Machine learning is a branch of computer science and has its own terminology, which may be unfamiliar to students with statistics/econometrics backgrounds.

- **Feature $(x)$** = Predictor = Input = Independent Variable = Explanatory Variable = a column in the data matrix $\mathbf{X}$
- **Target $(y)$** = **Label** (in classification) = Output = Dependent Variable = Response Variable
- **Example** = Observation from a sample, i.e. a sample is a collection of examples
- **Training** = Model fitting, for parameteric models like linear regression this refers to parameter estimation

Therefore a **labeled dataset** is a dataset which contains data on the label/target $(\mathbf{y})$, i.e. Dependent Variable.

Examples:

1. Predict whether a product will sell or not, i.e. target or dependent variable takes values True (1) of False (0), on the basis of Features or Predictors such as Price, Color, Country of Production, etc.

2. Predict the price of house, target numeric value e.g. Price = $500k, given a set of features (predictors) such as Number of Bedrooms, Size, Suburb, etc. 


---

## Mathematical Notation {-}

In contrast to Python, our mathematical notation starts at 1, e.g. rows 1 - 150 (instead 0 to 149 as in Python). 
- *We must keep this in mind all the time.*

So our features will be stored in a $(150\times4)$ matrix $\mathbf{X}\in \mathbb{R}^{150\times4}$  
<br>

$\mathbf{X}=
\left(\begin{array}{cccc} 
x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & x_4^{(1)}\\
x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & x_4^{(2)}\\
\vdots & \vdots & \vdots & \vdots\\
x_1^{(150)} & x_2^{(150)} & x_3^{(150)} & x_4^{(150)}
\end{array}\right)$

<br>
A typical element is $x_m^{(n)}$, where the subscript $m=1\dots4$ represents columns and the superscript $n=1\dots150$ represents rows.   

- For instance, $x_2^{(4)}=3.1$ - Sepal Width, observation 4.

<br><br>
We can also represent columns of $\mathbf{X}$ as column vectors. For instance, column 2 which contains Sepal Width can be written as $\mathbf{x_2}=\left(\begin{array}{c} 
x_2^{(1)}\\
x_2^{(2)}\\
\vdots\\
x_2^{(150)}
\end{array}\right)=\left(\begin{array}{c} 
3.5\\
3.0\\
\vdots\\
3.0
\end{array}\right)$.

<br><br>
Similarly, rows of $\mathbf{X}$ are row vectors. Row 3, for example, becomes $\mathbf{x^{(3)}}=\left(\begin{array}{cccc}
x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & x_4^{(3)}
\end{array}\right)=\left(\begin{array}{cccc} 
4.7 & 3.2 & 1.3 & 0.2
\end{array}\right)$.

<br><br>
Lastly, we store target variables (class labels) in a column vector $y=\left(\begin{array}{c} 
y^{(1)}\\
y^{(2)}\\
\vdots\\
y^{(150)}
\end{array}\right)=\left(\begin{array}{c} 
\textrm{Iris-setosa}\\
\textrm{Iris-setosa}\\
\vdots\\
\textrm{Iris-virginica}
\end{array}\right)$.   




---

## Introduction to Different Types of Machine Learning {-}

Broadly, there are three distinct types of machine learning:    

1. Supervised Learning    
2. Unsupervised Learning    
3. Reinforecement Learning    

<img src="images/pic5.png" alt="Drawing" style="width: 450px;"/>

<!-- ![](images/pic5.png) -->



---

## Supervised Learning {-}

### Making Predictions with Supervised Learning {-}

Goal of Supervised Learning: learn a model from labeled training data that allows us to make predictions about unseen or future data.    

- **Supervised** means that we have a set of training data for which labels (values of dependent variable) are known.   

<img src="images/pic6.png" alt="Drawing" style="width: 450px;"/>

<!-- ![](images/pic6.png) -->

There are two tasks in supervised learning:    

- **Classification**   
- **Regression**   


---

### Classification {-}
The problem of predicting the categorical class labels of new instances, based on a set of features.  

- For example, predict whether a product will sell or not sell (True/False)
- Predict whether it will rain today or not   

We can also distinguish between:  

- Binary Classification: Classification tasks with two classes, e.g. True/False  
- Multi-class Classification: Classification tasks with more than two classes, e.g. Buy/Sell/Hold   
    
Example: Classification

- Predict whether a bank loan ($y$) will be repaid (+) or will default (-)   
- Prediction made on the basis of 2 features: borrower's income ($x_1$) and borrower's age ($x_2$)   

<img src="images/pic7.png" alt="Drawing" style="width: 400px;"/>

<!-- ![](images/pic7.png) -->

### Regression {-}
The task of predicting the outcome is a continuous variable, e.g. House Price, on the basis of a set of features - explanatory variables.

- There are many algorithms for doing this.

- We are most familiar with Linear Regression from basic statistics/econometrics courses.

Example: Regression

- Use a linear regression to predict an exam mark ($y$) on the basis of the time spent studying ($x$)

<img src="images/pic8.png" alt="Drawing" style="width: 400px;"/>

<!-- ![](images/pic8.png) -->


--- 

## Unsupervised Learning {-}

In supervised learning we   

- Know what the target (dependent) variable in regression analysis (label in classification) is, e.g. exam mark 
- Have the values of the target (dependent) variable before we train a model, our training data contains exam marks for a sample of students

In unsupervised learning

- Unlabeled data, i.e. we dont even know what the dependent variable is
- Data has unknown structure which we wish to discover

### Clustering {-}

Clustering is a type of unsupervised learning where we attempt to group a set of objects without having any prior knowledge of their group memberships
- Objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters) 

Example:  

- Discover customer segments based on their age ($x_1$) and income ($x_2$) in order to develop targeted marketing programs.


<img src="images/pic9.png" alt="Drawing" style="width: 400px;"/>

<!-- ![](images/pic9.png) -->


---
## Reinforcement Learning {-}

The goal is to develop a system (agent) that improves its performance based on interactions (feedback) with the environment. 

- The agent can observe the environment, perform actions and get rewards in return (or penalties - negative rewards). 
- Learning to choose a series of actions that maximizes the total reward, which could be earned either immediately after taking an action of via delayed feedback.

**Example: Reinforcement Learning**  

A chess program where the agent decides upon a series of moves depending on the state of the board (the environment) and the reward can be defined as a win or lose at the end of the game.

[https://www.youtube.com/watch?v=JgvyzIkgxF0](https://www.youtube.com/watch?v=JgvyzIkgxF0)
