# Module 2 - Introduction to Machine Learning

**Module Overview**
- What is machine learning?
    - a subfield of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to learn patterns from data and make decisions or predictions without being explicitly programmed for specific tasks 
- Difference between classification and prediction
    - Prediction problems have a numerical output variable, while classification problems have categorical outputs. In each of the correct examples, we are sorting into categories (e.g. ‘beautiful or not beautiful’ or ‘defective or not defective’). 
- Difference between supervised and unsupervised
    - Supervised has both $X$ and $Y$ as inputs, unsupervised seeks to learn the relationship between data, and only has $X$ as input 
- Difference between parametric and non-parametric models
    - Parametric models assume a specific form for the data, such as a linear relationship. Non-parametric make few assumptions about underlying data distribution
- Difference between forecasting and inference 
    - Forecasting is used to predict future outcomes i.e. predict next years sales, inference is to understand relationships i.e. dertermine the effect of advertising on sales
- How machine learning differs from statistics
    - ML places emphasis on how well they perform on unseen data. Statistics places emphasis on interpretability providing insight into the data. 

**Learning outcomes**
- LO 1: Identify the fundamental components of and approaches to machine learning problems.
- LO 2: Differentiate between machine learning and statistics.
- LO 3: Classify problems along the major dividing lines of the machine learning landscape.
- LO 4: Apply the ten steps of a typical machine learning project.
- LO 5: Identify real-world applications of machine learning in a variety of industries

## What is machine learning?

ML is about understanding the relationship between some input variables and an ouput:  $Y = f(X_1 ..., X_p) + \alpha$
This is difficult for two main reasons
 - limited data to learn $f$ (function above) from.
 - noise (alpha) i.e. could be missing variables in our model
The input variables are also known as features, predictors, independant variables or fields.
The output is also known as target, response variable, outcome variable
Why learn this relationship between input and output:
 - To perform forecasting
    - want to predict output from input on data we have not seen
    - i.e. determining whether a patient has cancer, input is blood sample
    - forecasting aims to predict a function f such that  $Y = f(X_1 ..., X_p)$ not only on available data, but new data too
    - prediction accuracy is key measure
 - To do interfence
    - want to understand how the output Y depends on the inputs $(X_1 ..., X_p)$
    - i.e. which inputs play a positive or negative role
    - assume two different marketing campaigns, want to prediction sales increase, but also how the two different campaign types can be attriubted to the sales, to determining optim marketing campaign.

## Similarities and differences between ML and statistics

- Machine learning is more modern from the 80/90's. Statistics is centuries old

#### Statistics
- data are generted by a stochastic data model identified by the statistician,
- The aim is to estimate parameter values of the chosen model for data.
- Validation is often yes/no, goodness of fit metrics and residual analaysis, with more emphasis on explainability and simplicity. 

#### Machine Learning
- data is generated by a complex, unknown black box process, are few, if any, assumptions made about the data.
- The aim is to find a function that operations on input variables $X$ to predict response $Y$
- Validation is often prediction accuracy with less emphasis on explanation of why

#### Some key words:
- **Stochastic data models**: treats the data generate process as a random variable. Common examples include normal, binomial and student’s t-distributions.
- **Goodness of fit**: summarises discrepency between observed and expected values. Common example is root mean squared (RMS)
- **Predictive accuracy**: how well the predicted values match the actual
- **Generalisation**: Ability to fit to new unseen data
- **Summary statistics**: Set of values to describe a dataset, such as mean, std, median.
- **Black box**: A system of process where the interneal workings are hidden or not fully understood
- **Model valiation**: Process of confirming the model achieves its purpose
- **Residual analysis**: Residual is the difference between observed and predicted value. Analysis of these residuals can determine how useful the model is
- **Deterministic input**: an input that produces the same output every time it is used, often derived from historical data, standards, or specifications.
- **Linear regression**: models the relationship between two or more variables. The goal is to fit a straight line (in the case of simple linear regression) or a hyperplane (in the case of multiple linear regression) that best represents this relationship.
- **Logistic regression**: used for binary classification, meaning it is typically used to predict one of two possible outcomes. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities that can be mapped to discrete classes

## Prediction vs Classification

- Prediction (AKA; regression or estimation problems) has an output variable that is a number, that is continuous.
- Classification has an output of a category.
   - Often distinguish between ordinal and nominal variables, the latter has no ordering, the former can be ordered and is said to have a natural ordering i.e., hot, medium, cold.
 
## Parametric vs non-parametric

- Parametric
   - require less data
   - inflexible
   - possibly poor fit
   - make an assumuption about the function f we want to estimate i.e. they make the assumption that the function f is a linear function
- Non-Parametric
   - Requires more data
   - Flexible
   - Possbly better fit
   - Doesn't make any strong assumptions about the shape of function f, instead they learn any shape

## Supervised vs unsupervised learning

- Supervised
   - Training data consists of samples that contain both  $X_1 ..., X_p$ and $Y$
   - Seek to understand how output Y relates to input  $X_1 ..., X_p$
   - Two broad categories are regression and classification
- Unsupervised
   - Training data consists of only $X_1 ..., X_p$ and does not contain $Y$.
   - The goal of unsupervised learning is to find patterns, relationships, or structures within the data without any prior knowledge of what the output should be.
   - Examples include K-means clustering which may be good for identifying new labels, PCA for adjusting and reducing features etc.,

## The machine learning process

The machine learning process involves 10 key steps

1. Define purpose of ML project;
   a. define the users
   b. what the end results will be
2. Obtain data set for analysis
3. Explore, clean and pre-process the data; including dealing with missing data and outliers, which can be done as follows:
   a. Remove effected records
   b. Manually fill in data (requires a domain experts)
   c. Fill in using an algorithm i.e., average of other values
4. Dimension reduction and feature engineering
   a. Eliminate missing or irrelevant data (Missing data is a critical aspect and it dramatically impacts ML performance)
   b. Transform variables i.e., categorical to numeric
   c. Add new features based on domain knowledge
5. Determine the ML task at hand
   a. What is the task i.e., classification, prediction?
6. Partition the data (if supervised ML)
   a. Into training, validation, testing
7. Choose ML techniques
   a. Such as regression, classification trees, clustering techniques etc.,
8. Apply the ML techniques
9. Interpret Results
   a. Should apply results against each other and bench marking strategies
11. Deploy the ML model (optional) 

## Required activity 2.2: Dealing with missing data

### Missing Data Management - Probability Question
Construct a database with $100,000$ rows where each record has $100$ fields. Assume further that for each record, each of the $100$ fields has a $1$ per cent chance of being empty, i.e., its value is missing.

Remove all records with **two** or more empty fields and report the fraction of records that are removed.

In [17]:
# Wasn't entirely sure whether to submit the code, or the answer, so here are both :)

# Originally from the exercise sheet provided with the assignment

import numpy as np
import pandas as pd
from scipy.stats import bernoulli

# 100,000 rows as specified
nr_rows = 10000 
# 100 columns as specified
data_to_use = np.ones(nr_rows*100) 
df_exercise = pd.DataFrame(data_to_use.reshape(nr_rows, 100))

# Verification of shape
#print("shape: ", df_exercise.shape) 

# Workings from exercise sheet ot randomly generate missing
missing_or_not = bernoulli.rvs((1/100), size=nr_rows*100) 
missing_or_not = missing_or_not.reshape(nr_rows, 100) 
missing_rows, missing_cols = np.where(missing_or_not == 1) 
df_exercise.values[missing_rows, missing_cols] = np.nan 

# Corresponds to the requirement that at least two values must be missing in order to be dropped.
# Sum the number of missing results
# Apply 'ge(2)' for boolean result if greater than or equal to 2 
df_missing = df_exercise.isnull().sum(axis=1).ge(2)

# Output as a fraction 
answer = np.sum(df_missing)/nr_rows
answer

0.2662

### Mini-lesson 2.2: Real-world applications of machine learning

#### Case Study: YELP
- Aimed to rate whether images were good or bad to determine which were shown on the business page
- Used CNNs to develop a model that gave images a score between 0 and 1 based on:
  - Contrast
  - Depth of field
  - Alignment
  - EXIF data
- Those with a higher score were included
- Evaluation was performed by showing select users images from the DL model, vs the original model and comparing clicks, actions and retention rates 

### Live Stream - Office Hour with Matilde D'Amelio

- Forecast (i.e. accuracy) vs inference (i.e., accuracy and model simplicity)

- Start of ML project need to ask yourself how much transparency do I want? A **black box model**, which is more powerful, but at the cost of transparency. A model like a decision tree, is not a black box, you can visually see each of the steps, these are known as **white boxes** in which the decision making process can be easily traced.
- Then need to ask what type of output you want; a **prediction model** (a continuous value) or **classification model** (which is a categorical number).
- AI projects in industry typically fail because they are not thinking about the end users when making these decisions
- Then need to consider whether it is **Parametric** (i.e., assume shape of $f$), **non-parametric** (i.e., the shape of $f$ is unknown). If you know the behvaiour of the problem then you would use the simpler model of parametric.
- An example of a parametric model would be predicting a house price. This is a simple model as the parameters that impact house prices are known, for example whether there is a school near by, the size of the house, the number of rooms
- On the otherhand, forecasting weather is extremely complex and very difficult to assume the sort of relationships between parameters. Consquently, a nonparametric approach would be used.
- If you can make assumptions about the parameters and their relationship, it can be classified as non-parametric


#### Missing data
- In ML when you having missing data the entire row is deleted (Doesn't apply to DL), consquently, you're losing information
- Options to address this:
  - You **remove** the data and accept the data loss and collect more if required
  - You can adjust for missing data however **manually** 
  - You can adjust using a **statistical** approach such as the mean, mode, median, max, min etc,. However, if the input of missing data is too heavily augmented (adjusted) then performances of ML will drop as it is not truly representative of real data, so whilst it may achieve high performance during training, it won't during validation.
  - Other methods know as **imputations** are:
     - **Normalisation** uses your missing feature, i.e., income level, and normalises the values for it that are not missing, such that it now as closely as possible represents a bell curve, thus normalising means the center of distribution passed through 0, and then the std here is 1. You then fill in all missing data with anything which is mean, median or mode as they are the same which is 0.
     - **Random regression**: run a regression analysis with all other features  in the row with missing data and use the correlation between these other features to forecast the value of the missing data.
     - **Multiple imputation** uses all the features for all of the data to forecast and fill in the missing data. This is much more complex than random regression. Multiple imputation tends to be unstable, and can result in high validation erorrs if too much of the information was originally missing. Can also result in varying data each time it is ran if data has been added to the dataset. Despite this, this is one of the most common methods to deal with missing data.

### 2.3 Assignment

After reviewing the four machine learning applications provided above, find an example of a challenge within your industry that could be addressed through machine learning.

To complete this activity, first briefly summarise a unique challenge in your industry (for example, ‘calculating risk for new insurance policies’ or ‘optimising yields for agricultural output’). Then answer the following questions:

What is the problem you are trying to solve?
Why do you think machine learning is a good fit for addressing this problem?
Finally, list any questions you have about your chosen application for machine learning and its viability. When reading other participants’ posts, provide any advice or suggestions you have to their questions.

Your post should be approximately 500-700 words.

  

#### Optimising Future Crop Yield Production using Predictive Modelling

I will be focusing on Machine learning (ML) applications within the realm of agricultural sciences for the prediction of crop yield performance based on weather patterns. I have chosen this topic as it is a future project that I plan to do within my work but has a lot of unknowns that require careful consideration for any predictive model.

**The Unique Challenge**

Food security is a critical priority for sustainable development. With a global population expecting to reach ~10 billion people, competition for arable land, and unknown consequences of climate change, predicting future scenarios will be imperative to success. Ensuring food security in a dynamic and irregular environment is therefore a key challenge in which predictive models can provide a solution. Combining historic weather and yield information with climate models to make yield predictions is one way in which this can be achieved.

**Problem Being Solved:**

Predicting crop yield is no easy task, not least due to the dependence on a multitude of factors, some of which are correlated, whilst others are not. The ability to make ‘informed’ crop predictions requires access to wide ranging historical data under diverse conditions, as well as accurate predictions of future environments. Features such as species, variation, location, time of year, rain fall, wind speed and cloud cover, to name a few, all have direct impact on the performance of crops, and their complex relationships are often difficult to capture. Furthermore, obtaining data that is representative of the real world is incredibly challenging, particularly given that many crop growing regions are in developing countries which may lack resources. As such, missing or incomplete data is not uncommon.

The problem to be solved then becomes a two-part prediction problem;

1) How best do we identify the links between crop yield and weather variables given missing and/or redundant data?

2) How well can we use these relationships with climate models to make future yield predictions given further unknowns within the climate models?

**Why is ML a good fit?**

ML is particularly well suited, if not essential, for this task for numerous reasons. It’s ability to work on complex problem domains, such as weather and crop yield prediction. It is scalable, offering vast processing power to identify patterns and relationships between features that traditional statistical models may overlook, or be unable to process due to large volumes of available data. Moreover, ML offers advanced data integration, required for the diverse datasets required and sophisticated techniques for dealing with missing data, such as imputation strategies.

Once trained, a ML model continues to be adaptable, adapting to newly available data (i.e., real-time weather) which assists in producing accurate predictions of complex biological systems. As such ML offers vast predictive power using historic and real time data to make measurable predictions in the form of crop yield for both short term weather changes and future climate predictions. A non-parametric model will provide several advantages, for example the flexibility, robustness, and the ability to model complex relationships. From a research standpoint time is limiting, therefore ML can make fully automated predictions based on real time data without the need for human intervention.

**Viability Questions:**

- Should there be a restriction on the lookback period of historical data? With the weather patterns shifting in recent years, is the old data now technically ‘incorrect’ or redundant? and would this data reduce the accuracy of future predictions?

- Whilst non-parametric models can deal with large data, it still makes sense to perform some sort of dimensionality reduction, particularly for those that have very minor impact on predictions. It is however likely that all weather features have some impact on yield, therefore what would be the best technique for evaluating the impact of each feature and should we instead use feature extraction, such as PCA, to transform multiple of these ‘low impact’ features into a single feature?

- What is the best way to account for any errors in climate models? There are multiple published climate models available, and different scenarios incorporated, but these may also have their own errors associated.

- Climate change also leads to increasing frequency and severity of extreme weather events, which can be devastating to crop yield. How is the best way to account for this?

### 2.4 Assignment 

Now that you have identified a potential machine learning application for your industry and received feedback from your peers, you’ll now define more of the details of your specific problem. To complete this activity, provide a brief overview of the problem. Then, answer the following questions:

- What data would you use?
- What are your key input and output variables? 
- What type of machine learning problem is this? 
- What steps would you take to solve this problem through machine learning?
- What might cause missing data in your data set? Which approach outlined in the lecture materials do you think would be most suitable for dealing with missing data, and why?
- Once you have answered all of these questions in a separate document, upload your file below.


#### Problem identification

Ensuring global food security is vital for supporting the world’s population. To achieve this in a sustainable manner, we must therefore understand factors which influence crop yield, whilst simultaneously accounting for the impacts of climate change. Accounting for a dynamic and irregular environment is therefore a key challenge in which predictive models can provide a solution. Combining historic weather and yield information with climate models to make yield predictions is one way in which this can be achieved. 

#### 1. What data would you use?
i. Weather dataset containing features such as:
- Mapping data: timestamp and the latitude and longitude of the observation
-  Meteorological data: rainfall, windspeed and direction, temperature, humidity, pressure, cloud cover, solar radiation, event 
ii. Crop dataset:
- Latitude and longitude of the crop
- Agronomic data: crop species and variety, planting and harvesting dates, yield
- Field data: soil type, fertility and pH levels, crop area, tillage practices
- Treatment data: pesticides, herbicides, irrigation, fertiliser- applications and timings
- Note: Data such as crop rotation can be obtained by looking back at previous years
iii. Climate models 
- Application of existing climate models generated by meteorological experts
-  Will vary for given locations and different climate scenarios


#### 2. What are the inputs and outputs?
##### Inputs
- Weather datasets with normalised (min-max) meteorological data
- Crop datasets with categorical data converted to numerical
- Climate models

##### Outputs
Given that this problem requires both a supervised and unsupervised approach, outputs have been grouped into the pre-processing stage which uses an un-supervised approach, and the training and deployment stage which uses a supervised approach.

Pre-Processing output
- An adjusted dataset with fewer features (principal component analysis) and identification of feature relationships (clustering) 


Training and deployment output
- A continuous variable with a forecast of the weight of the yield. 

#### 3. Problem Type
This is primarily a forecasting problem with the goal of forecasting crop yield. However, given that this problem also uses non-temporal data as discussed above, it can also be classified as a prediction problem. 
A non-parametric model is required due non-linear relationship between features. 
Given the specific problem domain, a combination of both supervised and un-supervised learning will be used. The available datasets are large, complex, and relationships are predominantly unknown.  Consequently,
    - Unsupervised learning will be applied to the original dataset to establish relationships via clustering and perform dimensionality reduction. 
    - Supervised learning for training the ML model using the output of adjusted features from the unsupervised learning stage 

#### 4. Solving the problem

##### i. The Purpose 
Predict future crop yield using weather and crop data using existing climate models. These models will be of particular benefit to researchers, primarily biologists, to support plant phenotyping and breeding, as well as policy makers involved in agriculture. Any information gathered will subsequently be passed to farmers to improve crop yields.

##### ii. Obtaining the Data
Crop datasets can be obtained from the Food and Agricultural Organisations (FAO) Stat database: https://www.fao.org/faostat/en/#data for the national scale. Higher temporal resolution data can be obtained directly from farmers or research institutes such as Universities with a farm or international research centres such as the International Rice Research Institute (IRRI) or CIMMYT (Wheat and Maize). 
Using longitude and latitude from the crop datasets, the weather datasets can be obtained through various data providers using APIs (i.e. www.visualcrossing.com) which offers both high and low temporal resolution. Multiple sources will be used. It is likely that duplicate or missing data will arise in this step, which is discussed below.
Climate models are available from sources such as: https://climate.copernicus.eu/climate-datasets 

##### iii. Pre-Processing the Data
Weather data
- Combine datasets
- Handling duplicates:
    - If any of the values do not match, we use the mean value.  
    - Duplicate rows in which one or more values are missing are deleted
- Handling missing data
    - Delete row where any value is missing
- Normalisation: Normalise using min-max
Crop data
- Combine datasets
- Handling missing data:
  - Manually obtained from the farmer (if applicable) or, the row is deleted. 
- Normalisation: Normalise using min-max

##### iv. Dimensionality Reduction and Feature engineering
Unsupervised learning will be applied to the combined weather and crop dataset to identify the relationships between features. Additionally, feature engineering will be performed, to generate, for example, seasonal data due to various regions having opposing seasons which in theory makes the timestamp incomparable. Finally, dimensionality reduction can be employed to reduce the complexity of the dataset using principal component analysis which should equally reduce the required training time.

##### v.   Partition the Data
Create train (60%), test (20%) and validation (20%) datasets from the adjusted data in steps ii to iv. 

##### vi.  ML Definition, Application and Deployment
- Define a Gradient Boosting Machine (GBM) model
- Define the model hyper parameters including, but not limited to, learning rate, max depth, number of trees and loss function. 
- Train the model and interpret results against benchmarks and available targets.
- Evaluate and adjust the model hyperparameters.
- Repeat steps a, b and c until satisfactory results
- Deploy model to biologists to use with plant phenotyping models.

#### 5. Missing Data

i. Causes of missing data
- Lack of records i.e., farmer does not have historical data
- Lack of resources i.e., developing countries
- Corrupt data i.e., past crop data may not have been stored correctly
- Legacy systems i.e., systems that are no longer accessible
- Technological availability i.e., weather stations 100 years ago
- Technological errors i.e., failure to correctly log weather data

ii. Dealing with the missing data
A combination of both removal and manual adjustments can be made.
Data removal
- Where yield is not available
- Multiple values are unavailable in crop data
- Only a single row in the weather data exists with missing values
Manual adjustments
- When possible, manually obtain data i.e., the farmer may have paper records
Whilst imputation methods such as normalisation, random regression and multiple imputation are powerful, using these to estimate missing weather data is not feasible. Small spikes and sudden changes are often highly random, and as such, predicting missing data in this way is bound to result in errors.


### 2.5 Quiz General Notes

- Independent variables, predictors, features and fields are all alternative names for input variables.
- Machine learning focusses on generalisation of the model for predicting new data, so is often validated using predictive accuracy, sometimes at the cost of explainability. In machine learning, the aim is to find a function that operates on the input variables to predict response variables.
- Statistics often uses goodness of fit metrics and residual analysis to validate models.
- The sole motivation of forecasting is predictive accuracy. We’re not interested in understanding how the outputs are affected by the inputs but instead in learning a function that can be used to predict future observations with high accuracy, regardless of how understandable the function is.
- Inference is used for understanding the relationship between input and output variables. Because of this, we care much more about the explainability of the model used. For example, how do sales depend on marketing? Or does smoking reduce life expectancy?
- Prediction problems have an output variable that can be numerical or continuous.
- Prediction problems have a numerical output variable, while classification problems have categorical outputs. In each of the correct examples, we are sorting into categories (e.g. ‘beautiful or not beautiful’ or ‘defective or not defective’).
- In unsupervised learning, we don’t have an output variable. Rather, we use input data to cluster the data into groups based on similar characteristics. The two grouping examples are unsupervised, as we don’t have any output variables.
- We can deal with missing data and outliers either by removing affected records (as long as we still have enough data left), manually filling in the data (often by experts) or filling in the data using an algorithm.
- Dimension reduction and feature engineering are used for removing variables that are not going to be available at the point in time the model is run and eliminating variables that aren’t relevant.

### Live Stream - Office Hour with Yu Qian Ang

- Machine learning is the abilioty of systems to acquire their own knowledge by extracting patterns from raw data
- Another example; a computer program is said to learn from experience $E$ with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improves with experience $E$
- Why do we want to find the function $f$ in $Y = f(X_1 ... X_n) + \alpha$
    - For **prediction**: So we can use $f$ to predict values of $Y$ for new values of $X$
    - For **inference**: We might be interested in the type of relationship between $Y$ and $X$
- Machine Learning has a great emphasis on large scale and real-time data as well as prediction accuracy, don't know whats going on inside, using black boxes
- Statistical learning has a greater emphasis on models, model interpretability, and statistical properties of estimation
- Google Xai: Explainable artificial intelligence (XAI) is a set of processes and methods that allows human users to comprehend and trust the results and output created by machine learning algorithms
- **Regression problems** focus on predicting numbers like income stock prices; numeric / continuous value
- **Classification problems** predict categories such as a dog or cat; discrete / categorical value
- Supervised = labels,. unsupervised = no labels and can perform dimensionality reduction and cluster
- Problems are **parametric** which reduce the problem of estimating $f$ to finding a finite set of parameters, i.e. linear regression we only need to find intercept and slope for example $Y = \beta_0. + \beta_1 X$ where $\beta_0$ is the intercept and $\beta_1$ is the slope. **Non-parametric** models make no assumptions about the functional form of $f$, but require more data.