# Lecture 3 notes

- ML models – formalizes an expression that maps the input data to an output
- Simply, it is learning a target function (f) that best maps input (X) to output (Y):
- Y = f(X)
- To make predictions in the future (Y) given new examples of input variables (X).
- We don’t know what the function (f) looks like or it’s form - if we did, we would use it directly & would not need to learn!
- The complete model includes an error term (e) that is independent of the input data (X).
- Y = f(X) + e
- Where e models loss function, accuracy, goodness of the fit/prediction.

# **Function approximation**

### Notation:
- **H**: This represents the **hypothesis space**—the set of all possible functions (or models) that can be used to map the input data \( X \) to the output \( Y \).
  
- **h**: This represents an individual **hypothesis** or a specific function within the hypothesis space \( H \). In machine learning, \( h \) is the function we are trying to learn. It takes the input data \( X \) and maps it to the output \( Y \).
  
- **\( h: X \to Y \)**: This indicates that the hypothesis \( h \) is a function that maps elements from the **input space** \( X \) (the set of all possible input values) to the **output space** \( Y \) (the set of all possible output values).

### Breaking down the expression:
\[
H = \{ h \mid h: X \to Y \}
\]

This means:
- \( H \) is the set of all possible functions \( h \) such that each function \( h \) maps inputs from \( X \) to outputs in \( Y \).
- In other words, the set \( H \) includes every potential function \( h \) that could be a candidate for approximating the true underlying function \( f \) (the real relationship between \( X \) and \( Y \)).

### Example:
- Suppose \( X \) represents the input features for predicting house prices, such as the size of the house, number of rooms, and location.
- \( Y \) represents the target variable (e.g., house prices).
- The **hypothesis space** \( H \) might include various models, such as linear regression models, polynomial models, decision trees, neural networks, etc., each of which is a specific function \( h \) that maps from \( X \) to \( Y \).

The goal of **function approximation** in machine learning is to search for the best hypothesis \( h^* \in H \) that most accurately approximates the true function \( f \) based on the training data. This is done by minimizing some kind of error (e.g., the difference between predicted values \( \hat{Y} \) and actual values \( Y \)).

### Summary:
- \( H \): The set of all possible functions (hypotheses) that could explain the relationship between \( X \) and \( Y \).
- \( h \): An individual function (hypothesis) within \( H \), mapping \( X \) to \( Y \).
- The machine learning process involves selecting or learning the best \( h \in H \) to approximate the true function \( f \) based on the available data.

Does that make sense? Let me know if you'd like to explore any part further!

![image.png](attachment:image.png)

# **Data** 
- Refers to a collection of factual information, including numbers, words, measurements, observations, or descriptions of objects, events, or phenomena.

- **Types of Data**:
  - **Qualitative Data**: Descriptive information that characterizes but does not measure something. For example, *"It was great fun"* or *"The sky is blue."*
  - **Quantitative Data**: Numerical data that can be measured and expressed as numbers.
    - **Discrete Data**: Countable values, typically whole numbers. For example, *"There are 5 apples."*
    - **Continuous Data**: Measurable values that can take any value within a range. For example, *"The temperature is 3.26°C."*


# **Data**
- refers to a collection of factual information, including numbers, words, measurements, observations, or descriptions of objects, events, or phenomena.

### **Types of Data**:
1. **Qualitative (Categorical) Data**:
   - Descriptive and conceptual, capturing non-numeric information that describes qualities or characteristics.
   - **Nominal**: Data can only be categorized, but not ranked.  
     - Example: *Gender (male/female), Eye color (blue/brown/green/black).*
   - **Ordinal**: Data can be categorized and ranked, but the differences between ranks are not measured.
     - Example: *Grade (A/B/C/D/F), Satisfaction level (satisfied/neutral/dissatisfied), Education level (bachelor/master/PhD).*

2. **Quantitative (Numerical) Data**:
   - Numerical data that can be measured or counted. It represents quantities or amounts, making it useful for mathematical calculations and statistical analysis.
   - **Discrete**: Consists of countable, finite values.
     - Example: *Human population, Number of cars in a parking lot.*
   - **Continuous**: Represents measurable values that can take any value within a range.
     - Example: *Height, Weight, Temperature, Time.*

# Data Sample
We will use the following data sample:

$ X = [3, 7, 8, 5, 12, 14, 21, 13, 18] $

---

## 1. Sample Mean (𝜇̂, 𝑋̄)

**Formula**:  
$ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i $

**Example**:
$ \bar{X} = \frac{3 + 7 + 8 + 5 + 12 + 14 + 21 + 13 + 18}{9} = \frac{101}{9} = 11.22 $

The **mean** (or average) of the sample is **11.22**.

---

## 2. Sample Variance (𝜎̂², S²)

**Formula**:  
$ S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2 $

**Example**:
1. Calculate the difference of each data point from the mean:
   $ (3 - 11.22)^2, (7 - 11.22)^2, \dots, (18 - 11.22)^2 $
   
2. Square these differences and sum them:
   $ 67.60 + 18.12 + 10.37 + 39.17 + 0.60 + 7.72 + 95.68 + 3.17 + 45.94 = 288.37 $
   
3. Divide by $ n-1 $ (degrees of freedom):
   $ S^2 = \frac{288.37}{8} = 36.05 $

The **variance** of the sample is **36.05**.

---

## 3. Order Statistic (𝑋(𝑖))

Order statistic refers to sorting the data in ascending order:

$ [3, 5, 7, 8, 12, 13, 14, 18, 21] $

For example, the **2nd order statistic** (2nd smallest value) is **5**.

---

## 4. Sample Median (𝑋₀.₅)

The median is the middle value when data is ordered:

$ [3, 5, 7, 8, 12, 13, 14, 18, 21] $

Since $ n = 9 $ (odd number of values), the median is the middle value, which is **12**.

---

## 5. Sample Quartiles (𝑋₍𝛼₎)

The quartiles divide the data into four equal parts:

- **First quartile (Q1, 𝑋₀.₂₅)**: The median of the lower half (below 12):
  $ [3, 5, 7, 8] $
  $ Q1 = \frac{5 + 7}{2} = 6 $
  
- **Third quartile (Q3, 𝑋₀.₇₅)**: The median of the upper half (above 12):
  $ [13, 14, 18, 21] $
  $ Q3 = \frac{14 + 18}{2} = 16 $

---

## 6. Sample Interquartile Range (IQR)

**Formula**:  
$ IQR = Q3 - Q1 $

**Example**:
$ IQR = 16 - 6 = 10 $

The **interquartile range** is **10**, representing the spread of the middle 50% of the data.


# Data Type Graphs

- **Nominal Data**: 
  - **Bar Chart**: Useful for comparing the frequency or proportion of categories.
  - **Pie Chart**: Good for showing the proportion of each category as part of a whole.

- **Ordinal Data**:
  - **Bar Chart**: Effective for showing the frequency of ordered categories.
  - **Boxplot**: Can be used to show the distribution and central tendency of ordered categories.

- **Discrete Data**:
  - **Histogram**: Displays the frequency of distinct values or bins.
  - **Boxplot**: Provides a summary of the distribution, including median and quartiles.

- **Continuous Data**:
  - **Histogram**: Shows the distribution of continuous data by grouping data into bins.
  - **Boxplot**: Useful for visualizing the distribution, central tendency, and spread.
  - **Scatterplot**: Displays relationships between two continuous variables.


# Univariate Analysis
- **Definition**: Involves analyzing only one variable at a time.
- **Purpose**: To summarize and find patterns or insights about the single variable.
- **Common Techniques**:
  - **Descriptive Statistics**: Mean, median, mode, variance, and standard deviation.
  - **Graphs**: Histograms, bar charts, pie charts, boxplots.
  - **Example**: Analyzing the distribution of student grades in a class.

### Multivariate Analysis
- **Definition**: Involves analyzing multiple variables simultaneously to understand relationships and interactions between them.
- **Purpose**: To explore how variables relate to each other and how they jointly affect outcomes.
- **Common Techniques**:
  - **Correlation Analysis**: Examines the relationship between pairs of variables.
  - **Multivariate Regression**: Analyzes the effect of multiple predictors on an outcome.
  - **Principal Component Analysis (PCA)**: Reduces dimensionality by identifying key variables that explain variance.
  - **Graphs**: Scatterplot matrices, 3D scatter plots, pair plots.
  - **Example**: Analyzing how different factors (e.g., hours studied, attendance, previous grades) affect overall student performance.

### Summary
- **Univariate Analysis** focuses on a single variable, providing insights into its individual characteristics.
- **Multivariate Analysis** explores relationships and dependencies among multiple variables, offering a more complex view of data interactions.