# L16 : String Processing

**What is string processing?**
 *  String processing is the process where we can process or manipulate strings to extract useful information for our analysis

**Index for the Basic string processing**
  * `strsplit(x, split)`
    * splits a string `x` by another string `split`
    * Splits a string into substrings based on a specified delimiter (split).
    * Inputs : 
        * x is the input string or character vector.
        * split is the pattern (string or character) on which to divide the input.
    * Returns an unstrutured list of strings of the split string (ex"Ana are mere" -> "Ana" "are" "mere")
    * Typical Syntax : 
        * strsplit(x = "Your text here", split = " ")

  * `paste0(x1, x2, x3, ...)`
    * concatenates strings with no delimiter or separation between the strings.

  * `paste(x1, x2, x3, ..., sep, collapse)`
    * concatenates strings with a string `sep` in between. Can also be used to concatenate strings in a vector using `collapse`

  * `toupper(x)`
    * converts all characters in a string to upper case

  * `tolower(x)`
    * converts all characters in a string to lower case

  * `nchar(x)`
    * outputs the number of characters in a string
    
  * `substring(text, first, last)`
    * Extracts the `first` though `last` characters from a string `text`

  * `grepl(pattern, x)`
    * Outputs a logical representing if a `pattern` is in a string `x`
    
  * `grep(pattern, x)`
    * Outputs which elements in a character vector `x` contains a `pattern`

  * `gsub(pattern, replacement, x)`
    * Replaces a `pattern` in string `x` with another string `replacement`

  * `gregexpr(pattern, text)`
    * Outputs the location of the first character of a `pattern` in a string `text`


# L17 : Regular Expressions


## 🔍 Basic R String Matching Functions

### 1. `grepl(pattern, x)`
- **Purpose**: Checks if the `pattern` exists in each element of character vector `x`.
- **Returns**: A logical vector (`TRUE` or `FALSE` per element).
- **Example**:
  ```r
  grepl("data", c("data science", "statistics"))
  # [1] TRUE FALSE
  ```

### 2. `grep(pattern, x, value = FALSE)`
- **Purpose**: Returns the **indices** of elements that match the `pattern`.
- **Set `value = TRUE`** to return the matching strings instead.
- **Example**:
  ```r
  grep("science", c("data science", "biology"))
  # [1] 1

  grep("science", c("data science", "biology"), value = TRUE)
  # [1] "data science"
  ```

### 3. `gsub(pattern, replacement, x)`
- **Purpose**: Replaces all matches of `pattern` in `x` with `replacement`.
- **Example**:
  ```r
  gsub("dog", "cat", "The dog runs fast")
  # [1] "The cat runs fast"
  ```

### 4. `gregexpr(pattern, text)`
- **Purpose**: Returns a list of the **starting positions** of all matches of `pattern` in `text`.
- **Example**:
  ```r
  gregexpr("\d", "abc123")
  # [[1]] 4 5 6
  ```

`rematches()` : function that extracts a pattern from a string
---

## 🔣 Regular Expression Patterns (Regex Syntax)

### 1. Character Classes
| Pattern     | Meaning                              |
|-------------|---------------------------------------|
| `.`         | Any single character except newline    |
| `[abc]`     | Matches one of: a, b, or c            |
| `[^abc]`    | Matches anything except a, b, or c    |
| `[a-z]`     | Matches any lowercase letter          |
| `[0-9]` or `\d` | Matches any digit                |
| `\D`        | Non-digit character                  |
| `\s`        | Whitespace character                 |
| `\S`        | Non-whitespace character             |
| `\w`        | Word character (letter, digit, `_`)  |
| `\W`        | Non-word character                   |

### 2. Anchors
| Pattern | Meaning                      |
|---------|-------------------------------|
| `^`     | Start of string               |
| `$`     | End of string                 |

### 3. Quantifiers
| Pattern   | Meaning                          |
|-----------|-----------------------------------|
| `*`       | 0 or more occurrences              |
| `+`       | 1 or more occurrences              |
| `?`       | 0 or 1 occurrence                  |
| `{n}`     | Exactly n occurrences              |
| `{n,}`    | At least n occurrences             |
| `{n,m}`   | Between n and m occurrences        |
| `t*m`     | zero or more of the letter t       |

### 4. Grouping and Alternation
| Pattern     | Meaning                              |
|-------------|---------------------------------------|
| `(abc)`     | Grouping (for precedence/capture)     |
| `a pipe b`       | Either a or b                         |

---

## 🛠️ Practical Examples

### Match a word anywhere in a sentence
```r
pattern <- "science"
text <- c("data science", "political science", "biology")
grep(pattern, text, value = TRUE)
```

### Replace multiple spaces with a single space
```r
gsub(" +", " ", "this    is  messy")
```

### Extract numeric values from a string
```r
text <- "Height: 5'10\""
gregexpr("\d+", text)
# Returns positions of numbers
```

### Match U.S. phone number format
```r
pattern <- "\(\d{3}\) \d{3}-\d{4}"
text <- "Call (123) 456-7890 now!"
grepl(pattern, text)
```

## 🧠 Tips
- Always use double backslashes (`\\`) in R regex to escape characters.
- Combine functions: use `strsplit()` + regex to tokenize or clean text.
- Wrap your regex in `^...$` to match the entire string.
- Test patterns using `grepl()` before replacing or extracting.



# L18  Web Scraping

# R Web Scraping - Cheatsheet (Advanced Edition)

## 📘 What is Web Scraping?
Web scraping is the **automated extraction of data from websites**, especially when data is not available in structured file formats like `.csv`, `.json`, or `.xlsx`. Instead, the data lives inside HTML code — and we use R to **load, navigate, and extract** this information.

---

## 📦 Primary Package: `rvest`
- Built for scraping and parsing HTML in a tidy way.
- Uses CSS selectors to locate elements.

---

## 🌐 How the Web Works (Minimal Background)
- Websites are built in **HTML (HyperText Markup Language)**.
- HTML is hierarchical: tags nest within tags (e.g., `<body>`, `<div>`, `<h2>`).
- Browsers render this structure into the visuals we see.
- We can **view the source code** (Right-click → View Page Source or `F12` → Inspect) to understand how data is embedded.

---

## 🧩 HTML Vocabulary You Must Know
| Concept         | Meaning                                                                 |
|----------------|-------------------------------------------------------------------------|
| **Tag**        | HTML instruction, e.g., `<p>`, `<h2>`                                   |
| **Element**    | A tag plus its content: `<p>Hello</p>`                                  |
| **Text**       | The content displayed between opening & closing tags                    |
| **Attribute**  | Modifier inside opening tag: `<a href="url">`                         |
| **Attribute Value** | The value assigned to the attribute: `href="https://..."`         |
| **Class**      | A common attribute: used to group/style elements                        |

---

## 🔁 Full Web Scraping Workflow

### 1. Load the Web Page
```r
library(rvest)
html <- read_html("https://example.com")
```

### 2. Navigate and Explore with DevTools
Use **Right-click → Inspect** to:
- Find tag names (e.g., `h2`, `table`, `p`)
- Identify class names (e.g., `.director`)
- Check attributes (e.g., `href`, `data-id`)

### 3. Extract Elements
```r
html_elements(html, "section")               # Extract <section> blocks
html_elements(html, "section h2")            # Nested selection: <h2> inside <section>
html_elements(html, ".director")            # All elements with class="director"
```

### 4. Extract Visible Text
```r
html_text(elements, trim = TRUE)             # Only the human-visible part of an element
```

### 5. Extract Attributes
```r
html_attr(elements, "href")                  # Get hyperlink references
html_attr(elements, "data-id")               # Grab custom attribute values
```

```r
html_elements(html,"title")                  # extract all elements with <title> tag
html_table()                                 # extracts all tabular data with a <table>
                                             # tag and outputs an unstructured list of df
```

```r
<title>Ana are mere</title>
html_text(title_element)                    # extract the string "Ana are mere"
```

```r
<meta name="viewport" content="">
html_attr(meta_element,"name")                    # extract the string "Ana are mere"
```

```r
html_elements(html,".highlight")                    # extract all elements with class = "highlight"
```

---

## 📋 Table Extraction Workflow

### Step-by-step:
```r
tables <- html_elements(html, "table")       # All <table> elements
all_tables <- html_table(tables)             # Convert to list of data frames
films <- all_tables[[2]]                     # Choose relevant table
```

### Cleaning Extracted Table
```r
names(films) <- c("title", "director", "release_date", ...)  # Rename cols
films <- films[, 1:(ncol(films) - 1)]                         # Drop reference col
```

### Extract and Clean Release Dates
```r
pattern <- "\\d{4}-\\d{2}-\\d{2}"
locs <- gregexpr(pattern, films$release_date)
films$release_date <- sapply(regmatches(films$release_date, locs), `[`, 1)
```

### Remove Non-Movie Rows (e.g., Trilogy Headers)
```r
films <- films[!grepl("trilogy", films$title, ignore.case = TRUE), ]
```

---

## 🧠 Practical Summary: Key Functions
| Function               | Use Case                                      |
|------------------------|-----------------------------------------------|
| `read_html()`          | Load the HTML content from a URL              |
| `html_elements()`      | Find all matching HTML tags or classes        |
| `html_text()`          | Extract readable content                      |
| `html_attr()`          | Pull out values of specific attributes        |
| `html_table()`         | Convert HTML `<table>` into a data frame      |

---

## ⚠️ Static vs. Dynamic Websites
| Type     | Description                                                  | Tools       |
|----------|--------------------------------------------------------------|-------------|
| **Static**   | Content is part of the original HTML (scrape with `rvest`)  | `rvest`     |
| **Dynamic**  | Content appears only after interaction/JS execution        | `RSelenium` |

> 🔍 Dynamic scraping requires automation: loading pages, clicking buttons, etc.

---

## 🧠 Final Reminders
- Always inspect HTML structure before scraping.
- Tag navigation is hierarchical — use nested selectors for deeper elements.
- Use classes (with `.`) or tag names to isolate what you need.
- Attributes often contain useful metadata like IDs or links.
- Cleaning your data (renaming, trimming, regex) is a normal part of scraping.


# L20 +L21 : Data Vizualization


## 🔍 What is Data Visualization?
Data visualization is the practice of representing data through graphical formats. It helps uncover trends, patterns, outliers, and relationships within a dataset and supports communication of insights.

In R, **ggplot2** is the dominant package for creating clear, structured, and customizable data graphics based on the **Grammar of Graphics** approach.

* `ggplot()` :  creates an empty plot is created
* `library(ggplot2)` : loads necessary creates a ggplot

* for which type of data is the size visual dimension appropriate for your plot :  ordinal (categorical) and numeric
* for which type of data is the color visual dimension appropriate for your plot : ordinal(categorical), numeric and nominal(categorical)
* A barplot is a better substitue for a pie chart
* A log transformation is appropriate when your data are severky right skewed
* the default colors in ggplot aren't colorblind friendly
* facet_wrap() you can include another variable in a box plot using the fill= argument
* Barplots need to include zerro
* 3D plotting is good when
---

## 🧠 Grammar of Graphics: Conceptual Structure
The philosophy of ggplot2 is to build a plot layer-by-layer using consistent rules.

### 🧱 Core Components
Each plot consists of the following elements:

1. **Data**: The dataset you are working with
2. **Aesthetics (`aes`)**: Mappings from variables to visual properties
3. **Geoms**: Geometric objects that define how data should appear (bars, points, lines)
4. **Scales**: Define axes and color ranges
5. **Facets**: Split data into subplots
6. **Coordinate System**: Defines the plotting space
7. **Theme**: Controls the visual appearance

---

## ⚙️ ggplot2 Plot Syntax
```r
ggplot(data = <DATA>) +
  aes(x = <X-VAR>, y = <Y-VAR>, ...) +
  geom_<GEOM>() +
  other_layers
```
Each component is added with `+`, allowing flexible customization.

---

## 🎨 Aesthetic Mappings (`aes()`)
### 🔧 What is an Aesthetic?
An **aesthetic** is a visual property such as position, color, or size that gets mapped to a data variable.

| Aesthetic     | Purpose                                | Notes                                      |
|---------------|----------------------------------------|--------------------------------------------|
| `x`, `y`      | Axes values                            | Most basic aesthetics                      |
| `color`       | Outline color for lines/points         | Use for categories                         |
| `fill`        | Interior fill color (bars, boxes)      | Common in `geom_bar`, `geom_boxplot`      |
| `shape`       | Point symbol shape                     | Used in scatterplots                       |
| `size`        | Point or line size                     | Good for magnitude comparisons             |
| `alpha`       | Transparency                           | Ranges from 0 (invisible) to 1 (opaque)    |
| `group`       | Used to group lines                    | Important in time-series                  |

```r
ggplot(data, aes(x = year, y = pop, color = continent)) +
  geom_line()
```
* `fill` : what is the purpose of fill when creating a box plot, to include an additional visual dimension by color

---

## 📊 Geometric Objects (`geom_*()`)
### What is a Geom?
A **geom** specifies the type of plot (scatter, bar, line, etc.).

### 🧱 Common Geoms and Use Cases
| Geom Function       | Type                 | Example Usage                             |
|---------------------|----------------------|-------------------------------------------|
| `geom_point()`      | Scatterplot          | Compare two continuous variables          |
| `geom_bar()`        | Bar (count)          | Category frequency                        |
| `geom_col()`        | Bar (pre-aggregated) | Plot actual values instead of counts      |
| `geom_histogram()`  | Histogram            | Distribution of continuous variable       |
| `geom_boxplot()`    | Boxplot              | Summarize numeric data across categories  |
| `geom_line()`       | Line plot            | Time series or continuous trend           |
| `geom_smooth()`     | Trend line           | Add LOESS or linear model fit             |
| `geom_density()`    | Density estimate     | Smooth distribution shape                 |
| `geom_violin()`     | Violin plot          | Boxplot + density                         |
| `geom_text()`       | Text annotations     | To label points in a plot                 |
| `geom_tile()`       | heatmap              | Visualizing the relationship between 2 categorical                 |
| `geom_jitter()`       | overlay your data on a boxplot     |                           |

```r
ggplot(mpg, aes(x = class)) +
  geom_bar(fill = "steelblue")
```

---

## 📐 Coordinate Systems
### Why Modify Coordinates?
To improve readability or support different visualization formats.

| Function           | Description                                |
|--------------------|--------------------------------------------|
| `coord_flip()`      | Switch x and y axes                        |
| `coord_polar()`     | Create pie chart or radial visualizations  |
| `xlim()` / `ylim()` | Manually set axis limits                  |

```r
ggplot(mpg, aes(x = class)) +
  geom_bar() +
  coord_flip()
```

---

## 🔍 Faceting for Small Multiples
Faceting allows data to be split across multiple panels.

| Function               | Use Case                              |
|------------------------|----------------------------------------|
| `facet_wrap(~ var)`    | Create multiple plots by one variable |
| `facet_grid(row ~ col)`| Grid layout for two variables         |

```r
ggplot(gapminder, aes(x = year, y = life_expectancy)) +
  geom_line() +
  facet_wrap(~ continent)
```

---

## 🏷️ Labels and Titles
Make your plots self-explanatory.

```r
labs(
  title = "Life Expectancy Over Time",
  subtitle = "Across Continents",
  x = "Year",
  y = "Life Expectancy",
  caption = "Source: Gapminder"
)
```
Use `geom_text()` to add annotations at specific points.

---

## 🎨 Themes and Visual Appearance
Themes control the non-data aspects of your plot (text, background, margins).

| Theme Function     | Description                               |
|--------------------|-------------------------------------------|
| `theme_minimal()`  | Clean with minimal distractions           |
| `theme_classic()`  | Traditional axis lines                    |
| `theme_light()`    | Light grid background                     |
| `theme_void()`     | Blank plot for maps or infographics       |
| `theme()`          | Customize individual elements manually    |

```r
ggplot(...) + theme_minimal()
```

### Customizing with `theme()`
```r
+ theme(
    axis.text.x = element_text(angle = 45),
    legend.position = "bottom"
  )
```

---

## 🌈 Color Scales and Palettes
### Managing Color
```r
scale_fill_brewer(palette = "Set2")
scale_color_manual(values = c("red", "blue"))
```

| Function                | Purpose                                  |
|-------------------------|------------------------------------------|
| `scale_fill_brewer()`   | Fill colors using RColorBrewer palettes  |
| `scale_color_manual()`  | Manually define line/point colors        |
| `scale_fill_gradient()` | Continuous color gradient fill           |

> ✅ Tip: Avoid red-green palettes for accessibility.

---

## ✅ Best Practices in Plot Design
### DO:
- Use **bar charts** instead of **pie charts**
- Start **axes at 0** (unless justified)
- Add **meaningful titles and labels**
- Apply **color sparingly and meaningfully**
- Organize bars by **logical order**
- Use **facet wrapping** for category comparisons

### AVOID:
- Overuse of colors or decorations
- Omitting legends or labels
- Skewed axes or cherry-picked scales
- Using 3D or distorted perspectives

---

## 🔎 Example: Clean and Informative Plot
```r
ggplot(gapminder, aes(x = continent, fill = continent)) +
  geom_bar() +
  labs(
    title = "Number of Countries by Continent",
    x = "Continent",
    y = "Number of Countries"
  ) +
  scale_fill_brewer(palette = "Pastel1") +
  theme_minimal()
```


# L22, L23, L24 

# 🧠 Lecture Concepts Review (Expanded) — Part 1: Simulation-Based Inference (Bootstrapping)

This document offers an in-depth, conceptually rich breakdown of **Lecture 22**: Simulation-Based Inference and Bootstrapping.

---

## 📘 Lecture 22: Simulation-Based Inference – Bootstrapping

### 🧭 Context & Motivation
In previous lectures, we've learned how to:
- Import data (CSV, JSON, web scraping)
- Clean and wrangle data (subsetting, merging, functions)
- Summarize data (mean, median, variance)
- Visualize relationships using `ggplot2`

This provided the groundwork for **basic data analysis**, but all of that was descriptive — based on **samples**. To go from sample to population, we need a new tool: **statistical inference**.

---

### 🧠 What is Statistical Inference?
**Statistical inference** refers to drawing conclusions about a population based on a sample. It lets us:
- Estimate unknown population values (like the population mean \( \mu \))
- Assess uncertainty in our estimates

It’s based on the idea that any single sample is just one of many that could have been drawn.

---

### 📊 Why Do We Sample?
Often, we cannot collect data on the full population due to constraints:
- Cost (e.g., testing all smokers with CT scans)
- Time (e.g., surveying all undergrad students)
- Accessibility (e.g., testing everyone in a country)

So we **randomly sample** and use the sample to infer characteristics of the population.

---

### 🎯 Sampling Variability
- Different random samples yield different estimates
- This variability is called **sampling variability**
- It's the foundation of **inference**, because it defines how uncertain our sample estimates are

Example: Two samples of size 50 can yield very different sample means due to random chance

---

### 📈 Sampling Distribution
- The **sampling distribution** of a statistic (like the mean) is the distribution of that statistic across many repeated samples
- We rarely get to see this distribution in practice

But we can **simulate** it using bootstrapping.

---

### 🔄 Bootstrapping: A Resampling Technique

**Definition:** Bootstrapping simulates the sampling distribution of a statistic by repeatedly resampling (with replacement) from your observed sample.

#### Why Bootstrap?
- Does not assume normality or other parametric forms
- Enables inference from *one* sample by mimicking sampling variability

#### Procedure:
1. Start with a sample (size \( n \))
2. Draw a new sample of size \( n \) **with replacement** from that sample
3. Compute a statistic (e.g., mean) from this bootstrap sample
4. Repeat 1000–2000 times to get a **bootstrap distribution**

---

### 🧮 Example in R
```r
# Initial sample
sample_data <- sample(population_data, size = 50, replace = FALSE)

# Bootstrap loop
boot_means <- numeric(1000)
for (i in 1:1000) {
  boot_sample <- sample(sample_data, size = 50, replace = TRUE)
  boot_means[i] <- mean(boot_sample)
}

# Plot bootstrap distribution
hist(boot_means, breaks = 30)
```

---

### 🔐 Confidence Intervals (CI)
A **95% Confidence Interval** captures the population value with 95% probability, *if* we could resample repeatedly.

#### Percentile Method (Simulation-based):
- Sort the bootstrap statistics
- Take the 2.5th and 97.5th percentiles
```r
quantile(boot_means, c(0.025, 0.975))
```

#### Interpretation:
> “We are 95% confident that the true population mean is between X and Y.”

This is based on the fact that the center of the bootstrap distribution approximates the true population parameter.

---

### 🧪 Statistical Inference via CI
CIs help us:
- Estimate population values
- Perform **hypothesis testing**

#### Decision Rules:
- If a reference value (like 60) is **outside** the CI: the true mean is **significantly different** from 60
- If a value is **inside** the CI: we **cannot** say it's significantly different

#### Example:
> If CI = [63, 72], then:
> - 60 is significantly lower → mean is greater than 60
> - 73 is significantly higher → mean is less than 73
> - 65 is not significantly different

This leads into **hypothesis testing**, which connects to later lectures.

---

### 🪄 Summary of Lecture 22 Concepts
| Concept                   | Definition/Explanation                                                                 |
|--------------------------|----------------------------------------------------------------------------------------|
| Statistical inference    | Inferring population characteristics from a sample                                     |
| Sampling variability     | The fact that each random sample gives different results                               |
| Sampling distribution    | Distribution of a statistic across many random samples                                |
| Bootstrap sample         | A sample of size \( n \) drawn **with replacement** from the original sample           |
| Bootstrap distribution   | Distribution of statistics (means, medians, etc.) from many bootstrap samples         |
| Confidence Interval (CI) | Range of plausible values for the population parameter                                |
| Percentile CI            | CI based on 2.5% and 97.5% quantiles of the bootstrap distribution                     |





# 🧠 Lecture Concepts Review (Expanded) — Part 2: Simple Linear Regression (SLR)

This section offers a detailed, structured breakdown of **Lecture 23**, which covers Simple Linear Regression using numeric variables.

---

## 📘 Lecture 23: Simple Linear Regression with Numeric Predictors

### 🧭 Context & Motivation
So far, we’ve explored descriptive statistics and inference (e.g., bootstrapping) mostly around **summary metrics** like means. Now we turn to **modeling relationships** — specifically between two numeric variables — using **regression**.

Use case example: At Amazon, should you schedule more oil changes to reduce truck repair costs? To answer this, you can model the relationship between number of oil changes and cost of repairs.

---

### 📐 What is Simple Linear Regression?
**Simple Linear Regression (SLR)** models the relationship between a **predictor** (independent variable) \( x \) and a **response** (dependent variable) \( y \) using a straight line:

\[ y = mx + b + \varepsilon \]

Where:
- \( m \) is the **slope** (rate of change)
- \( b \) is the **intercept** (value of \( y \) when \( x = 0 \))
- \( \varepsilon \) is the **error term** (random noise)

---

### 🔎 Visualizing the Relationship
Before modeling, always **visualize** your data with a **scatterplot** to assess:
- Is the relationship linear?
- Is there a clear trend (positive/negative)?

Example: More oil changes → lower repair costs → negative linear trend

---

### 📏 Slope & Intercept Interpretation
Suppose our regression model is:

\[ \text{repair} = -72 \times \text{oil changes} + 652 \]

Then:
- **Slope**: For each additional oil change, **repair cost decreases by $72**
- **Intercept**: If no oil changes are done, the **average repair cost is $652**

Note: The intercept is often outside the range of meaningful values (e.g., 0 oil changes might not be realistic).

---

### ⚙️ Fitting a Linear Model in R
```r
fit <- lm(repair ~ oilchanges, data = dataset)
coef(fit)  # Returns intercept and slope
```

To extract them individually:
```r
intercept <- coef(fit)[1]
slope <- coef(fit)[2]
```

To visualize the regression line:
```r
plot(dataset$oilchanges, dataset$repair)
abline(fit, col = "red")
```

---

### 📊 Sampling Variability in Regression
Just like the **sample mean**, slope and intercept estimates **vary from sample to sample**:
- This variability makes inference necessary
- Bootstrapping can help us **estimate confidence intervals** for the slope/intercept

---

### 🔁 Bootstrap for Regression Coefficients
We can apply bootstrapping to linear regression parameters:

#### Procedure:
1. Sample \( n \) rows from the dataset **with replacement**
2. Fit a linear model to the bootstrap sample
3. Extract and store the **slope** and **intercept**
4. Repeat 1000+ times
5. Analyze the distribution of estimates

#### R Example:
```r
boot_slopes <- numeric(1000)
boot_intercepts <- numeric(1000)
n <- nrow(data)

for (i in 1:1000) {
  boot_indices <- sample(1:n, size = n, replace = TRUE)
  boot_sample <- data[boot_indices, ]
  model <- lm(repair ~ oilchanges, data = boot_sample)
  coefs <- coef(model)
  boot_intercepts[i] <- coefs[1]
  boot_slopes[i] <- coefs[2]
}
```

---

### 🔐 Bootstrap Confidence Intervals for Coefficients
Use the percentile method:
```r
quantile(boot_slopes, c(0.025, 0.975))  # CI for slope
quantile(boot_intercepts, c(0.025, 0.975))  # CI for intercept
```

### 🧠 Interpretation:
If the 95% CI for slope **does not contain 0**, the slope is **significantly different from zero** → there is a significant relationship between variables.

If the CI **contains 0** → **no significant relationship**.

---

### 🧪 Summary of Lecture 23 Concepts
| Concept                         | Definition/Explanation                                                                 |
|--------------------------------|----------------------------------------------------------------------------------------|
| Simple Linear Regression (SLR) | A model relating two numeric variables with a straight line                           |
| Slope (\( m \))                | Average change in \( y \) for a 1-unit change in \( x \)                               |
| Intercept (\( b \))            | Predicted \( y \) when \( x = 0 \)                                                   |
| Error term (\( \varepsilon \)) | Captures deviation from the line due to random noise                                  |
| lm()                           | R function for fitting linear models                                                  |
| coef()                         | Extracts slope and intercept from fitted model                                        |
| abline()                       | Adds regression line to scatterplot                                                   |
| Bootstrap regression           | Repeatedly resampling rows to estimate slope/intercept variability                    |
| Confidence interval (CI)       | Range of likely values for true slope/intercept based on resampled estimates          |

Let me know when you're ready for **Part 3: k-Nearest Neighbors**!



# 🧠 Lecture Concepts Review (Expanded) — Part 3: k-Nearest Neighbors (k-NN)

This section offers a detailed and structured breakdown of **Lecture 24**, which introduces the supervised learning algorithm **k-Nearest Neighbors (k-NN)**.

---

## 📘 Lecture 24: Prediction with k-Nearest Neighbors (k-NN)

### 🧭 Context & Motivation
We’ve now seen descriptive statistics, bootstrapping, and modeling numeric relationships (SLR). But what if our **goal is to make predictions** — especially for new, unseen data?

That’s where **machine learning** comes in.

---

### 🧠 What is Prediction in ML?
Prediction is about using known **features** (inputs) to forecast or classify unknown **outcomes** (outputs):

| Task         | Type of Output | ML Type       |
|--------------|----------------|----------------|
| Predict tumor type (benign/malignant) | Categorical     | Classification |
| Predict surf height                  | Numeric         | Regression      |

---

### 🤖 Supervised Learning
**Supervised Learning**: Learn a model from data where both inputs (X) and outputs (Y) are known.

- Inputs = **Features** (a.k.a. predictors, covariates)
- Output = **Target** (a.k.a. label, response)

---

### 🔍 k-Nearest Neighbors (k-NN) Algorithm
**k-NN** is a simple supervised learning algorithm used for both **classification** and **regression**.

#### 🔧 How it Works:
1. Choose a value of **k** (e.g. 3)
2. For a new observation, compute its **distance** to all training observations
3. Identify the **k closest (nearest) neighbors**
4. Predict the outcome:
   - **Classification**: Use majority vote
   - **Regression**: Use average of neighbors' values

---

### 🌸 Visual Example: Classifying Flower Species
- Dataset: Iris (sepal width and sepal length)
- Task: Predict the species of a new flower

Steps:
- For a new flower (the “star” point), calculate distances to all others
- Choose the k-nearest labeled flowers
- Classify the new flower as the **most common label among neighbors**

---

### 🧪 Selecting k: Accuracy & Testing
Predictions can change based on the value of **k**.

#### To choose k:
- Split data into **training** and **testing** sets
- Try different values of **k**
- Pick the one with the **highest accuracy on the test set**

This is the concept of a **training/testing split**:
- Training data is used to train the model
- Testing data is used to evaluate prediction performance

---

### 🧪 Accuracy & Confusion Matrix
To assess how well k-NN performs:

- **Accuracy** = Proportion of correct classifications
- **Confusion Matrix** = Table comparing predictions to actual labels
  - Diagonal = correct predictions
  - Off-diagonal = misclassifications

---

### 📊 Implementing k-NN in R
Load packages:
```r
library(class)      # for knn()
library(ggplot2)    # for visualization
```

Split data:
```r
n <- nrow(iris)
train_index <- sample(1:n, size = 0.7 * n)
train_x <- iris[train_index, c("Sepal.Length", "Sepal.Width")]
train_y <- iris[train_index, "Species"]
test_x <- iris[-train_index, c("Sepal.Length", "Sepal.Width")]
test_y <- iris[-train_index, "Species"]
```

Run k-NN with different values of k:
```r
library(class)
predictions <- knn(train = train_x, test = test_x, cl = train_y, k = 5)
```

Compute accuracy:
```r
mean(predictions == test_y)
```

Create confusion matrix:
```r
table(Predicted = predictions, Actual = test_y)
```

---

### 🌀 Visualization & Model Insights
- Scatterplots can help assess how well separation occurs between classes
- Using additional features (e.g., petal length) can **improve accuracy**
- Larger values of k **smooth out** predictions but may **miss boundaries**

---

### 🧠 Summary of Lecture 24 Concepts
| Concept              | Definition/Explanation                                                      |
|----------------------|-------------------------------------------------------------------------------|
| Prediction           | Forecasting unknown outcomes using known inputs                              |
| Classification       | Predicting **categorical** outcomes                                           |
| Regression           | Predicting **numeric** outcomes                                               |
| Supervised Learning  | Model learns from labeled data (features + output)                            |
| Feature              | Input variable (a.k.a. predictor, covariate)                                  |
| Label/Target         | Output variable (a.k.a. response, class)                                      |
| k-NN Algorithm       | Predict outcome using the majority or average of the k closest neighbors      |
| Distance Metric      | Usually Euclidean distance between feature vectors                            |
| Training/Test Split  | Training = learn model, Testing = evaluate model performance                  |
| Accuracy             | % of correct predictions on test data                                         |
| Confusion Matrix     | Table summarizing correct and incorrect classifications                       |

Let me know if you'd like to compile all three lectures into one full printable study guide PDF or expand into lab questions! 🚀

