#### Module 4: Statistics with R  
- **Random Forest**  
- **Decision Tree**  
- **Normal and Binomial Distributions**  
- **Time Series Analysis**  
- **Linear and Multiple Regression**  
- **Logistic Regression**  
- **Survival Analysis**  


---

# **üëâ 1. Random Forest**  

---

## **Introduction**  
Random Forest is an **ensemble learning algorithm** that builds multiple decision trees and combines their outputs for improved accuracy and robustness. It is widely used for **classification** (categorical outputs) and **regression** (continuous outputs).

Random Forest:
- reduces **overfitting**
- improves **accuracy**
- works well on large datasets.  

---

## **How Random Forest Works (Step-by-Step)**  

1. **Bootstrapping (Random Sampling with Replacement):**  
   - Multiple subsets of data are created by randomly selecting rows from the dataset (some data points may be repeated).  

2. **Random Feature Selection at Each Split:**  
   - Unlike decision trees, which consider all features, Random Forest selects a **random subset of features** at each split.  

3. **Building Multiple Decision Trees:**  
   - Each subset is used to train a separate decision tree independently.  

4. **Aggregation of Outputs:**  
   - **For Classification**: The majority class predicted by the trees is chosen (voting mechanism).  
   - **For Regression**: The final output is the **average of all predictions** from the trees.  

---

## **Mathematical Explanation**  

### **1. Classification in Random Forest**  
Each tree gives an output $C_i$, and the final prediction is determined by:  

$$
\hat{C} = \text{Mode} (C_1, C_2, ..., C_n)
$$

where Mode represents the most frequently occurring class.

### **2. Regression in Random Forest**  
For regression tasks, the final prediction is:  

$$
\hat{y} = \frac{1}{n} \sum_{i=1}^{n} y_i
$$

where $y_i$ is the prediction from the $i^{th}$ tree.

---

## **Advantages of Random Forest**  
‚úÖ **Handles Missing Values:** Can handle missing data by maintaining multiple decision paths.  
‚úÖ **Reduces Overfitting:** Unlike a single decision tree, it prevents overfitting by averaging multiple trees.  
‚úÖ **Works Well with Large Datasets:** Efficient even for large datasets with many features.  
‚úÖ **Feature Importance:** Helps in selecting the most important features in a dataset.  

---

## **Disadvantages of Random Forest**  
‚ùå **Computationally Expensive:** Training a large number of trees requires time and memory.  
‚ùå **Less Interpretability:** Difficult to interpret compared to a single decision tree.  

---

## **Hyperparameters in Random Forest**  

| Hyperparameter  | Description |
|----------------|------------|
| **ntree**      | Number of decision trees (higher = more accuracy but slower computation). |
| **mtry**       | Number of features considered at each split (smaller values lead to more diverse trees). |
| **nodesize**   | Minimum number of samples required at a leaf node. |
| **max_depth**  | Maximum depth of each tree (prevents overfitting). |

---

## **Real-Life Applications of Random Forest**  

üìå **Medical Diagnosis:** Used to predict diseases like cancer based on patient data.  
üìå **Financial Fraud Detection:** Helps in detecting fraudulent transactions.  
üìå **E-commerce Recommendations:** Suggests products based on user activity.  
üìå **Stock Market Prediction:** Analyzes past stock trends to predict future prices.  

---

## **Comparison: Decision Tree vs. Random Forest**  

| Feature           | Decision Tree | Random Forest |
|------------------|--------------|--------------|
| **Overfitting**  | High         | Low         |
| **Accuracy**     | Moderate     | High        |
| **Speed**        | Fast         | Slower      |
| **Interpretability** | Easy     | Hard       |

---

## **Conclusion**  
- **Random Forest** is an improved version of Decision Trees, reducing overfitting by combining multiple trees.  
- It is widely used in **classification, regression, feature selection, and anomaly detection**.  
- While computationally expensive, it provides **high accuracy and stability**, making it one of the most powerful machine learning algorithms. 

---


---

# **üëâ 2. Decision Tree**  

---

## **Introduction**  
A **Decision Tree** is a supervised learning algorithm used for **classification** and **regression** tasks. It mimics human decision-making by breaking down a dataset into smaller subsets using a tree-like model of decisions.  

It is based on **if-else conditions**, where 
- **Root Node**: The starting point.
- **Internal Nodes**: Decision points.
- **Branches**: Outcomes of decisions.
- **Leaf Nodes**: Final prediction.  



```text
                  [Age < 40?]
                  /         \
               Yes           No
              /               \
     [Income = High?]      Buys = Yes
         /     \
      No        Yes
     /            \
Buys = No      Buys = Yes
```

- **If Age < 40**  
  - and **Income is High ‚Üí Buys = Yes**  
  - else ‚Üí **Buys = No**

- **If Age ‚â• 40 ‚Üí Buys = Yes**

---

## **How Decision Tree Works (Step-by-Step)**  

1. **Choose the Best Splitting Feature**  
   - The dataset is split based on the most significant feature using measures like **Gini Index** or **Entropy (Information Gain)**.  

2. **Create Decision Nodes and Branches**  
   - Based on the split, different branches are formed, leading to further decisions.  

3. **Continue Splitting Until a Stopping Condition is Met**  
   - The splitting process stops when:  
     - All data points in a node belong to the same class.  
     - A predefined depth is reached.  

4. **Make Predictions**  
   - For **classification**, the majority class in a leaf node is the output.  
   - For **regression**, the average of all values in the leaf node is the output.  

---

## **Mathematical Explanation**  

### **1. Splitting Criteria in Decision Trees**  
The best split is selected using the following metrics:  

#### **Entropy & Information Gain (Used in ID3 Algorithm)**  
- **Entropy (H)** measures uncertainty in data:  
  $$
  H(S) = - \sum p_i \log_2 p_i
  $$  
  where $p_i$ is the probability of each class.  

- **Information Gain (IG)** is the reduction in entropy after splitting:  
  $$
  IG = H(Parent) - \sum \frac{|Child|}{|Parent|} H(Child)
  $$

#### **Gini Index (Used in CART Algorithm)**  
- Measures impurity in data:  
  $$
  Gini = 1 - \sum p_i^2
  $$  
  where $p_i$ is the proportion of class $i$.  
- Lower Gini value means a purer split.  

#### **Chi-Square Test (Used in CHAID Algorithm)**  
- Measures statistical independence between features.  
- The **Chi-Square ($\chi^2$)** formula is used to test the independence or goodness-of-fit between categorical variables in statistics.

üí° **Chi-Square Test Formula:**

$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$

Where:  
- $\chi^2$ = Chi-square statistic  
- $O$ = Observed frequency  
- $E$ = Expected frequency  
- $\sum$ = Summation over all categories

---

üìå **Use Cases:**
- **Goodness-of-Fit Test** ‚Äì To see if a sample matches a population.
- **Test of Independence** ‚Äì To check if two categorical variables are related.

---

## **Types of Decision Trees**  

| Type                  | Description                                      |
|-----------------------|--------------------------------------------------|
| **Classification Tree** | Used when the target variable is categorical.    |
| **Regression Tree**     | Used when the target variable is continuous.     |

---

## **Advantages of Decision Trees**  
‚úÖ **Easy to understand and visualize.**  
‚úÖ **Handles both numerical and categorical data.**  
‚úÖ **No need for feature scaling (e.g., normalization).**  
‚úÖ **Requires little data preprocessing.**  

---

## **Disadvantages of Decision Trees**  
‚ùå **Prone to overfitting:** Small changes in data can change the tree structure.  
‚ùå **Greedy algorithm:** May not find the best split globally.  
‚ùå **Biased for dominant classes:** If one class has a higher proportion, it may dominate.  

---

## **Hyperparameters in Decision Trees**  

| Hyperparameter       | Description                                      |
|----------------------|--------------------------------------------------|
| **max_depth**        | Maximum depth of the tree (controls overfitting).|
| **min_samples_split**| Minimum samples required to split a node.        |
| **min_samples_leaf** | Minimum samples required in a leaf node.         |
| **criterion**        | Metric for choosing the best split (e.g., ‚Äògini‚Äô or ‚Äòentropy‚Äô). |

---

## **Real-Life Applications of Decision Trees**  

üìå **Medical Diagnosis:** Predicting diseases based on symptoms.  
üìå **Loan Approval:** Assessing whether a person qualifies for a loan.  
üìå **Fraud Detection:** Identifying fraudulent transactions in banking.  
üìå **Customer Segmentation:** Grouping customers based on purchase behavior.  

---

## **Comparison: Decision Tree vs. Random Forest**  

| Feature              | Decision Tree | Random Forest |
|----------------------|----------------|----------------|
| **Overfitting**      | High           | Low            |
| **Accuracy**         | Moderate       | High           |
| **Speed**            | Fast           | Slower         |
| **Interpretability** | Easy           | Hard           |

---

## **Conclusion**  
- **Decision Trees** are intuitive, easy to interpret, and useful for both classification and regression tasks.  
- They may overfit data, but techniques like **pruning** or using **Random Forest** help improve performance.  
- Commonly used in **finance, healthcare, and customer segmentation**.  

---


---

# **üëâ 3. Normal and Binomial Distributions**  

---

## **1. Normal Distribution**  

### **Introduction**  
The **Normal Distribution**, also known as the **Gaussian distribution**, is a continuous probability distribution that is symmetric and bell-shaped. It is widely used in statistics, finance, and machine learning to model natural phenomena like height, weight, IQ scores, and stock prices.

---

### **Characteristics of Normal Distribution**  
‚úÖ **Bell-shaped and symmetric** around the mean.  
‚úÖ **Mean (Œº), Median, and Mode are equal.**  
‚úÖ **Follows the Empirical Rule (68-95-99.7 Rule).**  
‚úÖ **Defined by two parameters:**  
   - **Mean (Œº)** ‚Äì The central value.  
   - **Standard Deviation (œÉ)** ‚Äì The spread of the data.  

---

### **Probability Density Function (PDF)**  
The probability density function (PDF) of a normal distribution is:  

$$
f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$

where:  
- $ \mu $ = mean  
- $ \sigma $ = standard deviation  
- $ e $ = Euler‚Äôs number (~2.718)  

---

### **Empirical Rule (68-95-99.7 Rule)**  
- **68%** of data lies within **1 standard deviation** ($\mu \pm 1\sigma$).  
- **95%** of data lies within **2 standard deviations** ($\mu \pm 2\sigma$).  
- **99.7%** of data lies within **3 standard deviations** ($\mu \pm 3\sigma$).  

---

### **Real-Life Applications of Normal Distribution**  
üìå **Height and Weight of People:** Most people‚Äôs height follows a normal distribution.  
üìå **IQ Scores:** IQ follows a normal distribution with a mean of 100 and $ \sigma \approx 15 $.  
üìå **Stock Market Returns:** Daily stock returns are approximately normal.  
üìå **Measurement Errors:** Errors in experiments tend to be normally distributed.  

---

## **2. Binomial Distribution**  

### **Introduction**  
The **Binomial Distribution** is a discrete probability distribution that represents the number of successes in a fixed number of independent trials, where each trial has two possible outcomes (**success/failure**).  

It is commonly used in scenarios like coin flips, pass/fail exams, and customer purchases.  

---

### **Characteristics of Binomial Distribution**  
‚úÖ **Only two outcomes per trial:** Success (1) or Failure (0).  
‚úÖ **Fixed number of trials (n).**  
‚úÖ **Constant probability of success (p) for each trial.**  
‚úÖ **Trials are independent.**  

---

### **Probability Mass Function (PMF)**  
The probability of getting exactly **k successes** in **n trials** is given by:  

$$
P(X = k) = \binom{n}{k} p^k (1 - p)^{n-k}
$$

where:  
- $ n $ = number of trials  
- $ k $ = number of successes  
- $ p $ = probability of success  
- $ \binom{n}{k} = \frac{n!}{k!(n-k)!} $ (Binomial coefficient)  

---

### **Real-Life Applications of Binomial Distribution**  
üìå **Coin Tossing:** Probability of getting heads 5 times in 10 flips.  
üìå **Quality Control:** Probability of 3 defective products in a batch of 10.  
üìå **Customer Purchases:** Probability that 4 out of 10 customers make a purchase.  
üìå **Medical Trials:** Probability that a drug is effective for 6 out of 10 patients.  

---

## **Comparison: Normal vs. Binomial Distribution**  

| Feature           | Normal Distribution | Binomial Distribution |
|------------------|--------------------|----------------------|
| **Type**         | Continuous         | Discrete |
| **Shape**        | Bell-shaped        | Varies (Skewed for small trials, symmetric for large trials) |
| **Parameters**   | Mean (Œº), Standard Deviation (œÉ) | Number of trials (n), Probability of success (p) |
| **Examples**     | Heights, IQ scores, Stock prices | Coin tosses, Exam scores, Customer purchases |

---


| **Aspect**                     | **Normal Distribution**                                                                 | **Binomial Distribution**                                                              |
|-------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|
| **Type**                      | Continuous distribution                                                                 | Discrete distribution                                                                 |
| **Shape**                     | Bell-shaped, symmetric                                                                  | Skewed or symmetric depending on probability                                          |
| **Parameters**                | Mean (Œº), Standard Deviation (œÉ)                                                        | Number of trials (n), Probability of success (p)                                      |
| **Domain**                    | All real numbers (‚àí‚àû to +‚àû)                                                             | Only integers from 0 to n                                                             |
| **Example Use Case**          | Heights of people, marks in exam, IQ scores                                            | Tossing a coin, number of defective products, survey success/failure                  |
| **R Function to Generate Data** | `rnorm(n, mean, sd)`                                                                   | `rbinom(n, size, prob)`                                                               |
| **R Function for Density**    | `dnorm(x, mean, sd)`                                                                    | `dbinom(x, size, prob)`                                                               |
| **R Plot Example**            | `plot(x, dnorm(x))` ‚Äì continuous curve                                                  | `barplot(dbinom(0:n, size, prob))` ‚Äì bars for each value                             |
| **Central Limit Theorem**     | Theoretical foundation ‚Äì binomial distribution approaches normal if n is large         | Approximates normal distribution when n is large and p is not near 0 or 1            |
| **Probability Between Values**| Can find probability over an interval (e.g., P(5 < X < 10))                             | Probabilities are exact for integers (e.g., P(X = 3))                                |
| **Application in R**          | Used with `dnorm()`, `pnorm()` for probability and density curves                      | Used with `dbinom()`, `pbinom()` for probability mass functions                      |

---


---

## **Conclusion**  
- **Normal Distribution** is used for **continuous** data with a bell curve shape, common in real-world applications like height and test scores.  
- **Binomial Distribution** is used for **discrete** data with a fixed number of trials, common in probability-based scenarios like coin flips and medical trials.  
- As **n increases**, the Binomial Distribution **approaches the Normal Distribution** (Central Limit Theorem).  

---



---

# **üëâ 4. Time Series Analysis**  

---

## **1. Introduction to Time Series Analysis**  
A **Time Series** is a sequence of data points collected or recorded at specific time intervals. Time series analysis involves analyzing trends, patterns, and seasonal effects in data over time to make forecasts and informed decisions.

### **Examples of Time Series Data:**  
üìå **Stock Market Prices** ‚Äì Daily fluctuations in stock prices.  
üìå **Weather Data** ‚Äì Temperature readings recorded hourly/daily.  
üìå **Sales Data** ‚Äì Monthly or yearly revenue of a company.  
üìå **Website Traffic** ‚Äì Daily number of visitors to a website.  

---

## **2. Components of Time Series Data**  
A time series typically consists of the following components:

### **1. Trend (Tt)**
- The general **direction** in which data is moving over time (increasing, decreasing, or stable).
- **Example:** Rising housing prices over the years.

### **2. Seasonality (St)**
- Patterns that **repeat at regular intervals** (daily, monthly, yearly).
- **Example:** Increased ice cream sales in summer.

### **3. Cyclic Patterns (Ct)**
- **Long-term fluctuations** that repeat but do not have a fixed period.
- **Example:** Economic cycles of boom and recession.

### **4. Random/Irregular Component (Et)**
- **Unpredictable variations** in data due to external factors.
- **Example:** Sudden drop in sales due to a pandemic.

### **Mathematical Representation of a Time Series:**  
A time series is represented as:

$$
Y_t = T_t + S_t + C_t + E_t
$$

where:  
- $Y_t$ = Observed Value  
- $T_t$ = Trend  
- $S_t$ = Seasonal Component  
- $C_t$ = Cyclical Component  
- $E_t$ = Random Error  

---

## **3. Types of Time Series Models**  
There are **two main types** of time series models:

### **1. Additive Model:**  
$$
Y_t = T_t + S_t + C_t + E_t
$$
- Used when seasonal variations are **constant** over time.

### **2. Multiplicative Model:**  
$$
Y_t = T_t \times S_t \times C_t \times E_t
$$
- Used when seasonal variations **increase/decrease proportionally** with time.

---

## **4. Time Series Analysis Techniques**  

### **1. Moving Average (MA)**
- Used to smooth fluctuations in data and identify trends.
- **Simple Moving Average (SMA):**
  $$
  SMA = \frac{(X_1 + X_2 + ... + X_n)}{n}
  $$
- **Example:** Finding a 7-day moving average of stock prices.

---

### **2. Exponential Smoothing**  
- Assigns more weight to recent observations.
- **Single Exponential Smoothing (SES):**  
  $$
  S_t = \alpha Y_t + (1 - \alpha) S_{t-1}
  $$
  where $\alpha$ (smoothing factor) is between 0 and 1.

---

### **3. Autoregressive Integrated Moving Average (ARIMA)**
- **ARIMA(p, d, q)** is a powerful forecasting model where:  
  - **p** = Number of lag observations (AR: Autoregression).  
  - **d** = Differencing order (I: Integrated).  
  - **q** = Moving average order (MA: Moving Average).  

---

### **4. Seasonal Decomposition of Time Series (STL Decomposition)**  
- Breaks down a time series into **trend, seasonal, and residual components**.

---

## **5. Forecasting in Time Series Analysis**  
Forecasting helps predict future values based on historical data.

### **Common Forecasting Methods:**
1. **Na√Øve Forecasting** ‚Äì Assumes future values will be the same as the last observed value.  
2. **Moving Average Forecasting** ‚Äì Uses past averages to predict future trends.  
3. **Exponential Smoothing** ‚Äì Assigns different weights to past data.  
4. **ARIMA Model** ‚Äì Advanced model for complex time series forecasting.  

---

## **6. Performance Evaluation of Time Series Models**  
To evaluate the accuracy of a model, we use:

üìå **Mean Absolute Error (MAE):**
$$
MAE = \frac{1}{n} \sum |Y_t - \hat{Y_t}|
$$

üìå **Mean Squared Error (MSE):**
$$
MSE = \frac{1}{n} \sum (Y_t - \hat{Y_t})^2
$$

üìå **Root Mean Squared Error (RMSE):**
$$
RMSE = \sqrt{MSE}
$$

üìå **Mean Absolute Percentage Error (MAPE):**
$$
MAPE = \frac{1}{n} \sum \left( \frac{|Y_t - \hat{Y_t}|}{Y_t} \right) \times 100
$$

---

## **7. Real-Life Applications of Time Series Analysis**  
üìå **Stock Market Prediction** ‚Äì ARIMA models forecast stock prices.  
üìå **Weather Forecasting** ‚Äì Seasonal models predict temperature and rainfall.  
üìå **Sales Forecasting** ‚Äì Helps businesses plan inventory and marketing.  
üìå **Energy Consumption Analysis** ‚Äì Power grids use time series to optimize electricity supply.  

---

## **8. Summary Table**  

| **Concept**               | **Description** |
|---------------------------|----------------|
| **Trend**                 | Long-term movement in data (upward/downward). |
| **Seasonality**           | Repeating patterns (daily, monthly, yearly). |
| **Cyclic Patterns**       | Irregular but long-term fluctuations. |
| **Moving Average (MA)**   | Smoothing method to remove short-term fluctuations. |
| **Exponential Smoothing** | More weight to recent values for forecasting. |
| **ARIMA Model**           | Advanced model for time series forecasting. |
| **Decomposition**         | Splits data into trend, seasonality, and noise. |
| **Forecasting**           | Predicting future data points. |

---

## **9. Conclusion**  
- **Time Series Analysis** is crucial for understanding patterns in data over time.  
- **Techniques like moving averages, exponential smoothing, and ARIMA** help make accurate predictions.  
- **Performance metrics** ensure model accuracy.  
- **Real-world applications** include forecasting stock prices, sales, and weather.  

---


---

# **üëâ 5. Linear and Multiple Regression**  

---

## **1. Introduction to Regression Analysis**  
Regression analysis is a **statistical method** used to understand relationships between variables and make predictions. It helps in determining how a dependent variable (outcome) changes when one or more independent variables (predictors) are modified.

### **Types of Regression:**  
üìå **Linear Regression** ‚Äì Relationship between one independent variable and one dependent variable.  
üìå **Multiple Regression** ‚Äì Relationship between multiple independent variables and one dependent variable.  
üìå **Logistic Regression** ‚Äì Used for categorical outcomes (e.g., Yes/No).  

---

## **2. Linear Regression**  
### **Definition:**  
Linear regression is used to model the relationship between a **dependent variable (Y)** and **one independent variable (X)** using a straight line.

### **Equation of a Simple Linear Regression Model:**  
$$
Y = mX + c + e
$$  
where:  
- $Y$ = Dependent variable (output/predicted value).  
- $X$ = Independent variable (input).  
- $m$ = Slope of the regression line (rate of change).  
- $c$ = Intercept (value of Y when X = 0).  
- $e$ = Error term (unexplained variations in Y).  

### **Example:**  
- Predicting house prices based on square footage.  
- Predicting student scores based on study hours.  

### **Assumptions of Linear Regression:**  
‚úÖ **Linearity** ‚Äì Relationship between X and Y is linear.  
‚úÖ **Independence** ‚Äì Observations are independent.  
‚úÖ **Homoscedasticity** ‚Äì Constant variance of errors.  
‚úÖ **Normality** ‚Äì Errors follow a normal distribution.  

### **Visual Representation of Linear Regression:**  
A scatter plot with a best-fit line showing the relationship between **X** (independent variable) and **Y** (dependent variable).

---

## **3. Multiple Regression**  
### **Definition:**  
Multiple regression is an extension of linear regression that includes **multiple independent variables** to predict a single dependent variable.

### **Equation of Multiple Regression:**  
$$
Y = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n + e
$$  
where:  
- $Y$ = Dependent variable.  
- $X_1, X_2, ..., X_n$ = Multiple independent variables.  
- $b_0$ = Intercept.  
- $b_1, b_2, ..., b_n$ = Coefficients for each independent variable.  
- $e$ = Error term.  

### **Example:**  
- Predicting house prices based on **square footage, number of bedrooms, and location**.  
- Estimating employee salaries based on **experience, education, and skill level**.  

### **When to Use Multiple Regression?**  
üìå When **one factor is not enough** to predict the outcome.  
üìå When **multiple factors** contribute to the dependent variable.  

### **Assumptions of Multiple Regression:**  
‚úÖ **No Multicollinearity** ‚Äì Independent variables should not be highly correlated.  
‚úÖ **Linearity** ‚Äì Relationship between independent and dependent variables is linear.  
‚úÖ **Homoscedasticity** ‚Äì Variance of errors is constant.  
‚úÖ **Normal Distribution of Errors** ‚Äì Errors should be normally distributed.  

---

## **4. Model Evaluation Metrics**  
To measure how well a regression model fits the data, we use:

### **1. R-Squared ($R^2$)**  
- Measures the proportion of variance in Y explained by X.  
- **Ranges from 0 to 1** (closer to 1 = better fit).  

$$
R^2 = 1 - \frac{\sum (Y - \hat{Y})^2}{\sum (Y - \bar{Y})^2}
$$  

üìå **Example:** If $R^2 = 0.85$, it means **85% of the variation in Y** is explained by X.  

### **2. Mean Absolute Error (MAE)**  
$$
MAE = \frac{1}{n} \sum |Y - \hat{Y}|
$$  
Measures the average absolute difference between actual and predicted values. Lower is better.  

### **3. Mean Squared Error (MSE)**  
$$
MSE = \frac{1}{n} \sum (Y - \hat{Y})^2
$$  
Punishes larger errors more than MAE.  

### **4. Root Mean Squared Error (RMSE)**  
$$
RMSE = \sqrt{MSE}
$$  
Gives an idea of the error magnitude in original units.  

---

## **5. Real-World Applications of Regression Analysis**  
üìå **Finance** ‚Äì Predicting stock prices based on economic indicators.  
üìå **Healthcare** ‚Äì Estimating disease risk based on patient history.  
üìå **Marketing** ‚Äì Analyzing the effect of advertising on sales.  
üìå **Education** ‚Äì Predicting student performance based on study time and attendance.  

---

## **6. Summary Table**  

| **Concept**                | **Linear Regression** | **Multiple Regression** |
|----------------------------|----------------------|--------------------------|
| **Number of Independent Variables** | 1 | 2 or more |
| **Equation** | $Y = mX + c$ | $Y = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n$ |
| **Example** | Study time ‚Üí Exam Score | Study time + Attendance ‚Üí Exam Score |
| **Usage** | Simple relationships | Complex relationships |
| **Interpretation** | Change in Y per unit change in X | Change in Y based on multiple factors |

---

## **7. Conclusion**  
- **Linear regression** is useful for understanding the relationship between one independent and one dependent variable.  
- **Multiple regression** allows analyzing the effect of multiple factors on an outcome.  
- **R-squared, MAE, MSE, and RMSE** help evaluate model performance.  
- **Used in various fields** like finance, healthcare, and education.  

---



---

# **üëâ 6. Logistic Regression**  

---

## **1. Introduction to Logistic Regression**  
Logistic regression is a **statistical method** used for **classification problems** where the output is **binary or categorical** (e.g., Yes/No, 0/1, Pass/Fail).  

üìå **Why Not Use Linear Regression?**  
- Linear regression gives continuous values, which are not ideal for classification.  
- Logistic regression transforms the output into probabilities between **0 and 1** using a **sigmoid function**.  

### **Types of Logistic Regression:**  
‚úÖ **Binary Logistic Regression** ‚Äì Two possible outcomes (e.g., Spam/Not Spam).  
‚úÖ **Multinomial Logistic Regression** ‚Äì More than two **unordered** categories (e.g., Red, Blue, Green).  
‚úÖ **Ordinal Logistic Regression** ‚Äì More than two **ordered** categories (e.g., Low, Medium, High).  

---

## **2. Sigmoid Function (Logistic Function)**  
Since logistic regression deals with probabilities, it uses the **sigmoid function** to squash values between **0 and 1**.  

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

where:  
- $ \sigma(z) $ = Probability value (between 0 and 1).  
- $ e $ = Euler‚Äôs number (**2.718**).  
- $ z $ = Linear combination of input features.  

### **Graph of Sigmoid Function:**  
- If $ z $ is **very large**, $ \sigma(z) $ approaches **1**.  
- If $ z $ is **very small**, $ \sigma(z) $ approaches **0**.  
- If $ z = 0 $, $ \sigma(0) = 0.5 $.  

üìå **Interpretation:** The closer $ \sigma(z) $ is to **1**, the more likely the event is to happen. If it is closer to **0**, the event is unlikely.  

---

## **3. Logistic Regression Equation**  
Instead of a straight line like in linear regression, logistic regression models a probability using:

$$
p = \frac{1}{1 + e^{-(b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n)}}
$$

where:  
- $ p $ = Probability of success.  
- $ b_0 $ = Intercept.  
- $ b_1, b_2, ..., b_n $ = Coefficients of independent variables.  
- $ X_1, X_2, ..., X_n $ = Independent variables (features).  

Taking the **logit transformation**:

$$
\log\left(\frac{p}{1-p}\right) = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n
$$

üìå **This converts the probability into a linear equation.**  

---

## **4. Decision Boundary**  
Logistic regression classifies data based on a **threshold** (usually **0.5**):  
- **If $ p > 0.5 $**, classify as **1 (Yes, Positive, Success, True)**.  
- **If $ p < 0.5 $**, classify as **0 (No, Negative, Failure, False)**.  

üìå **Example:**  
- If $ p = 0.8 $ ‚Üí Predict **Yes** (1).  
- If $ p = 0.3 $ ‚Üí Predict **No** (0).  

---

## **5. Loss Function ‚Äì Log Loss (Cross-Entropy Loss)**  
Since logistic regression is based on probabilities, it uses **log loss** instead of mean squared error (MSE).  

$$
Loss = -\frac{1}{n} \sum \left[ y \log(p) + (1 - y) \log(1 - p) \right]
$$

where:  
- $ y $ = Actual class label (0 or 1).  
- $ p $ = Predicted probability.  

üìå **Goal:** Minimize log loss to make better predictions.  

---

## **6. Model Evaluation Metrics**  
After training a logistic regression model, we evaluate its performance using:  

### **1. Accuracy**  
$$
Accuracy = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
$$

### **2. Confusion Matrix**  
A table that shows the actual vs. predicted values:  

| **Actual/Predicted** | **Predicted: 0** | **Predicted: 1** |
|----------------------|------------------|------------------|
| **Actual: 0**        | True Negative (TN) | False Positive (FP) |
| **Actual: 1**        | False Negative (FN) | True Positive (TP)  |

üìå **Key Metrics Derived from the Confusion Matrix:**  

- **Precision**: $ \frac{TP}{TP + FP} $ (How many predicted positives are actually positive?)  
- **Recall (Sensitivity)**: $ \frac{TP}{TP + FN} $ (How many actual positives were correctly predicted?)  
- **F1-Score**: Harmonic mean of precision and recall.  

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

- **ROC Curve & AUC Score**: Measures how well the model distinguishes between classes.  

---

## **7. Real-World Applications of Logistic Regression**  
üìå **Healthcare** ‚Äì Predicting disease risk (Diabetes: Yes/No).  
üìå **Finance** ‚Äì Credit card fraud detection (Fraud/Not Fraud).  
üìå **Marketing** ‚Äì Customer conversion prediction (Will buy/Will not buy).  
üìå **E-commerce** ‚Äì Product recommendation (Interested/Not Interested).  

---

## **8. Summary Table**  

| **Concept**                | **Details** |
|----------------------------|-------------|
| **Used For**               | Classification Problems |
| **Output**                 | Probability (0 to 1) |
| **Function Used**          | Sigmoid Function |
| **Equation**               | $ \log(\frac{p}{1-p}) = b_0 + b_1X_1 + ... + b_nX_n $ |
| **Threshold**              | 0.5 (Default) |
| **Evaluation Metrics**     | Accuracy, Precision, Recall, F1-Score |
| **Loss Function**          | Log Loss (Cross-Entropy) |
| **Real-World Use Cases**   | Medical Diagnosis, Fraud Detection, Marketing Analytics |

---

## **9. Conclusion**  
- Logistic Regression is a **classification algorithm** that predicts binary or categorical outcomes.  
- It uses the **sigmoid function** to output probabilities between 0 and 1.  
- **Thresholding** helps decide the class (0 or 1).  
- **Log Loss** is minimized to improve accuracy.  
- It is widely used in **healthcare, finance, marketing, and e-commerce**.  

---



---

# **üëâ 7. Survival Analysis**  

---

## **1. Introduction to Survival Analysis**  
Survival analysis is a **statistical method** used to analyze the **time until an event occurs**. This event could be **death, failure of a system, recovery from an illness, or churn of a customer in business**.  

### **Key Features of Survival Analysis:**  
‚úÖ Analyzes **time-to-event data**.  
‚úÖ Handles **censored data** (when an event hasn‚Äôt happened yet for some subjects).  
‚úÖ Estimates **survival probabilities** over time.  

üìå **Examples:**  
- Medical studies: Time until a patient dies after treatment.  
- Business: Time until a customer cancels a subscription.  
- Engineering: Time until a machine breaks down.  

---

## **2. Important Terminologies in Survival Analysis**  

### **1. Event (Failure or Success)**
The occurrence of a **specific event** (e.g., death, machine failure, recovery).  

### **2. Survival Time (T)**
The time from the **start of observation** until the event occurs.  

### **3. Censoring**
When the exact event time is **unknown**, we call it **censored data**. Types of censoring:  
- **Right-censoring**: The event hasn‚Äôt occurred by the end of the study (most common).  
- **Left-censoring**: The event happened before observation started.  
- **Interval-censoring**: The event happened between two time points but the exact time is unknown.  

üìå **Example of Right Censoring:**  
A patient is part of a cancer drug trial for **5 years**, but they are still alive at the end. We don‚Äôt know when they might die, so their data is **right-censored**.  

---

## **3. Survival Function and Hazard Function**  

### **1. Survival Function $S(t)$**
The probability that the event **has not occurred** by time $t$.  

$$
S(t) = P(T > t)
$$

- $S(t)$ is always **between 0 and 1**.  
- $S(0) = 1$ (At the start, no one has experienced the event).  
- $S(\infty) = 0$ (Eventually, all subjects will experience the event).  

üìå **Interpretation:**  
- $S(5) = 0.8$ ‚Üí 80% of individuals **survive beyond 5 years**.  

### **2. Hazard Function $h(t)$**
The **instantaneous rate** at which events occur, given that the subject **has survived up to time t**.  

$$
h(t) = \frac{f(t)}{S(t)}
$$

where:  
- $f(t)$ = Probability density function of survival times.  
- $S(t)$ = Survival function.  

üìå **Interpretation:**  
- A **higher hazard** means a higher risk of the event occurring at time $t$.  
- Example: If $h(5) = 0.2$, the risk of dying at year 5 is **20%**.  

---

## **4. Methods of Survival Analysis**  

### **1. Kaplan-Meier Estimator (KM Curve)**
A **non-parametric** method to estimate the survival function.  

$$
S(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)
$$

where:  
- $d_i$ = Number of events at time $t_i$.  
- $n_i$ = Number of subjects at risk just before $t_i$.  

üìå **How it Works:**  
- At each event time, the survival probability is **multiplied** by the probability of surviving past that time.  
- The result is a **stepwise decreasing survival curve**.  

‚úÖ **Used for:** Comparing two groups (e.g., treatment vs. control).  
‚úÖ **Plot:** The **Kaplan-Meier curve** shows survival probabilities over time.  

### **2. Log-Rank Test**
A **hypothesis test** to compare **two or more survival curves** (e.g., comparing drug A vs. drug B).  

- Null Hypothesis ($H_0$): No difference in survival between groups.  
- Alternative Hypothesis ($H_1$): A significant difference exists.  

If the **p-value < 0.05**, we reject $H_0$ and conclude that there is a **difference in survival** between groups.  

### **3. Cox Proportional Hazards Model (Cox Regression)**
A **semi-parametric** model that estimates the effect of variables on survival time.  

$$
h(t) = h_0(t) e^{(b_1X_1 + b_2X_2 + ... + b_nX_n)}
$$

where:  
- $h_0(t)$ = Baseline hazard.  
- $X_1, X_2, ..., X_n$ = Independent variables.  
- $b_1, b_2, ..., b_n$ = Regression coefficients.  

üìå **Interpretation:**  
- If $b_1 > 0$, the variable **increases the hazard (risk of event)**.  
- If $b_1 < 0$, the variable **decreases the hazard** (protective factor).  
- If $HR = e^{b_1} = 2$, the event is **twice as likely** for that variable.  

‚úÖ **Used for:** Finding risk factors in medical research. 

---

## **6. Real-World Applications of Survival Analysis**  
üìå **Healthcare** ‚Äì Studying the survival time of cancer patients.  
üìå **Business** ‚Äì Predicting customer churn.  
üìå **Engineering** ‚Äì Reliability of machine components.  
üìå **Finance** ‚Äì Risk assessment for loans.  

---

## **7. Summary Table**  

| **Concept**                 | **Description** |
|-----------------------------|----------------|
| **Event** | The occurrence of interest (e.g., death, failure, churn). |
| **Censoring** | When the event has not occurred yet for some subjects. |
| **Survival Function $S(t)$** | Probability of survival beyond time $t$. |
| **Hazard Function $h(t)$** | Instantaneous risk of event occurring at $t$. |
| **Kaplan-Meier Estimator** | Non-parametric method to estimate survival probability. |
| **Log-Rank Test** | Compares survival curves between groups. |
| **Cox Regression** | Estimates impact of variables on survival time. |

---

## **8. Conclusion**  
- Survival analysis helps analyze **time-to-event data** and handles **censored cases**.  
- **Kaplan-Meier estimator** is used for survival probability estimation.  
- **Cox regression** identifies risk factors affecting survival.  
- It is widely used in **medicine, business, engineering, and finance**.  

---

