#### Module 4: Statistics with R  
- **Random Forest**  
- **Decision Tree**  
- **Normal and Binomial Distributions**  
- **Time Series Analysis**  
- **Linear and Multiple Regression**  
- **Logistic Regression**  
- **Survival Analysis**  

---
---

# **Module 4: Statistics with R**

## **1. Random Forest**

### **Introduction**
Random Forest is an **ensemble learning method** that builds multiple decision trees and combines their outputs to improve accuracy. It is widely used for both **classification** (categorical output) and **regression** (numerical output).

### **How Random Forest Works**
1. **Bootstrapping**: Creates multiple subsets of the original dataset by randomly selecting samples with replacement.
2. **Feature Selection**: At each node of a decision tree, a random subset of features is chosen.
3. **Decision Trees**: Builds multiple decision trees using different subsets and features.
4. **Voting/Averaging**:
   - **For Classification**: Takes the majority vote from the trees.
   - **For Regression**: Takes the average prediction of all trees.

### **Advantages**
✅ Reduces overfitting compared to a single decision tree
✅ Works well with large datasets
✅ Handles missing values effectively
✅ Can be used for feature selection

### **Disadvantages**
❌ Computationally expensive
❌ Requires careful tuning of hyperparameters

---

## **2. Decision Tree**

### **Introduction**
A **Decision Tree** is a tree-like model used for decision-making. It recursively splits data based on conditions to arrive at a decision.

### **Types of Decision Trees**
1. **Classification Trees**: Used for categorical outcomes.
2. **Regression Trees**: Used for continuous numerical outcomes.

### **How It Works**
1. Starts at a **root node**.
2. Splits into **branches** based on the most informative feature.
3. Ends at **leaf nodes** that contain final decisions.

### **Advantages**
✅ Easy to interpret and visualize
✅ Handles both numeric and categorical data
✅ Requires little data preprocessing

### **Disadvantages**
❌ Prone to overfitting
❌ Unstable with small variations in data

---

## **3. Normal and Binomial Distributions**

### **Normal Distribution (Bell Curve)**

- Data is **symmetrically distributed** around the mean.
- Many natural phenomena follow this distribution (e.g., human heights, test scores).
- Defined by **mean (μ)** and **standard deviation (σ)**.

#### **Formula**
$$
P(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$
Where:
- $ \mu $ = Mean
- $ \sigma $ = Standard deviation

### **Binomial Distribution**

- Used for binary outcomes (Success/Failure, Yes/No, Heads/Tails)
- Requires two parameters: **number of trials (n)** and **probability of success (p)**

#### **Formula**
$$
P(X=k) = \binom{n}{k} p^k (1-p)^{(n-k)}
$$
Where:
- $ n $ = Total trials
- $ k $ = Successful outcomes
- $ p $ = Probability of success

---

## **4. Time Series Analysis**

### **Introduction**
Time Series Analysis examines **patterns in data over time** to make predictions.

### **Components of Time Series**
- **Trend**: Long-term increase/decrease in data
- **Seasonality**: Repeating patterns (e.g., sales increasing in holiday season)
- **Cyclic Variations**: Fluctuations without fixed frequency
- **Irregular Component**: Random variations

---

## **5. Linear and Multiple Regression**

### **Linear Regression**

- Predicts a **continuous** dependent variable based on one independent variable.

#### **Formula**
$$
Y = mX + c
$$
Where:
- $ Y $ = Dependent variable
- $ X $ = Independent variable
- $ m $ = Slope
- $ c $ = Intercept

### **Multiple Regression**

- Extends Linear Regression by considering **multiple independent variables**.

#### **Formula**
$$
Y = b_0 + b_1X_1 + b_2X_2 + ... + b_nX_n
$$

---

## **6. Logistic Regression**

### **Introduction**
- Used for **binary classification** (Yes/No, Spam/Not Spam).

#### **Formula (Sigmoid Function)**
$$
P(Y) = \frac{1}{1+e^{-(b_0 + b_1X_1 + ... + b_nX_n)}}
$$

---

## **7. Survival Analysis**

### **Introduction**
Survival Analysis estimates **the time until an event happens** (e.g., patient survival time, machine failure).

### **Kaplan-Meier Estimator**
- A non-parametric statistic for estimating survival probability over time.

---