# 📘 Methodology Manual

## **1️⃣ Tools and Libraries Used**
### **🛠 Tools:**
- **Jupyter Notebook**: Used as the primary environment for writing and executing Python code.

### **📚 Libraries Used:**
- **Pandas**: For data manipulation and preprocessing.
- **NumPy**: For numerical computations.
- **Matplotlib**: For data visualization.
- **Seaborn**: For advanced statistical plots.
- **Pylab**: For integrating plotting with Python.
- **SciPy**: For scientific computations and statistical analysis.
- **Scikit-learn (Sklearn)**: For machine learning model implementation.

---

## **2️⃣ Defining Dependent and Independent Features**
- **Dependent Feature (Target Variable)**:  
  - The feature we are trying to predict.  
  - Example: **Attrition** (1 = Employee left, 0 = Employee stayed).  

- **Independent Features (Predictors)**:  
  - Features that influence the target variable.  
  - Example: `Age`, `ExperienceYearsAtThisCompany`, `OverTime`, `EmpJobSatisfaction`, etc.  

---

## **3️⃣ Balancing the Data**
### **⚠️ Problem: Data Imbalance**
- The dataset is **imbalanced**, meaning there are significantly more instances of one class than the other.  
- An **imbalanced dataset** can cause the model to be biased towards the majority class, leading to poor predictions.  

### **✅ Solution: SMOTE (Synthetic Minority Oversampling Technique)**
- **SMOTE** is a widely used **oversampling** technique that generates synthetic examples for the **minority class** rather than simply duplicating existing data.  
- It creates **new data points** by interpolating between existing instances of the minority class.

### **🛠 How SMOTE Works:**
1. Randomly selects a **minority class** sample.
2. Finds its **k-nearest neighbors**.
3. Creates a **new synthetic data point** between the sample and one of its neighbors.
4. Repeats the process until the dataset is balanced.

---

## **4️⃣ Splitting Training and Testing Data**
- **80% of the data** is used for **training**.  
- **20% of the data** is used for **testing**.  
- The **train-test split** ensures that the model is evaluated on unseen data to test its generalizability.  

---

## **5️⃣ Algorithm Selection**
### **🎯 AIM: Create a Model with Low Bias & Low Variance**
To achieve an optimal balance between **bias and variance**, we experiment with **three algorithms**:

---

## **🔹 Support Vector Machine (SVM)**
### **📌 What is SVM?**
- SVM is a **supervised learning algorithm** used for **classification and regression tasks**.  
- It finds the **optimal hyperplane** that best separates the data into different classes.  

### **✅ Advantages of SVM:**
- Works well in **high-dimensional spaces**.
- Effective when the number of dimensions is **greater than the number of samples**.
- Robust against overfitting when using the right kernel.

### **📊 Performance on Our Data:**
- **Training Accuracy:** 96.61%  
- **Test Accuracy:** 94.66% (slightly lower, indicating potential overfitting).  
- After **Hyperparameter Tuning**, test accuracy increased to **98.28%**, but the model still overfits.  

---

## **🔹 Random Forest**
### **📌 What is Random Forest?**
- Random Forest is an **ensemble learning method** that combines multiple decision trees to improve accuracy.  
- Uses **bagging (Bootstrap Aggregation)** to reduce variance.  

### **✅ Advantages of Random Forest:**
- Handles **non-linear relationships** well.
- Works well with both **categorical and numerical** data.
- Reduces the risk of overfitting compared to a single decision tree.

### **📊 Performance on Our Data:**
- **Training Accuracy:** 100%  
- **Test Accuracy:** 95.61%  
- After **Hyperparameter Tuning**, the test accuracy **decreased**, suggesting **overfitting before tuning and underfitting after tuning**.  

---

## **🔹 Artificial Neural Network (Multilayer Perceptron - MLP Classifier)**
### **📌 What is an ANN (MLP Classifier)?**
- An **Artificial Neural Network (ANN)** is a computing system inspired by the human brain.
- **Multilayer Perceptron (MLP)** is a type of ANN that consists of **multiple layers** of neurons.  
- It learns by **adjusting weights** through **backpropagation**.  

### **✅ Advantages of ANN:**
- Can capture **complex patterns** in the data.  
- Works well with **both structured and unstructured data**.  
- Can generalize well when trained properly.  

### **📊 Performance on Our Data:**
- **Training Accuracy:** 98.95%  
- **Testing Accuracy:** 95.80%  
- This model showed the best balance between **training and testing accuracy**, making it the most **generalizable model**.  

---

## **6️⃣ Final Model Selection**
- **Artificial Neural Network (MLP Classifier)** is selected as the final model.  
- It provides **high accuracy on both training and testing data** without excessive overfitting.  
- This model will be used for **predicting employee attrition and performance analysis**.

---
