# Project Objective: Template for all projects 

## Module 1: DATA COLLECTION

#### Step 1.1 Import the required libraries 

#### Step 1.2 Load the Data as a DataFrame

### Step 2.1 Data Inspection - head(), info(), describe() etc.

## Module 2: DATA EXPLORATION

**Data Exploration**This is the process of describing, visualizing, and analyzing data to better understand it. It helps answer questions about the structure and nature of the data.

**Instances and Features**: An instance (or record/observation) refers to a row of data, while a feature (or variable) refers to a column of data. Features can be categorical (discrete values) or continuous (infinite possible values).

**Dimensionality and Sparsity**: Dimensionality refers to the number of features in a dataset, while sparsity and density describe the degree to which data exists for the features in the dataset.

### Step 2.1 Describe to explore and understand a specific column

### Step 2.2 Data Aggregation - value_counts(), mean(), groupby(), sort()

### Step 2.3 Data Cleaning - Handle missing values, duplicates, inconsistent formats, etc.

### Step 2.4 Data Storage - Save the cleaned data for future use.

----

## Module 3: DATA VISUALIZATION

### ** *Because certain data patterns are only understood when represented with a visualization*

In [1]:
# this command so that all the graphs or plots appear in our jupyter notebook just after the commands
%matplotlib inline

### Step 3.1 Comparision visualization

#### Scatter Plot

### Step 3.2 Relationship visualization

#### Scatter plot

### Step 3.3 Distribution visualization

#### Histogram

### Step 3.4 Composition visualization

----

## Module 4: DATA MODELLING

### Decision Trees

We will start the modelling with the decision trees.

A **decision tree** is a machine learning approach that uses an inverted tree-like structure to model the relationship between independent variables and a dependent variable. 

It mimics human decision-making with **if-then-else rules**. Each decision node represents a question, and branches represent possible answers leading to further decision nodes or leaf nodes (outcomes). 

Decision trees are used for both **classification** (categorical outcomes) and **regression** (continuous outcomes) problems.

-  When Dependent variable is categorical or discrete value, such as TRUE/FALSE, YES/NO then we will build **classification tree**
-  When Dependent variable is continious value like age, income, tempatature, we will build **Regression Tree**

### Step 4.1 Build Recursive Partitions to create child partitions that are purer than their parents

### Recursive Partitioning
Recursive partitioning is a process used to build classification trees by repeatedly splitting data into smaller subsets. The goal is to maximize the similarity (homogeneity) of items within each subset. Here's a simplified breakdown:

- Initial Split: Start with the entire dataset and find the best way to split it into two subsets to maximize the similarity within each subset.
- Subsequent Splits: For each subset, repeat the process of finding the best split to create even more homogeneous subsets.
- Stopping Criteria: Continue this process until all instances in a subset are of the same class, all features are exhausted, or a user-defined condition is met.

### Step 4.2 Measure the Degree of Impurity within a partition

#### Step 4.2.1 Measure the Entropy
- Entropy: A measure of impurity borrowed from information theory, representing the level of randomness or disorder within a partition. It is used by the C5.0 decision tree algorithm.

#### Step 4.2.2 Measure the Gini Impurity
- Gini Impurity: Another measure of impurity, representing the statistical dispersion within a partition. It is used by the CART decision tree algorithm.

#### Step 4.2.3 Measure the Information Gain
- Information Gain: The reduction in entropy that occurs as a result of a split, helping to determine the best split for the data.
These concepts help classification tree algorithms decide where to split the data to create the most homogeneous partitions.


### Step 4.2.4 Measure the Sum of Squared Residuals (SSR) 
- **Purpose**: A high SSR indicates high variability within the partition, meaning the values are very different from the mean. A low SSR indicates low variability, meaning the values are similar to the mean.
- The **regression tree algorithm** evaluates possible splits by calculating the SSR for each partition. It chooses the split that results in the lowest combined SSR, thereby minimizing variability and improving the model's accuracy.

### Step 4.2.5 Prune the Decision Tree
- **Pre-pruning**: This involves setting criteria to limit the size of the tree during the recursive partitioning process. It helps prevent overfitting but might stop tree growth too early.
- **Post-pruning**: This allows the tree to grow fully and then reduces its size afterward. It is more effective in discovering important patterns but is less efficient in terms of compute time.
- **Cost Complexity Pruning**: This method balances the sum of squared residuals (SSR) with a complexity penalty to choose the best sub-tree. The complexity parameter (alpha) is tuned to find the optimal tree size.

### Step 4.2.5 Calculate the Tree Score
- A tree score is a metric used to evaluate the quality of a decision tree, balancing its ability to explain the data with its complexity. It consists of two main components:

- Sum of Squared Residuals (SSR): Measures how well the tree explains the data. Lower SSR indicates better fit.
Tree Complexity Penalty: Accounts for the number of leaf nodes in the tree. It is the product of a user-defined complexity parameter (alpha) and the number of leaf nodes.


## Module 5: DATA OPTIMIZATION