# Week 8: Capstone Project Documentation

## **Report**

### **1. Dataset Description**

The dataset used in this project focuses on Pokémon statistics and attributes. It contains **801 entries** and **41 columns**. Each row represents a Pokémon, including its base stats, type classifications, and other descriptive features.

#### **Key Features**
- **Categorical Attributes**:
  - `type1`, `type2`: The primary and secondary elemental types of each Pokémon.
  - `is_legendary`: Indicates whether a Pokémon is legendary (1 for yes, 0 for no).
- **Numerical Attributes**:
  - Base stats such as `attack`, `defense`, `hp`, `speed`, etc.
  - Derived stats like `base_total`, which is the sum of all base stats.

This dataset was selected for its diversity in both categorical and numerical features, making it suitable for both classification and regression tasks.

---

### **2. Methodology**

#### **Step 1: Data Cleaning**
The dataset was preprocessed to handle missing values and inconsistencies:
- **Handling Missing Values**:
  - Categorical columns (`type2`) were filled with "None" for Pokémon without a secondary type.
  - Numerical columns (`height_m`, `weight_kg`, `percentage_male`) were imputed using median or mean values.
- **Feature Conversion**:
  - `capture_rate` was converted to numeric, handling any non-numeric values.
- **Final Check**:
  - Removed duplicates and ensured consistency in all columns.

#### **Step 2: Exploratory Data Analysis (EDA)**
EDA was performed to understand the dataset and guide feature selection. Key insights include:
- **Correlation Analysis**:
  - Strong positive correlations were observed among `base_total`, `attack`, `defense`, and `sp_attack`. 


![Correlation Heatmap](Datasets/week-8-images/correlation%20heatmap%20for%20numeric%20features.png)



- **Distribution Insights**:
  - Water-type Pokémon are the most common primary type.


![Count of Pokémon by Primary Type](Datasets/week-8-images/count%20of%20pokemon%20by%20primary%20type.png)



  - Generations 1 and 5 have the highest number of Pokémon. 


![Distribution Across Generations](Datasets/week-8-images/distribution%20of%20pokemon%20across%20generations.png)



  - Base total stats follow a bimodal distribution, with peaks at 300-400 and 500-600. 


![Base Total Stats](Datasets/week-8-images/distribution%20of%20pokemon%20base%20total%20statspng.png)




#### **Step 3: Classification Model**
##### **Objective**
Predict whether a Pokémon is legendary or non-legendary.

##### **Approach**
- **Model Used**: Random Forest Classifier
- **Preprocessing Steps**:
  - Numeric features were scaled.
  - Categorical features were one-hot encoded.
  - Dimensionality reduction was applied using TruncatedSVD.

##### **Performance**
- **Accuracy**: 95.65%
- **Confusion Matrix**:
  - Non-legendary Pokémon predictions were highly accurate.
  - Legendary Pokémon predictions suffered due to class imbalance. 



![Confusion Matrix](Datasets/week-8-images/confusion%20matrix%20for%20classification%20model.png)



- **Classification Metrics**:
  - Non-Legendary: Precision 95.45%, Recall 100%, F1-Score 97.67%.
  - Legendary: Precision 100%, Recall 50%, F1-Score 66.67%. 



![Classification Metrics](Datasets/week-8-images/Precision,%20f1,%20recall%20scores%20for%20classifcation%20model.png)



#### **Step 4: Regression Model**
##### **Objective**
Predict the `base_total` stat of Pokémon.

##### **Approach**
- **Model Used**: Random Forest Regressor
- **Preprocessing Steps**:
  - Numeric features were scaled.
  - Categorical features were one-hot encoded.
  - Dimensionality reduction was applied using TruncatedSVD.

##### **Performance**
- **Mean Squared Error (MSE)**: 2015.66
- **R-squared (R²)**: 0.849
- **Residual Analysis**:
  - Residuals are centered around 0, indicating minimal systematic errors. 



![Residual Plot](Datasets/week-8-images/residual%20plot%20of%20regression%20model.png)
- **Actual vs Predicted**:
  - Predictions align closely with actual values for most Pokémon. Outliers are minimal. 



![Actual vs Predicted](Datasets/week-8-images/actual%20VS%20predicted%20base%20total%20for%20regression%20model.png)



---

### **3. Challenges and Solutions**

#### **1. Missing Values**
- **Challenge**: Some attributes, such as `type2`, `height_m`, and `percentage_male`, contained missing values.
- **Solution**: Imputed missing values with appropriate strategies (e.g., median for numeric, most frequent for categorical).

#### **2. Imbalanced Dataset**
- **Challenge**: The dataset was imbalanced, with significantly fewer legendary Pokémon.
- **Solution**: Evaluated performance metrics beyond accuracy, such as Precision, Recall, and F1-Score.

#### **3. Dimensionality Reduction**
- **Challenge**: The one-hot encoding of categorical variables introduced high-dimensional sparse matrices.
- **Solution**: Applied TruncatedSVD to reduce dimensionality while retaining 95% variance.

#### **4. Performance of Models**
- **Challenge**: Regression and classification models showed slight underperformance for outliers and minority classes.
- **Solution**: Tuned hyperparameters and used robust algorithms like Random Forest for better generalization.

---

### **4. Summary of Work**

- **Dataset**: Pokémon dataset with 801 rows and 41 features.
- **EDA**: Provided insights into feature distributions, correlations, and data imbalances.
- **Models**:
  - **Classification**: Achieved 95.65% accuracy for predicting legendary Pokémon.
  - **Regression**: Explained 84.9% variance in `base_total` predictions.
- **Challenges Addressed**: Missing values, class imbalance, and high-dimensionality issues were resolved effectively.

This capstone project demonstrates the application of data science techniques to extract insights and build predictive models, balancing analytical rigor with practical challenges.
