# 📚 Homework & Assignment

## 1. 📊 Feature Selection Using Covariance

- **Covariance Matrix** helps us understand how variables vary together.
- **High positive/negative covariance** means features may be linearly related.
- Use `df.corr()` or `sns.heatmap(df.corr(), annot=True)` for plotting.
- Helps in identifying **multicollinearity** which can harm model performance.
- Remove or combine highly correlated features.

```python
import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
````

# 2. 🧼 Handling NaNs & Imputation Strategies

### Dealing with Missing Values:

* **Drop**: if rows or columns have too many NaNs.
* **Impute**:

  * Mean/Median: for numerical columns
  * Mode: for categorical columns
  * KNN or Model-based: for advanced imputation

```python
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent'
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
```

### Scaling:

* Normalize/scale numeric values for models that are sensitive to scale (like linear regression).

```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
```

## 3. 🔖 Label Encoding

* Converts **categorical** data into numerical form.
* Use **LabelEncoder** for ordinal values (like low, medium, high).
* For nominal data, use **OneHotEncoding**.

```python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
```

## 4. 🌲 ML Models: RandomForest, GradientBoosting, XGBoost, CatBoost

### Overview:

| Model            | Strengths                                           |
| ---------------- | --------------------------------------------------- |
| Random Forest    | Robust, fast, handles non-linear data               |
| GradientBoosting | Accurate, slower, prone to overfitting              |
| XGBoost          | Optimized GBM with regularization, very fast        |
| CatBoost         | Great with categorical features, no encoding needed |

```python
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

rf = RandomForestRegressor(n_estimators=100)
xgb = XGBRegressor(learning_rate=0.1, n_estimators=100)
cat = CatBoostRegressor(verbose=0)
```

## 5. 📏 Evaluation Metrics for Regression

| Metric                    | Use When You Want To…                             |
| ------------------------- | ------------------------------------------------- |
| MAE (Mean Absolute Error) | Measure average error, less sensitive to outliers |
| MSE (Mean Squared Error)  | Penalize large errors more heavily                |
| RMSE                      | Similar to MSE, interpretable in original units   |
| R² Score                  | See how much variance is explained by model       |

```python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
```

## 6. 🔧 Hyperparameter Tuning for Boosting Models

* Use **GridSearchCV** or **RandomizedSearchCV** for tuning.
* Common parameters:

  * `learning_rate`
  * `n_estimators`
  * `max_depth`
  * `subsample`, `colsample_bytree` (for XGBoost)

```python
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1]}
grid = GridSearchCV(XGBRegressor(), param_grid, cv=3)
grid.fit(X_train, y_train)
```


## 7. 🧠 Advanced Feature Engineering

### Custom Capacity Extraction Function

```python
def get_capacity(x: pandas.Series, df: pandas.DataFrame) -> pandas.DataFrame:
    """
    Extracts 'X.XL' capacity (e.g., '2.0L') from a Series of text and adds it as a float column.
    """
    capacity = []
    for text in x:
        tokens = text.split()
        if len(tokens) >= 2 and tokens[1].endswith('L'):
            try:
                cap_val = float(tokens[1].strip('L'))
                capacity.append(cap_val)
            except:
                capacity.append(np.nan)
        else:
            capacity.append(np.nan)
    df['capacity'] = capacity
    return df
```

* This function is helpful when features are **embedded in text**.
* Always clean and extract key features before modeling.