# **Machine Learning Task Report**

## **1. Preprocessing Steps and Rationale**

### Data Cleaning

Duplicate Removal: There were no duplicate entries in the data.

Handling Missing Values: Checked for null values and decided on appropriate handling mechanisms.

### Outlier Detection and Handling

Identified Outliers: Used the Interquartile Range (IQR) method to detect extreme values in the vomitoxin_ppb column.

Outlier Treatment: Compared different methods:

Median Replacement: Replaced outliers with the median value.

Log Transformation: Applied logarithmic scaling.

Winsorization: Capped extreme values at 5th and 95th percentiles.

### Normalization

Applied MinMax Scaling to normalize all features to a consistent range.

### Exploratory Data Analysis (EDA)

Used box plots and histograms to visualize data distributions.

Plotted spectral band reflectance to observe feature importance.

Heatmap generated to examine feature correlations.

## **2. Insights from Dimensionality Reduction**

Principal Component Analysis (PCA) was applied to reduce high-dimensional spectral data to 3 principal components.

Explained Variance: PCA components retained a significant proportion of variance, ensuring minimal information loss.

Visualization: A 3D scatter plot illustrated data separation post-reduction, demonstrating meaningful structure in the dataset.

## **3. Model Selection, Training, and Evaluation**

### Model Choice : CNN

CNN-based Regression Model was selected due to its ability to extract spatial patterns from spectral data.

### Training

Data Splitting: 80% training, 20% testing to ensure sufficient training data while maintaining a robust evaluation set.

Loss Function: Mean Squared Error (MSE) was used as it is a standard regression loss that penalizes larger errors more.

Optimizer: Adam was chosen for its efficiency and adaptability in adjusting learning rates dynamically.

### Evaluation

Mean Absolute Error (MAE) = 0.0261

MAE measures the average absolute difference between predicted and actual values.
A lower MAE means better accuracy.

Root Mean Squared Error (RMSE) = 0.0833

RMSE penalizes larger errors more than MAE because it squares the errors before averaging.
A lower RMSE indicates a better fit.

R² Score (R-squared) = 0.6700

R² measures how well your model explains the variance in the target variable.
It ranges from 0 to 1, where 1 means perfect prediction and 0 means the model explains nothing beyond the mean.

Result:

Our model performs reasonably well with mae and mse
but performs decently in R².

# **4. Key Findings and Suggestions for Improvement**

## Key Findings

Our PCA reduced almost 94% of the essential information while still preserving the data.

The CNN model successfully captured patterns in the spectral data.

Outlier handling methods significantly impacted model performance and therefore we selected median for outlier handling because log transformation and winsorization suppresses the important data.

## Suggestions for Improvement

**Alternative method** :
We could consider LSTM model for improving the accuracy.

**Dataset** :  We can experiment with increased data size to improve R² handling.

## Trade-offs and Challenges:

Log transformation is effective in handling skewness but fails if data contains zero or negative values.

Winsorization prevents extreme influences but may distort natural data variation.

PCA reduces dimensionality but may discard some useful variance.

CNNs require substantial computational power compared to traditional regression models.