![CA5.png](attachment:CA5.png)

# Feature Engineering Techniques

## Table of Contents

- [Feature Engineering Techniques](#feature-engineering-techniques)
  - [Introduction](#introduction)
  - [Objectives](#objectives)
  - [Dataset Description](#dataset-description)
  - [Tasks](#tasks)
  - [Environment Setup](#environment-setup)
  - [Questions](#questions)
  - [References](#references)


## Introduction

In this assignment, we will apply feature engineering techniques to a football-related dataset with the aim of analyzing the likelihood of scoring a goal through a shot. Following this analysis, we will explore regression and cross-validation concepts in greater depth by implementing multivariate regression and k-fold cross-validation from scratch. These techniques will be applied to a preprocessed dataset related to cars. Lastly, we will compare the results obtained from our custom implementations with those from built-in Python libraries.

## Objectives

This assignment aims to:

- Implementing multivariate regression and k-fold cross-validation from scratch.
- Comparing the results obtained from custom implementations with those from built-in Python libraries.

## Tasks

1. Preprocessing

2. Multivariate Regression Implementation

3. Manual K-Fold Cross Validation Implementation

4. Comparison with Built-in Python Libraries

## Dataset Description

The dataset utilized for preprocessing, named "football.csv", encompasses football-related data, offering insights into various shot attributes, including timing, location (such as corner or penalty), and outcomes like "saved by the goalkeeper", "blocked by defenders", or "missed" shots.

For the implementation phase, a distinct preprocessed dataset, "cars.csv", focusing on automotive data, will be employed. This dataset contains comprehensive information about cars, and our objective is to leverage it to train custom multivariate regression and k-fold models for predicting the "Price in Thousands" and "Horsepower" attributes.

## Environment Setup

Let's begin with setting up the Python environment and importing the necessary libraries.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

In [None]:
FOOTBALL_CSV = "../data/football.csv"
CARS_CSV = "../data/cars.csv"

## Main Tasks

### Preprocessing

First, we will handle cleaning and analyzing the dataset, highlighting its statistical attributes and visualizing its features. Our goal is to identify the beneficial features and justify out conclusions convincingly. Additionally, we should employ feature engineering techniques to refine the dataset, either by removing or replacing less desirable features. To gain a deeper understanding of feature engineering, we'll train an arbitrary but appropriate model and evaluate the outcomes before and after preprocessing. Furthermore, to assess the importance of each feature, we will utilize the mutual information method to create a `pandas` dataframe with two columns: one for features and the other for their importance. Subsequently, we'll sort the dataframe in descending order based on importance and display the results.

Some notes to consider during preprocessing:

![football-pitch.jpg](attachment:football-pitch.jpg)

- The above figure shows a football pitch, to gain better insights into the dataset.
- We can apply various methods that we have learned throughout the course to fill missing values and manipulate categorical features in our data.
- We can consolidate similar features. For instance, we could treat "goal" and "own goal" as the same.
- We'll employ feature selection to exclude less significant features, thereby reducing the dimensionality and lowering computational costs.
- For a more thorough analysis, let's consider extracting new, more informative features from existing ones. For instance, we can calculate shot distance and angle using the following formulas and incorporate them into our analysis:

    - $distance = \sqrt{x^2 + y^2}$
    - $angle =$
      - $rad2deg(\arctan(\theta)): if \arctan(\theta) \geq 0$
      - $rad2deg(\arctan(\theta + \pi)): if \arctan(\theta) < 0$
    - Where $\theta = \frac{7.32x}{x^2 + y^2 - (\frac{7.32}{2})^2}$

In [None]:
# code

### Multivariate Regression Implementation

We'll implement multivariate regression from scratch and use the gradient descent algorithm to update the weights. Then, we'll validate the regression model by providing a visual comparison between the predicted and actual values for "Price in Thousands" and “Horsepower”. Additionally, we'll plot the accuracy across different random states for a more robust verification. Finally, we will display a learning curve to illustrate the progression of the regression process.

In [None]:
# code

### Manual K-Fold Cross Validation Implementation

We'll implement K-Fold cross-validation from scratch. As in the previous section, we will use the gradient descent algorithm to adjust the weights. Then, we would validate your custom K-Fold implementation using statistical metrics. Finally, let's display a learning curve upon completion.

In [None]:
# code

### Comparison with Built-in Python Libraries

Now, let's compare the results from our custom implementations in sections 2 and 3 with those obtained using built-in Python libraries, and report the findings.

In [None]:
# code

## Questions

1. **Describe your strategy for addressing challenges such as handling missing values and categorical features. Could you also elaborate on your feature selection metrics and explain the rationale behind them?**

2. **Why didn’t we use regression to predict whether a shot would result in a goal?**

3. **How would you go about verifying the accuracy of the given formula used to calculate the shot angle in the preprocessing section?**

4. **Discuss the advantages and disadvantages of k-fold cross-validation. Can you also explain other types of cross-validation methods that could address the limitations and issues associated with k-fold cross-validation?**

5. **What metrics did you use to evaluate your manual implementations of multivariate regression and k-fold cross-validation, and why did you choose them?**



## References

