[Home](../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes I applied to my  data before model training to maximise the performance of my machine learning model.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [1]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import LabelEncoder


####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [2]:
data_frame = pd.read_csv("wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.


In [3]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'rating' column
data_frame['rating'] = label_encoder.fit_transform(data_frame['rating'])

# Display the first few rows of the DataFrame to verify the encoding
print(data_frame.head())

                                             name  rating      genre  year  \
0                                     The Shining       6      Drama  1980   
1                                 The Blue Lagoon       6  Adventure  1980   
2  Star Wars: Episode V - The Empire Strikes Back       4     Action  1980   
3                                       Airplane!       4     Comedy  1980   
4                                      Caddyshack       6     Comedy  1980   

        released     score     votes         director  \
0  June 13, 1980  0.878378  0.386241  Stanley Kubrick   
1   July 2, 1980  0.527027  0.027069   Randal Kleiser   
2  June 20, 1980  0.918919  0.499993   Irvin Kershner   
3   July 2, 1980  0.783784  0.092070     Jim Abrahams   
4  July 25, 1980  0.729730  0.044986     Harold Ramis   

                    writer            star         country    budget  \
0             Stephen King  Jack Nicholson  United Kingdom  0.053363   
1  Henry De Vere Stacpoole  Brooke Shields  

#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

In this, case some domain knowledge and data analysis have informed you that the BMI and AGE are risk multipliers (the greater the age and the greater the BMI the greater the feature). From this we can  risk value based on the feature interactions.

#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. 


#### Save the wrangled and engineered data to CSV

In [None]:
data_frame.to_csv('../2.3.Model_Training/2.3.1.model_ready_data.csv', index=False)