[Home](../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes I applied to my  data before model training to maximise the performance of my machine learning model.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [12]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import LabelEncoder


####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [13]:
data_frame = pd.read_csv("wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.


##### 'rating' column encoding

This code encodes the 'rating' column by mapping movie ratings to a custom numerical order from least to most restrictive

In [14]:
# to find the unique ratings in the data
data_frame['rating'].unique()

array(['R', 'PG', 'G', 'Not Rated', 'NC-17', 'Approved', 'TV-PG', 'PG-13',
       'Unrated', 'X', 'TV-MA', 'TV-14'], dtype=object)

Using a predefined list, each rating is assigned an index and mapped to its corresponding value. This approach preserves the ordinal relationship between ratings (e.g., G < PG < PG-13 < R), making it more meaningful for ML models

In [15]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Define the custom class order from least to most violent
rating_order = ['G', 'TV-PG', 'PG', 'Approved', 'Unrated', 'Not Rated', 'PG-13', 'TV-14', 'R', 'TV-MA', 'NC-17', 'X']

# Fit and transform the 'rating' column
data_frame['rating'] = pd.Series(data_frame['rating'])

# Map the 'rating' column to the custom order
data_frame['rating'] = data_frame['rating'].map({rating: idx for idx, rating in enumerate(rating_order)})

# Display the first few rows of the DataFrame to verify the encoding
print(data_frame[['rating']])

# Print the mapping of classes to labels
print(dict(enumerate(rating_order)))


      rating
0          8
1          8
2          2
3          2
4          8
...      ...
7569       5
7570       8
7571       6
7572       8
7573       5

[7574 rows x 1 columns]
{0: 'G', 1: 'TV-PG', 2: 'PG', 3: 'Approved', 4: 'Unrated', 5: 'Not Rated', 6: 'PG-13', 7: 'TV-14', 8: 'R', 9: 'TV-MA', 10: 'NC-17', 11: 'X'}


##### 'genre' column encoding

This code one-hot encodes the 'genre' column, creating separate binary columns for each genre. If a movie belongs to a specific genre, the corresponding column is set to 1; otherwise, it's 0. 

In [16]:
# One-hot encode the 'genre' column
data_frame = pd.get_dummies(data_frame, columns=['genre'])

# Display the first few rows of the encoded dataframe
print(data_frame.head())

                                             name  rating  year  \
0                                     The Shining       8  1980   
1                                 The Blue Lagoon       8  1980   
2  Star Wars: Episode V - The Empire Strikes Back       2  1980   
3                                       Airplane!       2  1980   
4                                      Caddyshack       8  1980   

        released     score     votes         director  \
0  June 13, 1980  0.878378  0.386241  Stanley Kubrick   
1   July 2, 1980  0.527027  0.027069   Randal Kleiser   
2  June 20, 1980  0.918919  0.499993   Irvin Kershner   
3   July 2, 1980  0.783784  0.092070     Jim Abrahams   
4  July 25, 1980  0.729730  0.044986     Harold Ramis   

                    writer            star         country  ...  \
0             Stephen King  Jack Nicholson  United Kingdom  ...   
1  Henry De Vere Stacpoole  Brooke Shields   United States  ...   
2           Leigh Brackett     Mark Hamill   United S

##### 'company' & 'country' column encoding

This code applies frequency encoding to the 'company' and 'country' columns by calculating how often each unique value appears in the dataset and mapping those frequencies back to the original data. 

In [17]:
# Calculate the frequency of each unique value in the 'company' column
company_freq = data_frame['company'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['company'] = data_frame['company'].map(company_freq)

# Calculate the frequency of each unique value in the 'country' column
country_freq = data_frame['country'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['country'] = data_frame['country'].map(country_freq)

#Print the results
print(data_frame[['company', 'country']])


       company   country
0     0.043966  0.106681
1     0.043834  0.719039
2     0.001320  0.719039
3     0.042250  0.719039
4     0.007922  0.719039
...        ...       ...
7569  0.000132  0.004621
7570  0.000132  0.719039
7571  0.000132  0.719039
7572  0.000132  0.719039
7573  0.000264  0.719039

[7574 rows x 2 columns]


##### Convert 'released' to a date format

This code converts the 'released' column to datetime format, ensuring consistency in date representation. Rows with invalid or missing dates are removed to maintain data integrity, preventing errors in time-based feature engineering.

In [18]:
# Convert the 'released' column to datetime format
data_frame['released'] = pd.to_datetime(data_frame['released'], errors='coerce')

# Remove rows with null values in the 'released' column
data_frame = data_frame.dropna(subset=['released'])

# Display the first few rows of the DataFrame to verify the conversion
print(data_frame[['released']])

       released
0    1980-06-13
1    1980-07-02
2    1980-06-20
3    1980-07-02
4    1980-07-25
...         ...
7569 2020-08-28
7570 2020-04-17
7571 2020-06-03
7572 2020-02-07
7573 2020-03-03

[7519 rows x 1 columns]


#### 'director' + 'star' column encoding

In [None]:
# possibly be implemented in the future

#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. 

#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. 


##### Time Based Features

In [20]:
# Extracting month and quarter from release date
data_frame['release_month'] = pd.to_datetime(data_frame['released']).dt.month
data_frame['release_quarter'] = pd.to_datetime(data_frame['released']).dt.quarter

print(data_frame[['release_month', 'release_quarter']])

      release_month  release_quarter
1466              7                3
4717             11                4
4260              1                1
4055              2                1
3912              6                2
...             ...              ...
7116              3                1
4359              8                3
5402              8                3
2739             12                4
3584              9                3

[7519 rows x 2 columns]


##### Sequel

In [21]:
# Binary feature for sequels
data_frame['is_sequel'] = data_frame['name'].str.contains(r'\b(?:2|3|4|II|III|IV)\b', regex=True).astype(int)

#### Save the wrangled and engineered data to CSV

In [22]:
data_frame.to_csv('../Model_Training/model_ready_data.csv', index=False)