[Home](../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes I applied to my  data before model training to maximise the performance of my machine learning model.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [1]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import LabelEncoder


####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [2]:
data_frame = pd.read_csv("wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.


##### 'rating' column encoding

In [3]:
data_frame['rating'].unique()

array(['R', 'PG', 'G', 'Not Rated', 'NC-17', 'Approved', 'TV-PG', 'PG-13',
       'Unrated', 'X', 'TV-MA', 'TV-14'], dtype=object)

In [4]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Define the custom class order from least to most violent
rating_order = ['G', 'TV-PG', 'PG', 'Approved', 'Unrated', 'Not Rated', 'PG-13', 'TV-14', 'R', 'TV-MA', 'NC-17', 'X']

# Fit and transform the 'rating' column
data_frame['rating'] = pd.Series(data_frame['rating'])

# Map the 'rating' column to the custom order
data_frame['rating'] = data_frame['rating'].map({rating: idx for idx, rating in enumerate(rating_order)})

# Display the first few rows of the DataFrame to verify the encoding
print(data_frame[['rating']])

# Print the mapping of classes to labels
print(dict(enumerate(rating_order)))


      rating
0          8
1          8
2          2
3          2
4          8
...      ...
7569       5
7570       8
7571       6
7572       8
7573       5

[7574 rows x 1 columns]
{0: 'G', 1: 'TV-PG', 2: 'PG', 3: 'Approved', 4: 'Unrated', 5: 'Not Rated', 6: 'PG-13', 7: 'TV-14', 8: 'R', 9: 'TV-MA', 10: 'NC-17', 11: 'X'}


##### 'genre' column encoding

In [5]:
# One-hot encode the 'genre' column
data_frame = pd.get_dummies(data_frame, columns=['genre'])

# Display the first few rows of the encoded dataframe
print(data_frame.head())

                                             name  rating  year  \
0                                     The Shining       8  1980   
1                                 The Blue Lagoon       8  1980   
2  Star Wars: Episode V - The Empire Strikes Back       2  1980   
3                                       Airplane!       2  1980   
4                                      Caddyshack       8  1980   

        released     score     votes         director  \
0  June 13, 1980  0.878378  0.386241  Stanley Kubrick   
1   July 2, 1980  0.527027  0.027069   Randal Kleiser   
2  June 20, 1980  0.918919  0.499993   Irvin Kershner   
3   July 2, 1980  0.783784  0.092070     Jim Abrahams   
4  July 25, 1980  0.729730  0.044986     Harold Ramis   

                    writer            star         country  ...  \
0             Stephen King  Jack Nicholson  United Kingdom  ...   
1  Henry De Vere Stacpoole  Brooke Shields   United States  ...   
2           Leigh Brackett     Mark Hamill   United S

##### 'company' & 'country' column encoding

In [6]:
# Calculate the frequency of each unique value in the 'company' column
company_freq = data_frame['company'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['company'] = data_frame['company'].map(company_freq)

# Calculate the frequency of each unique value in the 'country' column
country_freq = data_frame['country'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['country'] = data_frame['country'].map(country_freq)

#Print the results
print(data_frame[['company', 'country']])


       company   country
0     0.043966  0.106681
1     0.043834  0.719039
2     0.001320  0.719039
3     0.042250  0.719039
4     0.007922  0.719039
...        ...       ...
7569  0.000132  0.004621
7570  0.000132  0.719039
7571  0.000132  0.719039
7572  0.000132  0.719039
7573  0.000264  0.719039

[7574 rows x 2 columns]


##### Convert 'released' to a date format

In [None]:
# to be worked on

#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

##### Budget x Revenue (BxR)

In [7]:
# Calculate the budget-to-revenue ratio
data_frame['BxR'] = data_frame['gross'] / (data_frame['budget'] + 1)  # Avoid division by zero

# Display the first few rows of the DataFrame
print(data_frame[['BxR']])

           BxR
0     0.015670
1     0.020412
2     0.179987
3     0.029025
4     0.013763
...        ...
7569  0.132324
7570  0.000001
7571  0.000132
7572  0.027744
7573  0.000004

[7574 rows x 1 columns]


##### Budget x Votes

In [8]:
# Calculate the budget-to-vote ratio
data_frame['budget_per_vote'] = data_frame['budget'] / (data_frame['votes'] + 1)

# Display the first few rows of the DataFrame
print(data_frame[['budget_per_vote']])

      budget_per_vote
0            0.038495
1            0.012299
2            0.033703
3            0.008995
4            0.016120
...               ...
7569         0.224370
7570         0.100221
7571         0.100309
7572         0.000006
7573         0.100302

[7574 rows x 1 columns]


##### Director & Star Popularity

In [9]:
# Compute cumulative gross revenue for each director/star
data_frame['director_popularity'] = data_frame.groupby('director')['gross'].transform('sum')
data_frame['star_popularity'] = data_frame.groupby('star')['gross'].transform('sum')

# Display the first few rows of the DataFrame
print(data_frame[['director_popularity', 'star_popularity']])

      director_popularity  star_popularity
0                0.032788         0.536282
1                0.069072         0.032608
2                0.224599         0.355951
3                0.194719         0.045201
4                0.207334         0.226050
...                   ...              ...
7569             0.162059         0.162059
7570             0.000001         0.000001
7571             0.000145         0.000145
7572             0.027744         0.027744
7573             0.000005         0.000005

[7574 rows x 2 columns]


#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. 


##### Time Based Feature

In [None]:
#to be worked on

ValueError: time data "1981" doesn't match format "%B %d, %Y", at position 108. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

#### Save the wrangled and engineered data to CSV

In [10]:
data_frame.to_csv('../Model_Training/model_ready_data2.csv', index=False)