[Home](../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes I applied to my  data before model training to maximise the performance of my machine learning model.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [1]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import LabelEncoder


####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [2]:
data_frame = pd.read_csv("wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.


##### 'rating' column encoding

In [3]:
data_frame['rating'].unique()

array(['R', 'PG', 'G', 'Not Rated', 'NC-17', 'Approved', 'TV-PG', 'PG-13',
       'Unrated', 'X', 'TV-MA', 'TV-14'], dtype=object)

In [4]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Define the custom class order from least to most violent
rating_order = ['G', 'TV-PG', 'PG', 'Approved', 'Unrated', 'Not Rated', 'PG-13', 'TV-14', 'R', 'TV-MA', 'NC-17', 'X']

# Fit and transform the 'rating' column
data_frame['rating'] = pd.Series(data_frame['rating'])

# Map the 'rating' column to the custom order
data_frame['rating'] = data_frame['rating'].map({rating: idx for idx, rating in enumerate(rating_order)})

# Display the first few rows of the DataFrame to verify the encoding
print(data_frame[['rating']])

# Print the mapping of classes to labels
print(dict(enumerate(rating_order)))


      rating
0          8
1          8
2          2
3          2
4          8
...      ...
7569       5
7570       8
7571       6
7572       8
7573       5

[7574 rows x 1 columns]
{0: 'G', 1: 'TV-PG', 2: 'PG', 3: 'Approved', 4: 'Unrated', 5: 'Not Rated', 6: 'PG-13', 7: 'TV-14', 8: 'R', 9: 'TV-MA', 10: 'NC-17', 11: 'X'}


##### 'genre' column encoding

In [5]:
# One-hot encode the 'genre' column
data_frame = pd.get_dummies(data_frame, columns=['genre'])

# Display the first few rows of the encoded dataframe
print(data_frame.head())

                                             name  rating  year  \
0                                     The Shining       8  1980   
1                                 The Blue Lagoon       8  1980   
2  Star Wars: Episode V - The Empire Strikes Back       2  1980   
3                                       Airplane!       2  1980   
4                                      Caddyshack       8  1980   

        released     score     votes         director  \
0  June 13, 1980  0.878378  0.386241  Stanley Kubrick   
1   July 2, 1980  0.527027  0.027069   Randal Kleiser   
2  June 20, 1980  0.918919  0.499993   Irvin Kershner   
3   July 2, 1980  0.783784  0.092070     Jim Abrahams   
4  July 25, 1980  0.729730  0.044986     Harold Ramis   

                    writer            star         country  ...  \
0             Stephen King  Jack Nicholson  United Kingdom  ...   
1  Henry De Vere Stacpoole  Brooke Shields   United States  ...   
2           Leigh Brackett     Mark Hamill   United S

##### 'company' & 'country' column encoding

In [6]:
# Calculate the frequency of each unique value in the 'company' column
company_freq = data_frame['company'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['company'] = data_frame['company'].map(company_freq)

# Calculate the frequency of each unique value in the 'country' column
country_freq = data_frame['country'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['country'] = data_frame['country'].map(country_freq)

#Print the results
print(data_frame[['company', 'country']])


       company   country
0     0.043966  0.106681
1     0.043834  0.719039
2     0.001320  0.719039
3     0.042250  0.719039
4     0.007922  0.719039
...        ...       ...
7569  0.000132  0.004621
7570  0.000132  0.719039
7571  0.000132  0.719039
7572  0.000132  0.719039
7573  0.000264  0.719039

[7574 rows x 2 columns]


##### Convert 'released' to a date format

In [7]:
# Convert the 'released' column to datetime format
data_frame['released'] = pd.to_datetime(data_frame['released'], errors='coerce')

# Display the first few rows of the DataFrame to verify the conversion
print(data_frame[['released']])

       released
0    1980-06-13
1    1980-07-02
2    1980-06-20
3    1980-07-02
4    1980-07-25
...         ...
7569 2020-08-28
7570 2020-04-17
7571 2020-06-03
7572 2020-02-07
7573 2020-03-03

[7574 rows x 1 columns]


#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

##### Budget x Revenue

In [8]:
# Calculate the budget-to-revenue ratio
data_frame['budget_revenue'] = data_frame['gross'] / (data_frame['budget'] + 1)  # Avoid division by zero

# Display the first few rows of the DataFrame
print(data_frame[['budget_revenue']])

      budget_revenue
0           0.015670
1           0.020412
2           0.179987
3           0.029025
4           0.013763
...              ...
7569        0.132324
7570        0.000001
7571        0.000132
7572        0.027744
7573        0.000004

[7574 rows x 1 columns]


##### Budget x Votes

In [9]:
# Calculate the budget-to-vote ratio
data_frame['budget_per_vote'] = data_frame['budget'] / (data_frame['votes'] + 1)

# Display the first few rows of the DataFrame
print(data_frame[['budget_per_vote']])

      budget_per_vote
0            0.038495
1            0.012299
2            0.033703
3            0.008995
4            0.016120
...               ...
7569         0.224370
7570         0.100221
7571         0.100309
7572         0.000006
7573         0.100302

[7574 rows x 1 columns]


##### Director & Star Popularity

In [10]:
# Compute cumulative gross revenue for each director/star
data_frame['director_popularity'] = data_frame.groupby('director')['gross'].transform('sum')
data_frame['star_popularity'] = data_frame.groupby('star')['gross'].transform('sum')

# Display the first few rows of the DataFrame
print(data_frame[['director_popularity', 'star_popularity']])

      director_popularity  star_popularity
0                0.032788         0.536282
1                0.069072         0.032608
2                0.224599         0.355951
3                0.194719         0.045201
4                0.207334         0.226050
...                   ...              ...
7569             0.162059         0.162059
7570             0.000001         0.000001
7571             0.000145         0.000145
7572             0.027744         0.027744
7573             0.000005         0.000005

[7574 rows x 2 columns]


#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. 


##### Time Based Features

In [11]:
# Extracting month and quarter from release date
data_frame['release_month'] = pd.to_datetime(data_frame['released']).dt.month
data_frame['release_quarter'] = pd.to_datetime(data_frame['released']).dt.quarter

print(data_frame[['release_month', 'release_quarter']])

      release_month  release_quarter
0               6.0              2.0
1               7.0              3.0
2               6.0              2.0
3               7.0              3.0
4               7.0              3.0
...             ...              ...
7569            8.0              3.0
7570            4.0              2.0
7571            6.0              2.0
7572            2.0              1.0
7573            3.0              1.0

[7574 rows x 2 columns]


In [12]:
# Extracting season from release date
data_frame['release_season'] = data_frame['release_month'].apply(lambda x: 'Winter' if x in [12, 1, 2] 
                                                                else 'Spring' if x in [3, 4, 5] 
                                                                else 'Summer' if x in [6, 7, 8] 
                                                                else 'Fall')

print(data_frame[['release_season']])


     release_season
0            Summer
1            Summer
2            Summer
3            Summer
4            Summer
...             ...
7569         Summer
7570         Spring
7571         Summer
7572         Winter
7573         Spring

[7574 rows x 1 columns]


##### Success by Country

In [13]:
# Calculate the average gross revenue for each country
country_avg_gross = data_frame.groupby('country')['gross'].transform('mean')
data_frame['country_avg_gross'] = country_avg_gross

print(data_frame[['country_avg_gross']])

      country_avg_gross
0              0.021821
1              0.031618
2              0.031618
3              0.031618
4              0.031618
...                 ...
7569           0.076471
7570           0.031618
7571           0.031618
7572           0.031618
7573           0.031618

[7574 rows x 1 columns]


##### Budget Categories

In [14]:
# Define the bins and labels for the budget categories
bins = [0, 0.01, 0.05, 0.1, 0.5, 1, float('inf')]
labels = ['Very Low', 'Low', 'Medium-Low', 'Medium', 'Medium-High', 'High']

# Create a new column with the budget categories
data_frame['budget_category'] = pd.cut(data_frame['budget'], bins=bins, labels=labels)

# Display the first few rows of the dataframe with the new feature
print(data_frame[['budget', 'budget_category']])

        budget budget_category
0     0.053363      Medium-Low
1     0.012632             Low
2     0.050554      Medium-Low
3     0.009823        Very Low
4     0.016846             Low
...        ...             ...
7569  0.224713          Medium
7570  0.100320          Medium
7571  0.100320          Medium
7572  0.000006        Very Low
7573  0.100320          Medium

[7574 rows x 2 columns]


#### Save the wrangled and engineered data to CSV

In [15]:
data_frame.to_csv('../Model_Training/model_ready_data2.csv', index=False)