[Home](../README.md)

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes I applied to my  data before model training to maximise the performance of my machine learning model.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [1]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import LabelEncoder


####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [2]:
data_frame = pd.read_csv("wrangled_data.csv")

In [3]:
data_frame.isnull().sum()

name        0
rating      0
genre       0
year        0
released    0
score       0
votes       0
director    0
writer      0
star        0
country     0
budget      0
gross       0
company     0
runtime     0
dtype: int64

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.


##### 'rating' column encoding

In [4]:
data_frame['rating'].unique()

array(['R', 'PG', 'G', 'Not Rated', 'NC-17', 'Approved', 'TV-PG', 'PG-13',
       'Unrated', 'X', 'TV-MA', 'TV-14'], dtype=object)

In [5]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Define the custom class order from least to most violent
rating_order = ['G', 'TV-PG', 'PG', 'Approved', 'Unrated', 'Not Rated', 'PG-13', 'TV-14', 'R', 'TV-MA', 'NC-17', 'X']

# Fit and transform the 'rating' column
data_frame['rating'] = pd.Series(data_frame['rating'])

# Map the 'rating' column to the custom order
data_frame['rating'] = data_frame['rating'].map({rating: idx for idx, rating in enumerate(rating_order)})

# Display the first few rows of the DataFrame to verify the encoding
print(data_frame[['rating']])

# Print the mapping of classes to labels
print(dict(enumerate(rating_order)))


      rating
0          8
1          8
2          2
3          2
4          8
...      ...
7569       5
7570       8
7571       6
7572       8
7573       5

[7574 rows x 1 columns]
{0: 'G', 1: 'TV-PG', 2: 'PG', 3: 'Approved', 4: 'Unrated', 5: 'Not Rated', 6: 'PG-13', 7: 'TV-14', 8: 'R', 9: 'TV-MA', 10: 'NC-17', 11: 'X'}


##### 'genre' column encoding

In [6]:
# One-hot encode the 'genre' column
data_frame = pd.get_dummies(data_frame, columns=['genre'])

# Display the first few rows of the encoded dataframe
print(data_frame.head())

                                             name  rating  year  \
0                                     The Shining       8  1980   
1                                 The Blue Lagoon       8  1980   
2  Star Wars: Episode V - The Empire Strikes Back       2  1980   
3                                       Airplane!       2  1980   
4                                      Caddyshack       8  1980   

        released     score     votes         director  \
0  June 13, 1980  0.878378  0.386241  Stanley Kubrick   
1   July 2, 1980  0.527027  0.027069   Randal Kleiser   
2  June 20, 1980  0.918919  0.499993   Irvin Kershner   
3   July 2, 1980  0.783784  0.092070     Jim Abrahams   
4  July 25, 1980  0.729730  0.044986     Harold Ramis   

                    writer            star         country  ...  \
0             Stephen King  Jack Nicholson  United Kingdom  ...   
1  Henry De Vere Stacpoole  Brooke Shields   United States  ...   
2           Leigh Brackett     Mark Hamill   United S

##### 'company' & 'country' column encoding

In [7]:
# Calculate the frequency of each unique value in the 'company' column
company_freq = data_frame['company'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['company'] = data_frame['company'].map(company_freq)

# Calculate the frequency of each unique value in the 'country' column
country_freq = data_frame['country'].value_counts() / len(data_frame)
# Map the frequencies back to the original dataframe
data_frame['country'] = data_frame['country'].map(country_freq)

#Print the results
print(data_frame[['company', 'country']])


       company   country
0     0.043966  0.106681
1     0.043834  0.719039
2     0.001320  0.719039
3     0.042250  0.719039
4     0.007922  0.719039
...        ...       ...
7569  0.000132  0.004621
7570  0.000132  0.719039
7571  0.000132  0.719039
7572  0.000132  0.719039
7573  0.000264  0.719039

[7574 rows x 2 columns]


##### Convert 'released' to a date format

In [8]:
# Convert the 'released' column to datetime format
data_frame['released'] = pd.to_datetime(data_frame['released'], errors='coerce')

# Remove rows with null values in the 'released' column
data_frame = data_frame.dropna(subset=['released'])

# Display the first few rows of the DataFrame to verify the conversion
print(data_frame[['released']])

       released
0    1980-06-13
1    1980-07-02
2    1980-06-20
3    1980-07-02
4    1980-07-25
...         ...
7569 2020-08-28
7570 2020-04-17
7571 2020-06-03
7572 2020-02-07
7573 2020-03-03

[7519 rows x 1 columns]


In [9]:
data_frame.isnull().sum()

name               0
rating             0
year               0
released           0
score              0
votes              0
director           0
writer             0
star               0
country            0
budget             0
gross              0
company            0
runtime            0
genre_Action       0
genre_Adventure    0
genre_Animation    0
genre_Biography    0
genre_Comedy       0
genre_Crime        0
genre_Drama        0
genre_Family       0
genre_Fantasy      0
genre_Horror       0
genre_Music        0
genre_Musical      0
genre_Mystery      0
genre_Romance      0
genre_Sci-Fi       0
genre_Sport        0
genre_Thriller     0
genre_Western      0
dtype: int64

#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

##### Director Average Gross

In [10]:
# Sort by director and release year
data_frame = data_frame.sort_values(by=['director', 'year'])

# Calculate the expanding mean for each director
data_frame['director_avg_gross'] = (
    data_frame.groupby('director')['gross']
    .expanding()
    .mean()
    .reset_index(level=0, drop=True)
)

# Fill NaN for first-time directors with the overall average
data_frame['director_avg_gross'] = data_frame['director_avg_gross'].fillna(data_frame['gross'].mean())

print(data_frame[['director_avg_gross']])

      director_avg_gross
5339            0.013457
5080            0.007691
4257            0.087944
1363            0.002175
1746            0.002264
...                  ...
1181            0.000081
1785            0.000066
2189            0.000058
2939            0.000060
3909            0.000341

[7519 rows x 1 columns]


##### Actor Star Power

In [11]:
# Sort by star and release year
data_frame = data_frame.sort_values(by=['star', 'year'])

# Calculate the expanding mean for each star
data_frame['star_avg_score'] = data_frame.groupby('star')['score'].expanding().mean().reset_index(level=0, drop=True)

# Fill NaN for first-time lead actors with overall average IMDb score
data_frame['star_avg_score'] = data_frame['star_avg_score'].fillna(data_frame['score'].mean())

print(data_frame[['star_avg_score']])

      star_avg_score
1466        0.689189
4717        0.472973
4260        0.581081
4055        0.459459
3912        0.837838
...              ...
7080        0.560811
4359        0.662162
5402        0.722973
2739        0.743243
3584        0.756757

[7519 rows x 1 columns]


##### Production Company Popularity

In [12]:
# Sort by company and year to ensure chronological order
data_frame = data_frame.sort_values(by=['company', 'year'])

# Calculate cumulative revenue (excluding current movie) and divide by count of past movies
data_frame['company_avg_gross'] = (data_frame.groupby('company')['gross'].expanding().mean().reset_index(level=0, drop=True))

# Fill NaN for new companies with dataset's average gross
data_frame['company_avg_gross'] = data_frame['company_avg_gross'].fillna(data_frame['score'].mean())

print(data_frame[['company_avg_gross']])


      company_avg_gross
49             0.027744
89             0.016541
57             0.011755
86             0.009398
41             0.013067
...                 ...
7433           0.048705
7449           0.048978
7506           0.048893
7563           0.048897
7567           0.048996

[7519 rows x 1 columns]


#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. 


##### Time Based Features

In [13]:
# Extracting month and quarter from release date
data_frame['release_month'] = pd.to_datetime(data_frame['released']).dt.month
data_frame['release_quarter'] = pd.to_datetime(data_frame['released']).dt.quarter

print(data_frame[['release_month', 'release_quarter']])

      release_month  release_quarter
49                5                2
89               12                4
57                7                3
86                5                2
41                7                3
...             ...              ...
7433              8                3
7449              6                2
7506              4                2
7563              2                1
7567              1                1

[7519 rows x 2 columns]


In [14]:
# Extracting season from release date
data_frame['release_season'] = data_frame['release_month'].apply(lambda x: 'Winter' if x in [12, 1, 2] 
                                                                else 'Spring' if x in [3, 4, 5] 
                                                                else 'Summer' if x in [6, 7, 8] 
                                                                else 'Fall')

print(data_frame[['release_season']])


     release_season
49           Spring
89           Winter
57           Summer
86           Spring
41           Summer
...             ...
7433         Summer
7449         Summer
7506         Spring
7563         Winter
7567         Winter

[7519 rows x 1 columns]


#### Save the wrangled and engineered data to CSV

In [15]:
data_frame.to_csv('../Model_Training/model_ready_data.csv', index=False)