# Feature Engineering

Objectives

  Engineer features for Classification, Regression and Cluster models

## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* generate a list with variables to engineer

## Conclusions

* Feature Engineering Transformers
  * Ordinal categorical encoding: `['gender', 'Partner', Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']`
  * Smart Correlation Selection: `['OnlineSecurity', 'DeviceProtection', 'TechSupport']`

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

import os
current_dir = os.getcwd()
current_dir

In [20]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Film_Hit_prediction'

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [21]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [22]:
current_dir = os.getcwd()
current_dir


'/workspace'

---

# Load Cleaned Data

Train Set

In [23]:
import pandas as pd
X_train_path = "/workspace/Film_Hit_prediction/outputs/datasets/cleaned/X_train_cleaned.csv"
X_train = pd.read_csv(X_train_path)
print(X_train.head(3))

  original_language    budget  Action  Adventure  Animation  Comedy  Crime  \
0                en         0       0          0          0       1      0   
1                en  11000000       0          0          0       0      0   
2                en   2000000       1          1          0       0      0   

   Documentary  Drama  Family  ...  History  Horror  Music  Mystery  Romance  \
0            0      1       0  ...        0       0      0        0        1   
1            0      1       0  ...        0       0      1        0        0   
2            0      0       0  ...        0       0      0        0        0   

   Science Fiction  TV Movie  Thriller  War  Western  
0                0         0         0    0        0  
1                0         0         0    0        0  
2                0         0         1    0        0  

[3 rows x 22 columns]


Test Set

In [24]:
import pandas as pd
X_test_set_path = "/workspace/Film_Hit_prediction/outputs/datasets/cleaned/X_test_cleaned.csv"
X_test = pd.read_csv(X_test_set_path)
print(X_test.head(3))

  original_language    budget  Action  Adventure  Animation  Comedy  Crime  \
0                en  70000000       1          1          0       1      0   
1                en         7       1          0          0       0      1   
2                en  14000000       0          0          0       0      0   

   Documentary  Drama  Family  ...  History  Horror  Music  Mystery  Romance  \
0            0      0       0  ...        0       0      0        0        0   
1            0      0       0  ...        0       1      0        0        0   
2            0      1       0  ...        0       0      0        1        0   

   Science Fiction  TV Movie  Thriller  War  Western  
0                0         0         1    0        0  
1                1         0         1    0        0  
2                0         0         1    0        0  

[3 rows x 22 columns]


Y Train set


In [25]:
y_train_path = "/workspace/Film_Hit_prediction/outputs/datasets/cleaned/y_train_cleaned.csv"
y_train = pd.read_csv(y_train_path)
print(y_train.head(3))
print("\nShapes:")
print("y_train shape:", y_train.shape)

          0
0         0
1  21935319
2  76898765

Shapes:
y_train shape: (3842, 1)


Y Test Set

In [26]:
import pandas as pd
y_test_path = "/workspace/Film_Hit_prediction/outputs/datasets/cleaned/y_test_cleaned.csv"
y_test = pd.read_csv(y_test_path)
print(y_test.head(3))
print("\nShapes:")
print("y_test shape:", y_test.shape)

          0
0 -36438863
1        -2
2  -8891180

Shapes:
y_test shape: (961, 1)


# Data Exploration

In feature engineering, you are interested to evaluate which potential transformation you could do in your variables
* Take your notes in your separate spreadsheet

In [27]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=X_train, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

# Correlation and PPS Analysis

* We don’t expect major changes compared to the data cleaning notebook since the only data difference is the removal of “ID” , so correlation levels and PPS will essentially be the same.

# Feature Engineering

## Custom function

In [28]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pandas as pd
import numpy as np
import joblib

def engineer_film_features(X_train, X_test, y_train, y_test):
    print("Starting feature engineering...")

    def handle_infinite_values(data):
        if isinstance(data, pd.Series):
            # Replace inf values first
            data = data.replace([np.inf, -np.inf], np.nan)
            # Then handle NaN values with median
            median_value = data.median()
            data = data.fillna(median_value)
        else:
            # For numpy arrays
            data = np.nan_to_num(data, 
                                nan=np.nanmedian(data) if np.any(np.isnan(data)) else 0.0,
                                posinf=np.nanmax(data[~np.isinf(data)]) if np.any(~np.isinf(data)) else 0.0,
                                neginf=np.nanmin(data[~np.isinf(data)]) if np.any(~np.isinf(data)) else 0.0)
        return data

            # Replacing remianing nagative values
    def handle_negative_values(data):
        if isinstance(data, pd.Series):
            return data.clip(lower=0)
        elif isinstance(data, np.ndarray):
            return np.maximum(data, 0)
        else:
            return max(data, 0.0)

    def remove_outliers(X, y):
        # Convert y to series to dataframe
        if isinstance(y, pd.DataFrame):
            y = y.squeeze()

        print("Removing outliers...")

        # Clean data before calculating ROI
        clean_budget = handle_infinite_values(X['budget'])
        clean_revenue = handle_infinite_values(y)

        # Calculate ROI
        roi = (clean_revenue - clean_budget) / clean_budget
        roi = handle_infinite_values(roi)
        
        # Create masks for each metric using quantiles
        budget_mask = (X['budget'] < X['budget'].quantile(0.95)) & (X['budget'] > X['budget'].quantile(0.05))
        revenue_mask = (y < y.quantile(0.95)) & (y > y.quantile(0.05))
        roi_mask = (roi < roi.quantile(0.95)) & (roi > roi.quantile(0.05))
        
        # Combine all masks
        combined_mask = budget_mask & revenue_mask & roi_mask
        print(f"Rows remaining after outlier removal: {combined_mask.sum()} out of {len(X)}")
        
        # Return filtered datasets
        return X[combined_mask], y[combined_mask]
    
    # Outlier removal to training data 
    X_train, y_train = remove_outliers(X_train, y_train)

    # Copy training data
    X_train_processed = X_train.copy()
    X_test_processed = X_test.copy()

    #  Language processing with 
    print("Processing languages...")
    le_language = LabelEncoder()
    le_language.fit(pd.concat([X_train_processed['original_language'], X_test_processed['original_language']]))
    X_train_processed['language_encoded'] = le_language.transform(X_train_processed['original_language'])
    X_test_processed['language_encoded'] = le_language.transform(X_test_processed['original_language'])

    # Budget processing with log transform and scale
    print("Processing budget...")
    scaler = StandardScaler()

    # Log transform
    X_train_processed['budget'] = handle_infinite_values(X_train_processed['budget'])
    X_test_processed['budget'] = handle_infinite_values(X_test_processed['budget'])
    
    X_train_processed['budget_logged'] = np.log1p(X_train_processed['budget'])
    X_test_processed['budget_logged'] = np.log1p(X_test_processed['budget'])

    # Scale
    X_train_processed['budget_scaled'] = scaler.fit_transform(X_train_processed[['budget_logged']])
    X_test_processed['budget_scaled'] = scaler.transform(X_test_processed[['budget_logged']])

    # Get genre columns (they're already one-hot encoded)
    genre_columns = [col for col in X_train.columns if col not in ['original_language', 'budget']]
    print(f"Genre columns found: {genre_columns}")

    # Combine features
    print("Combining features...")
    X_train_final = pd.concat([
        X_train_processed[['language_encoded', 'budget_scaled']],
        X_train_processed[genre_columns]
    ], axis=1)

    X_test_final = pd.concat([
        X_test_processed[['language_encoded', 'budget_scaled']],
        X_test_processed[genre_columns]
    ], axis=1)

    # Process target variable (revenue)
    print("Processing revenue...")
    scaler_y = StandardScaler()

    # Handle infinite values in revenue
    y_train = handle_infinite_values(y_train)
    y_test = handle_infinite_values(y_test)



    # Replace the original code with:
    y_train = handle_negative_values(y_train)
    y_test = handle_negative_values(y_test)


    #Log transform revenue with error checking
    y_train_logged = np.log1p(y_train)
    y_test_logged = np.log1p(y_test)

     # Handle any infinite values that might have been created during log transform
    y_train_logged = handle_infinite_values(y_train_logged)
    y_test_logged = handle_infinite_values(y_test_logged)

    
    # Convert to numpy array if pandas Series
    if isinstance(y_train_logged, pd.Series):
        y_train_logged = y_train_logged.to_numpy()
    if isinstance(y_test_logged, pd.Series):
        y_test_logged = y_test_logged.to_numpy()
    
    # Additional safety check for infinite values
    y_train_logged = np.nan_to_num(y_train_logged, nan=0.0, posinf=0.0, neginf=0.0)
    y_test_logged = np.nan_to_num(y_test_logged, nan=0.0, posinf=0.0, neginf=0.0)

    # Reshape and scale
    y_train_final = scaler_y.fit_transform(y_train_logged.reshape(-1, 1))
    y_test_final = scaler_y.transform(y_test_logged.reshape(-1, 1))

    print("\nFinal shapes:")
    print(f"X_train_final: {X_train_final.shape}")
    print(f"X_test_final: {X_test_final.shape}")
    print(f"y_train_final: {y_train_final.shape}")
    print(f"y_test_final: {y_test_final.shape}")

    return X_train_final, X_test_final, y_train_final, y_test_final, le_language, scaler, scaler_y 



In [29]:
X_train_final, X_test_final, y_train_final, y_test_final, le_language, scaler, scaler_y = engineer_film_features(X_train, X_test, y_train, y_test)

Starting feature engineering...
Removing outliers...
Rows remaining after outlier removal: 2054 out of 3842
Processing languages...
Processing budget...
Genre columns found: ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']
Combining features...
Processing revenue...

Final shapes:
X_train_final: (2054, 22)
X_test_final: (961, 22)
y_train_final: (2054, 1)
y_test_final: (961, 1)


In [30]:
import os

# Create output directory
output_path = "/workspace/Film_Hit_prediction/outputs/datasets/engineered/"
os.makedirs(output_path, exist_ok=True)

# Save processed datasets
X_train_final.to_csv(output_path + 'X_train_final.csv', index=False)
X_test_final.to_csv(output_path + 'X_test_final.csv', index=False)
pd.DataFrame(y_train_final).to_csv(output_path + 'y_train_final.csv', index=False)
pd.DataFrame(y_test_final).to_csv(output_path + 'y_test_final.csv', index=False)

# Save encoders and scalers
joblib.dump(le_language, output_path + 'language_encoder.joblib')
joblib.dump(scaler, output_path + 'budget_scaler.joblib')
joblib.dump(scaler_y, output_path + 'revenue_scaler.joblib')

['/workspace/Film_Hit_prediction/outputs/datasets/engineered/revenue_scaler.joblib']

## Feature Engineering Spreadsheet Summary


- Languages are properly encoded using LabelEncoder
- Genre columnes are already one-hot encoded
- Budget is both log- transformed and scaled
- Saved the encoders and scalers
- Target variable (revenue) is Lon-transformed to handle skewness and scaled using StandardScaler
- Processed datasets are saved.
    

