# Project: Automated Feature Engineering Workflow

This project demonstrates automated feature engineering for a simple regression problem.
It uses a preprocessing pipeline with:

- OneHotEncoder for categorical variables
- Passthrough for numeric features
- RandomForest for prediction

## Steps:
1. Load dataset
2. Split into training/testing sets
3. Automatically transform numerical + categorical features
4. Train RandomForest regression model
5. Evaluate using RMSE

## Files:
- automated_feature_engineering.ipynb
- sales.csv
- requirements.txt

In [47]:
# Import packages and libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

In [48]:
# Load dataset
df = pd.read_csv("sales.csv")
df.head()

Unnamed: 0,Region,Customers,Marketing_Spend,Sales
0,North,120,5000,15000
1,South,80,3000,9000
2,East,95,4000,11000
3,West,110,5200,14500
4,North,130,5500,16000


In [49]:
df.shape

(152, 4)

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Region           152 non-null    object
 1   Customers        152 non-null    int64 
 2   Marketing_Spend  152 non-null    int64 
 3   Sales            152 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 4.9+ KB


In [52]:
df.describe()

Unnamed: 0,Customers,Marketing_Spend,Sales
count,152.0,152.0,152.0
mean,112.927632,4922.368421,13750.657895
std,22.459412,1379.482933,3177.596382
min,70.0,2350.0,7600.0
25%,94.0,3687.5,10675.0
50%,115.0,5200.0,14875.0
75%,130.0,6100.0,16100.0
max,160.0,7400.0,19000.0


In [53]:
df.isnull().sum()

Region             0
Customers          0
Marketing_Spend    0
Sales              0
dtype: int64

In [54]:
# Train/test split
X = df.drop("Sales", axis=1)
y = df["Sales"]

In [55]:
# Identify numerical + categorical features
num_features = ["Customers", "Marketing_Spend"]
cat_features = ["Region"]

In [56]:
# Build preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", "passthrough", num_features),
        ("cat", OneHotEncoder(), cat_features)
    ]
)

In [57]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [58]:
# Transform features
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

In [59]:
# Fit model
model = RandomForestRegressor(random_state=42)
display(model.fit(X_train_transformed, y_train))

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [60]:
# Predict
y_pred = model.predict(X_test_transformed)

In [61]:
# Calculate 
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)

RMSE: 202.69831328769715


In [62]:
# Evaluate model
from sklearn.metrics import r2_score
print("R2 Score:", r2_score(y_test, y_pred))

R2 Score: 0.9948520509645178


### Results:
The regression model achieved an RMSE of 202.70, indicating low prediction error relative to the sales range. The model also produced an R² score of 0.9949, demonstrating excellent fit and strong predictive performance.

## Further Scope:

- Incorporate advanced feature engineering such as interaction terms, polynomial features, and domain-specific ratios to enhance model performance.
- Experiment with additional regression models (Random Forest, Gradient Boosting, XGBoost) and compare results using cross-validation.
- Automate the entire pipeline by integrating feature generation, model training, and evaluation into a reusable, production-ready workflow.