# **Product Prediction Model**

### 1. Import Libraries
This cell imports all the necessary Python libraries for data manipulation, machine learning model building, and evaluation.
- `pandas` is used for data loading and manipulation.
- `sklearn.model_selection.train_test_split` is for splitting data into training and testing sets.
- `sklearn.preprocessing.LabelEncoder` is for encoding categorical features.
- `sklearn.ensemble.RandomForestClassifier` is the chosen machine learning model.
- `sklearn.metrics.classification_report` is for evaluating the model's performance.

In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

### 2. Load Datasets
This cell loads two CSV files, `customer_social_profiles.csv` and `customer_transactions.csv`, into pandas DataFrames named `social_df` and `transactions_df` respectively. These datasets contain customer social media and transaction information.

In [None]:
social_df = pd.read_csv('../data/raw/dataset/customer_social_profiles.csv')
transactions_df = pd.read_csv('../data/raw/dataset/customer_transactions.csv')


### 3. Standardize Customer ID Columns
This cell renames the customer ID columns in both `social_df` and `transactions_df` to a consistent name, `customer_id`, to facilitate merging.

In [51]:
social_df = social_df.rename(columns={'customer_id_new': 'customer_id'})
transactions_df = transactions_df.rename(columns={'customer_id_legacy': 'customer_id'})


### 4. Create Merged Dataset
This cell first drops the customer ID columns (as they are not needed for the cross-join after renaming and are not unique across all social profiles to transactions in a many-to-many relationship). It then performs a cross-join (Cartesian product) between `social_df` and `transactions_df` to create a `merged_df`. This operation combines every row from `social_df` with every row from `transactions_df`, assuming that any social profile could potentially be linked to any transaction for the purpose of feature engineering.

In [52]:
# Drop the ID columns if they exist
social_df = social_df.drop(columns=[col for col in ['customer_id'] if col in social_df.columns])
transactions_df = transactions_df.drop(columns=[col for col in ['customer_id'] if col in transactions_df.columns])

# Create a cross join (cartesian product) to combine datasets
social_df['key'] = 1
transactions_df['key'] = 1
merged_df = pd.merge(social_df, transactions_df, on='key').drop(columns=['key'])

print("Merged shape:", merged_df.shape)
print(merged_df.head())


Merged shape: (23250, 9)
  social_media_platform  engagement_score  purchase_interest_score  \
0              LinkedIn                74                      4.9   
1              LinkedIn                74                      4.9   
2              LinkedIn                74                      4.9   
3              LinkedIn                74                      4.9   
4              LinkedIn                74                      4.9   

  review_sentiment  transaction_id  purchase_amount purchase_date  \
0         Positive            1001              408    2024-01-01   
1         Positive            1002              332    2024-01-02   
2         Positive            1003              442    2024-01-03   
3         Positive            1004              256    2024-01-04   
4         Positive            1005               64    2024-01-05   

  product_category  customer_rating  
0           Sports              2.3  
1      Electronics              4.2  
2      Electronics       

### 5. Encode Categorical Features
This cell preprocesses the categorical features in the `merged_df` by applying Label Encoding. `LabelEncoder` converts categorical text values into numerical representations. The encoders are stored in `le_dict` for inverse transformation later if needed, except for the target variable `product_category` which will be encoded separately.

In [53]:
from sklearn.preprocessing import LabelEncoder

le_dict = {}  # To save encoders for later use

for col in merged_df.select_dtypes(include=['object']).columns:
    if col != 'product_category':  # don't encode target yet
        le = LabelEncoder()
        merged_df[col] = le.fit_transform(merged_df[col].astype(str))
        le_dict[col] = le


### 6. Prepare Data for Modeling
This cell separates the features (`X`) from the target variable (`y`), which is `product_category`. The target variable `y` is then also label-encoded using a separate `LabelEncoder` instance, `le_target`.

In [54]:
X = merged_df.drop(columns=['product_category'])
y = merged_df['product_category']

# Encode target
le_target = LabelEncoder()
y = le_target.fit_transform(y)


### 7. Split Data into Training and Testing Sets
This cell splits the preprocessed data (`X` and `y`) into training and testing sets. `train_test_split` allocates 80% of the data for training the model (`X_train`, `y_train`) and 20% for evaluating its performance (`X_test`, `y_test`), ensuring a `random_state` for reproducibility.

In [55]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### 8. Train the Random Forest Classifier Model
This cell initializes a `RandomForestClassifier` with 150 estimators and a fixed `random_state` for consistent results. It then trains the model using the training data (`X_train`, `y_train`).

In [56]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=150, random_state=42)
model.fit(X_train, y_train)


### 9. Verify Model Feature Order
This cell prints the list of column names that the trained model expects as input features. This is crucial for ensuring that new data used for prediction has the features in the correct order.

In [59]:
# Check the columns your model expects
print(X_train.columns.tolist())


['social_media_platform', 'engagement_score', 'purchase_interest_score', 'review_sentiment', 'transaction_id', 'purchase_amount', 'purchase_date', 'customer_rating']


### 10. Evaluate Model Performance
This cell uses the trained model to make predictions on the test set (`X_test`). It then generates and prints a `classification_report`, which provides key metrics like precision, recall, f1-score, and support for each product category, as well as overall accuracy, to assess the model's performance.

In [57]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print("Model Evaluation Report:\n")
print(classification_report(y_test, y_pred))


Model Evaluation Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       765
           1       1.00      1.00      1.00       865
           2       1.00      1.00      1.00      1027
           3       1.00      1.00      1.00       893
           4       1.00      1.00      1.00      1100

    accuracy                           1.00      4650
   macro avg       1.00      1.00      1.00      4650
weighted avg       1.00      1.00      1.00      4650



### 11. Predict Product Category for New Data
This cell demonstrates how to use the trained model to predict the product category for a new, unseen data point. It creates a new DataFrame `new_data`, encodes its categorical features using the previously saved `LabelEncoder` (`le_dict`), and ensures the column order matches the training data. Finally, it uses the model to predict the product category and then inverse transforms the encoded prediction back to its original label.

In [61]:
new_data = pd.DataFrame([{
    'social_media_platform': le_dict['social_media_platform'].transform(['Instagram'])[0],
    'engagement_score': 78,
    'purchase_interest_score': 90,
    'review_sentiment': le_dict['review_sentiment'].transform(['Positive'])[0],
    'transaction_id': 0,        # placeholder
    'purchase_amount': 50,
    'purchase_date': 0,         # placeholder
    'customer_rating': 5
}])

# Ensure column order matches X_train
new_data = new_data[X_train.columns]

# Predict
prediction_encoded = model.predict(new_data)
prediction = le_target.inverse_transform(prediction_encoded)
print("Predicted Product Category:", prediction[0])


Predicted Product Category: Sports


### 12. Save Model and Encoders
This cell saves the trained model and label encoders to disk for later use in the product prediction system. The model is saved as `product_model.pkl` and the encoders (including both feature encoders and target encoder) are saved as `encoders.pkl`.

In [None]:
import pickle
import os

# Create models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save the trained model
with open('../models/product_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Save all encoders (feature encoders + target encoder)
encoders = {
    'features': le_dict,
    'target': le_target
}
with open('../models/encoders.pkl', 'wb') as f:
    pickle.dump(encoders, f)

print("Model saved to: ../models/product_model.pkl")
print("Encoders saved to: ../models/encoders.pkl")