## Name: Feven Belay Araya
##  ID: 20027

# 3A - Recommend based on likelihood of category purchase

# Logistic Regression

# 1. Transform Dataset to a Format that can be Used for Training a Logistic Regression Model

In [464]:

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


In [465]:
# Load the dataset
df = pd.read_csv('/content/e-shop-clothing-2008.csv', delimiter=';')
df

Unnamed: 0,year,month,day,order,country,session ID,page 1 (main category),page 2 (clothing model),colour,location,model photography,price,price 2,page
0,2008,4,1,1,29,1,1,A13,1,5,1,28,2,1
1,2008,4,1,2,29,1,1,A16,1,6,1,33,2,1
2,2008,4,1,3,29,1,2,B4,10,2,1,52,1,1
3,2008,4,1,4,29,1,2,B17,6,6,2,38,2,1
4,2008,4,1,5,29,1,2,B8,4,3,2,52,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165469,2008,8,13,1,29,24024,2,B10,2,4,1,67,1,1
165470,2008,8,13,1,9,24025,1,A11,3,4,1,62,1,1
165471,2008,8,13,1,34,24026,1,A2,3,1,1,43,2,1
165472,2008,8,13,2,34,24026,3,C2,12,1,1,43,1,1


# Analyzing data

In [466]:
# Retrieve column names
column_names = df.columns

# Print column names
print(column_names)

Index(['year', 'month', 'day', 'order', 'country', 'session ID',
       'page 1 (main category)', 'page 2 (clothing model)', 'colour',
       'location', 'model photography', 'price', 'price 2', 'page'],
      dtype='object')


In [467]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165474 entries, 0 to 165473
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   year                     165474 non-null  int64 
 1   month                    165474 non-null  int64 
 2   day                      165474 non-null  int64 
 3   order                    165474 non-null  int64 
 4   country                  165474 non-null  int64 
 5   session ID               165474 non-null  int64 
 6   page 1 (main category)   165474 non-null  int64 
 7   page 2 (clothing model)  165474 non-null  object
 8   colour                   165474 non-null  int64 
 9   location                 165474 non-null  int64 
 10  model photography        165474 non-null  int64 
 11  price                    165474 non-null  int64 
 12  price 2                  165474 non-null  int64 
 13  page                     165474 non-null  int64 
dtypes: int64(13), object

In [468]:
completed_transactions = df[(df['page'] == 5)]
completed_transactions

Unnamed: 0,year,month,day,order,country,session ID,page 1 (main category),page 2 (clothing model),colour,location,model photography,price,price 2,page
8,2008,4,1,9,29,1,4,P82,6,4,2,48,1,5
24,2008,4,1,6,21,3,4,P77,7,2,1,43,1,5
478,2008,4,1,7,29,76,4,P80,7,3,1,28,2,5
479,2008,4,1,8,29,76,4,P78,14,2,2,48,1,5
480,2008,4,1,9,29,76,4,P76,2,2,2,43,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165138,2008,8,12,9,29,23993,4,P77,7,2,1,43,1,5
165396,2008,8,13,9,29,24014,4,P73,11,1,2,33,2,5
165397,2008,8,13,10,29,24014,4,P77,7,2,1,43,1,5
165431,2008,8,13,27,29,24018,4,P82,6,4,2,48,1,5


In [469]:
# Create a binary target variable for blouses (category = 3)
df['is_blouse'] = (df['page 1 (main category)'] == 3).astype(int)

In [470]:
# Select features for the model
features = ['month', 'day', 'country', 'colour', 'location', 'model photography', 'price', 'price 2']
X = df[features]
y = df['is_blouse']

In [471]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [472]:
# Define categorical and numerical features for preprocessing
categorical_features = ['country', 'colour', 'location', 'model photography', 'price 2']
numerical_features = ['month', 'day', 'price']

In [473]:
# Setup preprocessing for categorical variables: one-hot encoding
# Setup preprocessing for numerical variables: standardization
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])


In [474]:
# Create a pipeline with a logistic regression model
lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', LogisticRegression(solver='liblinear'))])

In [475]:
# Train the logistic regression model
lr_pipeline.fit(X_train, y_train)

In [476]:

from sklearn.metrics import accuracy_score, classification_report
y_pred = lr_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.86      0.93      0.89     25429
           1       0.68      0.49      0.57      7666

    accuracy                           0.83     33095
   macro avg       0.77      0.71      0.73     33095
weighted avg       0.82      0.83      0.82     33095



In [477]:
from sklearn.metrics import accuracy_score, recall_score, precision_score

# Assuming y_test and y_pred are defined
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}, Recall: {recall:.2f}, Precision: {precision:.2f}")

# Monitor and log these metrics over time to evaluate model performance
# Consider integrating A/B testing to compare model versions and updates


Accuracy: 0.83, Recall: 0.49, Precision: 0.68


# 2. Provide rationale for selected vs dropped features
# Selected Features:
- month, day: These temporal features could capture seasonal trends or specific days when purchases of blouses are more likely, reflecting consumer buying patterns.
- country: Geographic information can be relevant because fashion preferences can vary significantly across different regions.
colour, location: Attributes of the clothing items that might influence a customer's decision to buy a blouse, assuming different colors and store locations (e.g., online vs. physical store placement) might appeal differently to customers.
- model photography: Whether the clothing item is presented with a model could affect the visual appeal and thus the likelihood of purchase.
price, price 2: The cost of the item and possibly a related price category could significantly impact purchasing decisions, as customers may have specific budgets or price sensitivities.
# Dropped Features:
- year: Dropped because the dataset only contains data for 2008, making it a constant feature with no variability to contribute to the model's predictive power.
- order: This feature likely represents the sequence of page views or actions within a session and may not directly influence the decision to purchase a specific category, making it less relevant for predicting the purchase of blouses.
- page 2 (clothing model): Although this feature provides detailed information about the specific clothing model viewed, it was excluded to simplify the model to focus on category-level predictions. Including it could lead to overfitting or distract from the broader category trends we're interested in.
- session ID: Although it uniquely identifies a browsing session, it was not explicitly mentioned in the selected features for the provided code. It's typically used as an identifier rather than a feature for prediction unless the task involves session-based recommendations or behavior analysis.


# 3. Discuss limitations of using predictive models for recommending next category to a customer
1. Limitations of Predictive Models for Recommending Next Category
Data Quality and Representation: The model's effectiveness heavily relies on the quality and comprehensiveness of the data. If important features that influence purchase decisions are missing or incorrectly represented, the model's recommendations may not be accurate.

2. User Behavior Complexity: Logistic regression models assume linear relationships between features and the outcome. However, user behavior and purchase decisions can be influenced by complex, non-linear factors and interactions between features that logistic regression may not capture effectively.

3. Changing Preferences and Trends: User preferences and trends can change over time, which might not be captured in a static model trained on historical data. Without continuous updates, the model's recommendations may become less relevant.

4. Bias and Fairness: The model might inherit biases present in the training data, leading to unfair recommendations for certain user groups. Ensuring fairness and mitigating biases is a critical limitation to address.

5. Overfitting and Generalization: There's a risk of overfitting the model to the training data, making it less effective at generalizing to new, unseen data. This can limit the model's utility in providing accurate recommendations across diverse user sessions

## 3B - Recommend based on association rules

In [478]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [479]:
# Load the dataset
df = pd.read_csv('/content/e-shop-clothing-2008.csv', delimiter=';')
df

Unnamed: 0,year,month,day,order,country,session ID,page 1 (main category),page 2 (clothing model),colour,location,model photography,price,price 2,page
0,2008,4,1,1,29,1,1,A13,1,5,1,28,2,1
1,2008,4,1,2,29,1,1,A16,1,6,1,33,2,1
2,2008,4,1,3,29,1,2,B4,10,2,1,52,1,1
3,2008,4,1,4,29,1,2,B17,6,6,2,38,2,1
4,2008,4,1,5,29,1,2,B8,4,3,2,52,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165469,2008,8,13,1,29,24024,2,B10,2,4,1,67,1,1
165470,2008,8,13,1,9,24025,1,A11,3,4,1,62,1,1
165471,2008,8,13,1,34,24026,1,A2,3,1,1,43,2,1
165472,2008,8,13,2,34,24026,3,C2,12,1,1,43,1,1


# Association Rule

# Step 1: Transform the Clickstream Dataset into a Transaction Format

In [480]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules


# Assuming df is your dataframe containing the clickstream data
transactions = df.groupby('session ID')['page 1 (main category)'].apply(list).reset_index(name='transaction')

In [481]:
# Transform data into transaction format
te = TransactionEncoder()
te_ary = te.fit(transactions['transaction']).transform(transactions['transaction'])
transaction_df = pd.DataFrame(te_ary, columns=te.columns_)

# Step 2: Mine Frequent Itemsets
1. Convert Transactions to One-hot Encoded Format: Most algorithms require the dataset to be in a one-hot encoded format where each column represents an item, and each row represents a transaction. A value of 1 indicates the item is present in the transaction, and 0 indicates it is not.

2. Apply the Apriori Algorithm: Use the one-hot encoded format to identify frequent itemsets with a specified minimum support threshold.

In [482]:
# Mine frequent itemsets
frequent_itemsets = apriori(transaction_df, min_support=0.01, use_colnames=True)

# Step 3: Generate Association Rules

To generate association rules, we need to specify a metric (e.g., confidence, lift) and a minimum threshold for that metric. Confidence is a measure of the reliability of the rule, while lift indicates how much more often the antecedent and consequent of the rule occur together than expected if they were statistically independent. A lift value greater than 1 indicates a positive association between the antecedent and consequent.

In [483]:
# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)


In [484]:
# Calculate additional revenue and missed transactions for each category
categories = {'Trousers': 1, 'Skirts': 2, 'Blouse': 3, 'Sale': 4}
for category_name, category_code in categories.items():
    # Filter rules for the category recommendations
    category_rules = rules[rules['consequents'].apply(lambda x: category_code in x)]

    # Evaluate missed transactions for the category recommendations
    missed_transactions = transactions[~transactions['session ID'].isin(transaction_df[transaction_df[category_code] == False].index)]

    avg_category_price = df.loc[df['page 1 (main category)'] == category_code, 'price'].mean()

    # Calculate potential additional revenue from missed transactions
    additional_revenue = missed_transactions.shape[0] * avg_category_price

    print(f"Category: {category_name}, Additional Revenue: {additional_revenue:.2f}")
    print(missed_transactions)

Category: Trousers, Additional Revenue: 611124.18
       session ID                  transaction
0               1  [1, 1, 2, 2, 2, 3, 3, 4, 4]
2               3           [2, 3, 3, 3, 3, 4]
5               6              [3, 3, 3, 2, 2]
7               8  [3, 2, 3, 3, 3, 3, 3, 3, 2]
8               9                    [1, 1, 2]
...           ...                          ...
24020       24021              [1, 1, 4, 1, 1]
24021       24022                    [1, 1, 1]
24023       24024                          [2]
24024       24025                          [1]
24025       24026                    [1, 3, 2]

[13082 rows x 2 columns]
Category: Skirts, Additional Revenue: 588201.07
       session ID                                        transaction
0               1                        [1, 1, 2, 2, 2, 3, 3, 4, 4]
1               2                     [2, 2, 2, 2, 1, 1, 2, 4, 4, 4]
4               5                                                [3]
5               6                   

# 4. Discuss the difference and similarities between 3A and 3B approaches to recommending.

The tasks 3A and 3B represent two fundamentally different approaches to making recommendations based on a dataset of e-commerce clothing transactions. Here's a detailed comparison focusing on their differences and similarities:

# 3A: Logistic Regression for Category Purchase Likelihood
Approach: Utilizes a logistic regression model to predict the likelihood that a customer will purchase an item from a specific category (e.g., blouses) based on various features (e.g., country, color, price).

# Differences:

1. Model-based Approach: It relies on a predictive model that uses historical data to estimate the probability of future events (i.e., purchases).
2. Feature Dependency: The effectiveness of the model depends on the selection and relevance of input features used to predict the outcome.
3. Requires Numerical Transformation: Categorical data must be transformed (e.g., via one-hot encoding) to be used as input for the model.
4. Binary Outcome: The logistic regression model specifically addresses binary outcomes (purchase or no purchase).

# Similarities:

1. Data-Driven: Both approaches rely on historical transaction data to make recommendations.
2. Customization Potential: Each method can be tailored or adjusted based on the dataset's characteristics, such as varying the features or the minimum support and confidence thresholds.
#3B: Association Rules for Recommending
Approach: Employs association rule mining to discover relationships between different items within transactions. This method identifies itemsets that frequently occur together and generates rules that suggest if a customer buys item(s) X, they are likely to buy item Y.

# Differences:

1. Rule-based Approach: It does not predict outcomes but identifies patterns of item co-occurrence in transactions, which are used to make recommendations.
2. No Need for Feature Selection: This method directly works with transactional data without the need for selecting and transforming features.
3. Itemset Focus: The focus is on finding associations between items rather than predicting a specific outcome.
4. Versatility in Recommendations: Can recommend multiple items and is not limited to binary outcomes.

# Similarities:

1. Data-Driven: Both methods utilize historical transaction data to inform recommendations.
2.Dependent on Data Quality: The effectiveness of both approaches is influenced by the quality and comprehensiveness of the transaction data.
3.Customization Potential: Parameters such as support, confidence, and lift in association rules, and feature selection in logistic regression, offer customization based on specific goals or dataset peculiarities.
# Summary:
- Logistic Regression is a predictive modeling approach that estimates the probability of a specific outcome (e.g., purchase) based on selected features. It is well-suited for scenarios where the goal is to understand or predict customer behavior based on specific attributes.

- Association Rule Mining is a pattern-finding method that identifies items frequently purchased together. This approach is ideal for discovering relationships between items that might not be intuitively obvious, providing a basis for cross-selling and upselling strategies.