# Project data model
In the Project data model, we will present the details for the data preparation, model training and model evaluation.

## Data Preparation
The primary objective during the data preparation phase is to construct a Utility Matrix that will serve as the foundation for our recommendation system. To achieve this, we will generate explicit ratings derived from the amount of time a user spends on a particular item. This approach assumes that the duration of interaction directly correlates with user preference, thereby allowing us to quantify interest levels in a meaningful way. The utility Matrix will be built using the following structure:




| UserID    | ItemID    | Session_Duration |
|-----------|-----------|------------------|
| User_1    | Item_A    | 120              |
| User_1    | Item_B    | 60               |
| User_2    | Item_A    | 45               |
| User_2    | Item_C    | 30               |
| User_3    | Item_B    | 85               |
| User_3    | Item_C    | 90               |

### Step 1: Importing the data set

#### Feature Selection Details

In this step, we  selected specific features from the BigQuery database by running an SQL query. These features are pivotal for our analysis/model, offering insights into user behavior, session details, and product interactions. Below is an overview of the selected features and their significance:

1. **`fullVisitorId` | User ID**: A unique identifier for each user visiting the website. 

2. **`visitNumber` | Session/Visit Number**: Indicates the ordinal number of the user's visit. For example, the first visit is 1, the second visit is 2, and so on.

3. **`hits.eCommerceAction.action_type` | Ecommerce Action Type**: Categorizes the type of interaction a user had, such as viewing an item list (1), viewing a specific item (2), etc.

4. **`hits.time` | Action Time**: Timestamp indicating when the action occurred.

5. **`hits.hitNumber` | Event Number Within a Session**: Sequential number of the event/action within a session, starting from 1.

6. **`prod.productSKU` | Product ID**: Unique identifier for each product that was interacted with.



#### Data Retrieval Process


During this phase, we initiated the retrieval of raw data from the BigQuery public dataset. To conduct a preliminary evaluation of our analytical model, we opted to extract a dataset entries from a singular day's worth of data (August 1, 2017).

```sql
SELECT fullVisitorId, visitNumber, h.eCommerceAction.action_type, prod.productSKU, h.time, h.hitNumber
FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801, UNNEST(hits) as h, UNNEST(h.product) as prod
ORDER BY fullVisitorId ASC, visitNumber ASC, h.time ASC
```
The results from the executed query have been saved to the file located at `data/ga_sessions_20170801.csv`.

### Step 2: Data Preprocessing

#### Initialize Spark and necessary imports

In [154]:
from pyspark.sql import SparkSession, functions as F, Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark session
def init_spark():
    return SparkSession \
        .builder \
        .appName("GA360RECOMMENDER") \
        .getOrCreate()

spark = init_spark()

#### Defining the Sechma and loading the data into Spark DataFrame

In [155]:
schema = StructType([
    StructField("fullVisitorId", StringType(), True),
    StructField("visitNumber", IntegerType(), True),
    StructField("action_type", StringType(), True),
    StructField("productSKU", StringType(), True),
    StructField("time", IntegerType(), True),
    StructField("hitNumber", IntegerType(), True)
])

df = spark.read.csv("../data/ga_sessions_20170801.csv", header=True, schema=schema)

df.show(5)
print("Total number of rows in the dataframe: ", df.count(), "row")

+-------------------+-----------+-----------+--------------+----+---------+
|      fullVisitorId|visitNumber|action_type|    productSKU|time|hitNumber|
+-------------------+-----------+-----------+--------------+----+---------+
|0004915997121163857|          1|          0|GGOEYFKQ020699|   0|        1|
|0004915997121163857|          1|          0|GGOEYDHJ056099|   0|        1|
|0004915997121163857|          1|          0|GGOEYHPB072210|   0|        1|
|0004915997121163857|          1|          0|GGOEYOCR077799|   0|        1|
|0004915997121163857|          1|          0|  GGOEGAAX0351|   0|        1|
+-------------------+-----------+-----------+--------------+----+---------+
only showing top 5 rows

Total number of rows in the dataframe:  47723 row


#### Calculate the Session Duration 

In [223]:
# Define window specification for calculating pageview durations
windowSpec = Window.partitionBy("fullVisitorId", "visitNumber").orderBy("time")

# Calculate the next hit's time and pageview duration
df_with_durations = df.withColumn("next_time", F.lead("time", 1).over(windowSpec)) \
                      .withColumn("pageview_duration", F.when(F.isnull(F.col("next_time") - F.col("time")), 1)
                                                          .otherwise(F.col("next_time") - F.col("time")))

# Filter for product detail views only 
prodview_durations = df_with_durations.filter(df_with_durations.action_type == '2') \
                                      .select("fullVisitorId", "visitNumber", "productSKU", "pageview_duration")

# Aggregate pageview durations by fullVisitorId and productSKU
aggregate_web_stats = prodview_durations.groupBy("fullVisitorId", "productSKU") \
                                        .agg(F.sum("pageview_duration").alias("session_duration"))

user_item_rating = aggregate_web_stats

# Display the aggregated results
user_item_rating.orderBy(user_item_rating.fullVisitorId.asc()).show(10)

+-------------------+--------------+----------------+
|      fullVisitorId|    productSKU|session_duration|
+-------------------+--------------+----------------+
|0049931492016965831|GGOEGEVA022399|            9821|
|0052381813974609729|GGOEAOCB077499|           14292|
|0052381813974609729|GGOEGOCB017499|            6931|
|0052381813974609729|GGOEGOCC077299|            4745|
| 008016723867009901|GGOEGESB015099|            1488|
| 008016723867009901|GGOEGBJL013999|            1419|
| 008016723867009901|GGOEGDHC074099|            1394|
| 008016723867009901|GGOEGESC014099|               0|
| 008016723867009901|GGOEGCKQ013199|               1|
| 008016723867009901|GGOEACCQ017299|               0|
+-------------------+--------------+----------------+
only showing top 10 rows



#### Normalization Functions

##### Z-Score Normalization
Normalizes the data by subtracting the mean rating and dividing by the standard deviation, for users or items.

In [167]:
from pyspark.sql.functions import mean, stddev

def z_score_normalization(df):
    """
    Apply Z-score normalization to the rating column of a DataFrame.
    """
    # Calculate the mean and standard deviation of session_duration
    mean_val = df.select(mean(df['session_duration'])).collect()[0][0]
    stddev_val = df.select(stddev(df['session_duration'])).collect()[0][0]

    # Apply Z-score normalization
    normalized_df = df.withColumn('normalized_duration', 
                   (df['session_duration'] - mean_val) / stddev_val)
    return normalized_df

normalized_df = z_score_normalization(aggregate_web_stats)
normalized_df.orderBy(normalized_df.normalized_duration.desc()).show(10)

+-------------------+--------------+----------------+-------------------+
|      fullVisitorId|    productSKU|session_duration|normalized_duration|
+-------------------+--------------+----------------+-------------------+
|0834628261584717467|  GGOEGAAX0325|         1527925| 14.694681589535893|
|0834628261584717467|  GGOEGAAX0686|         1470110| 14.131252184050535|
|0485797735449723544|GGOEGESB015199|         1172597| 11.231873602122572|
| 431781159932899381|GGOEGBRJ037299|          759499| 7.2060747515268755|
|7484497031611210287|GGOEYHPB072210|          748694|  7.100775871888905|
|5873059317509196502|  GGOEGAAX0104|          594894| 5.6019357341452745|
|2863022817351466072|GGOEYFKQ020699|          536689|  5.034705628700749|
|7641607978785523241|GGOEGGCX056199|          443459|  4.126143430769419|
|2827498353821012092|  GGOEGAAX0680|          427854| 3.9740667054801513|
|1933634293342529288|GGOEGDHQ015399|          370320|  3.413375753042297|
+-------------------+--------------+--

##### Min-Max Normalization: 
Scale the data within the range [0, 1]

In [171]:
from pyspark.sql.functions import  min, max

def min_max_normalization(df):
    """
    Apply Min-Max Normalization on a specified column of a PySpark DataFrame.
    """
    # Calculate the minimum and maximum values of the specified column
    column_min_max = df.select(min(df['session_duration']).alias("min"), max(df['session_duration']).alias("max")).collect()[0]
    min_value, max_value = column_min_max["min"], column_min_max["max"]
    
    # Apply Min-Max Normalization
    df = df.withColumn('normalized_duration', (df['session_duration'] - min_value) / (max_value - min_value))
    
    return df

min_max_normalized_df = min_max_normalization(aggregate_web_stats)
min_max_normalized_df.orderBy(min_max_normalized_df.normalized_duration.desc()).show(10)

+-------------------+--------------+----------------+-------------------+
|      fullVisitorId|    productSKU|session_duration|normalized_duration|
+-------------------+--------------+----------------+-------------------+
|0834628261584717467|  GGOEGAAX0325|         1527925|                1.0|
|0834628261584717467|  GGOEGAAX0686|         1470110| 0.9621611008393737|
|0485797735449723544|GGOEGESB015199|         1172597| 0.7674440826611254|
| 431781159932899381|GGOEGBRJ037299|          759499|0.49707871786900537|
|7484497031611210287|GGOEYHPB072210|          748694| 0.4900070356856521|
|5873059317509196502|  GGOEGAAX0104|          594894| 0.3893476446815125|
|2863022817351466072|GGOEYFKQ020699|          536689|0.35125349739025147|
|7641607978785523241|GGOEGGCX056199|          443459| 0.2902361045208371|
|2827498353821012092|  GGOEGAAX0680|          427854| 0.2800229068835185|
|1933634293342529288|GGOEGDHQ015399|          370320|0.24236791727342638|
+-------------------+--------------+--

##### Logarithmic Transformation
Apply the logarithm function to each data point, to reduce the impact of outliers and diminish the skewness of the original distribution, making the data more symmetrical.

In [172]:
from pyspark.sql.functions import log

def logarithmic_transformation(df):
    """
    Apply Logarithmic Transformation on a specified column of a PySpark DataFrame.
    """
    # Adding 1 to avoid log(0) which is undefined
    df = df.withColumn('normalized_duration', log(df['session_duration'] + 1))
    
    return df

logarithmic_df = logarithmic_transformation(aggregate_web_stats)
logarithmic_df.orderBy(logarithmic_df.normalized_duration.desc()).show(10)

[Stage 8758:>                                                       (0 + 1) / 1]

+-------------------+--------------+----------------+-------------------+
|      fullVisitorId|    productSKU|session_duration|normalized_duration|
+-------------------+--------------+----------------+-------------------+
|0834628261584717467|  GGOEGAAX0325|         1527925| 14.239421818216494|
|0834628261584717467|  GGOEGAAX0686|         1470110|  14.20084846610825|
|0485797735449723544|GGOEGESB015199|         1172597| 13.974732357899336|
| 431781159932899381|GGOEGBRJ037299|          759499| 13.540415601017965|
|7484497031611210287|GGOEYHPB072210|          748694| 13.526086969954191|
|5873059317509196502|  GGOEGAAX0104|          594894| 13.296140198366766|
|2863022817351466072|GGOEYFKQ020699|          536689| 13.193175925608253|
|7641607978785523241|GGOEGGCX056199|          443459| 13.002362885006988|
|2827498353821012092|  GGOEGAAX0680|          427854| 12.966539632116586|
|1933634293342529288|GGOEGDHQ015399|          370320| 12.822125476068756|
+-------------------+--------------+--

                                                                                

### Building the Utility Matrxix:

In [177]:
def user_user_utility_matrix(df):
    """
    Create a user-user utility matrix from a DataFrame.
    """
    # Pivot the DataFrame to create a user-user utility matrix
    utility_matrix = df.groupBy("fullVisitorId").pivot("productSKU").agg(F.first("normalized_duration"))
    return utility_matrix

def item_item_utility_matrix(df):
    """
    Create an item-item utility matrix from a DataFrame.
    """
    # Pivot the DataFrame to create an item-item utility matrix
    utility_matrix = df.groupBy("productSKU").pivot("fullVisitorId").agg(F.first("normalized_duration"))
    return utility_matrix

utility_matrix_user_user = user_user_utility_matrix(normalized_df)
utility_matrix_item_item = item_item_utility_matrix(normalized_df)
# Show the result
utility_matrix_user_user.show(5)
utility_matrix_item_item.show(5)

+-------------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------------+--------------+--------------+--------------+------------+------------+------------+------------+------------+------------+------------+-------------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+--------------------+------------+------------+------------+------------+------------+

[Stage 8939:>                                                       (0 + 1) / 1]

+--------------+-------------------+-------------------+------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+-----------

                                                                                

## Model Development

In this step, we will implement three collaborative filtering algorithms:
1. **Latent Factor Model**
2. **Item-Based Filtering**
3. **User-Based Filtering**

#### Utility Functions


In [231]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

def string_to_Numeric(df):
    '''
    Convert string item and user ids to numeric for ALS, since ALS only accepts numberical IDs
    '''
    userIndexer = StringIndexer(inputCol="fullVisitorId", outputCol="userId").setHandleInvalid("skip")
    itemIndexer = StringIndexer(inputCol="productSKU", outputCol="itemId").setHandleInvalid("skip")

    # Pipeline to apply the transformations
    pipeline = Pipeline(stages=[userIndexer, itemIndexer])

    # Fit and transform
    transformed_ratings_dataFrame = pipeline.fit(df).transform(df)

    transformed_ratings_dataFrame.orderBy(transformed_ratings_dataFrame.fullVisitorId).show()
    
    return transformed_ratings_dataFrame

def row_mean(row):
    '''
    This function must return the mean of the non-zero elements in the row.
    '''
    row = row[1:]
    non_NULL = [x for x in row if x != None]
    if(len(non_NULL) == 0):
        return 0
    return sum(non_NULL) / len(non_NULL)

def pearson_Correlation(row1, row2):
    '''
    This function must return the Pearson correlation between two rows.
    '''
    row1 = row1[1:]
    row2 = row2[1:]
    mean1 = row_mean(row1)
    mean2 = row_mean(row2)

    #subtract the mean from the row
    row1 = [(x1 - mean1) if x1 is not None else 0 for x1 in row1]
    row2 = [(x2 - mean2) if x2 is not None else 0 for x2 in row2]

    #calculate cossine similarity with centered rows
    numerator = sum([row1[i] * row2[i] for i in range(len(row1))])
    denominator = (sum([x ** 2 for x in row1]) ** 0.5) * (sum([x ** 2 for x in row2]) ** 0.5)

    if denominator == 0:
        return 0
    else:
        return numerator / denominator
    
ratings_dataframe = string_to_Numeric(normalized_df)


[Stage 14449:>                                                      (0 + 1) / 1]

+-------------------+--------------+----------------+--------------------+------+------+
|      fullVisitorId|    productSKU|session_duration| normalized_duration|userId|itemId|
+-------------------+--------------+----------------+--------------------+------+------+
|0049931492016965831|GGOEGEVA022399|            9821|-0.09982561767578439| 165.0|   8.0|
|0052381813974609729|GGOEAOCB077499|           14292| -0.0562540035285037|  49.0|  20.0|
|0052381813974609729|GGOEGOCB017499|            6931|-0.12798977891166924|  49.0|  21.0|
|0052381813974609729|GGOEGOCC077299|            4745| -0.1492931894520306|  49.0|  22.0|
| 008016723867009901|GGOEGESB015099|            1488| -0.1810339068033375|  12.0|  42.0|
| 008016723867009901|GGOEGESC014099|               0|-0.19553503895523947|  12.0|  74.0|
| 008016723867009901|GGOEGDHC074099|            1394|-0.18194997294734208|  12.0|  13.0|
| 008016723867009901|GGOEGBJL013999|            1419| -0.1817063383345749|  12.0|   9.0|
| 008016723867009901|

                                                                                

#### Latent Factor Model

##### Basic ALS Recommender System

In [28]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


def basic_als_recommender(ratings_dataframe, seed):
    '''
    This function prints the RMSE of recommendations obtained
    through ALS collaborative filtering after hyperparameter tuning
    and returns the best model.
    
    The following parameters must be used in the ALS
    optimizer:
    - coldStartStrategy: 'drop'
    '''

    (trainingSet, testSet) = ratings_dataframe.randomSplit([0.8, 0.2], seed)

    #Build the recommendation model
    als = ALS(coldStartStrategy="drop", userCol="userId", itemCol="itemId", ratingCol="normalized_duration", seed=seed)

    param_grid = ParamGridBuilder()\
             .addGrid(als.rank, [30, 50, 70])\
             .addGrid(als.maxIter, [5, 10])\
             .addGrid(als.regParam, [0.01, 0.05, 0.15])\
             .build()
    
    evaluator = RegressionEvaluator(metricName="rmse", labelCol="normalized_duration", predictionCol="prediction")

    cv = CrossValidator(
        estimator=als,
        estimatorParamMaps=param_grid,
        evaluator=evaluator,
        numFolds=3)
    
    model = cv.fit(trainingSet)

    best_model = model.bestModel

    print('rank: ', best_model.rank)
    print('MaxIter: ', best_model._java_obj.parent().getMaxIter())
    print('RegParam: ', best_model._java_obj.parent().getRegParam())
    

    #Evaluate the model
    predictions = best_model.transform(testSet)

    rmse = evaluator.evaluate(predictions)

    print("Root-mean-square error = " + str(rmse))
    return best_model, trainingSet, testSet


(best_model, trainingSet, testSet) = basic_als_recommender(ratings_dataframe, 123)

rank:  50
MaxIter:  10
RegParam:  0.15
Root-mean-square error = 1.1808151204122308


##### Recommend Items with Basic ALS

In [45]:
from pyspark.sql.functions import explode, col


userRecs = best_model.recommendForAllUsers(50)  # Top-50 recommendations for each user

# Explode the recommendations to have one item per row for each user
userRecsExploded = userRecs.withColumn("rec_exp", explode("recommendations")).select(
    col("userId"),
    col("rec_exp.itemId").alias("recItemId"),
    col("rec_exp.rating").alias("predictedRating")
)

# Join the exploded recommendations with the actual ratings on userId and itemId
comparisonDf = userRecsExploded.join(
    ratings_dataframe,
    (userRecsExploded.userId == ratings_dataframe.userId) & (userRecsExploded.recItemId == ratings_dataframe.itemId),
    "left_outer"
).select(
    userRecsExploded.userId,
    userRecsExploded.recItemId,
    userRecsExploded.predictedRating,
    ratings_dataframe.normalized_duration.alias("actualRating")
).where(col("actualRating").isNotNull())

comparisonDf.show(100)


+------+---------+---------------+--------------------+
|userId|recItemId|predictedRating|        actualRating|
+------+---------+---------------+--------------------+
|     0|       34|   2.3718482E-4|0.047456378434225246|
|     0|       86|   1.1611577E-4| -0.1955252935707288|
|     1|       71|   0.0041744504| 0.17470186399026283|
|     1|      159|   0.0018805244|  0.0510329345496473|
|     2|       74|   1.9507561E-4| 0.02279081023767695|
|     3|       56|     0.02378653|  0.5136560826564605|
|     3|       13|   0.0031386532|-0.15921399088390975|
|     5|      169|   7.8896584E-4| 0.03170783706495538|
|     8|       15|   0.0044652536| 0.23103993184654323|
|     9|       27|   2.4136652E-4|-0.12527081663318762|
|    11|       59|   0.0010359924|  -0.170937688450266|
|    14|       17|    0.023564458|-0.16343374237703714|
|    14|        6|     0.01660411|  0.5595568437017954|
|    17|       41|    9.588975E-5|-0.13346668500667525|
|    18|       27|    6.689514E-4| 0.09876582788

##### ALS Recommender with Bias

In [50]:
from pyspark.sql.functions import mean

def global_average(ratings_df, seed):
    '''
    This function must print the global average rating for all users and
    all movies in the training set. Training and test
    sets should be determined as before (e.g: as in function basic_als_recommender).
    '''
    splits = ratings_df.randomSplit([0.8, 0.2], seed)
    training_set = splits[0]

    #now just return the average of the ratings in the training_set
    rating_average = training_set.agg({"normalized_duration": "avg"}).collect()[0][0]

    return rating_average



def als_with_bias_recommender(ratings_df, seed):
    '''
    This function must return the RMSE of recommendations obtained 
    using ALS + biases. Your ALS model should make predictions for *i*, 
    the user-item interaction, then you should recompute the predicted 
    rating with the formula *i+user_mean+item_mean-m* (*m* is the 
    global rating). The RMSE should compare the original rating column 
    and the predicted rating column.  Training and test sets should be 
    determined as before. Your ALS model should use the same parameters 
    as before and be initialized with the random seed passed as 
    parameter.
    '''

    splits = ratings_df.randomSplit([0.8, 0.2], seed)

    training_set = splits[0]
    test_set = splits[1]

    user_mean = training_set.groupby("userId").agg(mean("normalized_duration").alias("user_mean"))
    item_mean = training_set.groupby("itemId").agg(mean("normalized_duration").alias("item_mean"))

    training_set_with_means = training_set.join(user_mean, "userId").join(item_mean, "itemId")
    test_set_with_means = test_set.join(user_mean, "userId").join(item_mean, "itemId")

    global_mean = global_average(ratings_df, seed)

    final_training_set = training_set_with_means.withColumn("user_item_interaction", training_set_with_means.normalized_duration - (
                training_set_with_means.user_mean + training_set_with_means.item_mean - global_mean))


    #Build the recommendation model
    als = ALS(coldStartStrategy="drop", userCol="userId", itemCol="itemId", ratingCol="user_item_interaction", seed=seed)

    param_grid = ParamGridBuilder()\
             .addGrid(als.rank, [30, 50, 70])\
             .addGrid(als.maxIter, [5, 10])\
             .addGrid(als.regParam, [0.01, 0.05, 0.15])\
             .build()
    
    
    evaluator = RegressionEvaluator(metricName="rmse", labelCol="normalized_duration", predictionCol="prediction")

    cv = CrossValidator(
        estimator=als,
        estimatorParamMaps=param_grid,
        evaluator=evaluator,
        numFolds=3)
    
    model = cv.fit(final_training_set)

    best_model = model.bestModel

    print('rank: ', best_model.rank)
    print('MaxIter: ', best_model._java_obj.parent().getMaxIter())
    print('RegParam: ', best_model._java_obj.parent().getRegParam())

    # Evaluate the model by computing the RMSE on the test data
    predictions = best_model.transform(test_set_with_means)
    predictions = predictions.withColumn("prediction", predictions['prediction']+predictions['user_mean']+predictions['item_mean']-global_mean)

    rmse = evaluator.evaluate(predictions)

    print("Root-mean-square error = " + str(rmse))
    return best_model, trainingSet, testSet

(best_model, trainingSet, testSet) = als_with_bias_recommender(ratings_dataframe, 123)

rank:  50
MaxIter:  5
RegParam:  0.15
Root-mean-square error = 1.2883306372695404


##### Intermediate Results Discussion

the ALS RMSE = 1.1808151204122308

the ALS+biases RMSE = 1.2883306372695404

Implicit ratings, unlike explicit ratings (e.g., a 1-5 star rating), are derived from user behavior (such as session duration, page views, purchases) and not directly from user preferences. This means they are inherently noisy and less precise. Implicit signals might not always indicate preference but rather engagement or necessity. When biases based on these signals are introduced, they might amplify the noise or inaccuracies in the data, leading to a higher RMSE. 

In our specific case, a user may have the habit to leave their screen open on a product then going on to do another task. Or maybe they are convinced about a product that they like so much that they did not hesitate to buy without wasting much time on the details page. 

An interesting direction the project can head to is to try incorporating regularization techniques, to avoid overfitting if this is what is happening.

#### Item Based Collaborative filtering

In [181]:

def k_most_similar_items_with_ratings_rdd(utility_matrix, utility_matrix_t, item_id, userId, k):
    '''
    Calculates the top k items most similar to the specified item_id for a given user,
    based on Pearson correlation of ratings.
    '''
    # Fetch the row corresponding to the specified item_id
    item_row = utility_matrix.filter(utility_matrix.productSKU == item_id).collect()[0]
    
    # Calculate Pearson correlation for each item with the specified item, excluding the item itself
    pearson_correlation_rdd = utility_matrix.rdd.map(lambda row: (row[0], pearson_Correlation(item_row, row))).filter(lambda x: x[0] != item_id)
    
    # Remove items with no correlation TODO check if this is necessary
    # pearson_correlation_rdd  = pearson_correlation_rdd.filter(lambda x: x[1] is not None)
    
    # Fetch the user-specific ratings for all items
    user_ratings = utility_matrix_t.filter(utility_matrix_t.fullVisitorId == userId).collect()[0]

    # Map each item to its Pearson correlation and the user's rating for that item
    items_with_with_user_rating = pearson_correlation_rdd.map(lambda x: (x[0], x[1], user_ratings[x[0]]))
    
    # Remove items with no rating
    items_with_with_user_rating = items_with_with_user_rating.filter(lambda x: x[2] is not None)
    
    return items_with_with_user_rating.takeOrdered(k, key=lambda x: -x[1])


def item_item_recommender_rdd(utility_matrix, utility_matrix_t, itemId, userId, k):
    '''
    Predicts a rating for a specified item based on item-item similarity.
    '''
    # Find similar items and their ratings by the user
    similar_items_with_ratings = k_most_similar_items_with_ratings_rdd(utility_matrix, utility_matrix_t, itemId, userId, k)
    
    # Calculate the weighted sum of ratings (numerator) and sum of similarities (denominator)
    numerator = sum([item[1] * item[2] for item in similar_items_with_ratings]) # weighted sum
    denominator = sum([item[1] for item in similar_items_with_ratings]) # sum of similarities

    if denominator == 0:
        return 0
    else:
        return numerator / denominator
    
    

prediction = item_item_recommender_rdd(utility_matrix_item_item, utility_matrix_user_user,  "GGOEGAAX0596", '0167247604162700002', 500)
actual = ratings_dataframe.filter(ratings_dataframe.fullVisitorId == '0167247604162700002').filter(ratings_dataframe.productSKU == 'GGOEGAAX0596').collect()[0][3]

print("The predicted rating for the item is: ", prediction)
print("The actual rating for the item is: ", actual)

                                                                                

The predicted rating for the item is:  -0.17420239226134607
The actual rating for the item is:  -0.1724872045874652


#### User Based Collaborative Filtering

In [179]:

def k_most_similar_users_with_ratings_rdd(utility_matrix, utility_matrix_t, item_id, userId, k):
    '''
    Calculates the top k users most similar to the specified user for a given item,
    '''
    # Collect the row corresponding to the specified user
    user_row = utility_matrix.filter(utility_matrix.fullVisitorId == userId).collect()[0]
    
    # Calculate Pearson correlation for each user with the specified user, excluding the user themselves
    pearson_correlation_rdd = utility_matrix.rdd.map(lambda row: (row[0], pearson_Correlation(user_row, row))).filter(lambda x: x[0] != userId)

    # Remove users with no correlation TODO check if this is necessary
    # pearson_correlation_rdd  = pearson_correlation_rdd.filter(lambda x: x[1] is not None)

    # Collect the ratings for the specified item across users
    item_ratings = utility_matrix_t.filter(utility_matrix_t.productSKU == item_id).collect()[0]
    
    # Map each user to their Pearson correlation with the specified user and their rating for the specified item
    users_with_item_ratings = pearson_correlation_rdd.map(lambda x: (x[0], x[1], item_ratings[x[0]]))
    
    # Remove users with no rating
    users_with_item_ratings = users_with_item_ratings.filter(lambda x: x[2] is not None)
    
    # Return the top k similar users and their ratings
    return users_with_item_ratings.takeOrdered(k, key=lambda x: -x[1])


def user_user_recommender_rdd(utility_matrix, utility_matrix_t, itemId, userId, k):
    '''
    Predicts a rating for a specified item based on user-user similarity.
    '''
    # Calculate similarities and get ratings for the top k similar users
    similar_items_with_ratings = k_most_similar_users_with_ratings_rdd(utility_matrix, utility_matrix_t, itemId, userId, k)
    
    # Calculate the weighted sum of ratings (numerator) and sum of similarities (denominator)
    numerator = sum([item[1] * item[2] for item in similar_items_with_ratings]) # weighted sum of ratings
    denominator = sum([item[1] for item in similar_items_with_ratings]) # sum of similarities

    if denominator == 0:
        return 0
    else:
        return numerator / denominator
    
    

prediction = user_user_recommender_rdd( utility_matrix_user_user, utility_matrix_item_item, "GGOEGAAX0596", '0167247604162700002', 500)
actual = ratings_dataframe.filter(ratings_dataframe.fullVisitorId == '0167247604162700002').filter(ratings_dataframe.productSKU == 'GGOEGAAX0596').collect()[0][3]

print("The predicted rating for the item is: ", prediction)
print("The actual rating for the item is: ", actual)

                                                                                

The predicted rating for the item is:  0.03269441886722827
The actual rating for the item is:  -0.1724872045874652


### Normalization Function Selection: 
The goal is to identify which normalization technique—MinMax, Z-Score, or Log Transformation—optimizes the performance of the recommendation system.

In [232]:
import math

def evaluate_recommender_with_RMSE(function, utility_matrix,utility_matrix_t,k):
    '''
    This function must evaluate the recommender system using the specified function
    and return the RMSE of the predictions.
    '''
    user_item_pairs = [
        ('0167247604162700002', 'GGOEGAAX0596'),
        # ('0834628261584717467', 'GGOEGAAX0325'),
        ('008016723867009901', 'GGOEGBPB021199')
    ]
    predictions = []
    for user_id, product_sku in user_item_pairs:
        predicted_rating = function(
            utility_matrix, 
            utility_matrix_t,  
            product_sku, 
            user_id, 
            k
        )
        
        actual_rating = ratings_dataframe.filter(
            (ratings_dataframe.fullVisitorId == user_id) & 
            (ratings_dataframe.productSKU == product_sku)
        ).select('normalized_duration').collect()[0][0]  # Assuming 'normalized_duration' is the actual rating
        
        predictions.append((user_id, product_sku, actual_rating, predicted_rating))

    # Convert the list to a DataFrame for easier display 
    predictions_df = spark.createDataFrame(predictions, ["userId", "productSKU", "actual_rating", "predicted_rating"])
    predictions_df.show()

    squared_errors = [(pred[2] - pred[3]) ** 2 for pred in predictions]
    # Calculate RMSE
    rmse = math.sqrt(sum(squared_errors) / len(squared_errors))

    if(function == user_user_recommender_rdd):
        print(f"RMSE for User Based Recommender: {rmse}")
    else:
        print(f"RMSE for Item Based Recommender: {rmse}")


In [230]:
print("\n===============================================")
print("Evaluating the Z Score Normalization function")
print("===============================================")
z_score_normalized_df = z_score_normalization(user_item_rating)
utility_matrix_item_item = item_item_utility_matrix(z_score_normalized_df)
utility_matrix_user_user = user_user_utility_matrix(z_score_normalized_df)

evaluate_recommender_with_RMSE(item_item_recommender_rdd, utility_matrix_item_item, utility_matrix_user_user, 500) # Item Based Evaluation
evaluate_recommender_with_RMSE(user_user_recommender_rdd, utility_matrix_user_user, utility_matrix_item_item, 500) # User Based Evaluation

print("\n===============================================")
print("Evaluating the Min-Max Normalization function")
print("===============================================")
min_max_normalized_df = min_max_normalization(user_item_rating)
utility_matrix_item_item = item_item_utility_matrix(min_max_normalized_df)
utility_matrix_user_user = user_user_utility_matrix(min_max_normalized_df)

evaluate_recommender_with_RMSE(item_item_recommender_rdd, utility_matrix_item_item, utility_matrix_user_user, 500) # Item Based Evaluation
evaluate_recommender_with_RMSE(user_user_recommender_rdd, utility_matrix_user_user, utility_matrix_item_item, 500) # User Based Evaluation

print("\n===============================================")
print("Evaluating the Logarithmic Transformation function")
print("===============================================")
log_normalized_df = min_max_normalization(user_item_rating)
utility_matrix_item_item = item_item_utility_matrix(log_normalized_df)
utility_matrix_user_user = user_user_utility_matrix(log_normalized_df)

evaluate_recommender_with_RMSE(item_item_recommender_rdd, utility_matrix_item_item, utility_matrix_user_user, 500) # Item Based Evaluation
evaluate_recommender_with_RMSE(user_user_recommender_rdd, utility_matrix_user_user, utility_matrix_item_item, 500) # User Based Evaluation


Evaluating the Z Score Normalization function


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652|-0.17420239226134607|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|-0.19347412459520413|
+-------------------+--------------+--------------------+--------------------+

RMSE for Item Based Recommender: 0.007544048883076649


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652| 0.03269441886722827|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|-0.28593456492598723|
+-------------------+--------------+--------------------+--------------------+

RMSE for User Based Recommender: 0.1623369252832323

Evaluating the Min-Max Normalization function


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652|0.001432661943485446|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|1.384072785141305...|
+-------------------+--------------+--------------------+--------------------+

RMSE for Item Based Recommender: 0.17855991798647983


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652|0.015327477330703599|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|-0.00607106855636...|
+-------------------+--------------+--------------------+--------------------+

RMSE for User Based Recommender: 0.1824258608150262

Evaluating the Logarithmic Transformation function


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652|0.001432661943485446|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|1.384072785141305...|
+-------------------+--------------+--------------------+--------------------+

RMSE for Item Based Recommender: 0.17855991798647983


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652|0.015327477330703599|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|-0.00607106855636...|
+-------------------+--------------+--------------------+--------------------+

RMSE for User Based Recommender: 0.1824258608150262


#### Results:
Z-Score Normalization appears to perform exceptionally well with the item-based collaborative filtering approach, yielding a very low RMSE. This suggests that standardizing the ratings to have a mean of 0 and a standard deviation of 1 effectively captures the preference patterns across items, leading to accurate predictions. However, for the user-based approach, the performance is noticeably weaker, although still competitive.

Given these results, we chose Z-Score for User and Item Based Filtering

## Model Evalutation: User Based and Item Based

In [233]:
print("Evaluation for Item Based Recommender:")
evaluate_recommender_with_RMSE(item_item_recommender_rdd, utility_matrix_item_item, utility_matrix_user_user, 500)
print("Evaluation for User Based Recommender")
evaluate_recommender_with_RMSE(user_user_recommender_rdd, utility_matrix_user_user, utility_matrix_item_item, 500)

Evaluation for Item Based Recommender:


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652|0.001432661943485446|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|1.384072785141305...|
+-------------------+--------------+--------------------+--------------------+

RMSE for Item Based Recommender: 0.17855991798647983
Evaluation for User Based Recommender


                                                                                

+-------------------+--------------+--------------------+--------------------+
|             userId|    productSKU|       actual_rating|    predicted_rating|
+-------------------+--------------+--------------------+--------------------+
|0167247604162700002|  GGOEGAAX0596| -0.1724872045874652|0.015327477330703599|
| 008016723867009901|GGOEGBPB021199|-0.18294400216743212|-0.00607106855636...|
+-------------------+--------------+--------------------+--------------------+

RMSE for User Based Recommender: 0.1824258608150262
