# 📊 ***Data Science, CA3 - Task 3*** 📚

* **Member 1** : [Kasra Kashani, 810101490] 🆔
* **Member 2** : [Borna Foroohari, 810101480] 🆔

📄 **Subjects**: Machine Learning: Recommendation

## 🔹**Imports**

Import required modules.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_regression

## 📍 Movie Recommender System

In this task, our goal is to design and evaluate a movie recommendation system that predicts the `ratings` a user would give to a movie they have not yet rated.

### 🧠 Feature Understanding & Analysis

Below is a breakdown of all major features in the dataset, including their types, meanings, and the potential insights they provide for predicting the ratings a user would give to a movie:

| 🏷️ Feature Name   | 🧬 Type     | 💡 Description                              | 🔍 Key Insight                            |
|----------|----------|------------------------------------------|------------------------------------------------------------------|
| `id`       | numeric    | Unique row identifier                     | Not useful for modeling; just a row index                        |
| `user_id`  | numeric    | ID of the user who rated a movie         | Key for user-based collaborative filtering and personalization   |
| `item_id`  | numeric    | ID of the movie rated by the user        | Key for item-based filtering and popularity analysis             |
| `label`    | target  | Rating score given by the user (0.5–4.0) | Target variable for prediction; consider normalization if needed |

$\newline$

---

$\newline$

| 🏷️ Feature Name            | 🧬 Type   | 💡 Description                                             |  🔍 Key Insight                         |
|-------------------|--------|---------------------------------------------------------|---------------------------------------------------------------------------|
| `id`                | numeric  | Unique row identifier                                   | Not useful for modeling; just a row index                                 |
| `user_id_trustor`   | numeric  | User who expresses trust                                | Can be used to model trust graph or social regularization                 |
| `user_id_trustee`   | numeric  | User who is being trusted                               | Indicates who is trusted by whom; useful in user similarity estimation    |
| `trust_value`      | binary  | Trust level (always 1 in this dataset)                  | Binary relationship; indicates presence of trust between two users        |

First we read and load both train datasets into two seperated Pandas dataframes.

In [2]:
# Load the rate CSV file into a dataframe
df_rate = pd.read_csv("train_data_movie_rate.csv")

# Load the trust CSV file into a dataframe
df_trust = pd.read_csv("train_data_movie_trust.csv")

In [3]:
# Show the rate dataframe
df_rate

Unnamed: 0,id,user_id,item_id,label
0,1,1,1,2.0
1,2,1,2,4.0
2,3,1,3,3.5
3,4,1,4,3.0
4,5,1,5,4.0
...,...,...,...,...
34293,34294,1508,84,3.5
34294,34295,1508,17,4.0
34295,34296,1508,669,1.0
34296,34297,1508,686,2.5


In [4]:
df_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34298 entries, 0 to 34297
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       34298 non-null  int64  
 1   user_id  34298 non-null  int64  
 2   item_id  34298 non-null  int64  
 3   label    34298 non-null  float64
dtypes: float64(1), int64(3)
memory usage: 1.0 MB


In [5]:
# Show the rate dataframe
df_trust

Unnamed: 0,id,user_id_trustor,user_id_trustee,trust_value
0,1,2,966,1
1,2,2,104,1
2,3,5,1509,1
3,4,6,1192,1
4,5,7,1510,1
...,...,...,...,...
1848,1849,1507,806,1
1849,1850,1507,361,1
1850,1851,1508,1187,1
1851,1852,1508,509,1


In [6]:
df_trust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1853 entries, 0 to 1852
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   id               1853 non-null   int64
 1   user_id_trustor  1853 non-null   int64
 2   user_id_trustee  1853 non-null   int64
 3   trust_value      1853 non-null   int64
dtypes: int64(4)
memory usage: 58.0 KB


We see that there are just numeric features in both datasets. So we don't need to encode any categorical features because we don't have any categorical features.

In [7]:
# First see the count of unique values for each rate df column
for col in df_rate.columns:
    print(f"{col} -> {df_rate[col].unique().size}")

id -> 34298
user_id -> 1498
item_id -> 2071
label -> 8


In [8]:
# First see the count of unique values for each trust df column
for col in df_trust.columns:
    print(f"{col} -> {df_trust[col].unique().size}")

id -> 1853
user_id_trustor -> 609
user_id_trustee -> 732
trust_value -> 1


We use the movie trust dataframe and also the rating dataframe to create a dictionary from them.

In [9]:
# Dictionary for user's trustors
trustors = df_trust.groupby("user_id_trustor")["user_id_trustee"].apply(set).to_dict()

# Dictionary for user's trustees
trustees = df_trust.groupby("user_id_trustee")["user_id_trustor"].apply(set).to_dict()

# Dictionary for item's raters
item_raters = df_rate.groupby("item_id")["user_id"].apply(set).to_dict()

df_rate = df_rate.groupby(["user_id", "item_id"], as_index=False)["label"].mean()

First we extract some new features from each dataframe separately.

In [10]:
# Extract some statistical features for each user from the rating dataset
user_avg_rating = df_rate.groupby("user_id")["label"].mean().rename("user_avg_rating")
user_std_rating = df_rate.groupby("user_id")["label"].std().rename("user_std_rating")
user_min_rating = df_rate.groupby("user_id")["label"].min().rename("user_min_rating")
user_max_rating = df_rate.groupby("user_id")["label"].max().rename("user_max_rating")
user_count_rating = df_rate.groupby("user_id")["label"].count().rename("user_count_rating")

# Extract some statistical features for each item from the rating dataset
item_avg_rating = df_rate.groupby("item_id")["label"].mean().rename("item_avg_rating")
item_std_rating = df_rate.groupby("item_id")["label"].std().rename("item_std_rating")
item_min_rating = df_rate.groupby("item_id")["label"].min().rename("item_min_rating")
item_max_rating = df_rate.groupby("item_id")["label"].max().rename("item_max_rating")
item_count_rating = df_rate.groupby("item_id")["label"].count().rename("item_count_rating")

First we extract some new features from each dataframe separately using groupby.

In [11]:
# Join the dataframe and user based features on the `user_id`
df_rate = df_rate.join(user_avg_rating, on="user_id").fillna(-1)
df_rate = df_rate.join(user_std_rating, on="user_id").fillna(0)
df_rate = df_rate.join(user_min_rating, on="user_id").fillna(-1)
df_rate = df_rate.join(user_max_rating, on="user_id").fillna(-1)
df_rate = df_rate.join(user_count_rating, on="user_id").fillna(0)

# Join the dataframe and item based features on the `item_id`
df_rate = df_rate.join(item_avg_rating, on="item_id").fillna(-1)
df_rate = df_rate.join(item_std_rating, on="item_id").fillna(0)
df_rate = df_rate.join(item_min_rating, on="item_id").fillna(-1)
df_rate = df_rate.join(item_max_rating, on="item_id").fillna(-1)
df_rate = df_rate.join(item_count_rating, on="item_id").fillna(0)

In [12]:
# Show the dataframe
df_rate

Unnamed: 0,user_id,item_id,label,user_avg_rating,user_std_rating,user_min_rating,user_max_rating,user_count_rating,item_avg_rating,item_std_rating,item_min_rating,item_max_rating,item_count_rating
0,1,1,2.0,3.416667,0.668558,2.0,4.0,12,2.978339,0.855957,0.5,4.0,831
1,1,2,4.0,3.416667,0.668558,2.0,4.0,12,3.190286,0.830323,0.5,4.0,875
2,1,3,3.5,3.416667,0.668558,2.0,4.0,12,3.045519,0.821318,0.5,4.0,703
3,1,4,3.0,3.416667,0.668558,2.0,4.0,12,3.192969,0.907212,0.5,4.0,640
4,1,5,4.0,3.416667,0.668558,2.0,4.0,12,3.230030,0.802293,0.5,4.0,676
...,...,...,...,...,...,...,...,...,...,...,...,...,...
34290,1508,669,1.0,2.844828,1.018608,1.0,4.0,29,3.250000,1.500000,1.0,4.0,4
34291,1508,686,2.5,2.844828,1.018608,1.0,4.0,29,2.400000,0.894427,1.5,3.5,5
34292,1508,693,3.5,2.844828,1.018608,1.0,4.0,29,2.687500,1.066955,1.0,4.0,8
34293,1508,751,1.0,2.844828,1.018608,1.0,4.0,29,1.000000,0.500000,0.5,1.5,3


In [13]:
df_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34295 entries, 0 to 34294
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   user_id            34295 non-null  int64  
 1   item_id            34295 non-null  int64  
 2   label              34295 non-null  float64
 3   user_avg_rating    34295 non-null  float64
 4   user_std_rating    34295 non-null  float64
 5   user_min_rating    34295 non-null  float64
 6   user_max_rating    34295 non-null  float64
 7   user_count_rating  34295 non-null  int64  
 8   item_avg_rating    34295 non-null  float64
 9   item_std_rating    34295 non-null  float64
 10  item_min_rating    34295 non-null  float64
 11  item_max_rating    34295 non-null  float64
 12  item_count_rating  34295 non-null  int64  
dtypes: float64(9), int64(4)
memory usage: 3.4 MB


Then, we can extracting some additional good combinational features from both dataframes and save them into the joined dataframe.

In [14]:
# Extracting some combinational features

df_rate["user_item_rating_ratio"] = df_rate["user_avg_rating"] / (df_rate["item_avg_rating"] + 1e-5)

df_rate["user_recent_vs_old_item_rating_diff"] = df_rate["user_avg_rating"] - df_rate["item_avg_rating"]

df_rate["is_user_stricted"] = (df_rate["user_avg_rating"] < 2.5).astype(int)

df_rate["rating_diff_from_user_avg"] = df_rate["label"] - df_rate["user_avg_rating"]

df_rate["rating_diff_from_item_avg"] = df_rate["label"] - df_rate["item_avg_rating"]

In [15]:
df_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34295 entries, 0 to 34294
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   user_id                              34295 non-null  int64  
 1   item_id                              34295 non-null  int64  
 2   label                                34295 non-null  float64
 3   user_avg_rating                      34295 non-null  float64
 4   user_std_rating                      34295 non-null  float64
 5   user_min_rating                      34295 non-null  float64
 6   user_max_rating                      34295 non-null  float64
 7   user_count_rating                    34295 non-null  int64  
 8   item_avg_rating                      34295 non-null  float64
 9   item_std_rating                      34295 non-null  float64
 10  item_min_rating                      34295 non-null  float64
 11  item_max_rating             

Also we create some functions to apply them on our dataframe.

In [16]:
# Define strictness of users in rating
stricts = df_rate.groupby("user_id")["label"].mean().rename("stricts")
is_user_strict = (stricts >= 2.5).astype(int).rename("is_user_strict")

# Functions for extracting new features

def user_rating_percentile(user_id, label):
    user_ratings = df_rate[df_rate["user_id"] == user_id]["label"]

    if len(user_ratings <= 1):
        return 0.5

    return (user_ratings < label).mean()

def trustors_strict(user_id):
    vals = [is_user_strict.get(user) for user in trustors.get(user_id, set()) if user in is_user_strict]

    if not vals:
        return -1

    return sum(vals) / len(vals)

def trustee_strict(user_id):
    vals = [is_user_strict.get(user) for user in trustees.get(user_id, set()) if user in is_user_strict]

    if not vals:
        return -1

    return sum(vals) / len(vals)

def trustor_strict_per_item(user_id, item_id):
    valids = (trustors.get(user_id, set())) & (item_raters.get(item_id, set()))

    vals = [is_user_strict.get(user) for user in valids if user in is_user_strict]

    if not vals:
        return -1

    return sum(vals) / len(vals)

def trustee_strict_per_item(user_id, item_id):
    valids = (trustees.get(user_id, set())) & (item_raters.get(item_id, set()))

    vals = [is_user_strict.get(user) for user in valids if user in is_user_strict]

    if not vals:
        return -1

    return sum(vals) / len(vals)

def trusted_rating_count(user_id, item_id):
    trusted_users = trustors.get(user_id, set())

    if not trusted_users:
        return 0 
    
    return df_rate[(df_rate["user_id"].isin(trusted_users)) & (df_rate["item_id"] == item_id)].shape[0]

def trustor_ratio(user_id, item_id):
    trusted_users = trustors.get(user_id, set())

    raters = item_raters.get(item_id, set())
    
    if not trusted_users:
        return 0.0
    
    return len(trusted_users & raters) / len(trusted_users)

def trustee_ratio(user_id, item_id):
    trusteed_users = trustees.get(user_id, set())

    raters = item_raters.get(item_id, set())

    if not trusteed_users:
        return 0.0
    
    return len(trusteed_users & raters) / len(trusteed_users)

In [17]:
# Apply the functions to our dataframe
df_rate["user_rating_percentile"] = df_rate.apply(lambda row: user_rating_percentile(row["user_id"], row["label"]), axis=1)

df_rate["trustors_strict"] = df_rate.apply(lambda row: trustors_strict(row["user_id"]), axis=1)

df_rate["trustee_strict"] = df_rate.apply(lambda row: trustee_strict(row["user_id"]), axis=1)

df_rate["trustor_strict_per_item"] = df_rate.apply(lambda row: trustor_strict_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustee_strict_per_item"] = df_rate.apply(lambda row: trustee_strict_per_item(row["user_id"], row["item_id"]), axis=1)

In [18]:
# Show the dataframe
df_rate

Unnamed: 0,user_id,item_id,label,user_avg_rating,user_std_rating,user_min_rating,user_max_rating,user_count_rating,item_avg_rating,item_std_rating,...,user_item_rating_ratio,user_recent_vs_old_item_rating_diff,is_user_stricted,rating_diff_from_user_avg,rating_diff_from_item_avg,user_rating_percentile,trustors_strict,trustee_strict,trustor_strict_per_item,trustee_strict_per_item
0,1,1,2.0,3.416667,0.668558,2.0,4.0,12,2.978339,0.855957,...,1.147168,0.438327,0,-1.416667,-0.978339,0.5,-1.0,-1.0,-1.0,-1.0
1,1,2,4.0,3.416667,0.668558,2.0,4.0,12,3.190286,0.830323,...,1.070956,0.226381,0,0.583333,0.809714,0.5,-1.0,-1.0,-1.0,-1.0
2,1,3,3.5,3.416667,0.668558,2.0,4.0,12,3.045519,0.821318,...,1.121863,0.371147,0,0.083333,0.454481,0.5,-1.0,-1.0,-1.0,-1.0
3,1,4,3.0,3.416667,0.668558,2.0,4.0,12,3.192969,0.907212,...,1.070056,0.223698,0,-0.416667,-0.192969,0.5,-1.0,-1.0,-1.0,-1.0
4,1,5,4.0,3.416667,0.668558,2.0,4.0,12,3.230030,0.802293,...,1.057779,0.186637,0,0.583333,0.769970,0.5,-1.0,-1.0,-1.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34290,1508,669,1.0,2.844828,1.018608,1.0,4.0,29,3.250000,1.500000,...,0.875329,-0.405172,0,-1.844828,-2.250000,0.5,1.0,1.0,-1.0,-1.0
34291,1508,686,2.5,2.844828,1.018608,1.0,4.0,29,2.400000,0.894427,...,1.185340,0.444828,0,-0.344828,0.100000,0.5,1.0,1.0,1.0,-1.0
34292,1508,693,3.5,2.844828,1.018608,1.0,4.0,29,2.687500,1.066955,...,1.058537,0.157328,0,0.655172,0.812500,0.5,1.0,1.0,1.0,-1.0
34293,1508,751,1.0,2.844828,1.018608,1.0,4.0,29,1.000000,0.500000,...,2.844799,1.844828,0,-1.844828,0.000000,0.5,1.0,1.0,1.0,1.0


In [19]:
df_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34295 entries, 0 to 34294
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   user_id                              34295 non-null  int64  
 1   item_id                              34295 non-null  int64  
 2   label                                34295 non-null  float64
 3   user_avg_rating                      34295 non-null  float64
 4   user_std_rating                      34295 non-null  float64
 5   user_min_rating                      34295 non-null  float64
 6   user_max_rating                      34295 non-null  float64
 7   user_count_rating                    34295 non-null  int64  
 8   item_avg_rating                      34295 non-null  float64
 9   item_std_rating                      34295 non-null  float64
 10  item_min_rating                      34295 non-null  float64
 11  item_max_rating             

Here we implement some insightfull functions again to apply them to our dataframe for extracting new features.

In [20]:
# Functions for user's trustors for each item

def trustors_rating_std_per_item(user_id, item_id):
    trusted_users = trustors.get(user_id, set())

    if not trusted_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusted_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1 
    
    if len(ratings) == 1:
        return 0
    
    return ratings.std()

def trustors_rating_avg_per_item(user_id, item_id):
    trusted_users = trustors.get(user_id, set())

    if not trusted_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusted_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.mean()

def trustors_rating_max_per_item(user_id, item_id):
    trusted_users = trustors.get(user_id, set())

    if not trusted_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusted_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.max() 

def trustors_rating_min_per_item(user_id, item_id):
    trusted_users = trustors.get(user_id, set())

    if not trusted_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusted_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.min() 

def trustors_count_per_item(user_id, item_id):
    trusted_users = trustors.get(user_id, set())

    if not trusted_users:
        return 0
    
    count = df_rate[(df_rate["user_id"].isin(trusted_users)) & (df_rate["item_id"] == item_id)]
    
    return count["user_id"].unique().size


# Functions for user's trustors totally

def trustors_rating_avg_totally(user_id):
    trusted_users = trustors.get(user_id, set())

    if not trusted_users:
        return -1
    
    ratings = df_rate[df_rate["user_id"].isin(trusted_users)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.mean()

In [21]:
# Functions for user's trustees for each item

def trustees_rating_std_per_item(user_id, item_id):
    trusteed_users = trustees.get(user_id, set())

    if not trusteed_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusteed_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1
    
    if len(ratings) == 1:
        return 0
    
    return ratings.std()

def trustees_rating_avg_per_item(user_id, item_id):
    trusteed_users = trustees.get(user_id, set())

    if not trusteed_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusteed_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.mean() 

def trustees_rating_max_per_item(user_id, item_id):
    trusteed_users = trustees.get(user_id, set())

    if not trusteed_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusteed_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.max() 

def trustees_rating_min_per_item(user_id, item_id):
    trusteed_users = trustees.get(user_id, set())

    if not trusteed_users:
        return -1
    
    ratings = df_rate[(df_rate["user_id"].isin(trusteed_users)) & (df_rate["item_id"] == item_id)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.min() 

def trustees_count_per_item(user_id, item_id):
    trusteed_users = trustees.get(user_id, set())

    if not trusteed_users:
        return 0
    
    count = df_rate[(df_rate["user_id"].isin(trusteed_users)) & (df_rate["item_id"] == item_id)]
    
    return count["user_id"].unique().size


# Functions for user's trustors totally

def trustees_rating_avg_totally(user_id):
    trusteed_users = trustees.get(user_id, set())

    if not trusteed_users:
        return -1
    
    ratings = df_rate[df_rate["user_id"].isin(trusteed_users)]["label"]

    if ratings.empty:
        return -1
    
    return ratings.mean() 

And now we use and apply those functions on our dataframe.

In [22]:
# Apply functions for user's trustors
df_rate["trustors_rating_std_per_item"] = df_rate.apply(lambda row: trustors_rating_std_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustors_rating_avg_per_item"] = df_rate.apply(lambda row: trustors_rating_avg_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustors_rating_max_per_item"] = df_rate.apply(lambda row: trustors_rating_max_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustors_rating_min_per_item"] = df_rate.apply(lambda row: trustors_rating_min_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustors_count_per_item"] = df_rate.apply(lambda row: trustors_count_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustors_rating_avg_totally"] = df_rate.apply(lambda row: trustors_rating_avg_totally(row["user_id"]), axis=1)

# Apply functions for user's trustees
df_rate["trustees_rating_std_per_item"] = df_rate.apply(lambda row: trustees_rating_std_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustees_rating_avg_per_item"] = df_rate.apply(lambda row: trustees_rating_avg_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustees_rating_max_per_item"] = df_rate.apply(lambda row: trustees_rating_max_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustees_rating_min_per_item"] = df_rate.apply(lambda row: trustees_rating_min_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustees_count_per_item"] = df_rate.apply(lambda row: trustees_count_per_item(row["user_id"], row["item_id"]), axis=1)

df_rate["trustees_rating_avg_totally"] = df_rate.apply(lambda row: trustees_rating_avg_totally(row["user_id"]), axis=1)

Also we can extract some additional combinational features again!

In [23]:
# Count of each users's trustors
trustors_count = df_trust.groupby("user_id_trustor").size().rename("trustors_count")

# Count of each users's trustees
trustees_count = df_trust.groupby("user_id_trustee").size().rename("trustees_count")

# Build average of trustors and trustees count series from these features
user_ids = df_rate["user_id"].unique()
trustor_avg_rating = pd.Series({uid: trustors_rating_avg_totally(uid) for uid in user_ids}, name="trustor_avg_rating")
trustee_avg_rating = pd.Series({uid: trustees_rating_avg_totally(uid) for uid in user_ids}, name="trustee_avg_rating")

# Join these features to our dataframe
df_rate = df_rate.join(trustors_count, on="user_id")
df_rate = df_rate.join(trustees_count, on="user_id")
df_rate["trustors_count"] = df_rate["trustors_count"].fillna(0)
df_rate["trustees_count"] = df_rate["trustees_count"].fillna(0)
df_rate = df_rate.join(trustor_avg_rating, on="user_id")
df_rate = df_rate.join(trustee_avg_rating, on="user_id")

df_rate["trust_bias"] = df_rate["user_count_rating"] - df_rate["trustor_avg_rating"]

In [24]:
# Show the dataframe
df_rate

Unnamed: 0,user_id,item_id,label,user_avg_rating,user_std_rating,user_min_rating,user_max_rating,user_count_rating,item_avg_rating,item_std_rating,...,trustees_rating_avg_per_item,trustees_rating_max_per_item,trustees_rating_min_per_item,trustees_count_per_item,trustees_rating_avg_totally,trustors_count,trustees_count,trustor_avg_rating,trustee_avg_rating,trust_bias
0,1,1,2.0,3.416667,0.668558,2.0,4.0,12,2.978339,0.855957,...,-1.0,-1.0,-1.0,0,-1.00000,0.0,0.0,-1.000000,-1.00000,13.000000
1,1,2,4.0,3.416667,0.668558,2.0,4.0,12,3.190286,0.830323,...,-1.0,-1.0,-1.0,0,-1.00000,0.0,0.0,-1.000000,-1.00000,13.000000
2,1,3,3.5,3.416667,0.668558,2.0,4.0,12,3.045519,0.821318,...,-1.0,-1.0,-1.0,0,-1.00000,0.0,0.0,-1.000000,-1.00000,13.000000
3,1,4,3.0,3.416667,0.668558,2.0,4.0,12,3.192969,0.907212,...,-1.0,-1.0,-1.0,0,-1.00000,0.0,0.0,-1.000000,-1.00000,13.000000
4,1,5,4.0,3.416667,0.668558,2.0,4.0,12,3.230030,0.802293,...,-1.0,-1.0,-1.0,0,-1.00000,0.0,0.0,-1.000000,-1.00000,13.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34290,1508,669,1.0,2.844828,1.018608,1.0,4.0,29,3.250000,1.500000,...,-1.0,-1.0,-1.0,0,2.72381,3.0,2.0,2.738872,2.72381,26.261128
34291,1508,686,2.5,2.844828,1.018608,1.0,4.0,29,2.400000,0.894427,...,-1.0,-1.0,-1.0,0,2.72381,3.0,2.0,2.738872,2.72381,26.261128
34292,1508,693,3.5,2.844828,1.018608,1.0,4.0,29,2.687500,1.066955,...,-1.0,-1.0,-1.0,0,2.72381,3.0,2.0,2.738872,2.72381,26.261128
34293,1508,751,1.0,2.844828,1.018608,1.0,4.0,29,1.000000,0.500000,...,0.5,0.5,0.5,1,2.72381,3.0,2.0,2.738872,2.72381,26.261128


In [25]:
df_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34295 entries, 0 to 34294
Data columns (total 40 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   user_id                              34295 non-null  int64  
 1   item_id                              34295 non-null  int64  
 2   label                                34295 non-null  float64
 3   user_avg_rating                      34295 non-null  float64
 4   user_std_rating                      34295 non-null  float64
 5   user_min_rating                      34295 non-null  float64
 6   user_max_rating                      34295 non-null  float64
 7   user_count_rating                    34295 non-null  int64  
 8   item_avg_rating                      34295 non-null  float64
 9   item_std_rating                      34295 non-null  float64
 10  item_min_rating                      34295 non-null  float64
 11  item_max_rating             

As our test data is in another CSV file, we have to do all of these previous preproccesing and feature extracting for the test file too, after reading the test data into another Pandas datrame.

In [26]:
# Load the CSV file into a dataframe
df_test = pd.read_csv("test_data.csv")

# Save the id of each row
test_ids = df_test["id"]

In [27]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716 entries, 0 to 1715
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   id       1716 non-null   int64
 1   user_id  1716 non-null   int64
 2   item_id  1716 non-null   int64
dtypes: int64(3)
memory usage: 40.3 KB


In [28]:
# Join the dataframe and user based features on the `user_id`
df_test = df_test.join(user_avg_rating, on="user_id").fillna(-1)
df_test = df_test.join(user_std_rating, on="user_id").fillna(0)
df_test = df_test.join(user_min_rating, on="user_id").fillna(-1)
df_test = df_test.join(user_max_rating, on="user_id").fillna(-1)
df_test = df_test.join(user_count_rating, on="user_id").fillna(0)

# Join the dataframe and item based features on the `item_id`
df_test = df_test.join(item_avg_rating, on="item_id").fillna(-1)
df_test = df_test.join(item_std_rating, on="item_id").fillna(0)
df_test = df_test.join(item_min_rating, on="item_id").fillna(-1)
df_test = df_test.join(item_max_rating, on="item_id").fillna(-1)
df_test = df_test.join(item_count_rating, on="item_id").fillna(0)

# Extracting some combinational features

df_test["user_item_rating_ratio"] = df_test["user_avg_rating"] / (df_test["item_avg_rating"] + 1e-5)

df_test["user_recent_vs_old_item_rating_diff"] = df_test["user_avg_rating"] - df_test["item_avg_rating"]

df_test["is_user_stricted"] = (df_test["user_avg_rating"] < 2.5).astype(int)

df_test["rating_diff_from_user_avg"] = 0 

df_test["rating_diff_from_item_avg"] = 0

# Apply the functions to our dataframe
df_test["user_rating_percentile"] = df_test.apply(lambda row: user_rating_percentile(row["user_id"], user_avg_rating.get(row["user_id"], 2.5)), axis=1)

df_test["trustors_strict"] = df_test.apply(lambda row: trustors_strict(row["user_id"]), axis=1)

df_test["trustee_strict"] = df_test.apply(lambda row: trustee_strict(row["user_id"]), axis=1)

df_test["trustor_strict_per_item"] = df_test.apply(lambda row: trustor_strict_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustee_strict_per_item"] = df_test.apply(lambda row: trustee_strict_per_item(row["user_id"], row["item_id"]), axis=1)

# Apply functions for user's trustors
df_test["trustors_rating_std_per_item"] = df_test.apply(lambda row: trustors_rating_std_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustors_rating_avg_per_item"] = df_test.apply(lambda row: trustors_rating_avg_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustors_rating_max_per_item"] = df_test.apply(lambda row: trustors_rating_max_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustors_rating_min_per_item"] = df_test.apply(lambda row: trustors_rating_min_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustors_count_per_item"] = df_test.apply(lambda row: trustors_count_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustors_rating_avg_totally"] = df_test.apply(lambda row: trustors_rating_avg_totally(row["user_id"]), axis=1)

# Apply functions for user's trustees
df_test["trustees_rating_std_per_item"] = df_test.apply(lambda row: trustees_rating_std_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustees_rating_avg_per_item"] = df_test.apply(lambda row: trustees_rating_avg_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustees_rating_max_per_item"] = df_test.apply(lambda row: trustees_rating_max_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustees_rating_min_per_item"] = df_test.apply(lambda row: trustees_rating_min_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustees_count_per_item"] = df_test.apply(lambda row: trustees_count_per_item(row["user_id"], row["item_id"]), axis=1)

df_test["trustees_rating_avg_totally"] = df_test.apply(lambda row: trustees_rating_avg_totally(row["user_id"]), axis=1)

# Join these features to our dataframe
df_test = df_test.join(trustors_count, on="user_id")
df_test = df_test.join(trustees_count, on="user_id")
df_test["trustors_count"] = df_test["trustors_count"].fillna(0)
df_test["trustees_count"] = df_test["trustees_count"].fillna(0)
df_test = df_test.join(trustor_avg_rating, on="user_id")
df_test = df_test.join(trustee_avg_rating, on="user_id")

df_test["trust_bias"] = df_test["user_count_rating"] - df_test["trustor_avg_rating"]

# Delete the id column because we don't need it anymore
df_test = df_test.drop(columns=["id"])

In [29]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716 entries, 0 to 1715
Data columns (total 39 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   user_id                              1716 non-null   int64  
 1   item_id                              1716 non-null   int64  
 2   user_avg_rating                      1716 non-null   float64
 3   user_std_rating                      1716 non-null   float64
 4   user_min_rating                      1716 non-null   float64
 5   user_max_rating                      1716 non-null   float64
 6   user_count_rating                    1716 non-null   int64  
 7   item_avg_rating                      1716 non-null   float64
 8   item_std_rating                      1716 non-null   float64
 9   item_min_rating                      1716 non-null   float64
 10  item_max_rating                      1716 non-null   float64
 11  item_count_rating             

In [30]:
df_rate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34295 entries, 0 to 34294
Data columns (total 40 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   user_id                              34295 non-null  int64  
 1   item_id                              34295 non-null  int64  
 2   label                                34295 non-null  float64
 3   user_avg_rating                      34295 non-null  float64
 4   user_std_rating                      34295 non-null  float64
 5   user_min_rating                      34295 non-null  float64
 6   user_max_rating                      34295 non-null  float64
 7   user_count_rating                    34295 non-null  int64  
 8   item_avg_rating                      34295 non-null  float64
 9   item_std_rating                      34295 non-null  float64
 10  item_min_rating                      34295 non-null  float64
 11  item_max_rating             

Finally, we use some models to get and compare their accuracies and choose the best one for this recommendation. Models like:

- **Linear Regression**

- **Ridge**

- **Lasso**

- **ElasticNet**

- **Gradient Boosting**

- **XGBoost**

- **CatBoost**

At the end, we show the best model with its accuracy and save their predictions into a CSV file.

In [31]:
# Building train data and labels
X = df_rate.drop(columns=["label"])
Y = df_rate["label"]

# Drop columns in test dataframe which are not exist in the train dataframe
for col in X.columns:
    if col not in df_test.columns:
        df_test[col] = 0
df_test = df_test[X.columns]

# Standardize both train and test data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(df_test)

# Select best features fom our model and drop others
selector = SelectKBest(score_func=f_regression, k=60)
X_selected = selector.fit_transform(X_scaled, Y)
X_selected_names = X.columns[selector.get_support()]
X_test_selected = X_test_scaled[:, selector.get_support(indices=True)]

# Building train and test data and labels
X_train, X_val, Y_train, Y_val = train_test_split(X_selected, Y, test_size=0.2, random_state=42)

# Use models and their hyper parameters
models = {
    "Linear Regression": LinearRegression(
    ),
    "Ridge": Ridge(
        alpha=0.0001
    ),
    "Lasso": Lasso(
        max_iter=1000000,
        alpha=0.0001
    ),
    "ElasticNet": ElasticNet(
        max_iter=1000000,
        alpha=0.0001,
        l1_ratio=0.1
    ),
    "Gradient Boosting": GradientBoostingRegressor(
        n_estimators=1000,
        learning_rate=0.01
    ),
    "XGBoost": XGBRegressor(
        n_estimators=1000,
        learning_rate=0.01,
        max_depth=6,
        subsample=0.85,
        colsample_bytree=0.85,
        reg_lambda=1.2,
        reg_alpha=0.3
    ),
    "CatBoost": CatBoostRegressor(
        iterations=1000, 
        learning_rate=0.01,
        depth=6,
        verbose=0
    )
}

# Run all models and choose the bet one
results = pd.DataFrame(columns=["Model", "MSE", "RMSE", "R2-Score", "MAPE", "MAE"])

for name, model in models.items():
    model.fit(X_train, Y_train)

    y_pred = model.predict(X_val)

    mse = mean_squared_error(Y_val, y_pred)

    print(f"The model {name} has the MSE {mse}")

    new_result = pd.DataFrame([{
        "Model": name,
        "MSE": f"{mse}",
        "RMSE": f"{(mse ** 0.5)}",
        "R2-Score": f"{r2_score(Y_val, y_pred)}",
        "MAPE": f"{(np.mean(np.abs((Y_val - y_pred) / Y_val)) * 100)}%",
        "MAE": f"{mean_absolute_error(Y_val, y_pred)}"
    }])
    results = pd.concat([results, new_result], ignore_index=True)

    model.fit(X_selected, Y)

    final_preds = model.predict(X_test_selected)

    submission = pd.DataFrame({"id": test_ids, "label": final_preds})

    submission.to_csv(f"{name}_predictions.csv", index=False)



The model Linear Regression has the MSE 1.412594671694237e-30
The model Ridge has the MSE 5.5733209632875605e-18
The model Lasso has the MSE 2.173425766771575e-08
The model ElasticNet has the MSE 3.880147127620721e-08
The model Gradient Boosting has the MSE 0.0008923466808216046
The model XGBoost has the MSE 0.00015685135141043887
The model CatBoost has the MSE 0.00048205770155213776


Then we calculate and report MSE, RMSE, R2-Score, MAPE and MAE during our model development process on the train data.

- **MSE (Mean Squared Error)** -> The average of the squared differences between the actual values and the predicted values. It measures how close the predicted values are to the actual values. Squaring the error magnifies the impact of larger errors, making it sensitive to outliers.

$$ MSE = \dfrac{1}{N} \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2 $$

- **RMSE (Root Mean Squared Error)** -> The square root of the MSE. It has the same unit as the target variable, making it more interpretable. It is more sensitive to large errors than MSE because of the square root.

$$ RMSE = \sqrt{MSE} $$

- **R2-Score** -> This measures the proportion of variance in the target variable that is explained by the model. It ranges between 0 (the worst) and 1 (the best).

$$ R2Score = 1 - \dfrac{\sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2}{\sum_{i=1}^{n} (y_{i} - \bar{y})^2} $$

- **MAPE (Mean Absolute Percentage Error)** -> The average of the absolute percentage errors between the actual and predicted values. It is useful when the magnitude of the target variable is important.

$$ MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_{i} - \hat{y_{i}}}{y_{i}} \right| \times 100 $$

- **MAE (Mean Absolute Error)** -> The average of the absolute differences between the actual and predicted values. It is less sensitive to outliers compared to MSE and RMSE.

$$ MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_{i} - \hat{y_{i}} \right| $$

Where $y_{i}$ is the actual value, $\hat{y_{i}}$ is the predicted value, $\bar{y}$ is the mean of actual values and *n* is the number of samples.

In [32]:
# # Show the model results for the train data
results

Unnamed: 0,Model,MSE,RMSE,R2-Score,MAPE,MAE
0,Linear Regression,1.412594671694237e-30,1.1885262604142311e-15,1.0,3.815158274022139e-14%,8.716683236931612e-16
1,Ridge,5.573320963287561e-18,2.360788208054157e-09,1.0,8.938288410206541e-08%,1.8679030470884813e-09
2,Lasso,2.1734257667715752e-08,0.0001474254308717,0.9999999735881692,0.006172791443397987%,0.000113049765055
3,ElasticNet,3.880147127620721e-08,0.0001969808906371,0.9999999528478076,0.00453046730842764%,0.0001058526576975
4,Gradient Boosting,0.0008923466808216,0.0298721723485521,0.9989156054940084,0.8714202417236224%,0.0182171282928168
5,XGBoost,0.0001568513514104,0.0125240309569418,0.9998093916328906,0.21717635881366737%,0.0040522218315398
6,CatBoost,0.0004820577015521,0.0219558124776137,0.9994141954754032,0.6911546284962629%,0.0138460032160466
