<center>
    <h1 id='content-based-filtering' style='color:#7159c1; font-size:350%'>Collaborative Filtering</h1>
    <i style='font-size:125%'>Recommendations of Items from Similar Items that Similar Users Liked</i>
</center>

> **Topics**

```
- 🧑‍🤝‍🧑 Hands-on
```

<h1 id='0-hands-on' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>🧑‍🤝‍🧑 | Hands-on</h1>

```
- Settings
- Reading Datasets
- Calculating Ratings Matrix and Similarity Matrix
- Calculating Predictions
- Recommendations
```

---

**- Settings**

In [1]:
# ---- Importings ----
import numpy as np                               # pip instal numpy
import pandas as pd                              # pip install pandas
from sklearn.metrics import mean_squared_error   # pip install sklearn
from sklearn.model_selection import train_test_split

# ---- Constants ----
ANIMES_SCORED_BY_CUTOFF = (0.75)
ANIMES_NUMBER_RATINGS_CUTOFF = (75_000)

BASELINE_PREDICTION = (2.5)
DATASETS_PATH = ('./datasets')
SEED = (20240420) # April 20, 2024 (fourth Bitcoin Halving)

# ---- Settings ----
np.random.seed(SEED)

# ---- Functions ----
def calculate_score(user_id, anime_id):
    """
    \ Description:
        - drops the selected user from 'user_id' parameter on similarities and ratings matrices;
        - calculates the total score and weight between the users;
        - calculates the average user rating for the item from 'anime_id';
        - returns the predicted rating balanced by the weight.
    
    \ Parameters:
        - user_id: integer;
        - anime_id: integer.
        
    \ Return:
        - Baseline Prediction: float (when item is not into training dataset OR
    none of the similar users have rated items in common with the 'user_id' parameter);
        - Predicted Rating: float.
    """
    # If the item is not into the training dataset, the baseline value is returned
    if anime_id not in ratings_matrix.columns: return BASELINE_PREDICTION

    if user_id not in normalized_ratings_matrix.columns: return BASELINE_PREDICTION

    # Dropping the selected anime from 'anime_id' parameter
    similarity_scores = similarity_matrix[anime_id].drop(labels=anime_id)
    normalized_ratings = normalized_ratings_matrix[user_id].drop(index=anime_id)
    
    # Dropping animes that haven't been rated
    similarity_scores.drop(index=normalized_ratings[normalized_ratings.isnull()].index, inplace=True)
    normalized_ratings.dropna(inplace=True)
    
    # If none of the other users have rated items in common with the user in question, the baseline value is returned
    if similarity_scores.isna().all(): return BASELINE_PREDICTION
    
    # Calculating Predicted Rating
    total_score = 0
    total_weight = 0
    
    for anime_id_rating in normalized_ratings.index:
        # It is possible that another user rated the item but that
        # they have not rated any items in common with the user in question
        if not pd.isna(similarity_scores[anime_id_rating]):
            total_score += normalized_ratings[anime_id_rating] * similarity_scores[anime_id_rating]
            total_weight += abs(similarity_scores[anime_id_rating])
            
    avg_user_rating = ratings_matrix.T.mean()[anime_id]
    return avg_user_rating + total_score / total_weight

def get_recommendations(df, animes_df, user_id, number_recommendations=10):
    """
    \ Description:
        - filters the top 10 recommendations by 'predicted_rating';
        - creates a dataframe containing info about the filtered animes;
        - merges 'predicted_rating' to the dataset;
        - drops unuseful columns;
        - returns the recommendations descended sorted by 'predicted_rating'.
    
    \ Parameters:
        - df: Pandas DataFrame;
        - animes_df: Pandas DataFrame;
        - user_id: integer;
        - number_recommendations: integer.
        
    \ Return:
        - recommendations_df: Pandas DataFrame.
    """
    filtered_animes = df.loc[df.user_id == user_id]            \
      .sort_values(by='predicted_rating', ascending=False)    \
      .head(number_recommendations)
    
    recommended_animes_ids = filtered_animes.anime_id.unique().tolist()
    
    recommendations_df = animes_df.loc[animes_df.id.isin(recommended_animes_ids)][
        ['id', 'title', 'synopsis', 'score', 'genres', 'image_url']
    ]
    
    recommendations_df = recommendations_df.merge(
        filtered_animes
        , left_on='id'
        , right_on='anime_id'
        , how='left'
    )
    
    recommendations_df.drop(columns=['anime_id', 'user_id'], inplace=True)
    
    return recommendations_df.sort_values(by='predicted_rating', ascending=False)

---

**- Reading Datasets**

In [2]:
# ---- Reading Animes Dataset ----
animes_df = pd.read_csv(f'{DATASETS_PATH}/anime-transformed-dataset-2023.csv')[
    ['id', 'title', 'synopsis', 'score', 'genres', 'image_url', 'scored_by']
]

# ---- Filterig Animes with more than or equal to a cutoff of number of Users Ratings ----
minimum_number_of_ratings = animes_df.scored_by.quantile(q=ANIMES_SCORED_BY_CUTOFF, interpolation='linear')
animes_df = animes_df.loc[animes_df.scored_by >= minimum_number_of_ratings].copy()

print(f'- Number of Observations: {animes_df.shape[0]:,}')
print(f'- Number of Variables: {animes_df.shape[1]:,}')
print('---')

animes_df.head()

- Number of Observations: 5,938
- Number of Variables: 7
---


Unnamed: 0,id,title,synopsis,score,genres,image_url,scored_by
0,1,cowboy bebop,"crime is timeless. by the year 2071, humanity ...",8.75,"action, sci-fi, award winning",https://cdn.myanimelist.net/images/anime/4/196...,914193
1,5,cowboy bebop tengoku no tobira,"another day, another bounty—such is the life o...",8.38,"action, sci-fi",https://cdn.myanimelist.net/images/anime/1439/...,206248
2,6,trigun,"vash the stampede is the man with a $$60,000,0...",8.22,"adventure, action, sci-fi",https://cdn.myanimelist.net/images/anime/7/203...,356739
3,7,witch hunter robin,robin sena is a powerful craft user drafted in...,7.25,"mystery, action, supernatural, drama",https://cdn.myanimelist.net/images/anime/10/19...,42829
4,8,bouken ou beet,it is the dark century and the people are suff...,6.94,"adventure, fantasy, supernatural",https://cdn.myanimelist.net/images/anime/7/215...,6413


In [3]:
# ---- Reading Ratings Dataset ----
ratings_df = pd.read_csv(f'{DATASETS_PATH}/users-scores-transformed-2023.csv')[
    ['user_id', 'anime_id', 'rating']
]

# ---- Filterig Ratings by Filtered Animes ----
filtered_animes_ids = animes_df.id.to_list()
ratings_df = ratings_df.loc[ratings_df.anime_id.isin(filtered_animes_ids)].copy()

# ---- Filtering Ratings with users with more than or equal to 2000 Ratings ----
animes_ratings_count = ratings_df.anime_id.value_counts()
ratings_df = ratings_df.loc[
    ratings_df.anime_id.isin(animes_ratings_count[animes_ratings_count >= ANIMES_NUMBER_RATINGS_CUTOFF].index)
].copy()


print(f'- Number of Observations: {ratings_df.shape[0]:,}')
print(f'- Number of Variables: {ratings_df.shape[1]:,}')
print('---')

ratings_df.head()

- Number of Observations: 651,318
- Number of Variables: 3
---


Unnamed: 0,user_id,anime_id,rating
26,1,1575,8
37,1,226,8
49,1,121,9
147,1,20,7
231,1,269,9


---

**- Calculating Ratings Matrix and Similarity Matrix**

In [4]:
# ---- Splitting Dataset into Train and Validation ----
train_ratings_df, valid_ratings_df = train_test_split(
    ratings_df
    , train_size=0.80
    , test_size=0.20
    , random_state=SEED
)

print(f'- Train Ratings Observations: {train_ratings_df.shape[0]:,}')
print(f'- Validation Ratings Observations: {valid_ratings_df.shape[0]:,}')

- Train Ratings Observations: 521,054
- Validation Ratings Observations: 130,264


In [5]:
# ---- Calculating Ratings Matrix ----
#
# - values: users ratings to animes;
# - indexes: animes ids;
# - columns: users ids;
#
ratings_matrix = pd.pivot_table(train_ratings_df, values='rating', index='anime_id', columns='user_id')
normalized_ratings_matrix = ratings_matrix.subtract(ratings_matrix.mean(axis=1), axis=0)
normalized_ratings_matrix

user_id,1,4,9,20,23,47,66,70,71,80,...,1291021,1291029,1291033,1291039,1291049,1291057,1291079,1291085,1291087,1291097
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20,,,1.425853,-1.574147,-0.574147,,-0.574147,-1.574147,-0.574147,,...,0.425853,-0.574147,2.425853,2.425853,,-2.574147,,,,
121,,0.60134,-0.39866,1.60134,0.60134,1.60134,0.60134,-0.39866,,1.60134,...,,-1.39866,,,,-2.39866,1.60134,,,
226,0.045439,,2.045439,0.045439,1.045439,0.045439,0.045439,-2.954561,,,...,,-1.954561,,,0.045439,1.045439,,,-0.954561,
269,1.202507,,,-0.797493,-0.797493,,0.202507,,,-1.797493,...,,-1.797493,,1.202507,,-2.797493,,,,
1535,0.274634,,,0.274634,-0.725366,-0.725366,-1.725366,1.274634,1.274634,,...,-1.725366,,1.274634,-0.725366,-1.725366,-2.725366,-0.725366,,1.274634,0.274634
1575,-0.764645,-0.764645,,-0.764645,-0.764645,-0.764645,0.235355,-1.764645,,,...,,,,,,-1.764645,,,0.235355,
2904,-2.890233,-0.890233,,,-0.890233,,,,,,...,,,,,,-0.890233,,1.109767,,


In [6]:
# ---- Calculating Animes Similarity Matrix ----
similarity_matrix = ratings_matrix.T.corr(method='pearson')
similarity_matrix.head()

anime_id,20,121,226,269,1535,1575,2904
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
20,1.0,0.283111,0.244758,0.546313,0.302034,0.234618,0.278365
121,0.283111,1.0,0.258206,0.291144,0.28032,0.246503,0.240533
226,0.244758,0.258206,1.0,0.29027,0.293107,0.286265,0.317477
269,0.546313,0.291144,0.29027,1.0,0.294805,0.264667,0.31255
1535,0.302034,0.28032,0.293107,0.294805,1.0,0.347047,0.371049


---

**- Calculating Predictions**

In [7]:
# ---- Predictions Calculation ----
valid_ratings = np.array(valid_ratings_df['rating'])
users_ids_list = valid_ratings_df['user_id']
animes_ids_list = valid_ratings_df['anime_id']
predicted_ratings = np.array([calculate_score(user_id, anime_id) for (user_id, anime_id) in zip(users_ids_list, animes_ids_list)])

# ---- Validation ----
rmse = np.sqrt(mean_squared_error(valid_ratings, predicted_ratings))
print(f'- RMSE: {rmse}')

- RMSE: 5.774137938770761


---

**- Recommendations**

In [8]:
# --- Predicted Ratings DF ----
predicted_ratings_df = pd.DataFrame(columns=['user_id', 'anime_id', 'predicted_rating'])

predicted_ratings_df['user_id'] = users_ids_list
predicted_ratings_df['anime_id'] = animes_ids_list
predicted_ratings_df['predicted_rating'] = predicted_ratings
predicted_ratings_df.reset_index(drop=True, inplace=True)

predicted_ratings_df.head()

Unnamed: 0,user_id,anime_id,predicted_rating
0,1101057,1535,2.5
1,334264,269,2.5
2,397867,20,8.48585
3,1265487,1535,2.5
4,538119,121,2.5


In [9]:
# ---- Recommendations ----
get_recommendations(
    df=predicted_ratings_df
    , animes_df=animes_df
    , user_id=1129199         
    , number_recommendations=10
)

Unnamed: 0,id,title,synopsis,score,genres,image_url,predicted_rating
0,20,naruto,"moments prior to naruto uzumaki's birth, a hug...",7.99,"adventure, action, fantasy",https://cdn.myanimelist.net/images/anime/13/17...,9.619586
1,269,bleach,ichigo kurosaki is an ordinary high schooler—u...,7.92,"adventure, action, fantasy",https://cdn.myanimelist.net/images/anime/3/404...,2.5
2,1535,death note,"brutal murders, petty thefts, and senseless vi...",8.62,"suspense, supernatural",https://cdn.myanimelist.net/images/anime/9/945...,2.5
3,1575,code geass hangyaku no lelouch,"in the year 2010, the holy empire of britannia...",8.7,"action, sci-fi, award winning, drama",https://cdn.myanimelist.net/images/anime/1032/...,2.5
4,2904,code geass hangyaku no lelouch r2,"one year has passed since the black rebellion,...",8.91,"action, sci-fi, award winning, drama",https://cdn.myanimelist.net/images/anime/1088/...,2.5


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).