# Unsupervised user explicit rating based recommendation system

## Import Libraries

In [2]:
# Standard library imports
import os # allows access to OS-dependent functionalities
import sys # to manipulate different parts of the Python runtime environment

import numpy as np # functions for working in domain of linear algebra, fourier transform, matrices and arrays
import pandas as pd # data analysis and manipulation tool

# setting display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Get the current working directory
cwd = os.getcwd()

# Add the path of the utils directory to sys.path
utils_path = os.path.abspath(os.path.join(cwd, '..', 'utils'))
sys.path.append(utils_path)

# Utils libraries
from cleaning import *
from recommend import *
from testing import *
from training import *

#Preparing folder variables

main_folder = os.path.abspath(os.path.join(os.pardir))
data_folder = (main_folder + "/" +"data")
saved_models_folder = (data_folder + "/" + "saved_models")
raw_data = (data_folder + "/" + "_raw")
processed_data = (data_folder + "/" + "processed")
content_based_supervised_data = (main_folder + "/" + "processed" + "/" + "content_based_supervised")



## Cleaning and preparing the data

### Importing data

In [3]:
# CSV file called "anime.csv" from a directory called raw_data and returns the contents as a Pandas DataFrame
anime = pd.read_csv(raw_data + "/" + "anime.csv") 

# CSV file called "rating.csv.zip" from a directory called raw_data and returns the contents as a Pandas DataFrame
rating = pd.read_csv(raw_data + "/" + "rating.csv.zip")

### Cleaning and Merging dataframes

To do that we will call the next functions from cleaning.py in utils folder.
- final_df
- clean_anime_df
    - predict_source
    - clean_synopsis
- merging

In [15]:
print(final_df.__doc__)


    This function merges the two dataframes of anime data, reorders and selects columns, 
    renames the columns to lowercase, and saves the resulting dataframe to a CSV file. 
    The merged and cleaned dataframe is returned as the output.
    In other words, we get more information like to get more information like cover or japanese tittle
    


The steps of this function are:
- Load the original anime dataframe
- Load the updated anime dataframe
- Merge the two dataframes on the anime_id column 
- Reorder and select columns 
- Rename columns to lower case 
- Save the final dataframe to a CSV file in the processed data directory
- Return the final dataframe

In [16]:
print(clean_anime_df.__doc__)

The function clean_anime_df() takes an anime dataframe as input and performs several 
    cleaning and preprocessing steps, such as removing special characters from anime names, 
    converting all names to lowercase, filling missing values for "episodes" and "score" 
    columns with their median, dropping rows with null values for "genre" or "type" columns, 
    and saving the cleaned dataframe to a CSV file. The cleaned dataframe is also returned as output.


The steps of this function:
- Create a copy of the original dataframe called anime_cleaned
- Remove all non-word characters from the name column and replace them with spaces
- Convert all names to lowercase
- Replace all "Unknown" values in the episodes column with NaN
- Replace all NaN values in the episodes column with the median of the column
- Convert the score column to float type
- Replace all NaN values in the score column with the median of the column
- Convert the members column to float type
- Apply the clean_synopsis function to the synopsis column
    - Remove \r and \n from synopsis
    - Remove extra spaces from synopsis
    - Replace encoded characters
    - Return synopsis
- Add prediction to the source column of the dataframe using the predict_source function
    - change unknown values to NaN from 'source' column
    - fill missing values in the 'episodes' column with 0
    - create dummy variables for the 'type' column
    - create dummy variables for the 'rating' column
    - First, we area going to split the genre column by comma, then expand the list, so there is a column for each genre. We will have 13 columns, because the anime with most genres tags has 13 tags
    - Now we can get the list of unique genres. We "convert" the dataframe into a single dimension array and take the unique values
    - Getting the dummy variables will result in having a lot more columns than unique genres
    - So we sum up the columns with the same genre to have a single column for each genre
    - split the data into training and validation sets
    - create the decision tree classifier
    - train the model using the training data
    - predict the 'source' values for the validation data
    - fill the 'NaN' 'source' values in the original DataFrame with the predicted values
    - undo the get_dummies() operation to convert the one-hot encoded 'type' and 'rating' columns back to a single categorical column
    - Dropping unnecessary columns
    - calculate the accuracy of the model
- Replace all NaN values in the genre column with the mode of the column
- Replace all NaN values in the rating column with the mode of the column
- Replace all NaN values in the type column with the mode of the column
- Save the cleaned dataframe to a CSV file called "_anime_to_compare_with_name.csv" 

In [17]:
print(merging.__doc__)


    This function merges the given DataFrame with a rating DataFrame 
    based on the anime_id column. It then renames the 'rating_user' 
    column to 'user_rating' and returns the merged DataFrame.
    


The steps of this function:
- Loading rating df
- Añadimos suffixes for ratingdf ya que en los dos df la columna rating tiene el mismo nombre
- Cambiamos un par de nombres de columnas

In [18]:
anime = final_df()
anime.shape

(12201, 15)

In [19]:
anime_cleaned = clean_anime_df(anime)

The accuracy of source prediction is 0.8886884550084889


In [6]:
merged_df = merging(anime_cleaned)

In [21]:
merged_df.shape

(7808397, 17)

### Handling NaN values

In order to do that we will call the function name features_user_based_unsupervised from cleaning.py in utils folder.

In [22]:
print(features_user_based_unsupervised.__doc__)


    This function takes in a merged dataframe, preprocesses the data to drop users 
    who have not given any ratings and users who have given fewer ratings than a 
    specified threshold value, and saves the resulting pivot table to a pickle file. 
    It then compresses the pickle file into a zip file and returns the resulting pivot table.
    


The steps of this function:
-  A user who hasn't given any ratings (-1) has added no value to the engine. So let's drop it.
-  Drop rows with NaN values (user has not given any ratings 
-  There are users who has rated only once. So we should think if we want to consider only users with a minimin ratings as threshold value. Let's say 50.
-  Only consider users with at least 200 ratings
-  Saving the pivot table to pickle
-  Create a zip file for the saved pickle file
-  Return the cleaned and filtered features dataframe

In [7]:
features = features_user_based_unsupervised(merged_df)

In [7]:
features.shape

(2378351, 19)

### Pivoting

In order to do that we will call the function name create_pivot_table_unsupervised from cleaning.py in utils folder.

In [25]:
print(create_pivot_table_unsupervised.__doc__)


    The function create_pivot_table_unsupervised creates a pivot table with rows as anime titles, 
    columns as user IDs, and the corresponding ratings as values. The pivot table is then saved 
    to a pickle file and zipped. The function also saves a separate file containing only the 
    anime titles. Finally, the pivot table is returned.
    


The steps of this function:
- This function takes a DataFrame of features as input, and returns a pivot table of user ratings
- Creates the pivot table using pandas' pivot_table method, with user_id as columns, name as index, and user_rating as values
- Saves the pivot table as a pickle file using joblib
- Compresses the pickle file using zip and saves it
- Creates a DataFrame containing the index of the pivot table
- Saves the DataFrame as a csv file
- Returns the pivot table

In [None]:
pivot_df = create_pivot_table_unsupervised(features)
pivot_df.head(2)

In [27]:
pivot_df.shape

(8580, 8551)

## Recommendation building phase

### Cosine Similarity using KNN

We need to create a sparse matrix using the csr_matrix function, so we will call the function name matrix_creation_and_training from training.py in utils folder.

In [28]:
print(matrix_creation_and_training.__doc__)

None


The steps of this function:
- Convert pivot table of user-item ratings to a sparse matrix in CSR format
- Create k-Nearest Neighbors model with 2 neighbors, Euclidean distance metric, brute force algorithm, and p-norm=2
- Fit k-Nearest Neighbors model on the user-item rating matrix
- Save the trained k-Nearest Neighbors model to a file using the pickle module
- Return the trained k-Nearest Neighbors model

In [29]:
model_knn = matrix_creation_and_training(pivot_df)

In [30]:
model_knn

# Getting Recommendations

To get the recommendations we will use the next functions from recommend.py in utils folder:
- unsupervised_user_based_recommender
- reco
- finding_the_closest_title
- from_title_to_index
- match_the_score
- from_index_to_title
- create_dict
- filtering_and
- filtering_or

print(unsupervised_user_based_recommender.__doc__)

Steps of unsupervised_user_based_recommender:
- Load the anime data with features to compare  
- Convert the input anime title to lowercase 
- Load the pivot table to find the index of the input anime title
- Find the closest title to the input title based on string similarity
- When the user input has no spelling mistakes
	- Print the recommendations for similar animes to the closest title
- When the user input has spelling mistakes
	- Print a message asking if the user meant the closest title found


print(reco.__doc__)

Steps of reco:
- Load the trained KNN model for user-based unsupervised learning.
- Load the pivot table which stores the user rating data.
- Get the index of the anime given the name of the anime.
- Get the n nearest neighbors (anime recommendations) of the given anime.
- Store the names of the n nearest neighbors in a list.
	- If no recommendations are found, print a message to the user.   
- Return the list of recommended anime names.

In [31]:
print(finding_the_closest_title.__doc__)


    Function that takes in a string title and a pandas DataFrame df as input arguments, 
    and returns a tuple containing the closest matching title to the input title 
    and the Levenshtein distance score between the closest title and the input title.
    in other words, the function returns the most similar title to the name a user typed
    


Steps of finding_the_closest_title:
- This function takes a string `title` and a pandas DataFrame `df` as input arguments.
- Create a new variable `anime` to hold the DataFrame `df` for readability.
- Calculate the Levenshtein distance between each title in the 'name' column of the DataFrame and the input `title`.
- The `match_the_score` function is used to calculate the distance score.
- The `enumerate` function adds an index number to each distance score.
- Sort the list of (index, distance score) tuples in descending order by the distance score. sorted_levenshtein_scores = sorted(levenshtein_scores, key=lambda x: x[1], reverse=True)
- Get the closest matching title to the input `title` by using the index of the highest scoring match.
- The `from_index_to_title` function is used to return the title string from the DataFrame given an index.
- Get the Levenshtein distance score of the closest matching title.
- Return a tuple containing the closest matching title and its Levenshtein distance score.

In [32]:
print(from_title_to_index.__doc__) # just one step


    Function to return the matched index number of the anime name
    


In [33]:
print(match_the_score.__doc__) # just one step


    Function to find the closest title, It uses Levenshtein Distance to calculate the differences between sequences
    


In [34]:
print(from_index_to_title.__doc__) # just one step


    Function to return the anime name that mtches de index number
    


The information resulted is pass to:
- create_dict
- filtering_and
- filtering_or

In [35]:
print(create_dict.__doc__)


    The create_dict() function takes in four arguments - names (list of anime names to search for), 
    gen (list of genres to filter by), typ (list of anime types to filter by), 
    method (string indicating whether to filter by "or" or "and"), 
    and an optional n parameter indicating the maximum number of results to return. 
    It reads in a pre-processed anime DataFrame, filters it based on the input criteria, 
    and returns a dictionary of the resulting rows. If there are no matches, 
    it returns a string indicating it.
    


Steps of create_dict:
- This function takes in a list of anime titles `names`, lists of `gen`res and `typ`es, a filtering method `method`, and an optional number of results `n`.
- Load the anime dataframe from a CSV file using pandas.
- Filter the anime dataframe to only include titles that match those in the input list `names`.
- Remove the 'anime_id' and 'members' columns from the resulting dataframe.
- Reset the index of the resulting dataframe.
- Apply a filtering method based on the input `method`.
- If 'or', use the `filtering_or()` function to filter the dataframe.
- If 'and', use the `filtering_and()` function to filter the dataframe.
- If `method` is neither 'or' nor 'and', raise a ValueError.
- Drop any duplicate titles from the resulting dataframe.
- Limit the resulting dataframe to the first `n` rows.
- If the resulting dataframe is empty, print an error message and return None.
- Otherwise, convert the resulting dataframe to a dictionary and return the dictionary.

In [36]:
print(filtering_and.__doc__)


    This function takes a DataFrame df, a list of genres, and a list of types as input arguments. 
    The function first creates a boolean mask genre_mask by applying a lambda function to 
    the 'genre' column of the DataFrame. The lambda function checks if the value is a 
    string using isinstance(x, str) and if all genres in the genres list are present 
    in the string, which is split by comma and space using x.split(', '). 
    The all() function returns True if all genres in the genres list are present 
    in the string. The resulting genre_mask will be True for rows where the genre 
    column contains all of the genres in the genres list.

    Then the function creates another boolean mask type_mask by using the isin() 
    method to check if each value in the 'type' column of the DataFrame is in the types list.

    Finally, the function applies both masks to the DataFrame df using the & operator 
    to create a new DataFrame filtered_df that includes only rows where b

Steps of filtering_and:
- This function takes a DataFrame `df`, a list of `genres`, and a list of `types` as input arguments.
- Create a boolean mask that filters rows where the genre column contains all of the genres in the `genres` list.
- Create a boolean mask that filters rows where the type column is in the `types` list.
- Apply both masks to the DataFrame `df` and create a new DataFrame `filtered_df` that includes only rows where both masks are True.
- Return the filtered DataFrame.

In [37]:
print(filtering_or.__doc__)


    The code defines a function "filtering_or" that filters a pandas dataframe based on user-defined 
    genres and types using an "OR" method. The function allows the user to select one or all possible 
    genres and types and returns a filtered dataframe with the selected genres and types. 
    The function also splits the genre and type columns and explodes them to account for multiple entries.
    


Steps of filtering_or:
- Make a copy of the input DataFrame
- Split the genre column into a list of genres
- Explode the genre column to create a new row for each genre in the list
- If genres are specified and 'ALL' is not one of them, filter the DataFrame to keep only rows where the genre is in the specified list  
- If types are specified and 'ALL' is not one of them, filter the DataFrame to keep only rows where the type is in the specified list
- If both genres and types are specified
- If 'ALL' is in the genres list, set genres to be all the unique genres in the filtered DataFrame
- If 'ALL' is in the types list, set types to be all the unique types in the filtered DataFrame
- Filter the DataFrame to keep only rows where the genre is in the genres list AND the type is in the types list
- Return the filtered DataFrame

In [38]:
# We can get the recommendation as a dictionary
# We select the name of the anime we want to find similitudes
# Then the number of suggestions we have(we might get less if there not so many o none if there is no matches)
# Then the genre we want (or write "All" if we shoose "or" filter)
# Then the type we want (or write "All" if we shoose "or" filter)
# We must select a type or filtering, "or"/"and" 
create_dict(unsupervised_user_based_recommender("Naruto",5),["Shounen"],["TV"],"or")

These are the recommendations for similar animes to [1mnaruto[0m 

or


[{'name': 'bleach',
  'english_title': 'Bleach',
  'japanses_title': 'BLEACH - ブリーチ -',
  'genre': 'Shounen',
  'source': 'Manga',
  'duration': '24 min per ep',
  'episodes': 366.0,
  'score': 7.9,
  'rank': 722.0,
  'synopsis': "Ichigo Kurosaki is an ordinary high schooler—until his family is attacked by a Hollow, a corrupt spirit that seeks to devour human souls. It is then that he meets a Soul Reaper named Rukia Kuchiki, who gets injured while protecting Ichigo's family from the assailant. To save his family, Ichigo accepts Rukia's offer of taking her powers and becomes a Soul Reaper as a result. However, as Rukia is unable to regain her powers, Ichigo is given the daunting task of hunting down the Hollows that plague their town. However, he is not alone in his fight, as he is later joined by his friends—classmates Orihime Inoue, Yasutora Sado, and Uryuu Ishida—who each have their own unique abilities. As Ichigo and his comrades get used to their new duties and support each other o