<a href="https://colab.research.google.com/github/JCherryA050/phase_4_project/blob/main/Tech_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anime Planet Recommendation System With ALS

## Business Problem

Anime Planet is a website where users can track and rate the anime and manga that they consume. Using this information, Anime Planet has a global ranking of anime/manga, has a recommendation system based on users preferences, and directs users to where they can watch or read a given anime/manga.

We have been tasked with taking user data specifically dealing with anime to find out if we can create a more effective recommendation system using Alternating Least Squares (ALS)

## Setting Up Environment and Data

Because ALS relies on a sparse matrix of data (even when cleaning the data to reduce the size), we will be taking advantage Google's cloud service Colab, which will allow us to process an extremely large dataset. Google Colab will also allow us to utilize PySpark, a tool for ALS models.

To start, we will run the following to set up our Google Colab environment.

In [1]:
# Run for Google Colab environment
!pip install pyspark
!apt install openjdk-8-jdk-headless -qq
!pip install mlflow

openjdk-8-jdk-headless is already the newest version (8u292-b10-0ubuntu1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


Now that we have installed everything we will remove some data from our data. We are going to avoid Hentai recommendations from now, and want to ensure that we use full length shows or movies, so if an anime duration is in seconds we will remove it.

 Following that, we will import the following libraries for PySpark, and set up a spark session.

In [2]:
import pandas as pd

anime_df = pd.read_csv('anime.csv')
anime_list_df = pd.read_csv('animelist.csv')
anime_df = anime_df[~anime_df['Genders'].str.contains("Hentai")]
anime_df = anime_df[~anime_df['Duration'].str.contains("sec")]
anime_df = anime_df[~anime_df['Type'].str.contains('|'.join(['Music' , 'OVA', 'Special', 'ONA', 'Unknown']))]
anime_df = anime_df[anime_df['Popularity'] < 14000]
cleaned_df = anime_list_df[anime_list_df['anime_id'].isin(list(anime_df['MAL_ID'].tolist()))]

# There is way too much data in this set so we will be resampling the data to 
# make a smaller data set.
cleaned_df = cleaned_df.sample(2000000, random_state=1)

In [3]:
import pyspark
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import feature
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
# import org.apache.spark.sql.functions.col
# import org.apache.spark.sql.types.IntegerType
# import pyspark.sql.functions.col
from pyspark.sql.types import IntegerType

In [4]:
spark = SparkSession\
        .builder\
        .appName('anime_rec').config('spark.driver.host', 'localhost')\
        .getOrCreate()

Next, we are going to create a spark dataframe for our user recommendation data. We only want ratings, user id, and anime id. Because the dataset is already pretty large, we want to drop anything we can, so we can get rid of watched_status and watched_episodes. Additonally, we need to make sure all of our remaining values are integers.

Note: If you are also using Google Colab, you will need to make sure to upload any CSVs in the colab instance.

In [5]:
rec_data = spark.createDataFrame(cleaned_df)

In [6]:

rec_data = rec_data.withColumn('rating', rec_data['rating'].cast(IntegerType()))
rec_data = rec_data.withColumn('user_id', rec_data['user_id'].cast(IntegerType()))
rec_data = rec_data.withColumn('anime_id', rec_data['anime_id'].cast(IntegerType()))

In [7]:
rec_data.dtypes

[('user_id', 'int'),
 ('anime_id', 'int'),
 ('rating', 'int'),
 ('watching_status', 'double'),
 ('watched_episodes', 'double')]

In [8]:
rec_data = rec_data.drop('watching_status')
rec_data = rec_data.drop('watched_episodes')

In [9]:
rec_data

DataFrame[user_id: int, anime_id: int, rating: int]

## Building Our Model

Now that everything is set up, we can do a train test split and build a model. Our first model will make some guesses for our parameters, and from there we can try tweaking things a bit.

In [10]:
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml.recommendation import ALS

(training, test) = rec_data.randomSplit([0.8, 0.2], seed=1)

als = ALS(userCol='user_id', itemCol='anime_id', ratingCol='rating', coldStartStrategy ='drop', nonnegative= True, implicitPrefs=True)


In [11]:
params = ParamGridBuilder()\
  .addGrid(als.regParam, [0.1, 0.2])\
  .addGrid(als.maxIter, [20])\
  .addGrid(als.rank, [30]).build() # Ran earlier and found 10 to be best

evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')

tvs = TrainValidationSplit(estimator=als,
                    estimatorParamMaps=params,
                    evaluator=evaluator)

In [12]:
model = tvs.fit(training)

best_model = model.bestModel

Py4JJavaError: ignored

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 34244)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1207, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1033, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1212, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py

In [None]:
# We see the best model has a rank of 10, so we will use that in our future models with this dataset
predictions = best_model.transform(test)
rmse = evaluator.evaluate(predictions)

print('RMSE = ' + str(rmse))
print('---Best Model---')
print(' Rank:', best_model.rank) 
print(' MaxIter:', best_model._java_obj.parent().getMaxIter())
print(' RegParam:', best_model._java_obj.parent().getRegParam())

## Matching IDs to Names

Now that we have a model with more dependable scores, we need to be a bit user friendly. It is not very helpful to recommend anime_id 1 to a user, rather than Cowboy Bepop!

We are going to pull in our data set with anime information and write a function to get the title.

In [None]:
anime_titles = spark.read.csv('anime.csv', header=True)
anime_titles.head(5)

In [None]:
def name_retriever(anime_id, dataframe=anime_titles):
    name = anime_titles.where(anime_titles.MAL_ID == anime_id).take(1)[0]['English name']
    # If there is no English name available
    if name == 'Unknown':
      name = anime_titles.where(anime_titles.MAL_ID == anime_id).take(1)[0]['Name']
    return name

In [None]:
print(name_retriever(1, anime_titles))

## Getting Recommendations

We are going to go ahead and pull recommendations for users. Spark actually already has built in functions to get recommendations called recommendForUserSubset. There is also a more general function called recommendForAllUsers.

In [None]:
users = rec_data.select(als.getUserCol()).distinct().limit(10)
userSubsetRecs = best_model.recommendForUserSubset(users, 10)
recs = userSubsetRecs.take(1)

In [None]:
recs

In [None]:
# use indexing to obtain the movie id of top predicted rated item
first_recommendation = recs[0]['recommendations'][0][0]

# use the name retriever function to get the values
name_retriever(first_recommendation,anime_titles)

In [None]:
recommendations = best_model.recommendForAllUsers(5)
recommendations.where(recommendations.user_id == 7340).collect()

We can also set up a function for creating a new user getting and recommendations for them based on review scores we input for them. We can put the same scores into Anime Planet, and see how our recommendations compare to theirs!

In [None]:
def new_user_recs(user_id, new_ratings, rating_df, anime_title_df, num_recs):
    # turn the new_recommendations list into a spark DataFrame
    new_user_ratings = spark.createDataFrame(new_ratings, rating_df.columns)
    
    # combine the new ratings df with the rating_df
    anime_ratings_combined = rating_df.union(new_user_ratings)
    
    # create an ALS model and fit it
    als = ALS(maxIter=15, rank=15, regParam=0.2, userCol='user_id', itemCol='anime_id', ratingCol='rating', coldStartStrategy='drop')
    model = als.fit(anime_ratings_combined)
    
    # make recommendations for all users using the recommendForAllUsers method
    recommendations = model.recommendForAllUsers(num_recs)

    # get recommendations specifically for the new user that has been added to the DataFrame
    recs_for_user = recommendations.where(recommendations.user_id == user_id).take(1)

    for ranking, (anime_id, rating) in enumerate(recs_for_user[0]['recommendations']):
      anime_string = name_retriever(anime_id, anime_title_df)
      print('Recommendation {}: {}  | predicted score: {}'.format(ranking+1, anime_string, rating))

In [None]:
user_id = 1000000
user_ratings_1 = [(user_id,1,7), # Cowboy Bepop
                  (user_id,2,7), # Trigun
                  (user_id,30,10), # Neon Genesis Evangelion
                  (user_id,32937,10), # KonoSuba 2
                  (user_id,22199,5), # Akame ga Kill!
                  (user_id,18679,8),# Kill la Kill
new_user_recs(user_id,
             new_ratings=user_ratings_1,
             rating_df=rec_data,
             anime_title_df=anime_titles,
             num_recs = 10)