<a href="https://colab.research.google.com/github/JCherryA050/phase_4_project/blob/main/First_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Anime Planet Recommendation System With ALS

## Business Problem

Anime Planet is a website where users can track and rate the anime and manga that they consume. Using this information, Anime Planet has a global ranking of anime/manga, has a recommendation system based on users preferences, and directs users to where they can watch or read a given anime/manga.

We have been tasked with taking user data specifically dealing with anime to find out if we can create a more effective recommendation system using Alternating Least Squares (ALS)

In [None]:
import pandas as pd

## Setting Up Environment and Data

Because ALS relies on a sparse matrix of data (even when cleaning the data to reduce the size), we will be taking advantage Google's cloud service Colab, which will allow us to process an extremely large dataset. Google Colab will also allow us to utilize PySpark, a tool for ALS models.

To start, we will run the following to set up our Google Colab environment.

In [4]:
# Run for Google Colab environment
!pip install pyspark
!apt install openjdk-8-jdk-headless -qq
!pip install mlflow

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 71kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 22.1MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=a9bb4728e4119634e5ac1f16d79ca533611d691889e53181e591e309bc1a6c2f
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2
The 

Now that we have installed everything we need, we will import the following libraries for PySpark, and set up a spark session.

In [5]:
import pyspark
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import feature
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
# import org.apache.spark.sql.functions.col
# import org.apache.spark.sql.types.IntegerType
# import pyspark.sql.functions.col
from pyspark.sql.types import IntegerType

In [6]:
spark = SparkSession\
        .builder\
        .appName('anime_rec').config('spark.driver.host', 'localhost')\
        .getOrCreate()

Next, we are going to create a spark dataframe for our user recommendation data. We only want ratings, user id, and anime id. Because the dataset is already pretty large, we want to drop anything we can, so we can get rid of watched_status and watched_episodes. Additonally, we need to make sure all of our remaining values are integers.

Note: If you are also using Google Colab, you will need to make sure to upload any CSVs in the colab instance.

In [8]:
rec_data = spark.read.csv('animelist.csv', header='true')

In [9]:
rec_data = rec_data.withColumn('rating', rec_data['rating'].cast(IntegerType()))
rec_data = rec_data.withColumn('user_id', rec_data['user_id'].cast(IntegerType()))
rec_data = rec_data.withColumn('anime_id', rec_data['anime_id'].cast(IntegerType()))

In [10]:
rec_data.dtypes

[('user_id', 'int'),
 ('anime_id', 'int'),
 ('rating', 'int'),
 ('watching_status', 'string'),
 ('watched_episodes', 'string')]

In [11]:
rec_data = rec_data.drop('watching_status')
rec_data = rec_data.drop('watched_episodes')

In [12]:
rec_data

DataFrame[user_id: int, anime_id: int, rating: int]

## Building Our Model

Now that everything is set up, we can do a train test split and build a model. Our first model will make some guesses for our parameters, and from there we can try tweaking things a bit.

In [19]:
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml.recommendation import ALS

(training, test) = rec_data.randomSplit([0.8, 0.2], seed=1)

als = ALS(maxIter=5, rank=10, regParam=0.01, userCol='user_id', itemCol='anime_id', ratingCol='rating', coldStartStrategy ='drop')
# fit the ALS model to the training set
model = als.fit(training)

In [20]:
# importing appropriate library
from pyspark.ml.evaluation import RegressionEvaluator

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)

evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
rmse = evaluator.evaluate(predictions)
print('Root-mean-square error = ' + str(rmse))

Root-mean-square error = 3.13945031256278


Our Root Mean Square Error (RMSE) is at about 3, which means that our model can predict a user rating for a given anime within 3 points. We are using a 10 point scale, so this is too high an error to depend on. The next thing we can do is use cross validation to find better parameters to reduce our RMSE to be more dependable.

In [22]:
# Takes about an hour to run
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# initialize the ALS model
als_model = ALS(userCol='user_id', itemCol='anime_id', ratingCol='rating', coldStartStrategy='drop')

# create the parameter grid              
params = ParamGridBuilder()\
  .addGrid(als_model.regParam, [0.01, 0.001, 0.1])\
  .addGrid(als_model.rank, [10]).build() # Ran earlier and found 10 to be best

# instantiating crossvalidator estimator
cv = CrossValidator(estimator=als_model, estimatorParamMaps=params, evaluator=evaluator, parallelism=4)
best_model = cv.fit(rec_data)

In [30]:
# We see the best model has a rank of 10, so we will use that in our future models with this dataset
print(best_model.bestModel.rank)
print(best_model.bestModel.getParam())

10


NameError: ignored

## Matching IDs to Names

Now that we have a model with more dependable scores, we need to be a bit user friendly. It is not very helpful to recommend anime_id 1 to a user, rather than Cowboy Bepop!

We are going to pull in our data set with anime information and write a function to get the title.

In [31]:
anime_titles = spark.read.csv('anime.csv', header=True)
anime_titles.head(5)

[Row(MAL_ID='1', Name='Cowboy Bebop', Score='8.78', Genders='Action, Adventure, Comedy, Drama, Sci-Fi, Space', English name='Cowboy Bebop', Japanese name='カウボーイビバップ', Type='TV', Episodes='26', Aired='Apr 3, 1998 to Apr 24, 1999', Premiered='Spring 1998', Producers='Bandai Visual', Licensors='Funimation, Bandai Entertainment', Studios='Sunrise', Source='Original', Duration='24 min. per ep.', Rating='R - 17+ (violence & profanity)', Ranked='28.0', Popularity='39', Members='1251960', Favorites='61971', Watching='105808', Completed='718161', On-Hold='71513', Dropped='26678', Plan to Watch='329800', Score-10='229170.0', Score-9='182126.0', Score-8='131625.0', Score-7='62330.0', Score-6='20688.0', Score-5='8904.0', Score-4='3184.0', Score-3='1357.0', Score-2='741.0', Score-1='1580.0'),
 Row(MAL_ID='5', Name='Cowboy Bebop: Tengoku no Tobira', Score='8.39', Genders='Action, Drama, Mystery, Sci-Fi, Space', English name='Cowboy Bebop:The Movie', Japanese name='カウボーイビバップ 天国の扉', Type='Movie', Epis

In [45]:
def name_retriever(anime_id, dataframe=anime_titles):
    return anime_titles.where(anime_titles.MAL_ID == anime_id).take(1)[0]['Name']

In [1]:
# Should show Cowboy Bepop
print(name_retriever(1, anime_titles))

NameError: name 'name_retriever' is not defined

## Getting Recommendations

We are going to go ahead and pull recommendations for users. Spark actually already has built in functions to get recommendations called recommendForUserSubset. There is also a more general function called recommendForAllUsers.

In [47]:
# Get a random user
users = rec_data.select(als.getUserCol()).distinct().limit(1)
# Get recommendations based on 10 nearest users
userSubsetRecs = model.recommendForUserSubset(users, 10)
# Instantiate recommendations
recs = userSubsetRecs.take(1)

In [48]:
# use indexing to obtain the movie id of top predicted rated item
first_recommendation = recs[0]['recommendations'][0][0]

# use the name retriever function to get the values
name_retriever(first_recommendation,anime_titles)

'Kindaichi Shounen no Jikenbo Movie 1: Operazakan - Aratanaru Satsujin'

In [49]:
recommendations = model.recommendForAllUsers(5)
recommendations.where(recommendations.user_id == 3).collect()

[Row(user_id=3, recommendations=[Row(anime_id=28251, rating=21.12415885925293), Row(anime_id=40639, rating=18.66773223876953), Row(anime_id=3444, rating=18.416454315185547), Row(anime_id=928, rating=16.966115951538086), Row(anime_id=41260, rating=16.548372268676758)])]

We can also set up a function for creating a new user getting and recommendations for them based on review scores we input for them. We can put the same scores into Anime Planet, and see how our recommendations compare to theirs!

In [50]:
def new_user_recs(user_id, new_ratings, rating_df, anime_title_df, num_recs):
    # turn the new_recommendations list into a spark DataFrame
    new_user_ratings = spark.createDataFrame(new_ratings, rating_df.columns)
    
    # combine the new ratings df with the rating_df
    anime_ratings_combined = rating_df.union(new_user_ratings)
    
    # create an ALS model and fit it
    als = ALS(maxIter=5, rank=10, regParam=0.01, userCol='user_id', itemCol='anime_id', ratingCol='rating',
              coldStartStrategy='drop')
    model = als.fit(anime_ratings_combined)
    
    # make recommendations for all users using the recommendForAllUsers method
    recommendations = model.recommendForAllUsers(num_recs)
    
    # get recommendations specifically for the new user that has been added to the DataFrame
    recs_for_user = recommendations.where(recommendations.user_id == user_id).take(1)

    for ranking, (anime_id, rating) in enumerate(recs_for_user[0]['recommendations']):
      anime_string = name_retriever(anime_id, anime_title_df)
      print('Recommendation {}: {}  | predicted score: {}'.format(ranking+1, anime_string, rating))

In [51]:
user_id = 1000000
user_ratings_1 = [(user_id,1,7),
                  (user_id,2,7),
                  (user_id,30,10),
                  (user_id,32937,10),
                  (user_id,8625,5),
                  (user_id,203,10)]
new_user_recs(user_id,
             new_ratings=user_ratings_1,
             rating_df=rec_data,
             anime_title_df=anime_titles,
             num_recs = 10)

Recommendation 1: Kanojo ga Kanji wo Suki na Riyuu.  | predicted score: 16.453216552734375
Recommendation 2: Universe  | predicted score: 15.597274780273438
Recommendation 3: 1/100 Train Station  | predicted score: 14.501545906066895
Recommendation 4: Attakai, Fuyu Canada  | predicted score: 13.245718002319336
Recommendation 5: Code Geass: Hangyaku no Lelouch Picture Drama - Kiseki no Anniversary  | predicted score: 13.189346313476562
Recommendation 6: 1/100 Shibuya Crossing  | predicted score: 12.995405197143555
Recommendation 7: 1/100 Rice Planting  | predicted score: 12.851329803466797
Recommendation 8: Kimagure Mercy  | predicted score: 12.703506469726562
Recommendation 9: Re Boot  | predicted score: 12.488151550292969
Recommendation 10: Tokimeki Runners  | predicted score: 12.397653579711914
