<a href="https://colab.research.google.com/github/JCherryA050/phase_4_project/blob/main/Copy_of_Tech_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# My Anime List Recommendation System With ALS

## Business Problem

My Anime List is a website where users can track and rate the anime and manga that they consume. Using this information, My Anime List has a global ranking of anime/manga, has a recommendation system based on users preferences, and directs users to where they can watch or read a given anime/manga.

We have been tasked with taking user data specifically dealing with anime to find out if we can create a more effective recommendation system using Alternating Least Squares (ALS)

## Setting Up Environment and Data

Because ALS relies on a sparse matrix of data (even when cleaning the data to reduce the size), we will be taking advantage Google's cloud service Colab, which will allow us to process an extremely large dataset. Google Colab will also allow us to utilize PySpark, a tool for ALS models.

To start, we will run the following to set up our Google Colab environment.

In [1]:
# Run for Google Colab environment
!pip install pyspark
!apt install openjdk-8-jdk-headless -qq
!pip install mlflow

openjdk-8-jdk-headless is already the newest version (8u292-b10-0ubuntu1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


Now that we have installed everything we will remove some data from our data. We are going to avoid Hentai recommendations from now, and want to ensure that we use full length shows or movies, so if an anime duration is in seconds we will remove it.

 Following that, we will import the following libraries for PySpark, and set up a spark session.

In [2]:
import pandas as pd

anime_df = pd.read_csv('anime.csv')
anime_list_df = pd.read_csv('animelist.csv')
anime_df = anime_df[~anime_df['Genders'].str.contains("Hentai")]
anime_df = anime_df[~anime_df['Duration'].str.contains("sec")]
anime_df = anime_df[~anime_df['Type'].str.contains('|'.join(['Movie','Music' , 'OVA', 'Special', 'ONA', 'Unknown']))]
# anime_df = anime_df[anime_df['Popularity'] < 14000]
# anime_df = anime_df[anime_df['Premeired'] >= 1980]
cleaned_df = anime_list_df[anime_list_df['anime_id'].isin(list(anime_df['MAL_ID'].tolist()))]

# There is way too much data in this set so we will be resampling the data to 
# make a smaller data set.
cleaned_df = cleaned_df.sample(5000000, random_state=1)

In [3]:
cleaned_df

Unnamed: 0,user_id,anime_id,rating,watching_status,watched_episodes
92987086,300385,36949,5,2,12
98321496,317645,10447,0,1,0
12895905,41942,33475,0,6,0
31188545,101089,235,0,3,400
14474966,47015,11887,9,2,13
...,...,...,...,...,...
6724690,22026,7791,7,2,26
23052922,74586,36563,8,2,13
65350483,211503,20899,10,2,24
8701273,28432,11615,6,2,13


In [4]:
import pyspark
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import feature
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
# import org.apache.spark.sql.functions.col
# import org.apache.spark.sql.types.IntegerType
# import pyspark.sql.functions.col
from pyspark.sql.types import IntegerType

In [5]:
spark = SparkSession\
        .builder\
        .appName('anime_rec').config('spark.driver.host', 'localhost')\
        .getOrCreate()

Next, we are going to create a spark dataframe for our user recommendation data. We only want ratings, user id, and anime id. Because the dataset is already pretty large, we want to drop anything we can, so we can get rid of watched_status and watched_episodes. Additonally, we need to make sure all of our remaining values are integers.

Note: If you are also using Google Colab, you will need to make sure to upload any CSVs in the colab instance.

In [6]:
rec_data = spark.createDataFrame(cleaned_df)

In [7]:

rec_data = rec_data.withColumn('rating', rec_data['rating'].cast(IntegerType()))
rec_data = rec_data.withColumn('user_id', rec_data['user_id'].cast(IntegerType()))
rec_data = rec_data.withColumn('anime_id', rec_data['anime_id'].cast(IntegerType()))

In [8]:
rec_data.dtypes

[('user_id', 'int'),
 ('anime_id', 'int'),
 ('rating', 'int'),
 ('watching_status', 'bigint'),
 ('watched_episodes', 'bigint')]

In [9]:
rec_data = rec_data.drop('watching_status')
rec_data = rec_data.drop('watched_episodes')

In [10]:
rec_data

DataFrame[user_id: int, anime_id: int, rating: int]

## Building Our Model

Now that everything is set up, we can do a train test split and build a model. Our first model will make some guesses for our parameters, and from there we can try tweaking things a bit.

In [11]:
from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml.recommendation import ALS

(training, test) = rec_data.randomSplit([0.8, 0.2], seed=1)

als = ALS(userCol='user_id', itemCol='anime_id', ratingCol='rating', coldStartStrategy ='drop', nonnegative= True, implicitPrefs=True)


In [12]:
params = ParamGridBuilder()\
  .addGrid(als.regParam, [0.1, 0.2])\
  .addGrid(als.maxIter, [20])\
  .addGrid(als.rank, [15]).build() # Ran earlier and found 10 to be best

evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')

tvs = TrainValidationSplit(estimator=als,
                    estimatorParamMaps=params,
                    evaluator=evaluator)

In [13]:
model = tvs.fit(training)

best_model = model.bestModel

In [14]:
# We see the best model has a rank of 10, so we will use that in our future models with this dataset
predictions = model.transform(test)
rmse = evaluator.evaluate(predictions)

print('RMSE = ' + str(rmse))
print('---Best Model---')
print(' Rank:', best_model.rank) 
print(' MaxIter:', best_model._java_obj.parent().getMaxIter())
print(' RegParam:', best_model._java_obj.parent().getRegParam())

RMSE = 5.731110336044163
---Best Model---
 Rank: 15
 MaxIter: 20
 RegParam: 0.2


## Matching IDs to Names

Now that we have a model with more dependable scores, we need to be a bit user friendly. It is not very helpful to recommend anime_id 1 to a user, rather than Cowboy Bepop!

We are going to pull in our data set with anime information and write a function to get the title.

In [15]:
anime_titles = spark.read.csv('anime.csv', header=True)
anime_titles.head(5)

[Row(MAL_ID='1', Name='Cowboy Bebop', Score='8.78', Genders='Action, Adventure, Comedy, Drama, Sci-Fi, Space', English name='Cowboy Bebop', Japanese name='カウボーイビバップ', Type='TV', Episodes='26', Aired='Apr 3, 1998 to Apr 24, 1999', Premiered='Spring 1998', Producers='Bandai Visual', Licensors='Funimation, Bandai Entertainment', Studios='Sunrise', Source='Original', Duration='24 min. per ep.', Rating='R - 17+ (violence & profanity)', Ranked='28.0', Popularity='39', Members='1251960', Favorites='61971', Watching='105808', Completed='718161', On-Hold='71513', Dropped='26678', Plan to Watch='329800', Score-10='229170.0', Score-9='182126.0', Score-8='131625.0', Score-7='62330.0', Score-6='20688.0', Score-5='8904.0', Score-4='3184.0', Score-3='1357.0', Score-2='741.0', Score-1='1580.0'),
 Row(MAL_ID='5', Name='Cowboy Bebop: Tengoku no Tobira', Score='8.39', Genders='Action, Drama, Mystery, Sci-Fi, Space', English name='Cowboy Bebop:The Movie', Japanese name='カウボーイビバップ 天国の扉', Type='Movie', Epis

In [16]:
def name_retriever(anime_id, dataframe=anime_titles):
    name = anime_titles.where(anime_titles.MAL_ID == anime_id).take(1)[0]['English name']
    # If there is no English name available
    if name == 'Unknown':
      name = anime_titles.where(anime_titles.MAL_ID == anime_id).take(1)[0]['Name']
    return name

In [17]:
print(name_retriever(1, anime_titles))

Cowboy Bebop


## Getting Recommendations

We are going to go ahead and pull recommendations for users. Spark actually already has built in functions to get recommendations called recommendForUserSubset. There is also a more general function called recommendForAllUsers.

In [18]:
users = rec_data.select(als.getUserCol()).distinct().limit(10)
userSubsetRecs = best_model.recommendForUserSubset(users, 10)
recs = userSubsetRecs.take(1)

In [19]:
recs

[Row(user_id=109050, recommendations=[Row(anime_id=1575, rating=0.5800600647926331), Row(anime_id=20, rating=0.557708203792572), Row(anime_id=19815, rating=0.5254242420196533), Row(anime_id=4224, rating=0.5145238041877747), Row(anime_id=23273, rating=0.5140708684921265), Row(anime_id=11061, rating=0.5120140314102173), Row(anime_id=20507, rating=0.5088703036308289), Row(anime_id=25777, rating=0.4907777011394501), Row(anime_id=10620, rating=0.489093542098999), Row(anime_id=8074, rating=0.48525917530059814)])]

In [20]:
# use indexing to obtain the movie id of top predicted rated item
first_recommendation = recs[0]['recommendations'][0][0]

# use the name retriever function to get the values
name_retriever(first_recommendation,anime_titles)

'Code Geass:Lelouch of the Rebellion'

In [21]:
recommendations = best_model.recommendForAllUsers(5)
recommendations.where(recommendations.user_id == 7340).collect()

[Row(user_id=7340, recommendations=[Row(anime_id=5081, rating=0.7071511149406433), Row(anime_id=226, rating=0.7050871849060059), Row(anime_id=2167, rating=0.6992486119270325), Row(anime_id=9756, rating=0.6958047151565552), Row(anime_id=14741, rating=0.6565090417861938)])]

We can also set up a function for creating a new user getting and recommendations for them based on review scores we input for them. We can put the same scores into Anime Planet, and see how our recommendations compare to theirs!

In [22]:
def new_user_recs(user_id, new_ratings, rating_df, anime_title_df, num_recs):
    # turn the new_recommendations list into a spark DataFrame
    new_user_ratings = spark.createDataFrame(new_ratings, rating_df.columns)
    
    # combine the new ratings df with the rating_df
    anime_ratings_combined = rating_df.union(new_user_ratings)
    
    # create an ALS model and fit it
    als = ALS(maxIter=15, rank=15, regParam=0.2, userCol='user_id', itemCol='anime_id', ratingCol='rating', coldStartStrategy='drop')
    model = als.fit(anime_ratings_combined)
    
    # make recommendations for all users using the recommendForAllUsers method
    recommendations = model.recommendForAllUsers(num_recs)

    # get recommendations specifically for the new user that has been added to the DataFrame
    recs_for_user = recommendations.where(recommendations.user_id == user_id).take(1)

    for ranking, (anime_id, rating) in enumerate(recs_for_user[0]['recommendations']):
      anime_string = name_retriever(anime_id, anime_title_df)
      print('Recommendation {}: {}  | predicted score: {}'.format(ranking+1, anime_string, round(rating)))

To do a sanity check, I will enter myself as a user and with some anime rankings and get 10 recommendations for myself. We can investigate the recommendations to see if they make sense, and make adjustments based on our findings.

In [35]:
user_id = 1000000
user_ratings_1 = [(user_id,1,7), # Cowboy Bepop
                  (user_id,2,7), # Trigun
                  (user_id,30,10), # Neon Genesis Evangelion
                  (user_id,32937,10), # KonoSuba 2
                  (user_id,22199,5), # Akame ga Kill!
                  (user_id,18679,8), # Kill la Kill
                  (user_id, 28121, 1)] # Is It Wrong to Pick Up Girls in a Dungeon?


In [36]:
new_user_recs(user_id,
             new_ratings=user_ratings_1,
             rating_df=rec_data,
             anime_title_df=anime_titles,
             num_recs = 10)

Recommendation 1: Neon Genesis Evangelion  | predicted score: 10
Recommendation 2: Kaguya-sama:Love is War Season 2  | predicted score: 9
Recommendation 3: Owarimonogatari Second Season  | predicted score: 9
Recommendation 4: Kaguya-sama:Love is War  | predicted score: 9
Recommendation 5: Gintama.:Silver Soul Arc - Second Half War  | predicted score: 9
Recommendation 6: Attack on Titan Season 3 Part 2  | predicted score: 9
Recommendation 7: JoJo's Bizarre Adventure:Diamond is Unbreakable  | predicted score: 9
Recommendation 8: Grand Blue Dreaming  | predicted score: 9
Recommendation 9: Bakemonogatari  | predicted score: 9
Recommendation 10: KonoSuba:God's Blessing on This Wonderful World! 2  | predicted score: 9
