<a href="https://colab.research.google.com/github/Datangels/Machine_Learning_with_PySpark/blob/master/pyspark_recommendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Google Colab configuration & creation the SparkSession Object**

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

## **Read the Dataset**

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [0]:
dataset_not_clean = spark.read.csv('/content/drive/My Drive/pycharm_colab_training/dataset/movie_ratings_df.csv',inferSchema=True, header=True)

## **Exploratory Data Analysis**


In [0]:
print((dataset_not_clean.count(), len(dataset_not_clean.columns)))
# dataset_not_clean.printSchema()
# dataset_not_clean.describe().show()
# print((dataset_not_clean.count(), len(dataset_not_clean.columns)))

## **Feature Engineering**

In [0]:
from pyspark.sql.functions import *
from pyspark.ml.feature import StringIndexer, IndexToString

In [0]:
stringIndexer = StringIndexer(inputCol="title", outputCol="title_new")
model = stringIndexer.fit(dataset_not_clean)
indexed_df = model.transform(dataset_not_clean)

## **Splitting the Dataset**

In [0]:
train_df, test_df = indexed_df.randomSplit([0.75,0.25])
print("whole dataset: " + str(indexed_df.count()))
print("train_df dataset: " + str(train_df.count()))
print("test_df dataset: " + str(test_df.count()))

## **Build and Train Random Forest Model**


In [0]:
from pyspark.ml.recommendation import ALS
rec = ALS(maxIter=10,regParam=0.01,userCol='userId',itemCol='title_new',ratingCol='rating',nonnegative=True,coldStartStrategy="drop")
rec_model = rec.fit(train_df)

## **Predictions and Evaluation on Test Data**

In [0]:
predicted_ratings = rec_model.transform(test_df)
predicted_ratings.orderBy(rand()).show(10)

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName='rmse',predictionCol='prediction',labelCol='rating')
rmse = evaluator.evaluate(predicted_ratings)
print(rmse)

## **Recommend Top Movies That Active User Might Like**

In [0]:
unique_movies_df = indexed.select('title_new').distinct()
a = unique_movies_df.alias('a')
user_id = 85 # Picking one user for which we could make reccomendations
watched_movies_df = indexed.filter(indexed['userId'] == user_id).select('title_new').distinct() 
b = watched_movies_df.alias('b')
total_movies_df = a.join(b, a.title_new == b.title_new,how='left')
remaining_movies_df = total_movies_df.where(col("b.title_new").isNull()).select(a.title_new).distinct()
remaining_movies_df = remaining_movies_df.withColumn("userId",lit(int(user_id)))
recommendations_df = rec_model.transform(remaining_movies_df).orderBy('prediction',ascending=False)
recommendations_df.show(5,False)