In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container {width : 99% !important;}</style>"))
display(HTML("<style>.output_result {width : 99% !important;}</style>"))
display(HTML("<style>.jp-Notebook {--jp-notebook-max-width: 99%;}/style>"))

In [2]:
import pyspark
sc = pyspark.SparkContext().getOrCreate()
spark = pyspark.sql.SparkSession(sc)
sc
spark

23/10/23 20:31:11 WARN Utils: Your hostname, Alashmony-Lenovo-Z51-70 resolves to a loopback address: 127.0.1.1; using 192.168.1.182 instead (on interface wlp3s0)
23/10/23 20:31:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/23 20:31:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Recommendations Are Everywhere

## Why learn how to build recommendation engines?
1. Why learn how to build recommendation engines?

Hi. Welcome to this course on building recommendation engines using Alternating Least Squares or "ALS" in PySpark.

2. What recommendations look like
00:09 - 00:34
You're probably already familiar with the output of these types of recommendation engines where a website tells you something along the lines of, "If you like that, then you'll probably like this." You've likely seen these types of recommendations on your favorite retail or media streaming websites. These recommendations are generated through different types of data that you as a user or customer provide either directly or indirectly.

3. Learning about you

When you purchase something online, or watch a movie, or even read an article, you are often given a chance to rate that item on a scale of 1 to 5 stars, a thumbs up or thumbs down, or some other type of rating scale. Based on your feedback from these types of rating systems, companies can learn a lot about your preferences, and offer you recommendations based on preferences of users that are similar to you.

4. How recommendation engines work

For example, if your movie streaming service sees that you liked Dark Knight and Iron Man, and did not like Tangled, and it also sees

5. How recommendation engines work

other users that also liked Dark Knight and Iron Man and also did not like Tangled, the ALS algorithm would see that you and these other users have

6. How recommendation engines work

similar tastes. It would then look at the movies that you have not yet seen, and see which ones are the

7. How recommendation engines work

highest rated among those similar users, and offer them as

8. How recommendation engines work

recommendations to you. This is why websites will often say things like, "Because you liked that movie, we think you'll like this movie." Or "Users like you also watched this movie."

9. The Power of Recommendation Engines

These types of rating systems are extremely powerful. In fact, an article published by McKinsey & Company in October of 2013 stated that 35% of what customers buy on Amazon and 75% of what they watch on Netflix come from product recommendations based on algorithms such as the one you are going to be learning in this course. That's a powerful use of data, and with this course, you will learn how to do this. In addition to this, there are alternate uses for recommendation algorithms that can be extremely useful for purposes as broad as feature space reduction, image compression, mathematical user and product grouping, latent feature discovery and you're going to learn some of these in this course.

10. Prerequisites

This tutorial is intended for those that have experience with Spark and Python, and understand the fundamentals of machine learning. If needed, some good introductory resources are DataCamp's Introduction to PySpark course, their Intermediate Python for Data Science course, and their Supervised Machine Learning with Python's SciKitLearn course.

11. Let's practice!

Let's jump in.

### See the power of a recommendation engine
Taylor and Jane both like watching movies. Taylor only likes dramas, comedies, and romances. Jane likes only action, adventure, and otherwise exciting films. One of the greatest benefits of ALS-based recommendation engines is that they can identify movies or items that users will like, even if they themselves think that they might not like them. Take a look at the movie ratings that Taylor and Jane have provided below. It would stand to reason that their different preferences would generate different recommendations.

This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the [PySpark Cheat Sheet](https://datacamp-community-prod.s3.amazonaws.com/65076e3c-9df1-40d5-a0c2-36294d9a3ca9) and keep it handy!

**Instructions**

- Take a look at `TJ_ratings` using the `.show()` method and any other methods you prefer to see how each of them rated the various movies they've seen.
- Input user names into the `get_ALS_recs()` function provided to see what movies ALS recommends for Jane and Taylor based on the ratings provided. Do the ratings make sense to you?

In [3]:
'''
# View TJ_ratings
TJ_ratings.show()

# Generate recommendations for users
get_ALS_recs(["Taylor","Jane"]) 
'''

'\n# View TJ_ratings\nTJ_ratings.show()\n\n# Generate recommendations for users\nget_ALS_recs(["Taylor","Jane"]) \n'

### Power of recommendation engines
What is a reason for learning to build recommendation engines?

Answer the question

- Sales always go up with recommendations.

- **Show users items/products relevant to them that they may not know are available.**

- Customers always take recommendations.

- Because other successful companies do it.

## Recommendation engine types and data types

1. Recommendation engine types and data types

In the world of recommendation engines, there are two basic types:

2. Two types of recommendation engines:

Collaborative-filtering engines and content-based filtering engines. Both aim to offer meaningful recommendations, but they do so in slightly different ways. Content-based filtering, as the name suggests, tries to understand the content, or {{1}}features of the items, and makes recommendations based on your preferences for those specific features. For example, a movie streaming service might go to great lengths to add descriptive tags to their movies such as the genre, whether it's animated or not, the language spoken in the movie, the decade it was filmed, and which actors were in it, etc. So when a user like you

3. Two types of recommendation engines

gives 5 stars to a really dramatic, Portuguese movie with specific actors from a specific decade, they can infer that you like movies like this and will also like

4. Two types of recommendation engines

other dramatic movies in Portuguese with those same actors,

5. Two types of recommendation engines

and recommend those movies to you. Collaborative filtering is a little bit different.

6. Two types of recommendation engines

As explained in the previous video, collaborative filtering is based on user similarity. However, unlike content-based filtering, manually-created tags are not necessary. The features and groupings are created mathematically from patterns in the ratings provided by users. When you provide ratings for a product or item, whether it be a thumbs up or thumbs down, or even if you just watch a video without even giving it a rating, you are providing meaningful insight about your preferences. From this behavior, the ALS algorithm can mathematically group you with similar users, predict your behavior, and help you have a more effective customer experience. While ALS can have content-based applications, this course will focus on it's application to collaborative filtering, but many of the principles of collaborative filtering can be applied to content-based applications.

7. Two types of ratings

Now let's talk about ratings. In the realm of recommendation engines, there are two main types of ratings:

8. Two types of ratings

Explicit ratings

9. Two types of ratings

and implicit ratings Explicit ratings are pretty straightforward. Examples of these are when you input

10. Two types of ratings

a number of stars or something like a thumbs up or thumbs down. These are explicit ratings because users explicitly state how much they like or dislike something. Implicit ratings are a little bit different. They are based on the passive tracking of your behavior, like the number of movies you've seen in different genres. Fundamentally, implicit ratings are generated from the frequency of your actions. For example, if you watch 30 movies, and of those 30 movies,

11. Two types of ratings

22 are action movies, and only 1 is a comedy, the low number of comedy views will be converted into

12. Two types of ratings

low confidence that you like comedies, and the high number of action movie views will be converted into

13. Two types of ratings

a high confidence that you like action movies. These probabilities are then used as ratings. The logic behind this is, in essence, the more you carry out a behavior, the higher the the likelihood that you like it, and thus a higher rating. Additionally, in some cases you may not have access to behavior counts like this. A simpler form of ratings that still works with the ALS algorithm is the use of simple binary ratings. Rather than having a count of user actions, binary ratings just show whether a user has done something, like watched a comedy, represented by a 1 or not watched a comedy, represented by a 0. These types of ratings aren't nearly as rich, but they still can provide meaningful insight and still work perfectly fine with the ALS algorithm.

14. Let's practice!

Now, let's look at some actual data.

### Collaborative vs content-based filtering
Below are statements that are often used when providing recommendations. Select the one that DOES NOT indicate collaborative filtering.

Answer the question

- **"Because you liked that product, we think you'll like this product."**

- "Users that bought that also bought this."

- "Other people like you also liked this movie."

- "80% of your friends liked this movie, we think you'll like it too."

- "Here are top choices from similar users."

### Collaborative vs content based filtering part II
Look at the df dataframe using the `.show()` method and/or the `.columns` method, and determine whether it is best suited for "collaborative filtering", "content-based filtering", or "both".

- collaborative filtering
- content-based filtering
- **both**

### Implicit vs explicit data
Recall the differences between implicit and explicit ratings. Take a look at the `df1` dataframe to understand whether the data includes implicit or explicit ratings data.

**Instructions**

- Use the `.columns` and `.show()` methods to get an idea of the data provided, and see if the data includes implicit or explicit ratings.
- Type "implicit" or "explicit" based on whether you think this data contains "implicit" ratings or "explicit" ratings. Name your response answer.

In [5]:
'''df1.show()'''

'df1.show()'

In [6]:
# Type "implicit" or "explicit"
answer = "implicit"

### Ratings data types
Markus watches a lot of movies, including documentaries, superhero movies, classics, and dramas. Drawing on your previous experience with Spark, use the `markus_ratings` dataframe, which contains data on the number of times Markus has seen movies in various genres, and think about whether these are implicit or explicit ratings. Use the `groupBy()` method to determine which genre has the highest rating, which could likely influence what recommendations ALS would generate for Markus.

**Instructions**

- Use the `groupBy()` method to group the `markus_ratings` dataframe by `"Genre"`.
- Apply the `.sum()` method to get the total number of movies watched for each genre.
- Be sure to add the `.show()` method at the end to view the counts.

In [7]:
'''
# Group the data by "Genre"
markus_ratings.groupBy("Genre").sum().show()
'''

'\n# Group the data by "Genre"\nmarkus_ratings.groupBy("Genre").sum().show()\n'

## Uses for recommendation engines
1. Uses for recommendation engines

So far, we've only considered recommendations as the use case for the ALS algorithm, but there are other applications that are also useful. These include latent feature discovery, item grouping, dimensionality reduction and image compression. In this course we'll only talk about some of these. First let's talk about latent features. As mentioned earlier, people will go to great lengths to effectively categorize items. But some products span various categories making them difficult to organize. Movies are often like this. Horror movies can be comedies. Dramas can be satires. Documentaries can be romances or even mysteries. Because of this, they can sometimes be difficult to market. If we had a better understanding of how consumers categorize movies based on their experience watching them, we could add more power to marketing strategies. ALS can help with this.

2. Basic factorization of ALS

When we have a matrix that contains users and movie ratings, ALS

3. Basic factorization of ALS (cont.)

will factor that matrix into two matrices, one containing

4. User matrix

user information and the other containing

5. Product matrix

product information, or in this case, movie information. Each matrix takes the respective

6. Factor matrix axes

labeled axis from the original matrix, and is given another axis that is unlabeled. The unlabeled axes contain what’s called

7. Latent feature axes

latent features. The number of latent features is referred to as the "rank" of these matrices. In this case, the rank chosen is 3. You, as a data scientist get to choose how many of these ALS will create. These latent features represent groups that are created from patterns in the original ratings matrix and the values in these columns represent how much each item falls into these groups. For example,

8. Horror vs drama

in the original ratings matrix, there might be a lot of people who like horror movies and don’t like dramas. They would rate

9. Horror vs Drama Part II

horror movies high

10. Horror vs Drama Part III

and dramas low. Likewise other people might like dramas and not like horror movies, and would rate

11. Horror vs Drama Part IV

dramas high

12. Horror vs Romance Part V

and horror films low. ALS can see this and determine that these are different types of movies.

13. Horror vs Drama Part VI

And if we were to look at the movie factor matrix, we would likely see that in one of the latent feature rows the dramas would score high, and the horror movies would score low, while in another latent feature column we might see the

14. Horror vs Drama Part VII

opposite. Knowing a little about these movies, we could determine that those latent features reflect those two genres. This allows us to mathematically see how users experience these movies and to what degree users feel each movie falls into each respective category. This concept goes a bit deeper though. For example, if we were to look at a

15. Uncovering Features

movie matrix, we might see in one latent-feature column,

16. Uncovering Features Part II

that several movies have scored very high, but they

17. Uncovering Features Part III

don’t seem to have anything in common. If they are all popular movies,

18. Uncovering Features Part IV

we might want to research what’s going on here to see if there's a business opportunity. Digging deeper,

19. Shakespeare Adaptations

we find that these movies are all adaptations of Shakespeare plays, and that there seems to be a strong customer group that likes these types of movies. Now that we know this, we can use this information to inform how we choose what movies to make and hopefully give our customers more of what they want. It’s worth reiterating that in the original data set there was no column anywhere called “Shakespeare Adaptations”. It’s also worth noting that many or all of these customers may not even know that this is something that draws them to these movies. This is the type of powerful information that ALS can help us uncover.

20. Let's practice!

Now lets try and actually use this.

### Alternate uses of recommendation engines.
Select the best definition of "latent features".

Answer the question

- Features or tags that have manually been attached to items that categorize those items.

- **Features that are contained in data, but aren't directly observable.**

- Features that show up "later" in the machine learning process.

- Features that are added by human beings.

### Confirm understanding of latent features
Matrix `P` is provided here. Its columns represent movies and its rows represent several latent features. Use your understanding of Spark commands to view matrix `P` and see if you can determine what some of the latent features might represent. After examining the matrix, look at the dataframe `Pi`, which contains a rough approximation of what these latent features could represent. See if you weren't far off.

**Instructions**

- Examine matrix `P` using the `.show()` method.
- Examine matrix `Pi` using the `.show()` method.

In [8]:
'''
# Examine matrix P using the .show() method
P.show()

# Examine matrix Pi using the .show() method
Pi.show()
'''

'\n# Examine matrix P using the .show() method\nP.show()\n\n# Examine matrix Pi using the .show() method\nPi.show()\n'

# How does ALS work

## Overview of matrix multiplication
1. Matrix Multiplication

As you've probably realized, matrix operations are fundamental to the ALS algorithm. We're going to review matrix multiplication and matrix factorization here. Let's start with multiplication.

2. Matrix Multiplication

Here we have two square matrices. In order to multiply them together, we make specific pairs of the values from the two matrices, and add the products of those pairs. We start at the top left-hand corner of each matrix, and create pairs moving to the right on the first matrix, and moving down on the second matrix one at a time. Each pair is multiplied, and the products from all pairs are added together. The final sum will make up one number of the resulting matrix. That's a lot to digest, so let's walk through an example. Starting at the top

3. Matrix Multiplication

left number of each matrix we have a pair of numbers, 1 and 9. We will multiply those numbers together. Then moving to the right on the first matrix, and down on the second matrix we have

4. Matrix Multiplication

2 and 6, then moving

5. Matrix Multiplication

right again on the first matrix and down again on the second matrix we have 3 and 3. We have completed the first set of pairs, so let's add their products together. 1 times 9, plus 2 times 6 plus 3 times 3 is

6. Matrix Multiplication

9 plus 12 plus 9, which gives us

7. Matrix Multiplication

30. 30 is the first number in our final matrix. From here we stay on the first row of the first matrix, but move on to the second column of the

8. Matrix Multiplication

second matrix. These pairs give us 1 and 8,

9. Matrix Multiplication

2 and 5,

10. Matrix Multiplication

and 3 and 2. 1 times 8 plus 2 times 5 plus 3 times 2 is equal to 8 plus 10 plus 6, which is

11. Matrix Multiplication

24. Moving to the next set of pairs, we multiply

12. Matrix Multiplication

1 and 7, 2 and 4, and 3 and 1. Their products are 7, 8 and 3 which makes

13. Matrix Multiplication

18. Once we've multiplied the first row of the first matrix by all columns of the second matrix, we then go through the same process for

14. Matrix Multiplication

the second row of the first matrix with all the columns of the second matrix,

15. Matrix Multiplication

and so on

16. Matrix Multiplication

until all rows of the

17. Matrix Multiplication

first matrix have been multiplied

18. Matrix Multiplication

by all columns of the second matrix. In this example, we multiplied two square matrices of the same dimensions. In reality, you can multiply any two matrices as long as the

19. Matrix Multiplication

number of columns of the first matrix matches the number of rows of the second matrix,

20. Matrix Multiplication

If they don't, then some values in one of the matrixces won't be paired, and multiplication can't be completed.

21. Let's practice!

Let's look at some examples, and practice matrix multiplication.

### Matrix multiplication
To understand matrix multiplication more directly, let's do some matrix operations manually.

**Instructions**

- Matrices `a` and `b` are Pandas dataframes. Use the `.head()` method on each of them to view them.
- Work out the product of these two matrices on your own.
- Enter the values of the product of the `a` and `b` matrices into the product array, created using `np.array()`.
- Use the validation on the last line of code to evaluate your estimate. The `.dot()` method multiplies two matrices together.

In [10]:
import pandas as pd
import numpy as np

In [15]:
a = pd.DataFrame(columns = [0,1], index= ['one', 'two'], data = np.array([[2,2], [3,3]]))
b = pd.DataFrame(columns = [0,1], index= ['one', 'two'], data = np.array([[1,2], [4,4]]))
a.head()


Unnamed: 0,0,1
one,2,2
two,3,3


In [16]:
b.head()

Unnamed: 0,0,1
one,1,2
two,4,4


In [17]:
# Use the .head() method to view the contents of matrices a and b
print("Matrix A: ")
print (a.head())

print("Matrix B: ")
print (b.head())

# Complete the matrix with the product of matrices a and b
product = np.array([[10,12], [15,18]])

# Run this validation to see how your estimate performs
product == np.dot(a,b)

Matrix A: 
     0  1
one  2  2
two  3  3
Matrix B: 
     0  1
one  1  2
two  4  4


array([[ True,  True],
       [ True,  True]])

### Matrix multiplication part II
Let's put your matrix multiplication skills to the test.

**Instructions**

- Use the `.shape` attribute to understand the dimensions of matrices `C` and `D`, and determine whether these two matrices can be multiplied together or not.
- If they can be multiplied, use the `np.matmul()` method to multiply them. If not, set `C_times_D` to `None`.

In [18]:
C = pd.DataFrame(data = np.array([[3 , 4 , 5 , 1,  2], [2 , 5  ,7  ,6  ,8], [1,  9  ,0  ,7  ,6], [2 , 2  ,3 , 3 ,1]])) 
D = pd.DataFrame(data = np.array([[1, 2], [3, 3],[9,8]]))
C.head()

Unnamed: 0,0,1,2,3,4
0,3,4,5,1,2
1,2,5,7,6,8
2,1,9,0,7,6
3,2,2,3,3,1


In [19]:
D.head()

Unnamed: 0,0,1
0,1,2
1,3,3
2,9,8


In [21]:
# Print the dimensions of C
print(C.shape)

# Print the dimensions of D
print(D.shape)

# Can C and D be multiplied together?
C_times_D = None

(4, 5)
(3, 2)


That's right. The number of columns in `C` is different than the number of rows in `D`. `C` and `D` cannot be multiplied.

## Overview of matrix factorization
1. Overview of matrix factorization

Matrix factorization, or matrix decomposition, is essentially the opposite of matrix multiplication. Rather than multiplying two matrices together to get one new matrix, matrix factorization

2. Matrix Factorization

splits a matrix into two or more matrices which, when multiplied back together,

3. Matrix Factorization

produce an approximation of the original matrix. There are several different mathematical approaches for this, each of which has a different application. We aren't going to go into any of that here, we are simply going to review the factorization that ALS performs. Used in the context of collaborative filtering, ALS uses a factorization called non-negative matrix factorization. Because matrix factorization generally returns only approximations of the original matrix, in some cases, they can return negative values in the factor matrices, even when attempting to predict positive values. When predicting what rating a user will give to an item, negative values don't really make sense. Neither do they make sense in the context of latent features. For this reason, the version of ALS that we will use will require that the factorization return only positive values. Let's look at some sample factorizations.

4. Matrix Factorization

Here is a sample matrix of possible item ratings. There are 5 rows and 5 columns. And here

5. Matrix Factorization

is one factorization of that matrix called the LU factorization. Notice that the factor matrices are the same dimensions or rank as the original matrix. Also notice that some of the values in this factorization are negative. Using this type of factorization could result in negative predictions that wouldn't make sense in our context.

6. Matrix Factorization

Here is another factorization. In this case, all the values are positive meaning the resulting product of these factor matrices is guaranteed to be positive. This is closer to what we need for our purposes. Notice here that the dimensions of the factor matrices are such that the first factor matrix has the same number of rows as the original matrix, but a different number of columns. Also, the second factor matrix has the same number of columns as the original matrix, but a different number of rows. The dimensions of the factor matrices that don't match the original matrix are called the "rank" or number of latent features. In this case, we have chosen the "rank" of the factor matrices to be 3. What that means is that the number of

7. Rank of Factor Matrices

latent features of the factor matrices is 3. Remember that as a data scientist, when doing these types of factorizations, you get to choose the rank, or number of latent features the factor matrices will have.

8. Filling in the Blanks

Now look at this matrix. Not all the cells have numbers in them. Despite this, we can still

9. Filling in the Blanks

factor the values in the matrix. Also notice that because there is at least

10. Filling In the Blanks

one value in

11. Filling In the Blanks

every row and at

12. Filling In the Blanks

least one value in every column

13. Filling In the Blanks

that each of the factor matrices are totally full. Because of this, factoring a sparse matrix into two factor matrices gives us the means to not only approximate the original values that existed in the matrix to begin with, but to also

14. Filling In the Blanks

provide predictions for the cells that were originally blank. And because the factorization is based on the values that existed previously, the blank cells are filled in based on those already-existing patterns. So when we do this with user ratings, the blanks are filled in with values that reflect the individual user behavior and the behavior of users similar to them. This is why this method is called collaborative filtering.

15. Let's practice!

Let's look at some real-life examples of this.

## Data preparation for Spark ALS
1. Data preparation for Spark ALS

Let's talk about data preparation. Data preparation will consist of two things: 1. Correct dataframe format 2. Correct schema First, dataframe format.

2. Conventional Dataframe

Most dataframes you've seen probably look like this, with userId's in one column, all the features in the remaining columns, and the values of those features making up the contents of those columns. However, many Pyspark algorithms, ALS included, require your data to be in row-based format

3. Row-based data format

like this. The data is the same. The first column contains userIds, but rather than a different feature in each column, column 2 contains feature names, and column 3 contains the value of that feature for that user. So a user's data can be spread

4. Row-based data format (cont.)

across several rows, and rows contain no null values. Depending on your data, you may need to convert it to this format. Now let's talk about creating the right schema.

5. Correct schema

As you see, our userId column and our generically named column of movie titles are strings. Pyspark's implementation of ALS can only consume

6. Must be integers

userIds and movieIds as integers.So, again, you might need to convert your data to integers. Let's walk through an example of how to do all of this.

7. Conventional Dataframe

Here's a conventional dataframe. To convert it to a "long" or "dense" matrix, we will use a user-defined function called "wide_to_long":

8. Wide to long function

We won't go into the detail of how it works here, but it turns the conventional dataframe into a row-based dataframe like this:

9. Long DF Output

If you'd like to access this function directly, a link will be provided at the end of the course. So we have the right dataframe format, let's get the right schema. In order to have integer user and movieId's we need to assign unique integers to the userId's and the movieId's. To do this, we will follow 3 steps

10. Steps to get integer ID's

1. Extract unique userIds/movieIds 2. Assign unique integers to each id 3. Rejoin these unique integer id's back to the ratings data. Let's start with userIds.

11. Extracting distinct user IDs

Let's first run this query to get all the distinct userIds into one dataframe and call it users.

12. Monotonically increasing ID

Then we'll import a method called "monotonically_increasing_id()" which will assign a unique integer to each row of our users dataframe. We need to be careful when using this because it will treat each partition of data independently, meaning the same integer could be used in different partitions. In order to get around this, we'll convert our data into one partition using the coalesce method.

13. Coalesce method

Also note that while the integers will be increasing by a value of 1 over each row, they may not necessarily start at 1. That's not super important here, what's really important is that they are unique. So now we can create a new column in our users dataframe

14. Persist method

called userIntId, set it to monotonicallyIncreasingId, and we will have our new userIntegerIds. Note that the monotonically_increasing_id() method can be a bit tricky as the values it provides can change as you do different things to your dataset. For this reason, we've called the .persist() method to tell Spark to keep these values the same across all dataframe operations. We'll do the

15. Movie integer IDs

same thing with the movie id's and now we have two dataframes, one with our userIds and one with our movieIds. So let's join

16. Joining UserIds and MovieIds

them together along with our original dataframe on our userId and variable columns using the .join() method, specifying a "left" join. We can be even more thorough by creating a new dataframe with only the columns ALS needs, and renaming our columns using the .alias() method, which renames the column on which it is called. Like this:

17. Joining User and Movie Integer Ids

like this.

18. Let's practice!

Now let's prepare some data.

## ALS parameters and hyperparameters
1. ALS parameters and hyperparameters

As with most algorithms, ALS has arguments that we give it and hyperparameters which must be tuned in order to generate the best predictions.

2. Example ALS model code

Here is what a built-out ALS model looks like. Let's review each argument and hyperparameter.

3. Column names

The userCol, itemCol and ratingCol are straightforward. They simply tell spark which columns in your dataframe contain the respective userIds', itemIds' and ratings. The first ALS hyperparameter is the rank.

4. Rank

As you already know, ALS will take a matrix of ratings, and it will factor that matrix into two different matrices, one representing the users, and the other representing the products, or items, or in our case, movies. In the process of doing this,

5. Rank (cont.)

latent features are uncovered. ALS allows you to choose the number of latent features that are created, which is referred to as the "rank" hyperparameter, often represented by the letter k.

6. Rank

Your objective with the data will be to determine the "rank". If you're trying to find meaningful groupings or categories of movies to see how similar or different movies are, you may want to experiment with different numbers of latent features. If you have too few or too many latent features, the groupings might be difficult to understand, so you'll want to look at different numbers of latent features and manually identify what makes the most sense. For purposes of recommendations however, the best number of latent features will be found through cross-validation.

7. MaxIter

The number of iterations, or "maxIter" simply tells ALS how many times to iterate back and forth between the factors matrices, adjusting the values to reduce the RMSE. Obviously the higher number of iterations, the longer it will take to complete, and the fewer number of iterations, the higher the risk of not fully reducing the error. So you'll have to determine what works you.

8. RegParam

Many other machine learning algorithms have a regularization parameter, often called lambda. A lambda is simply a number that is added to an error metric to keep the algorithm from converging too quickly and overfitting to the training data. The lambda for ALS in Pyspark is referred to as the "regParam".

9. Alpha

We'll talk about alpha later in the course, but suffice it to say that alpha is only used when using implicit ratings, and not used with explicit ratings.

10. Non-negative

Let's talk about the ALS arguments. As mentioned previously, there are several different factorizations that can be used to factor a matrix. The one that we are interested in is the non-negative factorization, so we set the nonnegative argument to True.

11. Cold start strategy

You might be familiar with the term coldStartStrategy already. In the context of ALS, when splitting data into test and train sets, it's possible for a user to have all of their ratings inadvertantly put into the test set, leaving nothing in the train set to be used for making a prediction. In this case, ALS can't make meaningful predictions for that user, or calculate an error metric. To avoid this, we set the coldStartStrategy to "drop" which tells Spark that when these situations arise, to not use them to calculate the RMSE, and to only use users that have ratings in both the test AND training set.

12. Implicit preferences

We also need to tell Spark whether our ratings are implicit or explicit. We do this by setting the implicitPrefs argument to True or False.

13. Sample ALS model build

Once we have a built-out model like you see here, we can fit it to training data, and then generate test predictions to see how well it performs. We can do this by

14. Fit and transform methods

calling the fit and transform methods as you see here. You'll do this yourself in subsequent exercises.

15. Let's practice!

Now it's your turn to build some models.

# Recommending Movies

## Introduction to the MovieLens dataset
1. Introduction to the MovieLens dataset

Up until now we've only been using sample datasets. Now we're going to begin using actual data using the

2. MovieLens dataset

MovieLens dataset. This dataset is made available by the good people at GroupLens.org and contains

3. MovieLens summary stats

roughly 20 million ratings for over 138,000 users and more than 27,000 movies. In order to provide you with a better learning experience, we will achieve shorter runtimes by using a subset of the original dataset including 100,000 ratings. In addition to the ratings data, Grouplens.org also provides additional datafiles that include information on movie genres and other types of tags that movie watchers have provided for them. We'll take what you've learned from the previous chapters and explore the data, prepare the data, build out a cross-validated ALS model, generate predictions and assess the model's performance. First we'll view the data using the

4. Explore the data

.show() and .columns() methods, as well as some other methods to understand the nature of the dataset.

5. MovieLens sparsity

Then we'll calculate it's sparsity using this sparsity formula, and then we'll assess whether further preparation is needed in order to adequately prepare it for ALS. If you're not familiar with the term sparsity, it simply provides a measure of how empty a matrix is, or what percentage of the matrix is empty. In essence, this formula is simply the number of ratings that a matrix contains divided by the number of ratings it could contain given the number of users and movies in the matrix.

6. Sparsity: numerator

The code to calculate sparsity is pretty straightforward. We'll simply get the numerator by counting the number of ratings in the ratings dataframe

7. Sparsity: users and movies

then we'll get the number of distinct users and the number of distinct items or movies.

8. Sparsity: denominator

We'll then multiply the number of users and number of movies together to get the denominator

9. Sparsity

and simply divide the numerator by the denominator, and substract the result from 1. Because division in Pyhton will return an integer, we multiply the numerator by 1.0 to ensure a decimal or float is returned. Let's go over some other techniques that may or may not be new to you.

10. The .distinct() method

As you may already know, the .distinct() method simply returns all the unique values in a column. For example, if you want to know how many unique users there are in a table, you could simply select the userId column from the dataframe, then run the distinct and count methods like you see here.

11. GroupBy method

The groupBy method organizes data by the unique values of a specific column to return subtotals for those unique values. For example{{1}}, if you wanted to look at total number of ratings each user has provided you would first need to groupBy userId as you see here, then

12. GroupBy method

call the count method as you see here. With this, you could then

13. GroupBy method min

get the min

14. GroupBy method max

or max

15. GroupBy method avg

or average of that same column.

16. Filter method

The filter method allows you to filter out any data that doesn't meet your specified criteria. {{1}}For example if you wanted to only consider users that have rated at least 20 movies, you would simply apply the same groupby and count methods, and then add a filter method specifying that the count column should only include values greater than 20.

17. Let's practice!

Let's apply what you've learned to a real data set.

## ALS model buildout on MovieLens Data
1. ALS model buildout on MovieLens Data

If you remember from the last chapter, you built out a model on the ratings dataset. The code looked like this:

2. Fitting a basic model

Now, the RMSE that you got was lower than the 1.45 shown here. But what if you went through this whole process and got an error metric that you weren't satisfied with, like this RMSE of 1.45. You might want to try other combinations of hyperparameter values to try and reduce that. Spark makes it easy to do this by using two additional tools called

3. Intro to ParamGridBuilder and CrossValidator

the ParamGridBuilder and the CrossValidator. These tools will allow you to try many different hyperparameter values and have Spark identify the best combination. Let's talk about how to use them.

4. ParamGridBuilder

The ParamGridBuilder tells Spark all the hyperparameter values you want it to try. To do this, we first import the ParamGridBuilder package, instantiate it and give it a name. We'll call it param_grid. We then add each hyperparameter name calling the .addGrid()

5. Adding Hyperparameters to the ParamGridBuilder

method on our als algorithm and hyperparameter name as you see here. Notice the empty lists to the right of the hyperparameter names. This is where we input the values we want Spark to try for each hyperparameter, like this:

6. Adding Hyperparameter Values to the ParamGridBuilder

Once we've added all of this, we call the .build() method to complete the build of our param_grid. Now let's look at the CrossValidator.

7. CrossValidator

The CrossValidator essentially fits a model to several different portions of our training dataset called folds, and then generates predictions for each respective holdout portion of the dataset to see how it performs.

8. CrossValidator instantiation and estimator

To properly use the CrossValidator, we first import the CrossValidator package, instantiate a CrossValidator and give it a name, we'll call it cv here.

9. CrossValidator ParamMaps

We then tell it to use our als model as an estimator by setting estimator argument equal to the name of our model which is als. We'll set the estimatorParamMaps to our param_grid that we built so that Spark knows what values to try as it works to identify the best combination of hyperparameters. Then we provide the name of our evaluator so it knows how to measure each model's performance by simply setting the evaluator argument to the name of our evaluator which is "evaluator".

10. CrossValidator

We finish by setting the numFolds argument to the number of times we want Spark to test each model on the training data, in this case, 5 times. Let's go over how to integrate these into a full code buildout.

11. Random split

We'll first split our data into training and test sets using the randomSplit() method and we'll build a generic ALS model without any hyperparameters, only the model parameters as you see here. The cross validator will take care of the hyperparameters.

12. ParamGridBuilder

We'll build our ParamGridBuilder so Spark knows what hyperparameter values to test.

13. Evaluator

We'll create an evaluator so Spark knows how to evaluate each model.

14. CrossValidator

Then the CrossValidator will tell Spark the algorithm, the hyperparameters and values, and the evaluator to use to find the best model, and the number of training set folds we want each model to be tested on.

15. Best model

We then fit our CrossValidator on the training data to have Spark try all the combindations of hyperparameters we specified by calling the cv.fit() method on the training data. Once it's finished running, we extract the best-performing model by calling the bestModel() method on our model. We'll call this our best_model and

16. Predictions and performance evaluation

with it, we can generate predictions on the test set, print the error metric and the respective hyperparameter values using the code you see here. And now we have our cross-validated model.

17. Let's practice!

Let's build a real model on a real dataset.

## Model Performance Evaluation
1. Model Performance Evaluation and Output Cleanup

Congratulations. You just built your first cross-validated ALS model. Now let's determine whether the model suits your needs or not. The primary way to do this is to examine

2. Root mean squared error

the error metric, in this case, the RMSE. The RMSE tells us, on average, how far a given prediction is from it's corresponding actual value.

3. Pred vs actual

It's fairly straightforward. If we have predictions and actual values,

4. Pred vs actual: difference

the RMSE subtracts she actual value from the prediction,

5. Difference squared

then squares those differences to make them positive.

6. Sum of difference squared

It then sums those differences,

7. Average of difference squared

takes the average by dividing by the number of observations, in this case N = 4,

8. RMSE

and it then takes the square root to undo the squaring of the values that we did previously. So if we have an RMSE of .61, then on average, our predictions are either .61 above or below the original rating. Another way to evaluate our model is to look at it's recommendations. Remember, however, that ALS is often used to identify patterns and uncover latent features that are unobservable by humans, meaning that ALS can sometimes see things that may not initially make sense to us as humans. Bear this in mind as you move forward. To generate recommendations, we will use the native Spark function

9. Recommend for all users

recommendForAllUsers() which generates the top recommendations for all users. ALS recommendation output has two challenges that need to be addressed. The first is that it is in a format like

10. Unclean recommendation output

this which is perfectly usable in Pyspark, but isn't very human-readable. To resolve this, we save the dataframe as a temporary table and use

11. Cleaning up recommendation output

this sql query to make it readable. The explode command essentially takes an array like our recommendation column and separates each item within it, like this:

12. Explode function

Notice that only one movieId and it's respective recommendation value for each user is contained on each line, where previously, all recommendations for a given user were contained on one line. Also notice that ALS conveniently includes the movieId and rating column names with each value on each line. This makes it easy to separate them into different columns.

13. Adding lateral view

Adding the LATERAL VIEW to the explode function allows us to treat the exploded column as a table, and extract the individual values as separate columns. We first name the lateral view, in this case we call it exploded_table and then give it a formal table name which we call movieIds_and_ratings. This allows us to SELECT userId, and then get the movieId and ratings by referencing the movieIds_and_ratings table in the beginning of our query. The output is now readable:

14. Explode and lateral view together

And if we join it to the original movie information,

15. Joining clean recs with movie info

we can see what's going on even better. The other challenge with these recommendations is that they include predictions for movies that have been already been watched. Remember how ALS creates two factor matrices that are multiplied together to produce an approximation of the original ratings matrix. That's essentially what the ALS output is, including all movies for all users, whether they've seen them or not. A simple way to address this is to filter out the movies that have already been seen. Since we already have our clean recommendations, and the original movie_ratings,

16. Filtering recommendations

we can simply join these two dataframes together on "userId" and "movieId" using a "left" join.

17. Filtering recommendations (cont.)

The movies that have already been seen are those that have a rating from the original movie_ratings dataframe,

18. Filtering recommendations (cont.)

so if we simply add a filter so that the "rating" column of the movie_ratings dataframe is null, we'll only have predictions for movies that the individual users haven't seen.

19. Let's practice!

Now let's evaluate your model.

# What if you don't have customer ratings?

## Introduction to the Million Songs Dataset
1. Introduction to the Million Songs Dataset

By now you should be pretty comfortable with ALS. So far, we've only used explicit ratings. In most real-life situations, however, explicit ratings aren't available, and you'll have to get creative in building these types of models. One way to get around this is to use implicit ratings. Remember that while

2. Explicit vs implicit

explicit ratings are explicitly provided by users in various forms, implicit ratings are data used to infer ratings. For example, if a news website sees that in the last month you clicked on

3. Explicit vs implicit (cont.)

21 geopolitical articles and only 1 local news article, ALS can convert these numbers into scores indicating how confident it is that you like them. This approach assumes that the more you do something, the more you prefer it.

4. Implicit refresher II

ALS can use these confidence ratings to generate recommendations and you're going to learn how to do this. First, let's talk about the dataset you will be using.

5. Introduction to the Million Songs Dataset

The dataset this time comes from the Million Songs Dataset available from LabROSA at Columbia University. You're going to be using one file of this dataset called The Echo Nest Taste profile dataset. It contains information on over 1 million users including the number of times they've played nearly 400,000 songs. This is more data than we can use for this course, so we will only be using a portion of. We'll first examine the data, get summary statistics, and then build and evaluate our model. One thing to note here is that because the use of implicit ratings causes ALS to calculate a level of confidence that a user likes a song based on the number of times they've played it, the matrix will need to include zeros for the songs that each user has not yet listened to. In case your data doesn't already include the zeros, we'll walk through how to do this.

6. Add zeros sample

Let's say we have a ratings dataframe like this:

7. Cross join intro

You can use the .distinct() method to extract the unique userId's and songId's, like this.

8. Cross join output

You can then performs a cross join which joins each user to each song like this: Notice that the 3 users and 3 songs we originally had now create 9 unique pairs. Using a left join,

9. Joining back original ratings data

you can take that cross_join table, and join it with the original ratings to get the num_plays column. Notice it joins on both userId and songId. And because we want 0's in place of the null values, so that every user has a value for every song, we simply call the

10. Filling in with zero

.fillna() method telling Spark to fill the null values with 0. And you have your final product to feed to ALS.

11. Add zeros function

Here are all those steps in a clean function.

12. Let's practice!

Let's do this with our Million Songs dataset.

## Evaluating implicit ratings models
1. Evaluating implicit ratings models

Now that we have an implicit ratings dataset, let's discuss these types of models. The first thing you should know is that implicit ratings models have an additional hyperparameter called alpha. Alpha is an integer value that tells Spark how much each additional song play should add to the model's confidence that a user actually likes a song. Like the other hyperparameters, this will need to be tuned through cross validation. The challenge of these models is the evaluation. With explicit ratings, we used was the RMSE. It made sense in that situation because we could

2. Why RMSE worked before

match predictions back to a true measure of user preference. In the case of implicit ratings however,

3. Why RMSE doesn't work now

we don’t have a true measure of user preference. We only have the number of times a user listened to a song and a measure of how confident our model is that that they like that song. These aren't the same thing and calculating an RMSE between them doesn't make sense. However, using a test set, we can see if our model is giving high predictions to the songs that users have actually listened to. The logic being that if our model is returning a high prediction for a song that the respective user has actually listened to, then the predictions make sense, especially if they've listened to it more than once. We can measure this using this

4. (ROEM) Rank Ordering Error Metric

Rank Order Error Metric. In essence this metric checks to see if songs with higher numbers of plays have higher predictions.

5. ROEM bad predictions

For example, here is a set of bad predictions. The perc_rank column has ranked the predictions for each individual user such that the lowest prediction is in the highest precentile and the highest prediction is in the lowest percentile. Notice that these bad predictions include low predictions and high predictions for songs with more than one play indicating that the predictions may not be any better than random. If we multiply the number of plays by the percentRank, we get

6. ROEM: PercRank * plays

this np*rank column.

7. ROEM: bad predictions

When we sum that column we get our ROEM numerator{{1}}, and the sum of the numPlays column gives us our ROEM denominator. Using these, we can calculate our ROEM{{3}} to be 0.556. Values close to .5 indicate that they aren't much better than random. If we were to look at good predictions where the model gave high predictions to songs that had more than 1 play, they might look like this:

8. Good predictions

Notice that songs that have been played have high ratings indicating that the predictions are better than random. Which subsequently gives us an ROEM of

9. ROEM: good predictions

0.1111. This is much closer to 0, where we want to be. Unfortunately Spark hasn't implemented an evaluator metric like this, so you'll need to build it manually. An ROEM function will be provided to you in subsequent exercises. And for your reference, the code to build it is provided at the end of this course

10. ROEM: link to function on GitHub

Using this function, and a for loop

11. Building several ROEM models

you can build several models as you see here, each with different hyperparameter values. You'll want to create a model for each combination of hyperparameter values that you want to try.

12. Error output

You can then fit each one to the training data, extract each model's test predictions, and then calculate the ROEM for each one. This is a simplified way to do this. Full cross-validation is imperative to building good models. It is beyond the scope of this course to teach how to code a function that manually cross-validates and evaluates models like this, but doing so should be done, and code to do so is provided at the end of the course.

13. Let's practice!

Let's put this into practice.

## Overview of binary, implicit ratings
1. Overview of binary, implicit ratings

So far we've covered sitautions when you have explicit ratings, and when you have implicit ratings from user behavior counts. Now we're going to cover the situation when you might not even have behavior counts. In some situations, you may only have binary data that tells you whether a user has or has not taken an action with no indication of how many times they've done so. To go back to the movie example, if you know whether customers have watched certain movies, but don't have information on how many times or how much they actually liked them, you could simply feed binary data to ALS that indicates which customers have watched each movie and which ones haven't. ALS can still pull signal from this type of data and make meaningful predictions. When taking this approach, the data will look like this.

2. Binary ratings

Notice that all ratings are either a 1 or a 0. We must treat binary ratings like these as implicit ratings. If we treated them like explicit ratings and didn't include the 0's, the best performing model would simply predict 1 for everything, and deliver a deceivingly ideal RMSE of 0. Also, as with our previous Million Songs model, we can't use the RMSE as a model evaluation metric. Ultimately, when our machine learning process holds out random observations in the test set, we want our model to generate high predictions for those movies that users have actually watched. For this reason, we'll use our ROEM metric again. We'll apply the same concepts we've covered previously on this binary dataset. The convenience of using the MovieLens dataset is that we can see how our binary model performs against the original, true preference ratings of the original MovieLens dataset.

3. Class imbalance

One word about binary models. While it's perfectly feasible to feed binary data like this into ALS and get meaningful recommendations, the data does have a sort of class imbalance where the vast majority of ratings are 0's with a small percentage of 1's. Since implicit ratings models use customized error metrics like ROEM and not RMSE, the class imbalance doesn't really pose a problem like it might in classification problems. ALS can still generate meaningful recommendations from this type of data but there are strategies that can be taken with the data to try and improve recommendations.

4. Item weighting

For example, rather than treat unseen movies purely as 0's, you can weight them higher if more people have seen them. This assumes that if many people have seen a movie, it must be a pretty good movie and therefore should be treated with a little more weight, and vice versa. This is called item weighting.

5. Item weighting and user weighting

Likewise you could weight movies by individual user behavior. For example, if a user has seen lots of movies, you could weight their unseen movies lower assuming that if a user has seen lots of movies, they know what they like and have deliberately chosen NOT to view the movies they haven't seen and therefore those movies deserve a lower weighting. While these methods are applicable, their methods haven't been implemented into the Pyspark framework, and therefore require a lot of manual work which is beyond the scope of this course. However, if you'd like to learn more about these types of approaches, you can read the paper referenced at the end of the course.

6. Let's practice!

Let's build a binary ratings model.

# Course recap
1. Course recap

Congratulations. You've now completed this course on building Collaborative Filtering recommendation engines in Pyspark. We've covered a number of things from why these are important to matrix multiplication and factorization and latent features. But most importantly, you've learned how to build and interpret recommendation engines with three different data types:

2. Course summary

* Explicit Ratings * Implicit Ratings using user behavior counts{{1}} * Implicit Ratings using binary user behavior{{2}} With this information you'll be well-prepared to build a collaborative-filtering recommendation engine with the relevant data available to you as a data scientist. Some things to bear in mind about these types of models:

3. Things to bear in mind

If users don't have a lot of ratings, and ALS can't infer much about them, it's likely that ALS will make broad general recommendations that aren't really personalized. You might have seen this if you spent extra time exploring some of the recommendation output. Like all models, the more data there is, the better the model performs.

4. Things to bear in mind (cont.)

While we've gone over different ways of evaluating recommendations engines, the only way to really know if your model performs well is to test it on users and see if they actually take your recommendations. It's entirely possible that a simple binary implicit ratings model provides better recommendations for users than an explicit model, but the only way to know is to test it. Bear this in mind as you move forward.

5. Resources

Here are some resources to help you as you continue to learn about these models and begin to build them on your own. The {{1}} first is a paper published by McKinsey and Company discussing the power of recommendation engines like ALS based models. The {[2}} second is the code to build the wide_to_long function discussed in the section about preparing data for ALS. The {[3}} third is the white paper that provides the academic background for building ALS models using implicit ratings. I highly recommend reading this paper as it provides a lot of context and insight into how these models work and alternative ways to evaluate them. The {{4}} fourth is a GitHub link for code that manages the cross validation and model evaluation for implicit ratings models using ALS in Pyspark. The {[5}} last resource listed here is a paper that discusses the math and intuition behind the user-based weighting and item-based weighting methodologies for addressing the class imbalance present in binary ratings models. Congratulations on completing this course, and best of luck as you move forward.