# HW 8 - Tim Demetriades
## PySpark Part 2
4/17/2021
1.   For the provided data file "re_u.data", fix the "numIterations=20" and use different "rank" size,  rank=5, 7, 10, 20 and test the MSE values. 
2.   For the fixed "rank=20", and use different numIterations=2, 5,10, 20 and test the MSE values.
3.   For the fixed "rank=20" and "numIterations=20",  take different size of data. i.e.,  2000, 5000, 10000, 20000, 50000, 100000

For the above 3 different scenarios, from your observation, how MSE is changed related to the parameters?   That is, which factor, the rank size, or the numIterations size,  or the data size will change the MSE value more significantly? 

First step is to import the needed modules.

In [1]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, SparkContext, Rating
import time

We should then initialize SparkContext, the main entry point for Spark functionality.

In [2]:
sc = SparkContext(appName='HW 8-2')

We can use `sc` to get some data on it and get a link to the UI.

In [3]:
sc

Read `re_u.data` file from local file system (available on all nodes) and return it as an RDD of Strings.

In [4]:
data_original = sc.textFile('re_u.data')

We can use the `.take()` method to get the first num elements of the RDD.

Return a new RDD by applying a function to each element of this RDD. Here we are changing the data type for the 3 columns to make the first 2 columns ints and the last column float.

In [5]:
data = sc.parallelize(data_original.take(10000))
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

To print the data in the RDD we need to use the `.collect()` method first.

In [6]:
ratings_collection = ratings.collect()    # return a list that contains all of the elements in this RDD
print(ratings_collection[0])              # print first row

Rating(user=196, product=242, rating=3.0)


We can use the `.count()` method to return the number of elements in this RDD. This is the number of rows in `re_u.data`.

In [7]:
ratings.count()

10000

Next, we build the recommendation model using Alternating Least Squares (ALS) algorithm to do matrix factorization.

The ratings matrix is approximated as the product of two lower-rank matrices of a given rank (number of features). To solve for these features, ALS is run iteratively with a configurable level of parallelism.

In [8]:
rank = 10    # number of latent features
numIterations = 10

In [9]:
model = ALS.train(ratings, rank, numIterations)

Then, we evalutate the model.

In [10]:
start_time = time.time()

test_data = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(test_data).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_predictions = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = rates_and_predictions.map(lambda r: (r[1][0] - r[1][1])**2).mean()

end_time = time.time()
elapsed_time = end_time - start_time
print(f'Done! Time elapsed - {elapsed_time:.2f} seconds.')

Done! Time elapsed - 28.00 seconds.


In [11]:
MSE = round(MSE, 5)
print('Mean Squared Error = ' + str(MSE))

Mean Squared Error = 0.05808


### Part 1 - Adjust Rank

In [12]:
numIteration = 20

for rank in [5, 7, 10, 20]:
    model = ALS.train(ratings, rank, numIterations)
    
    start_time = time.time()

    test_data = ratings.map(lambda p: (p[0], p[1]))
    predictions = model.predictAll(test_data).map(lambda r: ((r[0], r[1]), r[2]))
    rates_and_predictions = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
    MSE = rates_and_predictions.map(lambda r: (r[1][0] - r[1][1])**2).mean()

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'Rank {rank} Done! Time elapsed - {elapsed_time:.2f} for seconds.')
    MSE = round(MSE, 5)
    print(f'Mean Squared Error = {str(MSE)} for Rank {rank}.')

Rank 5 Done! Time elapsed - 28.21 for seconds.
Mean Squared Error = 0.22746 for Rank 5.
Rank 7 Done! Time elapsed - 27.71 for seconds.
Mean Squared Error = 0.14346 for Rank 7.
Rank 10 Done! Time elapsed - 27.85 for seconds.
Mean Squared Error = 0.0623 for Rank 10.
Rank 20 Done! Time elapsed - 27.51 for seconds.
Mean Squared Error = 0.00625 for Rank 20.


We can see that for the above, **as the rank increases from 5 to 20, the MSE decreases significantly**, starting at **0.227** for rank 5 and decreasing all the way down to **0.006** for rank 20.

### Part 2 - Adjust numIterations

In [13]:
rank = 20

for numIterations in [2, 5, 10]:    # we did 20 interations for rank 20 above
    model = ALS.train(ratings, rank, numIterations)
    
    start_time = time.time()

    test_data = ratings.map(lambda p: (p[0], p[1]))
    predictions = model.predictAll(test_data).map(lambda r: ((r[0], r[1]), r[2]))
    rates_and_predictions = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
    MSE = rates_and_predictions.map(lambda r: (r[1][0] - r[1][1])**2).mean()

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'{numIterations} Iterations Done! Time elapsed - {elapsed_time:.2f} for seconds.')
    MSE = round(MSE, 5)
    print(f'Mean Squared Error = {str(MSE)} for {numIterations} iterations.')

2 Iterations Done! Time elapsed - 27.61 for seconds.
Mean Squared Error = 0.103 for 2 iterations.
5 Iterations Done! Time elapsed - 27.23 for seconds.
Mean Squared Error = 0.01631 for 5 iterations.
10 Iterations Done! Time elapsed - 27.39 for seconds.
Mean Squared Error = 0.00638 for 10 iterations.


We can see that for the above, **as the number of iterations increases, the MSE decreases significantly**, starting at **0.103** for 2 iterations and decreasing all the way down to **0.006** (from the previous part) for 20 iterations.

### Part 3 - Adjust Data Amount

In [14]:
numIterations = 20
rank = 20

for data_size in [2000, 5000, 10000, 50000, 100000]:
    data = sc.parallelize(data_original.take(data_size))
    ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))    
    
    model = ALS.train(ratings, rank, numIterations)
    
    start_time = time.time()

    test_data = ratings.map(lambda p: (p[0], p[1]))
    predictions = model.predictAll(test_data).map(lambda r: ((r[0], r[1]), r[2]))
    rates_and_predictions = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
    MSE = rates_and_predictions.map(lambda r: (r[1][0] - r[1][1])**2).mean()

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'{data_size} Rows of Data Done! Time elapsed - {elapsed_time:.2f} for seconds.')
    MSE = round(MSE, 5)
    print(f'Mean Squared Error = {str(MSE)} for {data_size} rows of data.')

2000 Rows of Data Done! Time elapsed - 26.95 for seconds.
Mean Squared Error = 0.0001 for 2000 rows of data.
5000 Rows of Data Done! Time elapsed - 27.81 for seconds.
Mean Squared Error = 0.00066 for 5000 rows of data.
10000 Rows of Data Done! Time elapsed - 26.66 for seconds.
Mean Squared Error = 0.00285 for 10000 rows of data.
50000 Rows of Data Done! Time elapsed - 27.39 for seconds.
Mean Squared Error = 0.15198 for 50000 rows of data.
100000 Rows of Data Done! Time elapsed - 27.75 for seconds.
Mean Squared Error = 0.29247 for 100000 rows of data.


We can see that for the above, **as the amount of data taken increases, the MSE increases** significantly, starting at **0.0001** for 2000 rows of data and increasing all the way up to **0.292** for 100,000 rows of data.

### Analysis
From the results above, we can see that all 3 parameters have an affect on the MSE. Starting with rank, we can see that as you **increase the value of rank** from 5 to 20, the **MSE decreases** by a great amount and therefore the model improves. This makes sense as the model is now using more latent features. 

Similarly for **number of iterations**, we can see that as this **value goes up the MSE decreases**. This makes sense as well since we are allowing the model to train longer.

Finally, we can see that as the **size of the data taken in increases, the MSE decreases**.

Of the 3, it appears that **changing the rank produces the biggest affect on MSE**. 