# AWS Data Output Processing

The following code downloads the output of the AWS Spark implementation of the ALS model and processes the data for further analysis. 

## Local Code Imports

In [None]:
# DO NOT REMOVE THESE
%load_ext autoreload
%autoreload 2

In [None]:
# DO NOT REMOVE This
%reload_ext autoreload

In [None]:
from src import model as mdl
from src import custom as cm
from src import make_data as md

## Code Imports

In [None]:
import boto3
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy.spatial import distance_matrix

# AWS ALS Factor Importation and Convertion

## Link to AWS S3 and view objects

In order to download objects from AWS S3, a client connection must be established. The following cells establish a client connection and list the objects in the specified bucket.

In [None]:
s3 = boto3.resource('s3')
client = boto3.client('s3')
my_bucket = s3.Bucket('fp-movielens-data')

In [None]:
for obj in my_bucket.objects.all():
    print(os.path.join(obj.bucket_name, obj.key))

## Item Factors

The output of the ALS model is saved as a set of files (a function of the MapReduce process).  To work with the output outside of AWS EMR, these files need to be combined into a single csv file. The following code completes this task for the item factors.

In [None]:
bucket = 'fp-movielens-data'
key = 'item_factors.csv/part-0000{}-40db7616-e552-48cd-bb18-9fba706fe5aa-c000.csv'

In [None]:
item_factors_df = md.get_factors(client, bucket, key, 10)

Further analysis required the item factors to be unstacked.  To unstack the factors, the features had to be assigned a label (indicated by the 'value' column in the output below).  The function for unstacking the data is in the model.py file located in the src folder. The unstacked data was then saved.

In [None]:
rank = item_factors_df.groupby(['id']).agg('count')[0:1]['features'][1]

In [None]:
item_factors_unstacked = mdl.unstack(item_factors_df, rank)
item_factors_unstacked.head()

In [None]:
item_factors_unstacked.to_csv('../data/processed/item_factors_unstacked.csv')

## User Factors

The output of the ALS model is saved as a set of files (a function of the MapReduce process). To work with the output outside of AWS EMR, these files need to be combined into a single csv file. The following code completes this task for the user factors.

In [None]:
key = 'user_factors.csv/part-0000{}-59dd1ef1-da71-4926-b18b-5a0d5f059a90-c000.csv'

In [None]:
user_factors_df = md.get_factors(client, bucket, key, 10)

Further analysis required the user factors to be unstacked.  To unstack the factors, the features had to be assigned a label (indicated by the 'value' column in the output below).  The unstacked data was then saved.

In [None]:
user_factors_unstacked = mdl.unstack(user_factors_df, rank)

In [None]:
user_factors_unstacked.to_csv('../data/processed/user_factors.csv')

The user factors needed to be processed for use in the KMeans model.  The following code uses sklearn's StandardScaler to transform the user factors to the same scale. 

In [None]:
scaler = StandardScaler()
user_factors_scaled = scaler.fit_transform(user_factors_unstacked)
user_factors_scaled = pd.DataFrame(user_factors_scaled)

In [None]:
user_factors_scaled.to_csv('../data/processed/user_factors_scaled.csv')