<i>Licensed under the MIT License.</i>

# Data Preperation - Movielens 25M Dataset with Visual Enrichment

## Movielens 25M Dataset
This dataset leverages the [Movielens 25M Dataset](https://grouplens.org/datasets/movielens/25m/) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. These data were created by 162541 users between January 09, 1995 and November 21, 2019. This dataset was generated on November 21, 2019.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

## Visual Enrichment
The "tmdbId" column from the Movielens Dataset is utilized via the [The Movie Database (TMDb) API](https://www.themoviedb.org/documentation/api), in which the cooresponding movie poster url and image file are stored for later use in the enrichment process.

Once movie posters for each movie are retrived, each movie poster image is sent to [Azure Computer Vision](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview) for analysis and metadata generation. The resulting features are then used to finally enrich the Movielens Dataset:
* [Categories](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-categorizing-images)
* [Color](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-detecting-color-schemes)
* [Tags](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-tagging-images)
* [Description](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-describing-images)
* [Celebrities](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-detecting-domain-content)

## Data Schema
### User Ids
MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

### Movie Ids
Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

### Ratings Data File Structure (ratings.csv)
All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
> userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### Tags Data File Structure (tags.csv)
All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
> userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.
Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### Movies Data File Structure (movies.csv)
Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:
> movieId,title,genres

Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:
* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)*

### Links Data File Structure (links.csv)
Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:
> movieId,imdbId,tmdbId

movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.
imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.
tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.
Use of the resources listed above is subject to the terms of each provider.

### Categories
In addition to tags and a description, Computer Vision returns the taxonomy-based categories detected in an image. Unlike tags, categories are organized in a parent/child hereditary hierarchy, and there are fewer of them (86, as opposed to thousands of tags). All category names are in English. Categorization can be done by itself or alongside the newer tags model.

Computer vision can categorize an image broadly or specifically, using the list of 86 categories in the following diagram. For the full taxonomy in text format, see [Category Taxonomy](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/category-taxonomy).

### Color
Computer Vision analyzes the colors in an image to provide three different attributes: the dominant foreground color, the dominant background color, and the larger set of dominant colors in the image. The set of possible returned colors is: black, blue, brown, gray, green, orange, pink, purple, red, teal, white, and yellow.

Computer Vision also extracts an accent color, which represents the most vibrant color in the image, based on a combination of the dominant color set and saturation. The accent color is returned as a hexadecimal HTML color code (for example, #00CC00).

### Tags
Computer Vision can return content tags for thousands of recognizable objects, living beings, scenery, and actions that appear in images. Tags are not organized as a taxonomy and do not have inheritance hierarchies. A collection of content tags forms the foundation for an image description displayed as human readable language formatted in complete sentences. When tags are ambiguous or not common knowledge, the API response provides hints to clarify the meaning of the tag in context of a known setting.

After you upload an image or specify an image URL, the Computer Vision algorithm can output tags based on the objects, living beings, and actions identified in the image. Tagging is not limited to the main subject, such as a person in the foreground, but also includes the setting (indoor or outdoor), furniture, tools, plants, animals, accessories, gadgets, and so on.

### Description
Computer Vision can analyze an image and generate a human-readable phrase that describes its contents. The algorithm returns several descriptions based on different visual features, and each description is given a confidence score. The final output is a list of descriptions ordered from highest to lowest confidence.

At this time, English is the only supported language for image description.

### Celebrities
In addition to tagging and high-level categorization, Computer Vision also supports further domain-specific analysis using models that have been trained on specialized data.

There are two ways to use the domain-specific models: by themselves (scoped analysis) or as an enhancement to the categorization feature.

## Importing Dataset | Movielens 25M

### Prerequisites
1) [Download raw dataset of Movielens 25M](https://files.grouplens.org/datasets/movielens/ml-25m.zip)

2) Unzip files

3) Copy files into "carve/datasets/ml-25m/"

In [None]:
import sys
import pandas as pd
import numpy as np
import datetime
import math
import requests
import json
import dask.dataframe as dd

print("System version: {}".format(sys.version))

In [None]:
# Dataset sample size - change to 0 for full dataset
SAMPLE_SIZE = 100

In [None]:
df_genome_scores = pd.read_csv("../../carve/datasets/ml-25m/genome-scores.csv")
df_genome_tags = pd.read_csv("../../carve/datasets/ml-25m/genome-tags.csv")
df_links = pd.read_csv("../../carve/datasets/ml-25m/links.csv")
df_movies = pd.read_csv("../../carve/datasets/ml-25m/movies.csv")
df_ratings = pd.read_csv("../../carve/datasets/ml-25m/ratings.csv")
df_tags = pd.read_csv("../../carve/datasets/ml-25m/tags.csv")

In [None]:
df_genome_scores.head()

In [None]:
df_genome_tags.head()

In [None]:
df_links.head()

In [None]:
df_movies.head()

In [None]:
df_ratings.head()

In [None]:
df_tags.head()

In [None]:
print(f"Movies: {df_movies.shape}")
print(f"Links: {df_links.shape}")

In [None]:
if SAMPLE_SIZE != 0:
    df_movies = df_movies.sample(n=SAMPLE_SIZE, random_state=0)
    df_movies.reset_index(inplace=True)

    checkpoint_file = f"../../carve/datasets/ml-25m/poster-urls-{SAMPLE_SIZE}.csv"
    output_file = f"../../carve/datasets/ml-25m/carve-movielens-{SAMPLE_SIZE}.csv"
    print(f"Movies: {df_movies.shape}")
else:
    checkpoint_file = f"../../carve/datasets/ml-25m/poster-urls.csv"
    output_file = f"../../carve/datasets/ml-25m/carve-movielens.csv"

In [None]:
# Join "movies" and "links" datasets, on Movielens's "movieId"
df = df_movies.set_index("movieId").join(df_links.set_index("movieId"))
print(f"Dataframe: {df.shape}")

In [None]:
df.reset_index(inplace=True)
df.head()

In [None]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

In [None]:
print(df.isnull().sum())

In [None]:
# Type correct "tmdbId" as Integer
df["tmdbId"].astype("int")

## Getting Movie Image Url | The Movie Database API

Leverages [The Movie Database API](https://www.themoviedb.org/documentation/api). 

In [None]:
def getMovieImageUrl(tmdb_id):
    """
    TODO: Function description
    """
    try:
        print(f"Processing {tmdb_id}...")
        TMDB_API_KEY = "<API-KEY>"
        image_prefix = "https://image.tmdb.org/t/p/"

        # Get The Movie Database poster art for given Movielens "tmdbId"
        response = requests.get(f"https://api.themoviedb.org/3/movie/{tmdb_id}/images?api_key={TMDB_API_KEY}")
        images = response.json()

        # Gets full size images of movie poster art
        poster = f"{image_prefix}original{images['posters'][0]['file_path']}"

        print(f"Success {tmdb_id}.")
        return poster
    except:
        print(f"Failure {tmdb_id}.")
        return None


In [None]:
# For each The Movie Database "tmdbId" in Movielens Dataset, get movie poster art url
"""
Uncomment the below code to download images locally. Currently, image files are not
used in the data enrichment process but may be desired for other use-cases.
"""
#df["posterUrl"] = df["tmdbId"].map(lambda x: getMovieImageUrl(x))

In [None]:
df.head()

In [None]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

## Downloading Movie Poster Art | The Movie Database API

In [None]:
def downloadMovieImage(img_url):
    """
    TODO: Function description
    """
    try:
        print(f"Processing {img}...")

        # Download The Movie Database image files for given image url
        img = requests.get(img_url)
        img_file = img_url.split("/")[-1]
        img_path = f"../../carve/datasets/ml-25m/images/{img_file}"

        # Save movie poster as local image
        with open(img_path, "wb") as f:
            f.write(img.content)

        print(f"Success {img}.")
        return img_path
    except:
        print(f"Failure {img}.")
        return None

In [None]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

In [None]:
print(df.isnull().sum())

In [None]:
# For each The Movie Database "posterUrl" in Movielens Dataset, get movie poster art image file
df["posterPath"] = df["posterUrl"].map(lambda x: downloadMovieImage(x))

In [None]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

In [None]:
print(df.isnull().sum())

In [None]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

## Visual Enrichment - Azure Computer Vision

In [None]:
def getComputerVision(img):
    """
    TODO: Function description
    """
    headers = {
        "Content-Type": "application/json",
        "Ocp-Apim-Subscription-Key": "<API-KEY>",
    }

    params = {
        "visualFeatures": "Color,Tags,Categories,Objects,Description",
        "language": "en",
        "model-version": "latest",
        "details": "Celebrities",
    }

    data = {
        "url": img
    }
    
    try:
        print(f"Processing {img}...")
        url = "https://eastus.api.cognitive.microsoft.com/vision/v3.2/analyze"
        response = requests.post(url, headers=headers, params=params, json=data)
        
        json_path = f"ml-25m/jsons/{img.split('/')[-1].split('.')[0]}.json"
        json_response = response.json()
        with open(json_path, "w") as f:
            json.dump(json_response, f)
        
        print(f"Success {img}.")
        return json_response
    except Exception as e:
        print(f"Failure {img}. {e}")
        return None

In [None]:
df["posterJson"] = df["posterUrl"].map(lambda x: getComputerVision(x))

In [None]:
print(df.isnull().sum())
df = df.dropna()

In [None]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

In [None]:
def getCategories(data):
    """
    """
    try:        
        return list(set([x["name"] for x in data["categories"][:]]))[:3]
    except:
        return None

In [None]:
def getColor(data):
    """
    """
    try:
        results = []

        x = data["color"]
        results.append(x["dominantColorForeground"])
        results.append(x["dominantColorBackground"])
        
        for dominant in x["dominantColors"]:
            results.append(dominant)

        return list(set(results))[:3]
    except:
        return None

In [None]:
def getTags(data):
    """
    """
    try:        
        return list(set([x["name"] for x in data["tags"][:]]))[:3]
    except:
        return None

In [None]:
def getDescription(data):
    """
    """
    try:
        return list(set([x for x in data["description"]["tags"]]))[:3]
    except:
        return None

In [None]:
def getCelebrities(data):
    """
    """
    try:
        results = []

        for x in data["categories"]:
            for y in x["detail"]["celebrities"]:
                results.append(y)

        return list(set(results))
    except:
        return None

In [None]:
df["categories"]= pd.Series()
df["color"]= pd.Series()
df["tags"]= pd.Series()
df["description"]= pd.Series()
#df["celebrities"]= ""

for i in range(len(df["posterPath"])):
    try:
        poster_path = df["posterPath"][i]
        json_path = f"../../carve/datasets/ml-25m/jsons/{poster_path.split('/')[-1].split('.')[0]}.json"

        with open(json_path) as f:
            data = json.load(f)

        df["categories"][i] = getCategories(data)
        df["color"][i] = getColor(data)
        df["tags"][i] = getTags(data)
        df["description"][i] = getDescription(data)
        #df["celebrities"][i] = getCelebrities(data)
    except Exception as e:
        pass

In [None]:
df.head()

In [None]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

In [None]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

In [None]:
# Join "poster-urls" and "ratings" datasets, on Movielens's "movieId"
df = df.set_index("movieId").join(df_ratings.set_index("movieId"))
print(f"Dataframe: {df.shape}")

In [None]:
df.reset_index(inplace=True)
df.head()

In [None]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

In [None]:
df = df.drop(["posterUrl", "posterPath"], axis=1)

In [None]:
#Save checkpoint for new dataset
df.to_csv(output_file)

In [None]:
df.head()

In [None]:
def transformGenre(data):
    """
    """
    try:
        return f"[{', '.join(data.split('|'))}]"
    except:
        None

In [None]:
def splitCategorical(data, num):
    """
    """
    try:
        data_list = data.replace("[", "").replace("]", "").replace("'", "").split(",")
        return data_list[num]
    except:
        return None

In [None]:
#Transform genres column to comma seperated string instead of pipe
df["genres"] = df["genres"].map(lambda x: transformGenre(x))

In [None]:
categorical_cols = ["genres", "categories", "color", "tags", "description"]
categorical_length = 5

for cat in categorical_cols:
    #Create empty columns for splitting 
    for i in range(categorical_length):
        new_col = f"{cat}_{i}"
        
        #Split categorical Series columns
        df[new_col] = df[cat].map(lambda x: splitCategorical(x, i))

df.head()

In [None]:
df.columns

In [None]:
#Drop empty columns
df.dropna(how='all', axis=1, inplace=True)
df = df.drop(["Unnamed: 0", "title", "genres", "categories", "color", "tags", "description", "imdbId", "tmdbId"], axis=1)

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.to_csv(output_file)