<i>Licensed under the MIT License.</i>

# Data Preperation | Movielens 25M Dataset with Visual Enrichment

## Movielens 25M Dataset
This dataset leverages the [Movielens 25M Dataset](https://grouplens.org/datasets/movielens/25m/) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. These data were created by 162541 users between January 09, 1995 and November 21, 2019. This dataset was generated on November 21, 2019.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

## Visual Enrichment
The "tmdbId" column from the Movielens Dataset is utilized via the [The Movie Database (TMDb) API](https://www.themoviedb.org/documentation/api), in which the cooresponding movie poster url and image file are stored for later use in the enrichment process.

Once movie posters for each movie are retrived, each movie poster image is sent to [Azure Computer Vision](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview) for analysis and metadata generation. The resulting features are then used to finally enrich the Movielens Dataset:
* [Categories](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-categorizing-images)
* [Color](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-detecting-color-schemes)
* [Tags](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-tagging-images)
* [Description](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-describing-images)
* [Celebrities](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-detecting-domain-content)

## Data Schema
### User Ids
MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

### Movie Ids
Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

### Ratings Data File Structure (ratings.csv)
All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
> userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### Tags Data File Structure (tags.csv)
All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
> userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.
Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### Movies Data File Structure (movies.csv)
Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:
> movieId,title,genres

Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:
* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)*

### Links Data File Structure (links.csv)
Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:
> movieId,imdbId,tmdbId

movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.
imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.
tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.
Use of the resources listed above is subject to the terms of each provider.

### Categories
In addition to tags and a description, Computer Vision returns the taxonomy-based categories detected in an image. Unlike tags, categories are organized in a parent/child hereditary hierarchy, and there are fewer of them (86, as opposed to thousands of tags). All category names are in English. Categorization can be done by itself or alongside the newer tags model.

Computer vision can categorize an image broadly or specifically, using the list of 86 categories in the following diagram. For the full taxonomy in text format, see [Category Taxonomy](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/category-taxonomy).

### Color
Computer Vision analyzes the colors in an image to provide three different attributes: the dominant foreground color, the dominant background color, and the larger set of dominant colors in the image. The set of possible returned colors is: black, blue, brown, gray, green, orange, pink, purple, red, teal, white, and yellow.

Computer Vision also extracts an accent color, which represents the most vibrant color in the image, based on a combination of the dominant color set and saturation. The accent color is returned as a hexadecimal HTML color code (for example, #00CC00).

### Tags
Computer Vision can return content tags for thousands of recognizable objects, living beings, scenery, and actions that appear in images. Tags are not organized as a taxonomy and do not have inheritance hierarchies. A collection of content tags forms the foundation for an image description displayed as human readable language formatted in complete sentences. When tags are ambiguous or not common knowledge, the API response provides hints to clarify the meaning of the tag in context of a known setting.

After you upload an image or specify an image URL, the Computer Vision algorithm can output tags based on the objects, living beings, and actions identified in the image. Tagging is not limited to the main subject, such as a person in the foreground, but also includes the setting (indoor or outdoor), furniture, tools, plants, animals, accessories, gadgets, and so on.

### Description
Computer Vision can analyze an image and generate a human-readable phrase that describes its contents. The algorithm returns several descriptions based on different visual features, and each description is given a confidence score. The final output is a list of descriptions ordered from highest to lowest confidence.

At this time, English is the only supported language for image description.

### Celebrities
In addition to tagging and high-level categorization, Computer Vision also supports further domain-specific analysis using models that have been trained on specialized data.

There are two ways to use the domain-specific models: by themselves (scoped analysis) or as an enhancement to the categorization feature.

## Importing Dataset | Movielens 25M

### Prerequisites
1) [Download raw dataset of Movielens 25M](https://files.grouplens.org/datasets/movielens/ml-25m.zip)

2) Unzip files

3) Copy files into "carve/datasets/ml-25m/"

In [265]:
import sys
import pandas as pd
import numpy as np
import datetime
import math
import requests
import json
import dask.dataframe as dd

print("System version: {}".format(sys.version))

System version: 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]


In [266]:
# Dataset sample size - change to 0 for full dataset
SAMPLE_SIZE = 100
TMDB_API_KEY = "<API-KEY>"
AZURE_CV_API_KEY = "<API-KEY>"

In [267]:
df_genome_scores = pd.read_csv("../../carve/datasets/ml-25m/genome-scores.csv")
df_genome_tags = pd.read_csv("../../carve/datasets/ml-25m/genome-tags.csv")
df_links = pd.read_csv("../../carve/datasets/ml-25m/links.csv")
df_movies = pd.read_csv("../../carve/datasets/ml-25m/movies.csv")
df_ratings = pd.read_csv("../../carve/datasets/ml-25m/ratings.csv")
df_tags = pd.read_csv("../../carve/datasets/ml-25m/tags.csv")

In [268]:
df_genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [269]:
df_genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [270]:
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [271]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [272]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [273]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [274]:
print(f"Movies: {df_movies.shape}")
print(f"Links: {df_links.shape}")

Movies: (62423, 3)
Links: (62423, 3)


In [275]:
if SAMPLE_SIZE != 0:
    df_movies = df_movies.sample(n=SAMPLE_SIZE, random_state=0)
    df_movies.reset_index(inplace=True)

    checkpoint_file = f"../../carve/datasets/ml-25m/poster-urls-{SAMPLE_SIZE}.csv"
    output_file = f"../../carve/datasets/ml-25m/carve-movielens-{SAMPLE_SIZE}.csv"
    print(f"Movies: {df_movies.shape}")
else:
    checkpoint_file = f"../../carve/datasets/ml-25m/poster-urls.csv"
    output_file = f"../../carve/datasets/ml-25m/carve-movielens.csv"

Movies: (100, 4)


In [276]:
# Join "movies" and "links" datasets, on Movielens's "movieId"
df = df_movies.set_index("movieId").join(df_links.set_index("movieId"))
print(f"Dataframe: {df.shape}")

Dataframe: (100, 5)


In [277]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,movieId,index,title,genres,imdbId,tmdbId
0,3636,3537,Those Who Love Me Can Take the Train (Ceux qui...,Drama,118834,31353.0
1,161504,41356,Wyrd Sisters (1997),Animation|Fantasy,159931,24057.0
2,79333,14978,Watch Out for the Automobile (Beregis avtomobi...,Comedy|Crime|Romance,60161,39768.0
3,134611,29741,Spring (1969),Comedy|Drama,64542,46809.0
4,123961,25538,Professional Sweetheart (1933),Comedy|Romance,24476,223497.0


In [278]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

movieId    0
index      0
title      0
genres     0
imdbId     0
tmdbId     0
dtype: int64


In [279]:
print(df.isnull().sum())

movieId    0
index      0
title      0
genres     0
imdbId     0
tmdbId     0
dtype: int64


In [280]:
# Type correct "tmdbId" as Integer
df["tmdbId"] = df["tmdbId"].astype("int")

## Getting Movie Image Url | The Movie Database API

Leverages [The Movie Database API](https://www.themoviedb.org/documentation/api). 

In [281]:
def getMovieImageUrl(tmdb_id):
    """
    TODO: Function description
    """
    try:
        print(f"Processing {tmdb_id}...")
        tmdb_api_key = TMDB_API_KEY
        image_prefix = "https://image.tmdb.org/t/p/"

        # Get The Movie Database poster art for given Movielens "tmdbId"
        response = requests.get(f"https://api.themoviedb.org/3/movie/{tmdb_id}/images?api_key={tmdb_api_key}")
        images = response.json()

        # Gets full size images of movie poster art
        poster = f"{image_prefix}original{images['posters'][0]['file_path']}"

        print(f"Success {tmdb_id}.")
        return poster
    except:
        print(f"Failure {tmdb_id}.")
        return None


In [282]:
# For each The Movie Database "tmdbId" in Movielens Dataset, get movie poster art url
df["posterUrl"] = df["tmdbId"].map(lambda x: getMovieImageUrl(x))

Processing 31353...
Failure 31353.
Processing 24057...
Failure 24057.
Processing 39768...
Success 39768.
Processing 46809...
Success 46809.
Processing 223497...
Success 223497.
Processing 74415...
Success 74415.
Processing 48833...
Success 48833.
Processing 347881...
Success 347881.
Processing 852...
Success 852.
Processing 14071...
Success 14071.
Processing 148083...
Success 148083.
Processing 202764...
Success 202764.
Processing 101514...
Success 101514.
Processing 33670...
Success 33670.
Processing 216176...
Success 216176.
Processing 59981...
Success 59981.
Processing 1246...
Success 1246.
Processing 376534...
Success 376534.
Processing 1896...
Success 1896.
Processing 271033...
Success 271033.
Processing 32497...
Success 32497.
Processing 83955...
Success 83955.
Processing 86507...
Success 86507.
Processing 45322...
Success 45322.
Processing 284497...
Success 284497.
Processing 49809...
Success 49809.
Processing 10032...
Success 10032.
Processing 67342...
Success 67342.
Processing

In [283]:
df.head()

Unnamed: 0,movieId,index,title,genres,imdbId,tmdbId,posterUrl
0,3636,3537,Those Who Love Me Can Take the Train (Ceux qui...,Drama,118834,31353,
1,161504,41356,Wyrd Sisters (1997),Animation|Fantasy,159931,24057,
2,79333,14978,Watch Out for the Automobile (Beregis avtomobi...,Comedy|Crime|Romance,60161,39768,https://image.tmdb.org/t/p/original/pxf7dGhwpw...
3,134611,29741,Spring (1969),Comedy|Drama,64542,46809,https://image.tmdb.org/t/p/original/pR4WgogBEj...
4,123961,25538,Professional Sweetheart (1933),Comedy|Romance,24476,223497,https://image.tmdb.org/t/p/original/tEqInMA9cf...


In [284]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

## Downloading Movie Poster Art | The Movie Database API

In [289]:
def downloadMovieImage(img_url):
    """
    TODO: Function description
    """
    try:
        print(f"Processing {img_url}...")

        # Download The Movie Database image files for given image url
        img = requests.get(img_url)
        img_file = img_url.split("/")[-1]
        img_path = f"../../carve/datasets/ml-25m/images/{img_file}"

        # Save movie poster as local image
        with open(img_path, "wb") as f:
            f.write(img.content)

        print(f"Success {img_url}.")
        return img_path
    except:
        print(f"Failure {img_url}.")
        return None

In [290]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

movieId      0
index        0
title        0
genres       0
imdbId       0
tmdbId       0
posterUrl    0
dtype: int64


In [291]:
print(df.isnull().sum())

movieId      0
index        0
title        0
genres       0
imdbId       0
tmdbId       0
posterUrl    0
dtype: int64


In [292]:
# For each The Movie Database "posterUrl" in Movielens Dataset, get movie poster art image file
"""
Uncomment the below code to download images locally. Currently, image files are not
used in the data enrichment process but may be desired for other use-cases.
"""
df["posterPath"] = df["posterUrl"].map(lambda x: downloadMovieImage(x))

Processing https://image.tmdb.org/t/p/original/pxf7dGhwpwxe8b1itqxkkjFQac8.jpg...
Failure https://image.tmdb.org/t/p/original/pxf7dGhwpwxe8b1itqxkkjFQac8.jpg.
Processing https://image.tmdb.org/t/p/original/pR4WgogBEjvlxIrjFToiInWEXhQ.jpg...
Success https://image.tmdb.org/t/p/original/pR4WgogBEjvlxIrjFToiInWEXhQ.jpg.
Processing https://image.tmdb.org/t/p/original/tEqInMA9cf9oug14wtPOk1q3eYn.jpg...
Success https://image.tmdb.org/t/p/original/tEqInMA9cf9oug14wtPOk1q3eYn.jpg.
Processing https://image.tmdb.org/t/p/original/nzEesgGHyqXHqqW8HeSHW4r0gSY.jpg...
Success https://image.tmdb.org/t/p/original/nzEesgGHyqXHqqW8HeSHW4r0gSY.jpg.
Processing https://image.tmdb.org/t/p/original/19Hpp2GpooohO0DorCR9nGwJ6L7.jpg...
Success https://image.tmdb.org/t/p/original/19Hpp2GpooohO0DorCR9nGwJ6L7.jpg.
Processing https://image.tmdb.org/t/p/original/uHX6ViDnUzvZhcuzROjSRWbypv5.jpg...
Success https://image.tmdb.org/t/p/original/uHX6ViDnUzvZhcuzROjSRWbypv5.jpg.
Processing https://image.tmdb.org/t/p/original

In [293]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

movieId       0
index         0
title         0
genres        0
imdbId        0
tmdbId        0
posterUrl     0
posterPath    2
dtype: int64


In [294]:
print(df.isnull().sum())

movieId       0
index         0
title         0
genres        0
imdbId        0
tmdbId        0
posterUrl     0
posterPath    0
dtype: int64


In [295]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

## Visual Enrichment | Azure Computer Vision

[Azure Computer Vision](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview)

In [296]:
def getComputerVision(img):
    """
    TODO: Function description
    """
    headers = {
        "Content-Type": "application/json",
        "Ocp-Apim-Subscription-Key": AZURE_CV_API_KEY,
    }

    params = {
        "visualFeatures": "Color,Tags,Categories,Objects,Description",
        "language": "en",
        "model-version": "latest",
        "details": "Celebrities",
    }

    data = {
        "url": img
    }
    
    try:
        print(f"Processing {img}...")
        url = "https://eastus.api.cognitive.microsoft.com/vision/v3.2/analyze"
        response = requests.post(url, headers=headers, params=params, json=data)
        
        json_path = f"../../carve/datasets/ml-25m/jsons/{img.split('/')[-1].split('.')[0]}.json"
        json_response = response.json()
        with open(json_path, "w") as f:
            json.dump(json_response, f)
        
        print(f"Success {img}.")
        return json_response
    except Exception as e:
        print(f"Failure {img}. {e}")
        return None

In [297]:
df["posterJson"] = df["posterUrl"].map(lambda x: getComputerVision(x))

Processing https://image.tmdb.org/t/p/original/pR4WgogBEjvlxIrjFToiInWEXhQ.jpg...
Failure https://image.tmdb.org/t/p/original/pR4WgogBEjvlxIrjFToiInWEXhQ.jpg. HTTPSConnectionPool(host='eastus.api.cognitive.microsoft.com', port=443): Max retries exceeded with url: /vision/v3.2/analyze?visualFeatures=Color%2CTags%2CCategories%2CObjects%2CDescription&language=en&model-version=latest&details=Celebrities (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fa116b41350>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
Processing https://image.tmdb.org/t/p/original/tEqInMA9cf9oug14wtPOk1q3eYn.jpg...
Success https://image.tmdb.org/t/p/original/tEqInMA9cf9oug14wtPOk1q3eYn.jpg.
Processing https://image.tmdb.org/t/p/original/nzEesgGHyqXHqqW8HeSHW4r0gSY.jpg...
Success https://image.tmdb.org/t/p/original/nzEesgGHyqXHqqW8HeSHW4r0gSY.jpg.
Processing https://image.tmdb.org/t/p/original/19Hpp2GpooohO0DorCR9nGwJ6L7.jpg...
Success https:

In [298]:
print(df.isnull().sum())
df = df.dropna()

movieId        0
index          0
title          0
genres         0
imdbId         0
tmdbId         0
posterUrl      0
posterPath     0
posterJson    12
dtype: int64


In [299]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

## Enhance Existing Dataset with Visual Enrichment JSON Data

In [300]:
def getCategories(data):
    """
    """
    try:        
        return list(set([x["name"] for x in data["categories"][:]]))[:3]
    except:
        return None

In [301]:
def getColor(data):
    """
    """
    try:
        results = []

        x = data["color"]
        results.append(x["dominantColorForeground"])
        results.append(x["dominantColorBackground"])
        
        for dominant in x["dominantColors"]:
            results.append(dominant)

        return list(set(results))[:3]
    except:
        return None

In [302]:
def getTags(data):
    """
    """
    try:        
        return list(set([x["name"] for x in data["tags"][:]]))[:3]
    except:
        return None

In [303]:
def getDescription(data):
    """
    """
    try:
        return list(set([x for x in data["description"]["tags"]]))[:3]
    except:
        return None

In [304]:
def getCelebrities(data):
    """
    """
    try:
        results = []

        for x in data["categories"]:
            for y in x["detail"]["celebrities"]:
                results.append(y)

        return list(set(results))
    except:
        return None

In [305]:
df["categories"]= pd.Series()
df["color"]= pd.Series()
df["tags"]= pd.Series()
df["description"]= pd.Series()
#df["celebrities"]= ""

for i in range(len(df["posterPath"])):
    try:
        poster_path = df["posterPath"][i]
        json_path = f"../../carve/datasets/ml-25m/jsons/{poster_path.split('/')[-1].split('.')[0]}.json"

        with open(json_path) as f:
            data = json.load(f)

        df["categories"][i] = getCategories(data)
        df["color"][i] = getColor(data)
        df["tags"][i] = getTags(data)
        df["description"][i] = getDescription(data)
        #df["celebrities"][i] = getCelebrities(data)
    except Exception as e:
        pass

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https:/

In [306]:
df.head()

Unnamed: 0,movieId,index,title,genres,imdbId,tmdbId,posterUrl,posterPath,posterJson,categories,color,tags,description
4,123961,25538,Professional Sweetheart (1933),Comedy|Romance,24476,223497,https://image.tmdb.org/t/p/original/tEqInMA9cf...,../../carve/datasets/ml-25m/images/tEqInMA9cf9...,"{'categories': [{'name': 'others_', 'score': 0...",[others_],"[Yellow, Black, Brown]","[human face, poster, woman]",[calendar]
5,172375,46405,A Heartbeat Away (2011),Comedy,1612793,74415,https://image.tmdb.org/t/p/original/nzEesgGHyq...,../../carve/datasets/ml-25m/images/nzEesgGHyqX...,"{'categories': [{'name': 'abstract_nonphoto', ...","[others_, outdoor_, text_sign]","[Red, Black]","[poster, text, graphics]","[text, book]"
6,82380,15661,Night Catches Us (2010),Drama|Romance,775543,48833,https://image.tmdb.org/t/p/original/19Hpp2Gpoo...,../../carve/datasets/ml-25m/images/19Hpp2Gpooo...,"{'categories': [{'name': 'others_', 'score': 0...","[others_, text_sign]","[Red, Black]","[human face, poster, text]","[text, book]"
7,155595,38755,The Confirmation (2016),Comedy,4210080,347881,https://image.tmdb.org/t/p/original/uHX6ViDnUz...,../../carve/datasets/ml-25m/images/uHX6ViDnUzv...,"{'categories': [{'name': 'people_group', 'scor...",[people_group],"[White, Grey]","[human face, man, clothing]","[different, text, person]"
8,176499,48363,It Happened in Broad Daylight (1958),Crime|Thriller,51588,852,https://image.tmdb.org/t/p/original/dQppHkGxq3...,../../carve/datasets/ml-25m/images/dQppHkGxq3y...,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, Grey]","[human face, mammal, cowboy]","[indoor, text]"


In [307]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

movieId         0
index           0
title           0
genres          0
imdbId          0
tmdbId          0
posterUrl       0
posterPath      0
posterJson      0
categories     14
color          14
tags           14
description    14
dtype: int64


In [308]:
#Save checkpoint for new dataset
df.to_csv(checkpoint_file)

## Finalize Dataset 

By Joining Visual Enriched Dataset with Movielens Ratings Dataset

In [309]:
# Join "poster-urls" and "ratings" datasets, on Movielens's "movieId"
df = df.set_index("movieId").join(df_ratings.set_index("movieId"))
print(f"Dataframe: {df.shape}")

Dataframe: (34919, 15)


In [310]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,movieId,index,title,genres,imdbId,tmdbId,posterUrl,posterPath,posterJson,categories,color,tags,description,userId,rating,timestamp
0,508,503,Philadelphia (1993),Drama,107818,9800,https://image.tmdb.org/t/p/original/tFe5Yoo5zT...,../../carve/datasets/ml-25m/images/tFe5Yoo5zT4...,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",12.0,3.5,1181653000.0
1,508,503,Philadelphia (1993),Drama,107818,9800,https://image.tmdb.org/t/p/original/tFe5Yoo5zT...,../../carve/datasets/ml-25m/images/tFe5Yoo5zT4...,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",23.0,5.0,942965600.0
2,508,503,Philadelphia (1993),Drama,107818,9800,https://image.tmdb.org/t/p/original/tFe5Yoo5zT...,../../carve/datasets/ml-25m/images/tFe5Yoo5zT4...,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",25.0,3.0,836217500.0
3,508,503,Philadelphia (1993),Drama,107818,9800,https://image.tmdb.org/t/p/original/tFe5Yoo5zT...,../../carve/datasets/ml-25m/images/tFe5Yoo5zT4...,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",35.0,4.0,1511296000.0
4,508,503,Philadelphia (1993),Drama,107818,9800,https://image.tmdb.org/t/p/original/tFe5Yoo5zT...,../../carve/datasets/ml-25m/images/tFe5Yoo5zT4...,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",36.0,3.0,834413500.0


In [311]:
# Drop missing values
print(df.isnull().sum())
df = df.dropna()

movieId        0
index          0
title          0
genres         0
imdbId         0
tmdbId         0
posterUrl      0
posterPath     0
posterJson     0
categories     0
color          0
tags           0
description    0
userId         4
rating         4
timestamp      4
dtype: int64


In [312]:
df = df.drop(["posterUrl", "posterPath"], axis=1)

In [313]:
#Save checkpoint for new dataset
df.to_csv(output_file)

In [314]:
df.head()

Unnamed: 0,movieId,index,title,genres,imdbId,tmdbId,posterJson,categories,color,tags,description,userId,rating,timestamp
0,508,503,Philadelphia (1993),Drama,107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",12.0,3.5,1181653000.0
1,508,503,Philadelphia (1993),Drama,107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",23.0,5.0,942965600.0
2,508,503,Philadelphia (1993),Drama,107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",25.0,3.0,836217500.0
3,508,503,Philadelphia (1993),Drama,107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",35.0,4.0,1511296000.0
4,508,503,Philadelphia (1993),Drama,107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]","[silhouette, abstract, text]",36.0,3.0,834413500.0


In [315]:
def transformGenre(data):
    """
    """
    try:
        return f"[{', '.join(data.split('|'))}]"
    except:
        None

In [316]:
def splitCategorical(data, num):
    """
    """
    try:
        data = str(data)
        data_list = data.replace("[", "").replace("]", "").replace("'", "").split(",")
        return data_list[num]
    except:
        return None

In [317]:
#Transform genres column to comma seperated string instead of pipe
df["genres"] = df["genres"].map(lambda x: transformGenre(x))

In [318]:
print(splitCategorical("[human face, man, poster]", 3))

None


In [319]:
categorical_cols = ["genres", "categories", "color", "tags", "description"]
categorical_length = 5

for cat in categorical_cols:
    #Create empty columns for splitting 
    for i in range(categorical_length):
        new_col = f"{cat}_{i}"
        
        #Split categorical Series columns
        df[new_col] = df[cat].map(lambda x: splitCategorical(x, i))

df.head()

Unnamed: 0,movieId,index,title,genres,imdbId,tmdbId,posterJson,categories,color,tags,...,tags_0,tags_1,tags_2,tags_3,tags_4,description_0,description_1,description_2,description_3,description_4
0,508,503,Philadelphia (1993),[Drama],107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]",...,human face,man,poster,,,silhouette,abstract,text,,
1,508,503,Philadelphia (1993),[Drama],107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]",...,human face,man,poster,,,silhouette,abstract,text,,
2,508,503,Philadelphia (1993),[Drama],107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]",...,human face,man,poster,,,silhouette,abstract,text,,
3,508,503,Philadelphia (1993),[Drama],107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]",...,human face,man,poster,,,silhouette,abstract,text,,
4,508,503,Philadelphia (1993),[Drama],107818,9800,"{'categories': [{'name': 'others_', 'score': 0...","[others_, outdoor_]","[Black, White]","[human face, man, poster]",...,human face,man,poster,,,silhouette,abstract,text,,


In [320]:
df.columns

Index(['movieId', 'index', 'title', 'genres', 'imdbId', 'tmdbId', 'posterJson',
       'categories', 'color', 'tags', 'description', 'userId', 'rating',
       'timestamp', 'genres_0', 'genres_1', 'genres_2', 'genres_3', 'genres_4',
       'categories_0', 'categories_1', 'categories_2', 'categories_3',
       'categories_4', 'color_0', 'color_1', 'color_2', 'color_3', 'color_4',
       'tags_0', 'tags_1', 'tags_2', 'tags_3', 'tags_4', 'description_0',
       'description_1', 'description_2', 'description_3', 'description_4'],
      dtype='object')

In [321]:
#Drop empty columns
df.dropna(how='all', axis=1, inplace=True)
df = df.drop(["title", "genres", "categories", "color", "tags", "description", "imdbId", "tmdbId", "posterJson", "index"], axis=1)

In [322]:
df.columns

Index(['movieId', 'index', 'posterJson', 'userId', 'rating', 'timestamp',
       'genres_0', 'genres_1', 'genres_2', 'genres_3', 'categories_0',
       'categories_1', 'categories_2', 'color_0', 'color_1', 'color_2',
       'tags_0', 'tags_1', 'tags_2', 'description_0', 'description_1',
       'description_2'],
      dtype='object')

In [329]:
df.head()

Unnamed: 0,movieId,userId,rating,timestamp,genres_0,genres_1,genres_2,genres_3,categories_0,categories_1,categories_2,color_0,color_1,color_2,tags_0,tags_1,tags_2,description_0,description_1,description_2
0,508,12.0,3.5,1181653000.0,Drama,,,,others_,outdoor_,,Black,White,,human face,man,poster,silhouette,abstract,text
1,508,23.0,5.0,942965600.0,Drama,,,,others_,outdoor_,,Black,White,,human face,man,poster,silhouette,abstract,text
2,508,25.0,3.0,836217500.0,Drama,,,,others_,outdoor_,,Black,White,,human face,man,poster,silhouette,abstract,text
3,508,35.0,4.0,1511296000.0,Drama,,,,others_,outdoor_,,Black,White,,human face,man,poster,silhouette,abstract,text
4,508,36.0,3.0,834413500.0,Drama,,,,others_,outdoor_,,Black,White,,human face,man,poster,silhouette,abstract,text


In [330]:
df.to_csv(output_file)