# Ethical Movie Recommendations


The deliverable for our project is an Altair-based user interface that provides
movie recommendations to users. Users will be prompted to input what kind of
film they want to see and, importantly, what describes what they would like not
to see. The latter can be content they find triggering or inappropriate for
them. We seek to empower users to be more efficient and comfortable with their
search experience.


For example, in our visualization section, we show a recommendation in which we
verified that ‘smoking’ is not a tag for that film. A user may wish to avoid
renewing their smoking habit by reducing the chance of stimulating their craving
for it. The recommendation algorithm will be based on popularity such that after
filtering for what users wish to see and what they do not want to see, we will
return the top movies with the highest mean review. The goal is to explore how
recommendations can be more sensitive to users preferences and to demonstrate a
need for these algorithms.


# Imports


Here we create the path to the directory with our MovieLens, IMDB, and poster
movie data which we can combine to get descriptive information about the films


In [1]:
!python --version

Python 3.8.13


In [2]:
import pandas as pd

import base64, io
from collections import Counter
import warnings


from IPython.display import display
import ipywidgets as widgets
import urllib.request

import re
import string
import nltk
from nltk.stem import PorterStemmer

from PIL import Image
import altair as alt

nltk.download("wordnet")
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings("ignore")
alt.themes.enable("fivethirtyeight")

ModuleNotFoundError: No module named 'nltk'

# Data Ingestion


In [None]:
# movie lens datasets
ml_movies = pd.read_csv("./data/ml_movies_100.csv")
ml_tags = pd.read_csv(f"./data/ml_tags_100.csv")
ml_ratings = pd.read_csv(f"./data/ml_ratings_100.csv")

# imdb dataset
imdb_reviews = pd.read_json(f"./data/sample_100.json")

# posters dataset
images = pd.read_csv(f"./data/images_100.csv")


# Data Preprocessing

Define functions to be used to prepare text for filters and aid in
preprocessing.


In [None]:
def prep_string(text:str) -> str:
    """remove punctuation, whitespace, and make lowercase"""

    # remove whitespace and apply lowercasing
    text_lstrip = str(text).lower().strip()

    # replace puncutation with empty string
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text_lstrip)
    return text


In [None]:
def stem_words(text:str) -> str:
    """normalize text by shortening words to their root (i.e., removing commoner morphological and inflexional endings from words)"""
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in text.split()])


## Imdb Data Preprocesing

Becuase we will use both title and year to join the imdb and movie lens data, we
need year and movie title to be seperate columns. In the imdb datset, year and
title are in the same column so we will need to split them.


In [None]:
# seperate the movie title and year from the movie column, create two seperate columns for them
imdb_reviews["title"] = imdb_reviews["movie"].apply(
    lambda x: x.split("(")[0].strip().lower()
)
imdb_reviews["year"] = imdb_reviews["movie"].apply(
    lambda x: x.split("(")[-1].split(")")[0].strip().replace("–", "")
)


imdb data is at the review level, but we need it to be at the movie level so we
will aggregare review information by movie.


In [None]:
# refactor imdb reviews dataset to the movie level for easy joining with movie lens
rating = (
    imdb_reviews.groupby(["title", "year"])
    .agg(
        {
            "rating": lambda x: list(x),
            "review_detail": lambda x: list(x),
            "review_summary": lambda x: list(x),
            "helpful": lambda x: list(x),
        }
    )
    .reset_index()
    .rename(
        columns={
            "title": "title",
            "year": "year",
            "rating": "imdb_ratings",
            "review_detail": "imdb_review_detail",
            "review_summary": "imdb_review_summary",
            "helpful": "imdb_helpful",
        }
    )
)


In [None]:
# uncomment to see the resulting dataframe
# imdb_movies.head(2)


In [None]:
# uncoment to reveal an example imdb detailed review
# imdb_movies['imdb_review_detail'].iloc[0]


## Movie Lens Data Preprocessing


Becuase we will use both title and year to join the imdb and movie lens data, we
need year and movie title to be seperate columns. In the imdb datset, year and
title are in the same column so we will need to split them.


In [3]:
# seperate the movie title and year from the movie column, create two seperate columns for them
# note: don't run this multiple times during same session
ml_movies["year"] = ml_movies["title"].apply(lambda x: x[-5:-1])
ml_movies["title"] = ml_movies["title"].apply(lambda x: x[:-6].strip().lower())


NameError: name 'ml_movies' is not defined

In [4]:
# split genres into a list for easy use
ml_movies["genres"] = ml_movies["genres"].str.split("|")


NameError: name 'ml_movies' is not defined

In [5]:
# cleanup movie tags and gather all tags per movie for consistency

# remove punctuation and whitespace from tags
ml_tags["tag"] = ml_tags["tag"].apply(lambda x: prep_string(x))

# replace tags with their stemmed versions for comparison
ml_tags["tag"] = ml_tags["tag"].apply(lambda x: stem_words(x))

# group tags by movie and create a list of unique tags for each movie to be appended to ml_movies
ml_tags_by_movie = ml_tags.groupby("movieId")["tag"].apply(set).apply(list)
ml_tags_by_movie = pd.DataFrame(ml_tags_by_movie).reset_index()

# merge ml_tags_by_movie with ml_movies on movieId
ml_movies = ml_movies.merge(ml_tags_by_movie, on="movieId", how="left")


NameError: name 'ml_tags' is not defined

In [6]:
# add aggregated movie lens rating information to DataFrame
ratings_agg = (
    (
        ml_ratings.groupby("movieId")
        .agg({"rating": ["mean", "count", "median", "std"]})
        .reset_index()
    )
    .droplevel(level=0, axis=1)
    .rename(
        columns={
            "": "movieId",
            "mean": "rating_mean",
            "count": "rating_count",
            "median": "rating_median",
            "std": "rating_std",
        }
    )
)

ml_movies = ml_movies.merge(ratings_agg, on="movieId", how="left").round(1)


NameError: name 'ml_ratings' is not defined

In [32]:
# uncomment the following line to see the merged DataFrame
# ml_movies.head(1)


# Joining movie lens and imdb


In [33]:
ml_movies.merge(imdb_reviews, on=["title", "year"], how="inner").to_csv(
    "ml_movies_imdb_joined.csv", index=False
)


# Exploring the MovieLens Dataset

The Movielens dataset is a popular resource for movie ratings, containing over
27,000 movies and their respective ratings from a diverse group of users.


## Analyzing Movies by Popularity

This bar chart is useful for quickly identifying the most popular or highly
rated movies in the dataset.


In [34]:
ml_ratings_scale = alt.Scale(domain=[0, 5])
imdb_ratings_scale = alt.Scale(domain=[0, 10])


In [35]:
def barchart_top_movies(
    df: pd.DataFrame = ml_movies,
    rating_col: str = "rating_mean",
    rating_counts_col: str = "rating_count",
    limit=10,
    descending=False,
):
    """Populates a barchart with the most popular movies."""

    mdf = df.copy()
    mdf["genres"] = mdf["genres"].apply(lambda x: x[0] if isinstance(x, list) else "")
    mdf = (
        mdf.groupby(["title", "year", "movieId", "genres"])
        .agg({rating_counts_col: "sum", rating_col: "mean"})
        .sort_values([rating_counts_col, rating_col], ascending=False)
        .dropna()
    )

    if descending:
        mdf = mdf.tail(limit).reset_index()
    else:
        mdf = mdf.head(limit).reset_index

    return (
        alt.Chart(mdf)
        .mark_bar()
        .encode(
            x=alt.X(f"{rating_col}:Q", title="Rating", scale=ml_ratings_scale),
            y=alt.Y("title:N", title="Title", sort="-x"),
            tooltip=[
                "title:N",
                "year:O",
                f"{rating_col}:Q",
                f"{rating_counts_col}:Q",
                "movieId:N",
            ],
        )
        .configure_axis(grid=False)
        .configure_view(width=512, height=256)
    )


barchart_top_movies().properties(title={"text": "MovieLens' Top Movies"})

The Shawshawnk Redemption (`id:318`) ranks 1 followed by Schindlers List
(`id:527`) - both movies are highly regarded by critics.


In [38]:
barchart_top_movies(descending=True).properties(
    title={"text": "MovieLens' least popular movies"}
)


## MovieLens Average Rating per Genre

This bar chart can be useful for quickly identifying which movie genres tend to
be rated more highly by users in the Movielens dataset, and can provide insights
into which genres may be more popular or highly regarded by a large sample of
moviegoers.


In [39]:
def ratings_by_genre(df=ml_movies):
    mdf = df.copy()
    mdf = (
        mdf.explode("genres")
        .groupby(["genres"])
        .agg({"rating_count": "sum", "rating_mean": "mean"})
    ).reset_index()

    chart = (
        alt.Chart(mdf)
        .mark_bar()
        .encode(
            x=alt.X("rating_mean:Q", title="Rating", scale=ml_ratings_scale),
            y=alt.Y("genres:N", title="Genre", sort="-x"),
            tooltip=["rating_count:Q"],
        )
        .configure_axis(grid=False)
        .configure_view(width=512, height=256)
        .properties(title={"text": "MovieLens average rating per genre"})
    )

    return chart


ratings_by_genre()


# Exploring the IMDB Movie Ratings Dataset

The IMDB ratings dataset is a collection of movie ratings from the popular
website IMDB, which provides users with an opportunity to rate movies on a scale
of 1-10.


## Understanding the distribution of the IMDB Movie ratings

The histogram below is useful in displaying the distribution of movie ratings,
which can provide insight into how users tend to rate movies and what patterns
or trends may exist in the data.


In [40]:
alt.data_transformers.disable_max_rows()


def get_rating_distribution(df=imdb_reviews):
    df = df.copy()
    return (
        alt.Chart(df)
        .mark_bar()
        .encode(
            alt.X("rating:Q", bin=True, title="Rating", scale=imdb_ratings_scale),
            y=alt.Y("count()", title="Number of Reviews"),
        )
        .configure_axis(grid=False)
        .configure_view(width=512, height=256)
        .properties(title={"text": "Rating Distribution Across Movies", "fontSize": 20})
    )


In [41]:
get_rating_distribution()


In [42]:
def filter_func(df: pd.DataFrame, filter: str) -> pd.DataFrame:

    df = df.copy()
    df["tag"] = df["tag"].apply(lambda d: " ".join(d) if isinstance(d, list) else d)

    filter = stem_words(filter)

    df["tag"] = df["tag"].astype(str)
    new_df = df[~df["tag"].str.contains(str(filter))]

    return new_df


def query_func(df: pd.DataFrame, query: str) -> pd.DataFrame:
    df = df.copy()
    df["tag"] = df["tag"].apply(lambda d: " ".join(d) if isinstance(d, list) else "")

    new_df = df[
        df["title"].str.contains(str(query))
        | df["tag"].str.contains(str(stem_words(query)))
    ]
    return new_df


pd.options.display.max_colwidth = 50

# display(query_func(ml_movies,'cult')['tag'])


In [43]:
df = filter_func(ml_movies, "abuse")


In [44]:
exp3 = filter_func(query_func(ml_movies, "cult"), "dumb down")
exp3.head(3)


Unnamed: 0.1,Unnamed: 0,movieId,title,genres,year,tag,rating_mean,rating_count,rating_median,rating_std
2,43267,165643,those people,"[Drama, Romance]",2015,test gay gay cultur young love lgbt teen paint...,3.3,33,3.5,1.0
10,16193,85510,sucker punch,"[Action, Fantasy, Thriller, IMAX]",2011,kaf ridicul upskirt femal protagonist jena mal...,2.9,2436,3.0,1.2
34,55592,192283,crazy rich asians,[Comedy],2018,asian cultur romcom wealth stylish fun awkwafi...,3.5,698,3.5,0.9


# Posters


In [None]:
ml_movies_clean = ml_movies.copy()
print(len(ml_movies_clean))


In [None]:
images = images[["poster", "title", "year"]]
images["title"] = images["title"].apply(lambda x: prep_string(x))
poster_and_prior = pd.merge(ml_movies_clean, images, on=["title", "year"])


In [45]:
!pwd

/Users/sarahamiraslani/Desktop/21-kevinjd-galon-samirasl/src


In [None]:
def prepare_images(poster_sample: pd.DataFrame, n: int) -> list[str]:
    # poster_sample is the links to the posters of the movie titles we want to visualize
    count = 0
    png_names = []
    for i in poster_sample:
        name = f"./data/local-filename{str(count)}.jpg"
        ex = urllib.request.urlretrieve(
            i, name
        )  # "/content/drive/MyDrive/SIADS-591-data/local-filename{i}.jpg")
        im = Image.open(name)
        png_name = f"./data/local-filename{str(count)}.png"
        png_names.append(png_name)
        im.save(png_name)
        count += 1

    imgCode = []
    images = png_names[:n]
    for imgPath in images:
        image = Image.open(imgPath)  # PilImage
        output = io.BytesIO()
        image.save(output, format="JPEG")
        encoded_string = (
            "data:image/jpeg;base64," + base64.b64encode(output.getvalue()).decode()
        )
        imgCode.append(encoded_string)
    return imgCode

In [None]:
def poster_visual_altair(
    poster_sample: pd.DataFrame, n: int, row_level: int
) -> alt.Chart:
    """This function is used to create a visual representation of the poster sample"""
    # pass n number of images to show in that row
    imgCode = prepare_images(poster_sample, n)
    x = [i for i in range(1, 50000, 6000)]
    x = x[:n]
    y = [row_level] * n

    source = pd.DataFrame({"x": x, "y": y, "img": imgCode})
    vis = (
        alt.Chart(source)
        .mark_image(size=5, width=250, height=250)
        .encode(x=alt.X("x", axis=None), y=alt.Y("y", axis=None), url="img")
    )
    return vis

In [None]:
# uncomment the following line to see the output DataFrame
# poster_and_prior.head(1)


In [None]:
# given lists of titles and years we can make a dataframe to select the poster links
selected_titles = ["1408", "2012", "50/50"]
selected_years = [2007, 2009, 2011]
data_tuples = list(zip(selected_titles, selected_years))
selected_df = pd.DataFrame(data_tuples, columns=["title", "year"])
selected_final = pd.merge(poster_and_prior, selected_df, on=["title", "year"])
selected_final = selected_final.drop_duplicates(
    subset=["title", "year"], keep="last"
).reset_index(drop=True)
selected_posters = selected_final["poster"]


## The counts of some of the most common terms, and the least common terms


In [None]:
# We have 59868 tags for movies in the full dataset
def insights() -> None:
    count_tags = ml_movies.copy().dropna()
    count_tags = count_tags["tag"].tolist()
    agg_tags = Counter()
    for i in count_tags:
        agg_tags.update(i)
    print(agg_tags.most_common()[:5])
    print(agg_tags.most_common()[-5:])
    print(len(agg_tags))

In [None]:
def poster_visual_altair(poster_sample: pd.DataFrame, n: int, row_level: int):
    # pass in n number of images to show in that row
    imgCode = prepare_images(poster_sample, n)
    x = [i for i in range(1, 50000, 6000)]
    x = x[:n]
    y = [row_level] * n

    source = pd.DataFrame({"x": x, "y": y, "img": imgCode, "poster": poster_sample})
    source = pd.merge(source, poster_and_prior, on="poster", how="inner")
    source = source[
        [
            "x",
            "y",
            "img",
            "poster",
            "title",
            "year",
            "genres",
            "rating_mean",
            "rating_count",
            "rating_median",
            "tag",
        ]
    ]
    vis = (
        alt.Chart(source)
        .mark_image(width=250, height=250)
        .encode(
            x=alt.X("x", axis=None),
            y=alt.Y("y", axis=None),
            url="img",
            tooltip=[
                "title",
                "year",
                "genres",
                "rating_mean",
                "rating_count",
                "rating_median",
                "tag",
            ],
        )
        .configure(background="#000000")
        .configure_view(strokeWidth=0)
    )
    return vis.properties(width=600, height=200)

# Altair GUI with Posters


To increase the effectiveness and expressiveness of the film recommendations, we
show the movie posters if they are available for a query and filter by the user.
We limit it to 3 for simplicity.

Otherwise, we return the dataframe of information that would otherwise be in the
tooltip for the posters had they been available.


In [None]:
# def query_filter(button):
#     if not filter_input.value:
#         filter = "None Marked"
#     else:
#         filter = filter_input.value
#     query = query_input.value
#     df = filter_func(query_func(ml_movies, query), filter).sort_values(
#         ["rating_mean", "rating_count"], ascending=False
#     )
#     title = df["title"].tolist()[:3]
#     year = df["year"].tolist()[:3]
#     year = [float(int(i)) for i in year]
#     data_tuples = list(zip(title, year))
#     selected_df = pd.DataFrame(data_tuples, columns=["title", "year"])
#     selected_final = pd.merge(poster_and_prior, selected_df, on=["title", "year"])
#     selected_posters = selected_final["poster"]
#     chart = poster_visual_altair(selected_posters, len(selected_posters), 1)
#     clear_output()
#     display(chart)


In [None]:
output = widgets.Output()


def calc(filter: str, query: str) -> alt.Chart:
    df: pd.DataFrame = filter_func(query_func(ml_movies, query), filter).sort_values(
        ["rating_mean", "rating_count"], ascending=False
    )
    title = df["title"].tolist()[:3]
    year = df["year"].tolist()[:3]
    year = [float(int(i)) for i in year]
    data_tuples = list(zip(title, year))

    selected_df = pd.DataFrame(data_tuples, columns=["title", "year"])
    selected_final = pd.merge(poster_and_prior, selected_df, on=["title", "year"])

    if len(selected_final) < 1:
        print("No Posters Found for the Following Film Results")
        show = df.drop(columns=["movieId", "rating_std"])
        print(display(show.head(3)))
    selected_posters = selected_final["poster"]
    chart = poster_visual_altair(selected_posters, len(selected_posters), 1)
    return chart


def clicked():

    output.clear_output()
    with output:
        _query = querybox.value
        _filter = querybox2.value
        if not querybox2.value:
            _filter = "None Marked"

        if _query == "":
            print("please enter a query")
        else:
            return calc(_filter, _query).display()


querybox = widgets.Text(description="Description:")
searchbutton = widgets.Button(description="Search")

querybox2 = widgets.Text(description="Filter Words:")

searchbutton.on_click(clicked)

list_widgets = [
    widgets.VBox([widgets.HBox([querybox, searchbutton]), widgets.HBox([querybox2])])
]
accordion = widgets.Accordion(children=list_widgets)
accordion.set_title(0, "Movie Recommendations")
display(accordion, output)
# ie search: last holiday

# ie: Search Description: friendship

With and Without the Filter: bite

This should remove the film: "Let Me In" whose first tag is "bite" is you scroll
over the poster


### Most Common and Least Common Terms in this Sample


In [None]:
insights()
