# Semantic similarity tasks
## Task
Let’s build a system that will tell you what to watch next based on the similarity of the description of movies.
1. Create a code file called movies_task. You may use a .py or .ipynb file.
2. Read in the movies.txt file from the code files for this task. Each separate line is a description of a different movie.
3. Your task is to create a function to return which movies a user would watch next if they have watched Planet Hulk. The film has the description: “Will he save their world or destroy it? When the Hulk becomes too dangerous for the Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a planet where the Hulk can live in peace. Unfortunately, Hulk lands on the planet Sakaar where he is sold into slavery and trained as a gladiator.”
4. The function should take in the description as a parameter and return the title of the most similar movie.

In [1]:
# Import libraries
import spacy
import pandas as pd

In [2]:
# Load English model
nlp = spacy.load("en_core_web_md")

### Define functions

In [3]:
def display_movies(df):
    '''
    Function to display the movies_df with wrapped text in the description
    column.
    Parameters:
        df = movies dataframe with columns title and description only
    outputs:
        displays df to screen
    '''
    # --- Defensive coding of inputs
    # Ensure input is a DataFrame
    try:
        if not isinstance(df, pd.DataFrame):
            raise TypeError("Input must be a pandas DataFrame")
    except TypeError:
        raise
    # --- Defensive coding ends
    
    styled = (
        movies_df.style
          # make the description column wrap
          .set_properties(subset=['description'], **{
              'white-space': 'normal', 
              'word-wrap': 'break-word',
              'vertical-align': 'top'
          })
          # constrain cell widths (applies to the whole table)
          .set_table_styles([{
              'selector': 'td',
              'props': [('max-width', '400px')]
          }])
    )
    display(styled)

In [4]:
def pairwise_similarities(df, nlp_model=nlp, title_col='title',
                          text_col='description'):
    '''
    Compute pairwise similarities between descriptions in a DataFrame.
    Parameters:
        df = pandas.DataFrame containing at least `title_col` and `text_col`
        nlp_model = spaCy model to use
        title_col = column name for row/column labels
        text_col = column name with text to compare
    Outputs:
        df_out = pandas.DataFrame of pairwise similarities
    '''
    # --- Defensive coding of inputs
    # Ensure input is a DataFrame
    try:
        if not isinstance(df, pd.DataFrame):
            raise TypeError("Input must be a pandas DataFrame")
    except TypeError:
        raise

    # Ensure required columns exist
    try:
        labels_ser = df[title_col]
        texts_ser = df[text_col]
    except KeyError:
        raise ValueError(
            f"DataFrame must contain '{title_col}' and '{text_col}' columns"
        )
    ### --- Defensive coding ends
    
    labels = labels_ser.astype(str).tolist()
    texts = texts_ser.astype(str).tolist()

    # Create spaCy Docs
    docs = list(nlp_model.pipe(texts))
    
    # Find similarities
    sims = [[d1.similarity(d2) for d2 in docs] for d1 in docs]

    # Prepare output
    df_out = pd.DataFrame(sims, index=labels, columns=labels)
    
    return df_out

In [20]:
def next_watch(library, title, description):
    '''
    Function to recommend the movie to watch next based on similarity.
    Parameters:
        library = dataframe of movies to recommend from
        title = the title of the last movie as string
        description = the description of the last movie as string
    Outputs:
        result = the recommended movie title as string
    '''
    # --- Defensive coding of inputs
    # Ensure input is a DataFrame
    try:
        if not isinstance(library, pd.DataFrame):
            raise TypeError("Library input must be a pandas DataFrame")
    except TypeError:
        raise

    # Ensure required columns exist
    try:
        labels_ser = library["title"]
        texts_ser = library["description"]
    except KeyError:
        raise ValueError(
            f"Library must contain 'title' and 'description' columns"
        )
    ### --- Defensive coding ends
    
    # Add watched movie to library dataframe
    library.iloc[-1] = [title, description]

    # Find pairwise similarities
    sim = pairwise_similarities(library)

    # Find top similarities (excluding the watched movie)
    sim.sort_values(by=[title], ascending=False, inplace=True)
    top_score = sim[title].iloc[1]

    # --- Defensive coding
    if top_score == 0:
        raise ValueError("No similar movies found.")
    # --- Defensive coding ends
    
    # Check for other rows with top_score
    check = len(sim[sim[title] == top_score])

    # Generate output
    if check == 1:
        result = str(sim.index[1])
    
    elif check > 1:
        # Collect all movies that share the top_score
        tied = sim[title][sim[title] == top_score].index.tolist()
        tied = [t for t in tied if t != title]
        result = [str(t) for t in tied]

    return result

### Prepare data

In [12]:
# Initialize list of pairs
pairs = []

# Read from file
with open("movies.txt", "r") as file:
    # Read lines
    for line_no, raw in enumerate(file, start=1):
        line = raw.strip()
        # --- Defensive coding
        if not line:
            continue
        if ":" not in line:
            print(f"Skipping line {line_no}: no ':' present")
            continue
        # --- End
        
        # Split the current line into a pair at the colon
        title, description = [item.strip() for item in line.split(":", 1)]

        # Append line to list
        pairs.append((title, description))

In [13]:
# Create dataframe from pairs
movies_df = pd.DataFrame(pairs, columns=["title", "description"])

display_movies(movies_df)

Unnamed: 0,title,description
0,Movie A,"When Hiccup discovers Toothless isn't the only Night Fury, he must seek ""The Hidden World"", a secret Dragon Utopia before a hired tyrant named Grimmel finds it first."
1,Movie B,"After the death of Superman, several new people present themselves as possible successors."
2,Movie C,"A darkness swirls at the center of a world-renowned dance company, one that will engulf the artistic director, an ambitious young dancer, and a grieving psychotherapist. Some will succumb to the nightmare. Others will finally wake up."
3,Movie D,A humorous take on Sir Arthur Conan Doyle's classic mysteries featuring Sherlock Holmes and Doctor Watson.
4,Movie E,A 16-year-old girl and her extended family are left reeling after her calculating grandmother unveils an array of secrets on her deathbed.
5,Movie F,"In the last moments of World War II, a young German soldier fighting for survival finds a Nazi captain's uniform. Impersonating an officer, the man quickly takes on the monstrous identity of the perpetrators he is trying to escape from."
6,Movie G,"The world at an end, a dying mother sends her young son on a quest to find the place that grants wishes."
7,Movie H,"A musician helps a young singer and actress find fame, even as age and alcoholism send his own career into a downward spiral."
8,Movie I,"Corporate analyst and single mom, Jen, tackles Christmas with a business-like approach until her uncle arrives with a handsome stranger in tow."
9,Movie J,"Adapted from the bestselling novel by Madeleine St John, Ladies in Black is an alluring and tender-hearted comedy drama about the lives of a group of department store employees in 1959 Sydney."


### Test function with Planet Hulk

In [14]:
# Set Planet Hulk parameters
ph_title = "Planet Hulk"
ph_description = (
        "Will he save their world or destroy it? When the Hulk becomes too "
        "dangerous for the Earth, the Illuminati trick Hulk into a shuttle and "
        "launch him into space to a planet where the Hulk can live in peace. "
        "Unfortunately, Hulk lands on the planet Sakaar where he is sold into "
        "slavery and trained as a gladiator."
    )

In [21]:
# Make recommendation
recommendation = next_watch(movies_df, ph_title, ph_description)
print(f"If you enjoyed watching {ph_title}, we think you'll also like"
     f" {recommendation}.")

If you enjoyed watching Planet Hulk, we think you'll also like Movie F.
