# HW7 - Movie Reccomendations
### Quentin Phillips
### DATA 440 Fall 2023
### 11/30/23

# My Approach

#### After looking over the requirements for the questions I would need to answer, I decided it made sense to read in the data and enter it into dataframes using pandas. From there, I could modify the functions to take data from the dataframes instead of a dictionary.


In [124]:
import pandas as pd
columns = ['user_id','age','gender','occupation','zip code']
user_df = pd.read_csv('u.user', sep='|', header=None)
user_df.columns = columns
user_df.head(5)

Unnamed: 0,user_id,age,gender,occupation,zip code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [91]:
columns = ['user_id','movie_id','rating','time']
data_df = pd.read_csv('u.data', sep='\t', header=None)
data_df.columns = columns
data_df.head(5)

Unnamed: 0,user_id,movie_id,rating,time
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


#### For the next file, I had to change the encoding to UTF-8 to load it into pandas, hence the .txt file.

In [9]:
item_df = pd.read_csv('u.item.txt', sep='|', header=None)
item_df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Functions to be used later:

In [139]:
from math import sqrt

def sim_distance(prefs, p1, p2):
    '''
    Returns a distance-based similarity score for person1 and person2.
    '''

    # Get the list of shared_items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1

    # If they have no ratings in common, return 0
    if len(si) == 0:
        return 0

    # Add up the squares of all the differences
    sum_of_squares = sum([pow(prefs[p1][item] - prefs[p2][item], 2)
                          for item in si])

    return 1 / (1 + sqrt(sum_of_squares))

def sim_pearson(prefs, p1, p2):
    '''
    Returns the Pearson correlation coefficient for p1 and p2.
    '''

    # Get the list of mutually rated items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1

    # If they are no ratings in common, return 0
    if len(si) == 0:
        return 0

    # Sum calculations
    n = len(si)

    # Sums of all the preferences
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])

    # Sums of the squares
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])

    # Sum of the products
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])

    # Calculate r (Pearson score)
    num = pSum - sum1 * sum2 / n
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:
        return 0
    r = num / den
    return r

In [145]:
def topMatches(
    prefs,
    person,
    n=5,
    similarity=sim_pearson,
):
    '''
    Returns the best matches for person from the prefs dictionary.
    Number of results and similarity function are optional params.
    '''

    scores = [(similarity(prefs, person, other), other) for other in prefs
              if other != person]
    scores.sort()
    scores.reverse()
    return scores[0:n]

def BottomMatches(
    prefs,
    person,
    n=5,
    similarity=sim_pearson,
):
    '''
    Returns the best matches for person from the prefs dictionary.
    Number of results and similarity function are optional params.
    '''

    scores = [(similarity(prefs, person, other), other) for other in prefs
              if other != person]
    scores.sort()
    return scores[0:n]

In [141]:
def getRecommendations(prefs, person, similarity=sim_pearson):
    '''
    Gets recommendations for a person by using a weighted average
    of every other user's rankings
    '''

    totals = {}
    simSums = {}
    for other in prefs:
    # Don't compare me to myself
        if other == person:
            continue
        sim = similarity(prefs, person, other)
        # Ignore scores of zero or lower
        if sim <= 0:
            continue
        for item in prefs[other]:
            # Only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item] == 0:
                # Similarity * Score
                totals.setdefault(item, 0)
                # The final score is calculated by multiplying each item by the
                #   similarity and adding these products together
                totals[item] += prefs[other][item] * sim
                # Sum of similarities
                simSums.setdefault(item, 0)
                simSums[item] += sim
    # Create the normalized list
    rankings = [(total / simSums[item], item) for (item, total) in
                totals.items()]
    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

In [113]:
# Get movie titles
movies = {}
for line in open('u.item.txt'):
  (id, title) = line.split('|')[0:2]
  movies[id] = title
  # Load data
prefs = {}
for line in open('u.data'):
  (user, movieid, rating, ts) = line.split('\t')
  prefs.setdefault(user, {})
  prefs[user][movies[movieid]] = float(rating)
prefs['196']

{'Kolya (1996)': 3.0,
 'Mrs. Doubtfire (1993)': 4.0,
 "Muriel's Wedding (1994)": 4.0,
 'Shall We Dance? (1996)': 3.0,
 'Stand by Me (1986)': 5.0,
 'Ace Ventura: Pet Detective (1994)': 5.0,
 'Mrs. Brown (Her Majesty, Mrs. Brown) (1997)': 4.0,
 'Raising Arizona (1987)': 4.0,
 'Being There (1979)': 5.0,
 'Truth About Cats & Dogs, The (1996)': 4.0,
 'Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)': 2.0,
 'Birdcage, The (1996)': 4.0,
 'English Patient, The (1996)': 5.0,
 'Home Alone (1990)': 3.0,
 'American President, The (1995)': 5.0,
 'Babe (1995)': 5.0,
 'Harold and Maude (1971)': 4.0,
 'Up in Smoke (1978)': 4.0,
 'Four Weddings and a Funeral (1994)': 3.0,
 'While You Were Sleeping (1995)': 3.0,
 'Men in Black (1997)': 2.0,
 'Kids in the Hall: Brain Candy (1996)': 4.0,
 'Groundhog Day (1993)': 3.0,
 'Boogie Nights (1997)': 3.0,
 "Marvin's Room (1996)": 3.0,
 'Cold Comfort Farm (1995)': 3.0,
 'Adventures of Priscilla, Queen of the Desert, The (1994)': 4.0,
 'Secrets &

In [125]:
user_df.sort_values('age')

Unnamed: 0,user_id,age,gender,occupation,zip code
29,30,7,M,student,55436
470,471,10,M,student,77459
288,289,11,M,none,94619
879,880,13,M,student,83702
608,609,13,F,student,55106
...,...,...,...,...,...
584,585,69,M,librarian,98501
766,767,70,M,engineer,00000
802,803,70,M,administrator,78212
859,860,70,F,retired,48322


In [134]:
x=user_df.loc[user_df['age'] == 21]
x=x.loc[x['gender'] == 'M']
x=x.loc[x['occupation'] == 'student']
x

Unnamed: 0,user_id,age,gender,occupation,zip code
80,81,21,M,student,21218
258,259,21,M,student,48823
275,276,21,M,student,95064
322,323,21,M,student,19149
541,542,21,M,student,60515
724,725,21,M,student,91711
922,923,21,M,student,E2E3R
927,928,21,M,student,55408


### I will choose the top 3 user ID's in order:
### 80, 258, and 275.

In [136]:
print(prefs['80'])
print(prefs['258'])
print(prefs['275'])

{'Red Rock West (1992)': 5.0, 'Piano, The (1993)': 3.0, 'Life Less Ordinary, A (1997)': 4.0, 'Star Wars (1977)': 3.0, 'Searching for Bobby Fischer (1993)': 4.0, 'Eat Drink Man Woman (1994)': 4.0, 'Quiz Show (1994)': 4.0, 'Full Monty, The (1997)': 3.0, 'Sting, The (1973)': 3.0, 'E.T. the Extra-Terrestrial (1982)': 3.0, 'Little Women (1994)': 3.0, 'Young Frankenstein (1974)': 5.0, 'Bridge on the River Kwai, The (1957)': 2.0, 'Jaws (1975)': 3.0, 'Field of Dreams (1989)': 5.0, "Ulee's Gold (1997)": 4.0, 'Fargo (1996)': 5.0, 'Room with a View, A (1986)': 3.0, "Monty Python's Life of Brian (1979)": 3.0, 'Event Horizon (1997)': 1.0, 'Shine (1996)': 4.0, 'Annie Hall (1977)': 3.0, 'Jerry Maguire (1996)': 4.0, 'Patton (1970)': 5.0, 'Remains of the Day, The (1993)': 5.0, 'Casablanca (1942)': 5.0, 'Fugitive, The (1993)': 4.0, "Eve's Bayou (1997)": 4.0, 'Shawshank Redemption, The (1994)': 5.0}
{'Scream (1996)': 1.0, "Dante's Peak (1997)": 4.0, 'Titanic (1997)': 5.0, 'Rainmaker, The (1997)': 5.0, 'W

#Q1
### User 80 top 3 movies: Red Rock West, Young Frankenstein, Field of Dreams
### User 80 bottom 3 movies: Event Horizon, Bridge on the River Kwai, Jaws
### User 258 top 3 movies: Titanic, Rainmaker, Tomorrow Never Dies
### User 258 bottom 3 movies: For Richer or Poorer, Scream, Evita
### User 275 top 3 movies: Star Trek: tWoK, Raiders of the Lost Ark, Bridge on the River Kwai
### User 275 bottom 3 movies: Mission: Impossible, Pete's Dragon, Winnie the Pooh and the Blustery Day

### I definitely feel the most similar to User 80, as I never liked Titanic or Indiana Jones which were in the top 3 of the other 2 users. User 80 will be the substitute me.

In [143]:
topMatches(prefs, "80", n=5)

[(1.000000000000004, '733'),
 (1.000000000000004, '467'),
 (1.000000000000004, '412'),
 (1.000000000000004, '163'),
 (1.000000000000004, '157')]

In [146]:
BottomMatches(prefs, "80", n=5)

[(-1.000000000000004, '111'),
 (-1.000000000000004, '38'),
 (-1.000000000000004, '926'),
 (-1.0000000000000027, '672'),
 (-1.0, '120')]

#Q2
### The users most correlated to my substitute user are : 733,467,412,163,157
### The users least correlated to my substitute user are : 111, 38, 926, 672, 120

In [149]:
getRecommendations(prefs, "80")

[(5.0, 'They Made Me a Criminal (1939)'),
 (5.0, 'Stonewall (1995)'),
 (5.0, "Some Mother's Son (1996)"),
 (5.0, 'Saint of Fort Washington, The (1993)'),
 (5.0, 'Prefontaine (1997)'),
 (5.0, 'Nico Icon (1995)'),
 (5.0, 'My Favorite Season (1993)'),
 (5.0, 'Maya Lin: A Strong Clear Vision (1994)'),
 (5.0, 'Little City (1998)'),
 (5.0, 'Great Day in Harlem, A (1994)'),
 (5.0, 'Golden Earrings (1947)'),
 (5.0, 'Aiqing wansui (1994)'),
 (4.8642844819340425, 'Perfect Candidate, A (1996)'),
 (4.8259121933628855, 'Visitors, The (Visiteurs, Les) (1993)'),
 (4.67179609071082, "Margaret's Museum (1995)"),
 (4.666009663634963, 'Some Folks Call It a Sling Blade (1993)'),
 (4.633084344486344, 'Bitter Sugar (Azucar Amargo) (1996)'),
 (4.593926928600576, 'Grateful Dead (1995)'),
 (4.587068006664695, 'Pather Panchali (1955)'),
 (4.538777034715009, 'Third Man, The (1949)'),
 (4.497203847290115, '12 Angry Men (1957)'),
 (4.495500856236329, 'Rear Window (1954)'),
 (4.492930408805284, 'Close Shave, A (199

#Q3
### The top 5 movies to recommend this user are: They Made Me a Criminal, Stonewall, Some Mother's Son, The Saint of Fort Washington, Prefontaine
### The bottom 5 movies to recommend this user are: The Shadow, Preacher's Wife, Georgia, Female Perversions, and Broken Arrow

#Q4
### My favorite film on the list is probably one of the Wallace and Gromit ones, but to avoid skewing the algorithm to return only other films in the same series, I will select The Princess Bride as my favorite. I will select The Parent Trap as my least favorite.



In [164]:
def transformPrefs(prefs):
    '''
    Transform the recommendations into a mapping where persons are described
    with interest scores for a given title e.g. {title: person} instead of
    {person: title}.
    '''

    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})
            # Flip item and person
            result[item][person] = prefs[person][item]
    return result

new=transformPrefs(prefs)

In [168]:
topMatches(new, "Princess Bride, The (1987)")

[(1.000000000000004, 'Colonel Chabert, Le (1994)'),
 (1.0000000000000013, 'Albino Alligator (1996)'),
 (1.0, 'Wedding Gift, The (1994)'),
 (1.0, 'Vermin (1998)'),
 (1.0, 'U.S. Marshalls (1998)')]

In [170]:
BottomMatches(new, "Princess Bride, The (1987)")

[(-1.000000000000004, 'Children of the Revolution (1996)'),
 (-1.0, '1-900 (1994)'),
 (-1.0, 'Broken English (1996)'),
 (-1.0, 'Clean Slate (Coup de Torchon) (1981)'),
 (-1.0, 'Dingo (1992)')]

In [171]:
topMatches(new, "Parent Trap, The (1961)")

[(1.0000000000000007, 'Mrs. Dalloway (1997)'),
 (1.0000000000000007, 'Kiss Me, Guido (1997)'),
 (1.0, 'Zeus and Roxanne (1997)'),
 (1.0, 'Year of the Horse (1997)'),
 (1.0, 'World of Apu, The (Apur Sansar) (1959)')]

In [172]:
BottomMatches(new, "Parent Trap, The (1961)")

[(-1.0000000000000027, "I'm Not Rappaport (1996)"),
 (-1.0000000000000007, 'Prophecy II, The (1998)'),
 (-1.0, 'Deep Rising (1998)'),
 (-1.0, 'Faster Pussycat! Kill! Kill! (1965)'),
 (-1.0, 'Girl 6 (1996)')]

Interestingly, I'm not sure I would enjoy the films that were recommended. After looking at the trailers, Albino Alligator and Vermin seem similar to The Princess Bride in their goofy, offbeat tone, but they do not seem like films I would particularly enjoy. It seems to me that this list doesn't fully capture why people like the movies that they do, however I am interested in trying to watch a few of these to have a better sense of the accuracy.

https://www.youtube.com/watch?v=qJv3-qPBNSs - Vermin Trailer

https://www.youtube.com/watch?v=zqhzJMfbqv4 - Albino Alligator Trailer