# Recommending Movies

-------------------------------------------------

The raw code for this Jupyter notebook is by default hidden for easier reading. The main focus of this particular page of the notebook is on the graphs and their interpretation. To toggle on/off the raw code, click below:

In [1]:
# Setup Code toggle button
from IPython.core.display import HTML  

HTML(''' 
<center><h3>
<a href="javascript:code_toggle()">Talk is cheap, show me the code.</a>
</center></h3>
<script>
    var code_show=true; //true -> hide code at first

    function code_toggle() {
        $('div.prompt').hide(); // always hide prompt

        if (code_show){
            $('div.input').hide();
        } else {
            $('div.input').show();
        }
        code_show = !code_show
    }
    $( document ).ready(code_toggle);
</script>
''')

In [2]:
# Setup notebook theme
from jupyterthemes import get_themes
from jupyterthemes.stylefx import set_nb_theme
set_nb_theme(get_themes()[1])

In [3]:
# Load R magic
%load_ext rpy2.ipython

&nbsp;

## Get the Dataset

This time we are skipping Python and going streight into R. The data is provided in tab seperated files which can easily be read into an R dataframe. Unfortuantely Python dataframes print with infinitly better formatting than R though. It makes the data much easier to inspect.

#### `u.user`

&nbsp;

In [4]:
%%R

library(knitr)

u.user <- read.delim("../data/u.user",
                     sep="|",
                     header=FALSE,
                     col.names=c("user.id", "age", "gender", "occupation", "zip.code")
                    )

kable(head(u.user), format='rst')



user.id  age  gender  occupation  zip.code
      1   24  M       technician  85711   
      2   53  F       other       94043   
      3   23  M       writer      32067   
      4   24  M       technician  43537   
      5   33  F       other       15213   
      6   42  M       executive   98101   


&nbsp;

#### `u.data`

&nbsp;

In [5]:
%%R

u.data <- read.delim("../data/u.data",
                     sep="\t",
                     header=FALSE,
                     col.names=c("user.id", "item.id", "rating", "timestamp")
                    )

kable(head(u.data), format='rst')



user.id  item.id  rating  timestamp
    196      242       3  881250949
    186      302       3  891717742
     22      377       1  878887116
    244       51       2  880606923
    166      346       1  886397596
    298      474       4  884182806


&nbsp;

#### `u.item`

Sometimes no matter how hard you try R will just always be ugly...

&nbsp;

In [6]:
%%R

u.item <- read.delim("../data/u.item",
                     sep="|",
                     header=FALSE,
                     col.names=c("movie.id", "movie.title", "release.date",
                                 "video.release.date", "IMDB.URL", "unknown",
                                 "action", "adventure", "animation", "children",
                                 "comedy", "crime", "documentary", "drama",
                                 "fantasy", "film-noir", "horror", "musical",
                                 "mystery", "romance", "sci-fi", "thriller",
                                 "war", "western"
                                )
                    )

head(u.item)

  movie.id                                          movie.title release.date
1        1                                     Toy Story (1995)  01-Jan-1995
2        2                                     GoldenEye (1995)  01-Jan-1995
3        3                                    Four Rooms (1995)  01-Jan-1995
4        4                                    Get Shorty (1995)  01-Jan-1995
5        5                                       Copycat (1995)  01-Jan-1995
6        6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)  01-Jan-1995
  video.release.date
1                 NA
2                 NA
3                 NA
4                 NA
5                 NA
6                 NA
                                                      IMDB.URL unknown action
1        http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
2          http://us.imdb.com/M/title-exact?GoldenEye%20(1995)       0      1
3       http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)       0      0
4 

&nbsp;

## Find the 3 users who are closest to you

Using the metrics:

*  age
*  gender
*  occupation

Dataframes can be subsetted based on values in each row (column values) using the pattern:

```R
dataframe[ dataframe$column.id == x, ]
```

&nbsp;

In [7]:
%%R

df <- u.user[ u.user$age == 29 & u.user$gender == 'M' & u.user$occupation == 'programmer', ]
kable(df, format='rst')



\    user.id  age  gender  occupation  zip.code
45        45   29  M       programmer  50233   
222      222   29  M       programmer  27502   


&nbsp;

Only two hits, lets throw in a scientist then.

&nbsp;

In [8]:
%%R

df <- u.user[ u.user$age == 29 & u.user$gender == 'M' & u.user$occupation == 'scientist', ]
kable(df, format='rst')



\    user.id  age  gender  occupation  zip.code
483      483   29  M       scientist   43212   


&nbsp;

Users 45, 222, and 483 it is.

&nbsp;
### Get the individual user data

To do this we just need to subset the `u.data` dataframe the same way we did for the `u.user` dataframe to pull out the users with a similar gender, age, and occupation. Because we can name each new user dataframe something like `u#.data` we also do not need the information in the `user.id` or `timestamp` columns.

#### `user 45`

In [9]:
%%R

u45.data <- u.data[ u.data$user.id == 45, ][c("item.id", "rating")]
rownames(u45.data) <- 1:nrow(u45.data)
kable(head(u45.data), format='rst')



item.id  rating
     25       4
    109       5
    118       4
    763       2
    473       3
    472       3


&nbsp;

#### `user 222`

In [10]:
%%R

u222.data <- u.data[ u.data$user.id == 222, ][c("item.id", "rating")]
rownames(u222.data) <- 1:nrow(u222.data)
kable(head(u222.data), format='rst')



item.id  rating
    366       4
    750       5
    755       4
    118       4
     77       4
    724       3


&nbsp;

#### `user 483`

In [11]:
%%R

u483.data <- u.data[ u.data$user.id == 483, ][c("item.id", "rating")]
rownames(u483.data) <- 1:nrow(u483.data)
kable(head(u483.data), format='rst')



item.id  rating
    237       3
    144       2
    181       4
    900       3
    462       3
    250       3


&nbsp;

### Get the `movie.id` for each `item.id`

To do this we need to make a list of `(movie, rating)` pairs for each user. This will require the data from the indiviual `u.data` dataframe and the `u.item` dataframe. Each `u.data` dataframe contains an `item.id` which coresponds to a `movie.id` in the `u.item` data and a `rating`. Pythonista Pseudo code would look like:

    for item.id, rating in u.data:
        movie = u.item['item.id']
        add movie, rating to user.movies

Then from movies ratings that stand out as particularly resonating with me as good or bad the score will be calculated as:

$$
\text{# of good rates} - \text{# of bad rates}
$$

&nbsp;
#### `user 45`

In [12]:
%%R

u45.data$item.id <- u.item$movie.title[match(u45.data$item.id, u.item$movie.id)]
kable(u45.data, format='rst')



item.id                                                                          rating
Birdcage, The (1996)                                                                  4
Mystery Science Theater 3000: The Movie (1996)                                        5
Twister (1996)                                                                        4
Happy Gilmore (1996)                                                                  2
James and the Giant Peach (1996)                                                      3
Dragonheart (1996)                                                                    3
Godfather, The (1972)                                                                 5
Independence Day (ID4) (1996)                                                         4
Evening Star, The (1996)                                                              2
First Wives Club, The (1996)                                                          3
Leaving Las Vegas (1995)      

Good Rates:

* Return of the Jedi (1983): 4
* Men in Black (1997): 5
* Scream (1996): 3
* Star Wars (1977): 5
* Space Jam (1996): 4
* Toy Story (1995): 5

Bad Rates:

* Willy Wonka and the Chocolate Factory (1971): 2
* Nutty Professor, The (1996): 3
* Hunchback of Notre Dame, The (1996): 3
* Fargo (1996): 5
* Independence Day (ID4) (1996): 4
* James and the Giant Peach (1996): 3
* Dragonheart (1996): 3
* Mystery Science Theater 3000: The Movie (1996): 5
* Twister (1996): 4

Really not feeling the Willy Wonka rating here. Also can not stand Mystery Science or Twister.

Unweighted that gives a $-3$ score.

&nbsp;
#### `user 222`

In [13]:
%%R

u222.data$item.id <- u.item$movie.title[match(u222.data$item.id, u.item$movie.id)]
kable(u222.data, format='rst')



item.id                                                                          rating
Dangerous Minds (1995)                                                                4
Amistad (1997)                                                                        5
Jumanji (1995)                                                                        4
Twister (1996)                                                                        4
Firm, The (1993)                                                                      4
Circle of Friends (1995)                                                              3
Willy Wonka and the Chocolate Factory (1971)                                          3
Dances with Wolves (1990)                                                             4
Bed of Roses (1996)                                                                   2
French Kiss (1995)                                                                    3
Andre (1994)                  

Good Rates:

* Jumanji (1995): 4
* Princess Bride, The (1987): 5
* Tales From the Crypt Presents: Demon Knight (1995): 1
* 2001: A Space Odyssey (1968): 5
* Indiana Jones and the Last Crusade (1989): 4
* Empire Strikes Back, The (1980): 5
* Forrest Gump (1994): 5
* Terminator 2: Judgment Day (1991): 5
* Stargate (1994): 4
* Star Trek: The Motion Picture (1979): 4
* Die Hard (1988): 5
* Back to the Future (1985): 5
* Jurassic Park (1993): 4
* Three Musketeers, The (1993): 4
* Lion King, The (1994): 4
* Toy Story (1995): 4
* Blade Runner (1982): 5
* Starship Troopers (1997): 4
* Men in Black (1997): 4
* Pink Floyd - The Wall (1982): 4
* Return of the Jedi (1983): 4
* Nightmare on Elm Street, A (1984): 4
* Star Wars (1977): 4
* Silence of the Lambs, The (1991): 4
* Shawshank Redemption, The (1994): 5

Bad Rates:

* Twister (1996): 4
* Willy Wonka and the Chocolate Factory (1971): 3
* Braveheart (1995): 5
* Fifth Element, The (1997): 2
* Dirty Dancing (1987): 4
* Tales from the Crypt Presents: Bordello of Blood (1996): 3
* Nutty Professor, The (1996): 3
* Bio-Dome (1996): 1
* Shining, The (1980): 3
* Air Bud (1997): 1
* James and the Giant Peach (1996): 1
* Wizard of Oz, The (1939): 2
* Hackers (1995): 3
* Aristocats, The (1970): 2
* 101 Dalmatians (1996): 1
* Apocalypse Now (1979): 3
* Liar Liar (1997): 3
* Balto (1995): 1
* Mystery Science Theater 3000: The Movie (1996): 3
* Free Willy (1993): 1
* Mortal Kombat (1995): 2
* Fargo (1996): 5
* Nightmare Before Christmas, The (1993): 2
* Full Metal Jacket (1987): 3
* Austin Powers: International Man of Mystery (1997): 1
* Jungle Book, The (1994): 2

Unweighted that gives a $-1$ score. but...

> Fifth Element
>
> 2
>
> Dirty Dancing
>
> 4
>
> WHAT!?

No we can't have that. This is now a weighted scale where everything is weighted $1$ except:

$$
\text{The Fifth Element}=\infty 
$$

So that gives a $-\infty$ score.

&nbsp;
#### `user 483`

In [14]:
%%R

u483.data$item.id <- u.item$movie.title[match(u483.data$item.id, u.item$movie.id)]
kable(u483.data, format='rst')



item.id                                                     rating
Jerry Maguire (1996)                                             3
Die Hard (1988)                                                  2
Return of the Jedi (1983)                                        4
Kundun (1997)                                                    3
Like Water For Chocolate (Como agua para chocolate) (1992)       3
Fifth Element, The (1997)                                        3
Austin Powers: International Man of Mystery (1997)               2
Star Trek: The Motion Picture (1979)                             3
Titanic (1997)                                                   2
Starship Troopers (1997)                                         3
Restoration (1995)                                               3
Mission: Impossible (1996)                                       3
English Patient, The (1996)                                      3
Bridge on the River Kwai, The (1957)                        

Good rates:

* Return of the Jedi (1983): 4
* Toy Story (1995): 4
* Star Wars (1977): 5
* Princess Bride, The (1987): 4

Bad rates:

* Die Hard (1988): 2
* Nightmare Before Christmas, The (1993): 3
* Fifth Element, The (1997): 3
* Austin Powers: International Man of Mystery (1997): 2
* Star Trek: The Motion Picture (1979): 3
* Apocalypse Now (1979): 2
* Mystery Science Theater 3000: The Movie (1996): 5
* Terminator, The (1984): 3
* Willy Wonka and the Chocolate Factory (1971): 2
* Men in Black (1997): 2
* Nightmare Before Christmas, The (1993): 3

Am I the only one that likes Willy Wonka? Nevermind the score, not worth counting.

### User Result

Well no one even got a positive score. User 222 had a lot of good rates, but a lot of outstandingly bad rates too. User 483 was nearly all bad, and user 45 didn't really have anything outstanding at all. In this case being banal is better so user 45 it is.

&nbsp;
## Find most and least correlated users

The `cor()` function can be used to calculate teh corelation between two users, but each individual users data must still be gathered.

In [70]:
%%R

#user.data <- user.df[ user.df$user.id == n, ][c("item.id", "rating")]
user.list <- list()
for(n in 1:dim(u.user)) {
    user.list[[n]] <- u.data[ u.data$user.id == n, ][c("item.id", "rating")]
}

kable(user.list[[1]], format='rst')



\      item.id  rating
203         61       4
306        189       3
334         33       4
335        160       4
479         20       4
640        202       5
688        171       5
821        265       4
934        155       2
973        117       3
1168        47       4
1300       222       4
1383       253       5
1441       113       5
1618       227       4
1781        17       3
1990        90       4
2329        64       5
3050        92       3
3060       228       5
3172       266       1
3192       121       4
3235       114       5
3247       132       4
3249        74       1
3261       134       4
3359        98       4
3378       186       4
3432       221       5
3711        84       4
3734        31       3
3837        70       3
3889        60       5
3910       177       5
4002        27       2
4071       260       1
4166       145       2
4178       174       5
4233       159       3
4281        82       5
4291        56       4
4307       272       3
4412     

In [104]:
%%R

# Correlate Dataframes
cordf <- function(df.one, df.two) {
    
    # Keep only common movie data between the two
    df.one <- df.one[ df.one$item.id %in% df.two$item.id, ]
    df.two <- df.two[ df.two$item.id %in% df.one$item.id, ]
    
    # Sort to allign ratings
    df.one <- df.one[order(df.one[,1]), ]
    df.two <- df.two[order(df.two[,1]), ]
    
    cor(df.one$rating, df.two$rating)
}

sub.me <- u.data[ u.data$user.id == 45, ][c("item.id", "rating")]
cordf(sub.me, user.list[[1]])

[1] 0.4953035


In [103]:
%%R

# Get correlation with all other users
cors <- list()
cors <- sapply(user.list, cordf, df.one=sub.me)
cors

  [1]  0.495303466  0.778624606  1.000000000  1.000000000  0.599692544
  [6]  0.460409167  0.286131692  1.000000000  0.500000000  0.166666667
 [11]  0.395284708 -1.000000000  0.561142542 -0.204866436  0.094491118
 [16]  0.159406512 -0.083918136  0.706172705           NA  0.293902599
 [21]  0.292174355  0.601510055  0.510674195 -0.049507377  0.071604144
 [26]  0.448411900  0.643796306 -0.456435465           NA -0.522232968
 [31]           NA  0.191485422           NA           NA           NA
 [36]           NA  0.077849894 -0.403962375           NA           NA
 [41] -0.102062073 -0.122403514  0.404688993  0.288800075  1.000000000
 [46]  0.435606842           NA -1.000000000 -0.272938259 -0.868599036
 [51]           NA -0.179161283  0.321633760  0.535573040  0.418330013
 [56]  0.152943823  0.514355440  0.276685786  0.388842837  0.000000000
 [61]           NA  0.261994556  0.585790715  0.526895472  0.436184305
 [66]  0.239775896 -0.102232603  0.554378608 -0.063088030  0.252603591
 [71] 

In [102]:
%%R

# Remove NAs and Find largest
cors <- na.omit(cors)
cors[order(cors, decreasing=TRUE)]

  [1]  1.000000000  1.000000000  1.000000000  1.000000000  1.000000000
  [6]  1.000000000  1.000000000  1.000000000  1.000000000  1.000000000
 [11]  1.000000000  1.000000000  1.000000000  1.000000000  1.000000000
 [16]  1.000000000  1.000000000  1.000000000  0.979957887  0.970725343
 [21]  0.933843014  0.928476691  0.918558654  0.893197737  0.887625365
 [26]  0.875000000  0.875000000  0.875000000  0.875000000  0.870388280
 [31]  0.870388280  0.867721831  0.867527617  0.866025404  0.866025404
 [36]  0.866025404  0.866025404  0.866025404  0.866025404  0.866025404
 [41]  0.852802865  0.852802865  0.848838215  0.818181818  0.818095988
 [46]  0.816666667  0.816496581  0.816496581  0.814618279  0.810092587
 [51]  0.801783726  0.792706853  0.791666667  0.783217803  0.781555173
 [56]  0.778624606  0.777777778  0.773020683  0.765531816  0.764705882
 [61]  0.763762616  0.756235342  0.755928946  0.755928946  0.752071047
 [66]  0.750000000  0.745816036  0.738107466  0.735860164  0.733799386
 [71] 

In [111]:
%%R

# So many perfect corelations are very suspicious
# How many movies were rated for each?

# Correlate Dataframes
movies.incommon <- function(df.one, df.two) {
    
    # Keep only common movie data between the two
    df.one <- df.one[ df.one$item.id %in% df.two$item.id, ]
    lengths(df.one)[1]
}

movies.incommon(sub.me, user.list[[1]])

item.id 
     20 


In [113]:
%%R

incommon <- list()
incommon <- sapply(user.list, movies.incommon, df.one=sub.me)
incommon

item.id item.id item.id item.id item.id item.id item.id item.id item.id item.id 
     20      14       2       2      13      17      14       4       3       7 
item.id item.id item.id item.id item.id item.id item.id item.id item.id item.id 
     10       5      32      18      25      12       8      18       1      13 
item.id item.id item.id item.id item.id item.id item.id item.id item.id item.id 
     14      11      11      11      11      26       7       5       0       4 
item.id item.id item.id item.id item.id item.id item.id item.id item.id item.id 
      0      11       1       1       0       1       9       8       1       0 
item.id item.id item.id item.id item.id item.id item.id item.id item.id item.id 
      5      20      28      16      48       7       1       2      16       5 
item.id item.id item.id item.id item.id item.id item.id item.id item.id item.id 
      2      16      13      20       7      16      30      16      30      12 
item.id item.id item.id item

In [129]:
%%R

# Note that it takes 3 for a correlation
# Need to remove 45
cor.data <- data.frame(cors, incommon)
colnames(cor.data) <- c("correlation", "incommon")
cor.data[order(cor.data[,1], decreasing=TRUE),]

     correlation incommon
8    1.000000000        4
45   1.000000000       48
420  1.000000000        3
482  1.000000000        4
683  1.000000000        3
728  1.000000000        6
928  1.000000000        3
3    1.000000000        2
4    1.000000000        2
154  1.000000000        2
461  1.000000000        2
516  1.000000000        2
556  1.000000000        2
558  1.000000000        2
574  1.000000000        2
607  1.000000000        2
753  1.000000000        2
876  1.000000000        2
743  0.979957887        5
739  0.970725343        3
210  0.933843014       12
871  0.928476691        6
252  0.918558654        5
409  0.893197737        6
71   0.887625365        7
781  0.875000000        5
480  0.875000000        5
610  0.875000000        6
710  0.875000000        6
197  0.870388280        4
846  0.870388280        4
563  0.867721831        6
473  0.867527617        5
142  0.866025404        3
218  0.866025404        3
645  0.866025404        3
762  0.866025404        3
785  0.86602

In [137]:
%%R

# Take complement of the set I want to remove
cor.data <- cor.data[-45,]
# Remove all rows with NA values
cor.data <- na.omit(cor.data)
cor.data[order(cor.data[,1], decreasing=TRUE),]

     correlation incommon
8    1.000000000        4
420  1.000000000        3
482  1.000000000        4
683  1.000000000        3
728  1.000000000        6
928  1.000000000        3
3    1.000000000        2
4    1.000000000        2
154  1.000000000        2
461  1.000000000        2
516  1.000000000        2
556  1.000000000        2
558  1.000000000        2
574  1.000000000        2
607  1.000000000        2
753  1.000000000        2
876  1.000000000        2
743  0.979957887        5
739  0.970725343        3
210  0.933843014       12
871  0.928476691        6
252  0.918558654        5
409  0.893197737        6
71   0.887625365        7
781  0.875000000        5
480  0.875000000        5
610  0.875000000        6
710  0.875000000        6
197  0.870388280        4
846  0.870388280        4
563  0.867721831        6
473  0.867527617        5
142  0.866025404        3
218  0.866025404        3
645  0.866025404        3
762  0.866025404        3
785  0.866025404        3
791  0.86602

The five most correlated users to me are users:

1. 728
2. 8
3. 482
4. 420
5. 683

The five least correlated users to me are:

1. 12
2. 124
3. 778
4. 204
5. 127

## Get top and bottom 5 recommendations