# Predicting if my husband will like or not like a movie using a logistic classifier

In [1]:
import graphlab
import itertools
graphlab.canvas.set_target('ipynb')

# Data preparation

We will use a movies dataset downloaded from Kaggle.

In [2]:
d = "/Users/Gretel_MacAir/Documents/NewGitHub/Movie/movie_metadata_labeled.csv"
movies = graphlab.SFrame.read_csv(d)

This non-commercial license of GraphLab Create for academic use is assigned to gretel.paepe@gmail.com and will expire on April 30, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1478963469.log


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,int,int,int,str,int,int,str,str,str,int,int,str,int,str,str,int,str,str,str,int,int,int,float,float,int,int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


## Explore the data
Let us explore a record in the movie dataset.

In [3]:
movies.show()

In [4]:
movies[0]

{'Ray Scr': 1,
 'X30': '',
 'actor_1_facebook_likes': 1000,
 'actor_1_name': 'CCH Pounder',
 'actor_2_facebook_likes': 936,
 'actor_2_name': 'Joel David Moore',
 'actor_3_facebook_likes': 855,
 'actor_3_name': 'Wes Studi',
 'aspect_ratio': 1.78,
 'budget': 237000000,
 'cast_total_facebook_likes': 4834,
 'color': 'Color',
 'content_rating': 'PG-13',
 'country': 'USA',
 'director_facebook_likes': 0,
 'director_name': 'James Cameron',
 'duration': 178,
 'facenumber_in_poster': 0,
 'genres': 'Action|Adventure|Fantasy|Sci-Fi',
 'gross': 760505847,
 'imdb_score': 7.9,
 'language': 'English',
 'movie_facebook_likes': 33000,
 'movie_imdb_link': 'http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1',
 'movie_title': 'Avatar\xe5\xca',
 'num_critic_for_reviews': 723,
 'num_user_for_reviews': 3054,
 'num_voted_users': 886204,
 'plot_keywords': 'avatar|future|marine|native|paraplegic',
 'title_year': 2009}

In [5]:
movies['country'].show()

In [6]:
movies[movies['country'] == 'New Zealand'][['movie_title','actor_1_name','Ray Scr']]

movie_title,actor_1_name,Ray Scr
The Hobbit: The Battle of the Five Armies�� ...,Aidan Turner,2
King Kong��,Naomi Watts,1
The Lord of the Rings: The Fellowship of the ...,Christopher Lee,3
The Warrior's Way��,Tony Cox,1
The World's Fastest Indian�� ...,Anthony Hopkins,3
King Kong��,Naomi Watts,1
The Piano��,Holly Hunter,2
Tracker��,Ray Winstone,1
Heavenly Creatures��,Kate Winslet,2
Out of the Blue��,William Kircher,2


In [7]:
movies['Ray Scr'].show(view='Categorical')

In [8]:
movies['num_user_for_reviews'].show()

In [9]:
movies['imdb_score'].show()

To keep things simple let's focus on a subset of fields

In [10]:
movies = movies[ 'movie_title',
                 'genres',
                 'plot_keywords',
                 'director_name',
                 'actor_1_name',
                 'actor_2_name',
                 'actor_3_name',
                 'title_year',
                 'country',
                 'language',
                 'imdb_score',
                 'num_user_for_reviews',
                 'Ray Scr']

## Clean up the data
Now, we will perform some simple data transformations:

1. Remove duplication
2. Remove records with empty features
3. Remove some rubbish in the movie title
4. Replace the | with a space in the 'genres' and 'plot_keywords' and perform a word count on them.

In [11]:
def remove_rubbish(field):
    a = field
    b = a.replace('\xe5\xca', '')
    return b

In [12]:
def replace_bar(field):
    a = field
    b = a.replace('|', ' ')
    return b

In [13]:
movies = movies.unique()

In [14]:
movies = movies.dropna()

In [15]:
movies['movie_title'] = movies['movie_title'].apply(remove_rubbish)
movies['genres'] = movies['genres'].apply(replace_bar)
movies['plot_keywords'] = movies['plot_keywords'].apply(replace_bar)

In [16]:
movies['genres_count'] = graphlab.text_analytics.count_words(movies['genres'])
movies['plot_keywords_count'] = graphlab.text_analytics.count_words(movies['plot_keywords'])

In [17]:
movies[['movie_title','genres_count','plot_keywords_count']][0:5]

movie_title,genres_count,plot_keywords_count
Our Idiot Brother,"{'drama': 1, 'comedy': 1}","{'art': 1, 'nude': 1, 'three': 1, 'jail': 1, ..."
D.E.B.S.,"{'romance': 1, 'action': 1, 'comedy': 1} ...","{'force': 1, 'voyeurism': 1, 'cadillac': 1, ..."
Evolution,"{'mystery': 1, 'drama': 1, 'horror': 1, 'sci- ...","{'boy': 1, 'birth': 1, 'giving': 1, 'sea': 1, ..."
The Gift,"{'mystery': 1, 'thriller': 1} ...","{'a': 1, 'substance': 1, 'from': 1, 'fired': 1, ..."
Mission: Impossible - Rogue Nation ...,"{'action': 1, 'adventure': 1, ...","{'hacker': 1, 'spy': 1, 'computer': 1, 'captu ..."


In [18]:
movies.show()

In [19]:
movies.column_names

<bound method SFrame.column_names of Columns:
	Ray Scr	int
	actor_1_name	str
	actor_2_name	str
	actor_3_name	str
	country	str
	director_name	str
	genres	str
	imdb_score	float
	language	str
	movie_title	str
	num_user_for_reviews	int
	plot_keywords	str
	title_year	int
	genres_count	dict
	plot_keywords_count	dict

Rows: 4816

Data:
+---------+------------------+-----------------------+------------------------+
| Ray Scr |   actor_1_name   |      actor_2_name     |      actor_3_name      |
+---------+------------------+-----------------------+------------------------+
|    2    | Zooey Deschanel  |       Adam Scott      |      Steve Coogan      |
|    1    | Jordana Brewster |      Geoff Stults     |     Scoot McNairy      |
|    0    |  Nissim Renard   |      Roxane Duran     | Julie-Marie Parmentier |
|    0    |  Busy Philipps   |     Allison Tolman    |     Wendell Pierce     |
|    1    |    Tom Cruise    |     Jeremy Renner     |      Sean Harris       |
|    3    |   Bruce Willis   

# Feature engineering

We are going to engineer some feature, for example the exact year is not so important and cn lead to overfitting.  The decade on the other hand can be an important feature.

In [20]:
movies['title_year'] = movies['title_year'].astype(str)
movies['decade'] = movies['title_year'].apply(lambda x: x[0:-1] + '0')

In [21]:
movies['decade'] 

dtype: str
Rows: 4816
['2010', '2000', '2010', '2010', '2010', '2010', '2010', '2010', '2000', '2010', '2000', '2000', '2010', '1980', '2010', '2010', '2000', '2000', '2010', '1980', '2000', '2000', '2010', '2000', '2010', '2010', '2000', '2010', '2010', '1990', '2010', '2010', '2010', '2010', '2000', '1960', '1950', '2010', '1980', '2000', '2000', '1980', '2000', '1980', '1970', '2000', '2000', '2010', '2010', '2000', '2000', '2000', '1990', '2000', '2010', '2010', '2000', '2010', '2010', '1990', '2000', '2010', '2010', '2010', '2010', '2010', '1940', '2000', '2000', '2000', '2010', '2000', '2010', '2010', '2010', '2010', '2010', '2000', '2010', '2000', '1990', '2010', '1990', '2000', '2000', '2000', '1980', '2010', '2000', '1960', '2010', '2000', '1990', '2010', '2010', '2010', '2000', '2010', '2010', '2010', ... ]

As far as language feature is concerned, we are going to simplify this feature so that is only contains two categories: English and Other.

In [22]:
movies['language'].show()

In [23]:
movies['language_cat'] = movies['language'].apply(lambda x: 'English' if x == 'English' else 'Other')

In [24]:
movies['language_cat'].show()

## Rescale the data
For those features which are numeric we are going to ensure their scales are the same.

In [25]:
print movies['imdb_score'].mean()
print movies['imdb_score'].var()
print movies['imdb_score'].sum()

6.41299833887
1.24543735548
30885.0


In [26]:
movies['imdb_score'] = movies['imdb_score'] / movies['imdb_score'].sum()

In [27]:
print movies['imdb_score'].mean()
print movies['imdb_score'].var()
print movies['imdb_score'].sum()

0.000207641196013
1.30564970935e-09
1.0


In [28]:
print movies['num_user_for_reviews'].mean()
print movies['num_user_for_reviews'].var()
print movies['num_user_for_reviews'].sum()

271.155523256
140671.861818
1305885


In [29]:
movies['num_user_for_reviews'] = movies['num_user_for_reviews'] / movies['num_user_for_reviews'].sum()

In [30]:
print movies['num_user_for_reviews'].mean()
print movies['num_user_for_reviews'].var()
print movies['num_user_for_reviews'].sum()

0.000207641196013
8.24892526004e-08
1.0


## Extract sentiments

We will ignore all movies with Ray Scr = 0, since the sentiment is not known.

In [31]:
unkown_movies = movies[movies['Ray Scr'] == 0]
known_movies = movies[movies['Ray Scr'] != 0]

In [32]:
known_movies['sentiment'] = known_movies['Ray Scr'].apply(lambda rating : +1 if rating > 1 else -1)

In [33]:
print len(known_movies[known_movies['sentiment'] == 1])
print len(known_movies[known_movies['sentiment'] == -1])

1121
3339


In [34]:
neg_len = len(known_movies[known_movies['sentiment'] == 1])

In [35]:
extra = known_movies[known_movies['Ray Scr'] == 3]
extra_len = len(extra)

In [36]:
pos = known_movies[known_movies['Ray Scr'] >= 2]
neg = known_movies[known_movies['Ray Scr'] == 1][0:neg_len]

In [37]:
posneg = pos.append(neg)

In [38]:
len(posneg)

2242

# Split data into training and test sets

In [39]:
train_data, test_data = posneg.random_split(.8, seed=1)

In [40]:
#train_data = train_data.append(extra)

In [41]:
print len(test_data)
print len(train_data)

455
1787


# Create all possible feature combinations

In [42]:
features = [ 'genres_count',
             'plot_keywords_count',
             'director_name',
             'actor_1_name', 
             'actor_2_name',
             'actor_3_name',
             'decade',
             'language_cat',
             'imdb_score',
             'num_user_for_reviews']

In [43]:
features_list = []
for f in range(1, len(features)+1):
    for subset in itertools.combinations(features, f):
        features_list.append(list(subset))

In [44]:
len(features_list)

1023

# Train using each of the feature combos and capture the accuracy of each model

In [45]:
def get_classification_accuracy(model, data, true_labels):
    # First get the predictions
    predictions = model.predict(data, output_type='margin')
    scores = predictions.apply(lambda predictions : +1 if predictions >= 0 else -1)
    # Compute the number of correctly classified examples
    no_of_correct = len([True for x,y in zip(scores, true_labels) if x==y])
    # Then compute accuracy by dividing num_correct by total number of examples
    total_no = len(data)
    accuracy = float(no_of_correct)/total_no
    return accuracy

In [46]:
accuracy = []
for f in features_list:
    sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                          target = 'sentiment',
                                                          features=f,
                                                          validation_set=None,
                                                          verbose=False)
    accuracy.append(get_classification_accuracy(sentiment_model, test_data, test_data['sentiment']))

# Identify the feature combo with the highest accuracy and train the model again

In [47]:
feature_accuracy = {}
for a in accuracy:
    i = accuracy.index(a)
    f = features_list[i]
    feature_accuracy[a] = f

In [48]:
best = max(feature_accuracy.keys())
print best

0.771428571429


In [49]:
best_features = feature_accuracy[best]
print best_features
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features=best_features,
                                                      validation_set=None)

['genres_count', 'decade', 'language_cat', 'imdb_score']


In [50]:
sentiment_model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 35
Number of examples             : 1787
Number of classes              : 2
Number of feature columns      : 4
Number of unpacked features    : 26

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : newton
Solver iterations              : 8
Solver status                  : SUCCESS: Optimal solution found.
Training time (sec)            : 0.0831

Settings
--------
Log-likelihood                 : 893.085

Highest Positive Coefficients
-----------------------------
imdb_score                     : 39662.6449
genres_count[film-noir]        : 10.1442
decade[1990]                   : 1.374
genres_count[fantasy]          : 0.9399
genres_count[sci-fi]           : 0.8128

Lowest Negative Coefficients
----------------------------
(intercept)                    : -9.0628
genr

# Examine the accuracy, Precision and Recall

In [51]:
get_classification_accuracy(sentiment_model, train_data, train_data['sentiment'])

0.7548964745383324

In [52]:
get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

0.7714285714285715

In [53]:
sentiment_model.evaluate(test_data,metric='roc_curve')

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-----+-----+
 | threshold | fpr | tpr |  p  |  n  |
 +-----------+-----+-----+-----+-----+
 |    0.0    | 1.0 | 1.0 | 232 | 223 |
 |   1e-05   | 1.0 | 1.0 | 232 | 223 |
 |   2e-05   | 1.0 | 1.0 | 232 | 223 |
 |   3e-05   | 1.0 | 1.0 | 232 | 223 |
 |   4e-05   | 1.0 | 1.0 | 232 | 223 |
 |   5e-05   | 1.0 | 1.0 | 232 | 223 |
 |   6e-05   | 1.0 | 1.0 | 232 | 223 |
 |   7e-05   | 1.0 | 1.0 | 232 | 223 |
 |   8e-05   | 1.0 | 1.0 | 232 | 223 |
 |   9e-05   | 1.0 | 1.0 | 232 | 223 |
 +-----------+-----+-----+-----+-----+
 [100001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [54]:
sentiment_model.show(view = 'Evaluation')

In [55]:
weights = sentiment_model.coefficients

In [56]:
num_positive_weights = len(weights[weights['value']>=0])
num_negative_weights = len(weights[weights['value']<0])

print "Number of positive weights: %s " % num_positive_weights
print "Number of negative weights: %s " % num_negative_weights

Number of positive weights: 17 
Number of negative weights: 18 


In [57]:
test_data['predictions'] = sentiment_model.predict(test_data, output_type='probability')

In [58]:
test_top_20 = test_data.sort('predictions', ascending=False)
test_top_20[['movie_title','predictions', 'Ray Scr']].print_rows(num_rows=20)

+-------------------------------+----------------+---------+
|          movie_title          |  predictions   | Ray Scr |
+-------------------------------+----------------+---------+
|            Rebecca            | 0.999932882943 |    1    |
| Star Wars: Episode IV - A ... | 0.989994644398 |    3    |
|         The Green Mile        | 0.988138830466 |    2    |
|         Jurassic Park         | 0.987808120057 |    2    |
|           The Matrix          | 0.982158241753 |    3    |
|         Reservoir Dogs        | 0.975229695874 |    3    |
|        The Truman Show        | 0.968189895436 |    2    |
|        Schindler's List       | 0.962242259792 |    2    |
|          The Fugitive         | 0.959460478985 |    2    |
| Star Wars: Episode III - R... | 0.956994542783 |    3    |
| Lock, Stock and Two Smokin... | 0.955094312158 |    3    |
|            The Game           | 0.945664790486 |    1    |
|         Donnie Brasco         | 0.945189661079 |    2    |
|       The Princess Bri

In [59]:
test_bottom_20 = test_data.sort('predictions', ascending=True)
test_bottom_20[['movie_title','predictions', 'Ray Scr']].print_rows(num_rows=20)

+--------------------------------+-------------------+---------+
|          movie_title           |    predictions    | Ray Scr |
+--------------------------------+-------------------+---------+
|    Capitalism: A Love Story    |  0.00048492223272 |    2    |
| The True Story of Puss'N Boots | 0.000651097894305 |    1    |
|           Snow Queen           | 0.000767465348712 |    1    |
|             Doogal             |  0.00118843879081 |    1    |
|        The Real Cancun         |  0.00222359620119 |    1    |
|          Khiladi 786           |  0.00254759220857 |    1    |
|           Date Movie           |  0.00482878798295 |    1    |
|            Marci X             |  0.00624245670872 |    1    |
| The Adventures of Rocky & ...  |  0.00627766178648 |    2    |
|        Son of the Mask         |  0.0070614646143  |    1    |
|             Fugly              |  0.00808903445354 |    1    |
|       Jaws: The Revenge        |  0.0101458237836  |    1    |
|        Mars Needs Moms 

# So which of the unknown movies shall I recommend to my hubby?

In [60]:
unkown_movies['predictions'] = sentiment_model.predict(unkown_movies, output_type='probability')

In [65]:
unknown_top_20 = unkown_movies.sort('predictions', ascending=False)
unknown_top_20[['movie_title','genres_count', 'title_year']].print_rows(num_rows=20)

+-----------------------------+-------------------------------+------------+
|         movie_title         |          genres_count         | title_year |
+-----------------------------+-------------------------------+------------+
|          Dark City          | {'mystery': 1, 'sci-fi': 1... |    1998    |
|             Cube            | {'mystery': 1, 'sci-fi': 1... |    1997    |
|        The Wild Bunch       | {'action': 1, 'western': 1... |    1969    |
|       The Love Letter       |  {'romance': 1, 'fantasy': 1} |    1998    |
|        Boys Don't Cry       | {'romance': 1, 'drama': 1,... |    1999    |
|      The Man from Earth     | {'romance': 1, 'drama': 1,... |    2007    |
|     The Sweet Hereafter     |          {'drama': 1}         |    1997    |
|        The Rainmaker        | {'drama': 1, 'thriller': 1... |    1997    |
|        The Apartment        | {'romance': 1, 'drama': 1,... |    1960    |
|     The Call of Cthulhu     | {'mystery': 1, 'fantasy': ... |    2005    |