# MovieLens - Predicting a user's gender based on the movies they have watched

...

Summary:

- Prediction type: __Classification model__
- Domain: __Entertainment__
- Prediction target: __The gender of a user__ 
- Population size: __6039__

_Author: Dr. Patrick Urbanke_

# Background

...

It has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/MovieLens) (Motl and Schulte, 2015).

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

### Where is this running?

Your getML live session is running inside a docker container on [mybinder.org](https://mybinder.org/), a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

from sklearn.feature_extraction.text import CountVectorizer

import getml

getml.engine.set_project('MovieLens')


Connected to project 'MovieLens'


Tuning is effective at improving our results, but it takes quite long, so we want to make it optional:

In [2]:
USE_FINE_TUNED = False

## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the source file:

In [3]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="imdb_MovieLens",
    port=3306,
    user="guest",
    password="relational"
)

conn

Connection(conn_id='default', dbname='imdb_MovieLens', dialect='mysql', 
           host='relational.fit.cvut.cz', port=3306)

In [4]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [5]:
users = load_if_needed("users")
u2base = load_if_needed("u2base")
movies = load_if_needed("movies")
movies2directors = load_if_needed("movies2directors")
directors = load_if_needed("directors")
movies2actors = load_if_needed("movies2actors")
actors = load_if_needed("actors")

### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [6]:
users["target"] = (users.u_gender == 'F')

In [7]:
users.set_role("userid", getml.data.roles.join_key)
users.set_role("age", getml.data.roles.numerical)
users.set_role("occupation", getml.data.roles.categorical)
users.set_role("target", getml.data.roles.target)

users.save()

Name,userid,target,occupation,age,u_gender
Role,join_key,target,categorical,numerical,unused_string
0.0,1,1,2,1,F
1.0,51,1,2,1,F
2.0,75,1,2,1,F
3.0,86,1,2,1,F
4.0,99,1,2,1,F
,...,...,...,...,...
6034.0,5658,0,5,56,M
6035.0,5669,0,5,56,M
6036.0,5703,0,5,56,M
6037.0,5948,0,5,56,M


In [8]:
u2base.set_role(["userid", "movieid"], getml.data.roles.join_key)
u2base.set_role("rating", getml.data.roles.numerical)

u2base.save()

Name,userid,movieid,rating
Role,join_key,join_key,numerical
0.0,2,1964242,1
1.0,2,2219779,1
2.0,3,1856939,1
3.0,4,2273044,1
4.0,5,1681655,1
,...,...,...
996154.0,6040,2560616,5
996155.0,6040,2564194,5
996156.0,6040,2581228,5
996157.0,6040,2581428,5


In [9]:
movies.set_role("movieid", getml.data.roles.join_key)
movies.set_role(["year", "runningtime"], getml.data.roles.numerical)
movies.set_role(["isEnglish", "country"], getml.data.roles.categorical)

movies.save()

Name,movieid,isEnglish,country,year,runningtime
Role,join_key,categorical,categorical,numerical,numerical
0.0,1672052,T,other,3,2
1.0,1672111,T,other,4,2
2.0,1672580,T,USA,4,3
3.0,1672716,T,USA,4,2
4.0,1672946,T,USA,4,0
,...,...,...,...,...
3827.0,2591814,T,other,4,2
3828.0,2592334,T,USA,4,2
3829.0,2592963,F,France,2,2
3830.0,2593112,T,USA,4,1


In [10]:
movies2directors.set_role(["movieid", "directorid"], getml.data.roles.join_key)
movies2directors.set_role( "genre", getml.data.roles.categorical)

movies2directors.save()

Name,movieid,directorid,genre
Role,join_key,join_key,categorical
0.0,1672111,54934,Action
1.0,1672946,188940,Action
2.0,1679461,179783,Action
3.0,1691387,291700,Action
4.0,1693305,14663,Action
,...,...,...
4136.0,2570825,265215,Other
4137.0,2572478,149311,Other
4138.0,2577062,304827,Other
4139.0,2590181,270707,Other


In [11]:
directors.set_role("directorid", getml.data.roles.join_key)
directors.set_role(["d_quality", "avg_revenue"], getml.data.roles.numerical)

directors.save()

Name,directorid,d_quality,avg_revenue
Role,join_key,numerical,numerical
0.0,67,4,1
1.0,92,2,3
2.0,284,4,0
3.0,708,4,1
4.0,746,4,4
,...,...,...
2196.0,305962,4,4
2197.0,305978,4,2
2198.0,306168,3,2
2199.0,306343,4,1


In [12]:
movies2actors.set_role(["movieid", "actorid"], getml.data.roles.join_key)
movies2actors.set_role( "cast_num", getml.data.roles.numerical)

movies2actors.save()

Name,movieid,actorid,cast_num
Role,join_key,join_key,numerical
0.0,1672580,981535,0
1.0,1672946,1094968,0
2.0,1673647,149985,0
3.0,1673647,261595,0
4.0,1673647,781357,0
,...,...,...
138344.0,2593313,947005,3
138345.0,2593313,1090590,3
138346.0,2593313,1347419,3
138347.0,2593313,2099917,3


In [13]:
actors.set_role("actorid", getml.data.roles.join_key)
actors.set_role("a_quality", getml.data.roles.numerical)
actors.set_role("a_gender", getml.data.roles.categorical)

actors.save()

Name,actorid,a_gender,a_quality
Role,join_key,categorical,numerical
0.0,4,M,4
1.0,16,M,0
2.0,28,M,4
3.0,566,M,4
4.0,580,M,4
,...,...,...
98685.0,2749162,F,3
98686.0,2749168,F,3
98687.0,2749204,F,3
98688.0,2749377,F,4


We need to separate our data set into a training, testing and validation set:

In [14]:
random = users.random()

is_training = (random < 0.75)
is_test = ~is_training

data_train = users.where("data_train", is_training)
data_test = users.where("data_test", is_test)

## 2. Predictive modelling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

### 2.1 Define relational model

To get started with relational learning, we need to specify the data model.

In [15]:
users_ph = getml.data.Placeholder('users')
u2base_ph = getml.data.Placeholder('u2base')
movies_ph = getml.data.Placeholder('movies')
movies2directors_ph = getml.data.Placeholder('movies2directors')
directors_ph = getml.data.Placeholder('directors')
movies2actors_ph = getml.data.Placeholder('movies2actors')
actors_ph = getml.data.Placeholder('actors')

users_ph.join(
    u2base_ph,
    join_key='userid'
)

u2base_ph.join(
    movies_ph,
    join_key='movieid',
    relationship=getml.data.relationship.many_to_one
)

movies_ph.join(
    movies2directors_ph,
    join_key='movieid'
)

movies2directors_ph.join(
    directors_ph,
    join_key='directorid',
    relationship=getml.data.relationship.many_to_one
)

movies_ph.join(
    movies2actors_ph,
    join_key='movieid'
)

movies2actors_ph.join(
    actors_ph,
    join_key='actorid',
    relationship=getml.data.relationship.many_to_one
)

users_ph

### 2.2 getML pipeline

<!-- #### 2.1.1  -->
__Set-up the feature learner & predictor__

We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters.

In [16]:
fast_prop = getml.feature_learning.FastPropModel(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    n_most_frequent=3
)

predictor = getml.predictors.XGBoostClassifier()

__Build the pipeline__

In [17]:
peripheral_ph = [
    u2base_ph, 
    movies_ph, 
    movies2directors_ph, 
    directors_ph, 
    movies2actors_ph,
    actors_ph
]

pipe = getml.pipeline.Pipeline(
    tags=['FastProp'],
    population=users_ph,
    peripheral=peripheral_ph,
    feature_learners=[fast_prop],
    predictors=[predictor]
)

### 2.3 Model training

In [18]:
peripheral = {
    "u2base": u2base, 
    "movies": movies, 
    "movies2directors": movies2directors,
    "directors": directors,
    "movies2actors": movies2actors,
    "actors": actors
}

In [19]:
pipe.check(data_train, peripheral)

Checking data model...


INFO [JOIN KEYS NOT FOUND]: When joining the composite data frame 'u2base'-'movies' that has been created by many-to-one joins or one-to-one joins and the composite data frame 'movies2directors'-'directors' that has been created by many-to-one joins or one-to-one joins over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'the composite data frame 'u2base'-'movies' that has been created by many-to-one joins or one-to-one joins'. You might want to double-check your join keys.
INFO [MIGHT TAKE LONG]: There are 43873711 matches between the composite data frame 'u2base'-'movies' that has been created by many-to-one joins or one-to-one joins and the composite data frame 'movies2actors'-'actors' that has been created by many-to-one joins or one-to-one joins when joined over 'movieid' and 'movieid'. This pipeline might take a very long time to fit. You should consider imposing a narrower limit on the scope of this join by reducing the memory

In [20]:
pipe.fit(data_train, peripheral)

Checking data model...


INFO [JOIN KEYS NOT FOUND]: When joining the composite data frame 'u2base'-'movies' that has been created by many-to-one joins or one-to-one joins and the composite data frame 'movies2directors'-'directors' that has been created by many-to-one joins or one-to-one joins over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'the composite data frame 'u2base'-'movies' that has been created by many-to-one joins or one-to-one joins'. You might want to double-check your join keys.
INFO [MIGHT TAKE LONG]: There are 43873711 matches between the composite data frame 'u2base'-'movies' that has been created by many-to-one joins or one-to-one joins and the composite data frame 'movies2actors'-'actors' that has been created by many-to-one joins or one-to-one joins when joined over 'movieid' and 'movieid'. This pipeline might take a very long time to fit. You should consider imposing a narrower limit on the scope of this join by reducing the memory


FastProp: Trying 7773 features...

FastProp: Building features...

FastProp: Building features...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:14m:1.646866



### 2.4 Model evaluation

In [21]:
pipe.score(data_test, peripheral)


FastProp: Building features...

FastProp: Building features...



Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2021-02-21 14:51:09,data_train,target,0.81272,0.86834,0.41461
1,2021-02-21 14:54:05,data_test,target,0.77167,0.77909,0.48689


### 2.6 Studying features

__Feature correlations__

We want to analyze how the features are correlated with the target variable.

In [None]:
names, correlations = pipe.features.correlations()

plt.subplots(figsize=(20, 10))

plt.bar(names, correlations, color='#6829c2')

plt.title('Feature Correlations')
plt.xlabel('Features')
plt.ylabel('Correlations')
plt.xticks(rotation='vertical')
plt.show()

In [None]:
pipe.features.to_sql()

__Feature importances__
 
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.

In [None]:
names, importances = pipe.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Feature Importances')
plt.xlabel('Features')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

most_important = names[0]

__Column importances__

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

As we can see, most of the predictive accuracy is drawn from the roles played by the actors. This suggests that the text fields contained in this relational database have a higher impact on predictive accuracy than for most other data sets.

In [None]:
names, importances = pipe.columns.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color='#6829c2')

plt.title('Columns importances')
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()

most_important = names[0]

__Transpiling the learned features__

We can also transpile the learned features to SQLite3 code. We want to show the two most important features. That is why we call the `.features.importances().` method again. The names that are returned are already sorted by importance.

In [None]:
names, _ = pipe.features.importances()

pipe.features.to_sql()

In [None]:
names, _ = pipe.features.importances()

pipe.features.to_sql()[names[1]]

### 2.7 Benchmarks

## 3. Conclusion

In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our  approach outperforms state-of-the-art relational learning algorithms on the IMdb dataset.

## Citations

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
    
Neville, Jennifer, and David Jensen. "Relational dependency networks." Journal of Machine Learning Research 8.Mar (2007): 653-692.
    
Neville, Jennifer, and David Jensen. "Collective classification with relational dependency networks." Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.
    
Neville, Jennifer, et al. "Learning relational probability trees." Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.
    
Perovšek, Matic, et al. "Wordification: Propositionalization by unfolding relational data into bags of words." Expert Systems with Applications 42.17-18 (2015): 6442-6456.

# Next Steps

This tutorial went through the basics of applying getML to relational data. If you want to learn more about getML, here are some additional tutorials and articles that will help you:

__Tutorials:__
* [Loan default prediction: Introduction to relational learning](loans_demo.ipynb)
* [Occupancy detection: A multivariate time series example](occupancy_demo.ipynb)  
* [Expenditure categorization: Why relational learning matters](consumer_expenditures_demo.ipynb)
* [Disease lethality prediction: Feature engineering and the curse of dimensionality](atherosclerosis_demo.ipynb)
* [Traffic volume prediction: Feature engineering on multivariate time series](interstate94_demo.ipynb)
* [Air pollution prediction: Why feature learning outperforms brute-force approaches](air_pollution_demo.ipynb) 


__User Guides__ (from our [documentation](https://docs.getml.com/latest/)):
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)


# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.