## Dataset exploration
Before starting the search for a solution, it's good to take a look at the data and learn how to preprocess it.

In [2]:
# Download and unzip the dataset
# https://gist.github.com/hantoine/c4fc70b32c2d163f604a8dc2a050d5f6
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

def download_and_unzip(url, extract_to='.'):
    http_response = urlopen(url)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path=extract_to)

download_and_unzip('https://files.grouplens.org/datasets/movielens/ml-100k.zip')

In [29]:
import pandas as pd

data = pd.read_csv('ml-100k/u.data', sep='\t', header=None)
data.head()

Unnamed: 0,0,1,2,3
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [31]:
genres = pd.read_csv('ml-100k/u.genre', sep='|', header=None)
genres.head()

Unnamed: 0,0,1
0,unknown,0
1,Action,1
2,Adventure,2
3,Animation,3
4,Children's,4


In [27]:
users = pd.read_csv('ml-100k/u.user', sep='|', header=None)
users.head()

Unnamed: 0,0,1,2,3,4
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [28]:
items = pd.read_csv('ml-100k/u.item', sep='|', header=None, encoding='ANSI').drop(columns=[3])
items.head()

Unnamed: 0,0,1,2,4,5,6,7,8,9,10,...,14,15,16,17,18,19,20,21,22,23
0,1,Toy Story (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [66]:
occupations = pd.read_csv('ml-100k/u.occupation', sep='|', header=None)
o_dict = {v: k for k, v in occupations.to_dict()[0].items()}
o_dict

{'administrator': 0,
 'artist': 1,
 'doctor': 2,
 'educator': 3,
 'engineer': 4,
 'entertainment': 5,
 'executive': 6,
 'healthcare': 7,
 'homemaker': 8,
 'lawyer': 9,
 'librarian': 10,
 'marketing': 11,
 'none': 12,
 'other': 13,
 'programmer': 14,
 'retired': 15,
 'salesman': 16,
 'scientist': 17,
 'student': 18,
 'technician': 19,
 'writer': 20}

The dataset is separated into multiple tables, seems really like an SQL database. Regardless, it should have enough info to run ratig prediction prediction. As an input to the model we would use demographics data of the user and movie release year and genres. User rating would be the target.

Let's prepare the dataset for that training.

In [156]:
def vectorize(user, movie, rating):
    o_vec = [0] * len(o_dict)
    o_vec[o_dict[user[3]]]  = 1
    if movie[1] == 'unknown': # strange edge case, better get rid of those
        return -1
    return [user[1], int(user[2] == 'M')]+o_vec+[int(movie[2].split('-')[2])]+list(movie.drop([0, 1, 2, 4]))+[rating]

In [157]:
from tqdm import tqdm

d = {}
for i, row in tqdm(data.iterrows(), total=len(data)):
    
    user = users.iloc[row[0]-1]
    movie = items.iloc[row[1]-1]
    rating = row[2]
    vec = vectorize(user, movie, rating)
    if vec == -1:
        continue
        
    d.update({i:vec})

classical = pd.DataFrame.from_dict(d, orient='index')
classical.head()

100%|█████████████████████████████████████████████████████████████████████████| 100000/100000 [02:31<00:00, 660.68it/s]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,34,35,36,37,38,39,40,41,42,43
0,49,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
1,39,0,0,0,0,0,0,0,1,0,...,1,0,0,1,0,0,1,0,0,3
2,25,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,28,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,2
4,47,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [None]:
classical.to_csv('classical.csv')

This apporach leads to some data loss, as zipcodes can not be vectorized easily. However, I doubt that there is much correlation between a postal code and cinematic prefrences.