# MoReBo
MoReBo stands for Movie Recommendations Bot.

## Introduction

TODO

## Data cleansing

First we need to load our data into memory. We will create a different dataframe for each file. These dataframes will be merged at a later stage.

In [None]:
%pip install pandas
import pandas as pd

# Read the data
movies = pd.read_csv('data/movies.dat', sep='::', engine='python', names=['movieId', 'title', 'genres'], encoding="ISO-8859-1")
ratings = pd.read_csv('data/ratings.dat', sep='::', engine='python', names=['userId', 'movieId', 'rating', 'timestamp'])
users = pd.read_csv('data/users.dat', sep='::', engine='python', names=['userId', 'gender', 'age', 'occupation', 'zip-code'])

Now that we have created our dataframes let's have a brief look at our data. Using the `head()` function we can display the first 5 rows of our dataframes.

In [None]:
# Print the first 5 rows of the movies dataframe
movies.head()

In [None]:
# Print the first 5 rows of the ratings dataframe
ratings.head()

In [None]:
# Print the first 5 rows of the users dataframe
users.head()

When looking at the __movies__ dataframe we can see that we have different data in the _title_ column. The title column should only contain the title of the movie. Of course we still want to keep the year but we would like to have it in a seperate column. Let's create a new column for the movies release data and remove it from the title column.

In [None]:
# Create a new column for the year
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

# Convert year to int
movies['year'] = movies['year'].astype(int)

# Remove the year from the title
movies['title'] = movies['title'].str.replace(r"\s\(\d{4}\)","", regex=True)

By looking at the output of our `head()` function we can see that we still have some data issues in our dataframe. The _genre_ column contains all genres of a single movie in one field. To better work with our data we are going to create dummy variables for every genre.

In [None]:
# Create dummy variables for the genres
movies = movies.join(movies['genres'].str.get_dummies(sep='|'))
movies.head()

# TODO
@Jan:<br>
- Properly format the movie titles for movie titles containing a comma. Eg: Christmas Carol, A
- Describe how we found this issue without looking at the excel...

Now that we have cleaned our data we can merge our seperate dataframes into a single dataframe. This dataframe will be used for the remainder of this notebook.

In [None]:
# Merge the data
merged = pd.merge(movies, ratings, on='movieId', how='inner')
merged = pd.merge(merged, users, on='userId', how='inner')
merged.head()

Make the data readable

In [None]:
# Transform column age to categorical and change the labels
merged['age'] = merged['age'].astype('category')
merged['age'].cat.categories = ['Under 18', '18-24', '25-34', '35-44', '45-49', '50-55', '56+']

# Transform column occupation to categorical and change the labels
merged['occupation'] = merged['occupation'].astype('category')
merged['occupation'].cat.categories = ['other', 'academic/educator', 'artist', 'clerical/admin', 'college/grad student', 'customer service', 'doctor/health care', 'executive/managerial', 'farmer', 'homemaker', 'K-12 student', 'lawyer', 'programmer', 'retired', 'sales/marketing', 'scientist', 'self-employed', 'technician/engineer', 'tradesman/craftsman', 'unemployed', 'writer']

To save time we are going to save our merged dataframe as a .csv file. By doing this we can load our data again without having to clean and merge it.

In [None]:
# Save the merged dataset as a new csv file
merged.to_csv('data/merged.csv', index=False, sep=';')

## Data Exploration

# TODO
Let's have a look at our data.

In [None]:
# Create a barchart for the number of movies for every genre
movies['genres'].str.get_dummies(sep='|').sum().sort_values(ascending=False).plot(kind='bar', figsize=(20,10))

In [None]:
# Create a barchart for the number of movies for every year
movies['year'].value_counts().sort_index().plot(kind='bar', figsize=(20,10))