# Prepare for data import
The Movie dataset is picked from GroupLens at this [link](https://grouplens.org/datasets/movielens/latest/) (the smaller version).  
Dimensions of the dataset is: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.  
The files inside the zip directory are: 
- `movies.csv`: lists of movies with movieId, name, genres list
- `ratings.csv`: list of ratings from users with userId, movieId, rating and timestamp
- `tags.csv`: list of tags from users with userId, movieId, tag and timestamp
- `links.csv`: ids mapping of movieIds in different dataset _(not used)_

All CSVs are good structured with header and same delimeter.  
Data are normalized pretty well but a user table (not present for privacy reasons) and a table specifically for genders will be created in order to do better graph analytics.  

## Load data
Starting with load data from dir specified on `source_dataset`.  
It is specified dtype on dataframe import for better performance. 

In [2]:
import pandas as pd
import numpy as np

source_dataset = "dataset"

movies = pd.read_csv(
    f"{source_dataset}/movies.csv",
    dtype={"movieId": np.int32, "title": str, "genres": str},
)
ratings = pd.read_csv(
    f"{source_dataset}/ratings.csv",
    dtype={
        "userId": np.int32,
        "movieId": np.int32,
        "rating": np.float16,
        "timestamp": np.float64,
    },
)
tags = pd.read_csv(
    f"{source_dataset}/tags.csv",
    dtype={
        "userId": np.int32,
        "movieId": np.int32,
        "tag": str,
        "timestamp": np.float64,
    },
)

## Check data quality

### Movies
In movies dataframe it can see that we have not null values and movieIds are unique.  
There're 5 titles duplicated but in differents row infos (it will be fix them).  
It can see that titles can has inside year info.  
In genres there's a list of genres that could be an other table.

In [43]:
print(f"Movies size is {movies.shape}")
print(f"Unique movieIds? {movies['movieId'].nunique() == movies.shape[0]}")
print(f"Unique titles? {movies['title'].nunique() == movies.shape[0]}")
print(f"How many duplicated titles? {movies[movies[['title']].duplicated()].shape[0]}")
print(
    f"How many duplicated rows? {movies[movies[['title','genres']].duplicated()].shape[0]}"
)
# movies_duplicated = movies[movies['title'].duplicated(keep=False)]

movies.info()
movies.head()

Movies size is (9742, 3)
Unique movieIds? True
Unique titles? False
How many duplicated titles? 5
How many duplicated rows? 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int32 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int32(1), object(2)
memory usage: 190.4+ KB


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Ratings
In the ratings dataframe it can see that all ratings are in the range 0 to 5 (with step of 0.5 according with readme of dataset).  
There's a timestamp field that will be formatted

In [45]:
print(f"\nRatings size is {ratings.shape}")
print(f"Are there any negative ratings? {ratings[ratings['rating']<0].size!=0}")
print(f"Are there any voting above 5? {ratings[ratings['rating']>5].size!=0}")
print(f"Are ratings all integer? {ratings['rating'].apply(float.is_integer).all()}")

ratings.info()
ratings.head()


Ratings size is (100836, 4)
Are there any negative ratings? False
Are there any voting above 5? False
Are ratings all integer? False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int32  
 1   movieId    100836 non-null  int32  
 2   rating     100836 non-null  float16
 3   timestamp  100836 non-null  float64
dtypes: float16(1), float64(1), int32(2)
memory usage: 1.7 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703.0
1,1,3,4.0,964981247.0
2,1,6,4.0,964982224.0
3,1,47,5.0,964983815.0
4,1,50,5.0,964982931.0


### Tags
In the tags

In [49]:
print(f"\nTags size is {tags.shape}")
print(f"Are there any duplicated tags? {tags[['tag']].duplicated().any()}")
print(f"Are there any duplicated rows? {tags.duplicated().any()}")
print(f"Unique tags? {tags['tag'].nunique()}")

tags.info()
tags.head()


Tags size is (3683, 4)
Are there any duplicated tags? True
Are there any duplicated rows? False
Unique tags? 1589
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   userId     3683 non-null   int32  
 1   movieId    3683 non-null   int32  
 2   tag        3683 non-null   object 
 3   timestamp  3683 non-null   float64
dtypes: float64(1), int32(2), object(1)
memory usage: 86.4+ KB


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445715000.0
1,2,60756,Highly quotable,1445715000.0
2,2,60756,will ferrell,1445715000.0
3,2,89774,Boxing story,1445715000.0
4,2,89774,MMA,1445715000.0


### Check Movies

TODO:
- Check if ids are unique
- Check id genres is splitted and if there is some movie without genres
- Check title --> there is year information

In [3]:
print("Movies")
print(f"Unique movieId? {movies['movieId'].nunique() == movies.shape[0]}")
print(movies["genres"].head(10))

Movies
Unique movieId? True
0    Adventure|Animation|Children|Comedy|Fantasy
1                     Adventure|Children|Fantasy
2                                 Comedy|Romance
3                           Comedy|Drama|Romance
4                                         Comedy
5                          Action|Crime|Thriller
6                                 Comedy|Romance
7                             Adventure|Children
8                                         Action
9                      Action|Adventure|Thriller
Name: genres, dtype: object


In [4]:
# Check ratings
print(f"Are there any negative ratings? {ratings[ratings['rating']<0].size!=0}")
print(f"Are there any voting above 5? {ratings[ratings['rating']>5].size!=0}")
print(f"Are ratings all integer? {ratings['rating'].apply(float.is_integer).all()}")

Are there any negative ratings? False
Are there any voting above 5? False
Are ratings all integer? False


# Process data

## Create users list
We have no missing value.  
The structure of data is normalized and csv structure is OK.  
We don't have users for privacy and so we will create them after some check.  

In [2]:
print(f"Users that do ratings: {ratings['userId'].value_counts().shape}")
print(f"Users that do tags: {tags['userId'].value_counts().shape}")
print(f"Users that did only tags: {~tags['userId'].isin(ratings['userId']).all()}")
users = ratings["userId"].value_counts().reset_index()
users = users.merge(
    tags["userId"].value_counts().reset_index(), on="userId", how="outer"
)
users.columns = ["userId", "ratings", "tags"]
users = users.fillna(0)
users["tags"] = users["tags"].astype(int)
print(f"So total users is {users['userId'].shape}")
users.head()

Users that do ratings: (610,)
Users that do tags: (58,)
Users that did only tags: False
So total users is (610,)


Unnamed: 0,userId,ratings,tags
0,1,232,0
1,2,29,9
2,3,39,0
3,4,216,0
4,5,44,0


## Add year info

In [3]:
movies["year"] = movies["title"].str.extract(r"\((\d{4})\)")

## Add genre nodes

In [4]:
# From readme file
genres = [
    "Action",
    "Adventure",
    "Animation",
    "Children",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "IMAX",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
    "(no genres listed)",
]

# Create links from movies to genres
movies["genres"] = movies["genres"].apply(lambda x: x.split("|"))
movies_genres = movies[["movieId", "genres"]].explode("genres")
movies_genres["genreId"] = movies_genres["genres"].apply(lambda x: genres.index(x) + 1)
movies_genres.drop("genres", axis=1, inplace=True)

# movies dosen't need anymore of genres
movies.drop(columns=["genres"], inplace=True)

# create genres nodes
genres = pd.DataFrame(genres, columns=["name"])
genres["genreId"] = genres.index + 1
genres = genres.reindex(columns=["genreId", "name"])

print(
    f"Are genres been extracted correctly? {movies_genres['genreId'].isin(genres['genreId']).all()}"
)

Are genres been extracted correctly? True


In [6]:
ratings["timestamp"] = pd.to_datetime(
    ratings["timestamp"], unit="s", utc=True
).dt.strftime("%Y-%m-%dT%H:%M:%S%z")
tags["timestamp"] = pd.to_datetime(tags["timestamp"], unit="s", utc=True).dt.strftime(
    "%Y-%m-%dT%H:%M:%S.%fZ"
)

# Move data on volume container


## old import
Prepare data to neo4j-admin-import (info)[https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j-admin-import]

In [10]:
# movies.columns = ["movieId:ID(Movie-ID)", "title", "year:int"]
# movies[":LABEL"] = "Movie"

# users.columns = ["userId:ID(User-ID)", "ratings:int", "tags:int"]
# users[":LABEL"] = "User"

# genres.columns = ["genreId:ID(Genre-ID)", "name:string"]
# genres[":LABEL"] = "Genre"

# ratings.columns = [
#     "userId:START_ID(User-ID)",
#     "movieId:END_ID(Movie-ID)",
#     "rating:float",
#     "timestamp:string",
# ]
# ratings[":TYPE"] = "RATED"


# tags.columns = [
#     "userId:START_ID(User-ID)",
#     "movieId:END_ID(Movie-ID)",
#     "tag:string",
#     "timestamp:string",
# ]
# tags[":TYPE"] = "TAGGED"

# movies_genres.columns = [
#     "movieId:START_ID(Movie-ID)",
#     "genreId:END_ID(Genre-ID)",
# ]
# movies_genres[":TYPE"] = "IN_GENRE"

# Save on import_to_docker dir

In [7]:
import os

path_to_save = "clean_data/small"


if not os.path.exists(path_to_save):
    os.makedirs(path_to_save)

movies.to_csv(f"{path_to_save}/movies.csv", index=False)
users.to_csv(f"{path_to_save}/users.csv", index=False)
genres.to_csv(f"{path_to_save}/genres.csv", index=False)

ratings.to_csv(f"{path_to_save}/ratings.csv", index=False)
tags.to_csv(f"{path_to_save}/tags.csv", index=False)
movies_genres.to_csv(f"{path_to_save}/movies_genres.csv", index=False)

In [12]:
# TODO: maybe check if tags are unique (upper case and lower case are the same)