<a href="https://colab.research.google.com/github/Pegah-khm/Recommender-Systems/blob/main/Add_Features_to_ML_100K.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Add features to ML-100K**

So far, we have fetched movie additional data to ML-25M ratings. As we are using two versions of The MovieLens datasets, here we want to add those data to the smaller dataset called ML- 100K. Here are the additional features' names:

1. Director
1. Rotten Tomatoes Ratings
1. IMDB Ratings
1. IMDB Votes

### **Import Dependencies**

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse.linalg import svds
from scipy.sparse import csr_matrix
from scipy.sparse import coo_matrix

### **Load Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [2]:
# Files Paths:
path1 = '/content/drive/My Drive/Datasets/ml-25m/filtered_ratings.csv'
path2 = '/content/drive/My Drive/Datasets/ml-25m/filtered_movies.csv'
path3 = '/content/drive/My Drive/Datasets/ml-latest-small/ratings.csv'
path4 = '/content/drive/My Drive/Datasets/ml-latest-small/movies.csv'

In [5]:
small_ratings = pd.read_csv(path3)
small_movies = pd.read_csv(path4)
ratings = pd.read_csv(path1)
movies = pd.read_csv(path2)

In [15]:
small_movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5207 entries, 0 to 9723
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          5207 non-null   int64  
 1   title            5207 non-null   object 
 2   genres           5207 non-null   object 
 3   Director         5207 non-null   object 
 4   Rotten Tomatoes  5207 non-null   float64
 5   imdbRatings      5207 non-null   float64
 6   imdbVotes        5207 non-null   float64
dtypes: float64(3), int64(1), object(3)
memory usage: 325.4+ KB


In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5418 entries, 0 to 5417
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          5418 non-null   int64  
 1   title            5418 non-null   object 
 2   genres           5418 non-null   object 
 3   Director         5418 non-null   object 
 4   Rotten Tomatoes  5418 non-null   float64
 5   imdbRatings      5418 non-null   float64
 6   imdbVotes        5418 non-null   int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 296.4+ KB


In [9]:
small_movies = pd.merge(small_movies, movies[['movieId', 'Director', 'Rotten Tomatoes', 'imdbRatings', 'imdbVotes']],
                  on='movieId', how='left')

In [10]:
small_movies

Unnamed: 0,movieId,title,genres,Director,Rotten Tomatoes,imdbRatings,imdbVotes
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,John Lasseter,10.00,8.3,1064199.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,Joe Johnston,5.59,7.1,374898.0
2,3,Grumpier Old Men (1995),Comedy|Romance,Howard Deutch,2.89,6.6,29548.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Forest Whitaker,6.04,6.0,12108.0
4,5,Father of the Bride Part II (1995),Comedy,Charles Shyer,5.68,6.1,41501.0
...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,,,,
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,,,,
9739,193585,Flint (2017),Drama,,,,
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,,,,


In [11]:
small_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movieId          9742 non-null   int64  
 1   title            9742 non-null   object 
 2   genres           9742 non-null   object 
 3   Director         5207 non-null   object 
 4   Rotten Tomatoes  5207 non-null   float64
 5   imdbRatings      5207 non-null   float64
 6   imdbVotes        5207 non-null   float64
dtypes: float64(3), int64(1), object(3)
memory usage: 532.9+ KB


In [12]:
nan_counts = small_movies.isna().sum()
print(nan_counts)

movieId               0
title                 0
genres                0
Director           4535
Rotten Tomatoes    4535
imdbRatings        4535
imdbVotes          4535
dtype: int64


In [13]:
# Filter the DataFrame to show only rows where all specified columns are NaN
rows_with_all_nan = small_movies[small_movies[['Director', 'Rotten Tomatoes', 'imdbRatings', 'imdbVotes']].isna().all(axis=1)]

print(rows_with_all_nan)

      movieId                                              title  \
28         29  City of Lost Children, The (Cité des enfants p...   
31         32          Twelve Monkeys (a.k.a. 12 Monkeys) (1995)   
45         49                       When Night Is Falling (1995)   
48         53                                    Lamerica (1994)   
52         58                  Postman, The (Postino, Il) (1994)   
...       ...                                                ...   
9737   193581          Black Butler: Book of the Atlantic (2017)   
9738   193583                       No Game No Life: Zero (2017)   
9739   193585                                       Flint (2017)   
9740   193587                Bungo Stray Dogs: Dead Apple (2018)   
9741   193609                Andrew Dice Clay: Dice Rules (1991)   

                                      genres Director  Rotten Tomatoes  \
28    Adventure|Drama|Fantasy|Mystery|Sci-Fi      NaN              NaN   
31                   Mystery|Sci-Fi

In [14]:
# Remove all the NaN rows from small_movies dataset
small_movies.dropna(inplace=True)

In [None]:
# Saving the editted dataset to a new csv file
small_movies.to_csv('/content/drive/My Drive/Datasets/ml-latest-small/small_movies.csv', index=False)

In [None]:
# Cleaning ratings dataset according to the new movies dataset
small_ratings = small_ratings[small_ratings['movieId'].isin(small_movies['movieId'])]

In [None]:
small_ratings.to_csv('/content/drive/My Drive/Datasets/ml-latest-small/small_ratings.csv', index=False)