<a href="https://colab.research.google.com/github/HussainPythonista/Movie-Product-Recommendation-System/blob/main/Data_Cleaning/Movie_Lens_Clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Movie Lens Dataset For recommendation system

   The MovieLens dataset is a popular and widely used dataset in the field of recommendation systems and collaborative filtering. It is provided by the GroupLens Research project, which is a research group in the Department of Computer Science and Engineering at the University of Minnesota. The dataset is designed for research on recommender systems, and it consists of user ratings and movie metadata.


## About Dataset

## Context
The datasets describe ratings and free-text tagging activities from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies.

### Ratings Data:

  The 'ratings.csv' file typically contains information about user ratings for different movies.
  Columns may include 'userId' (user identifier), 'movieId' (movie identifier), 'rating' (user's rating for the movie), and 'timestamp' (timestamp of the rating).

### Movies Data:

The 'movies.csv' file contains information about the movies themselves.
Columns often include 'movieId' (unique movie identifier), 'title' (movie title), and 'genres' (genres associated with the movie).
User-Movie Interaction:

The dataset represents the interaction between users and movies through ratings. Users provide ratings for movies they have watched, indicating their preferences.
### Collaborative Filtering:

The dataset is often used for collaborative filtering algorithms. Collaborative filtering recommends items (movies in this case) to users based on the preferences and behavior of other users.
Timestamps:

The 'timestamp' field in the ratings data provides information about when a particular rating was given. This can be used for analyzing user behavior over time.

# Data Cleaning

### Import Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Load the Dataset

In [None]:
gnome_tags=pd.read_csv("/content/drive/MyDrive/Movie-Product-Recommendation /Datasets/movieLens_Dataset/genome_tags.csv")
link=pd.read_csv("/content/drive/MyDrive/Movie-Product-Recommendation /Datasets/movieLens_Dataset/link.csv")
tags=pd.read_csv("/content/drive/MyDrive/Movie-Product-Recommendation /Datasets/movieLens_Dataset/tag.csv")
movie=pd.read_csv("/content/drive/MyDrive/Movie-Product-Recommendation /Datasets/movieLens_Dataset/movie.csv")
gnome_scores=pd.read_csv("/content/drive/MyDrive/Movie-Product-Recommendation /Datasets/movieLens_Dataset/genome_scores.csv")
rating=pd.read_csv("/content/drive/MyDrive/Movie-Product-Recommendation /Datasets/movieLens_Dataset/rating.csv")

## File Wise Cleaning

#### **genome_tags.csv that contains tag descriptions:**

In [None]:
gnome_tags

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
...,...,...
1123,1124,writing
1124,1125,wuxia
1125,1126,wwii
1126,1127,zombie


In [None]:
gnome_tags.isna().sum()

tagId    0
tag      0
dtype: int64

In [None]:
gnome_tags.describe()

Unnamed: 0,tagId
count,1128.0
mean,564.5
std,325.769857
min,1.0
25%,282.75
50%,564.5
75%,846.25
max,1128.0


In [None]:
gnome_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tagId   1128 non-null   int64 
 1   tag     1128 non-null   object
dtypes: int64(1), object(1)
memory usage: 17.8+ KB


**The Gnome Tag file doesn't contain any null values and it contains only two columns so as of now there is no need of droping and fill na values so i keep as it is**

#### **link**

In [None]:
link.isna().sum()

movieId      0
imdbId       0
tmdbId     252
dtype: int64

In [None]:
link.describe()

Unnamed: 0,movieId,imdbId,tmdbId
count,27278.0,27278.0,27026.0
mean,59855.48057,578186.0,63846.683083
std,44429.314697,780470.7,69862.134497
min,1.0,5.0,2.0
25%,6931.25,77417.25,15936.5
50%,68068.0,152435.0,39468.5
75%,100293.25,906271.5,82504.0
max,131262.0,4530184.0,421510.0


In [None]:
link.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  27278 non-null  int64  
 1   imdbId   27278 non-null  int64  
 2   tmdbId   27026 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 639.5 KB


- Find:

    I Found the link dataset contains some NaN values that is in tmdbId column.

- Decision:

    I'm gonna get the movie information using IMDB, So, keeping TMDB info only increase my size of dataset,So I'm gonna drop the **tmdbId** column

In [None]:
#Droping TMDB Dataset
link.drop(columns="tmdbId",inplace=True)

In [None]:
link

Unnamed: 0,movieId,imdbId
0,1,114709
1,2,113497
2,3,113228
3,4,114885
4,5,113041
...,...,...
27273,131254,466713
27274,131256,277703
27275,131258,3485166
27276,131260,249110


#### tags

In [None]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18
...,...,...,...,...
465559,138446,55999,dragged,2013-01-23 23:29:32
465560,138446,55999,Jason Bateman,2013-01-23 23:29:38
465561,138446,55999,quirky,2013-01-23 23:29:38
465562,138446,55999,sad,2013-01-23 23:29:32


In [None]:
tags.describe()

Unnamed: 0,userId,movieId
count,465564.0,465564.0
mean,68712.354263,32627.76292
std,41877.674053,36080.241157
min,18.0,1.0
25%,28780.0,2571.0
50%,70201.0,7373.0
75%,107322.0,62235.0
max,138472.0,131258.0


In [None]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465564 entries, 0 to 465563
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   userId     465564 non-null  int64 
 1   movieId    465564 non-null  int64 
 2   tag        465548 non-null  object
 3   timestamp  465564 non-null  object
dtypes: int64(2), object(2)
memory usage: 14.2+ MB


In [None]:
tags.isna().sum()

userId        0
movieId       0
tag          16
timestamp     0
dtype: int64

In [None]:
tags[tags["tag"].isna()]

Unnamed: 0,userId,movieId,tag,timestamp
373276,116460,123,,2008-01-04 12:47:47
373277,116460,346,,2008-01-04 13:05:46
373281,116460,1184,,2008-01-04 13:11:01
373288,116460,1785,,2008-01-04 13:06:46
373289,116460,2194,,2008-01-04 12:44:37
373291,116460,2691,,2008-01-04 12:50:02
373299,116460,4103,,2008-01-04 13:05:20
373301,116460,4473,,2008-01-04 12:50:40
373303,116460,4616,,2008-01-04 13:14:01
373319,116460,7624,,2008-01-04 13:11:06


In [None]:
mode=tags["tag"].mode().iloc[0]
tags["tag"].fillna(mode,inplace=True)

In [None]:
tags.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

In [None]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18
...,...,...,...,...
465559,138446,55999,dragged,2013-01-23 23:29:32
465560,138446,55999,Jason Bateman,2013-01-23 23:29:38
465561,138446,55999,quirky,2013-01-23 23:29:38
465562,138446,55999,sad,2013-01-23 23:29:32


#### movie

In [None]:
movie

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
27273,131254,Kein Bund für's Leben (2007),Comedy
27274,131256,"Feuer, Eis & Dosenbier (2002)",Comedy
27275,131258,The Pirates (2014),Adventure
27276,131260,Rentun Ruusu (2001),(no genres listed)


In [None]:
movie.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [None]:
movie.describe()

Unnamed: 0,movieId
count,27278.0
mean,59855.48057
std,44429.314697
min,1.0
25%,6931.25
50%,68068.0
75%,100293.25
max,131262.0


In [None]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  27278 non-null  int64 
 1   title    27278 non-null  object
 2   genres   27278 non-null  object
dtypes: int64(1), object(2)
memory usage: 639.5+ KB


#### gnome_scores

In [None]:
gnome_scores

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02500
1,1,2,0.02500
2,1,3,0.05775
3,1,4,0.09675
4,1,5,0.14675
...,...,...,...
11709763,131170,1124,0.58775
11709764,131170,1125,0.01075
11709765,131170,1126,0.01575
11709766,131170,1127,0.11450


In [None]:
gnome_scores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11709768 entries, 0 to 11709767
Data columns (total 3 columns):
 #   Column     Dtype  
---  ------     -----  
 0   movieId    int64  
 1   tagId      int64  
 2   relevance  float64
dtypes: float64(1), int64(2)
memory usage: 268.0 MB


In [None]:
gnome_scores.describe()

Unnamed: 0,movieId,tagId,relevance
count,11709770.0,11709770.0,11709770.0
mean,25842.97,564.5,0.1164833
std,34676.15,325.6254,0.1542463
min,1.0,1.0,0.00025
25%,2926.0,282.75,0.02425
50%,6017.0,564.5,0.0565
75%,46062.0,846.25,0.1415
max,131170.0,1128.0,1.0


In [None]:
gnome_scores.isna().sum()

movieId      0
tagId        0
relevance    0
dtype: int64

#### rating

In [None]:
rating

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40
...,...,...,...,...
20000258,138493,68954,4.5,2009-11-13 15:42:00
20000259,138493,69526,4.5,2009-12-03 18:31:48
20000260,138493,69644,3.0,2009-12-07 18:10:57
20000261,138493,70286,5.0,2009-11-13 15:42:24


In [None]:
rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 610.4+ MB


In [None]:
rating.describe()

Unnamed: 0,userId,movieId,rating
count,20000260.0,20000260.0,20000260.0
mean,69045.87,9041.567,3.525529
std,40038.63,19789.48,1.051989
min,1.0,1.0,0.5
25%,34395.0,902.0,3.0
50%,69141.0,2167.0,3.5
75%,103637.0,4770.0,4.0
max,138493.0,131262.0,5.0


## Data Integration

Joining the dataframe each other to create master dataset

#### Joining Tags dataframe each other

In [None]:
gnome_tags.columns

Index(['tagId', 'tag'], dtype='object')

In [None]:
gnome_tags.shape

(1128, 2)

In [None]:
tags.columns

Index(['userId', 'movieId', 'tag', 'timestamp'], dtype='object')

In [None]:
tags.shape

(465564, 4)

In [None]:
tag_merged=pd.merge(left=tags,right=gnome_tags,on="tag",how="left")

In [None]:
tag_merged.isna().sum()

userId            0
movieId           0
tag               0
timestamp         0
tagId        247977
dtype: int64

In [None]:
tag_merged

Unnamed: 0,userId,movieId,tag,timestamp,tagId
0,18,4141,Mark Waters,2009-04-24 18:19:40,
1,65,208,dark hero,2013-05-10 01:41:18,288.0
2,65,353,dark hero,2013-05-10 01:41:19,288.0
3,65,521,noir thriller,2013-05-10 01:39:43,712.0
4,65,592,dark hero,2013-05-10 01:41:18,288.0
...,...,...,...,...,...
465559,138446,55999,dragged,2013-01-23 23:29:32,
465560,138446,55999,Jason Bateman,2013-01-23 23:29:38,
465561,138446,55999,quirky,2013-01-23 23:29:38,829.0
465562,138446,55999,sad,2013-01-23 23:29:32,871.0


In [None]:
tag_merged[tag_merged.tagId.isna()]

Unnamed: 0,userId,movieId,tag,timestamp,tagId
0,18,4141,Mark Waters,2009-04-24 18:19:40,
26,65,27803,Oscar (Best Foreign Language Film),2011-05-10 06:25:15,
27,65,27866,New Zealand,2011-05-09 16:05:53,
29,65,48082,unusual,2011-05-09 16:25:59,
33,65,58652,girls who play boys,2011-05-09 16:10:03,
...,...,...,...,...,...
465556,138446,7045,Scary Movies To See on Halloween,2013-01-23 23:27:40,
465557,138446,7164,Peter Pan,2013-01-23 23:30:55,
465559,138446,55999,dragged,2013-01-23 23:29:32,
465560,138446,55999,Jason Bateman,2013-01-23 23:29:38,


In [None]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18
...,...,...,...,...
465559,138446,55999,dragged,2013-01-23 23:29:32
465560,138446,55999,Jason Bateman,2013-01-23 23:29:38
465561,138446,55999,quirky,2013-01-23 23:29:38
465562,138446,55999,sad,2013-01-23 23:29:32


#### Join movie with ratings

In [None]:
print("Movie=",movie.shape,"Ratings =",rating.shape)

Movie= (27278, 3) Ratings = (20000263, 4)


In [None]:
print("Movie Columns=",movie.columns,"Rating Columns =",rating.columns)

Movie Columns= Index(['movieId', 'title', 'genres'], dtype='object') Rating Columns = Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')


In [None]:
rating_merged=pd.merge(left=movie,right=rating,on="movieId",how="outer")
rating_merged