## Data Wrangling for Review Sentimental Analysis and Recommnadation System 

Importing the library

In [1]:
import pandas as pd
import numpy as np
import re

First import the review dataset and making it ready for the next step.

In [2]:
review = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/reviews.csv')

In [3]:
review.head()

Unnamed: 0,review_id,reviewer,movie,rating,review_summary,review_date,spoiler_tag,review_detail,helpful
0,rw1133942,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,Good follow up that answers all the questions,24 July 2005,0,"After seeing Tarantino's Kill Bill Vol: 1, I g...","['0', '1']"
1,rw1133959,lost-in-limbo,Feardotcom (2002),3.0,"""I couldn't make much sense of it myself"". Too...",24 July 2005,0,There's a Website called FearDotCom and anyone...,"['1', '4']"
2,rw1133985,NateManD,Persona (1966),10.0,Persona gives me all the reasons to love art-h...,24 July 2005,0,"Long before ""Muholland Drive"" there was anothe...","['9', '23']"
3,rw1133999,CAMACHO-4,War of the Worlds (2005),3.0,A disappointing film from the team that you Mi...,24 July 2005,0,Spielberg said this film is based on the H.G. ...,"['9', '14']"
4,rw1134010,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,A fun action movie with great chemistry,24 July 2005,0,"Director Doug Liman, who's gotten famous for m...","['1', '3']"


It looks a clean dataset, with name of the reviewer, name of the movie, rating that reviewer gets to film, summary and detail of the review and date of the review. 
First, there are some columns that are not related to the project goal and some need to do some changes.

In [4]:
review.describe()

Unnamed: 0,rating,spoiler_tag
count,465661.0,542461.0
mean,6.955223,0.241282
std,2.306315,0.427861
min,1.0,0.0
25%,6.0,0.0
50%,7.0,0.0
75%,9.0,0.0
max,10.0,1.0


In [5]:
review.isnull().sum()

review_id             0
reviewer              0
movie                 0
rating            76800
review_summary        1
review_date           0
spoiler_tag           0
review_detail         1
helpful               0
dtype: int64

In [6]:
review.dropna(inplace=True)

Some of the rows don't have the rating information. Becasue there is no way to fill these and there is enough data, those rows drop from the dataset.

In [7]:
review.reviewer.nunique()

11256

After droping these NaN values, there are 11256 unique reviewer.

In [8]:
#checking the type of the data
review.dtypes

review_id          object
reviewer           object
movie              object
rating            float64
review_summary     object
review_date        object
spoiler_tag         int64
review_detail      object
helpful            object
dtype: object

Next, the data type needs to be check and change to do some of the adjustment for the review, date, and rating columns.

In [9]:
#Droping the columns that are not related
review.drop(columns=['review_id','spoiler_tag','helpful'],inplace=True)

In [10]:
review.head()

Unnamed: 0,reviewer,movie,rating,review_summary,review_date,review_detail
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,Good follow up that answers all the questions,24 July 2005,"After seeing Tarantino's Kill Bill Vol: 1, I g..."
1,lost-in-limbo,Feardotcom (2002),3.0,"""I couldn't make much sense of it myself"". Too...",24 July 2005,There's a Website called FearDotCom and anyone...
2,NateManD,Persona (1966),10.0,Persona gives me all the reasons to love art-h...,24 July 2005,"Long before ""Muholland Drive"" there was anothe..."
3,CAMACHO-4,War of the Worlds (2005),3.0,A disappointing film from the team that you Mi...,24 July 2005,Spielberg said this film is based on the H.G. ...
4,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,A fun action movie with great chemistry,24 July 2005,"Director Doug Liman, who's gotten famous for m..."


In [11]:
# Changing the type of the movie name and review details to a string.
review.review_detail = review.review_detail.astype('string')
review.movie = review.movie.astype('string')

In [12]:
review.dtypes

reviewer           object
movie              string
rating            float64
review_summary     object
review_date        object
review_detail      string
dtype: object

In [13]:
#Remove the comma and dot from the review
review.review_detail = review.review_detail.str.replace(',','').replace('.','')
review.review_summary = review.review_summary.str.replace(',','').replace('.','')

In [14]:
#Makes all the review lowercase
review.review_detail = review.review_detail.str.lower()
review.review_summary = review.review_summary.str.lower()

In [15]:
#Change the type of the review date to datetime
review.review_date = pd.to_datetime(review.review_date)

In [16]:
review.dtypes

reviewer                  object
movie                     string
rating                   float64
review_summary            object
review_date       datetime64[ns]
review_detail             string
dtype: object

In [17]:
review.head()

Unnamed: 0,reviewer,movie,rating,review_summary,review_date,review_detail
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,good follow up that answers all the questions,2005-07-24,after seeing tarantino's kill bill vol: 1 i go...
1,lost-in-limbo,Feardotcom (2002),3.0,"""i couldn't make much sense of it myself"". too...",2005-07-24,there's a website called feardotcom and anyone...
2,NateManD,Persona (1966),10.0,persona gives me all the reasons to love art-h...,2005-07-24,"long before ""muholland drive"" there was anothe..."
3,CAMACHO-4,War of the Worlds (2005),3.0,a disappointing film from the team that you mi...,2005-07-24,spielberg said this film is based on the h.g. ...
4,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,a fun action movie with great chemistry,2005-07-24,director doug liman who's gotten famous for ma...


The rating needs to change to a lable for sentiment analysis. Because I am going to make a recommandation system after, two lable chosed. " Love it" or "Not Love It". If the user love the movie we going to make a recommandation.

In [18]:
def conditions(review):
    if (review.rating > 7.0):
        return 'LOVE IT'

    else:
        return 'NOT LOVE IT'

review['lable'] = review.apply(conditions, axis=1)

In [19]:
review.drop(columns=['rating'],inplace=True)
review.head()

Unnamed: 0,reviewer,movie,review_summary,review_date,review_detail,lable
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),good follow up that answers all the questions,2005-07-24,after seeing tarantino's kill bill vol: 1 i go...,LOVE IT
1,lost-in-limbo,Feardotcom (2002),"""i couldn't make much sense of it myself"". too...",2005-07-24,there's a website called feardotcom and anyone...,NOT LOVE IT
2,NateManD,Persona (1966),persona gives me all the reasons to love art-h...,2005-07-24,"long before ""muholland drive"" there was anothe...",LOVE IT
3,CAMACHO-4,War of the Worlds (2005),a disappointing film from the team that you mi...,2005-07-24,spielberg said this film is based on the h.g. ...,NOT LOVE IT
4,CAMACHO-4,Mr. & Mrs. Smith (2005),a fun action movie with great chemistry,2005-07-24,director doug liman who's gotten famous for ma...,NOT LOVE IT


In [20]:
review.lable.value_counts()

NOT LOVE IT    249026
LOVE IT        216633
Name: lable, dtype: int64

It seems there is a good distribution of the lable.

Next the movies dataset will load so we can add the imdb Id to the dataset so we can use the IMDB dataset to add more information of the movie to the data.

In [21]:
movie = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/movies.csv')

In [22]:
movie.head()

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation;Adventure;Comedy,images/114709_.jpg
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action;Adventure;Family,images/113497_.jpg
2,113277,http://www.imdb.com/title/tt113277,Heat (1995),8.2,Action;Crime;Drama,images/113277_.jpg
3,114319,http://www.imdb.com/title/tt114319,Sabrina (1995),6.3,Comedy;Drama,images/114319_.jpg
4,114576,http://www.imdb.com/title/tt114576,Sudden Death (1995),5.7,Action;Crime;Thriller,images/114576_.jpg


In [23]:
#Merge the review and movie
df = review.merge(movie,left_on='movie',right_on='Title')

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 479565 entries, 0 to 479564
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   reviewer          479565 non-null  object        
 1   movie             479565 non-null  object        
 2   review_summary    479565 non-null  object        
 3   review_date       479565 non-null  datetime64[ns]
 4   review_detail     479565 non-null  string        
 5   lable             479565 non-null  object        
 6   imdbId            479565 non-null  int64         
 7   Imdb Link         479565 non-null  object        
 8   Title             479565 non-null  object        
 9   IMDB Score        479452 non-null  float64       
 10  Genre             479565 non-null  object        
 11  local_image_path  479565 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(8), string(1)
memory usage: 47.6+ MB


The dataset Imdb Id format is diffrent from the current ID which has a 7 digit format with "tt" in the beginning.
Because we are going to use the ID to merge this data with IMDB data, this columns need to be adjust to the current standard. So we need to add "tt" and make it 7 digit but adding zero to the left.

In [25]:
#make the imdbId to string,Add the "tt" and adjust the number of digit
df['imdbId'] = df['imdbId'].astype('string')
df['imdb_id'] = 'tt' + df['imdbId'].str.rjust(7, '0')

In [26]:
df.head()

Unnamed: 0,reviewer,movie,review_summary,review_date,review_detail,lable,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path,imdb_id
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),good follow up that answers all the questions,2005-07-24,after seeing tarantino's kill bill vol: 1 i go...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
1,Bogmeister,Kill Bill: Vol. 2 (2004),the bride ends her rampage; we applaud,2005-08-15,the 2nd half of tarantino's tale of bloody rev...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
2,departed07,Kill Bill: Vol. 2 (2004),quentin tarantino's best since pulp fiction,2005-08-26,the bride is back and ready to kick ass in thi...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
3,Angeneer,Kill Bill: Vol. 2 (2004),quentin made me eat my hat,2005-09-08,i'm very happy to admit that tarantino proved ...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
4,LoneWolfAndCub,Kill Bill: Vol. 2 (2004),great ending to qt's tale of revenge,2005-09-07,kill bill volume 2 (directed by quentin tarant...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194


In [27]:
#The Movie title has the date at the end which needs to be remove.
df['movie'] = df['movie'].str.replace(r"\(.*\)","")

  df['movie'] = df['movie'].str.replace(r"\(.*\)","")


In [28]:
df.head()

Unnamed: 0,reviewer,movie,review_summary,review_date,review_detail,lable,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path,imdb_id
0,OriginalMovieBuff21,Kill Bill: Vol. 2,good follow up that answers all the questions,2005-07-24,after seeing tarantino's kill bill vol: 1 i go...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
1,Bogmeister,Kill Bill: Vol. 2,the bride ends her rampage; we applaud,2005-08-15,the 2nd half of tarantino's tale of bloody rev...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
2,departed07,Kill Bill: Vol. 2,quentin tarantino's best since pulp fiction,2005-08-26,the bride is back and ready to kick ass in thi...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
3,Angeneer,Kill Bill: Vol. 2,quentin made me eat my hat,2005-09-08,i'm very happy to admit that tarantino proved ...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
4,LoneWolfAndCub,Kill Bill: Vol. 2,great ending to qt's tale of revenge,2005-09-07,kill bill volume 2 (directed by quentin tarant...,LOVE IT,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194


In [29]:
#Load the IMDB dataset to add more information to the data
imdb_db = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/IMDb movies.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [30]:
imdb_db.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


In [31]:
#Merge the review data with imdb data on imdb ID
final_db = df.merge(imdb_db,left_on='imdb_id',right_on='imdb_title_id')

In [32]:
final_db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 475111 entries, 0 to 475110
Data columns (total 35 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   reviewer               475111 non-null  object        
 1   movie                  475111 non-null  object        
 2   review_summary         475111 non-null  object        
 3   review_date            475111 non-null  datetime64[ns]
 4   review_detail          475111 non-null  string        
 5   lable                  475111 non-null  object        
 6   imdbId                 475111 non-null  string        
 7   Imdb Link              475111 non-null  object        
 8   Title                  475111 non-null  object        
 9   IMDB Score             474998 non-null  float64       
 10  Genre                  475111 non-null  object        
 11  local_image_path       475111 non-null  object        
 12  imdb_id                475111 non-null  obje

For the sake of the hybrid recommandation, I am going to keep the name of the director and genre of the movie.

In [33]:
#Drop the unwanted columns
final_db.drop(columns=['imdbId','Imdb Link','Title','IMDB Score','local_image_path','imdb_id',
                      'imdb_title_id','title','original_title','date_published','duration','country',
                      'language','writer','production_company','description','avg_vote','votes',
                      'budget','usa_gross_income','worlwide_gross_income','metascore','reviews_from_users',
                      'reviews_from_critics','Genre','year','actors'],inplace=True)

The genre columns include three value seperated by a comma. It needs to seperated to three columns for further use.

In [34]:
final_db[['genre_1','genre_2','genre_3']] = final_db['genre'].str.split(', ', expand=True)

In [35]:
final_db.drop(columns=['genre'],inplace=True)

Those movie that has only one genre, the other two columns will fill with the same genre.

In [36]:
final_db['genre_2'].fillna(final_db['genre_1'],inplace=True)

In [37]:
final_db['genre_3'].fillna(final_db['genre_2'],inplace=True)

In [38]:
final_db.head()

Unnamed: 0,reviewer,movie,review_summary,review_date,review_detail,lable,director,genre_1,genre_2,genre_3
0,OriginalMovieBuff21,Kill Bill: Vol. 2,good follow up that answers all the questions,2005-07-24,after seeing tarantino's kill bill vol: 1 i go...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
1,Bogmeister,Kill Bill: Vol. 2,the bride ends her rampage; we applaud,2005-08-15,the 2nd half of tarantino's tale of bloody rev...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
2,departed07,Kill Bill: Vol. 2,quentin tarantino's best since pulp fiction,2005-08-26,the bride is back and ready to kick ass in thi...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
3,Angeneer,Kill Bill: Vol. 2,quentin made me eat my hat,2005-09-08,i'm very happy to admit that tarantino proved ...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
4,LoneWolfAndCub,Kill Bill: Vol. 2,great ending to qt's tale of revenge,2005-09-07,kill bill volume 2 (directed by quentin tarant...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller


In [39]:
#Checking the final dataset type
final_db.dtypes

reviewer                  object
movie                     object
review_summary            object
review_date       datetime64[ns]
review_detail             string
lable                     object
director                  object
genre_1                   object
genre_2                   object
genre_3                   object
dtype: object

In [40]:
#number of unique reviewer
final_db.reviewer.nunique()

11251

In [41]:
#Number of the unique movie
final_db.movie.nunique()

3925

In [42]:
#The lable distribution
final_db.lable.value_counts()

NOT LOVE IT    254791
LOVE IT        220320
Name: lable, dtype: int64

In [43]:
#saving the clean data 
final_db.to_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/clean_db.csv',index=False)