## Data Wrangling for Review Sentimental Analysis and Recommnadation System 

Importing the library

In [53]:
import pandas as pd
import numpy as np
import re

First import the review dataset and making it ready for the next step.

In [54]:
review = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/reviews.csv')

In [55]:
review.head()

Unnamed: 0,review_id,reviewer,movie,rating,review_summary,review_date,spoiler_tag,review_detail,helpful
0,rw1133942,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,Good follow up that answers all the questions,24 July 2005,0,"After seeing Tarantino's Kill Bill Vol: 1, I g...","['0', '1']"
1,rw1133959,lost-in-limbo,Feardotcom (2002),3.0,"""I couldn't make much sense of it myself"". Too...",24 July 2005,0,There's a Website called FearDotCom and anyone...,"['1', '4']"
2,rw1133985,NateManD,Persona (1966),10.0,Persona gives me all the reasons to love art-h...,24 July 2005,0,"Long before ""Muholland Drive"" there was anothe...","['9', '23']"
3,rw1133999,CAMACHO-4,War of the Worlds (2005),3.0,A disappointing film from the team that you Mi...,24 July 2005,0,Spielberg said this film is based on the H.G. ...,"['9', '14']"
4,rw1134010,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,A fun action movie with great chemistry,24 July 2005,0,"Director Doug Liman, who's gotten famous for m...","['1', '3']"


It looks a clean dataset, with name of the reviewer, name of the movie, rating that reviewer gets to film, summary and detail of the review and date of the review. 
First, there are some columns that are not related to the project goal and some need to do some changes.

In [4]:
review.describe()

Unnamed: 0,rating,spoiler_tag
count,465661.0,542461.0
mean,6.955223,0.241282
std,2.306315,0.427861
min,1.0,0.0
25%,6.0,0.0
50%,7.0,0.0
75%,9.0,0.0
max,10.0,1.0


In [5]:
review.isnull().sum()

review_id             0
reviewer              0
movie                 0
rating            76800
review_summary        1
review_date           0
spoiler_tag           0
review_detail         1
helpful               0
dtype: int64

In [56]:
review.dropna(inplace=True)

Some of the rows don't have the rating information. Becasue there is no way to fill these and there is enough data, those rows drop from the dataset.

In [57]:
review.reviewer.nunique()

11256

After droping these NaN values, there are 11256 unique reviewer.

In [8]:
#checking the type of the data
review.dtypes

review_id          object
reviewer           object
movie              object
rating            float64
review_summary     object
review_date        object
spoiler_tag         int64
review_detail      object
helpful            object
dtype: object

Next, the data type needs to be check and change to do some of the adjustment for the review, date, and rating columns.

In [58]:
#Droping the columns that are not related
review.drop(columns=['review_id','spoiler_tag','helpful','review_summary'],inplace=True)

In [59]:
review.head()

Unnamed: 0,reviewer,movie,rating,review_date,review_detail
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,24 July 2005,"After seeing Tarantino's Kill Bill Vol: 1, I g..."
1,lost-in-limbo,Feardotcom (2002),3.0,24 July 2005,There's a Website called FearDotCom and anyone...
2,NateManD,Persona (1966),10.0,24 July 2005,"Long before ""Muholland Drive"" there was anothe..."
3,CAMACHO-4,War of the Worlds (2005),3.0,24 July 2005,Spielberg said this film is based on the H.G. ...
4,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,24 July 2005,"Director Doug Liman, who's gotten famous for m..."


In [9]:
# Changing the type of the movie name and review details to a string.
review.review_detail = review.review_detail.astype('string')
review.movie = review.movie.astype('string')

In [10]:
review.dtypes

reviewer          object
movie             string
rating           float64
review_date       object
review_detail     string
dtype: object

In [11]:
#Remove the comma and dot from the review
review.review_detail = review.review_detail.str.replace(',','').replace('.','')

In [12]:
#Makes all the review lowercase
review.review_detail = review.review_detail.str.lower()

In [13]:
#Change the type of the review date to datetime
review.review_date = pd.to_datetime(review.review_date)

In [16]:
review.dtypes

reviewer                 object
movie                    string
rating                  float64
review_date      datetime64[ns]
review_detail            string
dtype: object

In [14]:
review.head()

Unnamed: 0,reviewer,movie,rating,review_date,review_detail
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,24 July 2005,"After seeing Tarantino's Kill Bill Vol: 1, I g..."
1,lost-in-limbo,Feardotcom (2002),3.0,24 July 2005,There's a Website called FearDotCom and anyone...
2,NateManD,Persona (1966),10.0,24 July 2005,"Long before ""Muholland Drive"" there was anothe..."
3,CAMACHO-4,War of the Worlds (2005),3.0,24 July 2005,Spielberg said this film is based on the H.G. ...
4,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,24 July 2005,"Director Doug Liman, who's gotten famous for m..."


The rating needs to change to a lable for sentiment analysis. Because I am going to make a recommandation system after, two lable chosed. " Love it" or "Not Love It". If the user love the movie we going to make a recommandation.

In [18]:
def conditions(review):
    if (review.rating > 7.0):
        return 'LOVE IT'

    else:
        return 'NOT LOVE IT'

review['lable'] = review.apply(conditions, axis=1)

In [19]:
#review.drop(columns=['rating'],inplace=True)
review.head()

Unnamed: 0,reviewer,movie,rating,review_date,review_detail,lable
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,2005-07-24,after seeing tarantino's kill bill vol: 1 i go...,LOVE IT
1,lost-in-limbo,Feardotcom (2002),3.0,2005-07-24,there's a website called feardotcom and anyone...,NOT LOVE IT
2,NateManD,Persona (1966),10.0,2005-07-24,"long before ""muholland drive"" there was anothe...",LOVE IT
3,CAMACHO-4,War of the Worlds (2005),3.0,2005-07-24,spielberg said this film is based on the h.g. ...,NOT LOVE IT
4,CAMACHO-4,Mr. & Mrs. Smith (2005),6.0,2005-07-24,director doug liman who's gotten famous for ma...,NOT LOVE IT


In [20]:
review.lable.value_counts()

NOT LOVE IT    249026
LOVE IT        216633
Name: lable, dtype: int64

It seems there is a good distribution of the lable.

Next the movies dataset will load so we can add the imdb Id to the dataset so we can use the IMDB dataset to add more information of the movie to the data.

In [60]:
movie = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/movies.csv')

In [61]:
movie.head()

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation;Adventure;Comedy,images/114709_.jpg
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action;Adventure;Family,images/113497_.jpg
2,113277,http://www.imdb.com/title/tt113277,Heat (1995),8.2,Action;Crime;Drama,images/113277_.jpg
3,114319,http://www.imdb.com/title/tt114319,Sabrina (1995),6.3,Comedy;Drama,images/114319_.jpg
4,114576,http://www.imdb.com/title/tt114576,Sudden Death (1995),5.7,Action;Crime;Thriller,images/114576_.jpg


In [62]:
movie['imdbId'] = movie['imdbId'].astype('string')
movie['imdb_id'] = 'tt' + movie['imdbId'].str.rjust(7,'0')

In [22]:
movie['Title'] = movie['Title'].str.replace(r"\(.*\)","")

  movie['Title'] = movie['Title'].str.replace(r"\(.*\)","")


In [63]:
movie.head()

Unnamed: 0,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path,imdb_id
0,114709,http://www.imdb.com/title/tt114709,Toy Story (1995),8.3,Animation;Adventure;Comedy,images/114709_.jpg,tt0114709
1,113497,http://www.imdb.com/title/tt113497,Jumanji (1995),6.9,Action;Adventure;Family,images/113497_.jpg,tt0113497
2,113277,http://www.imdb.com/title/tt113277,Heat (1995),8.2,Action;Crime;Drama,images/113277_.jpg,tt0113277
3,114319,http://www.imdb.com/title/tt114319,Sabrina (1995),6.3,Comedy;Drama,images/114319_.jpg,tt0114319
4,114576,http://www.imdb.com/title/tt114576,Sudden Death (1995),5.7,Action;Crime;Thriller,images/114576_.jpg,tt0114576


In [64]:
#Merge the review and movie
df = review.merge(movie,left_on='movie',right_on='Title')

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 479565 entries, 0 to 479564
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   reviewer          479565 non-null  object 
 1   movie             479565 non-null  object 
 2   rating            479565 non-null  float64
 3   review_date       479565 non-null  object 
 4   review_detail     479565 non-null  object 
 5   imdbId            479565 non-null  string 
 6   Imdb Link         479565 non-null  object 
 7   Title             479565 non-null  object 
 8   IMDB Score        479452 non-null  float64
 9   Genre             479565 non-null  object 
 10  local_image_path  479565 non-null  object 
 11  imdb_id           479565 non-null  string 
dtypes: float64(2), object(8), string(2)
memory usage: 47.6+ MB


The dataset Imdb Id format is diffrent from the current ID which has a 7 digit format with "tt" in the beginning.
Because we are going to use the ID to merge this data with IMDB data, this columns need to be adjust to the current standard. So we need to add "tt" and make it 7 digit but adding zero to the left.

In [66]:
#make the imdbId to string,Add the "tt" and adjust the number of digit
df['imdbId'] = df['imdbId'].astype('string')
df['imdb_id'] = 'tt' + df['imdbId'].str.rjust(7, '0')

In [67]:
df.head()

Unnamed: 0,reviewer,movie,rating,review_date,review_detail,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path,imdb_id
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,24 July 2005,"After seeing Tarantino's Kill Bill Vol: 1, I g...",378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
1,Bogmeister,Kill Bill: Vol. 2 (2004),9.0,15 August 2005,The 2nd half of Tarantino's tale of bloody rev...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
2,departed07,Kill Bill: Vol. 2 (2004),10.0,26 August 2005,The Bride is back and ready to kick ass in thi...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
3,Angeneer,Kill Bill: Vol. 2 (2004),10.0,8 September 2005,I'm very happy to admit that Tarantino proved ...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
4,LoneWolfAndCub,Kill Bill: Vol. 2 (2004),10.0,7 September 2005,Kill Bill Volume 2 (directed by Quentin Tarant...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194


In [30]:
#The Movie title has the date at the end which needs to be remove.
df['movie'] = df['movie'].str.replace(r"\(.*\)","")

  df['movie'] = df['movie'].str.replace(r"\(.*\)","")


In [68]:
df.head()

Unnamed: 0,reviewer,movie,rating,review_date,review_detail,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path,imdb_id
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,24 July 2005,"After seeing Tarantino's Kill Bill Vol: 1, I g...",378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
1,Bogmeister,Kill Bill: Vol. 2 (2004),9.0,15 August 2005,The 2nd half of Tarantino's tale of bloody rev...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
2,departed07,Kill Bill: Vol. 2 (2004),10.0,26 August 2005,The Bride is back and ready to kick ass in thi...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
3,Angeneer,Kill Bill: Vol. 2 (2004),10.0,8 September 2005,I'm very happy to admit that Tarantino proved ...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194
4,LoneWolfAndCub,Kill Bill: Vol. 2 (2004),10.0,7 September 2005,Kill Bill Volume 2 (directed by Quentin Tarant...,378194,http://www.imdb.com/title/tt378194,Kill Bill: Vol. 2 (2004),8.0,Action;Crime;Drama,images/378194_.jpg,tt0378194


In [69]:
df.drop(columns=['imdbId','Imdb Link','Title','IMDB Score','Genre','local_image_path'],inplace=True)

In [70]:
df.head()

Unnamed: 0,reviewer,movie,rating,review_date,review_detail,imdb_id
0,OriginalMovieBuff21,Kill Bill: Vol. 2 (2004),8.0,24 July 2005,"After seeing Tarantino's Kill Bill Vol: 1, I g...",tt0378194
1,Bogmeister,Kill Bill: Vol. 2 (2004),9.0,15 August 2005,The 2nd half of Tarantino's tale of bloody rev...,tt0378194
2,departed07,Kill Bill: Vol. 2 (2004),10.0,26 August 2005,The Bride is back and ready to kick ass in thi...,tt0378194
3,Angeneer,Kill Bill: Vol. 2 (2004),10.0,8 September 2005,I'm very happy to admit that Tarantino proved ...,tt0378194
4,LoneWolfAndCub,Kill Bill: Vol. 2 (2004),10.0,7 September 2005,Kill Bill Volume 2 (directed by Quentin Tarant...,tt0378194


In [51]:
df.to_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/review_db.csv',index=False)

In [71]:
#Load the IMDB dataset to add more information to the data
imdb_db = pd.read_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/IMDb movies.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [72]:
imdb_db.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

In [73]:
#Merge the review data with imdb data on imdb ID
#imdb_db = df.merge(imdb_db,left_on='imdb_id',right_on='imdb_title_id')
imdb_db = imdb_db[imdb_db.imdb_title_id.isin(df.imdb_id)]
imdb_db.shape

(4013, 22)

In [74]:
imdb.head()

Unnamed: 0,year,genre,director,actors,description,Title,imdb_id
0,1915,"Drama, History, War",D.W. Griffith,"Henry B. Walthall, Lillian Gish, Mae Marsh, Mi...",The Stoneman family finds its friendship with ...,The Birth of a Nation,tt0004972
1,1920,"Fantasy, Horror, Mystery",Robert Wiene,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...","Hypnotist Dr. Caligari uses a somnambulist, Ce...",The Cabinet of Dr. Caligari,tt0010323
2,1921,"Comedy, Drama, Family",Charles Chaplin,"Carl Miller, Edna Purviance, Jackie Coogan, Ch...","The Tramp cares for an abandoned child, but ev...",The Kid,tt0012349
3,1922,"Fantasy, Horror",F.W. Murnau,"Max Schreck, Gustav von Wangenheim, Greta Schr...",Vampire Count Orlok expresses interest in a ne...,Nosferatu,tt0013442
4,1923,"Action, Comedy, Thriller","Fred C. Newmeyer, Sam Taylor","Harold Lloyd, Mildred Davis, Bill Strother, No...",A boy leaves his small country town and heads ...,Safety Last!,tt0014429


In [45]:
imdb['Title'] = imdb['Title'].str.replace(r"\(.*\)","")

  imdb['Title'] = imdb['Title'].str.replace(r"\(.*\)","")


In [75]:
imdb = pd.merge(imdb_db,movie.drop_duplicates(),right_on='imdb_id',left_on='imdb_title_id',how='inner')

In [76]:
imdb.shape

(4013, 29)

In [77]:
imdb.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,metascore,reviews_from_users,reviews_from_critics,imdbId,Imdb Link,Title,IMDB Score,Genre,local_image_path,imdb_id
0,tt0004972,Nascita di una nazione,The Birth of a Nation,1915,1915-03-21,"Drama, History, War",195,USA,,D.W. Griffith,...,,368.0,97.0,4972,http://www.imdb.com/title/tt4972,The Birth of a Nation (1915),6.7,Drama;History;War,images/4972_.jpg,tt0004972
1,tt0010323,Il gabinetto del dottor Caligari,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,Robert Wiene,...,,237.0,160.0,10323,http://www.imdb.com/title/tt10323,The Cabinet of Dr. Caligari (1920),8.1,Fantasy;Horror;Mystery,images/10323_.jpg,tt0010323
2,tt0012349,Il monello,The Kid,1921,1923-11-26,"Comedy, Drama, Family",68,USA,"English, None",Charles Chaplin,...,,173.0,105.0,12349,http://www.imdb.com/title/tt12349,The Kid (1921),8.3,Comedy;Drama;Family,images/12349_.jpg,tt0012349
3,tt0013442,Nosferatu - Il vampiro,"Nosferatu, eine Symphonie des Grauens",1922,1922-03-04,"Fantasy, Horror",94,Germany,German,F.W. Murnau,...,,419.0,202.0,13442,http://www.imdb.com/title/tt13442,Nosferatu (1922),8.0,Fantasy;Horror,images/13442_.jpg,tt0013442
4,tt0014429,Preferisco l'ascensore,Safety Last!,1923,1924-12-08,"Action, Comedy, Thriller",74,USA,English,"Fred C. Newmeyer, Sam Taylor",...,,91.0,93.0,14429,http://www.imdb.com/title/tt14429,Safety Last! (1923),8.2,Comedy;Thriller,images/14429_.jpg,tt0014429


For the sake of the hybrid recommandation, I am going to keep the name of the director and genre of the movie.

In [78]:
#Drop the unwanted columns
imdb.drop(columns=['imdbId','Imdb Link','IMDB Score','Genre','local_image_path',
                      'imdb_title_id','title','original_title','date_published','duration','country',
                      'language','writer','production_company','avg_vote','votes',
                      'budget','usa_gross_income','worlwide_gross_income','metascore','reviews_from_users',
                      'reviews_from_critics'],inplace=True)

In [34]:
imdb.drop(columns=['actors'],inplace=True)

In [79]:
imdb.to_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/imdb_subset.csv',index=False)

The genre columns include three value seperated by a comma. It needs to seperated to three columns for further use.

In [34]:
final_db[['genre_1','genre_2','genre_3']] = final_db['genre'].str.split(', ', expand=True)

In [35]:
final_db.drop(columns=['genre'],inplace=True)

Those movie that has only one genre, the other two columns will fill with the same genre.

In [36]:
final_db['genre_2'].fillna(final_db['genre_1'],inplace=True)

In [37]:
final_db['genre_3'].fillna(final_db['genre_2'],inplace=True)

In [38]:
final_db.head()

Unnamed: 0,reviewer,movie,review_summary,review_date,review_detail,lable,director,genre_1,genre_2,genre_3
0,OriginalMovieBuff21,Kill Bill: Vol. 2,good follow up that answers all the questions,2005-07-24,after seeing tarantino's kill bill vol: 1 i go...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
1,Bogmeister,Kill Bill: Vol. 2,the bride ends her rampage; we applaud,2005-08-15,the 2nd half of tarantino's tale of bloody rev...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
2,departed07,Kill Bill: Vol. 2,quentin tarantino's best since pulp fiction,2005-08-26,the bride is back and ready to kick ass in thi...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
3,Angeneer,Kill Bill: Vol. 2,quentin made me eat my hat,2005-09-08,i'm very happy to admit that tarantino proved ...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller
4,LoneWolfAndCub,Kill Bill: Vol. 2,great ending to qt's tale of revenge,2005-09-07,kill bill volume 2 (directed by quentin tarant...,LOVE IT,Quentin Tarantino,Action,Crime,Thriller


In [39]:
#Checking the final dataset type
final_db.dtypes

reviewer                  object
movie                     object
review_summary            object
review_date       datetime64[ns]
review_detail             string
lable                     object
director                  object
genre_1                   object
genre_2                   object
genre_3                   object
dtype: object

In [40]:
#number of unique reviewer
final_db.reviewer.nunique()

11251

In [41]:
#Number of the unique movie
final_db.movie.nunique()

3925

In [42]:
#The lable distribution
final_db.lable.value_counts()

NOT LOVE IT    254791
LOVE IT        220320
Name: lable, dtype: int64

In [43]:
#saving the clean data 
final_db.to_csv('/Users/Amin/Documents/GitHub/Review-Sentiment-Analysis-with-Recommendation-System/data/clean_db.csv',index=False)