# Data Cleaning

In [1]:
Sample_Text = "Title: Opera (1987) Director: Dario Argento Cast: Cristina Masillach, Ian Charleson, Urbano Barberini, Daria Nicolodi Review: The only other Argento movie I had seen was Suspiria and that one blew me away with its style, colors and spooky story line. I next decided to go with Opera as I had been told it was one of his best. Man, I think I'm discovering what will ultimately be one of my favorite horror directors.<br /><br />Opera is about a young opera singer who gets her big break when the main star of a creepy modern opera take on Mc Beth gets hit by a car. Betty is the understudy so she gets to do the part herself. Too bad for her there's a psycho after her who makes her watch the brutal murders of her friends and co-workers.<br /><br />Wow, Id heard good things about this here flick, but I wasn't prepared for the level of greatness to which this film would take me. Yeah the movie has its shortcomings to which Ill get to later. But for the most part the movie blew me away.<br /><br />First off, this movie is not as filled with lots of colors as Suspiria. I was expecting it to be a bit like suspiria in that department, but no, to my surprise it had its own look and feel. The film is somehow devoid of color. It does have lots in colors in certain scenes (like the masterful kitchen/living room sequence) where Argento fills the screen with lush greens and blues, but for the most part the film has a grayish, black tone to it all through out and I liked that it had its own distinctive look.<br /><br />The real stars of this show are the incredibly well orchestrated death sequences. Wow. Every death scene was like a work of art. Beauty in destruction. These are not just your typical hack and slash death sequences, these deaths were carefully constructed to shock and get the most out of its situations. Loved every second of them, there's plenty of blood and mayhem here, but with style. Not gonna spoil em though.<br /><br />Then there's the direction. Man, there's some really original and beautiful shots on this one. I loved the inventive use of the camera on this one. You thought that Tarantinos shot in Kill Bill vol. 1 where we see the bullet coming out of the chamber of the gun was original? Well this is the movie he lifted it from! I honestly believe that Tarantino was heavily influenced by this specific movie with certain scenes in Kill Bill Vol. 1. Heck in the making of feature he mentions that the whole scene with Beatrix in the hospital and Elle Driver coming to kill her was influenced by Italian Giallos, and here my friends is the proof of that. Anyhows, Tarantino references aside, this movie has some amazing camera shots, like those scenes of the crows flying through the crowd in the opera house...great stuff. And a main reason why Argentos becoming one of my favorites.<br /><br />The acting from most of the cast was alright, but the best by far was Cristina Marsillach as the tortured young opera singer Betty. The looks in her eyes as the murders were being committed were great. The rest of the cast was a little wooden and stiff, but nothing that would deter your enjoyment of the film.<br /><br />There were very few things I didn't like about this movie. First off logic was thrown out the window in certain scenes. Specially those involving Bettys reactions after shes seen the murders. It seem to me that for the longest time, she just went on about her business, not telling anyone about the whole thing. Not even the police. I mean if you see someone brutally murder a loved one in front of your eyes...you don't just walk away from the murder scene and continue with your life. Someone would have connected her to the murders. She might have even become a suspect herself...but no. Also the ending is a bit anti climactic. You'll have to see this to understand, but it seemed a bit unnecessary the way the film ended, it felt like it could have ended earlier. It would not have felt so redundant. But thats about it, not real big problems for me really since I was enjoying the rest of this beautiful film.<br /><br />I've still got a lot of Argento territory to cover...but I'm devouring every step of the way like if I was eating a plate of the most expensive caviar. This guys really good. I think of his films as works of art, and I've only seen two of em! Cant wait to discover the rest of his films. Argento, you the man! <br /><br />Rating: 41/2 out of 5"


# Libraries

In [68]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import numpy as np

In [69]:
tokenizer = RegexpTokenizer(r'\w+')
english_stopwords = set(stopwords.words('english'))
ps = PorterStemmer()

In [70]:
def clean_review(review):
    
    # first of all make all the letters in lower case so that stopword can be removed easily 
    review = review.lower()
    
    # given data is collected using web scrapping so first remove all redundant data
    review = review.replace("<br /><br />" , " ")
    
    # now break the sentence into words removing and special characters as well
    tokens  = tokenizer.tokenize(review)
    
    # now remove all stopwords
    new_tokens = [token for token in tokens if token not in english_stopwords]
    
    
    # now do stemming  eg jumping to jump 
    cleaned_tokens = [ps.stem(token) for token in new_tokens]
    
    return ' '.join(cleaned_tokens)
    

In [71]:
clean_review(Sample_Text)

'titl opera 1987 director dario argento cast cristina masillach ian charleson urbano barberini daria nicolodi review argento movi seen suspiria one blew away style color spooki stori line next decid go opera told one best man think discov ultim one favorit horror director opera young opera singer get big break main star creepi modern opera take mc beth get hit car betti understudi get part bad psycho make watch brutal murder friend co worker wow id heard good thing flick prepar level great film would take yeah movi shortcom ill get later part movi blew away first movi fill lot color suspiria expect bit like suspiria depart surpris look feel film somehow devoid color lot color certain scene like master kitchen live room sequenc argento fill screen lush green blue part film grayish black tone like distinct look real star show incred well orchestr death sequenc wow everi death scene like work art beauti destruct typic hack slash death sequenc death care construct shock get situat love eve

In [72]:
import pandas as pd

In [73]:
def clean_input_data(inputfile , outputfile):
    ##loading data
    df = pd.read_csv(inputfile)
    
    X = df.values[: , 0 ]
    Y = df.values[: , 1 ]
    
    for i in range(X.shape[0]):
        ## cleaning each and every review
        X[i] = clean_review(X[i])
    
    new_Y = np.zeros( Y.shape)
    for i in range(Y.shape[0]):
        if(Y[i] == 'pos'):
            new_Y[i]=1
        
    
    
    # creating new data set 
    df = pd.DataFrame(list(zip(X , new_Y)), 
               columns =['review', 'label'])
    
    ## storing cleaned dataset
    df.to_csv(outputfile , index=False)

In [74]:
# Dataset location
inputfile = "D:\github\Sentiment Analysis\Train\Train.csv"

In [75]:
outputfile = "Cleaned_review.csv"

In [76]:
clean_input_data(inputfile , outputfile)

## for testing data

In [66]:
def clean_testing_data(inputfile , outputfile):
    ##loading data
    df = pd.read_csv(inputfile)
    
    X = df.values[: , 0 ]
    
    for i in range(X.shape[0]):
        ## cleaning each and every review
        X[i] = clean_review(X[i])
     
    # creating new data set 
    df = pd.DataFrame(list(zip(X )), 
               columns =['review'])
    
    ## storing cleaned dataset
    df.to_csv(outputfile , index=False)

In [67]:
inputfile = "D:\github\Sentiment Analysis\Test\Test.csv"
outputfile = "Cleaned_Testing_Data.csv"
clean_testing_data(inputfile , outputfile)