## Data set loading

**Dataset Description:**

Dataset used: IMDB Dataset

Dataset description: 
1. IMDB dataset having 50K movie reviews for natural
language processing or Text analytics.
2. Dataset is of binary sentiment classification
3.  Total instances - 50,000 (sample : 21)
4.  To classify the reviews as either positive or negative using either classification algorithms.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Import necessary libraries**

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

import numpy as np
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

import sklearn.model_selection
import sklearn.metrics
from sklearn.metrics import classification_report,confusion_matrix

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Load the data**

In [None]:
df = pd.read_csv('/content/drive/MyDrive/SACE/IMDB_sample.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [None]:
df['sentiment'].value_counts()

positive    11
negative    11
Name: sentiment, dtype: int64

## Data wrangling

1. Lower case
2. URL removal
3. Number removal
4. Tokenizing : Regexp
5. Stemming : stemmer
6. Lemmatizer : lemmatize
7. Stop word removal : stopwords & length less than 2 (time : )




In [None]:
import re

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    #rem_punc = re.sub(r'[^\w\s]', '', rem_num)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)

df['cleanText']=df['review'].map(lambda s:preprocess(s))

In [None]:
df.head(5)

Unnamed: 0,review,sentiment,cleanText,sentiment_le
0,One of the other reviewers has mentioned that ...,positive,one reviewers mentioned watching episode hooke...,1
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...,1
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...,1
3,Basically there's a family where a little boy ...,negative,basically family little boy jake thinks zombie...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...,1


In [None]:
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    print("Sentence in lower case:\n", sentence)
    print("_________________________________________________")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    print("After removing urls:\n", rem_url)
    print("_________________________________________________")
    rem_num = re.sub('[0-9]+', '', rem_url)
    print("After removing numbers:\n", rem_num)
    print("_________________________________________________")
    #rem_punc = re.sub(r'[^\w\s]', '', rem_num)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    print("After tokenization:\n",tokens)
    print("_________________________________________________")
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    print("After removing stop words:\n", filtered_words)
    print("_________________________________________________")
    stem_words=[stemmer.stem(w) for w in filtered_words]
    print("After stemming:\n", stem_words)
    print("_________________________________________________")
    lemma_words=[lemmatizer.lemmatize(w) for w in filtered_words]
    print("After lemmatization:\n", lemma_words)
    print("_________________________________________________")
    return " ".join(lemma_words)

sentence = "If you like 100 original gut wrenching laughter you will leaves like this movie. https:www.google.com \ .<br /><br />Great Camp!!!"
print(sentence)
clean_sent = preprocess(sentence)
print(clean_sent)

If you like 100 original gut wrenching laughter you will leaves like this movie. https:www.google.com \ .<br /><br />Great Camp!!!
Sentence in lower case:
 if you like 100 original gut wrenching laughter you will leaves like this movie. https:www.google.com \ .<br /><br />great camp!!!
_________________________________________________
After removing urls:
 if you like 100 original gut wrenching laughter you will leaves like this movie.  \ .great camp!!!
_________________________________________________
After removing numbers:
 if you like  original gut wrenching laughter you will leaves like this movie.  \ .great camp!!!
_________________________________________________
After tokenization:
 ['if', 'you', 'like', 'original', 'gut', 'wrenching', 'laughter', 'you', 'will', 'leaves', 'like', 'this', 'movie', 'great', 'camp']
_________________________________________________
After removing stop words:
 ['like', 'original', 'gut', 'wrenching', 'laughter', 'leaves', 'like', 'movie', 'great', 

## Feature extraction : TF-IDF

In [None]:
from sklearn.preprocessing import LabelEncoder
Encoder = LabelEncoder()
df['sentiment_le']=Encoder.fit_transform(df['sentiment'])

In [None]:
df.head(5)

Unnamed: 0,review,sentiment,cleanText,sentiment_le
0,One of the other reviewers has mentioned that ...,positive,one reviewers mentioned watching episode hooke...,1
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...,1
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...,1
3,Basically there's a family where a little boy ...,negative,basically family little boy jake thinks zombie...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...,1


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['cleanText'], df['sentiment_le'], test_size=0.3)

In [None]:
tfidf = TfidfVectorizer(max_features=20000)
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

## Classification : SVM

In [None]:
from sklearn.svm import SVC # "Support Vector Classifier" 

clf = SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', 
          coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, 
          class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr',
          break_ties=False, random_state=None) 

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('SVC')
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred)) 
print("Classification report")
print(classification_report(y_test,y_pred))

SVC
Accuracy score: 0.42857142857142855
Confusion Matrix:
 [[3 0]
 [4 0]]
Classification report
              precision    recall  f1-score   support

           0       0.43      1.00      0.60         3
           1       0.00      0.00      0.00         4

    accuracy                           0.43         7
   macro avg       0.21      0.50      0.30         7
weighted avg       0.18      0.43      0.26         7

