# **Sentimental Analysis Project**

# **What is spacy?**
### ***Spacy is a open-source python library for NLP. Spacy is designed to make it easy ot extract the information from or general purpose natual language preprocessing.***

# **Installation of spacy**
### ***You can install spacy using pip, python package manager. Your run this command to install the spacy in your system.***
### **! python -m install spacy**
### ***After spacy installation in your system. There's one more thing you have to install in your system for difffernt languages.***
### **! python -m spacy download en_core_web_sm**
### ***The en_core_web_sm is a deafult model for the English language. Since the models are quite large, it’s best to install them separately—including all languages in one package would make the download too massive.***

### **Read the csv file**

In [1]:
import pandas as pd
import numpy as np
import re
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD as TSVD
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score, cross_val_predict
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

In [2]:
data = pd.read_csv('IMDB.csv')

# ***How the data is look like?***

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# ***How big is the data?***

In [4]:
data.shape

(50000, 2)

# ***Seperate the data for train_test split***

In [5]:
train = data.iloc[:25000]
test = data.iloc[25000:]

In [6]:
train.shape, test.shape

((25000, 2), (25000, 2))

# ***What does the training data look like?***

In [7]:
train.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [9]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB 2.0 MB/s eta 0:00:07
     ---------------------------------------- 0.0/12.8 MB 2.0 MB/s eta 0:00:07
     ---------------------------------------- 0.0/12.8 MB 2.0 MB/s eta 0:00:07
     --------------------------------------- 0.1/12.8 MB 365.7 kB/s eta 0:00:35
     --------------------------------------- 0.1/12.8 MB 353.1 kB/s eta 0:00:37
     --------------------------------------- 0.1/12.8 MB 353.1 kB/s eta 0:00:37
     --------------------------------------- 0.1/12.8 MB 364.4 kB/s eta 0:00:35
     --------------------------------------- 0.1/12.8 MB 405.9 kB/s eta 0:00:32
     --------------------------------------- 0.2/12.8 MB 427.9 kB/s eta 0:00:30
     ------------------------------


[notice] A new release of pip is available: 23.0 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# **What does the load function do?**
### ***The load() function returns a Language callable object, which is commonly assigned to a variable called nlp.***

In [10]:
nlp=spacy.load('en_core_web_sm')

In [22]:
#If this type of error occur like A UTF-8 locale is required. Got ANSI_X3.4-1968 
import locale
locale.getpreferredencoding = lambda: "UTF-8"

###  ***The text is used to instantiate a Doc object. From there, you can access a whole bunch of information about the processed text. For instance, you iterated over the Doc object with a list comprehension that produces a series of Token objects.***

### **Converting the text to lowercase**

In [11]:
txt=nlp('My name is mirza ahmad awais')
intro_tok=[token.text for token in txt]
intro_tok

['My', 'name', 'is', 'mirza', 'ahmad', 'awais']

In [12]:
# Converting the text to lowercase
train['review'] = train['review'].apply(lambda x: str(x).lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['review'] = train['review'].apply(lambda x: str(x).lower())


### ***Here's what the training data looks like after it's converted to lowercase***

In [13]:
train.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


### ***Contractions Expansion***
### ***The contractions approach can easily be updated to help correct common spelling mistakes or even change character names in a short story. Like can't to can not, haven't to have not etc.***

In [26]:
! pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
import contractions

In [15]:
contractions_dict = contractions.contractions_dict
contractions_dict

{"I'm": 'I am',
 "I'm'a": 'I am about to',
 "I'm'o": 'I am going to',
 "I've": 'I have',
 "I'll": 'I will',
 "I'll've": 'I will have',
 "I'd": 'I would',
 "I'd've": 'I would have',
 'Whatcha': 'What are you',
 "amn't": 'am not',
 "ain't": 'are not',
 "aren't": 'are not',
 "'cause": 'because',
 "can't": 'cannot',
 "can't've": 'cannot have',
 "could've": 'could have',
 "couldn't": 'could not',
 "couldn't've": 'could not have',
 "daren't": 'dare not',
 "daresn't": 'dare not',
 "dasn't": 'dare not',
 "didn't": 'did not',
 'didn’t': 'did not',
 "don't": 'do not',
 'don’t': 'do not',
 "doesn't": 'does not',
 "e'er": 'ever',
 "everyone's": 'everyone is',
 'finna': 'fixing to',
 'gimme': 'give me',
 "gon't": 'go not',
 'gonna': 'going to',
 'gotta': 'got to',
 "hadn't": 'had not',
 "hadn't've": 'had not have',
 "hasn't": 'has not',
 "haven't": 'have not',
 "he've": 'he have',
 "he's": 'he is',
 "he'll": 'he will',
 "he'll've": 'he will have',
 "he'd": 'he would',
 "he'd've": 'he would have',
 

### ***Changed all values ​​that conflicted.***

In [16]:
def contraction_expansion(x):
    if type(x) is str:
        for key in contractions_dict:
            value = contractions_dict[key]
            x = x.replace(key, value)
        return x
    else:
        return x

In [17]:
train['review'] = train['review'].apply(lambda x: contraction_expansion(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['review'] = train['review'].apply(lambda x: contraction_expansion(x))


In [18]:
train.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there is a family where a little boy...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


# What is the Regular Expression? 
###  A Regular string is a specified set of strings that matches it. Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. 

# Quantifers or Repetition Operators 
###  The quantifiers are used to specify the number of occurences to match. Quantifiers (*, +, ?, {m,n}, etc) cannot be directly nested. 

### Dot character(.)
### This matches any character except a newline. If the DOTALL flag has been 
### specified, this matches any character including a newline.

### Caret character(^)
### Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

### Dollar character($)
### Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. 

### Multliply character(*)
### Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

### Plus character(+)
### Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

### QuestionMark(?)
### Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

### {m}
### Specifies that exactly m copies of the previous RE should be matched. Like {6} exactly match six regular expressions.

### {m,n}
### Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. 

### {m,n}?
### Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. For example,  while a{3,5}? will only match 3 characters.

### {m,n}+ 
### Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible without establishing any backtracking points.  For example, on the 6-character string 'aaaaaa', a{3,5}+aa attempt to match 5 'a' characters, requiring 2 more 'a's, will need more characters than available and thus fail, while a{3,5}aa will match with a{3,5} capturing 5, then 4 'a's by backtracking and then the final 2 'a's are matched by the final aa in the pattern. x{m,n}+ is equivalent to (?>x{m,n}).

### []
### Used to indicate the set of charcters in a set. Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59. 

## Functions

### Compile Method 
### import re
### syantax
### re.compile(pattern, flags=0)
### Python’s re.compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object (re.Pattern). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re.match() or re.search().

### Search Method 
### syantax
### re.search(pattern, string, flags=0)
### Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.


### Match Method 
### If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

### Full Match Method 
### If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

### Split Method
### Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. 

### For Example
###  re.split(r'(\W+)', 'Words, words, words.')
### Output
### ['Words', ', ', 'words', ', ', 'words', '.', '']

### Sub method
### re.sub() function stands for a substring and returns a string with replaced values. Multiple elements can be replaced using a list when we use this function.

### ***Removing Emails***

In [19]:
def remove_emails(x):
    email_pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
    return re.sub(email_pattern, '', x)

In [20]:
train['review'] = train['review'].apply(lambda x: remove_emails(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['review'] = train['review'].apply(lambda x: remove_emails(x))


In [21]:
train.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there is a family where a little boy...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


### Removing HTML Tags

In [22]:
train['review'] = train['review'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['review'] = train['review'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text().strip())


In [23]:
train.iloc[6005][0]

'pretty.pretty actresses and actors. pretty bad script. pretty frequent "let us strip to our undies" scenes. pretty fair f/x. pretty jarring location decisions (the college dorm room looks like a high-end hotel room - probably because it was shot at a hotel). pretty bland storyline. pretty awful dialog. pretty locations. pretty annoying editing, unless you like the music video flash-cut style.this one is not a guilty pleasure - this is more an embarrassing one. if you must watch this, pick a good dance/techno album and turn the sound off on the movie - you will see the pretty people in their pretty black undies, and probably follow the story just fine.the cast may be able to act - i doubt that anyone could look skilled given the lines/plot that they had to deal with.'

In [24]:
train.sample(5)

Unnamed: 0,review,sentiment
21933,if you were brought up on a diet of gameshows ...,positive
22800,"odd one should be able to stumble into ""classe...",positive
24553,have wanted to see this for a while: i never t...,positive
22984,"this is pretty funny. ""saturday the 12th"", a?....",negative
17476,"hi to read the entire plot around ""oz"" just lo...",positive


### Removing Special Characters

In [25]:
def RemoveSpecialChars(x):
    x = re.sub(r'[^\w ]+', "", x)
    x = ' '.join(x.split())
    return x

In [26]:
train['review'] = train['review'].apply(lambda x: RemoveSpecialChars(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['review'] = train['review'].apply(lambda x: RemoveSpecialChars(x))


In [27]:
train.sample(5)

Unnamed: 0,review,sentiment
11643,what a porn movie would look like if you took ...,negative
24650,let us see here are the highlights of the brai...,negative
19893,i love all types of films especially horror th...,negative
4224,i am an australian currently living in japan i...,positive
10933,most of us at least inhabit two worlds the rea...,positive


In [28]:
train.iloc[6005][0]

'prettypretty actresses and actors pretty bad script pretty frequent let us strip to our undies scenes pretty fair fx pretty jarring location decisions the college dorm room looks like a highend hotel room probably because it was shot at a hotel pretty bland storyline pretty awful dialog pretty locations pretty annoying editing unless you like the music video flashcut stylethis one is not a guilty pleasure this is more an embarrassing one if you must watch this pick a good dancetechno album and turn the sound off on the movie you will see the pretty people in their pretty black undies and probably follow the story just finethe cast may be able to act i doubt that anyone could look skilled given the linesplot that they had to deal with'

### Lemmetization
### Lemmetization are used to generate inflected words based on actual language word. It considers the context and convert into meaningful base form called lemmetization. For example convert caring into care, historical into history etc.

In [29]:
train.sample(5)

Unnamed: 0,review,sentiment
9206,i have not seen every single movie that burt r...,positive
14829,my first thoughts on this film were of using s...,negative
17721,this ambitious film suffers most from writerdi...,negative
17595,a killer john karlen with a penchant for reall...,negative
2598,erika kohut is a woman with deep sexual proble...,positive


In [30]:
# def lemme(x):
#     x = str(x)
#     x_list = []
#     doc = nlp(x)
#     for token in doc:
#         lemma = token.lemma_
#         if lemma == '-PRON-' or lemma == 'be':
#             lemma = token.text
#         x_list.append(lemma)
#     return ' '.join(x_list)
def lemmatize_sentence(doc):
    """
    Lemmatizes a sentence using Spacy.
    """
    lemmatized_sentence = [token.lemma_ for token in doc]
    return ' '.join(lemmatized_sentence)
def handle_special_lemmas(lemma, word):
    """
    Returns the original word if the lemma is '-PRON-' or 'be'.
    """
    if lemma == '-PRON-':
        return word
    elif lemma == 'be':
        return word.lower()
    else:
        return lemma

In [31]:
docs = list(nlp.pipe(train['review'], batch_size=100))

In [32]:
lemmatized_text = []
for doc in docs:
    lemmatized_text.append(lemmatize_sentence(doc))
train['review'] = pd.Series(lemmatized_text).apply(lambda x: ' '.join([handle_special_lemmas(token.lemma_, token.text) for token in nlp(x)]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['review'] = pd.Series(lemmatized_text).apply(lambda x: ' '.join([handle_special_lemmas(token.lemma_, token.text) for token in nlp(x)]))


### Tokenization using Text Blob

### Removing Stop Words

In [33]:
stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [34]:
len(stopwords)

326

In [35]:
def RemoveStopWords(x):
    return ' '.join([word for word in x.split() if word not in stopwords])

In [49]:
x = train.iloc[6005][0]
x

'prettypretty actress and actor pretty bad script pretty frequent let we strip to our undies scene pretty fair fx pretty jarring location decision the college dorm room look like a highend hotel room probably because it be shoot at a hotel pretty bland storyline pretty awful dialog pretty location pretty annoying editing unless you like the music video flashcut stylethis one be not a guilty pleasure this be more an embarrassing one if you must watch this pick a good dancetechno album and turn the sound off on the movie you will see the pretty people in their pretty black undie and probably follow the story just finethe cast may be able to act I doubt that anyone could look skilled give the linesplot that they have to deal with'

In [50]:
# EXAMPLE CODE
print("length of x: ", len(x))

length of x:  735


In [51]:
x1 = RemoveStopWords(x)
x1

'prettypretty actress actor pretty bad script pretty frequent let strip undies scene pretty fair fx pretty jarring location decision college dorm room look like highend hotel room probably shoot hotel pretty bland storyline pretty awful dialog pretty location pretty annoying editing like music video flashcut stylethis guilty pleasure embarrassing watch pick good dancetechno album turn sound movie pretty people pretty black undie probably follow story finethe cast able act I doubt look skilled linesplot deal'

In [52]:
len(x1)

511

In [53]:
%%time
train['review'] = train['review'].apply(lambda x: RemoveStopWords(x))

CPU times: user 803 ms, sys: 4 ms, total: 807 ms
Wall time: 808 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [54]:
train.sample(5)

Unnamed: 0,review,sentiment
18197,film young man painful journey discover sexual...,negative
11967,I rest head think reason movie killer shark an...,negative
13576,undying good game bring new element tired genr...,positive
16663,curse wolf start reluctant werewolf dakota ren...,negative
8716,I admit I m fan shakespeare I familiar play I ...,negative


### Removing Rare Words
### Usually both the most frequent and most rare words are not useful in providing contextual information. Very frequent words are called stop words. As stop-words occur in almost every sentence/document, they do not help in uniquely identifying content in sentences/documents. The very rare words could sometimes be very useful, but are often so sparse that it is hard to draw insights from them. For example Ali occured most frequently in 100 time and uzair occured rarely in 1 time. Both words are not useful for the prediction of the text.  

In [55]:
text = ' '.join(train['review'])

In [56]:
#text

In [57]:
len(text)

17493821

In [58]:
# Creating Frequency
text_series = pd.Series(text.split())

In [59]:
#most frequent words in a text
freq_comm = text_series.value_counts()

In [60]:
freq_comm

I                91453
movie            49653
film             45997
like             21503
good             20114
                 ...  
mimis                1
robitussen           1
girlewwww            1
brittleat            1
tvpersonality        1
Length: 128435, dtype: int64

In [61]:
rare_words = freq_comm[-82000: -1]
'rockumentarie' in rare_words

True

In [62]:
rare_words

wiccan         2
1850s          2
murad          2
tummy          2
twentyseven    2
              ..
193040         1
mimis          1
robitussen     1
girlewwww      1
brittleat      1
Length: 81999, dtype: int64

In [63]:
# Removing 82000 rare occuring words 
train['review'] = train['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in rare_words]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['review'] = train['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in rare_words]))


In [64]:
train['review'].sample(5)

11294    think oliver stone movie come mind big controv...
20809    emperor assassin subtitle minute long time pac...
1194     I surprised I film underrated I understand dis...
16294    I ve nsna I ve roger I feel good connery prove...
19145    question like film version ghost train invaria...
Name: review, dtype: object

### Converting the Data into Vector

In [65]:
train['sentiment'].value_counts()

negative    12526
positive    12474
Name: sentiment, dtype: int64

In [66]:
X = train['review']
y = train['sentiment']

### Term Frequency and Inverse Frequency Document
### TFIDF works by proportionally increasing the number of times a word appears in the document but is counterbalanced by the number of documents in which it is present. Hence, words like ‘this’, ’are’ etc., that are commonly present in all the documents are not given a very high rank. However, a word that is present too many times in a few of the documents will be given a higher rank as it might be indicative of the context of the document.


In [67]:
tfidf = TfidfVectorizer()

In [68]:
X = tfidf.fit_transform(X)

In [69]:
X.shape

(25000, 46398)

In [70]:
X

<25000x46398 sparse matrix of type '<class 'numpy.float64'>'
	with 1930964 stored elements in Compressed Sparse Row format>

### Splitting Data into Training and Testing sets

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1, stratify=y)

In [72]:
print(y_train.value_counts(normalize=True)*100)
print(y_test.value_counts(normalize=True)*100)

negative    50.104
positive    49.896
Name: sentiment, dtype: float64
negative    50.104
positive    49.896
Name: sentiment, dtype: float64


### Dimensionality reduction using Truncated Singular Value Decomposition
### Singular Value Decomposition, or SVD for short, is a mathematical technique used in machine learning to make sense of huge and complicated data. Let me explain SVD in laymen's terms: Imagine you have many different toys you want to organize them. Some are big, some are small, some are red, some are blue, and so on.
### A=U∑V^T 
###  U and V orthogonal matrix 
###  ∑ is a diagnonal matrix 
### PCA is a more specialized technique that is particularly useful for the dimensionality reduction of numerical data. At the same time, SVD is a more general technique that can be applied to a wider range of data types and problems.

In [73]:
# sum(TSVD.explained_variance_)

In [74]:
# %%time
# tsvd = TSVD(n_components=10000, random_state=4)
# X_train_tsvd = tsvd.fit_transform(X_train)

### Using SVC for Classification

In [93]:
# clf_svc = SVC(kernel='linear', random_state=0)
# clf_svc.fit(X_train, y_train)

In [97]:
# y_pred=clf_svc.predict(X_test)
# y_pred

array(['positive', 'negative', 'positive', ..., 'positive', 'negative',
       'positive'], dtype=object)

In [100]:
# cross validation score
# cm=confusion_matrix(y_test, y_pred)
# print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.89      0.86      0.87      6263
    positive       0.86      0.89      0.87      6237

    accuracy                           0.87     12500
   macro avg       0.87      0.87      0.87     12500
weighted avg       0.87      0.87      0.87     12500



In [104]:
# from sklearn.metrics import accuracy_score
# score=accuracy_score(y_test, y_pred)
# print(score * 100)

87.248


In [107]:
# scores=cross_val_score(clf_svc, X_train, y_train, cv=6, n_jobs=-1)

In [76]:
~%%time
#scores = cross_val_score(clf_svc, X_train, y_train, cv=6, n_jobs=-1)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs


In [110]:
# print(scores.mean()*100)

87.05605564836631


In [77]:
#scores

### Using Logistic Regression

In [111]:
from sklearn.linear_model import LogisticRegression

In [113]:
clf_lr = LogisticRegression()

In [114]:
X_train

<12500x46398 sparse matrix of type '<class 'numpy.float64'>'
	with 961715 stored elements in Compressed Sparse Row format>

In [115]:
scores = cross_val_score(clf_lr, X_train, y_train, cv=10, n_jobs=4)

In [116]:
scores

array([0.8696, 0.8752, 0.876 , 0.8624, 0.884 , 0.864 , 0.8672, 0.8536,
       0.8904, 0.864 ])

In [118]:
scores.mean() * 100

87.06400000000001

In [119]:
clf_lr.fit(X_train, y_train)

In [120]:
y_test_pred = clf_lr.predict(X_test)

In [121]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

    negative       0.89      0.84      0.86      6263
    positive       0.85      0.89      0.87      6237

    accuracy                           0.87     12500
   macro avg       0.87      0.87      0.87     12500
weighted avg       0.87      0.87      0.87     12500



In [122]:
confusion_matrix(y_test, y_test_pred)

array([[5291,  972],
       [ 685, 5552]])

In [123]:
clf_lr.predict(tfidf.transform(['American Psycho deserved an Oscar, they were robbed']))

array(['positive'], dtype=object)

In [124]:
y_real_pred = clf_lr.predict(tfidf.transform(test['review']))

In [125]:
print(classification_report(test['sentiment'], y_real_pred))

              precision    recall  f1-score   support

    negative       0.90      0.79      0.84     12474
    positive       0.81      0.91      0.86     12526

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000



In [128]:
clf_lr.predict(tfidf.transform(["Sir Hafiz Zeshan fail whole the class in ppit despite of few students."]))

array(['negative'], dtype=object)