# <center> Sentiment Analysis <center>
    

Sentiment Analysis can help us find out the mood and emotion of the customer in reviews. Sentiment Analysis is the process of analysing data and classifying it based on the need of research.

In [2]:
import pandas as pd
from textblob import TextBlob
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer=ToktokTokenizer()
import spacy
nlp=spacy.load('en_core_web_sm',disable=['ner'])

In [3]:
TextBlob("he is a very good boy").sentiment

Sentiment(polarity=0.9099999999999999, subjectivity=0.7800000000000001)

In [4]:
TextBlob("he is not a good boy").sentiment

Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)

In [5]:
TextBlob("Everybody say this man is poor").sentiment

Sentiment(polarity=-0.4, subjectivity=0.6)

 Polarity and Subjectivity:
- Polarity identifies if the sentence is a positive sentence or negative sentence and it lies between [-1,1]-1=> Negative; 1=> positive

- Subjectivity refer to personal opinion, emotion or judgement and it lies between(0,1)0=> personal opinion ; 1=> public opinion

In [6]:
### Data Loading
train= pd.read_csv("/kaggle/input/imdb-dataset-sentiment-analysis-in-csv-format/Train.csv")
train.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


In [7]:
train.shape

(40000, 2)

In [8]:
label_0=train[train['label']==0].sample(n=5000)
label_1=train[train['label']==1].sample(n=5000)


In [9]:
train=pd.concat([label_1,label_0])
from sklearn.utils import shuffle
train=shuffle(train)

In [10]:
train


Unnamed: 0,text,label
19930,After seeing the movie last night I was left w...,1
32631,"This movie shows a row of sketches, which part...",0
17165,John Wayne is one of the few players in film h...,1
29585,They screwed up this story! In the end Nell is...,0
10321,I've seen this movie today for the first time ...,0
...,...,...
11813,This film without doubt is one of the worst I ...,0
6339,A couple of cowpokes help a group of Mormons c...,1
16324,Manhattan apartment dwellers have to put up wi...,1
7664,Mockney comes to Brighton; despite a poor rece...,0


### Data Preprocessing

In [11]:
train.isnull().sum()

text     0
label    0
dtype: int64

In [12]:
import numpy as np
train.replace(r'^\s*$',np.nan,regex=True,inplace=True)
train.dropna(axis=0,how='any',inplace=True)

These two lines of code first replace empty strings with NaN values and then remove rows containing any NaN values in the DataFrame train

train.replace(r'^\s*$', np.nan, regex=True, inplace=True): This line uses the replace method in pandas to replace empty strings (^\s*$ using regex) with NaN values in the DataFrame train. The regex=True argument indicates that the pattern provided is a regular expression. inplace=True modifies the DataFrame train in place without creating a new object.

train.dropna(axis=0, how='any', inplace=True): This line uses the dropna method to remove rows with any NaN values along the rows (axis=0) in the DataFrame train. The parameter how='any' specifies that a row will be dropped if it contains any NaN values. inplace=True modifies the DataFrame train in place without creating a new object.


In [13]:
train.replace(to_replace=[r"\\t|\\n|\\r","\t|\n|r"],value=["",""],regex=True,inplace=True)
print('escape seq removed')

escape seq removed


In [14]:
import numpy as np
train.replace(r'^\s*$',np.nan,regex=True,inplace=True)
train.dropna(axis=0,how='any',inplace=True)

In [15]:
train

Unnamed: 0,text,label
19930,Afte seeing the movie last night I was left wi...,1
32631,"This movie shows a ow of sketches, which patly...",0
17165,John Wayne is one of the few playes in film hi...,1
29585,They scewed up this stoy! In the end Nell is a...,0
10321,I've seen this movie today fo the fist time an...,0
...,...,...
11813,This film without doubt is one of the wost I h...,0
6339,A couple of cowpokes help a goup of Momons cos...,1
16324,Manhattan apatment dwelles have to put up with...,1
7664,Mockney comes to Bighton; despite a poo ecepti...,0


In [16]:
## Removing the non-ASCII code
train['text']=train['text'].str.encode('ascii','ignore').str.decode('ascii')
print('non-ascii data removed')

non-ascii data removed


In [17]:
### Remove Punctuation
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [18]:
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text =text.replace(punctuation,'')
    return text
train['text']=train['text'].apply(remove_punctuations)

In [19]:
train

Unnamed: 0,text,label
19930,Afte seeing the movie last night I was left wi...,1
32631,This movie shows a ow of sketches which patly ...,0
17165,John Wayne is one of the few playes in film hi...,1
29585,They scewed up this stoy In the end Nell is al...,0
10321,Ive seen this movie today fo the fist time and...,0
...,...,...
11813,This film without doubt is one of the wost I h...,0
6339,A couple of cowpokes help a goup of Momons cos...,1
16324,Manhattan apatment dwelles have to put up with...,1
7664,Mockney comes to Bighton despite a poo eceptio...,0


In [20]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [21]:
## Stop words are usecase dependent so we need to check according to our use and make changes
stopword_list=nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [22]:
def custom_remove_stopwords(text,is_lower_case=False):
    tokens=tokenizer.tokenize(text)
    tokens= [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens =[token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens= [token for token in tokens if token.lower() not in stopword_list]
    filtered_text=' '.join(filtered_tokens)
    return filtered_text


- This function takes in a piece of text and an optional boolean parameter (is_lower) to determine whether to consider the text as lowercase when removing stopwords. It tokenizes the text, removes stopwords based on the condition provided, and returns the filtered text as a string.

In [23]:
train['text']=train['text'].apply(custom_remove_stopwords)

In [24]:
train

Unnamed: 0,text,label
19930,Afte seeing movie last night left sense hopele...,1
32631,movie shows ow sketches patly pass ove one ano...,0
17165,John Wayne one playes film histoy failed fist ...,1
29585,scewed stoy end Nell heoic taking fo team save...,0
10321,Ive seen movie today fo fist time neve head be...,0
...,...,...
11813,film without doubt one wost seen boing simply ...,0
6339,couple cowpokes help goup Momons coss ough cou...,1
16324,Manhattan apatment dwelles put kinds inconveni...,1
7664,Mockney comes Bighton despite poo eception Bit...,0


In [25]:
import re

def remove_special_characters(text):
    text = re.sub('[^a-zA-Z0-9\s]', '', text)
    return text


- This function utilizes the re.sub() method from Python's re module to substitute any character that is not a letter (uppercase or lowercase), digit, or whitespace with an empty string. Afterward, it returns the text with special characters removed.

In [26]:
train['text']=train['text'].apply(remove_special_characters)

In [27]:
def removal_html(text):
    import re
    html_pattern=re.compile('<.*?>')
    return html_pattern.sub(r'',text)

- The function removal_html takes a text parameter as input.
- It imports the re module for working with regular expressions.
- It creates a regular expression pattern html_pattern using re.compile('<.*?>'). This pattern matches any substring that starts with < and ends with > (i.e., HTML tags) and tries to remove them from the text.
- The html_pattern.sub(r'', text) method substitutes all matches of the HTML pattern in the text with an empty string (r''), effectively removing HTML tags from the text.
- This function can be useful when dealing with text data containing HTML tags, as it helps to clean the text by removing those tags, leaving behind the text content.

In [28]:
train['text']=train['text'].apply(removal_html)

In [29]:
def remove_URL(text):
    url=re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r' ',text)

In [30]:
train['text']=train['text'].apply(remove_URL)

In [31]:
def remove_numbers(text):
    text=''.join([i for i in text if not i.isdigit()])
    return text

- The function remove_numbers takes a text parameter as input.
- It uses a list comprehension [i for i in text if not i.isdigit()] to iterate through each character i in the text. For each character, it checks if it's not a digit using the isdigit() method (if not i.isdigit()).
- Characters that are not digits are retained in the list comprehension.
- The resulting list of non-digit characters is joined back together using ''.join() to form a string, effectively removing all numerical digits.
- The modified text with the removed numbers is returned as the output.
- This function can be used when there's a need to remove numerical values from a piece of text, leaving behind only the non-numeric content.

In [32]:
train['text']=train['text'].apply(remove_numbers)

In [33]:
import re

def cleanse(word):
    # Regular expression pattern to match alphanumeric characters and digits
    rx = re.compile(r'\D*\d')
    
    # If the word matches the pattern (contains alphanumeric characters or digits), return an empty string
    if rx.match(word):
        return ''
    
    # Otherwise, return the word as is
    return word

def remove_alphanumeric(strings):
    # Using list comprehension to process each word in the input text
    nstrings = [" ".join(filter(None, (
        cleanse(word) for word in string.split()))) 
                for string in strings.split()]
    
    # Joining the processed words back together to form a string
    str1 = ' '.join(nstrings)
    return str1


- cleanse(word): This function uses a regular expression pattern \D*\d to match any word that contains digits or alphanumeric characters. If the word matches this pattern, it returns an empty string, effectively removing the alphanumeric content.

- remove_alphanumeric(strings): This function applies the cleanse function to each word in the input text (strings). It splits the text into words, applies cleanse to each word using a generator expression within filter, and joins the non-empty words back together into a string.

Overall, the remove_alphanumeric function utilizes the cleanse function to remove words containing alphanumeric characters or digits from the input text, resulting in a string containing only non-alphanumeric content.

In [34]:
train['text']=train['text'].apply(remove_alphanumeric)

In [35]:
train

Unnamed: 0,text,label
19930,Afte seeing movie last night left sense hopele...,1
32631,movie shows ow sketches patly pass ove one ano...,0
17165,John Wayne one playes film histoy failed fist ...,1
29585,scewed stoy end Nell heoic taking fo team save...,0
10321,Ive seen movie today fo fist time neve head be...,0
...,...,...
11813,film without doubt one wost seen boing simply ...,0
6339,couple cowpokes help goup Momons coss ough cou...,1
16324,Manhattan apatment dwelles put kinds inconveni...,1
7664,Mockney comes Bighton despite poo eception Bit...,0


In [36]:
def lemmatize_text(text):
    text=nlp(text)
    text= ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.texr for word in text])
    return text

In [37]:
train['text']=train['text'].apply(lemmatize_text)

In [38]:
train['sentiment']=train['text'].apply(lambda tweet: TextBlob(tweet).sentiment)

In [39]:
train

Unnamed: 0,text,label,sentiment
19930,afte see movie last night leave sense hopeless...,1,"(0.24455128205128202, 0.6069597069597069)"
32631,movie show ow sketch patly pass ove one anothe...,0,"(-0.02951388888888886, 0.5820436507936507)"
17165,John Wayne one playe film histoy fail fist big...,1,"(0.13167388167388167, 0.28614718614718615)"
29585,scewe stoy end Nell heoic taking fo team save ...,0,"(0.044318181818181805, 0.6818181818181819)"
10321,I have see movie today fo fist time neve head ...,0,"(0.39166666666666666, 0.6761904761904761)"
...,...,...,...
11813,film without doubt one wost see boe simply cou...,0,"(-0.020833333333333343, 0.5005952380952381)"
6339,couple cowpoke help goup Momons coss ough coun...,1,"(0.3141233766233766, 0.6077922077922079)"
16324,Manhattan apatment dwelle put kind inconvenien...,1,"(0.3073593073593074, 0.5193722943722945)"
7664,Mockney come Bighton despite poo eception Biti...,0,"(0.04999999999999999, 0.29500000000000004)"


In [40]:
sentiment_series=train['sentiment'].tolist()

In [41]:
columns=['polarity','subjectivity']
df1=pd.DataFrame(sentiment_series,columns=columns,index=train.index)

In [42]:
df1

Unnamed: 0,polarity,subjectivity
19930,0.244551,0.606960
32631,-0.029514,0.582044
17165,0.131674,0.286147
29585,0.044318,0.681818
10321,0.391667,0.676190
...,...,...
11813,-0.020833,0.500595
6339,0.314123,0.607792
16324,0.307359,0.519372
7664,0.050000,0.295000


In [43]:
result=pd.concat([train,df1],axis=1)
result

Unnamed: 0,text,label,sentiment,polarity,subjectivity
19930,afte see movie last night leave sense hopeless...,1,"(0.24455128205128202, 0.6069597069597069)",0.244551,0.606960
32631,movie show ow sketch patly pass ove one anothe...,0,"(-0.02951388888888886, 0.5820436507936507)",-0.029514,0.582044
17165,John Wayne one playe film histoy fail fist big...,1,"(0.13167388167388167, 0.28614718614718615)",0.131674,0.286147
29585,scewe stoy end Nell heoic taking fo team save ...,0,"(0.044318181818181805, 0.6818181818181819)",0.044318,0.681818
10321,I have see movie today fo fist time neve head ...,0,"(0.39166666666666666, 0.6761904761904761)",0.391667,0.676190
...,...,...,...,...,...
11813,film without doubt one wost see boe simply cou...,0,"(-0.020833333333333343, 0.5005952380952381)",-0.020833,0.500595
6339,couple cowpoke help goup Momons coss ough coun...,1,"(0.3141233766233766, 0.6077922077922079)",0.314123,0.607792
16324,Manhattan apatment dwelle put kind inconvenien...,1,"(0.3073593073593074, 0.5193722943722945)",0.307359,0.519372
7664,Mockney come Bighton despite poo eception Biti...,0,"(0.04999999999999999, 0.29500000000000004)",0.050000,0.295000


In [44]:
result.drop(['sentiment'],axis=1,inplace=True)

In [45]:
result

Unnamed: 0,text,label,polarity,subjectivity
19930,afte see movie last night leave sense hopeless...,1,0.244551,0.606960
32631,movie show ow sketch patly pass ove one anothe...,0,-0.029514,0.582044
17165,John Wayne one playe film histoy fail fist big...,1,0.131674,0.286147
29585,scewe stoy end Nell heoic taking fo team save ...,0,0.044318,0.681818
10321,I have see movie today fo fist time neve head ...,0,0.391667,0.676190
...,...,...,...,...
11813,film without doubt one wost see boe simply cou...,0,-0.020833,0.500595
6339,couple cowpoke help goup Momons coss ough coun...,1,0.314123,0.607792
16324,Manhattan apatment dwelle put kind inconvenien...,1,0.307359,0.519372
7664,Mockney come Bighton despite poo eception Biti...,0,0.050000,0.295000


In [46]:
result.loc[result['polarity']>=0.3,'Sentiment']='Positive'
result.loc[result['polarity']<0.3,'Sentiment']='Negative'

In [47]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment
19930,afte see movie last night leave sense hopeless...,1,0.244551,0.606960,Negative
32631,movie show ow sketch patly pass ove one anothe...,0,-0.029514,0.582044,Negative
17165,John Wayne one playe film histoy fail fist big...,1,0.131674,0.286147,Negative
29585,scewe stoy end Nell heoic taking fo team save ...,0,0.044318,0.681818,Negative
10321,I have see movie today fo fist time neve head ...,0,0.391667,0.676190,Positive
...,...,...,...,...,...
11813,film without doubt one wost see boe simply cou...,0,-0.020833,0.500595,Negative
6339,couple cowpoke help goup Momons coss ough coun...,1,0.314123,0.607792,Positive
16324,Manhattan apatment dwelle put kind inconvenien...,1,0.307359,0.519372,Positive
7664,Mockney come Bighton despite poo eception Biti...,0,0.050000,0.295000,Negative


In [48]:
result.loc[result['label']==1, 'Sentiment_label'] = 1
result.loc[result['label']==0, 'Sentiment_label'] = 0

In [49]:
result

Unnamed: 0,text,label,polarity,subjectivity,Sentiment,Sentiment_label
19930,afte see movie last night leave sense hopeless...,1,0.244551,0.606960,Negative,1.0
32631,movie show ow sketch patly pass ove one anothe...,0,-0.029514,0.582044,Negative,0.0
17165,John Wayne one playe film histoy fail fist big...,1,0.131674,0.286147,Negative,1.0
29585,scewe stoy end Nell heoic taking fo team save ...,0,0.044318,0.681818,Negative,0.0
10321,I have see movie today fo fist time neve head ...,0,0.391667,0.676190,Positive,0.0
...,...,...,...,...,...,...
11813,film without doubt one wost see boe simply cou...,0,-0.020833,0.500595,Negative,0.0
6339,couple cowpoke help goup Momons coss ough coun...,1,0.314123,0.607792,Positive,1.0
16324,Manhattan apatment dwelle put kind inconvenien...,1,0.307359,0.519372,Positive,1.0
7664,Mockney come Bighton despite poo eception Biti...,0,0.050000,0.295000,Negative,0.0
