# Twitter Sentiment Analysis Description

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [4]:
import re                         # regular expression
import nltk                       # natural language tool kit

In [5]:
nltk.download('stopwords')  # (stopwords) => most occuring/repeating words in the sentence....like...(this,is,an,or,are,..)
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

#### Stemming and Lemmatization

Stemming(a common word for many similar breaked words) and lematization (different breaked words)............

words,word,.. === word

tasty,tastful,..=== taste

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is $\Rightarrow$ be

car, cars, car's, cars' $\Rightarrow$ car

In [8]:
#Relative path....to read a file...
data = pd.read_csv("C:/Users/HP/Documents/Avishkar Internship/ML Projects/Twitter Sentiment Analysis - 2/train.csv")

#Absolute path...to read a file....
#data = pd.read_csv("C:\\Users\\HP\\Documents\\Avishkar Internship\\ML Projects\\Twitter Sentiment Analysis - 2\\train.csv)

data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [20]:
data.shape

(31962, 3)

In [21]:
data.tweet[10]

'ireland consumer price index mom climbed from previous 0 2 to 0 5 in may blog silver gold forex'

In [22]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

In [23]:
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [24]:
help(re.sub)

Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used.



In [25]:
def clean(a):
    a = ' '.join(re.sub(" (@[A-Za-z0-9]+)|([^A-Za-z0-9']+)|(\w+:\/\/\S+) "," ",a).split())
    return a

In [26]:
#Removing Hyperlinks and user ID on train file.....

data.tweet = data.tweet.apply(clean)
data.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father is dysfunctional and is so selfi...
1,2,0,user thanks for lyft credit i can't use cause ...
2,3,0,bihday your majesty
3,4,0,model i love u take with u all the time in ur
4,5,0,factsguide society now motivation


In [29]:
#Applying Lemmatization....

from nltk.stem.wordnet import WordNetLemmatizer

w = WordNetLemmatizer()

In [30]:
data.tweet = data.tweet.apply(lambda x:' '.join([w.lemmatize(word,'v') for word in x.split()]))     # 'v' stands for verb
data.head()

Unnamed: 0,id,label,tweet
0,1,0,when a father be dysfunctional and be so selfi...
1,2,0,user thank for lyft credit i can't use cause t...
2,3,0,bihday your majesty
3,4,0,model i love u take with u all the time in ur
4,5,0,factsguide society now motivation


In [31]:
data.label.value_counts()       # 0--> postive tweet,  1 --> negative tweet

0    29720
1     2242
Name: label, dtype: int64

### Resampling Technique..

A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).

![UnderOver_sampling.png](attachment:image.png)

Class Imbalance: Undersampling and Oversampling

### Synthetic Minority Oversampling Technique

This technique generates synthetic data for the minority class.

SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

![SMOTE.png](attachment:image.png)

1. Choose a minority class as the input vector
2. Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function)
3. Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor
4. Repeat the steps until data is balanced

In [None]:
# import library " imblearn.over_sampling "..........
# !pip install imblearn

from imblearn.over_sampling import SMOTE

#smote = SMOTE()

# fit predictor and target variable
#x_smote, y_smote = smote.fit_resample(data[["tweet"]],data["label"])

#print('Original dataset shape', Counter(y))
#print('Resample dataset shape', Counter(y_ros))