<a href="https://colab.research.google.com/github/PavanReddy28/CRuX/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

## Packages

In [2]:
import nltk

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import time

import re
from nltk import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

## Importing data

In [5]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


### Different datasets

In [None]:
train.head()

In [None]:
train.value_counts()

In [None]:
test = pd.read_csv("/gdrive/My Drive/Inductions/test.csv")

In [None]:
test.head()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
#hi

In [None]:
train =  pd.read_csv("/gdrive/My Drive/Inductions/train.csv")

### Sentiment140 Dataset

Using the dataset **Sentiment140** contatining **1,600,000** tweets extracted using twitter api. 
<br>The tweets are annoted as '0 : Negative', '2 : Neutral' and '4 : Positive'.
<br>I have only used the positive and negative tweets to train my model.

In [6]:
twitter_dataset = pd.read_csv('/gdrive/My Drive/Inductions/training.1600000.processed.noemoticon.csv', encoding="ISO-8859-1", names=["sentiment", "ids", "date", "flag", "user", "text"])

In [7]:
twitter_dataset.head()

Unnamed: 0,sentiment,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Data Preprocessing

In [8]:
sentiment, text = twitter_dataset['sentiment'], twitter_dataset['text']

### **Sentiment**

Replacing 4's to 1's (Representation of Positive Data).

In [9]:
sentiment.value_counts()

4    800000
0    800000
Name: sentiment, dtype: int64

In [10]:
sentiment = sentiment.replace(4,1)

In [11]:
sentiment.value_counts()

1    800000
0    800000
Name: sentiment, dtype: int64

### **Text**

Preprocessing of text data include:
<ol type = "1">
<li>Converting all the data to **Lower Case**</li>
         <li>Replaceing **User ID's** ("@colab", etc.) with "USER"</li>
         <li>Replacing **Emojis** to text representation (According to Emoji Dictionary).</li>
         <li>Replacing **URLs** (starting with "http", "www", etc.) with "URL"</li>
         <li>Removing **Non-Alphabets**</li>
         <li>**Tokenization** : Splitting of the tweets to, list of words.</li>
         <li>**Lemmetization** : Grouping together the different forms of a work so that they can be analyzed as a single word. Done using NLTK Library.
</li>
<li>Removing letters repeated more than 3 times.</li>

In [12]:
#Emojis and Stopwords taken from the internet.

emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

In [13]:
def data_preprocess(textdata):

  lemmatizer = WordNetLemmatizer()

  Processed =[]

  urlPattern = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
  userPattern = '@[^\s]+'
  alphaPattern = r"[^\w]"
  sequenceFind = r"(.)\1\1+"
  sequenceReplace = r"\1\1"

  for tweet in textdata:

    tweet = tweet.lower()

    tweet = re.sub(urlPattern, 'URL ', tweet)
    tweet = re.sub(userPattern, 'USER ', tweet)
    tweet = re.sub(alphaPattern, ' ', tweet)
    tweet = re.sub(sequenceFind, sequenceReplace, tweet)

    for emoji in emojis.keys():
      tweet = tweet.replace(emoji, "EMOJI"+emojis[emoji])

    tweetWords = ''
    for text in tweet.split():
      if len(text)>1:
        text = lemmatizer.lemmatize(text)
        tweetWords += (text +' ')
    
    Processed.append(tweetWords)

  return Processed


In [14]:
%%time
text = data_preprocess(text)
print('Data Processed.')

Data Processed.
CPU times: user 1min 35s, sys: 978 ms, total: 1min 36s
Wall time: 1min 36s


### Train Test Split

In [15]:
X_train, Y_train, X_test, Y_test = train_test_split(text, sentiment, test_size=0.05, random_state=0)

In [16]:
X_test[:10]

391051     0
197655     0
905468     1
1492339    1
551346     0
909061     1
1091617    1
1323317    1
1249935    1
87105      0
Name: sentiment, dtype: int64

### **TF-IDF Vectorizer**

The 'tfidfVectorizer()' method from the sklearn.feature_extraction package, helps convert a collection of raw documents to a matrix of TF-IDF features.

It returns a list of features (here, words are the features) which have the most count.

**Example**:
<br>
INPUT: corpus = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?',]
<br><br>When the above list is passed through the vectorizer the below list is returned.
<br><br>OUTPUT: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [17]:
Vectorizer = TfidfVectorizer(ngram_range=(1,2), lowercase=True, max_features=50000)

In [18]:
Vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=50000,
                min_df=1, ngram_range=(1, 2), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [19]:
X_train = Vectorizer.transform(X_train)
X_test = Vectorizer.transform(Y_train)

In [20]:
X_train.shape

(1520000, 50000)

In [35]:
X_train[1]

<1x50000 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

## Model

### Naive Bayes

In [None]:
class NaiveBayesAlgo:

  def __init__(self, train, test):


  def fit():
    
