# Twitter Sentiment Analysis

![Example of sentiment analysis](https://media-exp1.licdn.com/dms/image/C5612AQERP5yD4Ov6Fw/article-cover_image-shrink_600_2000/0?e=1610582400&v=beta&t=O99Hkcjllunfb-MsfL_ANv5dYlTpsZobRxIE-eqdUiw)
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry", "sad", and "happy".

### About this Notebook

A Data Science project has 5 major stages in its life cycle:
1. Data Collection
2. Data Processing
3. Exploratory Analysis of the Data
4. Data Modeling
5. Interpreting the Data

This notebook will help create an understanding of how useful interpretation could be made of real world data using various data science principles.

This notebook will make use of popular libraries such as Plotly for visualizations. 

Pyspark would then be used to handle the big Data. Modeling would be done along with TF-IDF and Logistic Regression.

### Contents

1. Data Collection
2. EDA
3. Processing the Data
4. Modeling
5. Interpretaion

**0. Necessary imports**

In [None]:
# some external packages
!pip install pyspellchecker
!pip install pyspark
!pip install findspark

Collecting pyspellchecker
  Downloading pyspellchecker-0.6.3-py3-none-any.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 8.3 MB/s 
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.6.3
Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 27 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 53.7 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=2a4e163f9d6219116e098ce435e54f6bc04dd3bde668ca1f1617c83a487bf09b
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4

In [None]:
#### for data manipulation and math operations ####
import pandas as pd
import numpy as np

#### for visualizations ####
# plotly
from plotly.offline import iplot
import plotly.graph_objs as go
from plotly.subplots import make_subplots

#### NLP packages ####
# NLTK library
from nltk.corpus import stopwords
# SKLearn 
from sklearn.feature_extraction.text import CountVectorizer
# py-spell checker
from spellchecker import SpellChecker


#### other useful packages ####
import string
from collections import Counter
import re
from tqdm import tqdm


#### Pyspark packages ####
import findspark
# findspark.init()
import pyspark as ps
import warnings
from pyspark.sql import SQLContext
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
from zipfile import ZipFile
file_name = "/content/drive/My Drive/"

with ZipFile("/content/drive/My Drive/archive.zip", 'r') as ziped:
  ziped.extractall()
  print('Done')


## I. Data Collection

Dataset being used is the "<a href="https://www.kaggle.com/kazanova/sentiment140">**Sentiment140 dataset with 1.6 million tweets**</a>", which is a publicly available dataset on Kaggle.


**Reading the Data**

In [None]:
file_path = 'training.1600000.processed.noemoticon.csv'
colnames=['sentiment', 'ids', 'date', 'flag','user','text'] 
train = pd.read_csv(file_path,encoding = "ISO-8859-1", header=None, names=colnames) 

In [None]:
train.head()

In [None]:
print(f"Shape of training data: {train.shape}")

In [None]:
train = train[['text','sentiment']]
train.head()

In [None]:
train.describe()

In [None]:
train.info()

In [None]:
# Lets use a subset of the data for faster processing
# Lets use about 100K rows of data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train['text'], train['sentiment'], test_size=0.85, random_state=42)
train = pd.concat([X_train,y_train],axis=1)
print(len(train))

## II. Exploratory Data Analysis

Sentiment is the class label that would have to be predicted.

Understanding the distribution of data as per the classes is of high importance.

In [None]:
# lets run a groupby query which has similar functionalities to dealing with RDBMS
class_group_counts = train.groupby('sentiment').count()['text'].reset_index().sort_values(by='text',ascending=False)
class_group_counts.style.background_gradient(cmap='Blues')

Bar Chart for Distribution of Data in accordance to Sentiment classes.

In [None]:
# create a trace
trace = go.Bar(
    x = class_group_counts.sentiment,
    y = class_group_counts.text,
    name = 'Data Frequency',
    marker={'color': ['#e57373','#f06292','#ba68c8']}
)

data = [trace]
layout = go.Layout(title="Distribution of the categorical classes")

fig = go.Figure(data = data,layout=layout)
fig.show()

Pie Chart for Distribution of Data in accordance to Sentiment classes.

Helps for easier readability

In [None]:
# create the trace
trace = go.Pie(
    labels = class_group_counts.sentiment,
    values = class_group_counts.text
)

data = [trace]
layout = go.Layout(title="Pie plot of the distribution of the categorical classes")

fig = go.Figure(data = data,layout=layout)
fig.show()

Lets count the frequency of words of the respective classes

In [None]:
# lists to keep track of word-frequencyies for the two classes
positive_words_count = []
negative_words_count = []

for i in tqdm(range(len(train))):
    if train.iloc[i]['sentiment'] == 4:
        positive_words_count.append(len(train.iloc[i]['text'].split()))
    elif train.iloc[i]['sentiment'] == 0:
        negative_words_count.append(len(train.iloc[i]['text'].split()))

In [None]:
# plot the histogram

trace2 = go.Histogram(
    x = np.array(positive_words_count), 
    name = 'Positive'
)
trace3 = go.Histogram(
    x = np.array(negative_words_count),
    name = 'Negative'
)

data = [trace2,trace3]

layout = go.Layout(
    barmode='overlay',
    title="Word-Frequencies of each class")

fig = go.Figure(data = data, layout = layout)
fig.update_traces(opacity=0.6)
fig.show()

**Most Common Words in the Tweets**

This would help us understand which words appears several times in the tweets.

In [None]:
# lets now use pythons counter module to create word-frequency dictionary
# the key would be the word while the value would be the words count

positive_words_count = Counter()
negative_words_count = Counter()

counts = {4:positive_words_count,0:negative_words_count}

# iterate over every data row
for i in tqdm(range(len(train))):
    sentiment_class = train.iloc[i]['sentiment']
    for word in train.iloc[i]['text'].split():
        if word in counts[sentiment_class]:
            counts[sentiment_class][word] += 1
        else:
            counts[sentiment_class][word] = 1
            
top_words_positive_count = sorted(positive_words_count.items(), key = lambda x: x[1],reverse=True)[:20]
top_words_negative_count = sorted(negative_words_count.items(), key = lambda x: x[1],reverse=True)[:20]

In [None]:
# Creating subplots for the most common words in each class
fig = make_subplots(rows=2, cols=1,
                    subplot_titles=(
                        "Positive Tweets",
                        "Negative Tweets")
)
                    

# trace for positive class
x,y = zip(*top_words_positive_count)
fig.append_trace(
    go.Bar(
        x = x,
        y = y ,
        name = 'Positive'),
    row=1, col=1)

# trace for negative class
x,y = zip(*top_words_negative_count)
fig.append_trace(
    go.Bar(
        x = x,
        y = y ,
        name = 'Negative'),
    row=2, col=1)


fig.update_layout(height=500, width=900, title_text="Top words in different classes",showlegend=False)
fig.show()

From the above stats its pretty evident that the top words in each class are common spoken english words and don't contribute much towards understanding the sentiment of the tweet as they appear in all classes. 

These words are also called as stop words and could be eliminated from the corpus.

**Analyzing punctuations**

In [None]:
punct = Counter()

for text in train['text']:
    for word in text:
        if word in string.punctuation:
            if word in punct:
                punct[word] += 1
            else:
                punct[word] = 1

# sort the punctuations frequencies
top_punct_count = sorted(punct.items(),key=lambda x:x[1],reverse=True)[:20]

In [None]:
x, y = zip(*top_punct_count)
trace = go.Bar(
    x = x,
    y = y,
)

data = [trace]
layout = go.Layout(title="Most Frequent Puntuations",height=300, width=900,)
fig = go.Figure(data=data,layout=layout)
fig.show()

**N-Gram Analysis**

All the analysis done earlier were done on unigrams(on the basis of a single word). Lets now check what are the most common Bi-Grams and Tri-Grams.

"*I love to eat pizza.*"

A 1-gram (or unigram) is a one-word sequence. For the above sentence, the unigrams would simply be: “I”, “love”, “to”, “eat”, “pizza”.

A 2-gram (or bigram) is a two-word sequence of words, like “I love”, “love to”, or “to eat”, "eat pizza". And a 3-gram (or trigram) is a three-word sequence of words like “I love to”, “love to eat” or “to eat pizza”.

In [None]:
# Using the CountVectorizer of the sklearn library for n-gram analysis
def get_top_n_grams(corpus,N=2, n=20):
    """
    args:
        corpus: list of text data
        N : N-gram 
        n : top n N-grams
    returns:
        Returns the n most common N-grams
    """
    
    # create the CountVectorizer object
    vec = CountVectorizer(ngram_range=(N,N))
    # fit it on the corpus
    bag_of_words = vec.fit_transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x:x[1], reverse=True)
    return words_freq[:n]

In [None]:
bi_grams = get_top_n_grams(train['text'],N=2)
tri_grams = get_top_n_grams(train['text'],N=3)

fig = make_subplots(
    rows=2,cols=1,
    subplot_titles=("Bi-grams","Tri-grams")   
)

x,y = zip(*bi_grams)
fig.append_trace(
    go.Bar(
        x=x,
        y=y,
        name='bi-gram'
    ),
    row=1,col=1
)

x,y = zip(*tri_grams)
fig.append_trace(
    go.Bar(
        x=x,
        y=y,
        name='tri-gram'
    ),
    row=2,col=1
)

fig.update_layout(title="Most common N-grams",height=900,width=900,)
fig.show()

## III. Data Processing

From the above analysis it is certain that a lot of work has to be done on cleaning the data.

Major components of processing text data include:
Elimination of ...
1. punctuations
2. urls
3. emojis
4. stop words.
5. HTML

**Removing HTML-tags**

In [None]:
def remove_HTML(text):
    """
    Inputs a string and outputs a string free of any HTML tags
    """
    tag = re.compile(r'<.*?>')
    
    return tag.sub(r'',text)

Lets test the above function with an example

In [None]:
text = """<div>
<h1>Pizzeria</h1>
<p>Best pizza in town</p>
<a href="https://pizzeria.com">getting started</a>
</div>"""

print(remove_HTML(text))

**Removing URLs**

In [None]:
def remove_URL(text):
    """
    Inputs a string and outputs a string free of any URLs
    """
    url = re.compile(r'https?://\S+|www\.\S+')
    
    return url.sub(r'',text)

In [None]:
text = "New Pizza :https://pizzeria.com-getting-started you will love it"
remove_URL(text)

**Removing Emojis**

In [None]:
def remove_emojis(text):
    """
    Inputs a string and outputs a string free of any emojis
    """
    emoji = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
    "]+", flags=re.UNICODE)
    
    return emoji.sub(r'',text)

In [None]:
text = "I didn't like the pizza 😔😔"
remove_emojis(text)

**Removing Punctuations**

In [None]:
def remove_punctuations(text):
    """
    Inputs a string and outputs a string free of any punctuations
    """
    punct = re.compile(r'[^\w\s]')
    
    return punct.sub(r'',text)

In [None]:
s = "string. With. #$@#$Punctuation?"
remove_punctuations(s)

**Spell-checker and correction**

In [None]:
def correct_typo(text):
    spell = SpellChecker()
    correct = []
    # find the wrongly spelled words
    misspelled = spell.unknown(text.split())
    
    for word in text.split():
        # if the word is misspelled then correct it
        if word in misspelled:
            correct.append(spell.correction(word))
        else:
            correct.append(word)
            
    return " ".join(correct)

In [None]:
# testing the correct_typo function
text = "I love the pizz at Jimmy's, it's simpl fanastic"
correct_typo(text)

**Remove Stop words**

In [None]:
import nltk
nltk.download('stopwords')
# set of all stopwords
stop = set(stopwords.words('english'))
stop.remove('not') # exclude not

def remove_stop_words(text):
    """
    inputs a text string and outputs a string without any stopwords
    """
    sentence = [] # list without any stopwords
    for word in text.split():
        if word not in stop:
            sentence.append(word)
            
    return " ".join(sentence)

In [None]:
# testing the elimination of stop words function
text = "I dislike the fried chicken, but crave for the Lasanga"
remove_stop_words(text)

**Lets now assemble all of the above functions to return a cleaned text**

In [None]:
def clean_text(text):
    """
    inputs a string:
    -------------------------------------
    outputs a string free from 
    1) html-tags
    2) urls
    3) emojis
    4) emojis
    5) stopwords
    and lastly corrects the misspelled words
    """
    text = remove_HTML(text)
    text = remove_URL(text)
    text = remove_emojis(text)
    text = remove_punctuations(text)
    text = remove_stop_words(text)
    text = correct_typo(text)
    
    return text

In [None]:
text = """<div>
<h1>Pizzeria</h1>
<p>Best pizza in town</p>
<a href="https://pizzeria.com">getting started</a>
</div> Follow the link at https://pizzeria.com. But the pizz is not that great!!! 😔😔! Disapointed"""

clean_text(text)

In [None]:
print(len(train))
corpus = []
sentimental=[]
for i in tqdm(range(len(train[10000:60000]))):
    text = train.iloc[i]['text']
    sentimental.append(train.iloc[i]['sentiment'])
    corpus.append(clean_text(text))

In [None]:
print(len(corpus),
len(sentimental))

## IV. Data Modeling

**Creating a Spark Context to initiate a connection to a cluster to obtain the data**

In [None]:
sc = ps.sql.SparkSession.builder.getOrCreate()  
sqlContext = sc
print("Just created a SparkContext")

In [None]:
print(len(train[:]))
percentile_list = pd.DataFrame({'tweet' : corpus,
                                'target' : sentimental }, 
                                columns=['tweet','target'])


In [None]:
from google.colab import drive

drive.mount('/content/drive')
path = '/content/drive/My Drive/cleaned40_train.csv'

with open(path, 'w', encoding = 'utf-8-sig') as f:
  percentile_list.to_csv(f)

In [None]:
from google.colab import files
file=files.upload()

Loading the cleaned data

In [None]:
file_path = 'cleaned24_train - cleaned23_train.csv'
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(file_path)
type(df)

In [None]:
df.show(10)

In [None]:
# lets view the size of the data
print(f"Size of the data = {df.count()}")

In [None]:
# Lets split the data for training and testing the model
(train_set, val_set, test_set) = df.randomSplit([0.98, 0.01, 0.01], seed = 2000)

**Feature Extraction with TF-IDF**

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).

In [None]:
# Creating the pipeline for feature extraction

# tokenizing the data
tokenizer = Tokenizer(inputCol="tweet", outputCol="words")

# Creating an instance of the TF-IDF
hashtf = HashingTF(numFeatures=2**16, inputCol="words", outputCol='tf')
idf = IDF(inputCol='tf', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms

# to convert string target to index target
label_stringIdx = StringIndexer(inputCol = "target", outputCol = "label")

# the complete pipeline: sequence of various stages
pipeline = Pipeline(stages=[tokenizer, hashtf, idf, label_stringIdx])

**Extract Features**

In [None]:
train_set=train_set.na.drop()
pipelineFit = pipeline.fit(train_set)
train_df = pipelineFit.transform(train_set)

**Extracting features of the validation set**

In [None]:
val_df = pipelineFit.transform(val_set)
train_df.show(5)

**Modeling with Logistic Regression**

Lets now apply logistic regression to the data as we now have the extracted features of every data point

In [None]:
LR = LogisticRegression()
model = LR.fit(train_df)
predictions = model.transform(val_df)

**Evaluation Metrics for the model (TF-IDF with Logistic Regression)**

In [None]:
print(predictions[7])
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

## V. Data Interpretation

In [None]:
test = {
    'tweet':[
        'OMG! I"m so sick of the US elections and the corruptions',
'I love the Master Chef US, its streaming this Friday on Fox #masterchef'
        
    ],
    'target':[0,1]
}

test_ = pd.DataFrame(test)

test_ = sqlContext.createDataFrame(test_)
print(test_set['tweet'])

In [None]:
def model_predict(test_):
    features = pipelineFit.transform(test_)
    preds = model.transform(features)
    return preds

In [None]:
pred = model_predict(test_)
pred.select('prediction').show()

### Acknowledgements

* <a href="https://en.wikipedia.org/wiki/Sentiment_analysis#:~:text=Sentiment%20analysis%20(also%20known%20as,affective%20states%20and%20subjective%20information.">Sentiment Analysis Wikipedia</a>
