## Introduction

* **Natural Language Processing (NLP):**  Artificial Intelligence, Computer science, and linguistics are concerned with creating computational models that process and understand natural language. These include: making the computer understand the semantic grouping of words (e.g. cat and dog are semantically more similar than cat and spoon), text to speech, language translation and many more.
* **Sentiment Analysis/ Opinion mining:** It is a natural language processing technique that determines whether textual data is negative, neutral or positive. Businesses use it to detect social data sentiment, gauge brand reputation and products by customer feedback, and understand customer needs.


In this notebook, we'll develop a **Sentiment Analysis model** to categorize a text and emoji input as **Positive or Negative.**

## Table of Contents
1. [Importing dependencies](#p1)
2. [Importing dataset](#p2)
3. [Importing Emoji Dataset](#p3)
4. [Determine Emoji Polarity](#p4)
5. [Split Emojis in two polarities](#p5)
6. [Preprocess Text](#p6)
7. [Convert the ASCII emoji into emoticon symbols](#p7)
8. [Machine Learning Implementation](#p8)
9. [Transform Data](#p9)
10. [Splitting the Data](#p10)
11. [Save the Model](#p11)
12. [Process inputs - Text and Emojis](#p12)
13.[Building a Sentiment Analysis](#p13)
14.[Final Pipeline Chat](#p14)
15.[References](#p15)

# <a name="p1">Importing Dependencies</a>

In [1]:
# Utilities
import pandas as pd
import numpy as np
import re
import pickle

# Machine Learning
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report

# Emoji Library
import emoji

# <a name="p2">Importing Dataset</a>

We took our datasets from the following links:
- 1.6m tweet dataset with sentiment polarity : https://www.kaggle.com/kazanova/sentiment140
- emoticon dataset : https://www.kaggle.com/thomasseleck/emoji-sentiment-data?select=Emoji_Sentiment_Data_v1.0.csv

### <a name="p2">Context</a>
It is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the Twitter API. The tweets have been annotated (0 = negative, 4 = positive) and can detect sentiment.

### <a name="p2">Content</a>

It contains the following six fields:

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: tweet id.
- date: the date of the tweet.
- flag: The query. If there is no query, then this value is NO_QUERY.
- user: the user that tweeted.
- text: the text of the tweet.


In [2]:
DATASET_COLUMNS  = ["sentiment", "ids", "date", "flag", "user", "text"]

tweet = pd.read_csv('tweets.csv',names= DATASET_COLUMNS)

In [3]:
# Checking whether the columns were placed correctly or not.

tweet.columns

Index(['sentiment', 'ids', 'date', 'flag', 'user', 'text'], dtype='object')

In [4]:
# Drop unused columns

tweet = tweet.drop(columns= ['ids','date','flag','user'])

In [5]:
# Top 5 columns

tweet.head()

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [6]:
# Replace the sentiment column value of 4 to 1

tweet['sentiment'] = tweet['sentiment'].replace(4,1)

In [7]:
# Checking the Polarity of the sentiment column

tweet.sentiment.unique()

array([0, 1], dtype=int64)

In [8]:
# Checking if there is any null value in the dataframe

tweet.isna().any()

sentiment    False
text         False
dtype: bool

In [9]:
#Checking whether our dataframe is null or not

tweet.isna().sum()

sentiment    0
text         0
dtype: int64

# <a name="p3">Import Emoji Dataset</a>

### <a name="p3">Context</a>
A lexicon of 751 emoji characters with the automatically assigned sentiment.
The sentiment computed from 70,000 tweets, labelled by 83 human annotators
in 13 European languages.
Kralj Novak P, Smailović J, Sluban B, Mozetič I described the process and analysis of emoji sentiment ranking in the paper:  Kralj Novak P, Smailović J, Sluban B, Mozetič I (2015) Sentiment of Emojis. PLoS ONE 10(12): e0144296. doi:10.1371/journal.pone.0144296

### <a name="p3">Content</a>

It contains the following nine fields:

- Emoji: Emoji symbol
- Unicode codepoint: Hexadecimal Emoji Universal code.
- Ocurrences: Frequency of their appearance in Tweets.
- Position: Their position between 0.009615385 to 1.0.
- Negative: Negative score value of the emoji.
- Neutral: Neutral score value of the emoji.
- Positive: Positive score value of the emoji.
- Unicode name: Emoji Unicode name.
- Unicode block: Unicode block where they belong to.

In [10]:
# Setup the data for emoji

emoji = pd.read_csv("emoji.csv")
emoji.head()

Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons


# <a name="p4">Determine Emoji Polarity</a>


During the Data Preprocessing stage, we faced some issues with determining the polarity that we want to work with:

- Our text data is bipolar(Negative, Positive), and the emoji one has three polarities(Negative, Neutral, Positive), but with only eleven neutral emojis, which would be irrelevant to split the text data to satisfy such a measly number of emojis. 

Then, we decided to keep them as POSITIVE and NEGATIVE to match with our initial idea.

In [11]:
# compare the polarity of the dataset and turn the polarity to binary
# 0 = negative, 1= positive
polarity_ls = []
for index, row in emoji.iterrows():
    
    # polarity == sentiment
    # initial polarity is negative
    polarity = 0 
    
    # positive if positive value is greater than negative value
    arg_1 = row['Positive'] > row['Negative']
    
    # positive if neutral value is odd and positive and negative value are equal
    arg_2 = row['Positive'] == row['Negative'] and row['Neutral'] % 2 != 0 
    
    # positive if either of the two arguments are true
    if arg_1 or arg_2:
        polarity = 1
    polarity_ls.append(polarity)
    
# create new emoji dataset
new_emoji_df = pd.DataFrame(polarity_ls, columns=['sentiment'])
new_emoji_df['emoji'] = emoji['Emoji'].values
new_emoji_df.reset_index()
new_emoji_df

Unnamed: 0,sentiment,emoji
0,1,😂
1,1,❤
2,1,♥
3,1,😍
4,0,😭
...,...,...
964,1,➛
965,1,♝
966,1,❋
967,1,✆


# <a name="p5">Split Emojis in two polarities</a>

We will split our Emoji dataset into Positive and Negative to see which they are and their total number. 

- 795 `positive` emojis
- 174 `negative` emojis

In [12]:
def sentiment_dataset(df, polarity):
    emoticon_df = df.loc[df['sentiment'] == polarity]
    df_emoticon_df = pd.DataFrame(emoticon_df)
    df_emoticon_df.reset_index(inplace=True, drop=True)
    return df_emoticon_df

In [13]:
# Positive Emojis

positive_emoji = sentiment_dataset(new_emoji_df, 1)
positive_emoji

Unnamed: 0,sentiment,emoji
0,1,😂
1,1,❤
2,1,♥
3,1,😍
4,1,😘
...,...,...
790,1,➛
791,1,♝
792,1,❋
793,1,✆


In [14]:
# Negative emojis

negative_emoji = sentiment_dataset(new_emoji_df, 0)
negative_emoji

Unnamed: 0,sentiment,emoji
0,0,😭
1,0,😩
2,0,😒
3,0,😔
4,0,█
...,...,...
169,0,🕔
170,0,🈂
171,0,🎰
172,0,҂


# <a name="p6">Preprocess Text</a>

**Text Preprocessing** is traditionally an essential step for **Natural Language Processing (NLP)** tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

**Our approach for this task:** 

In the machine learning algorithm implementation, we noticed the amount of feature engineering we would have to do for this quantity of data that we are manipulating. So, it would slow down our development. For this reason, we chose an alternative that fits our purpose.

-  We decided to only remove internet symbols like '@', 'http://','https://', '&', '#'




In [15]:
posts = tweet['text']

In [16]:
temp = []
for text in posts:
    remove_keys = ('@', 'http://','https://', '&', '#')
    # remove words that starts with symbols from the remove keys
    clean_text = ' '.join(txt for txt in text.split() if not txt.startswith(remove_keys))
    temp.append(clean_text)
posts = temp
tweet['text'] = posts
posts

["- Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",
 "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!",
 'I dived many times for the ball. Managed to save 50% The rest go out of bounds',
 'my whole body feels itchy and like its on fire',
 "no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",
 'not the whole crew',
 'Need a hug',
 "hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",
 "nope they didn't have it",
 'que me muera ?',
 "spring break in plain city... it's snowing",
 'I just re-pierced my ears',
 "I couldn't bear to watch it. And I thought the UA loss was embarrassing . . . . .",
 'It it counts, idk why I did either. you never talk to me anymore',
 "i would've been the first, but i didn't have a gun. not really though, zac snyder's just a doucheclown.",
 'I wish I got to watch it with you!! I miss you and 

In [17]:
# Tell numpy to convert the data to Unicode (essentially a **string** in Python3)

tweet["text"]=tweet["text"].astype('U')

# <a name="p7">Convert the ASCII emoji into emoticon symbols</a>

In order to understand, we have to state what is ASCII, Unicode, Emoticons and Emojis. Let us break it down into small chunks and a final explanation of what we plan to do.

**What is ASCII?**

`ASCII - American Standard Code for Information Interchange`

 It is a character encoding that uses numeric codes to represent characters inputs. These include lowercase and upper English letters, numbers, and punctuation symbols. It can represent 128 characters using 7 bits to represent each character since the first bit of the byte is always 0. (Computer Hope, 2020)

**What is Unicode?**

`Unicode` is a universal character encoding standard. It defines how to represent individual characters in web pages, text files, and other types of documents.
While ASCII justly uses one byte to represent each character, Unicode supports up to 4 bytes for each character.


**What is Emoticon?**

`Emoticons`,short for "emotion icon", are letters, numbers and punctuation marks used to create pictorial icons that generally display emotion or sentiment.

**What is Emoji?**

`Emoji` comes from the Japanese words picture(e) and character(moji) is the graphical successor to the emoticon, representing similar things through the characters' sequences. Emoji is a small image, either animated or static, resembling a facial expression, an entity, or a digital communications concept.

- We plan to map those characters and convert them into emojis; that is why we did not go further with the feature engineering in the data preprocessing stage. We need these characters to extract emotions from them.

In [18]:
# corresponding emoticon sysmbols
txt_emoji = [
    ':)', ':P', ':D', ':|', ":'(", ':O', ":*", '<3', ':(', ';)',
    'xD', ':/', '=D' 
]
txt_emoji_pic =[
    '😊', '😛', '😄', '😐', '😢', '😲', '😘', '😍', '😧', '😉', 
    '😁', '😒', '😀'
]

In [19]:
# Function to convert text to emoji icons

def convert_emoji(txt, conv_txt, conv_pic):
    temp = []
    for i in txt:
        for j in range(len(conv_txt)):
            if i == conv_txt[j]:
                i = conv_pic[j]
        temp.append(i)
    return ' '.join(temp)
    

In [20]:
# Function to split texts from emojis and call the other function to convert these signs to emojis.

def conv_emoji_on_data(df_data):
    conv_text = []
    for idx, row in df_data.iterrows():
        txt = [i for i in row['text'].split()]
        emoji_found = convert_emoji(txt, txt_emoji, txt_emoji_pic)
        conv_text.append(emoji_found)
    return conv_text

In [21]:
# Convert text based emojis from positive text into utf-8 emoticon symbols.

conv_text = conv_emoji_on_data(tweet)
conv_text

["- Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",
 "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!",
 'I dived many times for the ball. Managed to save 50% The rest go out of bounds',
 'my whole body feels itchy and like its on fire',
 "no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",
 'not the whole crew',
 'Need a hug',
 "hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",
 "nope they didn't have it",
 'que me muera ?',
 "spring break in plain city... it's snowing",
 'I just re-pierced my ears',
 "I couldn't bear to watch it. And I thought the UA loss was embarrassing . . . . .",
 'It it counts, idk why I did either. you never talk to me anymore',
 "i would've been the first, but i didn't have a gun. not really though, zac snyder's just a doucheclown.",
 'I wish I got to watch it with you!! I miss you and 

In [22]:
# Convert the following words into these emoji icons.

add_emoji_txt = ['sad', 'unhappy', 'crying', 'smile', 'happy', 'love']
add_emoji_pic =['😔', '😧', '😆', '😭', '😊', '😍']


In [23]:
# Function to convert a list of words into a list of emojis.

def add_emoji_text(df_data):
    reform_pos_text = []
    for ct in df_data:
        txt = [i for i in ct.split()]
        emoji_found = convert_emoji(txt, add_emoji_txt, add_emoji_pic)
        reform_pos_text.append(emoji_found)
    return reform_pos_text

In [24]:
# Convert Selected words into emojis from texts

text_conv = add_emoji_text(conv_text)
text_conv

["- Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D",
 "is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!",
 'I dived many times for the ball. Managed to save 50% The rest go out of bounds',
 'my whole body feels itchy and like its on fire',
 "no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.",
 'not the whole crew',
 'Need a hug',
 "hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?",
 "nope they didn't have it",
 'que me muera ?',
 "spring break in plain city... it's snowing",
 'I just re-pierced my ears',
 "I couldn't bear to watch it. And I thought the UA loss was embarrassing . . . . .",
 'It it counts, idk why I did either. you never talk to me anymore',
 "i would've been the first, but i didn't have a gun. not really though, zac snyder's just a doucheclown.",
 'I wish I got to watch it with you!! I miss you and 

# <a name="p8">Machine Learning Implementation</a>

Text classification categorizes documents or pieces of text by examining the word usage in a text using classifiers to decide what class label to assign to those texts. For instance, a binary classifier decides between two labels, such as cat or dog, positive or negative; this way, the text can either be one label or another, but not both, whereas a multi-label classifier can assign one or more labels to a piece of text. The classification works by learning from labelled feature sets or training data to classify an unlabeled feature set.

Some standard techniques for Sentiment Analysis like `Bag of words(BOW)`, `Tokenization`, `Lemmatization`, `Stemming`, `Stopwords` for data preprocessing would not be relevant for this specific problem. Although they are compelling from a good feature engineering point of view, when we implemented it in the early stages of this project, we had to eliminate substantial parts of the data that compound emojis and make feelings sense. We aborted the mission for an alternative way, following our central idea, which is to classify input texts in POSITIVE or NEGATIVE mixing with emojis at the exact string. We decided to use tf-idf.


**TF-IDF (Term Frequency–Inverse Document Frequency)** is a fundamental technique for analyzing words in documents and retrieving relevant documents from a corpus (collection).  It can query a corpus by calculating normalized scores that express the relative importance of terms in the documents. Mathematically, TF-IDF expresses the term frequency and the inverse document frequency, `tf_idf = tf*idf`, where the term represents the importance of a term in a specific document. It represents the importance of a term relative to the entire collection by multiplying these terms together to produce a score that accounts for both factors, and it has been a crucial part of every major search engine until now.

Machine Learning with natural language is one of the major hurdles – its algorithms usually deal with numbers, and natural language is, essentially, text. So `we need to transform texts into numbers`, otherwise known as `text vectorization`. It is a fundamental step in machine learning for analyzing data, and different vectorization algorithms will drastically affect results, so we need to choose one between `Tfidftransformer` and `Tfidfvectorizer` that will deliver the results we expect.

With `Tfidftransformer` we will systematically compute word counts using `CountVectorizer` and then compute the `Inverse Document Frequency (IDF)` values and then compute the Tf-idf scores.
With `Tfidfvectorizer`, on the contrary, we will do all three steps at once. It computes the word counts, IDF values, and Tf-idf scores, all using the same dataset under the hood.

**When to use what?**

We may be wondering why we should use more steps than necessary if we can get everything done in two steps. There are cases where we want to use Tfidftransformer over Tfidfvectorizer, which is sometimes not that obvious (Kavita Ganesan, Ph.D
,n.d.). Here is a general guideline:
- If we need the term frequency (term count) vectors for different tasks, use `Tfidftransformer`.
- If we need to compute tf-idf scores on documents within our “training” dataset, use `Tfidfvectorizer`
- If we need to compute tf-idf scores on documents outside our “training” dataset, use either one; **both will work**.

After carefully analyzing both option, we decided the `Tfidfvectorizer` will be the best option to transform words into numbers in a way that machine learning algorithms can understand, the `TF-IDF` score can be fed to algorithms such as `Naive Bayes` and `Support Vector Machines`, significantly improving the results of more basic methods like word counts.



In [25]:
# TFIDF vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True,
                            strip_accents='ascii', stop_words=stopset)

In [26]:
# print out the emoticons and sentiment values
e_c, p = 0, 0
for index, row in new_emoji_df.iterrows():
    print(f"{row['emoji']} = {row['sentiment']}")
    p += 1 if row['sentiment'] else 0
    e_c += 1

😂 = 1
❤ = 1
♥ = 1
😍 = 1
😭 = 0
😘 = 1
😊 = 1
👌 = 1
💕 = 1
👏 = 1
😁 = 1
☺ = 1
♡ = 1
👍 = 1
😩 = 0
🙏 = 1
✌ = 1
😏 = 1
😉 = 1
🙌 = 1
🙈 = 1
💪 = 1
😄 = 1
😒 = 0
💃 = 1
💖 = 1
😃 = 1
😔 = 0
😱 = 1
🎉 = 1
😜 = 1
☯ = 1
🌸 = 1
💜 = 1
💙 = 1
✨ = 1
😳 = 1
💗 = 1
★ = 1
█ = 0
☀ = 1
😡 = 0
😎 = 1
😢 = 1
💋 = 1
😋 = 1
🙊 = 1
😴 = 0
🎶 = 1
💞 = 1
😌 = 1
🔥 = 1
💯 = 1
🔫 = 0
💛 = 1
💁 = 1
💚 = 1
♫ = 1
😞 = 0
😆 = 1
😝 = 1
😪 = 0
� = 1
😫 = 0
😅 = 1
👊 = 1
💀 = 0
😀 = 1
😚 = 1
😻 = 1
© = 1
👀 = 1
💘 = 1
🐓 = 1
☕ = 1
👋 = 1
✋ = 1
🎊 = 1
🍕 = 1
❄ = 1
😥 = 1
😕 = 0
💥 = 1
💔 = 0
😤 = 0
😈 = 1
► = 1
✈ = 1
🔝 = 1
😰 = 0
⚽ = 1
😑 = 0
👑 = 1
😹 = 1
👉 = 1
🍃 = 1
🎁 = 1
😠 = 0
🐧 = 1
☆ = 1
🍀 = 1
🎈 = 1
🎅 = 1
😓 = 0
😣 = 0
😐 = 0
✊ = 1
😨 = 0
😖 = 0
💤 = 1
💓 = 1
👎 = 0
💦 = 1
✔ = 1
😷 = 0
⚡ = 1
🙋 = 1
🎄 = 1
💩 = 0
🎵 = 1
➡ = 1
😛 = 1
😬 = 1
👯 = 1
💎 = 1
🌿 = 1
🎂 = 1
🌟 = 1
🔮 = 1
❗ = 1
👫 = 1
🏆 = 1
✖ = 1
☝ = 1
😙 = 1
⛄ = 1
👅 = 1
♪ = 1
🍂 = 1
💏 = 1
🔪 = 1
🌴 = 1
👈 = 1
🌹 = 1
🙆 = 1
➜ = 1
👻 = 1
💰 = 1
🍻 = 1
🙅 = 0
🌞 = 1
🍁 = 1
⭐ = 1
▪ = 1
🎀 = 1
━ = 1
☷ = 1
🐷 = 1
🙉 = 1
🌺 = 1
💅 = 1
🐶 = 1
🌚 = 1
👽 = 1
🎤 = 1
👭 = 1
🎧 = 

In [27]:
# Check the percentage of Positive emojis in the dataset

print(f'Total Positive Emojis are ({p}:{e_c}) or {round(p / e_c * 100)}%')

Total Positive Emojis are (795:969) or 82%




# <a name="p9">Transform Data</a>

Most scikit-learn objects are either `transformers` or `models`.

**Transformers** are for pre-processing before modelling. The Imputer class (like SimpleImputer for filling in missing values) and FeatureSelection classes in sklearn are examples of some transformers.

**Models** are used to make predictions like the Linear Regression model, Decision Tree model, Random Forest model. We will usually pre-process our data (with transformers) before putting it in a model.
Now the usage of methods fit(), transform(), fit_transform() and predict() depend on the type of object.

**For Transformers:**

1. `fit()` - It is used for calculating the initial filling of parameters on the training data (like mean of the column values) and saves them as an internal objects state
2. `transform()` - Use the above-calculated values and return modified training data
3. `fit_transform()` - It joins above two steps. Internally, it just calls first fit() and then transform() on the same data. (StackExchange, 2017).

**For Models:**

1. `fit()` - It calculates the parameters/weights on training data (e.g. parameters returned by coef() in case of Linear Regression) and saves them as an internal objects state.
2. `predict()` - Use the above-calculated weights on test data to make the predictions
3. `transform()` - Cannot be used
4. `fit_transform()` - Cannot be used 

**Conclusion:**

To accomplish the task and solve our problem, we chose to use `Transformers`, `fit_transform()` to convert the words into matrices of numbers to apply the Machine Learning algorithm. 

**Algorithm Choice:**

After we split the data, we have to choose which algorithm to use; we tried `Linear Support Vector`, `Logistic Regression`, `Random Forest`, but they all got not more than 80% accuracy, `Naive Bayes` was our choice, and it is perfect to use in a large dataset like ours. Also, it is known to outperform even these highly sophisticated classification methods, we reached `85% accuracy`, and we could keep emojis and special characters. Although this classification algorithm has three types:

- `Gaussian:` Used in classification problems, and it assumes that features follow a normal distribution.
- `Multinomial:` Used for discrete counts, which is perfect for our text classification problem because it counts how frequent a  word appears in the document over a number X of trials.
- `Bernoulli:` The binomial model is helpful if our feature vectors are binary (i.e. zeros and ones). One application would be text classification with the 'bag of words' model using the 1s and 0s for "words occurring in the document" and "words not occurring in the document" but, we are not using a bag of words.


In [28]:
# dependent variable will be linked as: 0 = negative, 1 = positive
y = tweet.sentiment
# convert 'sentence' from text to features
X = vectorizer.fit_transform(tweet.text)

print(y.shape)
print(X.shape)
print(f'{X.shape[0]} observations X {X.shape[1]} unique words')


(1600000,)
(1600000, 278853)
1600000 observations X 278853 unique words


# <a name="p10">Splitting the Data</a>
The Preprocessed Data is divided into 2 sets of data: 
* **Training Data:** The dataset upon which the model would be trained on. Contains 75% data.
* **Test Data:** The dataset upon which the model would be tested against. Contains 25% data.

In [29]:
# Test Train Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=None)

# we will train a naive bayes classifier
clf = naive_bayes.MultinomialNB()
#clf = naive_bayes.BernoulliNB()

clf.fit(X_train, y_train)

# test our models accuracy
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])


0.844727079947847

# <a name="p11">Save the Model</a>

Used pickle to save our model as `Sentiment-NB.pickle` reaching  nearly `85% accuracy`.

In [30]:
file = open('Sentiment-NB.pickle','wb')
pickle.dump(clf, file)
file.close()

# <a name="p12">Process inputs - Text and Emojis</a>

From now on, we will encounter functions that extract and label inputs as negative or positive.

In [31]:
# Function that extracts either text and emojis

text = "I miss my ps3, it's out of commission Wutcha playing? 💗😍 Have you copped Blood On The Sand?"

def extract_emoji_text(text = text):
    global allchars, emoji_list
    # remove all tagging and links, not need for sentiments
    remove_keys = ('@', 'http://', '&', '#')
    clean_text = ' '.join(txt for txt in text.split() if not txt.startswith(remove_keys))

    
    # setup the input, get the characters and the emoji lists
    allchars = [str for str in text]
    emoji_list = [c for c in allchars if c in emoji.Emoji.values]
    
    # extract text
    clean_text = ' '.join([str for str in clean_text.split() if not any(i in str for i in emoji_list)])
    
    # extract emoji
    clean_emoji = ''.join([str for str in text.split() if any(i in str for i in emoji_list)])
    return (clean_text, clean_emoji)

allchars, emoji_list = 0, 0
(ct, ce) = extract_emoji_text()
print('\nAll Char:', allchars)
print('\nAll Emoji:',emoji_list)
print('\n', ct)
print('\n',ce)


All Char: ['I', ' ', 'm', 'i', 's', 's', ' ', 'm', 'y', ' ', 'p', 's', '3', ',', ' ', 'i', 't', "'", 's', ' ', 'o', 'u', 't', ' ', 'o', 'f', ' ', 'c', 'o', 'm', 'm', 'i', 's', 's', 'i', 'o', 'n', ' ', 'W', 'u', 't', 'c', 'h', 'a', ' ', 'p', 'l', 'a', 'y', 'i', 'n', 'g', '?', ' ', '💗', '😍', ' ', 'H', 'a', 'v', 'e', ' ', 'y', 'o', 'u', ' ', 'c', 'o', 'p', 'p', 'e', 'd', ' ', 'B', 'l', 'o', 'o', 'd', ' ', 'O', 'n', ' ', 'T', 'h', 'e', ' ', 'S', 'a', 'n', 'd', '?']

All Emoji: ['💗', '😍']

 I miss my ps3, it's out of commission Wutcha playing? Have you copped Blood On The Sand?

 💗😍


In [32]:
# Function to predict the sentiment in an text input, it does not apply for emojis.

def get_sentiment(s_input = 'Happy'):
    # turn input into array
    input_array= np.array([s_input])
    # vectorize the input
    input_vector = vectorizer.transform(input_array)
    # predict the score of vector
    pred_senti = clf.predict(input_vector)

    return pred_senti[0]
print(get_sentiment())

1


In [33]:
# Function to predict emojis sentiments, it returns a list. It does not apply for text inputs.

def get_emoji_sentiment(emoji_ls = '❤❤❤😭', emoji_df = new_emoji_df):
    emoji_val_ls = []
    for e in emoji_ls:
        get_emo_senti = [row['sentiment'] for index, row in emoji_df.iterrows() if row['emoji'] == e]
        emoji_val_ls.append(get_emo_senti[0])
    return emoji_val_ls

ges = get_emoji_sentiment()
print('Sentiment value of each emoji:',ges)

Sentiment value of each emoji: [1, 1, 1, 0]


# <a name="p13"> Building a Sentiment Analysis</a>

The following function `calculates the final score` to our `sentiment analysis` project and returns it to the user. At the end of this section, we implemented a widget pipeline chat to check the sentiment for any phrase or emoji.

It will be `negative` if it is below `0.6` and anything above that till `1.0` is `positive`; if the input is mixed, the number will be a float that could vary but, that is the range score.

In [34]:
# Function that calculates the final score to our inputs

def get_text_emoji_sentiment(input_test = 'love 😭'):
    # separate text and emoji
    (ext_text, ext_emoji) = extract_emoji_text(input_test)
    print(f'\tExtracted: "{ext_text}" , {ext_emoji}')

    # get text sentiment
    senti_text = get_sentiment(ext_text)
    print(f'\tText value: {senti_text}')

    # get emoji sentiment
    senti_emoji_value = sum(get_emoji_sentiment(ext_emoji, new_emoji_df))
    print_emo_val_avg = 0 if len(ext_emoji) == 0 else senti_emoji_value/len(ext_emoji)
    print(f'\tEmoji average value: {print_emo_val_avg}')

    # avg the sentiment of emojis and text
    senti_avg = (senti_emoji_value + senti_text) / (len(ext_emoji) + 1)
    print(f'\tAverage value: {senti_avg}')

    # set value of avg sentiment to either pos or neg 
    senti_truth = "Positive" if senti_avg >= 0.6 else "Negative"
    
    return senti_truth

print(get_text_emoji_sentiment())

	Extracted: "love" , 😭
	Text value: 1
	Emoji average value: 0.0
	Average value: 0.5
Negative


In [35]:
# Function to print the Results

def print_status(test):
    print('========================================')
    print(f'Your input is "{test}" \n')
    sentiment = get_text_emoji_sentiment(test)
    print(f'\nYour input is of "{sentiment}" sentiment'.upper())
    print('========================================')

# <a name="p14">Final Pipeline Chat</a>

Pipeline chat to talk to `Ada`, our smart chatbot.

In [36]:
import ipywidgets as widgets
import warnings; warnings.simplefilter('ignore')

In [37]:
# for text area
l = widgets.Layout(flex='0 1 auto', height='50px',width='auto')
post_tweet = widgets.Textarea(value='Nice to meet you! I am Ada 😍', layout=l)
print(post_tweet.value)
# for button
button = widgets.Button(description="Talk to Ada!")
output = widgets.Output()

def on_tweet_clicked(b):
    output.clear_output()
    with output:
        output.layout={'border': '1px solid black'}
        print_status(post_tweet.value)


Nice to meet you! I am Ada 😍


In [38]:
# List of all Emojis to check its sentiments in the chat

emoji.Emoji.values

array(['😂', '❤', '♥', '😍', '😭', '😘', '😊', '👌', '💕', '👏', '😁', '☺', '♡',
       '👍', '😩', '🙏', '✌', '😏', '😉', '🙌', '🙈', '💪', '😄', '😒', '💃', '💖',
       '😃', '😔', '😱', '🎉', '😜', '☯', '🌸', '💜', '💙', '✨', '😳', '💗', '★',
       '█', '☀', '😡', '😎', '😢', '💋', '😋', '🙊', '😴', '🎶', '💞', '😌', '🔥',
       '💯', '🔫', '💛', '💁', '💚', '♫', '😞', '😆', '😝', '😪', '�', '😫', '😅',
       '👊', '💀', '😀', '😚', '😻', '©', '👀', '💘', '🐓', '☕', '👋', '✋', '🎊',
       '🍕', '❄', '😥', '😕', '💥', '💔', '😤', '😈', '►', '✈', '🔝', '😰', '⚽',
       '😑', '👑', '😹', '👉', '🍃', '🎁', '😠', '🐧', '☆', '🍀', '🎈', '🎅', '😓',
       '😣', '😐', '✊', '😨', '😖', '💤', '💓', '👎', '💦', '✔', '😷', '⚡', '🙋',
       '🎄', '💩', '🎵', '➡', '😛', '😬', '👯', '💎', '🌿', '🎂', '🌟', '🔮', '❗',
       '👫', '🏆', '✖', '☝', '😙', '⛄', '👅', '♪', '🍂', '💏', '🔪', '🌴', '👈',
       '🌹', '🙆', '➜', '👻', '💰', '🍻', '🙅', '🌞', '🍁', '⭐', '▪', '🎀', '━',
       '☷', '🐷', '🙉', '🌺', '💅', '🐶', '🌚', '👽', '🎤', '👭', '🎧', '👆', '🍸',
       '🍷', '®', '🍉', '😇', '☑', '🏃', '😿', '│', '💣', '🍺', '▶', '😲

In [39]:
# Run this cell to display the chat window

display(post_tweet,button, output)
button.on_click(on_tweet_clicked)

Textarea(value='Nice to meet you! I am Ada 😍', layout=Layout(flex='0 1 auto', height='50px', width='auto'))

Button(description='Talk to Ada!', style=ButtonStyle())

Output()

# <a name="p15">References</a>

Computer Hope (2020) ASCII, Available at: https://www.computerhope.com/jargon/a/ascii.htm (Accessed: 28th March 2020).

Kavita Ganesan, Ph.D (n.d.) How to Use Tfidftransformer & Tfidfvectorizer?, Available at: https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.YGCzN69KjIU (Accessed: 28th March 2020).

StackExchange (2017) What's the difference between fit and fit_transform in scikit-learn models?, Available at: https://datascience.stackexchange.com/questions/12321/whats-the-difference-between-fit-and-fit-transform-in-scikit-learn-models (Accessed: 29th March 2021).