## Objective

Sentiment analysis is a common use case of NLP where the idea is to classify the tweet as positive, negative or neutral depending upon the text in the tweet. This problem goes a way ahead and expects us to also determine the words in the tweet which decide the polarity of the tweet.

## Understanding the Evaluation Metric

The metric in this competition is the word-level **Jaccard score**. Jaccard Score is a measure of how dissimilar two sets are.  The lower the distance, the more similar the two strings. The idea is to find the number of common tokens and divide it by the total number of unique tokens. Its expressed in the mathematical terms by,

![](https://images.deepai.org/glossary-terms/jaccard-index-452201.jpg)

![](https://images.deepai.org/glossary-terms/jaccard-index-391304.jpg)

[Source](https://en.wikipedia.org/wiki/Jaccard_index)

**Here is how one can implement the jaccard score in Python:**

In [None]:

def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))


Sentence_1 = 'Life well spent is life good'
Sentence_2 = 'Life is an art and it is good so far'
    
jaccard(Sentence_1,Sentence_2)

In [None]:
Sentence_1 = 'Life well spent is life good'
Sentence_2 = 'Life well spent is life good'
    
jaccard(Sentence_1,Sentence_2)

# <a name="imports"></a>1. Importing the necessary libraries

In [None]:
import numpy as np 
import pandas as pd 

# text processing libraries
import re
import string
import nltk
from nltk.corpus import stopwords

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

# sklearn 
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# <a name="reading"></a> 2. Reading the datasets

In [None]:
#Training data
train = pd.read_csv('../input/tweet-sentiment-extraction/train.csv')
print('Training data shape: ', train.shape)
train.head()

In [None]:
# Testing data 
test = pd.read_csv('../input/tweet-sentiment-extraction/test.csv')
print('Testing data shape: ', test.shape)
test.head()

# <a name="eda"></a> 3. Basic EDA

## Missing values

In [None]:
#Missing values in training set
train.isnull().sum()

The columns denote the following:

* The `textID` of a tweet
* The `text` of a tweet
* The s`elected text` which determines the polarity of the tweet
* `sentiment` of the tweet

The test dataset doesn't have the selected text column which needs to be identified.

In [None]:
#Missing values in test set
test.isnull().sum()

## Analysing of the Selected_text Column in the training set


In [None]:
# Positive tweet
print("Positive Tweet example :",train[train['sentiment']=='positive']['text'].values[0])
#negative_text
print("Negative Tweet example :",train[train['sentiment']=='negative']['text'].values[0])
#neutral_text
print("Neutral tweet example  :",train[train['sentiment']=='neutral']['text'].values[0])

In [None]:
train['sentiment'].value_counts()

A lot of tweets have a neutral sentiment

In [None]:
sns.barplot(train['sentiment'].value_counts().index,train['sentiment'].value_counts(),palette='rocket')

## Distribution of the Sentiment Column in the test set

In [None]:
test['sentiment'].value_counts()

In [None]:
sns.barplot(test['sentiment'].value_counts().index,train['sentiment'].value_counts(),palette='rocket')


## Exploring positive, negative and neutral text

In [None]:
positive_text = train[train['sentiment'] == 'positive']['selected_text']
negative_text = train[train['sentiment'] == 'negative']['selected_text']
neutral_text = train[train['sentiment'] == 'neutral']['selected_text'].apply(str)

In [None]:
# Positive text
print("Positive Text example :",positive_text.values[0])
#negative_text
print("Negative Tweet example :",negative_text.values[0])
#neutral_text
print("Neutral tweet example  :",neutral_text.values[0])

> # <a name="processing"></a> 4. Text Data Preprocessing

Creating a function which does the basic text preprocessing activities including:
- Make text lowercase, 
- removes text in square brackets,
- removes links,
- remove punctuation
- removes words containing numbers
- tokenizes
- removes stopwords


In [None]:
# text preprocessing helper functions

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text


def text_preprocessing(text):
    """
    Cleaning and parsing the text.

    """
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    
    nopunc = clean_text(text)
    tokenized_text = tokenizer.tokenize(nopunc)
    remove_stopwords = [w for w in tokenized_text if w not in stopwords.words('english')]
    combined_text = ' '.join(remove_stopwords)
    return combined_text

## Pre-processed selected text columns

In [None]:
positive_text_clean = positive_text.apply(lambda x: text_preprocessing(x))
negative_text_clean = negative_text.apply(lambda x: text_preprocessing(x))
neutral_text_clean = neutral_text.apply(lambda x: text_preprocessing(x))

In [None]:
#source of code : https://medium.com/@cristhianboujon/how-to-list-the-most-common-words-from-text-corpus-using-scikit-learn-dad4d0cab41d
def get_top_n_words(corpus, n=None):
    """
    List the top n words in a vocabulary according to occurrence in a text corpus.
    """
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
top_words_in_positive_text = get_top_n_words(positive_text_clean)
top_words_in_negative_text = get_top_n_words(negative_text_clean)
top_words_in_neutral_text = get_top_n_words(neutral_text_clean)

p1 = [x[0] for x in top_words_in_positive_text[:10]]
p2 = [x[1] for x in top_words_in_positive_text[:10]]


n1 = [x[0] for x in top_words_in_negative_text[1:11]]
n2 = [x[1] for x in top_words_in_negative_text[1:11]]


nu1 = [x[0] for x in top_words_in_neutral_text[1:11]]
nu2 = [x[1] for x in top_words_in_neutral_text[1:11]]

In [None]:
import plotly.graph_objects as go

fig = go.Figure([go.Bar(x=p1, y=p2, text=p2 )])
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',title_text='Most common positive_text words')

fig.show()

fig = go.Figure([go.Bar(x=n1, y=n2, text=n2,marker_color='indianred')])
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',title_text='Most common negative_text words')

fig.show()

fig = go.Figure([go.Bar(x=nu1, y=nu2, text=nu2, marker_color='lightsalmon' )])
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',title_text='Most common neutral_text words')

fig.show()

## Wordclouds

Let's create wordclouds to see which words contribute to which type of polarity.

In [None]:
from wordcloud import WordCloud
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=[26, 8])
wordcloud1 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(positive_text_clean))
ax1.imshow(wordcloud1)
ax1.axis('off')
ax1.set_title('Positive text',fontsize=40);

wordcloud2 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(negative_text_clean))
ax2.imshow(wordcloud2)
ax2.axis('off')
ax2.set_title('Negative text',fontsize=40);

wordcloud3 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(neutral_text_clean))
ax3.imshow(wordcloud3)
ax3.axis('off')
ax3.set_title('Neutral text',fontsize=40);


The wordclouds give an idea of the words which might influence the polarity of the tweet.