# Tweet Sentiment Analysis 

This notebook contains Exploratory Data Analysis of the Tweet Sentiment Analysis Dataset. You can get the dataset [here](https://www.kaggle.com/competitions/tweet-sentiment-extraction/data?select=train.csv).

## Steps Involved

- [Checking Nan/Null Values and Duplicated Values](#Checking-Null/Nan-Values)
- [Cleaing the data](#Cleaning-the-Data)
- [Lemmatization](#Lemmatization)
- [Transforming the string labels in numeric.](#Data-Transformation)

## Importing the required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Importing the Dataset

Importing the dataset and overviewing the data

In [2]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [3]:
df.dtypes

textID           object
text             object
selected_text    object
sentiment        object
dtype: object

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   textID         27481 non-null  object
 1   text           27480 non-null  object
 2   selected_text  27480 non-null  object
 3   sentiment      27481 non-null  object
dtypes: object(4)
memory usage: 858.9+ KB


In [5]:
df.shape

(27481, 4)

**The dataset has *27481* samples and *4* features. All features are of object datatype i.e a string.**

## Checking Null/Nan Values

In [6]:
df.isna().sum()

textID           0
text             1
selected_text    1
sentiment        0
dtype: int64

Checking the only null entry in the data

In [7]:
df[df['text'].isna() | df['selected_text'].isna()]

Unnamed: 0,textID,text,selected_text,sentiment
314,fdb77c3752,,,neutral


Dropping the null entry

In [8]:
df.dropna(inplace=True)
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [9]:
df.isna().sum()

textID           0
text             0
selected_text    0
sentiment        0
dtype: int64

**The data doesnot have any null values now.**

### Checking Duplicated Values

In [10]:
df.duplicated().sum()

0

**The dataset has no duplicated values**

## Cleaning the Data

In [11]:
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


**For this machine learning model we will only consider the text column for the input feature and sentiment as our ouput feature.**

In [12]:
tweets = df['text'].copy()
target = df['sentiment'].copy()

In [13]:
tweets.head()

0                  I`d have responded, if I were going
1        Sooo SAD I will miss you here in San Diego!!!
2                            my boss is bullying me...
3                       what interview! leave me alone
4     Sons of ****, why couldn`t they put them on t...
Name: text, dtype: object

In [14]:
target.head()

0     neutral
1    negative
2    negative
3    negative
4    negative
Name: sentiment, dtype: object

### Importing Necessary Libraries for Text Preprocessing

In [15]:
import nltk
from nltk.corpus import stopwords
import re

In [16]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/notaryanramani/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/notaryanramani/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
def clean_tweets(tweets):
    cleaned_tweets = []
    stopwords_eng = stopwords.words('english')
    for tweet in tweets:
        #removing hyperlinks
        tweet = re.sub(r'https?://[^\s\n\r]+', ' ', tweet)
        #removing punctuations and digits, only taking words
        tweet = re.sub('[^a-zA-z]', ' ', tweet)
        tweet = re.sub(r'[\W_]+', ' ', tweet)
        tweet = tweet.lower()
        words = nltk.word_tokenize(tweet)
        words = [word for word in words if word not in stopwords_eng]
        cleaned_tweets.append(' '.join(words))
    print('Completed cleaning tweets')
    return cleaned_tweets

In [18]:
cleaned_tweets = np.array(clean_tweets(tweets))
cleaned_tweets[:5]

Completed cleaning tweets


array(['responded going', 'sooo sad miss san diego', 'boss bullying',
       'interview leave alone', 'sons put releases already bought'],
      dtype='<U129')

### Checking if any punctuations are left or not

In [19]:
whole_text = ' '.join(cleaned_tweets)
whole_text[:50]

'responded going sooo sad miss san diego boss bully'

In [20]:
chars = []
for char in whole_text:
    if char not in chars:
        chars.append(char)
print(chars)

['r', 'e', 's', 'p', 'o', 'n', 'd', ' ', 'g', 'i', 'a', 'm', 'b', 'u', 'l', 'y', 't', 'v', 'w', 'h', 'f', 'c', 'j', 'k', 'q', 'z', 'x']


**The text doesnot have any punctuations, so now we can lemmatize the tweets**

## Lemmatization

In [21]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/notaryanramani/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/notaryanramani/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [22]:
lemmatizer = WordNetLemmatizer()

In [23]:
def lemmatize_text(tweets):
    lem_tweets = []
    for tweet in tweets:
        lem_words = [lemmatizer.lemmatize(word) for word in nltk.word_tokenize(tweet)]
        lem_tweet = ' '.join(lem_words)
        lem_tweets.append(lem_tweet)
    return lem_tweets

In [24]:
lem_tweets = np.array(lemmatize_text(cleaned_tweets))
lem_tweets[:5]

array(['responded going', 'sooo sad miss san diego', 'bos bullying',
       'interview leave alone', 'son put release already bought'],
      dtype='<U129')

## Data Transformation

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

In [26]:
count_vec = CountVectorizer(max_features=10000)
count_vec.fit(lem_tweets)

CountVectorizer(max_features=10000)

In [27]:
vec_tweets = count_vec.transform(lem_tweets).toarray()
vec_tweets.shape

(27480, 10000)

In [28]:
target.unique()

array(['neutral', 'negative', 'positive'], dtype=object)

In [29]:
target_map = {
    'negative' : 0,
    'positive' : 1,
    'neutral' : 2
}

In [30]:
target = target.map(target_map).astype(np.int64)
target.head()

0    2
1    0
2    0
3    0
4    0
Name: sentiment, dtype: int64

In [31]:
target.value_counts()

2    11117
1     8582
0     7781
Name: sentiment, dtype: int64

In [32]:
encoded_data = pd.DataFrame(vec_tweets)
encoded_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
encoded_data['target'] = target
encoded_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,target
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
