**Steps to be followed to implement the Twitter Sentiment Analysis:**
1. Load the datasets
2. Explore the datasets
3. Preprocess the dataset
4. Data Preparation
   1. Split Data
   2. Feature Engineering using Tf-Idf
5. Model Building
   1. Naive Bayes
   2. Logistic Regression
   3. Model Summary
6. Final Sentiment Analysis pipeline 

# 1. Load the datasets

In [1]:
# Import basic necessary modules
import pandas as pd
import numpy as np
import re
import spacy

# Creating spacy model
nlp = spacy.load("en_core_web_sm")

In [2]:
# Loading training dataset
train_df = pd.read_csv(filepath_or_buffer="../datasets/train.csv")

# Sample of training dataset
train_df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [3]:
# Loading testing dataset
test_df = pd.read_csv(filepath_or_buffer="../datasets/test.csv")

# Sample of testing dataset
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


# 2. Exploring the dataset

In [4]:
# Info of training dataset
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [6]:
# Info of testing dataset
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17197 entries, 0 to 17196
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      17197 non-null  int64 
 1   tweet   17197 non-null  object
dtypes: int64(1), object(1)
memory usage: 268.8+ KB


In [7]:
# Shape of training dataset
print("Shape of training dataset:",train_df.shape)

Shape of training dataset: (31962, 3)


In [8]:
# Shape of testing dataset
print("Shape of testing dataset:",test_df.shape)

Shape of testing dataset: (17197, 2)


In [14]:
if (train_df[train_df.notnull()].shape == train_df.shape):
    print("No Null values present in the dataset")
else:
    print("Please remove null values")

No Null values present in the dataset


In [16]:
def is_null_in_dataset(df: pd.DataFrame) -> None:
    if (df[df.notnull()].shape == df.shape):
        print("No Null values present in the dataset")
    else:
        print("Please remove null values")

In [17]:
# Check whether null values are in the dataset or not.
is_null_in_dataset(train_df)

No Null values present in the dataset


In [18]:
is_null_in_dataset(test_df)

No Null values present in the dataset


In [20]:
train_df["label"].value_counts()

0    29720
1     2242
Name: label, dtype: int64

In [21]:
# Check whether training dataset is equally distributed or not
train_df["label"].value_counts(normalize=True)*100

0    92.98542
1     7.01458
Name: label, dtype: float64

In [25]:
# Description of the dataframe column
train_df["label"].describe()

count    31962.000000
mean         0.070146
std          0.255397
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: label, dtype: float64

In [28]:
# check the Skewness of the data
print("Skewness of the data is:",train_df["label"].skew())

Skewness of the data is: 3.366381217473261


In [30]:
# sample tweets in the dataset
train_df["tweet"].sample(10)

9940     fab. encouraging planning meeting for #grow w/...
14828    clueless top nyc cop william bratton promises ...
3398                       @user so many sycophants.....  
24508    #amazon rc nitro gas truck hsp 1/10 car 4wd vi...
15652     , caoon, crying, eating, adventure time, caoo...
30897    24hrs stand between me and a plane to zante â...
4937     ð installation day-makes me   @user #outdoo...
10104    so exo dropped 2 music videos today and i have...
8423     good morning ððð have a blessed day ð...
25781                   i got the blues. #blues   #filter 
Name: tweet, dtype: object