<a href="https://colab.research.google.com/github/ParthKhiriya/Sentiment_Analysis_PRML_Project/blob/main/Obtaining_the_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Step 1: Obtaining the Dataset**

In [1]:
# Installing the kaggle library
! pip install kaggle



In [2]:
# configuring the path of kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [3]:
# Fetching the API to download the dataset
! kaggle datasets download -d abhi8923shriv/sentiment-analysis-dataset

Dataset URL: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
License(s): CC0-1.0


In [4]:
from zipfile import ZipFile
dataset = '/content/sentiment-analysis-dataset.zip'

with ZipFile(dataset,'r') as zip:
  zip.extractall()
  print('The dataset is extracted')

The dataset is extracted


In [5]:
# Importing all the required libraries and dependencies
import numpy as np
import pandas as pd

In [6]:
# Loading the main dataset and also the train and test set
train_given = pd.read_csv('train.csv', encoding='latin-1')
test_given = pd.read_csv('test.csv', encoding='latin-1')
df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1', header= None)
df = df.drop(labels=0)
df.columns = ["polarity","id","date","query","user","text"]

# This gives the information about our data,  like what are the datatypes of content present in all the columns and also the null count
print(df)
# Our data has no null values and no duplicates

  df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1', header= None)


        polarity          id                          date     query  \
1              0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2              0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3              0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4              0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
5              0  1467811372  Mon Apr 06 22:20:00 PDT 2009  NO_QUERY   
...          ...         ...                           ...       ...   
1048568        4  1960186342  Fri May 29 07:33:44 PDT 2009  NO_QUERY   
1048569        4  1960186409  Fri May 29 07:33:43 PDT 2009  NO_QUERY   
1048570        4  1960186429  Fri May 29 07:33:44 PDT 2009  NO_QUERY   
1048571        4  1960186445  Fri May 29 07:33:44 PDT 2009  NO_QUERY   
1048572        4  1960186607  Fri May 29 07:33:45 PDT 2009  NO_QUERY   

                    user                                               text  
1          scotthamilton  is upset that he can't update h

# **Step 2 : Data Preprocessing**

In [7]:
df['polarity'].value_counts()

# Since value_counts() is showing two different types of 0s so we have to check what is the problem
print(df['polarity'].unique())

['0' 0 4]


In [8]:
# As the output of unique() function shows that the polarity columns contains two types of 0, integer and string respectively
# Now we have to convert the string zero to integer zero, so we will use to_numeric function
df['polarity'] = pd.to_numeric(df['polarity'], errors='coerce').fillna(0).astype(int)

df['polarity'] = df['polarity'].astype(str).str.strip().astype(int)

In [9]:
# Now let us again check if our data has two types of zeroes or not
print(df['polarity'].unique())

[0 4]


In [10]:
# Again checking for the number of positive and negative tweets, this time we get the actual value
df['polarity'].value_counts()
df['polarity'] = df['polarity'].map({4:1,0:0})
df['polarity'].value_counts()

Unnamed: 0_level_0,count
polarity,Unnamed: 1_level_1
0,799996
1,248576


In [11]:
# Now importing some more required dependencies
# The re library is for handling regular expressions, like handling links, URLs etc
import re
import string
# The nltk library is for handling the natural language processing tasks such as tokenisation, stemming, removing stopwords etc.
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

nltk.download('stopwords')

# This function performs basic tasks for cleaning the data such as lowercasing, URL handling etc.
def clean_text(text):

    text = text.lower()  # Lowercasing
    text = re.sub(r'http\S+|www.\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+|#\w+', '', text)  # Remove mentions and hashtags
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    return text.strip()

# Save the cleaned text in our dataframe
df['cleaned_text'] = df['text'].apply(clean_text)

# Tokenisation - splits the dataset into an array of words(nltk generally performs word tokenisation i.e. seperating words) for better handling of punctuations
df['tokens'] = df['cleaned_text'].apply(lambda x: x.split())

# Removing all the stopwords because it will not impact the sentiment of the tweet
stop_words = set(stopwords.words('english'))
df['tokens'] = df['tokens'].apply(lambda tokens: [word for word in tokens if word not in stop_words])

# Then stemming(converting a word into its root word e.g. acting, actor to act)
stemmer = PorterStemmer()
df['stemmed_text'] = df['tokens'].apply(lambda tokens: ' '.join([stemmer.stem(word) for word in tokens]))

print(df)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


         polarity          id                          date     query  \
1               0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2               0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3               0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4               0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
5               0  1467811372  Mon Apr 06 22:20:00 PDT 2009  NO_QUERY   
...           ...         ...                           ...       ...   
1048568         1  1960186342  Fri May 29 07:33:44 PDT 2009  NO_QUERY   
1048569         1  1960186409  Fri May 29 07:33:43 PDT 2009  NO_QUERY   
1048570         1  1960186429  Fri May 29 07:33:44 PDT 2009  NO_QUERY   
1048571         1  1960186445  Fri May 29 07:33:44 PDT 2009  NO_QUERY   
1048572         1  1960186607  Fri May 29 07:33:45 PDT 2009  NO_QUERY   

                    user                                               text  \
1          scotthamilton  is upset that he c