<a href="https://colab.research.google.com/github/Jaspreetkhokhar12/Youtube_Comment_Analyzer/blob/main/PreProcessing_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Description**

This code snippet is the starting point for building a **YouTube Comment Analyzer Chrome Plugin**. It focuses on the preprocessing stage, where it loads a dataset containing Reddit comments and their associated sentiment categories. The goal of this phase is to prepare data for further processing, such as cleaning, vectorization, and model training for a **multiclass sentiment classification task**.

In [None]:
# Importing necessary libraries
import numpy as np  # Used for numerical operations
import pandas as pd  # Used for data manipulation and analysis

In [3]:
# Loading the dataset
df = pd.read_csv('/content/drive/MyDrive/YT_Comment_Analyser/reddit_sentiment_analysis.csv')

# Displaying the first 5 rows
df.head()

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1


From the output above, we observe that the dataset includes a different sentiments as follow:

*   1 is for positive sentiment
*   0 is for neutral sentiment
*   -1 is for negative sentiment


## **Pre Processing**



In [4]:
# Checking the shape of the dataset: number of rows and columns
df.shape

(37249, 2)

In [5]:
# Randomly sampling a comment from the 'clean_comment' column
# This helps us understand how individual comments look
df.sample()['clean_comment'].values

array(['sterlite copper draper pryce'], dtype=object)

In [7]:
# Displaying summary info about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37249 entries, 0 to 37248
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   clean_comment  37149 non-null  object
 1   category       37249 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 582.1+ KB


In [8]:
# Checking for missing values in each column
df.isnull().sum()

Unnamed: 0,0
clean_comment,100
category,0


In [9]:
# Displaying rows with missing 'clean_comment' values
df[df['clean_comment'].isna()]

Unnamed: 0,clean_comment,category
413,,0
605,,0
2422,,0
2877,,0
3307,,0
...,...,...
35975,,0
36036,,0
37043,,0
37111,,0


This reveals the exact rows with NaN in the clean_comment column. Most of them are labeled as 0 (neutral), and should likely be removed before training a model.

In [11]:
# Count Categories for Missing Comments
df[df['clean_comment'].isna()]['category'].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
0,100


In [12]:
# Drop Rows with Missing Values
df.dropna(inplace=True)

In [16]:
# Check for Duplicate Rows
df[df.duplicated()]

Unnamed: 0,clean_comment,category
375,,0
392,,0
617,aurum mom,0
651,,0
1222,,0
...,...,...
36915,who won,0
37044,,0
37125,hari,0
37158,top kek,1


In [17]:
# Drop duplicates
df.drop_duplicates(inplace=True)

In [18]:
# verify removal of duplicates
df.duplicated().sum()

np.int64(0)

In [19]:
# Check empty or whitespace-only comments
df[(df['clean_comment'].str.strip() == '')]

Unnamed: 0,clean_comment,category
181,,0
4432,\n,0
10592,,0
16173,,0
32149,\n,0
34959,,0


We observed 6 rows that are empty strings or contain only whitespace ('\n'). These are not useful and should be removed.

In [20]:
# Filters out all rows where clean_comment is empty or whitespace after stripping
df = df[~(df['clean_comment'].str.strip() == '')]

In [21]:
# Convert the 'clean_comment' column to lowercase
df['clean_comment'] = df['clean_comment'].str.lower()

# Verify the transformation by displaying the first few rows
df.head()

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1


In [22]:
# Detect Comments With Leading or Trailing Spaces
df[df['clean_comment'].apply(lambda x: x.endswith(' ') or x.startswith(' '))]

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1
...,...,...
37241,let the janta decide not ulema clerics,0
37242,hona hai same with vaccination education insu...,0
37246,downvote karna tha par upvote hogaya,0
37247,haha nice,1


In [23]:
# Remove trailing and leading whitespaces from the 'clean_comment' column
df['clean_comment'] = df['clean_comment'].str.strip()

# Verify the transformation by checking for any remaining trailing whitespaces
df['clean_comment'].apply(lambda x: x.endswith(' ') or x.startswith(' ')).sum()

np.int64(0)

In [25]:
# Identify comments containing URLs
# Regex pattern explanation:
# - 'http[s]?://' matches 'http://' or 'https://'
# - The pattern continues with allowed URL characters (letters, digits, special characters)
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
comments_with_urls = df[df['clean_comment'].str.contains(url_pattern, regex=True)]


# Display the comments containing URLs
comments_with_urls.head()

Unnamed: 0,clean_comment,category


In [26]:
# Identify comments containing newline characters
comments_with_newline = df[df['clean_comment'].str.contains('\n')]

# Display a preview of the comments that contain newline characters
comments_with_newline.head()

Unnamed: 0,clean_comment,category
448,what missing jpg\nand why this brilliant edit ...,1
781,india has been ruined congress and populist sc...,-1
847,like aap for its stand corruption and making p...,-1
871,reduced trade\ndeficit stronger rupee aren the...,0
1354,amsa press conference australian maritime safe...,1


In [27]:
# Remove newline characters from the 'clean_comment' column
# Replace '\n' with a space to ensure proper formatting of the text
df['clean_comment'] = df['clean_comment'].str.replace('\n', ' ', regex=True)

# Verify that the transformation was successful by checking for any remaining newline characters
comments_with_newline_remaining = df[df['clean_comment'].str.contains('\n')]
comments_with_newline_remaining

Unnamed: 0,clean_comment,category


## **EDA**