In [2]:
import pandas as pd

df = pd.read_csv("../Data/Dataset.csv")
df.head()

Unnamed: 0,text,sentiment
0,,0
1,Horrible!!! The worst experience ever. Do not ...,0
2,Terrible service!! I won't buy from here again...,0
3,"I had high hopes, but it broke after a week. :-/",0
4,"Product is okay, but packaging was awful. ?!?",0


Justification: Missing values can lead to errors during analysis. Identifying them early ensures proper handling.

In [3]:
print(df.isnull().sum())

text         20
sentiment     0
dtype: int64


In [5]:
# Drop rows with missing values in the 'text' or 'sentiment' columns
df.dropna(subset=['text', 'sentiment'], inplace=True)

# Reset the index after dropping rows
df.reset_index(drop=True, inplace=True)
df.head(5)

Unnamed: 0,text,sentiment
0,Horrible!!! The worst experience ever. Do not ...,0
1,Terrible service!! I won't buy from here again...,0
2,"I had high hopes, but it broke after a week. :-/",0
3,"Product is okay, but packaging was awful. ?!?",0
4,"Good quality, but a bit expensive. Worth it th...",0


Justification: Missing sentiment labels cannot be inferred, and missing text data cannot be used for analysis. Dropping these rows ensures data quality

In [7]:
import re

def clean_text(text):
    # Remove non-alphabetic characters and ASCII codes
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the cleaning function to the 'text' column
df['text'] = df['text'].apply(clean_text)
df.head()

Unnamed: 0,text,sentiment
0,Horrible The worst experience ever Do not buy,0
1,Terrible service I wont buy from here again,0
2,I had high hopes but it broke after a week,0
3,Product is okay but packaging was awful,0
4,Good quality but a bit expensive Worth it though,0


Justification: Non-alphabetic characters and ASCII codes do not contribute to sentiment analysis and can introduce noise.

In [9]:
df['text'] = df['text'].str.lower()
df.head()

Unnamed: 0,text,sentiment
0,horrible the worst experience ever do not buy,0
1,terrible service i wont buy from here again,0
2,i had high hopes but it broke after a week,0
3,product is okay but packaging was awful,0
4,good quality but a bit expensive worth it though,0


Justification: Lowercasing ensures that words like "Happy" and "happy" are treated as the same token.



In [11]:
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
import nltk
nltk.download('stopwords')

# Get English stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply the stopwords removal function
df['text'] = df['text'].apply(remove_stopwords)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RC543\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,sentiment
0,horrible worst experience ever buy,0
1,terrible service wont buy,0
2,high hopes broke week,0
3,product okay packaging awful,0
4,good quality bit expensive worth though,0


Justification: Stopwords do not carry significant meaning and can be removed to reduce dimensionality.



In [12]:
from nltk.stem import WordNetLemmatizer

# Download WordNet if not already downloaded
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# Apply the lemmatization function
df['text'] = df['text'].apply(lemmatize_text)

df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\RC543\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,sentiment
0,horrible worst experience ever buy,0
1,terrible service wont buy,0
2,high hope broke week,0
3,product okay packaging awful,0
4,good quality bit expensive worth though,0


Justification: Lemmatization reduces words to their base forms, ensuring that different forms of the same word are treated as one.

## Summary of Preprocessing Steps
Handled Missing Values: Dropped rows with missing values in the text or sentiment columns to ensure data quality.

Removed Non-Alphabetic Characters: Used regex to remove special characters, ASCII codes, and extra spaces.

Converted Text to Lowercase: Ensured uniformity in the text data.

Removed Stopwords: Eliminated common words that do not contribute to sentiment analysis.

Performed Lemmatization: Normalized words to their base forms for consistency.