## About Dataset

### Social Media Sentiments Analysis Dataset Report

---

#### Dataset Overview

This dataset contains a collection of social media posts labeled with sentiment classes. The primary goal of this dataset is to facilitate training and evaluation of Natural Language Processing (NLP) models for **sentiment analysis**. Each entry in the dataset represents a short piece of user-generated content, typically a tweet or a post, annotated with a sentiment label.

---

#### Dataset Source

- **URL**: [Kaggle Dataset](https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset)
- **Author**: [Kashish Parmar](https://www.kaggle.com/kashishparmar02)

---

#### Columns Description

| Column Name | Description                              |
|-------------|------------------------------------------|
| `text`      | The actual text from the social media post |
| `sentiment` | The sentiment label (e.g., Positive, Neutral, Negative) |

---



### Import Required Libraries and Download NLTK Resources

In [32]:
##pip install nltk

import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('stopwords')


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\SaeedM\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\SaeedM\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SaeedM\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SaeedM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Load Dataset and Display Top 10 Rows


In [33]:
##pip install pandas

import pandas as pd
df = pd.read_csv("SocialMedia.csv")

df.head(10)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,0,0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,1,1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,2,2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,3,3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,4,4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19
5,5,5,Feeling grateful for the little things in lif...,Positive,2023-01-16 09:10:00,GratitudeNow,Twitter,#Gratitude #PositiveVibes,25.0,50.0,India,2023,1,16,9
6,6,6,Rainy days call for cozy blankets and hot coc...,Positive,2023-01-16 14:45:00,RainyDays,Facebook,#RainyDays #Cozy,10.0,20.0,Canada,2023,1,16,14
7,7,7,The new movie release is a must-watch! ...,Positive,2023-01-16 19:30:00,MovieBuff,Instagram,#MovieNight #MustWatch,15.0,30.0,USA,2023,1,16,19
8,8,8,Political discussions heating up on the timel...,Negative,2023-01-17 08:00:00,DebateTalk,Twitter,#Politics #Debate,30.0,60.0,USA,2023,1,17,8
9,9,9,Missing summer vibes and beach days. ...,Neutral,2023-01-17 12:20:00,BeachLover,Facebook,#Summer #BeachDays,18.0,35.0,Australia,2023,1,17,12


### Remove Unnecessary Columns

In [34]:

df = df[['Text', 'Sentiment']] 
df.dropna(subset=['Text', 'Sentiment'], inplace=True) 
df.reset_index(drop=True, inplace=True)
df.head()


Unnamed: 0,Text,Sentiment
0,Enjoying a beautiful day at the park! ...,Positive
1,Traffic was terrible this morning. ...,Negative
2,Just finished an amazing workout! 💪 ...,Positive
3,Excited about the upcoming weekend getaway! ...,Positive
4,Trying out a new recipe for dinner tonight. ...,Neutral


In [36]:
import re
import string
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

class Preprocessor:
    def __init__(self):
        self.wl = WordNetLemmatizer()
        self.stemmer = SnowballStemmer('english')
        self.stop_words = set(stopwords.words('english'))
        
        self.html_tags = re.compile(r'<.*?>')
        self.punctuations = re.compile(f"[{re.escape(string.punctuation)}]")
        self.extra_spaces = re.compile(r'\s+')
        self.digits = re.compile(r'\d')
        self.brackets_numbers = re.compile(r'\[[0-9]*\]')
        self.special_chars = re.compile(r"[*/&|_<>~+=\\^™%\"”“❝„]+")
        self.unwanted_chars = re.compile(r"[ं-ో̇•】【\{\}\(\)\[\]‼.,;:?!…]+")

    def word_remove(self, text):
        return re.sub(r'\n\s*|http\S+', '', text)

    def char_replacing(self, text):
        text = re.sub(r"[‘´’̇]+", "'", text)
        text = re.sub(r"[#̇]+", "#", text)
        return re.sub(r"[”“❝„\"]", "\"", text)

    def word_expanding(self, text):
        contractions = {
            r"(\b)([Ii])'m": r"\1\2 am",
            r"(\b)([Tt]hey|[Ww]e|[Ww]hat|[Ww]ho|[Yy]ou)'re": r"\1\2 are",
            r"(\b)([Ll]et)'s": r"\1\2 us",
            r"(\b)([Hh]e|[Ii]|[Ss]he|[Tt]hey|[Ww]e|[Ww]hat|[Ww]ho|[Yy]ou)'ll": r"\1\2 will",
            r"(\b)([Ii]|[Ss]hould|[Tt]hey|[Ww]e|[Ww]hat|[Ww]ho|[Ww]ould|[Yy]ou)'ve": r"\1\2 have",
            r"'d": " would",
            r"'s": " is",
            r"isn't": "is not",
            r" its ": " it is "
        }
        for pattern, repl in contractions.items():
            text = re.sub(pattern, repl, text)
        return text

    def word_negation(self, text):
        negations = {
            r"(\b)([Aa]re|[Cc]ould|[Dd]id|[Dd]oes|[Dd]o|[Hh]ad|[Hh]as|[Hh]ave|[Ii]s|[Mm]ight|[Mm]ust|[Ss]hould|[Ww]ere|[Ww]as|[Ww]ould)n't": r"\1\2 not",
            r"(\b)([Cc]a)n't": r"\1\2n not",
            r"(\b)([Ww])on't": r"\1\2ill not",
            r"(\b)([Ss])han't": r"\1\2hall not"
        }
        for pattern, repl in negations.items():
            text = re.sub(pattern, repl, text)
        return text

    def char_removing(self, text):
        text = self.unwanted_chars.sub("", text)
        text = self.special_chars.sub(" ", text)
        text = self.digits.sub(" ", text)
        text = self.brackets_numbers.sub(" ", text)
        return text

    def word_stopwords(self, text):
        return ' '.join(word for word in text.split() if word not in self.stop_words)

    def get_wordnet_pos(self, tag):
        return {
            'J': wordnet.ADJ, 'V': wordnet.VERB,
            'N': wordnet.NOUN, 'R': wordnet.ADV
        }.get(tag[0], wordnet.NOUN)

    def lemmatization(self, text):
        word_pos_tags = nltk.pos_tag(word_tokenize(text))
        return " ".join(self.wl.lemmatize(word, self.get_wordnet_pos(pos)) for word, pos in word_pos_tags)

    def stemming(self, text):
        return " ".join(self.stemmer.stem(word) for word in word_tokenize(text))
    
    def emoji_categorization(self, text):
        text = re.sub(r"[☺☻😊😌🙂]+", "🙂", text)
        text = re.sub(r"[😀😁😆😄😃😸😺]+", "😀", text)
        text = re.sub(r"[☹😞😔🙁]+", "🙁", text)
        text = re.sub(r"[♥❤♡💟💝💜💛💚💙🖤💘💗💖💕💓💞💌]+", "💜", text)
        text = re.sub(r"[😗😙😚😍😽😻😘]+", "😘", text)
        text = re.sub(r"[😮😯😲🙀]+", "😮", text)
        text = re.sub(r"[😨😧😦]+", "😦", text)
        text = re.sub(r"[😏]+", "😏", text)
        text = re.sub(r"[😜😝😛]+", "😛", text)
        text = re.sub(r"[🤣😹😂]+", "😂", text)
        text = re.sub(r"[😿😢😭😥😪😢]+", "😢", text)
        text = re.sub(r"[😠😾😤👿😡]+", "😡", text)
        text = re.sub(r"[👬👭👫]+", "👫", text)
        text = re.sub(r"[✔]+", "✅", text)
        text = re.sub(r"[🌞]+", "☀", text)
        text = re.sub(r"[🎊🎉🎈🎂🎆🎇]+", "🎉", text)
        text = re.sub(r"[⚽⚾🏀🏐🏈🏉🎾🎳🏏🏑🏒🏓🏸🥊⛳🏊🏌🏃🏄🎿]+", " :sport: ", text)
        text = re.sub(r"[🌑🌓🌕🌙🌜🌛🌝]+", " :moon: ", text)
        text = re.sub(r"[🌍🌎🌏]+", " :earth: ", text)
        text = re.sub(r"[🐂🐄🐅🐇🐈🐉🐊🐋🐍🐎🐐🐑🐒🐓🐔🐕🐖🐗🐘🐚🐛🐝🐞🐟🐠🐢🐣🐥🐦🐨🐬🐭🐮🐯🐰🐱🐲🐳🐴🐵🐶🐷🐸🐹🐺🐻🐼]+", " :animal: ", text)
        text = re.sub(r"[🍄🍅🍆🍇🍉🍊🍌🍍🍎🍏🍑🍒🍓]+", " :fruit: ", text)
        text = re.sub(r"[🍔🍕🍖🍗🍛🍜🍝🍞🍟🍣🍥🍦🍧🍨🍩🍪🍫🍬🍭🍯🍰]+", " :food: ", text)
        text = re.sub(r"[🇦-🇿]{2}", " :flag: ", text)
        text = re.sub(r"[♩♪♫♬🎵🎶🎷🎸🎹🎺🎼🎤🎧🎻]+", " :music: ", text)
        text = re.sub(r"[🌷🌸🌹🌺🌻🌼]+", " :flower: ", text)
        text = re.sub(r"[🌱🌲🌳🌴🌵🌾🌿🍀🍁🍂🍃]+", " :plant: ", text)
        text = re.sub(r"[🍷🍸🍹🍺🍻🍼🍾]+", " :drink: ", text)
        text = re.sub(r"[👕👗👙👚👛👜👠]+", " :dress: ", text)
        text = re.sub(r"[💰💳💵💷💸]+", " :money: ", text)

        return text

    def emoticon_to_emoji(self,text):
        text = re.sub(r":-*\)+", "🙂", text)
        text = re.sub(r"\(+-*:", "🙂", text)
        text = re.sub(r":-*(d|D)+", "😀", text)
        text = re.sub(r"x-*(d|D)+", "😀", text)
        text = re.sub(r":-*(p|P)+", "😛", text)
        text = re.sub(r":-*\(+", "🙁", text)
        text = re.sub(r";-*\)+", "😉", text)
        text = re.sub(r":-*<+", "😠", text)
        text = re.sub(r":-*/+", "😕", text)
        text = re.sub(r":-*\*+", "😘", text)
        text = re.sub(r":-*(o|O)+", "😮", text)
        text = re.sub(r":'+-*\)+", "😂", text)
        text = re.sub(r":'+-*\(+", "😢", text)
        text = re.sub(r">_<", "😣", text)
        text = re.sub(r"\(-_-\)zzz", "😴", text)
        text = re.sub(r"-_+-", "😑", text)
        text = re.sub(r"\^_+\^", "😊", text)
        text = re.sub(r"\*_+\*", "😍", text)
        text = re.sub(r">_+>", "😒", text)
        text = re.sub(r"<_+<", "😒", text)
        text = re.sub(r"\(⌣́_⌣̀\)", "😌", text)
        text = re.sub(r";_+;", "😢", text)
        text = re.sub(r"3:-+\)", "😈", text)
        text = re.sub(r"<+3+", "💜", text)
        text = re.sub(r">\.<", "🤔", text)
        text = re.sub(r"\._+\.", "😔", text)
        text = re.sub(r"¯\\_\(ツ\)_/¯", "🤷", text)
        text = re.sub(r"¯_\(ツ\)_/¯", "💁", text)
        text = re.sub(r"(o|O)+_+(o|O)+", "😐", text)
        text = re.sub(r"(o|O)+\.+(o|O)+", "😮", text)
        return text
    
    def preprocess_text(self, text):
        text = text.lower().strip()
        text = self.html_tags.sub('', text)  
        text = self.punctuations.sub(' ', text) 
        text = self.extra_spaces.sub(' ', text)  
        text = self.word_remove(text)
        text = self.char_replacing(text)
        text = self.word_expanding(text)
        text = self.word_negation(text)
        text = self.emoji_categorization(text)  
        text = self.emoticon_to_emoji(text) 
        text = self.char_removing(text)
        text = self.extra_spaces.sub(" ", text)
        text = self.word_stopwords(text)
        return text

    def preprocessing(self, text):
        return self.preprocess_text(text)

# def prepare_dataset(doc):
#   txt=Preprocessor().preprocessing(doc)
#   print(txt)
#   return txt

# df['text_clean'] = df['Text'].iloc[:100].apply(prepare_dataset)

# df.head(10)

In [45]:
preprocessor = Preprocessor()

df['Text_Clean'] = df['Text'].apply(preprocessor.preprocess_text)
df['Text_Stemmed'] = df['Text_Clean'].apply(preprocessor.stemming)
df['Text_Lemmatized'] = df['Text_Clean'].apply(preprocessor.lemmatization)

df.head(10)

Unnamed: 0,Text,Sentiment,Text_Clean,Text_Stemmed,Text_Lemmatized
0,Enjoying a beautiful day at the park! ...,Positive,enjoying beautiful day park,enjoy beauti day park,enjoy beautiful day park
1,Traffic was terrible this morning. ...,Negative,traffic terrible morning,traffic terribl morn,traffic terrible morning
2,Just finished an amazing workout! 💪 ...,Positive,finished amazing workout 💪,finish amaz workout 💪,finish amaze workout 💪
3,Excited about the upcoming weekend getaway! ...,Positive,excited upcoming weekend getaway,excit upcom weekend getaway,excited upcoming weekend getaway
4,Trying out a new recipe for dinner tonight. ...,Neutral,trying new recipe dinner tonight,tri new recip dinner tonight,try new recipe dinner tonight
5,Feeling grateful for the little things in lif...,Positive,feeling grateful little things life,feel grate littl thing life,feel grateful little thing life
6,Rainy days call for cozy blankets and hot coc...,Positive,rainy days call cozy blankets hot cocoa,raini day call cozi blanket hot cocoa,rainy day call cozy blanket hot cocoa
7,The new movie release is a must-watch! ...,Positive,new movie release must watch,new movi releas must watch,new movie release must watch
8,Political discussions heating up on the timel...,Negative,political discussions heating timeline,polit discuss heat timelin,political discussion heat timeline
9,Missing summer vibes and beach days. ...,Neutral,missing summer vibes beach days,miss summer vibe beach day,miss summer vibe beach day


In [46]:
df.to_csv("SocialMedia_processed.csv", index=False)