## Data Wrangling
Youâ€™re now in the data wrangling stage of your third capstone. In addition to
the data wrangling steps applied in your previous capstone projects, you now need to
address some unique characteristics related to the advanced nature of your third
capstone project. The exact steps depend heavily on the type of data youâ€™re working
with for this capstone project. In this case for NLP there are methods like stemming,
lemmatization, tokenization, stop word removal, and frequency analysis.

Data was pulled from the Social Media Sentiments Analysis dataset on kaggle (https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset).

In [83]:
#import necesary packages and libraries
import pandas as pd

import string
import re
import nltk #python natural language processing toolkit
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

''' import torch
from transformers import BertTokenizerFast, BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm '''

from textblob import TextBlob

'''nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')'''

"nltk.download('vader_lexicon')\nnltk.download('punkt')\nnltk.download('stopwords')"

In [84]:
# load the sentiment dataset and drop unused columns
df = pd.read_csv('sentimentdataset.csv')
df.drop(columns=["Unnamed: 0.1", "Unnamed: 0"], inplace=True)

In [85]:
# get the number of rows and columns in the dataset
df.shape

(732, 13)

In [86]:
# print the first 5 rows of the dataframe to better understand its structure and features
df.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! ðŸ’ª ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


In [87]:
# Check our dataset for missing values and ensure the columns are the appropriate datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Text       732 non-null    object 
 1   Sentiment  732 non-null    object 
 2   Timestamp  732 non-null    object 
 3   User       732 non-null    object 
 4   Platform   732 non-null    object 
 5   Hashtags   732 non-null    object 
 6   Retweets   732 non-null    float64
 7   Likes      732 non-null    float64
 8   Country    732 non-null    object 
 9   Year       732 non-null    int64  
 10  Month      732 non-null    int64  
 11  Day        732 non-null    int64  
 12  Hour       732 non-null    int64  
dtypes: float64(2), int64(4), object(7)
memory usage: 74.5+ KB


In [88]:
# Outliers in these  shouldn't meaningfully impact any analysis currently planned
df.describe()

Unnamed: 0,Retweets,Likes,Year,Month,Day,Hour
count,732.0,732.0,732.0,732.0,732.0,732.0
mean,21.508197,42.901639,2020.471311,6.122951,15.497268,15.521858
std,7.061286,14.089848,2.802285,3.411763,8.474553,4.113414
min,5.0,10.0,2010.0,1.0,1.0,0.0
25%,17.75,34.75,2019.0,3.0,9.0,13.0
50%,22.0,43.0,2021.0,6.0,15.0,16.0
75%,25.0,50.0,2023.0,9.0,22.0,19.0
max,40.0,80.0,2023.0,12.0,31.0,23.0


Explaination of Columns:    
    
    Text: Text of the social media post
    Sentiment: Sentiment label for the text (positive, neutral, negative)
    Timestamp: Timestamp of when the post was created
    User: UserId of the post's creator
    Platform: Social Media site the post was created on (twitter, facebook, instagram)
    Hashtags: Hashtags used in the post
    Retweets: Number of retweets or shares of the post
    Likes: Number of like on the post
    Country: Nation the post was created in
    Year: Year post was created
    Month: Month post was created
    Day: Day the post was created
    Hour: Hour the post was created


In [89]:
# example of unecesary white space being present, twitter is split into two categoriess
df['Platform'].unique()

array([' Twitter  ', ' Instagram ', ' Facebook ', ' Twitter '],
      dtype=object)

In [90]:
# remove unnecessary white space to prevent splitting categorical variables
df["Text"] = df["Text"].str.strip()
df["Sentiment"] = df["Sentiment"].str.strip()
df["Hashtags"] = df["Hashtags"].str.strip()
df["User"] = df["User"].str.strip()
df["Platform"] = df["Platform"].str.strip()
df["Country"] = df["Country"].str.strip()

In [91]:
# check dataset for duplicate values
df.duplicated().sum()

22

In [92]:
# remove duplicated rows to prevent bias in modeling
df.drop_duplicates()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park!,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning.,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! ðŸ’ª,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway!,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight.,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...
727,Collaborating on a science project that receiv...,Happy,2017-08-18 18:20:00,ScienceProjectSuccessHighSchool,Facebook,#ScienceFairWinner #HighSchoolScience,20.0,39.0,UK,2017,8,18,18
728,Attending a surprise birthday party organized ...,Happy,2018-06-22 14:15:00,BirthdayPartyJoyHighSchool,Instagram,#SurpriseCelebration #HighSchoolFriendship,25.0,48.0,USA,2018,6,22,14
729,Successfully fundraising for a school charity ...,Happy,2019-04-05 17:30:00,CharityFundraisingTriumphHighSchool,Twitter,#CommunityGiving #HighSchoolPhilanthropy,22.0,42.0,Canada,2019,4,5,17
730,"Participating in a multicultural festival, cel...",Happy,2020-02-29 20:45:00,MulticulturalFestivalJoyHighSchool,Facebook,#CulturalCelebration #HighSchoolUnity,21.0,43.0,UK,2020,2,29,20


In [93]:
# count the number of distinct values in each column of the dataframe
for column in df.columns:
    number_distinct_values = len(df[column].unique())
    print(f"{column} has {number_distinct_values} distinct values")

Text has 706 distinct values
Sentiment has 191 distinct values
Timestamp has 683 distinct values
User has 670 distinct values
Platform has 3 distinct values
Hashtags has 692 distinct values
Retweets has 26 distinct values
Likes has 38 distinct values
Country has 33 distinct values
Year has 14 distinct values
Month has 12 distinct values
Day has 31 distinct values
Hour has 22 distinct values


In [94]:
# combine text of post and hashtag for more information for sentiment analysis
df["Text_Combined"] = df["Text"] + ' ' + df["Hashtags"]
df = df.drop(columns="Hashtags")
df.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Retweets,Likes,Country,Year,Month,Day,Hour,Text_Combined
0,Enjoying a beautiful day at the park!,Positive,2023-01-15 12:30:00,User123,Twitter,15.0,30.0,USA,2023,1,15,12,Enjoying a beautiful day at the park! #Nature ...
1,Traffic was terrible this morning.,Negative,2023-01-15 08:45:00,CommuterX,Twitter,5.0,10.0,Canada,2023,1,15,8,Traffic was terrible this morning. #Traffic #M...
2,Just finished an amazing workout! ðŸ’ª,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,20.0,40.0,USA,2023,1,15,15,Just finished an amazing workout! ðŸ’ª #Fitness #...
3,Excited about the upcoming weekend getaway!,Positive,2023-01-15 18:20:00,AdventureX,Facebook,8.0,15.0,UK,2023,1,15,18,Excited about the upcoming weekend getaway! #T...
4,Trying out a new recipe for dinner tonight.,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,12.0,25.0,Australia,2023,1,15,19,Trying out a new recipe for dinner tonight. #C...


In [95]:
df['Text_Combined'] = df['Text_Combined'].str.lower()
# use regular expressions to filter out punctuation, emoji, and numbers
df['Text_Combined'] = df['Text_Combined'].str.replace(r'[%s]' % re.escape(string.punctuation), '', regex=True)
df['Text_Combined'] = df['Text_Combined'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
df['Text_Combined'] = df['Text_Combined'].str.replace(r'^[0-9]+$', '', regex=True)

df.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Retweets,Likes,Country,Year,Month,Day,Hour,Text_Combined
0,Enjoying a beautiful day at the park!,Positive,2023-01-15 12:30:00,User123,Twitter,15.0,30.0,USA,2023,1,15,12,enjoying a beautiful day at the park nature park
1,Traffic was terrible this morning.,Negative,2023-01-15 08:45:00,CommuterX,Twitter,5.0,10.0,Canada,2023,1,15,8,traffic was terrible this morning traffic morning
2,Just finished an amazing workout! ðŸ’ª,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,20.0,40.0,USA,2023,1,15,15,just finished an amazing workout fitness workout
3,Excited about the upcoming weekend getaway!,Positive,2023-01-15 18:20:00,AdventureX,Facebook,8.0,15.0,UK,2023,1,15,18,excited about the upcoming weekend getaway tra...
4,Trying out a new recipe for dinner tonight.,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,12.0,25.0,Australia,2023,1,15,19,trying out a new recipe for dinner tonight coo...


In [96]:
# tokenize, stem, filter on stop words, and rejoin 'Text'

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# English stop words
stop_words = set(stopwords.words('english'))

def process_text_nltk(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Apply stemming and remove stop words
    stemmed_tokens = [stemmer.stem(word) for word in tokens if word.lower() not in stop_words]
    
    # join stems into text entries again
    processed_text = ' '.join(stemmed_tokens)
    return processed_text

def process_text_nostem(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # remove stop words
    p_tokens = [word for word in tokens if word.lower() not in stop_words]
    
    # join stems into text entries again
    processed_text = ' '.join(p_tokens)
    return processed_text


df["Processed_Text_NLTK"] = df["Text_Combined"].apply(process_text_nltk)
df["Text_Combined"] = df["Text_Combined"].apply(process_text_nostem)

df.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Retweets,Likes,Country,Year,Month,Day,Hour,Text_Combined,Processed_Text_NLTK
0,Enjoying a beautiful day at the park!,Positive,2023-01-15 12:30:00,User123,Twitter,15.0,30.0,USA,2023,1,15,12,enjoying beautiful day park nature park,enjoy beauti day park natur park
1,Traffic was terrible this morning.,Negative,2023-01-15 08:45:00,CommuterX,Twitter,5.0,10.0,Canada,2023,1,15,8,traffic terrible morning traffic morning,traffic terribl morn traffic morn
2,Just finished an amazing workout! ðŸ’ª,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,20.0,40.0,USA,2023,1,15,15,finished amazing workout fitness workout,finish amaz workout fit workout
3,Excited about the upcoming weekend getaway!,Positive,2023-01-15 18:20:00,AdventureX,Facebook,8.0,15.0,UK,2023,1,15,18,excited upcoming weekend getaway travel adventure,excit upcom weekend getaway travel adventur
4,Trying out a new recipe for dinner tonight.,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,12.0,25.0,Australia,2023,1,15,19,trying new recipe dinner tonight cooking food,tri new recip dinner tonight cook food


In [110]:
# Use nltk vader sentiment score analyzer to assign sentiment_scores to processed_text_nltk
# then update sentiment labels based on the sentiment_scores
analyzer_vader = SentimentIntensityAnalyzer()
df['Sentiment_Score_VADER'] = df['Text_Combined'].apply(lambda text: analyzer_vader.polarity_scores(text)['compound'])
df['Sentiment_VADER'] = df['Sentiment_Score_VADER'].apply(lambda score: 'Positive' if score >= 0.05 else ('Negative' if score <= -0.05 else 'Neutral'))

In [111]:
# use text blob sentiment score analyzer to entiment_scores to Text_combined
# then update sentiment labels based on the sentiment_scores

#Create two new columns 'Subjectivity' & 'Polarity'
df['TextBlob_Subjectivity'] = df['Text_Combined'].apply(lambda text: TextBlob(text).sentiment.subjectivity)
df['TextBlob_Polarity'] = df['Text_Combined'].apply(lambda text: TextBlob(text).sentiment.polarity)

df['Sentiment_TextBlob'] = df['TextBlob_Polarity'].apply(lambda score: 'Positive' if score >= 0.05 else ('Negative' if score <= -0.05 else 'Neutral'))


In [112]:
'''
# Preparing the text
class SentimentDataset(Dataset):
    def __init__(self, texts, tokenizer, max_len=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, item):
        text = str(self.texts[item])
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=False,
            return_attention_mask=True,
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': inputs['input_ids'].flatten(),
            'attention_mask': inputs['attention_mask'].flatten()
        }

# Initialize the tokenizer and dataset
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
dataset_raw = SentimentDataset(df['Text'], tokenizer)
dataset_combined = SentimentDataset(df['Text_Combined'], tokenizer)

# DataLoader
data_loader_raw = DataLoader(dataset_raw, batch_size=16)
data_loader_combined = DataLoader(dataset_combined, batch_size=16)

# Load BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)  # 3 labels for sentiment
model.eval()  # Set model to evaluation mode

# Prediction function
def predict_sentiments(data_loader, model):
    model = model.to(device)
    predictions = []
    with torch.no_grad():
        for data in tqdm(data_loader):
            input_ids = data['input_ids'].to(device)
            attention_mask = data['attention_mask'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=1)
            predictions.extend(preds.cpu().numpy())
    return predictions

# Predict sentiments
device = 'cuda' if torch.cuda.is_available() else 'cpu'
predicted_sentiments_raw = predict_sentiments(data_loader_raw, model)
predicted_sentiments_combined = predict_sentiments(data_loader_combined, model)

# Map predictions to sentiment labels
sentiment_labels = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}

for pred in predicted_sentiments_raw:
    print( predicted_sentiments_raw[pred])

categorized_sentiments_raw = [sentiment_labels[pred] for pred in predicted_sentiments_raw]
categorized_sentiments_combined = [sentiment_labels[pred] for pred in predicted_sentiments_combined]

# Add sentiments to dataframe
df['Sentiment_BERT_raw'] = categorized_sentiments_raw
df['Sentiment_BERT_combined'] = categorized_sentiments_combined '''

"\n# Preparing the text\nclass SentimentDataset(Dataset):\n    def __init__(self, texts, tokenizer, max_len=512):\n        self.texts = texts\n        self.tokenizer = tokenizer\n        self.max_len = max_len\n    \n    def __len__(self):\n        return len(self.texts)\n    \n    def __getitem__(self, item):\n        text = str(self.texts[item])\n        inputs = self.tokenizer.encode_plus(\n            text,\n            None,\n            add_special_tokens=True,\n            max_length=self.max_len,\n            padding='max_length',\n            return_token_type_ids=False,\n            return_attention_mask=True,\n            truncation=True,\n            return_tensors='pt'\n        )\n        return {\n            'input_ids': inputs['input_ids'].flatten(),\n            'attention_mask': inputs['attention_mask'].flatten()\n        }\n\n# Initialize the tokenizer and dataset\ntokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')\ndataset_raw = SentimentDataset(df['

In [113]:
# get a full view of the first 10 rows of the dataset
with pd.option_context('display.max_rows', 10, 'display.max_columns', None): 
    display(df)

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Retweets,Likes,Country,Year,Month,Day,Hour,Text_Combined,Processed_Text_NLTK,Sentiment_Score_VADER,Sentiment_VADER,TextBlob_Subjectivity,TextBlob_Polarity,Sentiment_TextBlob
0,Enjoying a beautiful day at the park!,Positive,2023-01-15 12:30:00,User123,Twitter,15.0,30.0,USA,2023,1,15,12,enjoying beautiful day park nature park,enjoy beauti day park natur park,0.8074,Positive,0.800000,0.675000,Positive
1,Traffic was terrible this morning.,Negative,2023-01-15 08:45:00,CommuterX,Twitter,5.0,10.0,Canada,2023,1,15,8,traffic terrible morning traffic morning,traffic terribl morn traffic morn,-0.4767,Negative,1.000000,-1.000000,Negative
2,Just finished an amazing workout! ðŸ’ª,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,20.0,40.0,USA,2023,1,15,15,finished amazing workout fitness workout,finish amaz workout fit workout,0.7096,Positive,0.900000,0.600000,Positive
3,Excited about the upcoming weekend getaway!,Positive,2023-01-15 18:20:00,AdventureX,Facebook,8.0,15.0,UK,2023,1,15,18,excited upcoming weekend getaway travel adventure,excit upcom weekend getaway travel adventur,0.5719,Positive,0.750000,0.375000,Positive
4,Trying out a new recipe for dinner tonight.,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,12.0,25.0,Australia,2023,1,15,19,trying new recipe dinner tonight cooking food,tri new recip dinner tonight cook food,0.0000,Neutral,0.454545,0.136364,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
727,Collaborating on a science project that receiv...,Happy,2017-08-18 18:20:00,ScienceProjectSuccessHighSchool,Facebook,20.0,39.0,UK,2017,8,18,18,collaborating science project received recogni...,collabor scienc project receiv recognit region...,0.8126,Positive,0.900000,0.700000,Positive
728,Attending a surprise birthday party organized ...,Happy,2018-06-22 14:15:00,BirthdayPartyJoyHighSchool,Instagram,25.0,48.0,USA,2018,6,22,14,attending surprise birthday party organized fr...,attend surpris birthday parti organ friend sur...,0.9531,Positive,0.600000,0.600000,Positive
729,Successfully fundraising for a school charity ...,Happy,2019-04-05 17:30:00,CharityFundraisingTriumphHighSchool,Twitter,22.0,42.0,Canada,2019,4,5,17,successfully fundraising school charity initia...,success fundrais school chariti initi joy give...,0.9042,Positive,0.383333,0.516667,Positive
730,"Participating in a multicultural festival, cel...",Happy,2020-02-29 20:45:00,MulticulturalFestivalJoyHighSchool,Facebook,21.0,43.0,UK,2020,2,29,20,participating multicultural festival celebrati...,particip multicultur festiv celebr divers musi...,0.8910,Positive,1.000000,1.000000,Positive


In [115]:
# export cleaned data
df.to_csv('sentimentdataset_cleaned.csv', index=False)