## Data Wrangling
You’re now in the data wrangling stage of your third capstone. In addition to
the data wrangling steps applied in your previous capstone projects, you now need to
address some unique characteristics related to the advanced nature of your third
capstone project. The exact steps depend heavily on the type of data you’re working
with for this capstone project. In this case for NLP there are methods like stemming,
lemmatization, tokenization, stop word removal, and frequency analysis.

Data was pulled from the Social Media Sentiments Analysis dataset on kaggle (https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset).

In [13]:
#import necesary packages and libraries
import pandas as pd

import string
import re
import nltk #python natural language processing toolkit
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')


nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Quinn\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Quinn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Quinn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
# load the sentiment dataset and drop unused columns
df = pd.read_csv('sentimentdataset.csv')
df.drop(columns=["Unnamed: 0.1", "Unnamed: 0"], inplace=True)

In [15]:
# get the number of rows and columns in the dataset
df.shape

(732, 13)

In [16]:
# print the first 5 rows of the dataframe to better understand its structure and features
df.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


In [17]:
# Check our dataset for missing values and ensure the columns are the appropriate datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Text       732 non-null    object 
 1   Sentiment  732 non-null    object 
 2   Timestamp  732 non-null    object 
 3   User       732 non-null    object 
 4   Platform   732 non-null    object 
 5   Hashtags   732 non-null    object 
 6   Retweets   732 non-null    float64
 7   Likes      732 non-null    float64
 8   Country    732 non-null    object 
 9   Year       732 non-null    int64  
 10  Month      732 non-null    int64  
 11  Day        732 non-null    int64  
 12  Hour       732 non-null    int64  
dtypes: float64(2), int64(4), object(7)
memory usage: 74.5+ KB


In [18]:
# Outliers in these  shouldn't meaningfully impact any analysis currently planned
df.describe()

Unnamed: 0,Retweets,Likes,Year,Month,Day,Hour
count,732.0,732.0,732.0,732.0,732.0,732.0
mean,21.508197,42.901639,2020.471311,6.122951,15.497268,15.521858
std,7.061286,14.089848,2.802285,3.411763,8.474553,4.113414
min,5.0,10.0,2010.0,1.0,1.0,0.0
25%,17.75,34.75,2019.0,3.0,9.0,13.0
50%,22.0,43.0,2021.0,6.0,15.0,16.0
75%,25.0,50.0,2023.0,9.0,22.0,19.0
max,40.0,80.0,2023.0,12.0,31.0,23.0


Explaination of Columns:    
    
    Text: Text of the social media post
    Sentiment: Sentiment label for the text (positive, neutral, negative)
    Timestamp: Timestamp of when the post was created
    User: UserId of the post's creator
    Platform: Social Media site the post was created on (twitter, facebook, instagram)
    Hashtags: Hashtags used in the post
    Retweets: Number of retweets or shares of the post
    Likes: Number of like on the post
    Country: Nation the post was created in
    Year: Year post was created
    Month: Month post was created
    Day: Day the post was created
    Hour: Hour the post was created


In [19]:
# example of unecesary white space being present, twitter is split into two categoriess
df['Platform'].unique()

array([' Twitter  ', ' Instagram ', ' Facebook ', ' Twitter '],
      dtype=object)

In [20]:
# remove unnecessary white space to prevent splitting categorical variables
df["Text"] = df["Text"].str.strip()
df["Sentiment"] = df["Sentiment"].str.strip()
df["Hashtags"] = df["Hashtags"].str.strip()
df["User"] = df["User"].str.strip()
df["Platform"] = df["Platform"].str.strip()
df["Country"] = df["Country"].str.strip()

In [21]:
# check dataset for duplicate values
df.duplicated().sum()

22

In [22]:
# remove duplicated rows to prevent bias in modeling
df.drop_duplicates()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park!,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning.,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! 💪,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway!,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight.,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...
727,Collaborating on a science project that receiv...,Happy,2017-08-18 18:20:00,ScienceProjectSuccessHighSchool,Facebook,#ScienceFairWinner #HighSchoolScience,20.0,39.0,UK,2017,8,18,18
728,Attending a surprise birthday party organized ...,Happy,2018-06-22 14:15:00,BirthdayPartyJoyHighSchool,Instagram,#SurpriseCelebration #HighSchoolFriendship,25.0,48.0,USA,2018,6,22,14
729,Successfully fundraising for a school charity ...,Happy,2019-04-05 17:30:00,CharityFundraisingTriumphHighSchool,Twitter,#CommunityGiving #HighSchoolPhilanthropy,22.0,42.0,Canada,2019,4,5,17
730,"Participating in a multicultural festival, cel...",Happy,2020-02-29 20:45:00,MulticulturalFestivalJoyHighSchool,Facebook,#CulturalCelebration #HighSchoolUnity,21.0,43.0,UK,2020,2,29,20


In [23]:
# count the number of distinct values in each column of the dataframe
for column in df.columns:
    number_distinct_values = len(df[column].unique())
    print(f"{column} has {number_distinct_values} distinct values")

Text has 706 distinct values
Sentiment has 191 distinct values
Timestamp has 683 distinct values
User has 670 distinct values
Platform has 3 distinct values
Hashtags has 692 distinct values
Retweets has 26 distinct values
Likes has 38 distinct values
Country has 33 distinct values
Year has 14 distinct values
Month has 12 distinct values
Day has 31 distinct values
Hour has 22 distinct values


In [27]:
# combine text of post and hashtag for more information for sentiment analysis
#df1 = df
#df1["Text"] = df1["Text"] + ' ' + df1["Hashtags"]
#df1 = df1.drop(columns="Hashtags")
df1.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park! #Nature ...,Positive,2023-01-15 12:30:00,User123,Twitter,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning. #Traffic #M...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! 💪 #Fitness #...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway! #T...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight. #C...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,12.0,25.0,Australia,2023,1,15,19


In [28]:
# make all text characters lowercase for easier processing
df1['Text'] = df1['Text'].str.lower()
# use regular expressions to filter out punctuation, emoji, and numbers
df1['Text'] = df1['Text'].str.replace(r'[%s]' % re.escape(string.punctuation), '', regex=True)
df1['Text'] = df1['Text'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
df1['Text'] = df1['Text'].str.replace(r'^[0-9]+$', '', regex=True)

df1.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Retweets,Likes,Country,Year,Month,Day,Hour
0,enjoying a beautiful day at the park nature park,Positive,2023-01-15 12:30:00,User123,Twitter,15.0,30.0,USA,2023,1,15,12
1,traffic was terrible this morning traffic morning,Negative,2023-01-15 08:45:00,CommuterX,Twitter,5.0,10.0,Canada,2023,1,15,8
2,just finished an amazing workout fitness workout,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,20.0,40.0,USA,2023,1,15,15
3,excited about the upcoming weekend getaway tra...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,8.0,15.0,UK,2023,1,15,18
4,trying out a new recipe for dinner tonight coo...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,12.0,25.0,Australia,2023,1,15,19


In [29]:
# tokenize, stem, filter on stop words, and rejoin 'Text'

# Initialize the Porter Stemmer
stemmer = PorterStemmer()

# English stop words
stop_words = set(stopwords.words('english'))

def process_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Apply stemming and remove stop words
    stemmed_tokens = [stemmer.stem(word) for word in tokens if word.lower() not in stop_words]
    
    # join stems into text entries again
    processed_text = ' '.join(stemmed_tokens)
    return processed_text

df1["Processed_Text"] = df1["Text"].apply(process_text)

df1.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Retweets,Likes,Country,Year,Month,Day,Hour,Processed_Text
0,enjoying a beautiful day at the park nature park,Positive,2023-01-15 12:30:00,User123,Twitter,15.0,30.0,USA,2023,1,15,12,enjoy beauti day park natur park
1,traffic was terrible this morning traffic morning,Negative,2023-01-15 08:45:00,CommuterX,Twitter,5.0,10.0,Canada,2023,1,15,8,traffic terribl morn traffic morn
2,just finished an amazing workout fitness workout,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,20.0,40.0,USA,2023,1,15,15,finish amaz workout fit workout
3,excited about the upcoming weekend getaway tra...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,8.0,15.0,UK,2023,1,15,18,excit upcom weekend getaway travel adventur
4,trying out a new recipe for dinner tonight coo...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,12.0,25.0,Australia,2023,1,15,19,tri new recip dinner tonight cook food


In [30]:
analyzer = SentimentIntensityAnalyzer()
df1['Sentiment_Score'] = df1['Processed_Text'].apply(lambda text: analyzer.polarity_scores(text)['compound'])
df1['Sentiment'] = df1['Sentiment_Score'].apply(lambda score: 'Positive' if score >= 0.1 else ('Negative' if score <= -0.1 else 'Neutral'))

In [31]:
# export cleaned data
df1.to_csv('sentimentdataset_cleaned.csv', index=False)