# Social Media Sentiment Analysis

## Problem Statement

Conduct a comprehensive analysis of social media data to uncover insights into user sentiment distribution, engagement patterns, and content preferences across different platforms and geographical locations.

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import math
import re

from scipy import stats

## Dataset

In [2]:
data = pd.read_csv("dataset.csv")
data.drop(columns=["Unnamed: 0", "Unnamed: 0.1"], inplace=True)

data.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


### Data Dictionary of this dataset


* Text: The text content of the social media post.
* Sentiment: Categorized sentiment of the post (e.g., Positive, Negative, Neutral).
* Timestamp: The date and time when the post was made.
* User: Identifier or username of the post's author.
* Platform: The social media platform where the post was made (e.g., Twitter, Facebook).
* Hashtags: Hashtags used in the post, indicating topics or themes.
* Retweets: The number of times the post has been retweeted or shared.
* Likes: The number of likes the post has received.
* Country: The country from which the post was made.
* Year: The year when the post was made.
* Month: The month when the post was made.
* Day: The day of the month when the post was made.
* Hour: The hour of the day when the post was made, using a 24-hour clock.

## Preprocessing Data

### 1. Checking and Handling Missing Values

In [3]:
data.shape

(732, 13)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Text       732 non-null    object 
 1   Sentiment  732 non-null    object 
 2   Timestamp  732 non-null    object 
 3   User       732 non-null    object 
 4   Platform   732 non-null    object 
 5   Hashtags   732 non-null    object 
 6   Retweets   732 non-null    float64
 7   Likes      732 non-null    float64
 8   Country    732 non-null    object 
 9   Year       732 non-null    int64  
 10  Month      732 non-null    int64  
 11  Day        732 non-null    int64  
 12  Hour       732 non-null    int64  
dtypes: float64(2), int64(4), object(7)
memory usage: 74.5+ KB


- data.info() suggest `no missing values`.
- Let's check for data cleaning

### 2. Removing Leading and Trailing Spaces

In [5]:
data["Platform"].unique()

array([' Twitter  ', ' Instagram ', ' Facebook ', ' Twitter '],
      dtype=object)

In [6]:
data["Country"].unique()[: 10]

array([' USA      ', ' Canada   ', ' USA        ', ' UK       ',
       ' Australia ', ' India    ', ' USA    ', 'USA', ' Canada    ',
       ' USA       '], dtype=object)

In [7]:
data["Hashtags"].unique()[:5]

array([' #Nature #Park                            ',
       ' #Traffic #Morning                        ',
       ' #Fitness #Workout                        ',
       ' #Travel #Adventure                       ',
       ' #Cooking #Food                           '], dtype=object)

*Removing Leading and Trailing whitespaces from textual data.*

In [8]:
for column in data.columns:
    if data[column].dtype == object:
        data[column] = data[column].str.strip()

### 3. Dealing with duplicates

In [9]:
data.duplicated().sum()

22

In [10]:
data.drop_duplicates(keep='first', inplace=True)

In [11]:
data.shape

(710, 13)

### 4. Optimising memory for numerical column dtypes

In [12]:
print(f"meory usage before datatype optimisation: {data.memory_usage().sum()}")

meory usage before datatype optimisation: 79520


In [13]:
for column in data.columns:
    if np.issubdtype(data[column].dtype, np.number):
        print(f"{column} ==> Min: {min(data[column])}, Max: {max(data[column])}")


Retweets ==> Min: 5.0, Max: 40.0
Likes ==> Min: 10.0, Max: 80.0
Year ==> Min: 2010, Max: 2023
Month ==> Min: 1, Max: 12
Day ==> Min: 1, Max: 31
Hour ==> Min: 0, Max: 23


- Converting dtypes of `Retweets, Likes, Month, Day and Hour` columns to **`np.uint8`** dtype.
- Converting dtypes of `Year` to **`np.uint16`** dtype.

In [14]:
data["Retweets"] = data["Retweets"].astype(np.uint8)
data["Likes"] = data["Likes"].astype(np.uint8)
data["Month"] = data["Month"].astype(np.uint8)
data["Day"] = data["Day"].astype(np.uint8)
data["Hour"] = data["Hour"].astype(np.uint8)

data["Year"] = data["Year"].astype(np.uint16)

In [15]:
for column in data.columns:
    if np.issubdtype(data[column].dtype, np.number):
        print(f"{column} ==> Min: {min(data[column])}, Max: {max(data[column])}")


Retweets ==> Min: 5, Max: 40
Likes ==> Min: 10, Max: 80
Year ==> Min: 2010, Max: 2023
Month ==> Min: 1, Max: 12
Day ==> Min: 1, Max: 31
Hour ==> Min: 0, Max: 23


In [16]:
print(f"meory usage after datatype optimisation: {data.memory_usage().sum()}")

meory usage after datatype optimisation: 50410


- Earlier memory usage is `80.2 kb`
- After dtype optimisation memory usage is `46.6 kb`

### 5. Converting `Timestamp` to pandas datetime format.

In [17]:
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

In [18]:
data['Timestamp'].dt.day_name()

0        Sunday
1        Sunday
2        Sunday
3        Sunday
4        Sunday
         ...   
727      Friday
728      Friday
729      Friday
730    Saturday
731      Sunday
Name: Timestamp, Length: 710, dtype: object

### 6. Handling Unnested data-

- Hashtags and Sentiments contains nested data.
- To unnest them I will separate the hashtags and sentiments into different dataframes.
- Now to hold the relationship between separated hashtags, sentiments and the post, I will also create an unique id column.

#### Creating id column

In [19]:
total_records = data.shape[0]
total_records

710

In [20]:
data["id"] = np.arange(1, total_records +1, dtype=np.uint16)
data.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour,id
0,Enjoying a beautiful day at the park!,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15,30,USA,2023,1,15,12,1
1,Traffic was terrible this morning.,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5,10,Canada,2023,1,15,8,2
2,Just finished an amazing workout! 💪,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20,40,USA,2023,1,15,15,3
3,Excited about the upcoming weekend getaway!,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8,15,UK,2023,1,15,18,4
4,Trying out a new recipe for dinner tonight.,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12,25,Australia,2023,1,15,19,5


#### a) Handling Nested Hashtags

In [21]:
data_hashtags = data["Hashtags"].apply(lambda s: str(s).split()).tolist()
data_hashtags = pd.DataFrame(data_hashtags, index = data["id"])

# pivot the dataframe
data_hashtags = data_hashtags.stack()

# convert it back to dataframe after reseting indices
data_hashtags = pd.DataFrame(data_hashtags.reset_index())

# dropping 'level_1' and renaming other column
data_hashtags.rename(columns={0: "Hashtags"}, inplace=True)
data_hashtags.drop(columns=["level_1"], inplace=True)

data_hashtags.head()

Unnamed: 0,id,Hashtags
0,1,#Nature
1,1,#Park
2,2,#Traffic
3,2,#Morning
4,3,#Fitness


#### b) Handling Nested Sentiments

In [22]:
def sentiment_splitter(s):
    if " " in s:
        if "'s " in s:
            s = s.replace("'s ", " ")
        return s.split(" ")
    else:
        return re.findall(r'[A-Z][a-z]*', s)

In [23]:
data_sentiments = data["Sentiment"].apply(sentiment_splitter).tolist()

data_sentiments = pd.DataFrame(data_sentiments, index = data["id"])

# pivot the dataframe
data_sentiments = data_sentiments.stack()

# convert it back to dataframe after reseting indices
data_sentiments = pd.DataFrame(data_sentiments.reset_index())

# dropping 'level_1' and renaming other column
data_sentiments.rename(columns={0: "Sentiment"}, inplace=True)
data_sentiments.drop(columns=["level_1"], inplace=True)

data_sentiments.head()

Unnamed: 0,id,Sentiment
0,1,Positive
1,2,Negative
2,3,Positive
3,4,Positive
4,5,Neutral
