# 📊 Social Media Sentiment Analysis
This notebook performs sentiment analysis on a social media dataset (positive, negative, neutral) using data preprocessing, TF-IDF feature extraction, and Logistic Regression model.

In [3]:
import pandas as pd

# Load dataset
df = pd.read_csv("sentimentdataset.csv")  # Replace with correct file name
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,0,0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,1,1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,2,2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,3,3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,4,4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


In [5]:
df.info()
df.isnull().sum()
df['Sentiment'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0.1  732 non-null    int64  
 1   Unnamed: 0    732 non-null    int64  
 2   Text          732 non-null    object 
 3   Sentiment     732 non-null    object 
 4   Timestamp     732 non-null    object 
 5   User          732 non-null    object 
 6   Platform      732 non-null    object 
 7   Hashtags      732 non-null    object 
 8   Retweets      732 non-null    float64
 9   Likes         732 non-null    float64
 10  Country       732 non-null    object 
 11  Year          732 non-null    int64  
 12  Month         732 non-null    int64  
 13  Day           732 non-null    int64  
 14  Hour          732 non-null    int64  
dtypes: float64(2), int64(6), object(7)
memory usage: 85.9+ KB


Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
Positive,44
Joy,42
Excitement,32
Happy,14
Neutral,14
...,...
Vibrancy,1
Culinary Adventure,1
Mesmerizing,1
Thrilling Journey,1


In [7]:
import re

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"#", "", text)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    return text.strip()

df['clean_text'] = df['Text'].apply(clean_text)
df[['Text', 'clean_text']].head()

Unnamed: 0,Text,clean_text
0,Enjoying a beautiful day at the park! ...,enjoying a beautiful day at the park
1,Traffic was terrible this morning. ...,traffic was terrible this morning
2,Just finished an amazing workout! 💪 ...,just finished an amazing workout
3,Excited about the upcoming weekend getaway! ...,excited about the upcoming weekend getaway
4,Trying out a new recipe for dinner tonight. ...,trying out a new recipe for dinner tonight


In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess(text):
    tokens = text.split()
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

df['processed_text'] = df['clean_text'].apply(preprocess)
df[['clean_text', 'processed_text']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,clean_text,processed_text
0,enjoying a beautiful day at the park,enjoy beauti day park
1,traffic was terrible this morning,traffic terribl morn
2,just finished an amazing workout,finish amaz workout
3,excited about the upcoming weekend getaway,excit upcom weekend getaway
4,trying out a new recipe for dinner tonight,tri new recip dinner tonight


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['processed_text'])

In [13]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(df['Sentiment'])

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

In [17]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))

# Get the unique labels present in the test set
unique_labels_test = np.unique(y_test)

# Use these unique labels to get the corresponding target names
target_names_test = encoder.inverse_transform(unique_labels_test)

print(classification_report(y_test, y_pred, target_names=target_names_test, labels=unique_labels_test))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
                        precision    recall  f1-score   support

         Acceptance          0.00      0.00      0.00         2
           Admiration        0.00      0.00      0.00         1
        Admiration           0.00      0.00      0.00         1
         Affection           0.00      0.00      0.00         1
      Ambivalence            0.00      0.00      0.00         1
         Anger               0.00      0.00      0.00         1
        Anticipation         0.00      0.00      0.00         1
        Arousal              0.00      0.00      0.00         3
                  Awe        0.00      0.00      0.00         1
         Awe                 0.00      0.00      0.00         1
                  Bad        0.00      0.00      0.00         1
             Betrayal        0.00      0.00      0.00         2
        Betrayal             0.00      0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
def predict_sentiment(text):
    clean = clean_text(text)
    processed = preprocess(clean)
    vector = vectorizer.transform([processed])
    prediction = model.predict(vector)
    return encoder.inverse_transform(prediction)[0]

# Example
predict_sentiment("I love this product!")

# Task
Explain the error in the provided Python code for training a Logistic Regression model on sentiment data, which is likely related to class imbalance, and fix the code to address this issue.

## Identify minority classes

### Subtask:
Determine which sentiment classes have a significantly lower number of samples compared to others.


**Reasoning**:
Analyze the sentiment value counts to identify minority classes.



In [18]:
sentiment_counts = df['Sentiment'].value_counts()
print("Sentiment class distribution:\n", sentiment_counts)

Sentiment class distribution:
 Sentiment
Positive               44
Joy                    42
Excitement             32
Happy                  14
Neutral                14
                       ..
Vibrancy                1
Culinary Adventure      1
Mesmerizing             1
Thrilling Journey       1
Winter Magic            1
Name: count, Length: 279, dtype: int64


## Address class imbalance

### Subtask:
Implement techniques to balance the dataset, such as oversampling minority classes or undersampling majority classes.


**Reasoning**:
Implement random oversampling to balance the training data by importing the necessary class, instantiating it, and then applying it to the training data.



In [19]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)

## Retrain the model

### Subtask:
Train the Logistic Regression model on the balanced dataset.


**Reasoning**:
Train the Logistic Regression model on the resampled training data.



In [20]:
model_resampled = LogisticRegression()
model_resampled.fit(X_train_resampled, y_train_resampled)

## Evaluate the model

### Subtask:
Evaluate the performance of the retrained model using appropriate metrics like precision, recall, and F1-score, paying close attention to the performance on minority classes.


**Reasoning**:
Evaluate the performance of the retrained model using appropriate metrics.



In [21]:
y_pred_resampled = model_resampled.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_resampled))

unique_labels_test = np.unique(y_test)
target_names_test = encoder.inverse_transform(unique_labels_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_resampled, target_names=target_names_test, labels=unique_labels_test))

Confusion Matrix:
[[0 2 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Classification Report:
                        precision    recall  f1-score   support

         Acceptance          0.00      0.00      0.00         2
           Admiration        0.00      0.00      0.00         1
        Admiration           0.00      0.00      0.00         1
         Affection           1.00      1.00      1.00         1
      Ambivalence            1.00      1.00      1.00         1
         Anger               0.00      0.00      0.00         1
        Anticipation         0.00      0.00      0.00         1
        Arousal              1.00      1.00      1.00         3
                  Awe        0.00      0.00      0.00         1
         Awe                 0.00      0.00      0.00         1
                  Bad        1.00      1.00      1.00         1
             Betrayal        0.00      0.00      0.00         2
     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Summary:

### Data Analysis Key Findings

*   The initial sentiment class distribution revealed a severe class imbalance, with 279 unique sentiment classes, many having extremely low counts (1, 2, or 3 samples) compared to more frequent classes (in the 30s and 40s).
*   Random oversampling was applied to the training data to address the class imbalance.
*   A Logistic Regression model was trained on the resampled training data.
*   Evaluation of the model on the test set using a confusion matrix and classification report showed that the model struggled significantly with minority classes, often resulting in precision, recall, and F1-scores of 0.00 for these classes.

### Insights or Next Steps

*   Simple random oversampling was insufficient to enable the Logistic Regression model to effectively learn and predict minority sentiment classes in this highly imbalanced dataset.
*   Further investigation into more advanced class imbalance techniques (e.g., SMOTE, ADASYN, or different ensemble methods) or exploring alternative model architectures better suited for imbalanced data is recommended to improve performance on minority classes.
