*   **Title:** `Data Preprocessing for Sentiment Analysis`
    *   **Objective:** The objective of this notebook is to load a raw sentiment dataset, clean it, handle missing values, encode categorical data, and prepare it for machine learning.

In [1]:
# Import the pandas library for data manipulation
import pandas as pd

# Load the dataset from the uploaded file
df = pd.read_csv('3) Sentiment dataset.csv')

# Display the first 5 rows to see what the data looks like
print("Original Data Head:")
print(df.head())

# Display a summary of the dataframe to see data types and non-null counts
print("\nData Info:")
df.info()

Original Data Head:
   Unnamed: 0.1  Unnamed: 0  \
0             0           0   
1             1           1   
2             2           2   
3             3           3   
4             4           4   

                                                Text    Sentiment  \
0   Enjoying a beautiful day at the park!        ...   Positive     
1   Traffic was terrible this morning.           ...   Negative     
2   Just finished an amazing workout! 💪          ...   Positive     
3   Excited about the upcoming weekend getaway!  ...   Positive     
4   Trying out a new recipe for dinner tonight.  ...   Neutral      

             Timestamp            User     Platform  \
0  2023-01-15 12:30:00   User123          Twitter     
1  2023-01-15 08:45:00   CommuterX        Twitter     
2  2023-01-15 15:45:00   FitnessFan      Instagram    
3  2023-01-15 18:20:00   AdventureX       Facebook    
4  2023-01-15 19:55:00   ChefCook        Instagram    

                                     Hashtags  

*   **Headline:** `Handling Missing Data`
    *   **Explanation:** First, I checked for missing values. The 'Retweets' and 'Likes' columns had missing data, which I filled with the median value to avoid skewing the data with outliers. Rows with missing 'Country' were dropped.

In [2]:
# Check for the total number of missing values in each column
print("Missing values before handling:")
print(df.isnull().sum())

# For numerical columns like 'Retweets' and 'Likes', we'll fill missing values with the median.
# The median is less sensitive to outliers than the mean.
median_retweets = df['Retweets'].median()
df['Retweets'] = df['Retweets'].fillna(median_retweets)

median_likes = df['Likes'].median()
df['Likes'] = df['Likes'].fillna(median_likes)

# For the 'Hashtags' and 'Country' columns, there are only a few missing values.
# For simplicity in this task, we will drop the rows where these are missing.
df.dropna(subset=['Hashtags', 'Country'], inplace=True)

# Verify that the missing values have been handled
print("\nMissing values after handling:")
print(df.isnull().sum())

Missing values before handling:
Unnamed: 0.1    0
Unnamed: 0      0
Text            0
Sentiment       0
Timestamp       0
User            0
Platform        0
Hashtags        0
Retweets        0
Likes           0
Country         0
Year            0
Month           0
Day             0
Hour            0
dtype: int64

Missing values after handling:
Unnamed: 0.1    0
Unnamed: 0      0
Text            0
Sentiment       0
Timestamp       0
User            0
Platform        0
Hashtags        0
Retweets        0
Likes           0
Country         0
Year            0
Month           0
Day             0
Hour            0
dtype: int64


*   **Headline:** `Encoding Categorical Data`
    *   **Explanation:** Machine learning models require numerical input, so I converted the text-based 'Platform' and 'Sentiment' columns into numerical format using one-hot encoding with pandas' `get_dummies` function.

In [3]:
# Select the categorical columns we want to encode.
# We'll choose 'Platform' and 'Sentiment' as examples.
categorical_cols = ['Platform', 'Sentiment']

# Use pandas get_dummies() to perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Display the first few rows to see the new columns
# You will see new columns like 'Platform_Twitter', 'Sentiment_Positive', etc.
print("Data after One-Hot Encoding:")
print(df_encoded.head())

Data after One-Hot Encoding:
   Unnamed: 0.1  Unnamed: 0  \
0             0           0   
1             1           1   
2             2           2   
3             3           3   
4             4           4   

                                                Text            Timestamp  \
0   Enjoying a beautiful day at the park!        ...  2023-01-15 12:30:00   
1   Traffic was terrible this morning.           ...  2023-01-15 08:45:00   
2   Just finished an amazing workout! 💪          ...  2023-01-15 15:45:00   
3   Excited about the upcoming weekend getaway!  ...  2023-01-15 18:20:00   
4   Trying out a new recipe for dinner tonight.  ...  2023-01-15 19:55:00   

             User                                    Hashtags  Retweets  \
0   User123         #Nature #Park                                  15.0   
1   CommuterX       #Traffic #Morning                               5.0   
2   FitnessFan      #Fitness #Workout                              20.0   
3   AdventureX      #

*   **Headline:** `Scaling Numerical Features`
    *   **Explanation:** Distance-based algorithms like K-Nearest Neighbors are sensitive to the scale of the data. Features with large value ranges (like 'Likes') could disproportionately influence the model's predictions. To prevent this, I used scikit-learn's StandardScaler to transform the numerical features so they all have a mean of 0 and a standard deviation of 1, ensuring all features are treated equally.

In [4]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Select the numerical columns to scale
numerical_features = ['Retweets', 'Likes', 'Year', 'Month', 'Day', 'Hour']

# Apply the scaler to our numerical features
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

print("Data after scaling numerical features:")
print(df_encoded.head())

Data after scaling numerical features:
   Unnamed: 0.1  Unnamed: 0  \
0             0           0   
1             1           1   
2             2           2   
3             3           3   
4             4           4   

                                                Text            Timestamp  \
0   Enjoying a beautiful day at the park!        ...  2023-01-15 12:30:00   
1   Traffic was terrible this morning.           ...  2023-01-15 08:45:00   
2   Just finished an amazing workout! 💪          ...  2023-01-15 15:45:00   
3   Excited about the upcoming weekend getaway!  ...  2023-01-15 18:20:00   
4   Trying out a new recipe for dinner tonight.  ...  2023-01-15 19:55:00   

             User                                    Hashtags  Retweets  \
0   User123         #Nature #Park                             -0.922303   
1   CommuterX       #Traffic #Morning                         -2.339444   
2   FitnessFan      #Fitness #Workout                         -0.213733   
3   Adventu

*   **Headline:** `Splitting the Dataset into Training and Testing Sets`
    *   **Explanation:** The final step of preprocessing is to split the data. I allocated 80% of the data for training the model (X_train, y_train) and reserved the remaining 20% as a 'final exam' for testing (X_test, y_test). This ensures that we can evaluate the model's performance on unseen data to get an unbiased measure of its effectiveness.
  

In [5]:
from sklearn.model_selection import train_test_split

# First, we need to define our features (X) and our target (y).
# Let's say our goal is to predict if a sentiment is positive.
# We need to make sure the target variable exists from the encoding step.
# Our encoded columns include 'Platform_Twitter', 'Sentiment_Positive', etc.
# We will drop columns that are not useful for prediction (like text, user IDs).

# Clean up column names by removing leading and trailing spaces from sentiment columns
df_encoded.columns = [col.strip() if col.startswith('Sentiment_') else col for col in df_encoded.columns]

# Print column names for debugging after stripping
print("Columns in df_encoded after stripping:")
print(df_encoded.columns)

# Find the exact column name for 'Positive' sentiment after encoding and stripping
positive_sentiment_col = None
for col in df_encoded.columns:
    if col.startswith('Sentiment_') and 'Positive' in col:
        positive_sentiment_col = col
        break

if positive_sentiment_col is None:
    raise ValueError("Could not find the 'Positive' sentiment column after encoding and stripping.")

# Drop irrelevant columns and all sentiment columns except the target variable
columns_to_drop = [col for col in df_encoded.columns if col.startswith('Sentiment_') and col != positive_sentiment_col] + ['Unnamed: 0', 'Text', 'Timestamp', 'User', 'Hashtags', 'Country', 'Unnamed: 0.1']
X = df_encoded.drop(columns=columns_to_drop)
y = df_encoded[positive_sentiment_col] # Target: predicting positive sentiment

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the new datasets to confirm the split
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Columns in df_encoded after stripping:
Index(['Unnamed: 0.1', 'Unnamed: 0', 'Text', 'Timestamp', 'User', 'Hashtags',
       'Retweets', 'Likes', 'Country', 'Year',
       ...
       'Sentiment_ Vibrancy', 'Sentiment_ Whimsy',
       'Sentiment_ Whispers of the Past', 'Sentiment_ Winter Magic',
       'Sentiment_ Wonder', 'Sentiment_ Wonder', 'Sentiment_ Wonder',
       'Sentiment_ Wonderment', 'Sentiment_ Yearning', 'Sentiment_ Zest'],
      dtype='object', length=294)
Shape of X_train: (585, 11)
Shape of X_test: (147, 11)
Shape of y_train: (585, 2)
Shape of y_test: (147, 2)
