# Analyzing Political Discourse in the 2024 U.S. Presidential Election  
### Sentiment Analysis (VADER) and Topic Modeling (LDA) on Twitter Data
## Project Overview

This project examines political discourse surrounding the **2024 U.S. Presidential Election** using a large corpus of Twitter data.  
The analysis combines:

- **Sentiment Analysis** using VADER
- **Topic Modeling** using Latent Dirichlet Allocation (LDA)

The goal is to understand:

- How sentiment is distributed across candidates
- Where political negativity and positivity concentrate
- How discussion volume interacts with sentiment
- How sentiment varies across discourse themes



## 1. Dataset Description

The raw dataset consists of tweet-level data collected from Twitter and includes:

- Tweet text
- User identifiers
- Engagement metrics (likes, quotes)
- Timestamp information stored in epoch format

Due to dataset size and redistribution constraints, only preprocessing logic and
derived datasets are included in this repository.


In [None]:
import pandas as pd
import numpy as np
import re
import spacy


## 2. Data Loading and Initial Processing

The dataset is loaded from a CSV file.  
Timestamp information is converted from epoch format to a human-readable datetime format.
Only columns required for textual analysis and engagement-based interpretation are retained.


In [2]:
df=pd.read_csv('november_chunk_1.csv')

In [3]:
# Convert 'epoch' column to datetime
df['date'] = pd.to_datetime(df['epoch'], unit='s')

In [4]:

# Drop the old 'epoch' column if not needed
df.drop(columns=['epoch'], inplace=True)

In [29]:
df.head()

Unnamed: 0,0,type,id,username,text,url,media,retweetedTweet,retweetedTweetID,retweetedUserID,...,links,viewCount,quotedTweet,in_reply_to_screen_name,in_reply_to_status_id_str,in_reply_to_user_id_str,location,cash_app_handle,user,date
0,,tweet-,1.859386e+18,SATraveler2,@MdBreathe continued- \nto develop a techniqu...,https://twitter.com/SATraveler2/status/1859386...,[],False,,,...,[{'display_url': 'childrenshealthdefense.org/d...,"{'count': '4', 'state': 'EnabledWithCount'}",False,MdBreathe,1.859234e+18,1.139217e+18,,,"{'id': 1280313070053502978, 'id_str': '1280313...",2024-11-20 23:59:59
1,,tweet-,1.859386e+18,JoeZumwalt,@UROCKlive1 all these banderite/nazi sycophant...,https://twitter.com/JoeZumwalt/status/18593862...,[],False,,,...,[],"{'count': '3', 'state': 'EnabledWithCount'}",False,UROCKlive1,1.859057e+18,87983040.0,,,"{'id': 1087502843102162945, 'id_str': '1087502...",2024-11-20 23:59:57
2,,tweet-,1.859386e+18,67sharona,Laken Rileys Death Would Of Never Ever Happene...,https://twitter.com/67sharona/status/185938622...,[],False,,,...,[],"{'count': '25', 'state': 'EnabledWithCount'}",False,,,,,,"{'id': 792933462256652288, 'id_str': '79293346...",2024-11-20 23:59:56
3,,tweet-,1.859386e+18,TruthAboutF,POV: you go downstairs in the middle of the ni...,https://twitter.com/TruthAboutF/status/1859386...,"[{'display_url': 'pic.x.com/tUK0JjrAHY', 'expa...",False,,,...,[],"{'count': '1476', 'state': 'EnabledWithCount'}",False,,,,,,"{'id': 1424726838144901122, 'id_str': '1424726...",2024-11-20 23:59:55
4,,tweet-,1.859386e+18,Hump98Clint,"@mtgreenee This is a good question, are you no...",https://twitter.com/Hump98Clint/status/1859386...,[],False,,,...,[],"{'count': '8', 'state': 'EnabledWithCount'}",False,mtgreenee,1.859042e+18,8.260652e+17,,,"{'id': 1942063262, 'id_str': '1942063262', 'ur...",2024-11-20 23:59:53


## 3. Feature Selection

To maintain analytical focus and reduce noise, the following features are retained:

- `username`: anonymized user reference
- `text`: raw tweet content
- `likeCount`, `quoteCount`: engagement indicators
- `date`: posting timestamp

All other metadata fields are excluded from further analysis.


In [35]:
# Retain only columns relevant for NLP analysis and engagement metadata
required_columns = [
    'username', 'text', 'likeCount', 'quoteCount', 'epoch'
]

df = df[required_columns].copy()


In [38]:
df.head(10)

Unnamed: 0,username,text,likeCount,quoteCount,date
0,SATraveler2,@MdBreathe continued- \nto develop a techniqu...,0.0,0.0,2024-11-20 23:59:59
1,JoeZumwalt,@UROCKlive1 all these banderite/nazi sycophant...,0.0,0.0,2024-11-20 23:59:57
2,67sharona,Laken Rileys Death Would Of Never Ever Happene...,1.0,0.0,2024-11-20 23:59:56
3,TruthAboutF,POV: you go downstairs in the middle of the ni...,14.0,0.0,2024-11-20 23:59:55
4,Hump98Clint,"@mtgreenee This is a good question, are you no...",1.0,0.0,2024-11-20 23:59:53
5,BlakeBurman,The race is on for the next DNC chair. Martin ...,3.0,1.0,2024-11-20 23:59:51
6,Chip_quicksteps,@DefiyantlyFree Kamala Harris is a leech grift...,0.0,0.0,2024-11-20 23:59:50
7,Cynthia19941119,@BarronTrumpoo Kamala Harris is Vice President...,0.0,0.0,2024-11-20 23:59:49
8,G2638Garr,@MayorOfLA @POTUS A Pew study this year fall s...,0.0,0.0,2024-11-20 23:59:49
9,mascotmike12,@TheDemocrats Wtf are you doing Montana?,7.0,0.0,2024-11-20 23:59:45


In [39]:
df.shape

(50000, 5)

In [40]:

# Save to CSV without the index
df_new.to_csv('election.csv', index=False)

print("✅ New dataset saved as 'election.csv' successfully!")



✅ New dataset saved as 'election.csv' successfully!


## 5. Output Files

The following processed datasets are generated:

- `election.csv`  
  Contains cleaned metadata and raw tweet text.

These outputs are used as inputs for subsequent sentiment analysis and topic modeling notebooks.
