# Twitter Political Affiliation Classification 

# Names
- Cameron Faulkner
- Nikhil Hegde
- Qianxi Gong
- Atul Nair

# Abstract 

##### Problem Statement
Twitter has become a popular platform for political discourse, where people express their views on a variety of issues. Given the increasing polarization in the political landscape, it is of great interest to understand how people's political views are reflected in their online behavior. In this project, we propose to investigate whether it is possible to accurately classify Twitter users' political affiliation based on their tweets using a supervised learning approach.

##### Dataset
In order to train our predictive model, we scraped every Tweet made by a member of the American Congress with a Twitter account during the 117th session of Congress, which ran from January 3rd, 2021 to January 3rd, 2023. In total, 530 of 535 members of this session of Congress had Twitter accounts.

We are aware this small collection of users could bias the data. If we were to work with a subset of random Twitter users instead, it would be difficult to manually identify hundreds of Twitter accounts whether they belong to an American political party, or show liberal or conservative viewpoints. Additionally, filtering through bot accounts pose another difficult challenge. 

Gathering data from verified Twitter accounts of members of congress allows us to easily identify political leanings as well and we can extrapolate our findings to the rest of the population as these members of Congress are the public figures of the party. 

Using the dataset of Congressional Twitter handles, we utilized the snscrape Python package to scrape all of the Tweets in our specified time frame from each of the members. This amounted to 786268 Tweets (clearly, members of Congress are prolific Tweeters). The median amount of Tweets sent by a member of Congress over this span was 1336, corresponding to approximately 1.83 Tweets per day. 

For each Tweet in our dataset, we kept the following variables and eliminated some we found irrelevant:

KEPT VARIABLES:

-rawContent

-replyCount

-retweetCount

-likeCount

-quoteCount

-links

-Media

-mentionedUsers

-Hashtags

-Party (we collected this by applying the party identification of each individual member of Congress to their Tweets of Congress while scraping them)

We elected to drop the following variables as we found them to be redundant or of little utility to our project:

DROPPED VARIABLES:

-Url

-Date

-Id

-User

-renderedContent

-Coordinates

-place

-Card

-viewCount

-Vibe



However, we only used these variables as a starting point for our analysis and needed to treat some of them so that they were usable in our dataset. 


##### Preprocessing and Addition of Features
TREATMENT OF VARIABLES TAKEN FROM TWITTER SCRAPING:

While all of our variables with the word “Count” in them were already in numerical format, we needed to transform the others into numerical data.

For the variables “links”, “Media”, “mentionedUsers”, and “Hashtags”, we weren’t interested in their actual content, but rather how many of them were present in a given tweet (e.g. the number of links placed in a single Tweet). As such, we transformed each variable for every observation into the length of the list of items it pertained to yielding values from 0 and up. 

For “Party” we changed our initial classification of “R” or “D” into 1 and 0, respectively, as we knew that we cannot feed a model categorical data in the form of text. We will be using the party label as the basis of our classifier.

As a fortunate result of using this particular scraper, all of our data was well formatted (no strings were in fields reserved for floats, for example) and, after doing our best practice checks on our data, we found no observations needed to be thrown out.


ADDITION OF DERIVED VARIABLES:

Sentiment Analysis:

In addition to the variables provided by our scraping, we conducted sentiment analysis on the text content of each of our Tweets with the theory that Republicans and Democrats may differ in the tone of their communication with the public. 

To apply our sentiment analysis, we took the rawContent variable for each of our Tweets (literally the text of the Tweet) and applied the Vader Sentiment analysis algorithm to them. This algorithm is specifically for the sentiment classification of social media posts, an important factor to consider given how different a Tweet is from a novel. We used this model to classify the sentiment of each Tweet as “positive”, “negative”, or “neutral” based on the scores provided by the algorithm and the thresholds recommended by the developers of the sentiment analysis package. 
	
Again, we ran into the problem of using categorical data in an ML algorithm and so we used one-hot encoding to create three separate binary variables corresponding to each of the three sentiments. 


Word Frequency and Importance:

We also decided to take further advantage of the rich text content of our dataset and include the frequency and uniqueness of the words that comprised the Tweets in our dataset under the hypothesis that Republicans and Democrats used certain words more frequently than the other party. 

To do so, we used the standard Term Frequency-Inverse Document Frequency measure to quantify the significance of the words used in our Tweets. Understanding that using the algorithm on unprocessed data can yield falsely significant results, we knew we must preprocess the data before applying TF-IDF. We removed stop words, words that commonly occur and offer no value to our classification, using the Natural Language Tool Kit’s (nltk) premade stopwords corpus, and also applied nltk’s Lemmatizer which removes the conjugation of words to avoid duplicates and diluting the significance of them. 
We then took our processed Tweets and vectorized them with sklearn’s TfidfVectorizer and extracted 4500 features from our dataset. The resulting matrix was in the form of columns that contained the TF_IDF scores for each of the 4500 features, for every one of the Tweets. This matrix, added to our 11 other variables, formed our completed dataset.
	
COMPLETED DATASET:

In sum, the dataset we will use to train our model has the shape of (786268, 4513). For each Tweet, this includes sentiment analysis, engagement metrics from Twitter users, word importance, and party labels from the Member of Congress who made a given Tweet that will serve as our classification labels.


##### Model Development 
!!! TO BE UPDATED !!!
We will compare the performance of different algorithms, such as logistic regression, decision trees, support vector machines, and multinomial naive bayes to determine which algorithm is best suited for the task. We will also explore the use of ensemble methods, such as random forests and gradient boosting, to improve the accuracy of our model.

#### Model Evaluation
!!! TO BE UPDATED !!!
We will evaluate the performance of our model using a variety of metrics, such as accuracy, precision, recall, and F1-score. We will also use techniques such as cross-validation to ensure that our model is not overfitting the training data. 


# Background
!!! OLD BACKGROUND, NEED CITATIONS !!!

Previous research has demonstrated that sentiment analysis of Members of Parliament (MPs) s' tweets can be used to predict their popularity. In particular, studies have shown that factors such as negative sentiments is a key factor that influences the level of retweets received by MPs <a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). These findings suggest that analyzing the words and language used by MPs in their tweets can provide insights into their strategies for gaining popularity on social media platforms.

Given our real-world political knowledge, we suspect that the use of certain words and language patterns may have different impacts on the popularity of tweets from MPs of different parties. To explore this further, we are interested in investigating whether the distribution of language use and popularity among different parties can be modeled using a mixture of supervised and unsupervised methods. By examining the use of specific words and phrases across tweets from different political parties, it is possible to score a tweet from a MP given their party, by template matching language patterns from that party and model how they will be reacted upon by the public <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote).

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Preliminary results

NEW SECTION!

After reading about the best algorithms for natural language processing classification, we decided the baseline algorithm against which others should be compared will be multinomial naive Bayes. As such, we trained a model using 70,000 randomly selected Tweets and their partisan labels, and tested the model's performance on 30,000 unseen Tweets. 


Classification Report
======================================================

               precision    recall  f1-score   support

           0       0.89      0.66      0.76     15466
           1       0.72      0.91      0.80     14534
    accuracy                           0.78     30000
   macro avg       0.80      0.79      0.78     30000
weighted avg       0.81      0.78      0.78     30000


As we can see, we achieved a promising level of accuracy (78%), but experienced very different precision and recall scores for our two categories. While this is a nice baseline, we would like to achieve both higher accuracy and similar precision and recall for our two labels. 


As our data collection and processing was quite time intensive, we have not yet begun to analyze the effect of other models, nor vary the number of features extracted in our text processing portion. In addition, we would like to compare different sizes of training sets and testing sets, as this model was only trained on around 15% of our data for the sake of time. However, with the coding intensive portions of this project behind us, we can turn our full focus towards finding the best combination of models and features to perform our task.


Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

**Possible Concerns**: 

!!! GENERIC FILL IN WITH MORE DETAIL !! 

**Biases in the training data**: The training data used to develop the machine learning model may be biased towards certain political affiliations, leading to biased predictions. This could result in discrimination against certain groups and contribute to polarization.

**Misuse of the model**: The model could be misused to target individuals or groups based on their political affiliations, leading to harassment, discrimination, and other unethical behavior.

**To address these concerns, we will take the following steps**:

**Model training and evaluation**: We will use techniques such as cross-validation to evaluate the performance of the model and identify any biases that may exist. 

**Deployment and monitoring**: We will monitor the performance of the model in production and identify any unintended consequences or ethical issues that may arise. We will also implement mechanisms to prevent the misuse of the model, such as restricting access to the model or using it only for specific purposes.

**Regular review**: We will regularly review the model's performance and re-evaluate its ethical implications, as well as update the model to address any new concerns that may arise.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...

- Cameron Faulkner: Twitter Data Collection
- Nikhil Hegde: training model
- Qianxi Gong: Implementation of sentiment analysis and feature parsing
- Atul Nair: Model evaluation and selection

# Project Timeline Proposal

2/28: finish data collection

3/5/23: finish model construction

3/8: complete checkpoint requirements

3/12: conduct model selection and analysis

3/15: begin final write up

# Footnotes
<a name="lorenznote"></a>1.[^] Antypas D. , Preece A., Camacho-Collados J. “Politics, Sentiment and Virality: A Large-Scale Multilingual Twitter Analysis in Greece, Spain and United Kingdom”. Online Social Networks and Media. August 22, 2022. https://arxiv.org/pdf/2202.00396.pdf

<a name="admonishnote"></a>2.[^] Gangwar, A. and Mehta,T.. “Sentiment Analysis of Political Tweets for Israel using Machine Learning”. LearnByResearch. April, 2022. https://arxiv.org/pdf/2204.06515.pdf

