# Twitter Political Affiliation Classification 

# Names
- Cameron Faulkner
- Nikhil Hegde
- Qianxi Gong
- Atul Nair

# Abstract 

##### Problem Statement
Twitter has become a popular platform for political discourse, where people express their views on a variety of issues. Given the increasing polarization in the political landscape, it is of great interest to understand how people's political views are reflected in their online behavior. In this project, we propose to investigate whether it is possible to accurately classify Twitter users' political affiliation based on their tweets using a supervised learning approach.

##### Dataset
We will collect data from Twitter using the Webscraping. We will primarily gather tweets from verified accounts from democratic and republican members of congress. We are aware this restriction could bias the data. If we were to work with a subset of random Twitter users instead, it would be difficult to manually identify hundreds of Twitter accounts whether they belong to an American political party, or show liberal or conservative viewpoints. Additionally, filtering through bot accounts pose another difficult challenge. 
Gathering data from verified Twitter accounts of members of congress allows us to easily identify political leanings as well as ensuring that the Tweets come from real individuals. 

##### Preprocessing
We will preprocess the collected data to extract features that are relevant for the classification task. This will include removing stop words, stemming, and tokenizing the text. We will also engineer additonal features such as key word frequency and sentiment analysis predictions to capture additional information that may be relevant for the classification task.

##### Model Development 
!!! TO BE UPDATED !!!
We will compare the performance of different algorithms, such as logistic regression, decision trees, and support vector machines, to determine which algorithm is best suited for the task. We will also explore the use of ensemble methods, such as random forests and gradient boosting, to improve the accuracy of our model.

#### Model Evaluation
!!! TO BE UPDATED !!!
We will evaluate the performance of our model using a variety of metrics, such as accuracy, precision, recall, and F1-score. We will also use techniques such as cross-validation to ensure that our model is not overfitting the training data. 


# Background
!!! OLD BACKGROUND, NEED CITATIONS !!!

Previous research has demonstrated that sentiment analysis of Members of Parliament (MPs) s' tweets can be used to predict their popularity. In particular, studies have shown that factors such as negative sentiments is a key factor that influences the level of retweets received by MPs <a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). These findings suggest that analyzing the words and language used by MPs in their tweets can provide insights into their strategies for gaining popularity on social media platforms.

Given our real-world political knowledge, we suspect that the use of certain words and language patterns may have different impacts on the popularity of tweets from MPs of different parties. To explore this further, we are interested in investigating whether the distribution of language use and popularity among different parties can be modeled using a mixture of supervised and unsupervised methods. By examining the use of specific words and phrases across tweets from different political parties, it is possible to score a tweet from a MP given their party, by template matching language patterns from that party and model how they will be reacted upon by the public <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote).

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

**Possible Concerns**: 

!!! GENERIC FILL IN WITH MORE DETAIL !! 

**Biases in the training data**: The training data used to develop the machine learning model may be biased towards certain political affiliations, leading to biased predictions. This could result in discrimination against certain groups and contribute to polarization.

**Misuse of the model**: The model could be misused to target individuals or groups based on their political affiliations, leading to harassment, discrimination, and other unethical behavior.

**To address these concerns, we will take the following steps**:

**Model training and evaluation**: We will use techniques such as cross-validation to evaluate the performance of the model and identify any biases that may exist. 

**Deployment and monitoring**: We will monitor the performance of the model in production and identify any unintended consequences or ethical issues that may arise. We will also implement mechanisms to prevent the misuse of the model, such as restricting access to the model or using it only for specific purposes.

**Regular review**: We will regularly review the model's performance and re-evaluate its ethical implications, as well as update the model to address any new concerns that may arise.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...

- Cameron Faulkner: Twitter Data Collection
- Nikhil Hegde: training model
- Qianxi Gong: Implementation of sentiment analysis and feature parsing
- Atul Nair: Model evaluation and selection

# Project Timeline Proposal

2/28: finish data collection

3/5/23: finish model construction

3/8: complete checkpoint requirements

3/12: conduct model selection and analysis

3/15: begin final write up

# Footnotes
<a name="lorenznote"></a>1.[^] Antypas D. , Preece A., Camacho-Collados J. “Politics, Sentiment and Virality: A Large-Scale Multilingual Twitter Analysis in Greece, Spain and United Kingdom”. Online Social Networks and Media. August 22, 2022. https://arxiv.org/pdf/2202.00396.pdf

<a name="admonishnote"></a>2.[^] Gangwar, A. and Mehta,T.. “Sentiment Analysis of Political Tweets for Israel using Machine Learning”. LearnByResearch. April, 2022. https://arxiv.org/pdf/2204.06515.pdf

