# Research Project
COMP 435 Introduction to Machine Learning, Spring 2025

- Instructor: Jon Hutchins
- Author: Ina Tang
- Dataset: Sentiment140 on [Kaggle](https://www.kaggle.com/datasets/kazanova/sentiment140/data)
- Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

> Just 75% accuracy would be good... – Dr. Hutchins

### Schema
- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- ids: The id of the tweet ( 2087)
- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- flag: The query (lyx). If there is no query, then this value is NO_QUERY. 
- user: the user that tweeted (robotickilldozr)
- text: the text of the tweet (Lyx is cool)

### Ideas

- [ ] Proportions of + and -
- [ ] Total frequency of word
    - [ ] remove pronouns, prepositions, conjunctions, article adjectives, etc. ?
    - [ ] cutoff for words with (say) less than 1% frequency 
- [ ] Correlation between word and each label (porportions)
- [ ] Affect of capitalization and punctuations on prediction
- [ ] Use deep neural network(s) to identify strong FPs and FNs (weird data points)
- [ ] (Synthesized feature) Certain collection(s) of words that strongly correlates with one of the labels


### Previous labs
- [Linear Regression with a Real Dataset](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/linear_regression_with_a_real_dataset.ipynb)
- [Linear Regression with Synthetic Data](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/linear_regression_with_synthetic_data.ipynb)
- [Logistic Regression](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/core/logistic_regression_core.ipynb)

In [3]:
# pip install numpy pandas torch matplotlib seaborn

In [4]:
import numpy as np
import pandas as pd
# import torch
import matplotlib.pyplot as plt
import seaborn as sns  # sns.pairplot

from typing import Dict, List, Union

In [5]:
df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1', header=None)
# 0 = negative, 2 = neutral, 4 = positive
df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']
print(df.head())

NUMBER_OF_TARGET_VALUES = 5

   target          id                          date      flag  \
0       0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1       0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2       0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3       0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4       0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   

              user                                               text  
0  _TheSpecialOne_  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1    scotthamilton  is upset that he can't update his Facebook by ...  
2         mattycus  @Kenichan I dived many times for the ball. Man...  
3          ElleCTF    my whole body feels itchy and like its on fire   
4           Karoli  @nationwideclass no, it's not behaving at all....  


In [None]:
# generated by GitHub Copilot with minor edits
# runtime: 45s

print("Counting words...")
word_counts: Dict[str, Dict[int, int]] = {}  # word -> target -> frequency

for i, row in df.iterrows():
    words: List[str] = list(set(row['text'].split()))
    target: int = row['target']
    for word in words:
        if word in word_counts:
            word_counts[word][target] += 1
        else:
            word_counts[word] = {i: 0 for i in range(NUMBER_OF_TARGET_VALUES)}  # initialize with 0
            word_counts[word][target] = 1

print("Sorting words by total frequency...")
word_counts = {k: v for k, v in sorted(word_counts.items(), key=lambda item: sum(item[1].values()), reverse=True)}  # sort by counts

In [None]:
print(word_counts)

In [None]:
# TODO: remove pronouns, prepositions, conjunctions, article adjectives, etc.