# Flatiron Capstone Project – Notebook #3: Feature Engineering

Student name: **Angelo Turri**

Student pace: **self paced**

Project finish date: **1/19/24**

Instructor name: **Mark Barbour**

# Instructions

Due to the size of this project, there are four notebooks instead of one. The proper order to execute these notebooks is as follows:

- **Gathering Data**
- **Preprocessing**
- **Feature Engineering** <---- You are here
- **Modeling**

Technically, we did minor feature engineering in the previous notebook with bigrams and trigrams. All major feature engineering occurs in this notebook.

### Sentiment analysis

Sentiment analysis was done with VADER (Valence Aware Dictionary sEntiment Reasoner). I discovered VADER through [this video by Rob Mulla](https://www.youtube.com/watch?v=QpzMWQvxXWk). Vader uses [a lexicon of about 7,500 words](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) to calculate the probability of a piece of text being postive, negative or neutral. It also gives a compound score of sentiment based on these three probabilities, which is what we use to score sentiment.

[Neptune.ai](https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair#:~:text=Valence%20aware%20dictionary%20for%20sentiment,to%20calculate%20the%20text%20sentiment.&text=Positive%2C%20negative%2C%20and%20neutral.) describes VADER as being "optimized for social media data and can yield good results when used with data from twitter, facebook, etc." Since Reddit counts as social media, I believe VADER is a good choice for this project.

### Bag of words features

We have both of the target variables we need - score and sentiment. We are unable to use models meaningfully on our data as it currently stands. To create proper training data:

- I limited my vocabulary to the top 100 unigrams, bigrams, and trigrams. This was necessary because keeping all the words would have resulted in too many features.
- For each word/phrase in our vocabulary, I went through each post and determined whether the word/phrase was present.
- The result was three sparse matrices for unigrams, bigrams and trigrams.
- Due to limiting our vocabulary, many of the rows in these sparse matrices consisted entirely of 0's (that is, none of the words in our vocabulary were present in the given post). These rows do nothing but dilute the data and impeded the model's ability to make predictions, so all these rows were removed.

# Importing Packages

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

import os
import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output, display_html, Markdown, display
import time

from tqdm.notebook import tqdm
tqdm.pandas()

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag.perceptron import PerceptronTagger

In [2]:
import ray

ray.shutdown()
ray.init()

2024-02-21 21:26:15,579	INFO worker.py:1724 -- Started a local Ray instance.


0,1
Python version:,3.11.6
Ray version:,2.9.1


# Importing data

In [3]:
name='the_donald_comments'

df = pd.read_parquet(path=f'../data/{name}_preprocessed.parquet')

unigrams = pd.read_parquet(path=f'../data/{name}_unigrams.parquet')
bigrams = pd.read_parquet(path=f'../data/{name}_bigrams.parquet')
trigrams = pd.read_parquet(path=f'../data/{name}_trigrams.parquet')

# Sentiment analysis using VADER

In [4]:
sia = SentimentIntensityAnalyzer()

@ray.remote
def vader(series):
    """
    Calculates the compound sentiment score for a piece of text using VADER.
    """
    return series.apply(lambda x: sia.polarity_scores(x)['compound'])


# Splits our data into chunks and runs sentiment analysis in parallel across
# all of them using the ray library.
num_chunks=8
chunks = np.array_split(df, num_chunks)
returns = ray.get([vader.remote(chunks[i].joined) for i in range(num_chunks)])
df['sentiment'] = pd.concat(returns)

# Bag of Words Features

In [5]:
# Limits vocabulary
most_common_unigrams = unigrams.head(300).unigram
most_common_bigrams = bigrams.head(300).bigram
most_common_trigrams = trigrams.head(300).trigram

# Initializes the dataframes for the bag of words features
unigram_features, bigram_features, trigram_features = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

for i in tqdm(np.arange(100)):
    
    # Creates a column for the presence of each unigram
    unigram = most_common_unigrams[i]
    unigram_features['_' + unigram] = df.joined.str.contains(unigram).astype('int8')
    
    # Creates a column for the presence of each bigram
    bigram = most_common_bigrams[i]
    bigram_features['_' + '_'.join(bigram)] = df.joined.str.contains(' '.join(bigram)).astype('int8')
    
    # Creates a column for the presence of each trigram
    trigram = most_common_trigrams[i]
    trigram_features['_' + '_'.join(trigram)] = df.joined.str.contains(' '.join(trigram)).astype('int8')

  0%|          | 0/100 [00:00<?, ?it/s]

In [6]:
# Calculates how sparse each set of feature data is
unigram_pct_before = unigram_features.sum().sum()/(len(unigram_features) * len(unigram_features.columns))
bigram_pct_before = bigram_features.sum().sum()/(len(bigram_features) * len(bigram_features.columns))
trigram_pct_before = trigram_features.sum().sum()/(len(trigram_features) * len(trigram_features.columns))

In [7]:
# Removes all rows consisting entirely of 0's from each set of feature data
unigram_mask = (unigram_features == 0).all(axis=1)
bigram_mask = (bigram_features == 0).all(axis=1)
trigram_mask = (trigram_features == 0).all(axis=1)

unigram_features = unigram_features[~unigram_mask]
bigram_features = bigram_features[~bigram_mask]
trigram_features = trigram_features[~trigram_mask]

In [8]:
# Re-calculates how sparse each set of feature data is after pruning
unigram_pct_after = unigram_features.sum().sum()/(len(unigram_features) * len(unigram_features.columns))
bigram_pct_after = bigram_features.sum().sum()/(len(bigram_features) * len(bigram_features.columns))
trigram_pct_after = trigram_features.sum().sum()/(len(trigram_features) * len(trigram_features.columns))

In [9]:
unigram_features.head()

Unnamed: 0_level_0,_like,_people,_trump,_would,_get,_one,_think,_know,_even,_right,...,_yes,_pretty,_trying,_come,_post,_getting,_oh,_thought,_government,_already
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,1,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
bigram_features.head()

Unnamed: 0_level_0,_looks_like,_fake_news,_president_trump,_years_ago,_sounds_like,_look_like,_donald_trump,_holy_shit,_white_people,_trump_supporters,...,_build_wall,_think_would,_ever_seen,_shit_like,_never_seen,_people_know,_people_get,_whole_thing,_10_years,_people_need
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
trigram_features.head()

Unnamed: 0_level_0,_make_america_great,_bill_clinton_rapist,_fucking_white_male,_donald_j_trump,_orange_man_bad,_ten_feet_higher,_president_united_states,_つ_つ_つ,_long_time_ago,_new_york_times,...,_right_side_history,_brick_brick_brick,_god_emperor_trump,_makes_look_like,_usa_usa_usa,_holy_fucking_shit,_behind_closed_doors,_last_8_years,_wearing_maga_hat,_feel_like_going
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
115,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
313,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
360,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
504,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Effects of bag-of-words pruning

In [12]:
print(f""" Before pruning: \n\t{unigram_pct_before:.2%} of cells in unigram features weren't empty.
\t{bigram_pct_before:.2%} of cells in bigram features weren't empty.
\t{trigram_pct_before:.2%} of cells in trigram features weren't empty.""")

print(f""" After pruning: \n\t{unigram_pct_after:.2%} of cells in unigram features weren't empty.
\t{bigram_pct_after:.2%} of cells in bigram features weren't empty.
\t{trigram_pct_after:.2%} of cells in trigram features weren't empty.""")

 Before pruning: 
	3.32% of cells in unigram features weren't empty.
	0.16% of cells in bigram features weren't empty.
	0.02% of cells in trigram features weren't empty.
 After pruning: 
	4.22% of cells in unigram features weren't empty.
	1.19% of cells in bigram features weren't empty.
	1.08% of cells in trigram features weren't empty.


In [13]:
print(f""" Rows before pruning: \n\tUnigram features: {len(unigram_mask):,}
\tBigram features: {len(unigram_mask):,}
\tTrigram featuers: {len(unigram_mask):,}""")

print(f""" Rows removed: \n\tUnigram features: {unigram_mask.sum():,}
\tBigram features: {bigram_mask.sum():,}
\tTrigram featuers: {trigram_mask.sum():,}""")

 Rows before pruning: 
	Unigram features: 1,812,700
	Bigram features: 1,812,700
	Trigram featuers: 1,812,700
 Rows removed: 
	Unigram features: 384,319
	Bigram features: 1,570,057
	Trigram featuers: 1,784,996


# Preparing data for modeling

In [18]:
df[['date', 'score', 'sentiment']].to_parquet(path=f'../data/training_data/{name}_targets.parquet')

In [15]:
unigram_features.to_parquet(path=f'../data/training_data/{name}_unigram_features.parquet')
bigram_features.to_parquet(path=f'../data/training_data/{name}_bigram_features.parquet')
trigram_features.to_parquet(path=f'../data/training_data/{name}_trigram_features.parquet')