# Wallstreetbets data (Kaggle) EDA

### Parent Directory Configurations 

**Notebook note:** Please make sure that the `PATH` of this notebook corresponds to the base-directory path of this repository. 
This will ensure that all executions, data reading will have the reference as if this notebook were in the base directory; 
otherwise, you might need to change the PATH and do some trickery (which is a pain for Jupyter notebooks) 

In [4]:
# verify working directory of the notebook 
import os 
print(os.getcwd())

c:\Users\jairp\OneDrive\Desktop_remote\HEC Montreal\3. Winter 2024\Machine Learning I\WallstreetbetsGenNLP


## Imports 

In [6]:
# General imports 
import zipfile 
from typing import Union, List 
from tqdm.notebook import tqdm 

# Data Analysis 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

# NLTK setup
import nltk 
nltk.download('punkt')
nltk.download('stopwords')

# Text Processing 
import re 
import spacy
from emoji import demojize
from spacy.tokens import Doc
from nltk.corpus import stopwords

# Configurations 
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Exclude common negation words from the stop words list
negation_words = {'no', 'not', 'nor', 'neither', 'never', "n't", 'none'}
stop_words = stop_words - negation_words

# Load Spacy model and disable irrelevant components for acceleration
nlp = spacy.load("en_core_web_md", disable=["parser", "ner", "textcat", "attribute_ruler", "tok2vec"])
# nlp.max_length = 1500000  # Adjust based on your text size


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jairp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jairp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jairp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Data Loading 

### Extracting the data 

Here, we want to extract and preview the data 

In [7]:
# Specify the path to the zip file
zip_file_path = 'data_raw/reddit_wsb.csv.zip'

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall('data_raw')

# Read the CSV using Pandas
csv_file_path = 'data_raw/reddit_wsb.csv'
df = pd.read_csv(csv_file_path)

# Preview the loaded data 
display(df.head(40))

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading “to g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56
5,WE BREAKING THROUGH,405,l6uf7d,https://i.redd.it/2wef8tc062e61.png,84,1611862000.0,,2021-01-28 21:26:30
6,SHORT STOCK DOESN'T HAVE AN EXPIRATION DATE,317,l6uf6d,https://www.reddit.com/r/wallstreetbets/commen...,53,1611862000.0,Hedgefund whales are spreading disinfo saying ...,2021-01-28 21:26:27
7,THIS IS THE MOMENT,405,l6ub9l,https://www.reddit.com/r/wallstreetbets/commen...,178,1611862000.0,Life isn't fair. My mother always told me that...,2021-01-28 21:19:31
8,Currently Holding AMC and NOK - Is it retarded...,200,l6ub4i,https://i.redd.it/6k2z7ouo42e61.png,161,1611862000.0,,2021-01-28 21:19:16
9,I have nothing to say but BRUH I am speechless...,291,l6uas9,https://i.redd.it/bfzzw2yo42e61.jpg,27,1611862000.0,,2021-01-28 21:18:37


In [8]:
df.columns

Index(['title', 'score', 'id', 'url', 'comms_num', 'created', 'body',
       'timestamp'],
      dtype='object')

### Text Cleaning Versions 

In [44]:
def clean_text(text: Union[str, List[str], pd.Series]) -> Union[str, List[str]]:
    """
    Cleans a given text or collection of texts by performing several preprocessing steps while preserving negation words:
    lowercasing, punctuation removal, stop words removal, text standardization, spelling correction,
    tokenization, stemming/lemmatization, handling emojis, and preserving tickers and numeric information.
    """
    # Define helper function to preprocess a single text
    def preprocess_single_text(doc: Doc) -> str:
        tokens = []
        for token in doc:
            # Convert emojis to words
            token_text = demojize(token.text)

            # Remove punctuation while preserving intra-word dashes
            token_text = re.sub(r'[^\w\s\-:]', '', token_text)

            # Preserve negation words and alphanumeric tokens not in stop words
            if (token_text.lower() not in stop_words 
                and (token_text.isalnum() or token_text.startswith(":") and token_text.endswith(":"))
                or token_text.lower() in negation_words):

                # Append the lemmatized form of the token, unless it's a negation word, then keep original
                tokens.append(token.lemma_ if token_text.lower() not in negation_words else token_text.lower())
        return " ".join(tokens)

    # Check if input is a single string or a collection of strings
    if isinstance(text, str):
        # Process a single string
        return preprocess_single_text(nlp(text))
    elif isinstance(text, (pd.Series, list)):
        # Process a collection of strings using nlp.pipe for efficient batch processing
        texts = [preprocess_single_text(doc) for doc in nlp.pipe(text, batch_size=50)]
        return texts
    else:
        raise ValueError("Input type must be str, List[str], or pd.Series")


In [46]:
# Extract an example title from the data
example_title = df['title'].iloc[0]

# Apply the cleaning function
cleaned_title = clean_text(example_title)

# Print the title before and after cleaning 
print(example_title)
print(cleaned_title)

It's not about the money, it's about sending a message. 🚀💎🙌
not money sending message 🚀 💎 🙌
