Cleaning Comments Data using spaCy, re, and NLTK

Introduction
In this Jupyter Notebook, we will demonstrate how to clean comments data using the spaCy library for tokenization, the re (regular expression) library for removing special characters, and NLTK for stop word removal and lemmatization. The goal is to preprocess the raw comments data and prepare it for further analysis or natural language processing tasks.

Important Notes: The spaCy and NLTK are two separate library, it is apply individually in this notebook. Please don't run all the code in one time from this notebook.

The 1st Part: SpaCY & re

In [None]:
# Install spaCY

pip install spacy

In [None]:
# Download Spacy language model

python -m spacy download en_core_web_sm

In [1]:
# Importing Required Libraries

import numpy as np
import pandas as pd
import spacy
import re

In [2]:
# Loading the Data

df = pd.read_csv("consol_Reddit_comment_Jan-Jun23.csv")

In [3]:
# Text Preprocessing with spaCy and re

nlp = spacy.load("en_core_web_sm")
nlp.max_length = 3000000  # Set a higher value based on your needs
spacy_stopwords = nlp.Defaults.stop_words

In [4]:
# Print default spacy stopwords

print(spacy_stopwords)
print(len(spacy_stopwords))

{'any', 'until', 'one', 'thereafter', 'everything', 'around', 'not', 'perhaps', 'but', 'alone', 'nobody', 'several', 'that', 'whoever', 'it', 'me', 'none', 'themselves', 'still', '’ve', 'empty', 'only', 'get', 'is', 'see', '‘ve', 'full', 'may', 'mine', 'too', 'nine', 'anyhow', 'thence', 'everyone', 'meanwhile', 'next', 'above', 'per', 'which', 'wherever', 'used', 'amount', 'last', 'made', 'there', 'formerly', '’s', 'together', 'yet', 'indeed', 'behind', 'them', 'an', 'back', 'eleven', 'for', 'regarding', 'other', 'hereupon', 'they', 'thereupon', 'becoming', 'twelve', 'give', 'so', 'the', 'more', '‘re', 'ourselves', 'below', 'latter', 'beforehand', 'you', "'m", 'once', 'upon', 'now', 'others', 'are', 'he', 'might', 'should', 'third', 'with', 'nowhere', 'such', 'neither', 'whence', 'was', '‘ll', 'make', 'no', 'part', 'who', 'sometime', 'because', 'elsewhere', 'except', 'itself', 'us', 'throughout', 'then', 'less', 'whole', 'up', 'hereby', 'move', 'did', 'himself', 'though', 'unless', 'ca

In [5]:
# Set stopwords by yourself
#custom_stopwords = ['reddit', 'stock', 'market', 'company', 'invest', 'close', 'green', 'candle', 'like']

custom_stopwords = []

In [6]:
# Tokenization with spaCy

def clean_comment(comment, custom_stopwords=None):
    doc = nlp(comment)
    cleaned_tokens = []
    
    if custom_stopwords is None:
        custom_stopwords = []
    
    for token in doc:
        if not token.is_stop and not token.is_punct and token.lemma_.lower() not in custom_stopwords:
            cleaned_tokens.append(token.lemma_.lower())
    
    cleaned_comment = " ".join(cleaned_tokens)
    
    # Remove extra whitespace and meaningless symbols
    cleaned_comment = re.sub(r'\s+', ' ', cleaned_comment)
    cleaned_comment = re.sub(r'[^a-zA-Z0-9\s]', '', cleaned_comment)
    
    return cleaned_comment


In [7]:
# Apply to the data

df['cleaned_comments'] = df['comments'].apply(lambda x: clean_comment(x, custom_stopwords))

In [8]:
df.head()

Unnamed: 0.1,Unnamed: 0,index,date,comments,Open_x,High_x,Low_x,Close_x,Adj Close_x,Volume_x,...,Low_y,Close_y,Adj Close_y,Volume_y,percent_chnage_y,Jump_y,Big_Jump_y,Drop_y,Big_Drop_y,cleaned_comments
0,0,0,2022-07-13,This week's [Earnings Thread](https://www.redd...,3779.669922,3829.439941,3759.070068,3801.780029,3801.780029,4109390000,...,11031.26953,11247.58008,11247.58008,4433060000,1.727757,0,1,0,0,week earnings threadhttpswwwredditcom r wallst...
1,1,1,2022-07-14,This week's [Earnings Thread](https://www.redd...,3763.98999,3796.409912,3721.560059,3790.379883,3790.379883,4199690000,...,11005.92969,11251.19043,11251.19043,4481070000,0.896589,1,0,0,0,week earnings threadhttpswwwredditcom r wallst...
2,2,2,2022-07-15,Cashed out up 56k today. Now cuddled up watchi...,3818.0,3863.620117,3817.179932,3863.159912,3863.159912,4143800000,...,11295.33008,11452.41992,11452.41992,4369060000,0.642036,1,0,0,0,cashed 56k today cuddle watch netflix eat chur...
3,3,3,2022-07-19,This week's [Earnings Thread](https://www.redd...,3860.72998,3939.810059,3860.72998,3936.689941,3936.689941,4041070000,...,11448.96973,11713.15039,11713.15039,5302740000,1.720802,0,1,0,0,week earnings threadhttpswwwredditcom r wallst...
4,4,4,2022-07-20,If 2008 was the Great Recession\n\nThen 2022 i...,3935.320068,3974.129883,3922.030029,3959.899902,3959.899902,4185300000,...,11703.36035,11897.65039,11897.65039,5467080000,1.463067,0,1,0,0,2008 great recession 2022 fake recession nflx ...


In [9]:
df.to_csv("spacy_Reddit_comment_Jan-Jun23.csv")

The 2nd Part: NLTK and re

Important Notes: The NLTK and spaCy are two separate library, it is apply individually in this notebook. Please don't run all the code from this notebook.

In [None]:
# Install NLTK (Natural Language Toolkit)

pip install nltk

In [10]:
# Importing Required Libraries

import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [11]:
# Loading the Data

df = pd.read_csv("consol_Reddit_comment_Jan-Jun23.csv")

In [12]:
# Download NLTK resources

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/wailunchan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/wailunchan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/wailunchan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
# Define my own project-specific stop words
# Add any additional project-specific stopwords

custom_stopwords = ['', '']

In [14]:
# Initialize the NLTK lemmatizer

lemmatizer = WordNetLemmatizer()

In [15]:
# Define a function to clean and preprocess the words

def clean_words(text):
    # Tokenize the text into individual words
    tokens = word_tokenize(text.lower())

    # Remove special characters and numbers using regex
    tokens = [re.sub(r'[^a-zA-Z\s]', '', token) for token in tokens]

    # Remove white spaces from the tokens
    tokens = [token.strip() for token in tokens]

    # Remove stopwords
    stop_words = set(stopwords.words('english') + custom_stopwords)
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatize the words
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Filter out empty strings ('')
    lemmatized_tokens = [token for token in lemmatized_tokens if token]

    # Join the lemmatized tokens back into a single string
    cleaned_text = ' '.join(lemmatized_tokens)

    return cleaned_text

In [22]:
print(stopwords.words('english'))
print(len(stopwords.words('english')))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [16]:
#Apply the comments cleaning function to your comments data

df['cleaned_comments'] = df['comments'].apply(clean_words)

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,index,date,comments,Open_x,High_x,Low_x,Close_x,Adj Close_x,Volume_x,...,Low_y,Close_y,Adj Close_y,Volume_y,percent_chnage_y,Jump_y,Big_Jump_y,Drop_y,Big_Drop_y,cleaned_comments
0,0,0,2022-07-13,This week's [Earnings Thread](https://www.redd...,3779.669922,3829.439941,3759.070068,3801.780029,3801.780029,4109390000,...,11031.26953,11247.58008,11247.58008,4433060000,1.727757,0,1,0,0,week earnings thread http wwwredditcomrwallstr...
1,1,1,2022-07-14,This week's [Earnings Thread](https://www.redd...,3763.98999,3796.409912,3721.560059,3790.379883,3790.379883,4199690000,...,11005.92969,11251.19043,11251.19043,4481070000,0.896589,1,0,0,0,week earnings thread http wwwredditcomrwallstr...
2,2,2,2022-07-15,Cashed out up 56k today. Now cuddled up watchi...,3818.0,3863.620117,3817.179932,3863.159912,3863.159912,4143800000,...,11295.33008,11452.41992,11452.41992,4369060000,0.642036,1,0,0,0,cashed k today cuddled watching netflix eating...
3,3,3,2022-07-19,This week's [Earnings Thread](https://www.redd...,3860.72998,3939.810059,3860.72998,3936.689941,3936.689941,4041070000,...,11448.96973,11713.15039,11713.15039,5302740000,1.720802,0,1,0,0,week earnings thread http wwwredditcomrwallstr...
4,4,4,2022-07-20,If 2008 was the Great Recession\n\nThen 2022 i...,3935.320068,3974.129883,3922.030029,3959.899902,3959.899902,4185300000,...,11703.36035,11897.65039,11897.65039,5467080000,1.463067,0,1,0,0,great recession fake recession nflx gon na war...


In [18]:
df.to_csv("nltk_Reddit_comment_Jan-Jun23.csv")