## Text Preprocessing: Tokenization, Stop Word Removal, Lemmatization, and Progress Tracking in Python

Tokenization: Splitting text into individual words.
Stop Word Removal: Eliminating common words that add little meaning.
Lemmatization: Reducing words to their base or root form.
Progress Tracking: Using a progress bar to monitor the preprocessing of large datasets.

In [1]:
import pandas as pd

In [4]:
data=pd.read_csv("./nlp_tweets_processing/cleaned_data.csv")

In [6]:
data.columns

Index(['Date', 'Tweet Count', 'Username', 'Text', 'Created At', 'Retweets',
       'Likes', 'stockname'],
      dtype='object')

In [17]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\av

True

In [18]:
data["Text"].head(10)

0    ไม่พูดไม่ได้ adidas ลดเยอะมากก ทั้งรุ่นคลาสสิก...
1    WHAT A DAAAAAY!\n \nTraded with some awesome t...
2    Stocks to Watch out\n\n1. Hind Copper\n2. Fluo...
3    옵션 Implied Volatility을 보면, 곧 큰 변동성이 나타날 가능성이 높...
4    おはようございます。主要3指数はまちまちの展開。ダウは2日連続での最高値更新。一方ナスダック...
5    21年のSaaS企業 パフォーマンス（年初来）\n\nプラスで終えたのは以下の5社のみで、大...
6    #BBAS3 O LL anualizado de 2021 evoluiu 44% em ...
7    【ばっちゃまの米国株YouTube🇺🇸👵🏻】\n (1/4) 悪い兆候です。\n今日のビデオ...
8    FREE #OPTIONS Ideas\n\nScale out when above 25...
9    Take a guess on what’s my favorite play?\n\nCa...
Name: Text, dtype: object

In [20]:
# Import necessary libraries
import re
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Enable tqdm for pandas
tqdm.pandas()

# Define the preprocessing function
def preprocess_text(text):
    # Check if text is a string
    if not isinstance(text, str):
        return ''
    
    # Lowercasing
    text = text.lower()
    
    # Removing URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Removing Mentions (@username)
    text = re.sub(r'@\w+', '', text)
    
    # Removing Hashtags (#hashtag)
    text = re.sub(r'#\w+', '', text)
    
    # Removing Stock Symbols ($AAPL)
    text = re.sub(r'\$\w+', '', text)
    
    # Removing Punctuation, Numbers, and Special Characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenization
    tokens = nltk.word_tokenize(text)
    
    # Removing Stop Words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Joining tokens back into a single string
    clean_text = ' '.join(tokens)
    
    return clean_text

# Apply the preprocessing function to the 'Text' column with a progress bar
data['clean_text'] = data['Text'].progress_apply(preprocess_text)

100%|████████████████████████████████████████████████████████████████████████| 349986/349986 [04:34<00:00, 1277.00it/s]


In [21]:
data["clean_text"].head(10)

0                  adidas stansmith superstar k adidas
1    daaaaay traded awesome trader added monitor di...
2    stock watch hind copper fluorochem praj zuari ...
3                              implied volatility hour
4                                                sampp
5                                 saas sansan wantedly
6    anualizado de evoluiu em relao no ltimos meses...
7                                 youtube youtube live
8    free idea scale profit cgt plt cgt plt cgt plt...
9    take guess whats favorite play blame stock mov...
Name: clean_text, dtype: object

# Sentiment Analysis

In [28]:
# Import necessary libraries
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from tqdm import tqdm

# Enable tqdm for pandas if not already enabled
tqdm.pandas()

# Initialize the VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Extend VADER's lexicon with domain-specific terms
new_words = {
    'bullish': 2.0,
    'bearish': -2.0,
    'fomo': -1.5,
    'moon': 3.0,
    'rocket': 2.5,
    'crash': -2.5,
    'bagholder': -2.0,
    'pump': 1.5,
    'dump': -2.0,
    'manipulation': -2.5,
    'hodl': 1.0,
    'yolo': 0.5,
    'short squeeze': 2.5,
    'all-time high': 2.0,
    'atm': 0.5,
    'green': 1.5,
    'red': -1.5,
    'long': 1.0,
    'short': -1.0,
    'overvalued': -2.0,
    'undervalued': 2.0,
    'buyback': 1.5,
    'selloff': -1.5,
    'breakout': 1.5,
    'fakeout': -1.5,
    'dip': -1.0,
    'rally': 2.0,
    'resistance': -0.5,
    'support': 0.5,
    'consolidation': 0.0,
    'downgrade': -1.5,
    'upgrade': 1.5,
    'miss': -1.5,
    'beat': 1.5,
    'whale': 0.5,
    'correction': -1.5,
    'volatility': -0.5,
    'dead cat bounce': -2.0,
    'golden cross': 2.0,
    'death cross': -2.0,
    # Add more terms as needed
}

# Update the analyzer's lexicon
analyzer.lexicon.update(new_words)

# Define the function to compute sentiment scores
def get_sentiment_score(text):
    if not text:
        return 0.0  # Neutral sentiment
    sentiment = analyzer.polarity_scores(text)
    return sentiment['compound']

# Apply the sentiment analysis function to the 'clean_text' column
data['sentiment_score'] = data['clean_text'].progress_apply(get_sentiment_score)


100%|████████████████████████████████████████████████████████████████████████| 349986/349986 [01:09<00:00, 5020.32it/s]


In [24]:
data.sentiment_score

0         0.0000
1         0.9091
2         0.0000
3         0.0000
4         0.0000
           ...  
349981    0.0000
349982    0.0000
349983    0.1531
349984    0.2023
349985    0.0000
Name: sentiment_score, Length: 349986, dtype: float64

In [27]:
# Calculate the number of tweets with a sentiment score of zero
num_zero_sentiments = (data['sentiment_score'] == 0.0).sum()

# Calculate the total number of tweets
total_tweets = len(data)

# Calculate the percentage of zero sentiment tweets
percentage_zero = (num_zero_sentiments / total_tweets) * 100

# Print the results
print(f"Number of tweets with sentiment score zero: {num_zero_sentiments}")
print(f"Total number of tweets: {total_tweets}")
print(f"Percentage of zero sentiment tweets: {percentage_zero:.2f}%")

Number of tweets with sentiment score zero: 137004
Total number of tweets: 349986
Percentage of zero sentiment tweets: 39.15%


In [30]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
import pandas as pd
import numpy as np

# Enable tqdm for pandas
tqdm.pandas()

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [31]:
def get_finbert_sentiment(text):
    if not isinstance(text, str) or text.strip() == '':
        return 0.0  # Neutral sentiment

    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    sentiment_score = probabilities.numpy()[0]

    # Map probabilities to a compound score
    compound_score = sentiment_score[2] - sentiment_score[0]  # Positive prob - Negative prob

    return compound_score

In [None]:
data['sentiment_score'] = data['clean_text'].progress_apply(get_finbert_sentiment)

  0%|▏                                                                        | 1144/349986 [02:06<12:26:36,  7.79it/s]

In [None]:
num_zero_sentiments = (data['sentiment_score'] == 0.0).sum()
total_tweets = len(data)
percentage_zero = (num_zero_sentiments / total_tweets) * 100

print(f"Number of tweets with sentiment score zero: {num_zero_sentiments}")
print(f"Total number of tweets: {total_tweets}")
print(f"Percentage of zero sentiment tweets: {percentage_zero:.2f}%")