## ***Natural Language Processing (NLP) on Financial Sentiment Data***

# **1 Project Overview and Objectives**

This project demonstrates the fundamental steps of Natural Language Processing (NLP) on real-world unstructured text data. The goal is to clean, standardize, and enrich the text to prepare it for advanced analytics, such as building a financial sentiment classifier.

### **1.1 Dataset Context: StockerBot Financial Tweets**


*Source*: Kaggle:https://www.kaggle.com/datasets/davidwallach/financial-tweets

*Content*: 28k+ tweets about publicly traded companies and cryptocurrencies.

*Influencers*: Tweets are sourced from key financial and news sources (e.g., MarketWatch, WSJ, Jim Cramer).

*Goal*: To understand public sentiment around financial markets based on influential voices.



### **1.2 Core Tasks**

We will execute and explain five essential NLP techniques on the tweet data:

1. Normalization: Cleaning and standardizing the text.

2. Stemming: Reducing words to their root stem (fast, heuristic method).

3. Lemmatization: Reducing words to their valid dictionary base form (accurate, linguistic method).

4. Text Enrichment (POS Tagging): Determining the grammatical role of each word.

5. Named Entity Recognition (NER): Identifying and classifying real-world entities (companies, persons, dates).

# **2. Environment Setup and Data Acquisition**

### **2.1. Installing Libraries and Downloading Resources**

This step ensures all required Python packages (Pandas, NLTK, spaCy) are installed and necessary language models are downloaded.

In [16]:
# Install required libraries
!pip install nltk spacy pandas kaggle -qq
!python -m spacy download en_core_web_sm -qq

# Import necessary modules
import nltk
import pandas as pd
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy
from google.colab import files
import os

# Download NLTK resources
# These downloads are required for Stop Words, Tokenization, Lemmatization, and POS Tagging
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)


# Initialize the advanced Spacy English model
nlp = spacy.load("en_core_web_sm")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/12.8 MB[0m [31m11.9 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/12.8 MB[0m [31m51.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/12.8 MB[0m [31m59.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m9.2/12.8 MB[0m [31m65.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m12.4/12.8 MB[0m [31m88.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m85.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[

## 2.2. Data Loading, Selection, and Sampling

In [4]:
# Load the dataset
data = pd.read_csv('stockerbot-export.csv', encoding='utf-8', on_bad_lines='skip')
data

Unnamed: 0,id,text,timestamp,source,symbols,company_names,url,verified
0,1019696670777503700,VIDEO: “I was in my office. I was minding my o...,Wed Jul 18 21:33:26 +0000 2018,GoldmanSachs,GS,The Goldman Sachs,https://twitter.com/i/web/status/1019696670777...,True
1,1019709091038548000,The price of lumber $LB_F is down 22% since hi...,Wed Jul 18 22:22:47 +0000 2018,StockTwits,M,Macy's,https://twitter.com/i/web/status/1019709091038...,True
2,1019711413798035500,Who says the American Dream is dead? https://t...,Wed Jul 18 22:32:01 +0000 2018,TheStreet,AIG,American,https://buff.ly/2L3kmc4,True
3,1019716662587740200,Barry Silbert is extremely optimistic on bitco...,Wed Jul 18 22:52:52 +0000 2018,MarketWatch,BTC,Bitcoin,https://twitter.com/i/web/status/1019716662587...,True
4,1019718460287389700,How satellites avoid attacks and space junk wh...,Wed Jul 18 23:00:01 +0000 2018,Forbes,ORCL,Oracle,http://on.forbes.com/6013DqDDU,True
...,...,...,...,...,...,...,...,...
28259,1019730088617635800,$FB : 29234a9c-7f08-4d5a-985f-cb1a5554ecf9,Wed Jul 18 23:46:13 +0000 2018,test5f1798,FB,Facebook,,False
28260,1019730115524288500,【仮想通貨】ビットコインの価格上昇、８０万円台回復　約１カ月半ぶり $BTC ht...,Wed Jul 18 23:46:19 +0000 2018,keizai_toushi17,BTC,Bitcoin,http://keizai-toushi-navi.com/?p=26838,False
28261,1019730115805184000,RT @invest_in_hd: 'Nuff said! $TEL #telcoin #...,Wed Jul 18 23:46:19 +0000 2018,iad81,BTC,Bitcoin,https://twitter.com/CRYPTOVERLOAD/status/10178...,False
28262,1019730117252341800,【仮想通貨】ビットコインの価格上昇、８０万円台回復　約１カ月半ぶり $BTC ht...,Wed Jul 18 23:46:20 +0000 2018,O8viWMyrCV6cBOZ,BTC,Bitcoin,http://true.velvet.jp/monexx/archives/2357,False


In [8]:
# Select the 'text' column, which contains the tweet content
text_column = 'text' # Confirmed column name for tweet content

# Select the text column, drop missing values, and take a random sample of 1000 tweets.
df_text = data[text_column].astype(str).dropna().sample(n=1000, random_state=42).reset_index(drop=True)
sample_tweets_to_view = df_text.head(4).tolist()
sample_tweet = df_text.iloc[3]

print(f"\nDataset successfully loaded. Total rows: {len(data)}")
print(f"Total rows in sample for analysis: {len(df_text)}")
print("\n--- 4 Sample Tweets for Preview ---")
for i, tweet in enumerate(sample_tweets_to_view):
    print(f"Tweet {i+1}: {tweet}")

print(f"\n--- Selected Tweet (Tweet 1) for NLP Demonstration --- \n{sample_tweet}")



Dataset successfully loaded. Total rows: 28264
Total rows in sample for analysis: 1000

--- 4 Sample Tweets for Preview ---
Tweet 1: Some candidates for you to short:  $CHKP $XOM $CBS $MRK
Tweet 2: https://t.co/9VjKMnGXvX $CVNA $DAL $DISCA $MU $TKC https://t.co/wkeFqrTQxF
Tweet 3: As Accenture Plc Ireland $ACN Market Value Declined Morgan Stanley Trimmed Position - https://t.co/IFHuXpSSE9
Tweet 4: Are Analysts Bullish about Campbell Soup Company $CPB after last week? https://t.co/BcMiRuN8uF

--- Selected Tweet (Tweet 1) for NLP Demonstration --- 
Are Analysts Bullish about Campbell Soup Company $CPB after last week? https://t.co/BcMiRuN8uF


# **3. Natural Language Processing Technique**

### **3.1. Normalization**

Normalization is the foundational step in text preprocessing. Its objective is to reduce variation and noise, leading to a more efficient and accurate analysis.

1. Lowercasing: All text is converted to lowercase, treating 'Apple', 'apple', and 'APPLE' as the same token.

2. Noise Removal: In financial tweets, this means stripping away elements like URLs, mentions (@user), hashtags (#finance), and crucially, stock tickers ($TSLA) which, while important contextually, are removed here to focus on the surrounding verbal sentiment.

3. Stop Word Removal: Common, non-essential words (e.g., 'the', 'a', 'is') are filtered out using the NLTK English stop word list, significantly reducing the volume of data without losing core meaning.

In [11]:
# Define the comprehensive normalization function
def normalize_text(text):
    text = text.lower()
    # Remove URLs, mentions, hashtags, and stock tickers (e.g., $TSLA)
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'@\w+|#\w+|\$\w+', '', text)
    # Tokenization: Splitting the text into individual words/units
    tokens = word_tokenize(text)
    # Stop Word Removal & Punctuation Cleanup
    stop_words = set(stopwords.words('english'))
    # Keep only alphanumeric words that are NOT stop words
    normalized_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    return normalized_tokens

normalized_output = normalize_text(sample_tweet)

print(f"Original Tweet: {sample_tweet}")
print("\n--- Normalization Results ---")
print(f"Tokens after Normalization: {normalized_output}")

Original Tweet: Are Analysts Bullish about Campbell Soup Company $CPB after last week? https://t.co/BcMiRuN8uF

--- Normalization Results ---
Tokens after Normalization: ['analysts', 'bullish', 'campbell', 'soup', 'company', 'last', 'week']


### **3.2. Stemming**

Stemming is a simple, rule-based approach to morphological analysis. It employs a set of heuristic rules (like removing common suffixes) to truncate words and find their root stem. We use the Porter Stemmer, one of the most widely adopted algorithms.

*   Benefit: It is computationally fast and highly effective for reducing word  variations.

*   Limitation: The resulting stem may often be a non-dictionary word (e.g., 'univers' for 'university'), which can decrease linguistic accuracy.

In [12]:
# We will use the tokens obtained from the Normalization step
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in normalized_output]

print(f"Normalized Tokens: {normalized_output}")
print("\n--- Stemming Results ---")
print(f"Words after Stemming: {stemmed_words}")

Normalized Tokens: ['analysts', 'bullish', 'campbell', 'soup', 'company', 'last', 'week']

--- Stemming Results ---
Words after Stemming: ['analyst', 'bullish', 'campbel', 'soup', 'compani', 'last', 'week']


### **3.3. Lemmatization**

Lemmatization is a sophisticated, vocabulary-aware technique that uses a dictionary lookup and morphological analysis to reduce a word to its true base form, or lemma.

*   Benefit: The output is always a valid word (e.g., 'better' → 'good', 'went' → 'go'). This provides a higher level of linguistic accuracy than stemming.

*   Application in Finance: It is crucial for sentiment analysis, ensuring that words expressing different tenses or degrees (e.g., 'rising', 'rose', 'risen') are consistently mapped to the base form 'rise'.

In [13]:
# Use the tokens obtained from the Normalization step
lemmatizer = WordNetLemmatizer()
# Lemmatization is often more accurate when the POS tag is provided, but we use the default (noun) here.
lemmatized_words = [lemmatizer.lemmatize(word) for word in normalized_output]

print(f"Normalized Tokens: {normalized_output}")
print("\n--- Lemmatization Results ---")
print(f"Words after Lemmatization: {lemmatized_words}")


Normalized Tokens: ['analysts', 'bullish', 'campbell', 'soup', 'company', 'last', 'week']

--- Lemmatization Results ---
Words after Lemmatization: ['analyst', 'bullish', 'campbell', 'soup', 'company', 'last', 'week']


### **3.4. Text Enrichment / Augmentation (Part-of-Speech Tagging)**

Part-of-Speech (POS) Tagging is a method of Text Enrichment that assigns a grammatical category (e.g., Noun, Verb, Adjective) to every token. We use the NLTK tagger which employs the Penn Treebank tag set.

* Value: This process is essential for disambiguation and feature engineering. For instance, in the sentence "The stock is trading down," the word 'stock' is tagged as a Noun (NN). In the sentence "We must stock up on cash," 'stock' is tagged as a Verb (VB). This contextual information is vital for training accurate text classifiers.

In [17]:
# We use NLTK POS Tagger on the original tokens (before stop word removal)
raw_tokens = word_tokenize(sample_tweet)
pos_tags = nltk.pos_tag(raw_tokens)

print(f"Original Tokens: {raw_tokens}")
print("\n--- POS Tagging Results ---")
print(f"POS Tags (Token, Tag): {pos_tags}")

Original Tokens: ['Are', 'Analysts', 'Bullish', 'about', 'Campbell', 'Soup', 'Company', '$', 'CPB', 'after', 'last', 'week', '?', 'https', ':', '//t.co/BcMiRuN8uF']

--- POS Tagging Results ---
POS Tags (Token, Tag): [('Are', 'NNP'), ('Analysts', 'NNS'), ('Bullish', 'VBP'), ('about', 'IN'), ('Campbell', 'NNP'), ('Soup', 'NNP'), ('Company', 'NNP'), ('$', '$'), ('CPB', 'NNP'), ('after', 'IN'), ('last', 'JJ'), ('week', 'NN'), ('?', '.'), ('https', 'NN'), (':', ':'), ('//t.co/BcMiRuN8uF', 'NN')]


### **3.5. Named Entity Recognition (NER)**

Named Entity Recognition (NER) is a key technique for structuring unstructured text. It automatically locates and classifies sequences of words that refer to real-world objects, such as people, organizations, dates, and locations. We utilize the powerful spaCy library for this task.

* Value in Finance: NER is critical for quickly identifying:

  *  ORG (Organizations): Names of companies (if not already tagged by the ticker).

  *  PERSON: Names of influencers or executives (e.g., Elon Musk, Jim Cramer).

  *  DATE/CARDINAL: Specific timeframes, quantities, or prices mentioned.

In [18]:
# We use the Spacy model for powerful NER
doc = nlp(sample_tweet)

# Extract and label the entities
named_entities = [(ent.text, ent.label_) for ent in doc.ents]

print(f"Sample Tweet: {sample_tweet}")
print("\n--- NER Results ---")
print(f"Named Entities (Entity, Type): {named_entities}")


Sample Tweet: Are Analysts Bullish about Campbell Soup Company $CPB after last week? https://t.co/BcMiRuN8uF

--- NER Results ---
Named Entities (Entity, Type): [('Campbell Soup Company', 'ORG'), ('CPB', 'ORG'), ('last week', 'DATE')]


# **4. Conclusion and Next Steps**

This notebook successfully demonstrated the five core NLP techniques, cleaning and enriching the financial sentiment data. The processed tokens and extracted entities are now ready for the next phase of analysis, which typically involves Feature Engineering (e.g., creating TF-IDF vectors) and Model Building (e.g., training a machine learning classifier for sentiment prediction).