### NUS Fintech Society (ML) Project 2: Natural Language Processing classifier

Start date: 31-10-23
Submission date: 15-12-2023 11.59pm

1. Use the dataset stock_data.csv to train and build a NLP model to make binary predictions about
the comments
2. Make a train test data split to get your own sets of train and test datasets
3. Once testing/evaluation is completed, please print out a statement showing the accuracy of your
model on the test set. There is no need to create a separate screenshot for accuracy. As long as the
model accuracy is shown in the jupyter notebook output, it can be evaluated for marking.
4. Access the submission google form with this link (https://forms.gle/Jn8UCS4GMY1P9KWGA) to
submit the link to your github repository. Ensure that the repository is public

In [1]:
import pandas as pd

df = pd.read_csv("stock_data.csv")

In [2]:
#Preprocessing the text data
import re
import nltk
nltk.download('popular')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Make lower case
    # text = text.lower()
    
    # Remove punctuations
    text = re.sub(r'[^\w\s]', '', text)
    
    # Tokenize and remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    text = ' '.join([word for word in tokens if word not in stop_words])
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
    
    return text

df['Processed Text'] = df['Text'].apply(preprocess_text)

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\Jolene\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\Jolene\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\Jolene\AppData\Roaming\nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\Jolene\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\Jolene\AppData\Roaming\nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_dat

In [3]:
df = df[['Text', 'Processed Text', 'Sentiment']]
df.head()

Unnamed: 0,Text,Processed Text,Sentiment
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,Kickers watchlist XIDE TIT SOQ PNK CPW BPZ AJ ...,1
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,user AAP MOVIE 55 return FEAGEED indicator 15 ...,1
2,user I'd be afraid to short AMZN - they are lo...,user Id afraid short AMZN looking like nearmon...,1
3,MNTA Over 12.00,MNTA Over 1200,1
4,OI Over 21.37,OI Over 2137,1


Initially, I used the python libraries re and nltk to conduct basic text preprocessing including removing URLs, making the text lower case, removing punctuations, removing stopwords, lemmatizing, and tokenizing. Looking at the dataset, I noticed that a large proportion of words with capital letters are names of special significance eg. AMZN, or express a strong positive or negative sentiment eg. AMAZING, so I decided to remove the step of making all the text lower case. This increased the accuracy of the classifier. A problem faced was regarding removing the punctuation mark from decimal numbers and percentages eg. 21.37 would become 2137, 55% would become 55, and I do not have a simple solution to these specific cases.

In [4]:
#Split into train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['Processed Text'], df['Sentiment'], test_size=0.2, random_state=42)

In [5]:
# Vectorize using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

I chose Count Vectorizer over TF-IDF as I was able to obtain a higher accuracy using it. This could be because 1) Count Vectorizer works better on small datasets like this one than TF-IDF and 2) TF-IDF works by downweighting common terms, which is not ideal for this dataset of comments about stocks, where the common words can be important indicators of the sentiment.

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

# Make predictions on the test set
predictions = classifier.predict(X_test_vectorized)

# Evaluate the model and print the accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy on Test Set: {accuracy:.2f}")

Model Accuracy on Test Set: 0.77
