<h1 style="text-align: center; font-size: 2.5rem; font-weight: bold; margin: 2.5rem 0 1rem;">Sentiment Analysis on Labeled Financial Data</h1>
<h2 style="text-align: center; font-size: 1rem; font-weight: 500; margin: 0 0 2rem;">Text Preprocessing and Sentiment Analysis</h2>

_____

<h2 style="text-align: center; font-size: 1rem; margin: 2rem 0 1rem 0;">Part I</h2>
<h1 style="text-align: center; font-size: 2rem; margin: 0 0 2rem 0;">Text Preprocessing</h1>

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Imports</h2>

In [1]:
# Imports
from textblob import TextBlob
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
import nltk
import string
import emoji
import re

In [2]:
# nltk - Dependencies
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KubangPawis\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\KubangPawis\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
df = pd.read_csv('../data/raw/financial-data.csv')
df.head()

Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral


<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Lowercasing Text</h2>

In [4]:
df['Sentence'][3]

'According to the Finnish-Russian Chamber of Commerce , all the major construction companies of Finland are operating in Russia .'

In [5]:
df['Sentence'] = df['Sentence'].str.lower()

In [6]:
df['Sentence'][3]

'according to the finnish-russian chamber of commerce , all the major construction companies of finland are operating in russia .'

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Remove HTML Tags</h2>

In [7]:
def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'', text)

In [8]:
df['Sentence'] = df['Sentence'].apply(remove_html_tags)

In [9]:
df['Sentence']

0       the geosolutions technology will leverage bene...
1       $esi on lows, down $1.50 to $2.50 bk a real po...
2       for the last quarter of 2010 , componenta 's n...
3       according to the finnish-russian chamber of co...
4       the swedish buyout firm has sold its remaining...
                              ...                        
5837    rising costs have forced packaging producer hu...
5838    nordic walking was first used as a summer trai...
5839    according shipping company viking line , the e...
5840    in the building and home improvement trade , s...
5841    helsinki afx - kci konecranes said it has won ...
Name: Sentence, Length: 5842, dtype: object

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Remove URLs</h2>

In [10]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [11]:
df['Sentence'] = df['Sentence'].apply(remove_url)

In [12]:
df['Sentence']

0       the geosolutions technology will leverage bene...
1       $esi on lows, down $1.50 to $2.50 bk a real po...
2       for the last quarter of 2010 , componenta 's n...
3       according to the finnish-russian chamber of co...
4       the swedish buyout firm has sold its remaining...
                              ...                        
5837    rising costs have forced packaging producer hu...
5838    nordic walking was first used as a summer trai...
5839    according shipping company viking line , the e...
5840    in the building and home improvement trade , s...
5841    helsinki afx - kci konecranes said it has won ...
Name: Sentence, Length: 5842, dtype: object

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Remove Punctuations</h2>

In [13]:
# Using the string library to retrieve all available punctuations
punc = string.punctuation
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [14]:
def remove_punc(text):
  return text.translate(str.maketrans('', '', punc))

In [15]:
df['Sentence'] = df['Sentence'].apply(remove_punc)

In [16]:
df['Sentence']

0       the geosolutions technology will leverage bene...
1       esi on lows down 150 to 250 bk a real possibility
2       for the last quarter of 2010  componenta s net...
3       according to the finnishrussian chamber of com...
4       the swedish buyout firm has sold its remaining...
                              ...                        
5837    rising costs have forced packaging producer hu...
5838    nordic walking was first used as a summer trai...
5839    according shipping company viking line  the eu...
5840    in the building and home improvement trade  sa...
5841    helsinki afx  kci konecranes said it has won a...
Name: Sentence, Length: 5842, dtype: object

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Handling ChatWords</h2>

In [17]:
# Here Come ChatWords Which i Get from a Github Repository
# Repository Link : https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt
chat_words = {
    "AFAIK": "As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP": "As Soon As Possible",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "A3": "Anytime, Anywhere, Anyplace",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRT": "Be Right There",
    "BTW": "By The Way",
    "B4": "Before",
    "B4N": "Bye For Now",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FWIW": "For What It's Worth",
    "FYI": "For Your Information",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GN": "Good Night",
    "GMTA": "Great Minds Think Alike",
    "GR8": "Great!",
    "G9": "Genius",
    "IC": "I See",
    "ICQ": "I Seek you (also a chat program)",
    "ILU": "ILU: I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "KISS": "Keep It Simple, Stupid",
    "LDR": "Long Distance Relationship",
    "LMAO": "Laugh My A.. Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "L8R": "Later",
    "MTE": "My Thoughts Exactly",
    "M8": "Mate",
    "NRN": "No Reply Necessary",
    "OIC": "Oh I See",
    "PITA": "Pain In The A..",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "QPSA?": "Que Pasa?",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
    "SK8": "Skate",
    "STATS": "Your sex and age",
    "ASL": "Age, Sex, Location",
    "THX": "Thank You",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "WB": "Welcome Back",
    "WTF": "What The F...",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "W8": "Wait...",
    "7K": "Sick:-D Laugher",
    "TFW": "That feeling when",
    "MFW": "My face when",
    "MRW": "My reaction when",
    "IFYP": "I feel your pain",
    "TNTL": "Trying not to laugh",
    "JK": "Just kidding",
    "IDC": "I don't care",
    "ILY": "I love you",
    "IMU": "I miss you",
    "ADIH": "Another day in hell",
    "ZZZ": "Sleeping, bored, tired",
    "WYWH": "Wish you were here",
    "TIME": "Tears in my eyes",
    "BAE": "Before anyone else",
    "FIMH": "Forever in my heart",
    "BSAAW": "Big smile and a wink",
    "BWL": "Bursting with laughter",
    "BFF": "Best friends forever",
    "CSL": "Can't stop laughing"
}

In [18]:
def chat_conversion(text):
    new_text = []
    for i in text.split():
        if i.upper() in chat_words:
            new_text.append(chat_words[i.upper()])
        else:
            new_text.append(i)
    return " ".join(new_text)

In [19]:
df['Sentence'] = df['Sentence'].apply(chat_conversion)

In [20]:
df['Sentence'].head()

0    the geosolutions technology will leverage bene...
1    esi on lows down 150 to 250 bk a real possibility
2    for the last quarter of 2010 componenta s net ...
3    according to the finnishrussian chamber of com...
4    the swedish buyout firm has sold its remaining...
Name: Sentence, dtype: object

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Spelling Correction</h2>

In [21]:
# Using the TextBlob on "Record 3"
print('[ORIGINAL]')
print()
print(df.loc[0, 'Sentence'])
print(df.loc[1, 'Sentence'])
print(df.loc[2, 'Sentence'])
print(df.loc[3, 'Sentence'])

print('[SPELLING CORRECTED]')
print()
print(TextBlob(df.loc[0, 'Sentence']).correct().string)
print(TextBlob(df.loc[1, 'Sentence']).correct().string)
print(TextBlob(df.loc[2, 'Sentence']).correct().string)
print(TextBlob(df.loc[3, 'Sentence']).correct().string)

[ORIGINAL]

the geosolutions technology will leverage benefon s gps solutions by providing location based search technology a communities platform location relevant multimedia content and a new and powerful commercial model
esi on lows down 150 to 250 bk a real possibility
for the last quarter of 2010 componenta s net sales doubled to eur131m from eur76m for the same period a year earlier while it moved to a zero pretax profit from a pretax loss of eur7m
according to the finnishrussian chamber of commerce all the major construction companies of finland are operating in russia
[SPELLING CORRECTED]

the resolutions technology will beverage benefit s gas solutions by providing location based search technology a communities platform location relevant multimedia content and a new and powerful commercial model
est on lows down 150 to 250 by a real possibility
for the last quarter of 2010 component s net sales doubled to eur131m from eur76m for the same period a year earlier while it moved to

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Handling StopWords</h2>

In [22]:
# Here we can see all the stopwords in English.However we can chose different Languages also like spanish etc.
stopword = stopwords.words('english')

In [23]:
# Function
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopword:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [24]:
df['Sentence'] = df['Sentence'].apply(remove_stopwords)

In [25]:
df['Sentence']

0        geosolutions technology  leverage benefon  gp...
1                esi  lows  150  250 bk  real possibility
2         last quarter  2010 componenta  net sales dou...
3       according   finnishrussian chamber  commerce  ...
4        swedish buyout firm  sold  remaining 224 perc...
                              ...                        
5837    rising costs  forced packaging producer huhtam...
5838    nordic walking  first used   summer training m...
5839    according shipping company viking line  eu dec...
5840      building  home improvement trade sales decre...
5841    helsinki afx kci konecranes said     order  fo...
Name: Sentence, Length: 5842, dtype: object

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Handling Emojis</h2>

### Simply Remove Emojis

In [26]:
# Again Here we use The Regular Expressions to Remove the Emojies from Text or Whole Corpus.
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [27]:
df['Sentence'] = df['Sentence'].apply(remove_emoji)

### Converting Emojis to Text

In [28]:
df['Sentence'] = df['Sentence'].apply(lambda x: emoji.demojize(x))

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Tokenization</h2>

In [29]:
df.loc[0, 'Sentence']

' geosolutions technology  leverage benefon  gps solutions  providing location based search technology  communities platform location relevant multimedia content   new  powerful commercial model'

In [30]:
# Tokenization Methods
def word_tokenize_rows(text):
  return word_tokenize(text)

def sentence_tokenize_rows(text):
  return sent_tokenize(text)

In [31]:
# Word Tokenization
df['Word_Tokenization'] = df['Sentence'].apply(word_tokenize_rows)

In [32]:
df['Word_Tokenization']

0       [geosolutions, technology, leverage, benefon, ...
1            [esi, lows, 150, 250, bk, real, possibility]
2       [last, quarter, 2010, componenta, net, sales, ...
3       [according, finnishrussian, chamber, commerce,...
4       [swedish, buyout, firm, sold, remaining, 224, ...
                              ...                        
5837    [rising, costs, forced, packaging, producer, h...
5838    [nordic, walking, first, used, summer, trainin...
5839    [according, shipping, company, viking, line, e...
5840    [building, home, improvement, trade, sales, de...
5841    [helsinki, afx, kci, konecranes, said, order, ...
Name: Word_Tokenization, Length: 5842, dtype: object

In [33]:
# Sentence Tokenization
df['Sent_Tokenization'] = df['Sentence'].apply(sentence_tokenize_rows)

In [34]:
df['Sent_Tokenization']

0       [ geosolutions technology  leverage benefon  g...
1              [esi  lows  150  250 bk  real possibility]
2       [  last quarter  2010 componenta  net sales do...
3       [according   finnishrussian chamber  commerce ...
4       [ swedish buyout firm  sold  remaining 224 per...
                              ...                        
5837    [rising costs  forced packaging producer huhta...
5838    [nordic walking  first used   summer training ...
5839    [according shipping company viking line  eu de...
5840    [  building  home improvement trade sales decr...
5841    [helsinki afx kci konecranes said     order  f...
Name: Sent_Tokenization, Length: 5842, dtype: object

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Stemming</h2>

In [35]:
stemmer = PorterStemmer()

In [36]:
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

In [37]:
df['Sentence'] = df['Sentence'].apply(stem_words)

In [38]:
df['Sentence']

0       geosolut technolog leverag benefon gp solut pr...
1                         esi low 150 250 bk real possibl
2       last quarter 2010 componenta net sale doubl eu...
3       accord finnishrussian chamber commerc major co...
4       swedish buyout firm sold remain 224 percent st...
                              ...                        
5837    rise cost forc packag produc huhtamaki axe 90 ...
5838    nordic walk first use summer train method cros...
5839    accord ship compani vike line eu decis signifi...
5840    build home improv trade sale decreas 225 eur 2...
5841    helsinki afx kci konecran said order four hot ...
Name: Sentence, Length: 5842, dtype: object

<h2 style="padding: 0.5rem; background-color: #513d5c; color: white;">Exporting the Preprocessed Text Data</h2>

In [39]:
clean_df = df[['Sentence', 'Sentiment']]
clean_df.to_csv('../data/clean/financial-data-clean.csv', index=False)