# Sentiment Analysis
<hr/>

This is a beginner-friendly introduction to <b>Natural Language Processing</b> using the python package <b>nltk</b> (natural language toolkit).<br>
NLP can be used to extract sentiment from texts. This can be very useful in various areas of business, for instance social media monitoring and brand sentiment analysis!<br>
You can find some examples of the use of sentiment analysis using NLP in business <a href='https://theappsolutions.com/blog/development/sentiment-analysis-for-business/'>here</a>.

### Libraries

In [16]:
# data operations
import pandas as pd

# visualizations
import seaborn as sns
import matplotlib.pyplot as plt

# NLP
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import re
import string

# other
import random

### Data
NLP can be used to find out what the general public thinks about a certain company!<br>The dataset used in this notebook can be found <a href='https://www.kaggle.com/davidwallach/financial-tweets'>here</a>.
* Load the data to a dataframe using `pd.read_csv()`
* Set `error_bad_lines=False` to skip mal-structured lines

In [60]:
df = pd.read_csv('stockerbot-export.csv', error_bad_lines=False)
df.head()

b'Skipping line 731: expected 8 fields, saw 13\nSkipping line 2836: expected 8 fields, saw 15\nSkipping line 3058: expected 8 fields, saw 12\nSkipping line 3113: expected 8 fields, saw 12\nSkipping line 3194: expected 8 fields, saw 17\nSkipping line 3205: expected 8 fields, saw 17\nSkipping line 3255: expected 8 fields, saw 17\nSkipping line 3520: expected 8 fields, saw 17\nSkipping line 4078: expected 8 fields, saw 17\nSkipping line 4087: expected 8 fields, saw 17\nSkipping line 4088: expected 8 fields, saw 17\nSkipping line 4499: expected 8 fields, saw 12\n'


Unnamed: 0,id,text,timestamp,source,symbols,company_names,url,verified
0,1019696670777503700,VIDEO: “I was in my office. I was minding my o...,Wed Jul 18 21:33:26 +0000 2018,GoldmanSachs,GS,The Goldman Sachs,https://twitter.com/i/web/status/1019696670777...,True
1,1019709091038548000,The price of lumber $LB_F is down 22% since hi...,Wed Jul 18 22:22:47 +0000 2018,StockTwits,M,Macy's,https://twitter.com/i/web/status/1019709091038...,True
2,1019711413798035500,Who says the American Dream is dead? https://t...,Wed Jul 18 22:32:01 +0000 2018,TheStreet,AIG,American,https://buff.ly/2L3kmc4,True
3,1019716662587740200,Barry Silbert is extremely optimistic on bitco...,Wed Jul 18 22:52:52 +0000 2018,MarketWatch,BTC,Bitcoin,https://twitter.com/i/web/status/1019716662587...,True
4,1019718460287389700,How satellites avoid attacks and space junk wh...,Wed Jul 18 23:00:01 +0000 2018,Forbes,ORCL,Oracle,http://on.forbes.com/6013DqDDU,True


In [61]:
df.dtypes

id                int64
text             object
timestamp        object
source           object
symbols          object
company_names    object
url              object
verified           bool
dtype: object

In [62]:
df.isnull().any()

id               False
text             False
timestamp        False
source           False
symbols          False
company_names     True
url               True
verified         False
dtype: bool

### Preprocessing

In [63]:
# drop unused columns
df.drop(['id', 'source', 'company_names', 'url'], axis=1, inplace=True)
df = df[df['verified'] == True]
df.drop(['verified'], axis=1, inplace=True)

# date and time
df["timestamp"] = pd.to_datetime(df["timestamp"])
df['date'] = df['timestamp'].dt.date
df['time'] = df['timestamp'].dt.time

# data types
df["text"] = df["text"].astype(str)
df["symbols"] = df["symbols"].astype("category")
df.head()

Unnamed: 0,text,timestamp,symbols,date,time
0,VIDEO: “I was in my office. I was minding my o...,2018-07-18 21:33:26+00:00,GS,2018-07-18,21:33:26
1,The price of lumber $LB_F is down 22% since hi...,2018-07-18 22:22:47+00:00,M,2018-07-18,22:22:47
2,Who says the American Dream is dead? https://t...,2018-07-18 22:32:01+00:00,AIG,2018-07-18,22:32:01
3,Barry Silbert is extremely optimistic on bitco...,2018-07-18 22:52:52+00:00,BTC,2018-07-18,22:52:52
4,How satellites avoid attacks and space junk wh...,2018-07-18 23:00:01+00:00,ORCL,2018-07-18,23:00:01


In [67]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/dorsa/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Helper Functions

In [68]:
def preprocess_text(text):
    """
    preprocesses the text for sentiment extraction
    :param string: the text to extract sentiment from
    """
    # remove
    text = re.sub(r'^RT[\s]+', '', text)              # remove RTs
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)  # remove hyperlinks
    text = re.sub(r'#', '', text)                     # remove the hashtag sign
    
    # tokenize
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tokens = tokenizer.tokenize(text)
    
    # stop words
    stopwords_english = stopwords.words('english')
    text_clean = []
    for word in tokens:
        if word not in stopwords_english and word not in string.punctuation:
            text_clean.append(word)
    
    # stemming
    stemmer = PorterStemmer()
    text_stem = [] 
    for word in text_clean:
        stem_word = stemmer.stem(word)
        text_stem.append(stem_word)

    return text_stem


In [90]:
print('original text:', '\t\t', '\033[91m' + df['text'][2], 
      '\033[90m', '\n', 'preprocessed text:', '\t', '\033[92m', preprocess_text(df['text'][2]))

original text: 		 [91mWho says the American Dream is dead? https://t.co/CRgx19x7sA [90m 
 preprocessed text: 	 [92m ['say', 'american', 'dream', 'dead']
