## Supervised Classification
In supervised classification, the classifier is trained with labeled training data.

The NLTK’s twitter_samples corpus is used as labeled training data.
     The twitter_samples corpus contains 3 files.

    1) negative_tweets.json: contains 5k negative tweets
    2) positive_tweets.json: contains 5k positive tweets
    3) tweets.20150430-223406.json: contains 20k positive and negative tweets 

In [2]:
from nltk.corpus import twitter_samples
print (twitter_samples.fileids())

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']


In [3]:
pos_tweets = twitter_samples.strings('positive_tweets.json')
print (len(pos_tweets)) 
 
neg_tweets = twitter_samples.strings('negative_tweets.json')
print (len(neg_tweets))
 
all_tweets = twitter_samples.strings('tweets.20150430-223406.json')
print (len(all_tweets))

5000
5000
20000


In [4]:
for tweet in pos_tweets[:5]:
    print (tweet)

#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!
@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!
@97sides CONGRATS :)
yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days


### Cleaning Tweet

In the tweet cleaning process, we will do the following:

    – Remove stock market tickers like $GE
    – Remove retweet text “RT”
    – Remove hyperlinks
    – Remove hashtags (only the hashtag # and not the word)
    – Remove stop words like a, and, the, is, are, etc.
    – Remove emoticons like :), :D, :(, :-), etc.
    – Remove punctuation like full-stop, comma, exclamation sign, etc.
    – Convert words to Stem/Base words using Porter Stemming Algorithm. E.g. words like ‘working’, ‘works’, and ‘worked’ will be converted to their base/stem word “work”. 

In [9]:
#importing libraries
import string
import re
 
from nltk.corpus import stopwords 
stopwords_english = stopwords.words('english')
 
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
 
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer(preserve_case=False, 
                                 strip_handles=True,
                                 reduce_len=True)

In [10]:
# Happy Emoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])
 
# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])
 
# all emoticons (happy + sad)
emoticons = emoticons_happy.union(emoticons_sad)

In [14]:
def clean_tweets(tweet):
    tweet = re.sub(r'\$\w*','',tweet)
    
    tweet = re.sub(r'^RT[\s]+','',tweet)
    
    tweet = re.sub(r'https?:\/\/.*[\r\n]*','',tweet)
    
    tweet = re.sub(r'#','',tweet)
    
    tweet_tokens = tweet_tokenizer.tokenize(tweet)
    
    
    tweet_clean=[]
    for word in tweet_tokens:
        if (word not in stopwords_english and word not in emoticons and 
            word not in string.punctuation):
                stem_word = stemmer.stem(word)
                tweet_clean.append(stem_word)
    return tweet_clean

### Feature Extraction

We define a simple bag_of_words function that extracts features from the tweets.

In [16]:
def bag_of_words(tweet):
    words = clean_tweets(tweet)
    words_dictionary = dict([word, True] for word in words)    
    return words_dictionary

In [17]:
#positive tweets feature set
pos_tweets_set = []
for tweet in pos_tweets:
    pos_tweets_set.append((bag_of_words(tweet), 'pos'))    
 
#negative tweets feature set
neg_tweets_set = []
for tweet in neg_tweets:
    neg_tweets_set.append((bag_of_words(tweet), 'neg'))


In [18]:
from random import shuffle 
shuffle(pos_tweets_set)
shuffle(neg_tweets_set)
 
test_set = pos_tweets_set[:1000] + neg_tweets_set[:1000]
train_set = pos_tweets_set[1000:] + neg_tweets_set[1000:]

### Naive Bayes Classifier

In [19]:
from nltk import classify
from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.765
 
print (classifier.show_most_informative_features(10))    

0.734
Most Informative Features
                     via = True              pos : neg    =     37.7 : 1.0
                     sad = True              neg : pos    =     29.0 : 1.0
                     bam = True              pos : neg    =     25.0 : 1.0
                     x15 = True              neg : pos    =     18.3 : 1.0
                 appreci = True              pos : neg    =     16.3 : 1.0
                   arriv = True              pos : neg    =     15.9 : 1.0
                     ugh = True              neg : pos    =     15.0 : 1.0
                      ff = True              pos : neg    =     14.6 : 1.0
                      aw = True              neg : pos    =     14.2 : 1.0
                    glad = True              pos : neg    =     13.8 : 1.0
None
