# Transfer Learning MNIST

* Train a simple convnet on the MNIST dataset the first 5 digits [0..4].
* Freeze convolutional layers and fine-tune dense layers for the classification of digits [5..9].

## 1. Import necessary libraries for the model

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras import applications
from keras.models import Sequential, Model 
from keras import backend as k 
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import Conv2D, MaxPooling2D, Activation, Flatten, Dense, Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint

## 2. Import MNIST data and create 2 datasets with one dataset having digits from 0 to 4 and other from 5 to 9 

In [25]:
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# create two datasets one with digits from 0 to 4 and one with 5 to 9
x_train_lt5 = x_train[y_train < 5]
y_train_lt5 = y_train[y_train < 5]
x_test_lt5 = x_test[y_test < 5]
y_test_lt5 = y_test[y_test < 5]

x_train_gte5 = x_train[y_train >= 5]
y_train_gte5 = y_train[y_train >= 5]
x_test_gte5 = x_test[y_test >= 5]
y_test_gte5 = y_test[y_test >= 5]

## 3. Print x_train, y_train, x_test and y_test for both the datasets

In [26]:
print("Dataset Samples: \n")
print("X Train < 5: ", x_train_lt5[x_train_lt5 > 0])
print("X Test < 5: ", x_test_lt5[x_test_lt5 > 0])
print("Y Train < 5: ", y_train_lt5[0])
print("Y Test < 5: ", y_test_lt5[0])
print("X Train >= 5: ", x_train_gte5[x_train_gte5 > 0])
print("X Test >=5 : ", x_test_gte5[x_test_gte5 > 0])
print("Y Train >= 5: ", y_train_gte5[0])
print("Y Test >= 5: ", y_test_gte5[0])

Dataset Samples: 

X Train < 5:  [ 51 159 253 ... 168 108  15]
X Test < 5:  [116 125 171 ... 255 230  38]
Y Train < 5:  0
Y Test < 5:  2
X Train >= 5:  [  3  18  18 ... 193 197 134]
X Test >=5 :  [ 84 185 159 ... 132 110   4]
Y Train >= 5:  5
Y Test >= 5:  7


## ** 4. Let us take only the dataset (x_train, y_train, x_test, y_test) for Integers 0 to 4 in MNIST **
## Reshape x_train and x_test to a 4 Dimensional array (channel = 1) to pass it into a Conv2D layer

In [27]:
x_train_lt5_4d = np.expand_dims(x_train_lt5, axis = 3)
x_test_lt5_4d = np.expand_dims(x_test_lt5, axis = 3)

In [28]:
print("New Shape X Train: ", x_train_lt5_4d.shape);
print("New Shape X Test: ", x_test_lt5_4d.shape)

New Shape X Train:  (30596, 28, 28, 1)
New Shape X Test:  (5139, 28, 28, 1)


## 5. Normalize x_train and x_test by dividing it by 255

In [29]:
x_train_lt5_4d = x_train_lt5_4d/255;
x_test_lt5_4d = x_test_lt5_4d/255

## 6. Use One-hot encoding to divide y_train and y_test into required no of output classes

In [30]:
y_train_enc = pd.get_dummies(y_train_lt5)
y_test_enc = pd.get_dummies(y_test_lt5)

In [31]:
y_train_lt5 = y_train_enc
y_test_lt5 = y_test_enc

## 7. Build a sequential model with 2 Convolutional layers with 32 kernels of size (3,3) followed by a Max pooling layer of size (2,2) followed by a drop out layer to be trained for classification of digits 0-4  

In [32]:
# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
filters = 32
# size of pooling area for max pooling
pool_size = 2
# convolution kernel size
kernel_size = 3
# number of classes
num_classes = 5

conv_layers = [
    Conv2D(filters, kernel_size,
           padding='valid',
           input_shape=(28, 28, 1)),
    Activation('relu'),
    Conv2D(filters, kernel_size),
    Activation('relu'),
    MaxPooling2D(pool_size = pool_size),
    Dropout(0.25),
    Flatten(),
]

## 8. Post that flatten the data and add 2 Dense layers with 128 neurons and neurons = output classes with activation = 'relu' and 'softmax' respectively. Add dropout layer inbetween if necessary  

In [33]:
#Referenced from a similar GIT Project
output_layers = [
    Dense(128),
    Activation('relu'),
    Dropout(0.5),
    Dense(num_classes),
    Activation('softmax')
]

# create complete model
model = Sequential(conv_layers + output_layers)

# Save the model 
checkpoint = ModelCheckpoint("vgg16_initial_best.h5", monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
early = EarlyStopping(monitor='val_acc', min_delta=0, patience=5, verbose=1, mode='auto')

In [34]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
activation_5 (Activation)    (None, 26, 26, 32)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 24, 24, 32)        9248      
_________________________________________________________________
activation_6 (Activation)    (None, 24, 24, 32)        0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 12, 12, 32)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 4608)              0         
__________

In [35]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(x_train_lt5_4d, y_train_lt5,
          batch_size = 512,
          epochs = 10,
          verbose = 1,
          callbacks = [checkpoint, early],
          validation_data=(x_test_lt5_4d, y_test_lt5))

Train on 30596 samples, validate on 5139 samples
Epoch 1/10

Epoch 00001: val_acc improved from -inf to 0.98288, saving model to vgg16_initial_best.h5
Epoch 2/10

Epoch 00002: val_acc improved from 0.98288 to 0.99222, saving model to vgg16_initial_best.h5
Epoch 3/10

Epoch 00003: val_acc improved from 0.99222 to 0.99533, saving model to vgg16_initial_best.h5
Epoch 4/10

Epoch 00004: val_acc improved from 0.99533 to 0.99591, saving model to vgg16_initial_best.h5
Epoch 5/10

Epoch 00005: val_acc improved from 0.99591 to 0.99747, saving model to vgg16_initial_best.h5
Epoch 6/10

Epoch 00006: val_acc improved from 0.99747 to 0.99766, saving model to vgg16_initial_best.h5
Epoch 7/10

Epoch 00007: val_acc improved from 0.99766 to 0.99825, saving model to vgg16_initial_best.h5
Epoch 8/10

Epoch 00008: val_acc did not improve from 0.99825
Epoch 9/10

Epoch 00009: val_acc did not improve from 0.99825
Epoch 10/10

Epoch 00010: val_acc did not improve from 0.99825


<keras.callbacks.History at 0x7f2c705dec50>

In [36]:
model_score_train = model.evaluate(x_train_lt5_4d, y_train_lt5)
model_score_test = model.evaluate(x_test_lt5_4d, y_test_lt5)



## 9. Print the training and test accuracy

In [37]:
print('Train Accuracy:', model_score_train[1])
print('Test accuracy:', model_score_test[1])

Train Accuracy: 0.9987906915936724
Test accuracy: 0.9980540961276513


## 10. Make only the dense layers to be trainable and convolutional layers to be non-trainable

In [38]:
#Freezing layers in the model which don't have 'dense' in their name
for layer in model.layers:
    if('dense' not in layer.name): #prefix detection to freeze layers which does not have dense
    #Freezing a layer
        layer.trainable = False

In [39]:
#Module to print colourful statements
from termcolor import colored

#Check which layers have been frozen 
for layer in model.layers:
    print (colored(layer.name, 'blue'))
    print (colored(layer.trainable, 'red'))

[34mconv2d_3[0m
[31mFalse[0m
[34mactivation_5[0m
[31mFalse[0m
[34mconv2d_4[0m
[31mFalse[0m
[34mactivation_6[0m
[31mFalse[0m
[34mmax_pooling2d_2[0m
[31mFalse[0m
[34mdropout_3[0m
[31mFalse[0m
[34mflatten_2[0m
[31mFalse[0m
[34mdense_3[0m
[31mTrue[0m
[34mactivation_7[0m
[31mFalse[0m
[34mdropout_4[0m
[31mFalse[0m
[34mdense_4[0m
[31mTrue[0m
[34mactivation_8[0m
[31mFalse[0m


## 11. Use the model trained on 0 to 4 digit classification and train it on the dataset which has digits 5 to 9  (Using Transfer learning keeping only the dense layers to be trainable)

In [41]:
#The pre-trained weights must exist in a folder called "data" in the current folder
model.load_weights('./vgg16_initial_best.h5')

In [42]:
x_train_gte5_4d = np.expand_dims(x_train_gte5, axis = 3)
x_test_gte5_4d = np.expand_dims(x_test_gte5, axis = 3)

In [45]:
print("New Shape X Train: ", x_train_gte5_4d.shape)
print("New Shape X Test: ", x_test_gte5_4d.shape)

New Shape X Train:  (29404, 28, 28, 1)
New Shape X Test:  (4861, 28, 28, 1)


In [47]:
x_train_gte5_4d = x_train_gte5_4d/255
x_test_gte5_4d = x_test_gte5_4d/255

In [49]:
y_train_enc = pd.get_dummies(y_train_gte5)
y_test_enc = pd.get_dummies(y_test_gte5)

In [50]:
y_train_gte5 = y_train_enc
y_test_gte5 = y_test_enc

In [53]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(x_train_gte5_4d, y_train_gte5,
          batch_size = 512,
          epochs = 10,
          verbose = 1,
          validation_data=(x_test_gte5_4d, y_test_gte5))

Train on 29404 samples, validate on 4861 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f2c7c43f9e8>

## 12. Print the accuracy for classification of digits 5 to 9

In [54]:
model_score_train = model.evaluate(x_train_gte5_4d, y_train_gte5)
model_score_test = model.evaluate(x_test_gte5_4d, y_test_gte5)



In [55]:
print('Train Accuracy:', model_score_train[1])
print('Test accuracy:', model_score_test[1])

Train Accuracy: 0.9943885185689022
Test accuracy: 0.9907426454358277


## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 13. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [407]:
tweets_data = pd.read_csv("tweets.csv", na_filter = True)

In [408]:
tweets_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
tweet_text                                            9092 non-null object
emotion_in_tweet_is_directed_at                       3291 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    9093 non-null object
dtypes: object(3)
memory usage: 213.2+ KB


In [409]:
tweets_data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### 14. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [410]:
import requests
from bs4 import BeautifulSoup
import nltk
import string
import re
nltk.download('stopwords')
from nltk.corpus import stopwords 
stopwords_english = stopwords.words('english')
 
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
 
from nltk.tokenize import TweetTokenizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [411]:
#Referenced from Online Source

# Happy Emoticons
emoticons_happy = set([
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
    '<3'
    ])
 
# Sad Emoticons
emoticons_sad = set([
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
    ])
 
# all emoticons (happy + sad)
emoticons = emoticons_happy.union(emoticons_sad)
 
def clean_tweets(tweet):

    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
 
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
 
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
 
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
 
    tweets_clean = []    
    for word in tweet_tokens:
        if (word not in stopwords_english and # remove stopwords
              word not in emoticons and # remove emoticons
                word not in string.punctuation): # remove punctuation
            #tweets_clean.append(word)
            stem_word = stemmer.stem(word) # stemming word
            tweets_clean.append(stem_word)
 
    return tweets_clean

In [412]:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
 
# print cleaned tweet
print (clean_tweets(custom_tweet))

['hello', 'great', 'day', 'good', 'morn']


In [413]:
tweets_data['text'] = [clean_tweets(str(text)) for text in tweets_data.tweet_text]

In [414]:
tweets_data.head(1)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"[3g, iphon, 3, hr, tweet, rise_austin, dead, n..."


### 15. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [415]:
valid_emotions = ['Negative emotion', 'Positive emotion']
valid_tweets = tweets_data.is_there_an_emotion_directed_at_a_brand_or_product.isin(valid_emotions)
invalid_tweets = ~tweets_data.is_there_an_emotion_directed_at_a_brand_or_product.isin(valid_emotions)

In [416]:
tweets_data_valid = tweets_data[valid_tweets]
tweets_data_invalid = tweets_data[invalid_tweets]

In [417]:
print("Valid Tweets: \n")
print(tweets_data_valid.info())
tweets_data_valid.head(1)

Valid Tweets: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3548 entries, 0 to 9088
Data columns (total 4 columns):
tweet_text                                            3548 non-null object
emotion_in_tweet_is_directed_at                       3191 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    3548 non-null object
text                                                  3548 non-null object
dtypes: object(4)
memory usage: 138.6+ KB
None


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"[3g, iphon, 3, hr, tweet, rise_austin, dead, n..."


In [418]:
print("Invalid Tweets: \n")
print(tweets_data_invalid.info())
tweets_data_invalid.head(1)

Invalid Tweets: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5545 entries, 5 to 9092
Data columns (total 4 columns):
tweet_text                                            5544 non-null object
emotion_in_tweet_is_directed_at                       100 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    5545 non-null object
text                                                  5545 non-null object
dtypes: object(4)
memory usage: 216.6+ KB
None


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product,"[new, ipad, app, speechtherapi, commun, showca..."


### 16. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [456]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [547]:
X_vect = [clean_tweets_for_vect(str(text)) for text in tweets_data.tweet_text]

In [548]:
doc_term_freq = vect.fit_transform(X_vect)

In [549]:
type(doc_term_freq)

scipy.sparse.csr.csr_matrix

In [550]:
vect.vocabulary_

{'iphon': 3272,
 'hr': 3021,
 'tweet': 6641,
 'rise_austin': 5329,
 'dead': 1539,
 'need': 4227,
 'upgrad': 6767,
 'plugin': 4803,
 'station': 5980,
 'sxsw': 6169,
 'know': 3499,
 'awesom': 446,
 'ipad': 3264,
 'app': 275,
 'like': 3655,
 'appreci': 301,
 'design': 1605,
 'also': 183,
 'they': 6395,
 'give': 2605,
 'free': 2417,
 'ts': 6608,
 'wait': 6918,
 'sale': 5404,
 'hope': 2989,
 'year': 7201,
 'festiv': 2247,
 'crashi': 1402,
 'great': 2708,
 'stuff': 6061,
 'fri': 2428,
 'marissa': 3862,
 'mayer': 3907,
 'googl': 2655,
 'tim': 6451,
 'reilli': 5199,
 'tech': 6293,
 'book': 715,
 'confer': 1258,
 'matt': 3896,
 'mullenweg': 4151,
 'wordpress': 7123,
 'new': 4255,
 'speechtherapi': 5900,
 'commun': 1215,
 'showcas': 5632,
 'nan': 4198,
 'start': 5970,
 'ctia': 1453,
 'around': 334,
 'corner': 1348,
 'googleio': 2665,
 'hop': 2988,
 'skip': 5706,
 'jump': 3413,
 'good': 2648,
 'time': 6453,
 'android': 228,
 'fan': 2177,
 'beauti': 561,
 'smart': 5750,
 'simpl': 5663,
 'idea': 30

### 17. Find number of different words in vocabulary

In [423]:
def extract_features(tweet):
    tweet_words = set(tweet)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in tweet_words)
    return features

#### Tip: To see all available functions for an Object use dir

### 18. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [424]:
tweets_data_valid.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 19. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [425]:
binary_nums = {"is_there_an_emotion_directed_at_a_brand_or_product": {"Negative emotion": 0, "Positive emotion": 1}}
tweets_data_valid = tweets_data_valid.replace(binary_nums)
tweets_data_valid.head(3)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,0,"[3g, iphon, 3, hr, tweet, rise_austin, dead, n..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,1,"[know, awesom, ipad, iphon, app, like, appreci..."
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,1,"[wait, ipad, 2, also, sale, sxsw]"


### 20. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [389]:
X = tweets_data_valid.text
Y = tweets_data_valid.is_there_an_emotion_directed_at_a_brand_or_product
tweets_final_data = tweets_data_valid.iloc[:, [2,3]]
word_features = buildVocabulary(X)

In [428]:
neg = tweets_data_valid[tweets_data_valid.is_there_an_emotion_directed_at_a_brand_or_product == 0]
pos = tweets_data_valid[tweets_data_valid.is_there_an_emotion_directed_at_a_brand_or_product == 1]

In [429]:
#starting the function 
def featureExtraction(tweets_and_sentiments):
    #Here I am reading the tweets one by one and process it
    inpTweets = tweets_and_sentiments
    tweets = []
  
    for index, row in inpTweets.iterrows():
        sentiment = row.is_there_an_emotion_directed_at_a_brand_or_product
        tweet = row.text
        tweets.append((tweet, sentiment))
    #print "Printing the tweets con su sentiment"
    #print tweets
    return tweets #Here I am returning the tweets inside the array plus its sentiment
#end

#Classifier 
def get_words_in_tweets(tweets):
    all_words = []
    for (text, sentiment) in tweets:
        all_words.extend(text)
    return all_words

def get_word_features(wordlist):
    
    # This line calculates the frequency distrubtion of all words in tweets
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    
    # This prints out the list of all distinct words in the text in order
    # of their number of occurrences.
    return word_features

## 21. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [430]:
tweets = featureExtraction(tweets_final_data)

In [431]:
word_features = get_word_features(get_words_in_tweets(tweets)) #my list of many words 

In [433]:
training_set = nltk.classify.apply_features(extract_features, tweets[:500])

In [434]:
test_set = nltk.classify.apply_features(extract_features, tweets[:250])

In [435]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

# Accuracy
accuracy = nltk.classify.accuracy(classifier, training_set) 

In [436]:
#Printing the accuracy
print(accuracy) 

total = accuracy * 100 
print('Naive Bayes Accuracy: %4.2f', total)

0.944
Naive Bayes Accuracy: %4.2f 94.39999999999999


In [437]:
# Accuracy Test Set
accuracyTestSet = nltk.classify.accuracy(classifier, test_set) 

#Printing the accuracy for the test set 
print(accuracyTestSet)

totalTest = accuracyTestSet * 100 
print('\nNaive Bayes Accuracy with the Test Set: %4.2f', totalTest)

print('\nInformative features')
print(classifier.show_most_informative_features(n=15))

0.952

Naive Bayes Accuracy with the Test Set: %4.2f 95.19999999999999

Informative features
Most Informative Features
          contains(dead) = True                0 : 1      =     12.3 : 1.0
          contains(busi) = True                0 : 1      =     12.3 : 1.0
    contains(blackberri) = True                0 : 1      =     12.3 : 1.0
         contains(might) = True                0 : 1      =     12.3 : 1.0
          contains(feel) = True                0 : 1      =     12.3 : 1.0
           contains(say) = True                0 : 1      =      9.8 : 1.0
             contains(1) = True                0 : 1      =      9.5 : 1.0
         contains(exist) = True                0 : 1      =      8.8 : 1.0
         contains(guess) = True                0 : 1      =      8.8 : 1.0
         contains(spend) = True                0 : 1      =      8.8 : 1.0
          contains(base) = True                0 : 1      =      8.8 : 1.0
        contains(sunday) = True                0 : 1    

## 22. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [536]:
from sklearn import metrics
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train[0])
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test[0])
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [537]:
def clean_tweets_for_vect(tweet):

    tweet = re.sub("\d+", "", tweet)
    
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
 
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
 
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
 
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
 
    tweets_clean = []    
    for word in tweet_tokens:
        if (word not in stopwords_english and # remove stopwords
              word not in emoticons and # remove emoticons
                word not in string.punctuation and
                   word not in range(0,9)): # remove punctuation
            #tweets_clean.append(word)
            stem_word = stemmer.stem(word) # stemming word
            tweets_clean.append(stem_word)
            
    tweets_clean = ', '.join(tweets_clean)
    return str(tweets_clean)

In [538]:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
 
# print cleaned tweet
print (clean_tweets_for_vect(custom_tweet))

hello, great, day, good, morn


In [539]:
X = pd.DataFrame(clean_tweets_for_vect(str(text)) for text in tweets_data[valid_tweets].tweet_text)
Y = pd.DataFrame(tweets_data_valid.is_there_an_emotion_directed_at_a_brand_or_product)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [540]:
tokenize_test(CountVectorizer())

Features:  3870
Accuracy:  0.8676056338028169


  y = column_or_1d(y, warn=True)


### Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [541]:
tokenize_test(CountVectorizer(ngram_range = (1,2)))

Features:  18396
Accuracy:  0.8760563380281691


  y = column_or_1d(y, warn=True)


### Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [542]:
tokenize_test(CountVectorizer(stop_words = 'english'))

Features:  3740
Accuracy:  0.8685446009389671


  y = column_or_1d(y, warn=True)


### Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [543]:
tokenize_test(CountVectorizer(stop_words = 'english', max_features = 300))

Features:  300
Accuracy:  0.8262910798122066


  y = column_or_1d(y, warn=True)


### Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [545]:
tokenize_test(CountVectorizer(ngram_range = (1,2), max_features = 15000))

Features:  15000
Accuracy:  0.8732394366197183


  y = column_or_1d(y, warn=True)


### Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [546]:
tokenize_test(CountVectorizer(ngram_range = (1,2), min_df = 2))

Features:  5249
Accuracy:  0.8469483568075117


  y = column_or_1d(y, warn=True)
