# Transfer Learning MNIST

* Train a simple convnet on the MNIST dataset the first 5 digits [0..4].
* Freeze convolutional layers and fine-tune dense layers for the classification of digits [5..9].

## 1. Import necessary libraries for the model

In [107]:
import keras
from keras.datasets import mnist
from keras.utils import np_utils

## 2. Import MNIST data and create 2 datasets with one dataset having digits from 0 to 4 and other from 5 to 9 

In [136]:
import numpy as np
import pandas as pd

In [101]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [87]:
def break_data(x, y):
    x_4, x_9, y_4, y_9 = [], [], [], []
    for i in range(0, len(x)):
        if(y[i] < 5):
            x_4.append(x[i])
            y_4.append(y[i])
        else:
            x_9.append(x[i])
            y_9.append(y[i])
    x_4, x_9, y_4, y_9
    return (np.asarray(x_4), np.asarray(x_9), np.asarray(y_4), np.asarray(y_9))

In [102]:
(x_train_4, x_train_9, y_train_4, y_train_9) = break_data(x_train, y_train)

In [103]:
(x_test_4, x_test_9, y_test_4, y_test_9) = break_data(x_test, y_test)

## 3. Print x_train, y_train, x_test and y_test for both the datasets

In [94]:
print('X Train Data 0 to 4: X:',len(x_train_4))
print('Y Train Data 0 to 4: X:',len(y_train_4))
print('X Test Data 0 to 4: X:',len(x_test_4))
print('Y Test Data 0 to 4: X:',len(y_test_4))

X Train Data 0 to 4: X: 30596
Y Train Data 0 to 4: X: 30596
X Test Data 0 to 4: X: 5139
Y Test Data 0 to 4: X: 5139


In [95]:
print('X Train Data 5 to 9: X:',len(x_train_9))
print('Y Train Data 5 to 9: X:',len(y_train_9))
print('X Test Data 5 to 9: X:',len(x_test_9))
print('Y Test Data 5 to 9: X:',len(y_test_9))

X Train Data 5 to 9: X: 29404
Y Train Data 5 to 9: X: 29404
X Test Data 5 to 9: X: 4861
Y Test Data 5 to 9: X: 4861


## ** 4. Let us take only the dataset (x_train, y_train, x_test, y_test) for Integers 0 to 4 in MNIST **
## Reshape x_train and x_test to a 4 Dimensional array (channel = 1) to pass it into a Conv2D layer

In [104]:
x_train_4 = x_train_4.reshape(x_train_4.shape[0], 28, 28, 1).astype('float32')
x_test_4 = x_test_4.reshape(x_test_4.shape[0], 28, 28, 1).astype('float32')

## 5. Normalize x_train and x_test by dividing it by 255

In [105]:
x_train_4 /= 255
x_test_4 /= 255

## 6. Use One-hot encoding to divide y_train and y_test into required no of output classes

In [108]:
y_train_4 = np_utils.to_categorical(y_train_4, 10)
y_test_4 = np_utils.to_categorical(y_test_4, 10)

## 7. Build a sequential model with 2 Convolutional layers with 32 kernels of size (3,3) followed by a Max pooling layer of size (2,2) followed by a drop out layer to be trained for classification of digits 0-4  

In [109]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Dense, Activation, Dropout, Flatten, Reshape

In [111]:
# Define model
model = Sequential()

# 1st Conv Layer
model.add(Conv2D(32, (3, 3), input_shape=(28, 28, 1)))
model.add(Activation('relu'))

# 2nd Conv Layer
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

## 8. Post that flatten the data and add 2 Dense layers with 128 neurons and neurons = output classes with activation = 'relu' and 'softmax' respectively. Add dropout layer inbetween if necessary  

In [112]:
# Fully Connected Layer
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))

#Batch Normalisation
model.add(keras.layers.BatchNormalization())

# Prediction Layer
model.add(Dense(output_dim=10, init='he_normal', bias=True))
model.add(Activation('softmax'))

# Loss and Optimizer
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
# Store Training Results
early_stopping = keras.callbacks.EarlyStopping(monitor='val_acc', patience=5, verbose=1, mode='auto')
callback_list = [early_stopping]

model.fit(x_train_4, y_train_4, batch_size=100, nb_epoch=10,
           validation_data=(x_test_4, y_test_4), callbacks=callback_list)

  # Remove the CWD from sys.path while we load stuff.


Train on 30596 samples, validate on 5139 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x221d3f67780>

## 9. Print the training and test accuracy

In [115]:
loss_and_metrics = model.evaluate(x_test_4, y_test_4)
loss_and_metrics



[0.007768147450103117, 0.9978595057404164]

## 10. Make only the dense layers to be trainable and convolutional layers to be non-trainable

In [116]:
model.layers

[<keras.layers.convolutional.Conv2D at 0x221d4f3cac8>,
 <keras.layers.core.Activation at 0x221d3800b38>,
 <keras.layers.convolutional.Conv2D at 0x221d3b10860>,
 <keras.layers.core.Activation at 0x221d43fdb70>,
 <keras.layers.pooling.MaxPooling2D at 0x221d3dbf828>,
 <keras.layers.core.Dropout at 0x221dc4b4908>,
 <keras.layers.core.Flatten at 0x221d3dd0908>,
 <keras.layers.core.Dense at 0x221d3dd09e8>,
 <keras.layers.core.Activation at 0x221d3dd0d30>,
 <keras.layers.normalization.BatchNormalization at 0x221dc5dc588>,
 <keras.layers.core.Dense at 0x221dc5dcb70>,
 <keras.layers.core.Activation at 0x221dc5dc7b8>]

In [124]:
cnt = 0
for layer in model.layers:
    cnt += 1
    if cnt > 6:
        layer.trainable = False
    else:
        layer.trainable = True
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## 11. Use the model trained on 0 to 4 digit classification and train it on the dataset which has digits 5 to 9  (Using Transfer learning keeping only the dense layers to be trainable)

In [122]:
x_train_9 = x_train_9.reshape(x_train_9.shape[0], 28, 28, 1).astype('float32')
x_test_9 = x_test_9.reshape(x_test_9.shape[0], 28, 28, 1).astype('float32')
x_train_9 /= 255
x_test_9 /= 255
y_train_9 = np_utils.to_categorical(y_train_9, 10)
y_test_9 = np_utils.to_categorical(y_test_9, 10)

In [125]:
model.fit(x_train_9, y_train_9, batch_size=100, nb_epoch=10,
           validation_data=(x_test_9, y_test_9), callbacks=callback_list)

  


Train on 29404 samples, validate on 4861 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 00007: early stopping


<keras.callbacks.History at 0x221d6d50da0>

## 12. Print the accuracy for classification of digits 5 to 9

In [126]:
loss_and_metrics = model.evaluate(x_test_9, y_test_9)
loss_and_metrics



[0.07296136675398307, 0.98724542275252]

## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 13. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [1]:
import pandas as pd

In [5]:
data = pd.read_csv('tweets.csv', encoding='ISO-8859-1')
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### 14. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [26]:
def preprocess(text):
    try:
        etext = text.encode('ascii')
        return etext.decode('ascii')
    except Exception as e:
        return ""

In [27]:
data['text'] = [preprocess(text) for text in data.tweet_text]

In [127]:
data.head(5)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,.@wesley83 I have a 3G iPhone. After 3 hrs twe...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,@jessedee Know about @fludapp ? Awesome iPad/i...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,@swonderlin Can not wait for #iPad 2 also. The...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,@sxsw I hope this year's festival isn't as cra...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,@sxtxstate great stuff on Fri #SXSW: Marissa M...


### 15. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [30]:
data.is_there_an_emotion_directed_at_a_brand_or_product.unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

In [34]:
data.groupby(by='is_there_an_emotion_directed_at_a_brand_or_product').agg({'tweet_text':'count'})

Unnamed: 0_level_0,tweet_text
is_there_an_emotion_directed_at_a_brand_or_product,Unnamed: 1_level_1
I can't tell,156
Negative emotion,570
No emotion toward brand or product,5388
Positive emotion,2978


In [128]:
data = data[data.is_there_an_emotion_directed_at_a_brand_or_product.isin(['Positive emotion','Negative emotion'])]
print('Tweets with either positive/negative emotions:',data.shape[0])

Tweets with either positive/negative emotions: 3548


### 16. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [131]:
tweet_data = np.asarray(data.text)

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [134]:
#Term Frequency
tf = pd.DataFrame(vect.fit_transform(tweet_data).toarray(), columns=vect.get_feature_names())
tf.head(3)

Unnamed: 0,000,02,03,0310apple,08,10,100,100s,100tc,101,...,zimride,zing,zip,zite,zms,zombies,zomg,zone,zoom,zzzs
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [138]:
# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(tweet_data).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 5850), columns=vect.get_feature_names())

Unnamed: 0,000,02,03,0310apple,08,10,100,100s,100tc,101,...,zimride,zing,zip,zite,zms,zombies,zomg,zone,zoom,zzzs
0,7,1,2,1,1,17,5,1,1,4,...,1,1,1,1,2,2,5,1,2,1


In [141]:
# Term Frequency-Inverse Document Frequency 
tf_df = tf/df
tf_df.head()

Unnamed: 0,000,02,03,0310apple,08,10,100,100s,100tc,101,...,zimride,zing,zip,zite,zms,zombies,zomg,zone,zoom,zzzs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 17. Find number of different words in vocabulary

In [144]:
print(vect.get_feature_names())



#### Tip: To see all available functions for an Object use dir

### 18. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [146]:
data.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 19. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [148]:
data['Label'] = data.is_there_an_emotion_directed_at_a_brand_or_product.map({'Positive emotion':1,'Negative emotion':0})

### 20. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [149]:
X = data['text']
Y = data['Label']

In [151]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=1)

## 21. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [156]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [154]:
vect = CountVectorizer()
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)
x_train_dtm.shape

(2661, 5046)

In [157]:
nb = MultinomialNB()
nb.fit(x_train_dtm, y_train)
y_pred_class = nb.predict(x_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))
print(metrics.confusion_matrix(y_test, y_pred_class))

0.8647125140924464
[[ 37 109]
 [ 11 730]]


In [158]:
lr = LogisticRegression()
lr.fit(x_train_dtm, y_train)
y_pred_class = lr.predict(x_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))
print(metrics.confusion_matrix(y_test, y_pred_class))

0.8523111612175873
[[ 45 101]
 [ 30 711]]




## 22. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [159]:
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [160]:
tokenize_test(CountVectorizer(ngram_range=(1, 2)))

Features:  25736
Accuracy:  0.874859075535513


### Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [161]:
tokenize_test(CountVectorizer(stop_words='english'))

Features:  4806
Accuracy:  0.8624577226606539


### Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [162]:
tokenize_test(CountVectorizer(stop_words='english',max_features=300))

Features:  300
Accuracy:  0.8094701240135288


### Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [163]:
tokenize_test(CountVectorizer(ngram_range=(1, 2),max_features=300))

Features:  300
Accuracy:  0.7643742953776775


### Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [164]:
tokenize_test(CountVectorizer(ngram_range=(1, 2),min_df=2))

Features:  8297
Accuracy:  0.8714768883878241
