<a href="https://colab.research.google.com/github/Akshay-Kumar-Arya/Identify_the_sentiments/blob/master/nnlm_vectors_from_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Identify the Sentiments

In [None]:
# install modules
!pip install -q tensorflow
!pip install -q tensorflow_hub

In [1]:
# import Modules
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import re
import spacy
import pickle
#import math

# To visualize tweets upto larger width
pd.set_option('display.max_colwidth', 200)

print("TF version: ", tf.__version__)
print("Hub version: ", hub.__version__)

TF version:  2.2.0
Hub version:  0.8.0


## Dataset Preprocessing

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
# Data path
training_data_path = "/content/gdrive/My Drive/Identify_the_sentiments/train.csv"
test_data_path =  "/content/gdrive/My Drive/Identify_the_sentiments/test.csv"

save_path = "/content/gdrive/My Drive/Identify_the_sentiments/"

In [3]:
# reading data from csv
train_data = pd.read_csv(training_data_path)
test_data = pd.read_csv(test_data_path)

In [4]:
# data visualization
print(f"Number of training examples: {train_data.shape[0]}", '\n')
print(f"Number of test examples: {test_data.shape[0]}", '\n')

print(f"The fraction of positive and negative comments:")
print(train_data['label'].value_counts(normalize = True), '\n')

print("Training Dataframe:")
print(train_data.head())

Number of training examples: 7920 

Number of test examples: 1953 

The fraction of positive and negative comments:
0    0.744192
1    0.255808
Name: label, dtype: float64 

Training Dataframe:
   id  ...                                                                                                                                tweet
0   1  ...     #fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone
1   2  ...  Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/
2   3  ...          We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu
3   4  ...                     I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/
4   5  ...         What amazing service! Apple won't even talk to me about a question 

In [5]:
# removing URLs from data
train_data['clean_tweet'] = train_data['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))
test_data['clean_tweet'] = test_data['tweet'].apply(lambda x: re.sub(r'http\S+', '', x))

In [6]:
# remove twitter handles
train_data['clean_tweet'] = train_data['clean_tweet'].apply(lambda x: re.sub("@[\w]*", '', x))
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x: re.sub("@[\w]*", '', x))

In [7]:
# remove punctuations
punctuation = '.,\'!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'
              
train_data['clean_tweet'] = train_data['clean_tweet'].apply(lambda x: "".join(ch for ch in x if ch not in set(punctuation)))
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x: "".join(ch for ch in x if ch not in set(punctuation)))

In [8]:
# convert to lower case

train_data['clean_tweet'] = train_data['clean_tweet'].str.lower()
test_data['clean_tweet'] = test_data['clean_tweet'].str.lower()

In [9]:
# remove the numbers

train_data['clean_tweet'] = train_data['clean_tweet'].str.replace("[0-9]", " ")
test_data['clean_tweet'] = test_data['clean_tweet'].str.replace("[0-9]", " ")

In [10]:
# remove white spaces

train_data['clean_tweet'] = train_data['clean_tweet'].apply(lambda x: ' '.join(x.split()))
test_data['clean_tweet'] = test_data['clean_tweet'].apply(lambda x: ' '.join(x.split()))

In [11]:
# lammetizing the tweets
# converting them into their base form
nlp = spacy.load('en', disable=['parser', 'ner'])

# function to lemmatize text
def lemmatization(texts):
    output = []
    for i in texts:
        s = [token.lemma_ for token in nlp(i)]
        output.append(' '.join(s))
    return output

train_data['clean_tweet'] = lemmatization(train_data['clean_tweet'])
test_data['clean_tweet'] = lemmatization(test_data['clean_tweet'])


#### Load the embedding layer
The URL for the module is https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1.

This embedding takes a batch of text tokens in a 1-D tensor of strings as input. It then embeds the separate tokens into a 128-dimensional space.

NB: this model can be used as a sentence embedding module. The module will process each token by removing punctuation and splitting on spaces. It then averages the word embeddings over a sentence to give a single embedding vector. However, we can also use it as a word embedding module, and can pass each word in the input sentence as a separate token.



In [12]:
# nnlm-en-dim128 embedding model
embed = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1")

#### pool embedding (one vector for one sentence)

In [13]:
def nnlm_vectors(x):
  return embed(x).numpy()

In [14]:
# Build batches list
list_train = [train_data[i:i+100] for i in range(0,train_data.shape[0],100)]
list_test = [test_data[i:i+100] for i in range(0,test_data.shape[0],100)]

In [15]:
# extract embeddings
nnlm_train = [nnlm_vectors(x['clean_tweet']) for x in list_train]
nnlm_test = [nnlm_vectors(x['clean_tweet']) for x in list_test]

In [16]:
# concatenating 
nnlm_train_new = np.concatenate(nnlm_train, axis = 0)
nnlm_test_new = np.concatenate(nnlm_test, axis = 0)

In [17]:
# save the preprocessed tweets
train_file = open(save_path + "nnlm_train.pickle", mode='wb')
pickle.dump(nnlm_train_new, train_file)
train_file.close()

test_file = open(save_path + "nnlm_test.pickle", mode='wb')
pickle.dump(nnlm_test_new, test_file)
test_file.close()