# Sentiment Analysis for drugs/medicines


Here are the steps for sentiment analysis
1.   Cleaning the text
2.   Creating word embeddings
3.   Training the network
4.   Final prediction



Installing the necessary packages like tensorflow-gpu,emoji etc.

In [1]:
!pip install tensorflow-gpu

Collecting tensorflow-gpu
[?25l  Downloading https://files.pythonhosted.org/packages/76/04/43153bfdfcf6c9a4c38ecdb971ca9a75b9a791bb69a764d652c359aca504/tensorflow_gpu-1.14.0-cp36-cp36m-manylinux1_x86_64.whl (377.0MB)
[K     |████████████████████████████████| 377.0MB 45kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.14.0


In [1]:
!pip install emoji



Now mount the drive to load the training and testing data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Importing the packges 

In [3]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import emoji # handling emoticons in the text
import nltk # tokenizing
import string # string manipulation
import gensim # word embeddings
import tensorflow as tf
from autocorrect import spell
from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords 
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, CuDNNLSTM

Using TensorFlow backend.


visualisation of the data

In [4]:
train = pd.read_csv('/content/drive/My Drive/Innoplexus/train_F3WbcTw.csv')
train.head()

Unnamed: 0,unique_hash,text,drug,sentiment
0,2e180be4c9214c1f5ab51fd8cc32bc80c9f612e0,Autoimmune diseases tend to come in clusters. ...,gilenya,2
1,9eba8f80e7e20f3a2f48685530748fbfa95943e4,I can completely understand why you’d want to ...,gilenya,2
2,fe809672251f6bd0d986e00380f48d047c7e7b76,Interesting that it only targets S1P-1/5 recep...,fingolimod,2
3,bd22104dfa9ec80db4099523e03fae7a52735eb6,"Very interesting, grand merci. Now I wonder wh...",ocrevus,2
4,b227688381f9b25e5b65109dd00f7f895e838249,"Hi everybody, My latest MRI results for Brain ...",gilenya,1


In [5]:
print('Total number of samples',len(train.index))
print(train['sentiment'].value_counts())

Total number of samples 5279
2    3825
1     837
0     617
Name: sentiment, dtype: int64


**STEP 1: Cleaning the text**

This process can be furture divided into smaller steps as follows 

1.   Demojize the sentences
2.   Removing punctuations
3.   Conevrting sentences to lower case 
4.   Tokenizing the words
5.   Removing the stop words



In [6]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
def pre_process(list_sent): 
  # This function is used for text cleaning 
  ps = PorterStemmer()
  de_emoji = [emoji.demojize(sent) for sent in list_sent]
  rem_punc = [sent.translate(str.maketrans('','',string.punctuation)) for sent in de_emoji]
  rem_num = [sent.translate(str.maketrans('','','0123456789')) for sent in rem_punc]
  norm_corpus = [sent.lower() for sent in rem_num]
  tok_corpus = [nltk.word_tokenize(sent) for sent in norm_corpus]
  stop_words = set(stopwords.words('english')) 
  filtered_corpus = []
  for sent in tok_corpus:
    filtered_sent = []
    for word in sent:
      word = ps.stem(word)
      if not word in stop_words:
        filtered_sent.append(word)
    filtered_corpus.append(filtered_sent)
  return filtered_corpus

In [0]:
x = train['text'].values.tolist()
x_clean = pre_process(x)
del x

**STEP 2: Creating word embeddings**

Word Embedding is a representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.

Here we use gensim package which creates word embeddings for us

In [0]:
VECTOR_SIZE = 100 # length of embedding vector
model = gensim.models.Word2Vec(x_clean,min_count=1,size = 100)

Below is an example showing word 'patients' is related to other words in the corpus 

In [10]:
model.most_similar('patient')

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('individu', 0.7232431769371033),
 ('particip', 0.7034831047058105),
 ('popul', 0.7011430263519287),
 ('among', 0.6941633224487305),
 ('women', 0.6825122833251953),
 ('initi', 0.6628513336181641),
 ('observ', 0.6550769805908203),
 ('monotherapi', 0.6412683725357056),
 ('enrol', 0.6347674131393433),
 ('previous', 0.6340286135673523)]

In [11]:
x_feature = [model[words] for words in x_clean]

del x_clean

  """Entry point for launching an IPython kernel.


convecting the unequal length sequence to equal length squences

In [0]:
MAX_LENGTH = 500

x_feature = pad_sequences(x_feature,maxlen=MAX_LENGTH)

In [0]:
y_labels = np.asarray(train['sentiment'].values.tolist())


spliting the data into testing and training data

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(x_feature,y_labels, test_size = 0.2, random_state = 42)

In [0]:
del x_feature
del y_labels

In [16]:
print(len(X_train))
print(len(X_test))
print(len(Y_train))
print(len(Y_test))
print(X_train.shape[1:])

4223
1056
4223
1056
(500, 100)


**STEP 3: Training the model**



In [19]:
lstm_out1 = 200

out_senti = 3
analysis_model = Sequential()
analysis_model.add(LSTM(lstm_out1,input_shape=(X_train.shape[1:]), dropout=0.2, recurrent_dropout=0.2))
analysis_model.add(Dense(100,activation='relu'))
analysis_model.add(Dense(3,activation='softmax'))
analysis_model.compile(loss = 'sparse_categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
analysis_model.fit(X_train,Y_train,epochs=3,validation_data=(X_test,Y_test))

Train on 4223 samples, validate on 1056 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fab4fa3ca20>