**Map Drive**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Import Libraries**

In [3]:
import numpy as np
import pandas as pd
import collections
import pathlib
import re
import string
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds

tfds.disable_progress_bar()

**Set File path for training file**

In [4]:
train_filepaths= ['/content/drive/MyDrive/NLP/Data.csv']

**record_defaults**

A list of Tensor objects with specific types. Acceptable types are float32, float64, int32, int64, string. One tensor per column of the input record, with either a scalar default value for that column or an empty vector if the column is required.


In [5]:
record_defaults=["Hello",0]

**tf.data.experimental.CsvDataset**

The tf.data.experimental.CsvDataset class provides a minimal CSV Dataset interface without the convenience features of the make_csv_dataset function: column header parsing, column type-inference, automatic shuffling, file interleaving.

In [6]:
testdata = tf.data.experimental.CsvDataset(train_filepaths, record_defaults=record_defaults, header=True)


In [7]:
for example, label in testdata.take(3):
  print('texts: ', example)
  print()
  print('labels: ', label)

texts:  tf.Tensor(b'I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas.', shape=(), dtype=string)

labels:  tf.Tensor(0, shape=(), dtype=int32)
texts:  tf.Tensor(b'This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first

In [8]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

In [9]:
train_dataset = testdata.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [10]:
for example, label in train_dataset.take(10):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

texts:  [b'I have only been on this pill for week, and I feel as if I&#039;m not even on b.c. at all. I feel no change which is welcome after dealing with ortho tri cyclen lo for several years. With that pill if I missed a pill or started later than a week after the last set of pills, it made me feel like crap. This pill, I haven&#039;t felt anything since starting it. And that&#039;s awesome.'
 b'I was on and off Lexapro for many years, but I always felt tired. We tried a few combinations of anti-depressants but none of them gave me the energy I wanted--even with as high as 450 mg of Wellbutrin by itself. I&#039;ve been on 20 mg of Viibryd for the past couple of weeks as well as 300 mg of Wellbutrin.  I now have the energy and desire to go to the gym, breeze through work and housework, and visit family and friends. My sex drive is slightly improved. I&#039;ve had slight digestive issues but nothing major. I hope things continue to improve. I&#039;m impressed so far. I&#039;ll give an 

Next, you will standardize, tokenize, and vectorize the data using the `preprocessing.TextVectorization` layer.
* Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.

* Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).

* Vectorization refers to converting tokens into numbers so they can be fed into a neural network.

All of these tasks can be accomplished with this layer. You can learn more about each of these in the [API doc](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization).

* The default standardization converts text to lowercase and removes punctuation.

* The default tokenizer splits on whitespace.

* The default vectorization mode is `int`. This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes, like `binary`, to build bag-of-word models.




In [11]:
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

For `int` mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length, which will cause the layer to pad or truncate sequences to exactly sequence_length values.

In [12]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

Next, you will call `adapt` to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

Note: it's important to only use your training data when calling adapt (using the test set would leak information).

In [13]:
# Make a text-only dataset (without labels), then call adapt
train_text = train_dataset.map(lambda texts, labels: texts)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

In [14]:
for example in train_text.take(1):
  print('texts: ', example)
 

texts:  tf.Tensor(
[b'I cannot give it a 10 because I have only taken the pills for two and a half months. I have no mood swings due to the pills my hormones are not all over the place which is surprising. I have not had any bacterial or yeast infection which I had that issue with other birth controls. My sex drive is great. I used the pills to manipulate my period And I have not had one. So I have to go to the doctor on Monday just to be sure I&#039;m not pregnant otherwise the pills are awesome'
 b'I have been on this for 6 months and only have one issue with it. I gained 10 pounds after being on it for 1-2 months, but have been able to get that 10 down to 5. But no matter how healthy I eat or how much I exercise, I cannot lose that extra 5. Other than that, I have had no issues. It has not affected my mood or sex drive, and I haven&#039;t had any headaches or anything like others have mentioned. It actually cleared up acne on my chin so I have a clear face now that the dermatologist

See the result of using these layers to preprocess data:

In [15]:
def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

In [16]:
def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [17]:
# Retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(train_dataset))
first_question, first_label = text_batch[0], label_batch[0]
print("Text", first_question)
print("Label", first_label)

Text tf.Tensor(b'I was on celexa for about 3 months, started with 10 mg. and raised up to 20 when I felt no changes. After going up to 20, i felt constantly tired, drowsy and like a walking zombie. My anxiety and depression only got worse, it caused me to have thoughts of suicide and self harm and I just felt worse overall. My experience on celexa was terrible and I would not recomend this drug at all.', shape=(), dtype=string)
Label tf.Tensor(1, shape=(), dtype=int32)


In [18]:
print("'binary' vectorized question:", 
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[0. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)


In [19]:
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[   2   11   13  542    9   26   71   28   53   22  195  251    3 2013
    74    7  326   65    2  112   25  477   31  137   74    7  326    2
   112  349  317 2225    3   45    8 1534 1256    5  127    3   88   46
    56  212    6  356   20    7   10  443   12 1142    3  656 2787    3
     2   51  112  212  286    5  139   13  542   11  230    3    2   64
    23 4894   15  291   35   37    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0

As you can see above, `binary` mode returns an array denoting which tokens exist at least once in the input, while `int` mode replaces each token by an integer, thus preserving their order. You can lookup the token (string) that each integer corresponds to by calling `.get_vocabulary()` on the layer.

In [20]:
print("2 ---> ", int_vectorize_layer.get_vocabulary()[2])
print("10 ---> ", int_vectorize_layer.get_vocabulary()[10])
print("21 ---> ", int_vectorize_layer.get_vocabulary()[21])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

2 --->  i
10 --->  have
21 --->  been
Vocabulary size: 10000


You are nearly ready to train your model. As a final preprocessing step, you will apply the `TextVectorization` layers you created earlier to the train, validation, and test dataset.

In [21]:
raw_train_ds = train_dataset.take(25000) 
raw_val_ds = train_dataset.skip(25000)

In [22]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
#binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
#int_test_ds = raw_test_ds.map(int_vectorize_text)

### Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

`.cache()` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

`.prefetch()` overlaps data preprocessing and model execution while training. 

You can learn more about both methods, as well as how to cache data to disk in the [data performance guide](https://www.tensorflow.org/guide/data_performance).

In [23]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [24]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
#binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
#int_test_ds = configure_dataset(int_test_ds)

### Train the binary model
It's time to create our neural network. For the `binary` vectorized data, train a simple bag-of-words linear model:

In [25]:
binary_model = tf.keras.Sequential([layers.Dense(3)])
binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Train the int model
It's time to create our neural network. For the `int` vectorized data, train a simple word embedding model:

In [51]:
int_model = tf.keras.Sequential([layers.Dense(3)])
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(
    int_train_ds, validation_data=int_val_ds, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


**Save Binary Model**

In [26]:
# binary_model.save('/content/drive/My Drive/NLP/Models/DrugNLPbinary.h5')

**Save Int Model**

In [52]:
#  int_model.save('/content/drive/My Drive/NLP/Models/DrugNLPint.h5')

**Load Binary Model**

In [120]:
binary_model = tf.keras.models.load_model('/content/drive/My Drive/NLP/Models/DrugNLPbinary.h5')

**Load Int Model**

In [121]:
int_model = tf.keras.models.load_model('/content/drive/My Drive/NLP/Models/DrugNLPint.h5')

### Export Binary the model

In the code above, you applied the `TextVectorization` layer to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the `TextVectorization` layer inside your model. To do so, you can create a new model using the weights you just trained.

In [122]:
export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
#loss, accuracy = export_model.evaluate(raw_test_ds)
#print("Accuracy: {:2.2%}".format(binary_accuracy))

In [123]:
#Export Int Model
export_modelint = tf.keras.Sequential(
    [int_vectorize_layer, int_model,
     layers.Activation('sigmoid')])

export_modelint.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

Now your model can take raw strings as input and predict a score for each label using `model.predict`. Define a function to find the label with the maximum score:

In [124]:
def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

Get test data for inference from CSV

In [125]:
import csv

results = []
with open('/content/drive/MyDrive/NLP/testdata.csv', newline='') as inputfile:
    for row in csv.reader(inputfile):
        results.append(row[0])

Convert into dataframe for sorting and later joining to get detail result excel

In [126]:
df = pd.DataFrame(results,columns =['Text'])
df

Unnamed: 0,Text
0,I&#039;ve tried a few antidepressants over the...
1,I have been on this birth control for one cycl...
2,I&#039;ve had the copper coil for about 3 mont...
3,I was on this pill for almost two years. It do...
4,I absolutely love this product and recommend t...
...,...
13546,I have been on bydureon for 3 injections. I wa...
13547,I started Victoza about 5 weeks ago. Now I ha...
13548,My doctor placed me on this medicine to reduce...
13549,I dropped a Jean size. My close relatives say ...


In [127]:
df = df.sort_values(by=['Text'],ascending=True)
print(df)

                                                    Text
5527   \r\nHad it put in two months ago. Insertion wa...
9764   \r\nIn few words  - Life changing\r\nAll negat...
480     I have been on this for 2 1/2 years now. I am...
13369   I have only been on invokana for a few months...
10160  &bull;\t19 Apr. 2016\r\r\n\r\r\nBegan initial ...
...                                                  ...
8800   well I&#039;ve been on this for a few months, ...
3176   well, I personally do not recommend this pill....
6358   well, I personally do not recommend this pill....
8482   worst birth controL. BADDD MOOD SWINGS! ive be...
2028   zarah was my first form of birth control. ive ...

[13551 rows x 1 columns]


Convert dataframe to list to be fed as input to model

In [128]:
# results = df['Text'].tolist()
# results

**Predict**

In [129]:
#Switch model by renaming export_modelint to export_model (for binary) and vice versa
predicted_scores = export_modelint.predict(results)
predicted_labels = tf.argmax(predicted_scores, axis=1)
pred1 =  list(zip(results, predicted_labels))
#predicted_labels = get_string_labels(predicted_scores)
# for input, label in zip(inputs, predicted_labels):
#   print("Question: ", input)

In [130]:
predicted_scores

array([[0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 1.4940647e-38],
       ...,
       [5.9044946e-17, 1.3763348e-07, 1.3125978e-20],
       [1.0000000e+00, 1.0000000e+00, 1.0000000e+00],
       [1.0000000e+00, 1.0000000e+00, 1.0000000e+00]], dtype=float32)

Get class names in dictionary for prediction lookup

In [131]:
thisdict = {
  0: "Birth Control",
  1: "Depression",
  2: "Diabetes",
  }

In [132]:
predicted_labels

<tf.Tensor: shape=(13551,), dtype=int64, numpy=array([0, 0, 2, ..., 1, 0, 0])>

In [133]:
predicted_label = predicted_labels.numpy()

In [134]:
prediction_classes = [ thisdict.get(item,item) for item in predicted_label ]

In [135]:
# prediction_classes

In [136]:
Labels = []
for i in prediction_classes:
  Labels.append(i[0:])

In [137]:
predictions = {}
predictions['class_name'] = Labels

In [138]:
prediction = pd.DataFrame(predictions)

In [139]:
prediction

Unnamed: 0,class_name
0,Birth Control
1,Birth Control
2,Diabetes
3,Birth Control
4,Birth Control
...,...
13546,Birth Control
13547,Birth Control
13548,Depression
13549,Birth Control


In [140]:
detail_result = df.join(prediction)
detail_result 

Unnamed: 0,Text,class_name
5527,\r\nHad it put in two months ago. Insertion wa...,Birth Control
9764,\r\nIn few words - Life changing\r\nAll negat...,Birth Control
480,I have been on this for 2 1/2 years now. I am...,Birth Control
13369,I have only been on invokana for a few months...,Birth Control
10160,&bull;\t19 Apr. 2016\r\r\n\r\r\nBegan initial ...,Birth Control
...,...,...
8800,"well I&#039;ve been on this for a few months, ...",Birth Control
3176,"well, I personally do not recommend this pill....",Birth Control
6358,"well, I personally do not recommend this pill....",Birth Control
8482,worst birth controL. BADDD MOOD SWINGS! ive be...,Birth Control


In [141]:
prediction['class_name'].value_counts()

Birth Control    12539
Depression         838
Diabetes           174
Name: class_name, dtype: int64

In [142]:
# #Writing to excel
pd.DataFrame(detail_result).to_excel('/content/drive/My Drive/NLP/detail_result.xlsx', index = False)

In [143]:
# Real Values :
# Birth Control 9650
# Depression 3095
# Diabetes  809

# Previous Run Binary Model:
# Birth Control    9659
# Depression       3125
# Diabetes          767

# Previous Run Int Model:
# Birth Control    12539
# Depression         838
# Diabetes           174
