Written by: **DZE RICHARD**

# 1. EXTRACTING NEWS DATA FROM THE DATASET 


**Objective** : Extract the news data from the dataset, apply standard text cleaning process and separate out training and target variables.



- **Remark**: For this project we would be working on [google colab](colab.research.google.com) so we begin by mounting the google drive so we can store the data there and access it.    

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Download and extract the dataset

- We change directory to the one in which we wwant to download the data. We would download the data using `!wget` and unzip it into the same directory using `!unzip`


In [None]:
# %cd <path_to_save_file>

In [5]:
!ls

[0m[01;34m16119_db21c91a1ab47385bb13773ed8238c31[0m/     [01;34mdrive[0m/
16119_db21c91a1ab47385bb13773ed8238c31.zip  [01;34msample_data[0m/


In [None]:
!wget https://s3.amazonaws.com/webhose-archive/16119_db21c91a1ab47385bb13773ed8238c31.zip # download dataset

--2022-09-26 20:08:01--  https://s3.amazonaws.com/webhose-archive/16119_db21c91a1ab47385bb13773ed8238c31.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.42.6
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.42.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14711036286 (14G) [application/zip]
Saving to: ‘16119_db21c91a1ab47385bb13773ed8238c31.zip.1’


In [4]:
!unzip -q 16119_db21c91a1ab47385bb13773ed8238c31.zip -d ./16119_db21c91a1ab47385bb13773ed8238c31 #Unzip the dataset

- Importing the necessary libraries, iterating over the json file first two json files

In [None]:
# from __future__ import unicode_literals, print_function, division
# import numpy as np

import pandas as pd
import json
# import os, glob

# import re
# import unicodedata
# import string
# import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Reading the first file (Data for dec 2019)
with open("16119_webhose_2019_12_db21c91a1ab47385bb13773ed8238c31_0000001.json", "r", encoding="utf8") as file1:
  json_data1 = [json.loads(line) for line in file1]
len(json_data1)

104

In [None]:
# Reading the second json (Data for jan 2020)
with open("16119_webhose_2020_01_db21c91a1ab47385bb13773ed8238c31_0000001.json", "r", encoding="utf8") as file2:
  json_data2 = [json.loads(line) for line in file2]
len(json_data2)

94299

In [None]:
# Extracting the value of the text key and title key from the 2 jsons and putting them in the dataset and target lists respectively  
dataset = []
target = []
for dict1 in json_data1:
  text1 = dict1.get("text")
  dataset.append(text1)
  title1 = dict1.get("title")
  target.append(title1)
    
for dict2 in json_data2:
  text2 = dict2.get("text")
  dataset.append(text1)
  title2 = dict2.get("title")
  target.append(title2)

In [None]:
print("The length of dataset is:", len(dataset))
print("The length of dataset is:", len(target))

The length of dataset is: 94403
The length of dataset is: 94403


In [None]:
dataset[:2]

['Dublin, The “Swine Healthcare Market – Growth, Trends, and Forecast (2019 – 2024)” report has been added to ResearchAndMarkets.com’s offering.\nThe global swine health market is expected to register a healthy CAGR during the forecast period, owing to the increasing incidence of swine diseases.\nChina Ministry of Agriculture and Rural Affairs (MARA) confirmed its first African swine fever (ASF) outbreak in Liaoning Province in 2018, 145 ASF outbreaks detected in 32 Provinces/Autonomous Regions/Municipalities/Special Administrative Region. More than 1,160,000 pigs have been rising investments in R&D activities in the development of novel therapeutics, a growing number of governments initiatives for the prevention of zoonotic diseases, increasing demand for livestock products is expected to propel the global swine health market. culled in an effort to halt the further spread. Additionally, the growing consumption of pork globally.\nKey Market Trends\nVaccines are Expected to Lead the Ma

In [None]:
target[:2]

['Global Swine Healthcare Market by Products, Diseases & Geography – Forecast to 2024',
 'FDA launches app for health care professionals to report novel uses of existing medicines for patients with difficult-to-treat infectious diseases']

### Text cleanup
In this section we would:

- Define a contraction hashmap
- Define a function preprocess(text) and call it
- Create a dataframe to contain text and summary, and remove those empty strings both from summary and the text column

In [None]:
# Defining a contraction hashmap
contraction_map = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}

In [None]:
import nltk
nltk.__version__

'3.2.5'

In [None]:
nltk.download('stopwords') # Downloading stopwards(just in case they are not already available) 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import re 
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
#defining the preprocess function
def preprocess(text):
    text = text.lower() # convert text tlowercase
    text = text.split() 
    for i in range(len(text)):
        word = text[i]
        if word in contraction_map: 
            text[i] = contraction_map[word] #apply the contraction hashmap
    text = " ".join(text)
    text = text.split()
    newtext = []
    for word in text:
        if word not in stop_words:
            newtext.append(word)
    text = " ".join(newtext)
    text = text.replace("'s",'') # remove 's e.g convert your's -> your
    text = re.sub(r'\(.*\)','',text) # remove parenthesis outside a word e.g (word) -> word
    text = re.sub(r'[^a-zA-Z0-9. ]','',text) # remove punctuations
    text = re.sub(r'\.',' . ',text) # add a space character before and after the full stop
    return text

In [None]:
X = []
Y = []
for d_text in dataset:
    prep_dataset = preprocess(d_text)
    X.append(prep_dataset)
for t_text in target:
    prep_target = preprocess(t_text)
    Y.append(prep_target)

In [None]:
print(len(X), len(Y))

94403 94403


In [None]:
# Reduce dataset size
max_len_text = 600
max_len_target = 30

short_text=[]
short_summary=[]

for i in range(len(dataset)):
    if(len(target[i].split())<=max_len_target and len(dataset[i].split())<=max_len_text):
        short_text.append(dataset[i])
        short_summary.append(target[i])

temp_df = pd.DataFrame({'text':short_text,'summary':short_summary})

In [None]:
temp_df.head()

Unnamed: 0,text,summary
0,FDA launches app for health care professionals...,FDA launches app for health care professionals...
1,"Of all of Regina Yan ’s many traits, an open m...",C-Suite Awards: Regina Yan
2,The CURE ID app allows clinicians to share and...,FDA Launches Infectious Disease Crowdsourcing ...
3,The DSB is composed of representatives from tw...,Drug Safety Oversight Board
4,The Centre for Health Protection (CHP) of the ...,Suspected MERS case reported


In [None]:
# remove those empty strings both from summary and the text column
newdf = temp_df[temp_df['summary'].str.strip().astype(bool)]
df = newdf[newdf['text'].str.strip().astype(bool)]

In [None]:
df.head()

Unnamed: 0,text,summary
0,FDA launches app for health care professionals...,FDA launches app for health care professionals...
1,"Of all of Regina Yan ’s many traits, an open m...",C-Suite Awards: Regina Yan
2,The CURE ID app allows clinicians to share and...,FDA Launches Infectious Disease Crowdsourcing ...
3,The DSB is composed of representatives from tw...,Drug Safety Oversight Board
4,The Centre for Health Protection (CHP) of the ...,Suspected MERS case reported


### Text feature generation


In [None]:

SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {} 
        self.word2count = {} 
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # SOS and EOS already indexed 0 annd 1, so the first new word starts at 2 
        
    def addSentence(self, sentence):
        for word in sentence.split(' '): #get every word from the sentence and pas it into the addword function
            self.addWord(word)
            
    #updating word2index, index2word and word2count hashmaps
    def addWord(self, word):
        if word not in self.word2index: 
            self.word2index[word] = self.n_words 
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
        

### Make the features ready for the model

In [None]:
def readData(text, summary):
    pairs = [[text[i],summary[i]] for i in range(len(text))] # Put text and summary in pairs 
    input_lang = Lang(text) #create input object
    output_lang = Lang(summary) #create output object 
    return input_lang, output_lang, pairs 

def prepareData(X, Y):
    input_lang, output_lang, pairs = readData(X,Y)
    
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
        
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData(X,Y) 

In [None]:
pairs

[['dublin swine healthcare market  growth trends forecast  5 . 2 . 2 coccidiosis 5 . 2 . 3 respiratory diseases 5 . 2 . 4 swine dysentery 5 . 2 . 5 porcine parvovirus 5 . 2 . 6 others 5 . 3 geography 5 . 3 . 1 north america 5 . 3 . 2 europe 5 . 3 . 3 asiapacific 5 . 3 . 4 middle east  africa 5 . 3 . 5 south america 6 competitive landscape 6 . 1 company profiles 6 . 1 . 1 abaxis 6 . 1 . 2 bayer animal health 6 . 1 . 3 boehringer ingelheim 6 . 1 . 4 ceva animal health inc .  6 . 1 . 5 elanco 6 . 1 . 6 idvet 6 . 1 . 7 merck animal health 6 . 1 . 8 merial 6 . 1 . 9 vetoquinol s . a .  6 . 1 . 10 virbac 6 . 1 . 11 zoetis animal healthcare 7 market opportunities future trends information report visit httpswww . researchandmarkets . comrshhuje research markets also offers custom research services providing focused comprehensive tailored research .  contact researchandmarkets . com laura wood senior press manager pressresearchandmarkets . com e . s . t office hours call 19173000470 u . s . can

In [None]:
MAX_LENGTH = max_len_text

In [None]:
class Encoder:
  def __init__(self, input_size, hidden_size):
    super(Encoder, self).__init__()
    self.hidden_size = hidden_size
    self.embedding = nn.Embedding(input_size, hidden_size)
    self.gru = nn.GRU(hidden_size, hidden_size)

  def forward(self, input, hidden):
    output = self.embedding(input).view(1,1,-1)
    output, hidden = self.gru(output, hidden)

  def initHidden(self):
    hidden = torch.zeros(1,1, self.hidden_size, device=device)
    return hidden


In [None]:
class AttnDecoder:
  def __init__(self, hidden_size, output_size, dropout=0.2, max_length=MAX_LENGTH):
    super(AttnDecoder, self).__init__()
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.dropout = dropout
    self.max_length = MAX_LENGTH
    self.embedding = nn.Embedding(self.output_size, self.hidden_size)
    self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
    self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
    self.dropout = nn.Dropout(self.dropout)
    self.gru = nn.GRU(self.hidden_size, self.hidden_size)
    self.out = nn.Linear(self.hidden_size, self.output_size)
  def forward(self, input, hidden, encoder_outputs):
    embedded = self.embedding(input).view(1, 1, -1)
    embedded = self.dropout(embedded)

    attn_weights = F.softmax(self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
    attn_applied = torch.bmm(attn_weights.unsqueeze(0),encoder_outputs.unsqueeze(0))

    output = torch.cat((embedded[0], attn_applied[0]), 1)
    output = self.attn_combine(output).unsqueeze(0)

    output = F.relu(output)
    output, hidden = self.gru(output, hidden)

    output = F.log_softmax(self.out(output[0]), dim=1)
    return output, hidden, attn_weights

  def initHidden(self):
    return torch.zeros(1, 1, self.hidden_size, device=device)


In [None]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)


In [None]:
def trainIters(encoder, decoder, num_iters, learning_rate=0.01):
  encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
  decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
  training_pairs = [tensorsFromPair(random.choice(pairs))for i in range(num_iters)]
  criterion = nn.NLLLoss()
  for iter in range(1, n_iters + 1):
    if iter% 1000 == 0:
        print(iter,"/",n_iters + 1)
    training_pair = training_pairs[iter - 1]
    input_tensor = training_pair[0]
    target_tensor = training_pair[1]

    loss = train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
    print_loss_total += loss
    plot_loss_total += loss

    if iter % print_every == 0:
        print_loss_avg = print_loss_total / print_every
        print_loss_total = 0
        print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                      iter, iter / n_iters * 100, print_loss_avg))

    if iter % plot_every == 0:
        plot_loss_avg = plot_loss_total / plot_every
        plot_losses.append(plot_loss_avg)
        plot_loss_total = 0

showPlot(plot_losses)

In [None]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np


def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

In [None]:
def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
  encoder_hidden = encoder.initHidden()
  encoder_optimizer.zero_grad()
  decoder_optimizer.zero_grad()
  input_length = input_tensor.size(0)
  target_length = target_tensor.size(0)
  encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
  loss = 0
  for i in range(input_length):
    encoder_output, encoder_hidden = encoder(input_tensor[i], encoder_hidden)
    encoder_outputs[i] = encoder_output[0, 0]
  decoder_input = torch.tensor([[SOS_token]], device=device)
  decoder_hidden = encoder_hidden
  
  use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

  if use_teacher_forcing:
      # Teacher forcing: Feed the target as the next input
      for di in range(target_length):
          decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)
          loss += criterion(decoder_output, target_tensor[di])
          decoder_input = target_tensor[di]  # Teacher forcing

  else:
      # Without teacher forcing: use its own predictions as the next input
      for di in range(target_length):
          decoder_output, decoder_hidden, decoder_attention = decoder(
              decoder_input, decoder_hidden, encoder_outputs)
          topv, topi = decoder_output.topk(1)
          decoder_input = topi.squeeze().detach()  # detach from history as input

          loss += criterion(decoder_output, target_tensor[di])
          if decoder_input.item() == EOS_token:
              break

  loss.backward()

  encoder_optimizer.step()
  decoder_optimizer.step()

  return loss.item() / target_length

