<a href="https://colab.research.google.com/github/ShrikaraVarna/Text-Summarizer/blob/main/Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import nltk
import logging
from nltk.tokenize import sent_tokenize
nltk.download('punkt') # one time execution
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import heapq
import networkx as nx
from gensim.summarization import summarize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
#Get GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2021-06-16 08:16:15--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-06-16 08:16:15--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-06-16 08:16:15--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [3]:
class TextRank:

  def remove_stopwords(self, sen):
    stop_words = stopwords.words('english')
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

  word_embeddings = {}
  f = open('glove.6B.100d.txt', encoding='utf-8')
  for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
  f.close()

  def cosine_sim_vectors(self, vec1, vec2):
    vec1 = vec1.reshape(1,-1)
    vec2 = vec2.reshape(1,-1)
    return cosine_similarity(vec1,vec2)[0][0]

  def text_preprocessing(self, sentences):
    clean_sentences = []
    for s in sentences:
      clean_sentences.append(s.replace("[^a-zA-Z]", " "))
    clean_sentences = [s.lower() for s in clean_sentences]
    clean_sentences = [self.remove_stopwords(r.split()) for r in clean_sentences]
    return clean_sentences 

  def train(self):

    #Read rhe csv file being used for training
    df = pd.read_csv("/content/drive/MyDrive/BBC News Train.csv", encoding='unicode_escape')
    df.head()

    cal_acu = 0
    no_of_sum = 0

    for l in range(len(df['Text'])):
      sentences = sent_tokenize(df['Text'][l])
      clean_sentences = self.text_preprocessing(sentences)

      sentence_vectors = []
      s = np.zeros((100,))
      for i in clean_sentences:
        s = np.zeros((100,))
        if len(i) != 0:
          for w in i.split():
            s += self.word_embeddings.get(w,np.zeros((100,)))
          s = s/(len(i.split())+0.001)
        else:
          s = np.zeros((100,))
        sentence_vectors.append(s)

      #Similarity Matrix
      sim_mat = np.zeros([len(sentences), len(sentences)])
      for i in range(len(sentences)):
        for j in range(len(sentences)):
          if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

      nx_graph = nx.from_numpy_array(sim_mat)
      scores = nx.pagerank(nx_graph)

      ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

      nos = 0.2 * len(sentences)
      nos = int(nos)

      sum_lis = []
      for j in range(nos):
        #print(ranked_sentences[j][1])
        sum_lis.append(ranked_sentences[j][1])

      #print()

      summary_from_gensim = summarize(df['Text'][l],ratio = 0.2)
      #print(summary_from_gensim)
      #print()
      sum_from_gen_lis = sent_tokenize(summary_from_gensim)

      nor_sum_cleaned = self.text_preprocessing(sum_lis)
      sum_cleaned_normal = ''
      for s in nor_sum_cleaned:
        sum_cleaned_normal += s + " "

      gen_sum_cleaned = self.text_preprocessing(sum_from_gen_lis)
      sum_cleaned_gen = ''
      for s in gen_sum_cleaned:
        sum_cleaned_gen += s + ' '

      sum_cleaned = []
      sum_cleaned.append(sum_cleaned_normal)
      sum_cleaned.append(sum_cleaned_gen)

      sentence_vectors_sum = []
      s = np.zeros((100,))
      for i in sum_cleaned:
        s = np.zeros((100,))
        if len(i) != 0:
          for w in i.split():
            s += self.word_embeddings.get(w,np.zeros((100,)))
          s = s/(len(i.split())+0.001)
        else:
          s = np.zeros((100,))
        sentence_vectors_sum.append(s)

      if len(sentence_vectors_sum) == 2:
        no_of_sum += 1
        cal_acu += self.cosine_sim_vectors(sentence_vectors_sum[0], sentence_vectors_sum[1])
    
    cal_acu /= no_of_sum
    print("The average cosine similarity is (when compared to TextRank summarizer from Gensim): ", cal_acu)

  def test(self, article, ratio):
    sentences = sent_tokenize(article)
    cleaned_sentences = self.text_preprocessing(sentences)

    sentence_vectors = []
    s = np.zeros((100,))
    for i in cleaned_sentences:
      s = np.zeros((100,))
      if len(i) != 0:
        for w in i.split():
          s += self.word_embeddings.get(w,np.zeros((100,)))
        s = s/(len(i.split())+0.001)
      else:
        s = np.zeros((100,))
      sentence_vectors.append(s)

    #Similarity Matrix
    sim_mat = np.zeros([len(sentences), len(sentences)])
    for i in range(len(sentences)):
      for j in range(len(sentences)):
        if i != j:
          sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

    nx_graph = nx.from_numpy_array(sim_mat)
    scores = nx.pagerank(nx_graph)

    ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

    nos = ratio * len(sentences)
    nos = int(nos)
    
    for j in range(nos):
      print(ranked_sentences[j][1])
      
      

In [4]:
ob = TextRank()
ob.train()

The average cosine similarity is (when compared to TextRank summarizer from Gensim):  0.9522935548018704


In [5]:
class WordFrequency:

  def remove_stopwords(self, sen):
    stop_words = stopwords.words('english')
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

  word_embeddings = {}
  f = open('glove.6B.100d.txt', encoding='utf-8')
  for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
  f.close()

  def cosine_sim_vectors(self, vec1, vec2):
    vec1 = vec1.reshape(1,-1)
    vec2 = vec2.reshape(1,-1)
    return cosine_similarity(vec1,vec2)[0][0]

  def text_preprocessing(self, sentences):
    clean_sentences = []
    for s in sentences:
      clean_sentences.append(s.replace("[^a-zA-Z]", " "))
    clean_sentences = [s.lower() for s in clean_sentences]
    clean_sentences = [self.remove_stopwords(r.split()) for r in clean_sentences]
    return clean_sentences 

  def train(self):

    df = pd.read_csv("/content/drive/MyDrive/BBC News Train.csv", encoding = 'unicode_escape')

    no_of_sum = 0
    cal_acu = 0

    for i in range(len(df['Text'])):
      text = df['Text'][i]
      tokenizer = RegexpTokenizer('\w+')
      words = tokenizer.tokenize(text)

      sentences = sent_tokenize(text)

      no_of_sen_present = len(sentences)

      nos = 0.2 * no_of_sen_present
      nos = int(nos)

      frequency = nltk.FreqDist(words)
      max_frequency = max(frequency.values())

      for word in frequency.keys():
        frequency[word] = frequency[word]/max_frequency

      sentence_scores = {}
      for x in sentences:
        if len(x.split()) < 50:
          for word in nltk.word_tokenize(x.lower()):
            if word in frequency.keys():
              sentence_scores[x] = sentence_scores.get(x,0) + frequency[word]

      summary_sentences = heapq.nlargest(nos, sentence_scores, key = sentence_scores.get)
      summary = ' '.join(summary_sentences[:])

      summary_from_gensim = summarize(df['Text'][i],ratio = 0.2)

      sum_lis = sent_tokenize(summary)
      sum_from_gen_lis = sent_tokenize(summary_from_gensim)

      nor_sum_cleaned = self.text_preprocessing(sum_lis)
      sum_cleaned_normal = ''
      for s in nor_sum_cleaned:
        sum_cleaned_normal += s + " "

      gen_sum_cleaned = self.text_preprocessing(sum_from_gen_lis)
      sum_cleaned_gen = ''
      for s in gen_sum_cleaned:
        sum_cleaned_gen += s + ' '

      sum_cleaned = []
      sum_cleaned.append(sum_cleaned_normal)
      sum_cleaned.append(sum_cleaned_gen)

      sentence_vectors_sum = []
      s = np.zeros((100,))
      for i in sum_cleaned:
        s = np.zeros((100,))
        if len(i) != 0:
          for w in i.split():
            s += self.word_embeddings.get(w,np.zeros((100,)))
          s = s/(len(i.split())+0.001)
        else:
          s = np.zeros((100,))
        sentence_vectors_sum.append(s)

      if len(sentence_vectors_sum) == 2:
        no_of_sum += 1
        cal_acu += self.cosine_sim_vectors(sentence_vectors_sum[0], sentence_vectors_sum[1])

    acc = cal_acu / no_of_sum
    print("The average cosine similarity is (when compared to TextRank summarizer from Gensim): ",acc)



  def test(self, text, ratio):
    tokenizer = RegexpTokenizer('\w+')
    words = tokenizer.tokenize(text)

    sentences = sent_tokenize(text)

    no_of_sen_present = len(sentences)

    nos = ratio * no_of_sen_present
    nos = int(nos)

    frequency = nltk.FreqDist(words)
    max_frequency = max(frequency.values())

    for word in frequency.keys():
      frequency[word] = frequency[word]/max_frequency

    sentence_scores = {}
    for x in sentences:
      if len(x.split()) < 50:
        for word in nltk.word_tokenize(x.lower()):
          if word in frequency.keys():
            sentence_scores[x] = sentence_scores.get(x,0) + frequency[word]

    summary_sentences = heapq.nlargest(nos, sentence_scores, key = sentence_scores.get)
    summary = ' '.join(summary_sentences[:])

    print(summary)


In [6]:
obWF = WordFrequency()
obWF.train()

The average cosine similarity is (when compared to TextRank summarizer from Gensim):  0.9418402206841024


In [7]:
class UpperCase:

  def remove_stopwords(self, sen):
    stop_words = stopwords.words('english')
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

  word_embeddings = {}
  f = open('glove.6B.100d.txt', encoding='utf-8')
  for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
  f.close()

  def cosine_sim_vectors(self, vec1, vec2):
    vec1 = vec1.reshape(1,-1)
    vec2 = vec2.reshape(1,-1)
    return cosine_similarity(vec1,vec2)[0][0]

  def text_preprocessing(self, sentences):
    clean_sentences = []
    for s in sentences:
      clean_sentences.append(s.replace("[^a-zA-Z0-9]", " "))
    clean_sentences = [self.remove_stopwords(r.split()) for r in clean_sentences]
    return clean_sentences

  def train(self):
    df = pd.read_csv("/content/drive/MyDrive/news_summary.csv", encoding = 'unicode_escape')

    no_of_sum = 0
    cal_acu = 0

    for l in range(len(df['ctext'])):
      try:
        text = df['ctext'][l]
        tokenizer = RegexpTokenizer('\w+')
        words = tokenizer.tokenize(text)

        sentences = sent_tokenize(text)

        nos = 0

        scores = {}
        prop_sent = []
        for s in sentences:
          sents = s.split('.')
          for sent in sents:
            nos += 1
            count = 0
            prop_sent.append(sent)
            for word in sent.split():
              sa = ''
              word = re.sub(r'[^a-zA-Z0-9]','',word)
              sa += word + ' '
              if len(word) > 1:
                if word[0].isupper():
                  count += 1
            dig = re.findall(r'\d+',sa)
            count += len(dig)
            scores[sent] = scores.get(sent, 0) + count
      
        nos = 0.2 * nos
        nos = int(nos)

        ranked_sentences = sorted(scores.items(), key=lambda kv:(kv[1], kv[0]), reverse = True)

        summary = ''
        for j in range(nos):
          summary += ranked_sentences[j][0] + '\n'

        summary_from_gensim = summarize(df['ctext'][l],ratio = 0.2)

        sum_lis = sent_tokenize(summary)
        sum_from_gen_lis = sent_tokenize(summary_from_gensim)

        nor_sum_cleaned = self.text_preprocessing(sum_lis)
        sum_cleaned_normal = ''
        for s in nor_sum_cleaned:
          sum_cleaned_normal += s + " "

        gen_sum_cleaned = self.text_preprocessing(sum_from_gen_lis)
        sum_cleaned_gen = ''
        for s in gen_sum_cleaned:
          sum_cleaned_gen += s + ' '

        sum_cleaned = []
        sum_cleaned.append(sum_cleaned_normal)
        sum_cleaned.append(sum_cleaned_gen)

        sentence_vectors_sum = []
        s = np.zeros((100,))
        for i in sum_cleaned:
          s = np.zeros((100,))
          if len(i) != 0:
            for w in i.split():
              s += self.word_embeddings.get(w,np.zeros((100,)))
            s = s/(len(i.split())+0.001)
          else:
            s = np.zeros((100,))
          sentence_vectors_sum.append(s)

        if len(sentence_vectors_sum) == 2:
          no_of_sum += 1
          cal_acu += self.cosine_sim_vectors(sentence_vectors_sum[0], sentence_vectors_sum[1])

      except:
        logging.error("Error Occurred :(")

    acc = cal_acu / no_of_sum
    print("The average cosine similarity is (when compared to TextRank summarizer from Gensim): ",acc)


  def test(self, text, ratio):
    tokenizer = RegexpTokenizer('\w+')
    words = tokenizer.tokenize(text)

    sentences = sent_tokenize(text)

    nos = 0

    scores = {}
    prop_sent = []
    for s in sentences:
      sents = s.split('.')
      for sent in sents:
        nos += 1
        count = 0
        prop_sent.append(sent)
        for word in sent.split():
          sa = ''
          word = re.sub(r'[^a-zA-Z0-9]','',word)
          sa += word + ' '
          if len(word) > 1:
            if word[0].isupper():
              count += 1
        dig = re.findall(r'\d+',sa)
        count += len(dig)
        scores[sent] = scores.get(sent, 0) + count
      
    nos = ratio * nos
    nos = int(nos)

    ranked_sentences = sorted(scores.items(), key=lambda kv:(kv[1], kv[0]), reverse = True)

    for j in range(nos):
      print(ranked_sentences[j][0])    


In [None]:
obUC = UpperCase()
obUC.train()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:root:Error Occurred :(
ERROR:r

The average cosine similarity is (when compared to TextRank summarizer from Gensim):  0.7839259401269244


In [15]:
#@title Input the type of Algorithm and the Text that is to be summarized { run: "auto"}
Text = "When faced with significant uncertainty in the process of making a forecast or estimation, rather than just replacing the uncertain variable with a single average number, the Monte Carlo Simulation might prove to be a better solution by using multiple values.  Since business and finance are plagued by random variables, Monte Carlo simulations have a vast array of potential applications in these fields. They are used to estimate the probability of cost overruns in large projects and the likelihood that an asset price will move in a certain way.  Telecoms use them to assess network performance in different scenarios, helping them to optimize the network. Analysts use them to assess the risk that an entity will default, and to analyze derivatives such as options.  Insurers and oil well drillers also use them. Monte Carlo simulations have countless applications outside of business and finance, such as in meteorology, astronomy, and particle physics. Monte Carlo simulations are named after the popular gambling destination in Monaco, since chance and random outcomes are central to the modeling technique, much as they are to games like roulette, dice, and slot machines.  The technique was first developed by Stanislaw Ulam, a mathematician who worked on the Manhattan Project. After the war, while recovering from brain surgery, Ulam entertained himself by playing countless games of solitaire. He became interested in plotting the outcome of each of these games in order to observe their distribution and determine the probability of winning. After he shared his idea with John Von Neumann, the two collaborated to develop the Monte Carlo simulation. The basis of a Monte Carlo simulation is that the probability of varying outcomes cannot be determined because of random variable interference. Therefore, a Monte Carlo simulation focuses on constantly repeating random samples to achieve certain results.  A Monte Carlo simulation takes the variable that has uncertainty and assigns it a random value. The model is then run and a result is provided. This process is repeated again and again while assigning the variable in question with many different values. Once the simulation is complete, the results are averaged together to provide an estimate. One way to employ a Monte Carlo simulation is to model possible movements of asset prices using Excel or a similar program. There are two components to an asset's price movement: drift, which is a constant directional movement, and a random input, which represents market volatility.  By analyzing historical price data, you can determine the drift, standard deviation, variance, and average price movement of a security. These are the building blocks of a Monte Carlo simulation." #@param {type:"string"}
Algorithm = "Upper Case" #@param ["TextRank", "TextRank (Gensim) / LexRank", "Word Frequency", "Upper Case"]
Ratio = 0.1 #@param {type:"slider", min:0, max:1, step:0.1}

if Algorithm == 'TextRank':
  ob1 = TextRank()
  ob1.test(Text, Ratio)

elif Algorithm == 'TextRank (Gensim) / LexRank':
  print(summarize(Text, ratio=Ratio))

elif Algorithm == 'Word Frequency':
  ob2 = WordFrequency()
  ob2.test(Text, Ratio)

elif Algorithm == 'Upper Case':
  ob3 = UpperCase()
  ob3.test(Text, Ratio)


After he shared his idea with John Von Neumann, the two collaborated to develop the Monte Carlo simulation
The technique was first developed by Stanislaw Ulam, a mathematician who worked on the Manhattan Project
When faced with significant uncertainty in the process of making a forecast or estimation, rather than just replacing the uncertain variable with a single average number, the Monte Carlo Simulation might prove to be a better solution by using multiple values
One way to employ a Monte Carlo simulation is to model possible movements of asset prices using Excel or a similar program
