In [None]:
!pip install transformers bert-extractive-summarizer scapy

In [3]:
import pandas as pd
import numpy as np
import re

import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

from sklearn.metrics.pairwise import cosine_similarity

from summarizer import Summarizer,TransformerSummarizer

 # 1. Summarization based on cosine similarity

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
df = pd.read_csv('/kaggle/input/medium-articles/articles.csv')

In [6]:
dataset=df['text']

In [7]:
dataset[1]

'If you’ve ever found yourself looking up the same question, concept, or syntax over and over again when programming, you’re not alone.\nI find myself doing this constantly.\nWhile it’s not unnatural to look things up on StackOverflow or other resources, it does slow you down a good bit and raise questions as to your complete understanding of the language.\nWe live in a world where there is a seemingly infinite amount of accessible, free resources looming just one search away at all times. However, this can be both a blessing and a curse. When not managed effectively, an over-reliance on these resources can build poor habits that will set you back long-term.\nPersonally, I find myself pulling code from similar discussion threads several times, rather than taking the time to learn and solidify the concept so that I can reproduce the code myself the next time.\nThis approach is lazy and while it may be the path of least resistance in the short-term, it will ultimately hurt your growth, p

## 1.1. Tokenization
- Split paragraphs to sentences

In [8]:
articles = []
for article in dataset:
    articles.append(sent_tokenize(article))

In [9]:
len(articles)

337

In [10]:
articles[1]

['If you’ve ever found yourself looking up the same question, concept, or syntax over and over again when programming, you’re not alone.',
 'I find myself doing this constantly.',
 'While it’s not unnatural to look things up on StackOverflow or other resources, it does slow you down a good bit and raise questions as to your complete understanding of the language.',
 'We live in a world where there is a seemingly infinite amount of accessible, free resources looming just one search away at all times.',
 'However, this can be both a blessing and a curse.',
 'When not managed effectively, an over-reliance on these resources can build poor habits that will set you back long-term.',
 'Personally, I find myself pulling code from similar discussion threads several times, rather than taking the time to learn and solidify the concept so that I can reproduce the code myself the next time.',
 'This approach is lazy and while it may be the path of least resistance in the short-term, it will ultima

## 1.2. Word Embeddings
- GloVe

In [None]:
!wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
!unzip glove*.zip

- The open function is used to open the file 'glove.twitter.27B.200d.txt' in read mode. This file contains the pre-trained word embeddings. The encoding parameter is set to 'utf-8' to specify the character encoding for the file.

- The script then enters a loop that iterates over each line in the file. Each line in the file represents a word and its corresponding embedding. The split method is used to split the line into a list of strings. The first string in the list is the word, and the remaining strings represent the elements of the embedding.

- The numpy function asarray is used to convert the list of embedding values into a numpy array. The dtype parameter is set to 'float32' to specify that the elements of the array should be floating-point numbers.

- The word and its corresponding embedding are then added to the word_embeddings dictionary.

In [11]:
word_embeddings = {}
f = open('glove.twitter.27B.200d.txt', encoding='utf-8')
line_number = 0
for line in f:
    line_number = 0
    values = line.split()
    word = values[0]
    embeddings = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = embeddings
    line_number += 1
    if line_number == 50:
        print('Finished 50 lines.')
        line_number = 0
    
f.close()

In [12]:
len(word_embeddings)

1193514

In [13]:
first_element = next(iter(word_embeddings))
print(first_element, word_embeddings[first_element])

<user> [ 3.1553e-01  5.3765e-01  1.0177e-01  3.2553e-02  3.7980e-03  1.5364e-02
 -2.0344e-01  3.3294e-01 -2.0886e-01  1.0061e-01  3.0976e-01  5.0015e-01
  3.2018e-01  1.3537e-01  8.7039e-03  1.9110e-01  2.4668e-01 -6.0752e-02
 -4.3623e-01  1.9302e-02  5.9972e-01  1.3444e-01  1.2801e-02 -5.4052e-01
  2.7387e-01 -1.1820e+00 -2.7677e-01  1.1279e-01  4.6596e-01 -9.0685e-02
  2.4253e-01  1.5654e-01 -2.3618e-01  5.7694e-01  1.7563e-01 -1.9690e-02
  1.8295e-02  3.7569e-01 -4.1984e-01  2.2613e-01 -2.0438e-01 -7.6249e-02
  4.0356e-01  6.1582e-01 -1.0064e-01  2.3318e-01  2.2808e-01  3.4576e-01
 -1.4627e-01 -1.9880e-01  3.3232e-02 -8.4885e-01 -2.5684e-01  2.6369e-01
  2.9562e-01  1.8470e-01 -2.0668e-01 -1.3297e-02  1.2233e-01 -4.7751e-01
 -1.7202e-01 -1.4577e-01  4.7446e-02 -1.5824e-01  5.4215e-02 -1.9426e-01
 -8.1484e-02  9.9009e-02  1.0159e-01  4.3571e-02  5.0245e-01  1.3362e-01
  6.5985e-02  3.2969e-02 -2.0170e-01 -5.6905e-01 -1.3203e-01  7.3347e-02
 -6.3728e-02 -2.7960e-01 -3.8481e-01 -2.0193

## 1.3. Preprocessing embeddings and articles

In [14]:
cleaned_word_embeddings = {}
for word, embedding in word_embeddings.items():
    cleaned_word = re.sub(r'[^\w\s]', '', word)
    cleaned_word_embeddings[cleaned_word] = embedding

In [15]:
first_element = next(iter(cleaned_word_embeddings))
print(first_element, cleaned_word_embeddings[first_element])

user [ 4.2766e-01  2.4532e-01 -8.4922e-01  3.1648e-01  1.6538e-01 -5.8628e-01
  8.5115e-01  6.3838e-03  6.5366e-02 -7.1155e-01 -2.2464e-01 -1.3552e-01
 -1.5848e-01 -1.0465e+00 -2.5073e-01 -3.2956e-01  5.8865e-01  7.2249e-02
  1.4451e-01 -1.7732e-01 -1.0752e-03  1.0197e-01  7.3183e-02  3.2360e-01
 -8.7387e-01 -1.0808e+00  3.3931e-01 -1.1672e-01  1.0462e-01  1.0419e+00
  3.6597e-01  2.0681e-02  1.3460e-01 -9.4696e-01 -2.3978e-02 -8.2389e-01
  9.6535e-02 -1.3870e-01 -5.2832e-01  3.9776e-01 -8.4624e-01  4.7066e-01
 -2.5022e-01  7.7712e-01 -4.3216e-01  6.6454e-02 -3.5458e-01 -4.9915e-01
  2.3011e-01  1.0778e-01 -1.4022e-01  5.0438e-01 -5.8577e-01 -2.7001e-01
  1.9374e-01  4.2087e-01  1.4654e-02  1.0986e+00 -8.2837e-01 -8.0830e-01
  1.5268e-02 -7.7748e-01  8.1273e-01  3.4014e-01  3.3876e-01  5.4724e-02
  2.0892e-01  3.0737e-01  1.4706e-01  4.2437e-01  1.1422e-01 -1.6948e-01
  2.4700e-01 -5.4657e-01 -2.7471e-01 -4.4670e-01  1.4814e-02 -1.7289e-01
 -6.4120e-02 -8.3405e-01 -2.6884e-01  1.0085e+

In [16]:
cleaned_articles = []
for article in articles[:70]:
    cleaned_sentences = []
    for sentence in article:
        cleaned_sentence = re.sub(r'\\.|[^\'\w ]', ' ', sentence)
        cleaned_sentences.append(cleaned_sentence)
    cleaned_articles.append(cleaned_sentences)

In [17]:
cleaned_articles[1]

['If you ve ever found yourself looking up the same question  concept  or syntax over and over again when programming  you re not alone ',
 'I find myself doing this constantly ',
 'While it s not unnatural to look things up on StackOverflow or other resources  it does slow you down a good bit and raise questions as to your complete understanding of the language ',
 'We live in a world where there is a seemingly infinite amount of accessible  free resources looming just one search away at all times ',
 'However  this can be both a blessing and a curse ',
 'When not managed effectively  an over reliance on these resources can build poor habits that will set you back long term ',
 'Personally  I find myself pulling code from similar discussion threads several times  rather than taking the time to learn and solidify the concept so that I can reproduce the code myself the next time ',
 'This approach is lazy and while it may be the path of least resistance in the short term  it will ultima

In [18]:
stop_words = set(stopwords.words('english'))

In [19]:
filtered_articles = []
for article in cleaned_articles:
    filtered_sentences = []
    for sentence in article:
        words = sentence.split()
        filtered_words = [word for word in words if word.lower() not in stop_words]
        filtered_sentence = ' '.join(filtered_words)
        filtered_sentences.append(filtered_sentence)
    filtered_articles.append(filtered_sentences)

## 1.4. Get vector representations for each sentence in the articles

In [20]:
sentence_vectors = [] 
for article in filtered_articles:
    article_vectors = []
    for sentence in article:
        words = sentence.split()
        sentence_vector = np.zeros((200,))
        if len(words) != 0:
            for word in words:
                if word in cleaned_word_embeddings:
                    sentence_vector += cleaned_word_embeddings[word]
            sentence_vector /= len(words)
        article_vectors.append(sentence_vector)
    sentence_vectors.append(article_vectors)

In [21]:
print(f'vector representation of an article with {len(cleaned_articles[1])} sentences also has {len(sentence_vectors[1])} representations.')

vector representation of an article with 65 sentences also has 65 representations.


In [22]:
len(sentence_vectors[1])

65

In [23]:
sentence_vectors[1]

[array([ 2.76859627e-01, -1.79653007e-01, -2.06765747e-01,  7.78704993e-02,
        -2.52209749e-01,  3.43582495e-01,  3.50889874e-01, -1.87926875e-01,
         7.02167503e-02, -2.34442750e-01, -1.95482569e-02,  5.71268792e-02,
        -4.39225007e-01, -2.89839924e-01, -1.66169127e-02,  1.61749683e-03,
         5.53308732e-02,  2.61640999e-01, -7.01184010e-02,  4.18943753e-02,
        -5.88669981e-02, -1.18537499e-01, -8.36942503e-02, -6.12068810e-02,
        -1.22372879e-01,  8.75753734e-01, -1.81976248e-02,  2.08842999e-01,
        -8.20212299e-03,  8.54892423e-05,  7.84571264e-02, -7.82874981e-02,
        -1.08399185e-01, -4.16727757e-01, -6.32912517e-02, -1.86994131e-01,
         6.67008734e-02,  6.51374180e-03,  2.85458752e-01, -6.61124662e-03,
         4.90240493e-01,  1.65803026e-01, -1.45162921e-03,  2.54510215e-01,
         1.13638001e-01,  1.22177377e-01, -1.18621876e-01,  6.26288750e-02,
         2.83369469e-01, -2.41953877e-01,  1.96700003e-01,  1.80530746e-01,
        -6.0

In [24]:
flat_sentence_vectors = [vec for sublist in sentence_vectors for vec in sublist]

## 1.5. Calculate cosine similarity and summarize

In [25]:
similarity_matrix = cosine_similarity(flat_sentence_vectors)

In [26]:
similarity_matrix

array([[1.        , 0.22634313, 0.32236545, ..., 0.2595537 , 0.        ,
        0.21587905],
       [0.22634313, 1.        , 0.43875783, ..., 0.55118634, 0.        ,
        0.2447273 ],
       [0.32236545, 0.43875783, 1.        , ..., 0.44802914, 0.        ,
        0.20903227],
       ...,
       [0.2595537 , 0.55118634, 0.44802914, ..., 1.        , 0.        ,
        0.33908381],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.21587905, 0.2447273 , 0.20903227, ..., 0.33908381, 0.        ,
        1.        ]])

In [27]:
import networkx as nx

nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)

In [28]:
# Summarize each article
article_summaries = []
for i, article in enumerate(articles[:9]):
    article_summary = ''
    article_scores = scores[i]
    ranked_sentences = sorted(((scores[j], sentence) for j, sentence in enumerate(article)), reverse=True)
    for j in range(20):  # Select top 20 sentences as summary
        article_summary += ranked_sentences[j][1] + ' '
    print("article",i, article_summary, "\n")
    article_summaries.append(article_summary)

article 0 Some platforms provide a bit of NLP, but even the best is at toddler-level capacity (for example, think about Siri understanding your words, but not their meaning.) By pitting two such disparate concepts against one another (instead of seeing them as separate entities designed to serve different purposes) we discouraged bot development. Building a bot for the sake of it, letting it loose and hoping for the best will never end well:
The vast majority of bots are built using decision-tree logic, where the bot’s canned response relies on spotting specific keywords in the user input. Sure, there are some concepts that we can only express using language (“show me all the ways of getting to a museum that give me 2000 steps but don’t take longer than 35 minutes”), but most tasks can be carried out more efficiently and intuitively with GUIs than with a conversational UI. Today’s rule-based dialogue systems are too brittle to deal with this kind of unpredictability, and statistical ap

# 2. Using Pretrained Model

In [29]:
article_summaries = []
model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")

for i, article in enumerate(dataset[:9]):
    article_summary = ''.join(model(article))
    print("article",i, article_summary, "\n")
    article_summaries.append(article_summary)



article 0 Oh, how the headlines blared:
Chatbots were The Next Big Thing. All the road signs pointed towards insane success. Because there isn’t even an ecosystem for a platform to dominate. Chatbots weren’t the first technological development to be talked up in grandiose terms and then slump spectacularly. Digit’s Ethan Bloch sums up the general consensus:
According to Dave Feldman, Vice President of Product Design at Heap, chatbots didn’t just take on one difficult problem and fail: they took on several and failed all of them. Bots can interface with users in different ways. We became entranced by windows, mouse clicks, icons. Here’s an example dialog (dating back to the 1990s) with VCR setup system:
Pretty cool, right? A great bot can be about as useful as an average app. And that’s precisely their disadvantage, too. In an ideal world, the technology known as NLP (natural language processing) should allow a chatbot to understand the messages it receives. As Matt Asay outlines, this 



article 1 If you’ve ever found yourself looking up the same question, concept, or syntax over and over again when programming, you’re not alone. This approach is lazy and while it may be the path of least resistance in the short-term, it will ultimately hurt your growth, productivity, and ability to recall syntax (cough, interviews) down the line. Note that the list function simply converts the output to list type. For creating quick and easy Numpy arrays, look no further than the arange and linspace functions. Along with a starting and stopping point, you can also define a step size or data type if necessary. Linspace is very similar, but with a slight twist. Linspace returns evenly spaced numbers over a specified interval. If not, then you surely will at some point. Let’s use the example of dropping a column for now:
I don’t know how many times I wrote this line of code before I actually knew why I was declaring axis what I was. If you think about how this is indexed in Python, rows 



article 2 Machine learning is increasingly moving from hand-designed models to automatically optimized pipelines using tools such as H20, TPOT, and auto-sklearn. Let’s look at a few examples to see these concepts in action. A transformation acts on a single table (thinking in terms of Python, a table is just a Pandas DataFrame ) by creating new features out of one or more of the existing columns. Here’s how we would do that in Python using the language of Pandas. Although Pandas is a great resource, there’s only so much data manipulation we want to do by hand! ( Deep feature synthesis stacks multiple transformation and aggregation operations (which are called feature primitives in the vocab of featuretools) to create features from data spread across many tables. Like most ideas in machine learning, it’s a complex method built on a foundation of simple concepts. By learning one building block at a time, we can form a good understanding of this powerful method. Each entity must have an i



article 3 If your understanding of A.I. and Machine Learning is a big question mark, then this is the blog post for you. However, if you find yourself at the bottom of this article, you’ve earned your well-rounded knowledge and passion for this new world. Fun, but programmers aren’t always gifted in programming A.I. as we often see. You don’t need to come up with advanced algorithms anymore. So how does something like that even work? As a human, we all understand how to play a side-scroller, but identifying the predictive strategy of the resulting A.I. is insane. There’s something amazing about this idea, right? Firstly, Machine Learning (ML) is making computers do things that we’ve never made computers do before. I think Christian Heilmann said it best in his talk on ML. Everyone provides a caveat that this course is tough. Everyone should be impressed if you make it through because that’s not simple. If you’re not interested in writing the algorithms, but you want to use them to crea



article 4 Want to learn about applied Artificial Intelligence from leading practitioners in Silicon Valley, New York, or Toronto? Learn more about the Insight Artificial Intelligence Fellows Program. Are you a company working in AI and would like to get involved in the Insight AI Fellows Program? Many of the visuals are from the slides of the talk, and some are new. Deep RL adds the dimension of actions that influence the environment (what is the goal, and how do I get there?). On the other hand, Deep Reinforcement Learning focuses on the right sequences of sentences that will lead to a positive outcome, for example a happy customer. This makes Deep RL particularly attractive for tasks that require planning and adaptation, such as manufacturing or self-driving. A major reason is that Deep RL often requires an agent to experiment millions of times before learning anything useful. The best way to do this rapidly is by using a simulation environment. This is a classic problem called Multi



article 5 The advent of powerful and versatile deep learning frameworks in recent years has made it possible to implement convolution layers into a deep learning model an extremely simple task, often achievable in a single line of code. Do take note of this, as it’ll be critical to our later discussion. Before we move on, it’s definitely worth looking into two techniques that are commonplace in convolution layers: Padding and Strides. A stride of 1 means to pick slides a pixel apart, so basically every single slide, acting as a standard convolution. In practicality, most input images have 3 channels, and that number only increases the deeper you go into a network. The bias gets added to the output channel so far to produce the final output channel. If we were to use a kernel K of size 3 on the reshaped 4×4 input to get a 2×2 output, the equivalent transformation matrix would be:
(Note: while the above matrix is an equivalent transformation matrix, the actual operation is usually implem



article 6 There is an ongoing debate about whether or not designers should write code. Put simply, machine learning is a “field of study that gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959). Now is the best time to learn about machine learning and apply it to the products you are building. Machine learning can help create user-centric products by personalizing experiences to the individuals who use them. Knowing this, we can help prepare for a user’s next action. Depending on the application and what data is available, there are different types of machine learning algorithms to choose from. Labeled data is a group of examples that has informative tags or outputs. number of bedrooms, location) and its price. If the output we are trying to predict is a number we call it regression. Unsupervised learning is helpful when we have unlabeled data or we are not exactly sure what outputs (like an image’s hashtags or a house’s price) are meaningful



article 7 Data science interviews certainly aren’t easy. Long story short, I’ve decided to sort through all my bookmarks and notes in order to deliver a comprehensive list of data science resources. These were some of my favorite full-coverage questions to practice with right before an interview. As far as language goes, most companies will let you use whatever language you want. It may come up as a conceptual question regarding cross validation or bias-variance tradeoff, or it may take the form of a take home assignment with a dataset attached. Most interviews will have atleast one section solely dedicated to product thinking which often lends itself to A/B testing of some sort. Make sure your familiar with the concepts and statistical background necessary in order to be prepared when it comes up. If you only check out one section here, this is the one to focus on. Lastly, this post is part of an ongoing initiative to ‘open-source’ my experience applying and interviewing at data scien



article 8 Information theory is an important field that has made significant contribution to deep learning and AI, and yet is unknown to many. Some examples of concepts in AI that come from Information theory or related fields:
In the early 20th century, scientists and engineers were struggling with the question: “How to quantify the information? How can we quantify the difference between two sentences? Semantics, domain and form of data only added to the complexity of the problem. So, we can say that exp 1 is inherently more uncertain/unpredictable than exp 2. The probability distribution of experiment is used to calculate the entropy. A deterministic experiment, which is completely predictable, say tossing a coin with P(H)=1, has entropy zero. It tells us how similar two distributions are. Cross entropy between two probability distributions p and q defined over same set of outcomes is given by:
Mutual information is a measure of mutual dependency between two probability distributions

In [1]:
!jupyter nbconvert --to html text-summarization-with-nltk-and-transfer-learning.ipynb

[NbConvertApp] Converting notebook text-summarization-with-nltk-and-transfer-learning.ipynb to html
[NbConvertApp] Writing 972196 bytes to text-summarization-with-nltk-and-transfer-learning.html
