# Scientific Paper Summarizer

The script aims to find and summarize papers based on a specific

Process:
1. Download the paper
2. Convert from pdf to text
3. Feed the text to the GPT-3 model using the openai api
4. Show the summary

In [None]:
from newspaper import Article

url = 'https://www.sciencedaily.com/releases/2021/08/210811162816.htm'
article = Article(url)
article.download()
article.parse()

## Models

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

def summarize(text, per):
    nlp = spacy.load('en_core_web_sm')
    doc= nlp(text)
    tokens=[token.text for token in doc]
    word_frequencies={}
    for word in doc:
        if word.text.lower() not in list(STOP_WORDS):
            if word.text.lower() not in punctuation:
                if word.text not in word_frequencies.keys():
                    word_frequencies[word.text] = 1
                else:
                    word_frequencies[word.text] += 1
    max_frequency=max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word]=word_frequencies[word]/max_frequency
    sentence_tokens= [sent for sent in doc.sents]
    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if sent not in sentence_scores.keys():                            
                    sentence_scores[sent]=word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent]+=word_frequencies[word.text.lower()]
    select_length=int(len(sentence_tokens)*per)
    summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
    final_summary=[word.text for word in summary]
    summary=''.join(final_summary)
    return summary

In [None]:
#!/usr/bin/env python
# coding: utf-8
import nltk
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
 
def read_article(file_name):
    file = open(file_name, "r")
    filedata = file.readlines()
    article = filedata[0].split(". ")
    sentences = []

    for sentence in article:
        print(sentence)
        sentences.append(sentence.replace("[^a-zA-Z]", " ").split(" "))
    sentences.pop() 
    
    return sentences

def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix


def generate_summary(file_name, top_n=5):
    nltk.download("stopwords")
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  read_article(file_name)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
      summarize_text.append(" ".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize text
    print("Summarize Text: \n", ". ".join(summarize_text))

# let's begin
generate_summary( "msft.txt", 2)

In [None]:
import openai
import wget
import pathlib
import pdfplumber
import numpy as np

In [2]:
!pip install openai
!pip install wget
!pip install pdfplumber

Collecting openai
  Downloading openai-0.19.0.tar.gz (42 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Collecting pandas-stubs>=1.1.0.11
  Downloading pandas_stubs-1.2.0.60-py3-none-any.whl (162 kB)
Collecting openpyxl>=3.0.7
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Building wheels for collected packages: openai
  Building wheel for openai (PEP 517): started
  Building wheel for openai (PEP 517): finished with status 'done'
  Created wheel for openai: filename=openai-0.19.0-py3-none-any.whl size=53511 sha256=62b655bb7b93ea5880994c1c0d42fae03c5ea607ee023a3b66fc5aeb90728114
  Stored in directory: c:\users\aron gosch\appdata\lo

You should consider upgrading via the 'c:\users\aron gosch\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


Collecting wget
  Downloading wget-3.2.zip (10 kB)
Using legacy 'setup.py install' for wget, since package 'wheel' is not installed.
Installing collected packages: wget
    Running setup.py install for wget: started
    Running setup.py install for wget: finished with status 'done'
Successfully installed wget-3.2


You should consider upgrading via the 'c:\users\aron gosch\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


Collecting pdfplumber
  Downloading pdfplumber-0.7.1-py3-none-any.whl (39 kB)
Collecting Wand>=0.6.7
  Downloading Wand-0.6.7-py2.py3-none-any.whl (139 kB)
Collecting pdfminer.six==20220524
  Downloading pdfminer.six-20220524-py3-none-any.whl (5.6 MB)

You should consider upgrading via the 'c:\users\aron gosch\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.



Collecting Pillow>=9.1
  Downloading Pillow-9.1.1-cp39-cp39-win_amd64.whl (3.3 MB)
Installing collected packages: Wand, Pillow, pdfminer.six, pdfplumber
  Attempting uninstall: Pillow
    Found existing installation: Pillow 9.0.1
    Uninstalling Pillow-9.0.1:
      Successfully uninstalled Pillow-9.0.1
Successfully installed Pillow-9.1.1 Wand-0.6.7 pdfminer.six-20220524 pdfplumber-0.7.1


In [22]:
import openai
import wget
import pathlib
import pdfplumber
import numpy as np

def getPaper(paper_url, filename="random_paper.pdf"):
    """
    Downloads a paper from it's arxiv page and returns
    the local path to that file.
    """
    downloadedPaper = wget.download(paper_url, filename)    
    downloadedPaperFilePath = pathlib.Path(downloadedPaper)

    return downloadedPaperFilePath

def displayPaperContent(paperContent, page_start=0, page_end=5):
    for page in paperContent[page_start:page_end]:
        print(page.extract_text())

def showPaperSummary(paperContent):
    with open('openAI_key.txt') as f:
        key = f.readlines()
    
    tldr_tag = "\n tl;dr:"
    openai.organization = ''
    openai.api_key = key[0]
    engine_list = openai.Engine.list() 
    
    for page in paperContent:    
        text = page.extract_text() + tldr_tag
        response = openai.Completion.create(engine="davinci",prompt=text,temperature=0.3,
            max_tokens=140,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
            stop=["\n"]
        )
        print(response["choices"][0]["text"])

#paperContent = pdfplumber.open(paperFilePath).pages
#showPaperSummary(paperContent)

In [12]:
url = "https://arxiv.org/pdf/1808.04295"
paperFilePath = getPaper(url)
paperFilePath = "random_paper.pdf"
paperContent = pdfplumber.open(paperFilePath).pages
displayPaperContent(paperContent)

100% [......................................................] 1108543 / 1108543Understanding training and generalization in deep
learning by Fourier analysis
Zhi-QinJohnXu∗
8
NewYorkUniversityAbuDhabi
1
AbuDhabi129188,UnitedArabEmirates
0
zhiqinxu@nyu.edu
2
 
v
o
N Abstract
 
9
2 Background: It is still an open research area to theoretically understand why
  DeepNeuralNetworks(DNNs)—equippedwithmanymoreparametersthantrain-
 
] ing data and trained by (stochastic) gradient-based methods—often achieve re-
G
markably low generalization error. Contribution: We study DNN training by
L Fourier analysis. Our theoretical frameworkexplains: i) DNN with (stochastic)
. gradient-based methods often endows low-frequency components of the target
s
c function with a higher priority during the training; ii) Small initialization leads
[ to good generalizationability of DNN while preservingthe DNN’s ability to ﬁt
 
  any function. These results are further conﬁrmed by experiments of DNNs ﬁt-
4
tingthefo

C =exp(−2i(b k/w +θ(k))), (10)
1 j j
C =[C (i(πk−2w )−2b k)+(−i(πk−2w )−2b k)], (11)
2 1 j j j j
Thedescentamountatanydirection,say,withrespecttoparameterΘ ,is
jl
∂L ∂L(k)
=∑ . (12)
∂Θ ∂Θ
jl l jl
TheabsolutecontributionfromfrequencyktothistotalamountatΘ is
jl
∂L(k)
=A(k)exp(−|πk/2w |)G (Θ ,k), (13)
j jl j
(cid:12)∂Θ (cid:12)
(cid:12) jl (cid:12)
(cid:12) (cid:12)
whereΘ ,{w ,b ,a },Θ(cid:12) ∈Θ ,(cid:12)G (Θ ,k)isafunctionwithrespecttoΘ andk,whichcanbe
j j j j jl j jl j j
foundinoneofEqs. (6,7,8).
When the component at frequency k does not converge yet, exp(−|πk/2w |) would dominate
j
G (Θ ,k)forasmallw .Therefore,thebehaviorofEq.(13)isdominatedbyA(k)exp(−|πk/2w |).
jl j j j
Thisdominanttermalsoindicatesthatweightsaremuchmoreimportantthanbiasterms,whichwill
beveriﬁedbyMNISTdatasetlater.
To examine the convergencebehavior of different frequency components during the training, we
computetherelativedifferenceoftheDNNoutputand f(x)inthefrequencydomainateachrecord-
ingstep,thatis,
|F[f](k)−

In [23]:
showPaperSummary(paperContent)

 We study DNN training by Fourier analysis. Our theoretical framework explains why DNNs often endow low-frequency components of the target function with a higher priority during the training. Small initialization leads to good generalization ability of DNN while preserving the DNN’s ability to fit any function. These results are further conﬁrmed by experiments of DNNs fitting the following datasets, that is, natural images, one-dimensional functions and MNIST dataset.

 The loss function is the sum of the squared errors between the target function and the DNN output.
TheDNNwithonehiddenlayerusingtanhfunctionasactivationfunctionis

 We propose a theoretical framework to understand deep learning, and show that it explains the training behavior of DNNs.

 The DNN training process is governed by the weights rather than the biases.
 We prove that the Fourier transform of the loss function of a DNN is the Fourier transform of the target function. We also prove that the Fourier transform of t

## Run tests