## Research Article Summarization

In [61]:
import os
path=r'C:\Users\stone\Documents\Python_Scripts\Graduating_Project\research_articles'
os.chdir(path)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
import re

In [53]:
file='Analyst_interest_as_an_early_indicator.pdf'

In [55]:

class PdfConverter:
    def __init__(self, file_path):
        self.file_path = file_path
# convert pdf file to a string which has space among words 
    def convert_pdf_to_txt(self):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'  # 'utf16','utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        fp = open(self.file_path, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos = set()
        for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()
        str = retstr.getvalue()
        retstr.close()
        return str

In [66]:
raw_text=PdfConverter(file).convert_pdf_to_txt()

In [75]:
x1=raw_text.replace('\n',' ')
x1=re.sub(r'\(.*?\)','',x1) # removing text within parenthesis
x1=re.sub('â€™s','',x1) # removing special characters
x1

"THE ACCOUNTING REVIEW Vol. 90, No. 3 2015 pp. 1049–1078  American Accounting Association DOI: 10.2308/accr-50912  Analyst Interest as an Early Indicator of Firm Fundamental Changes and Stock Returns  Michael J. Jung New York University  M. H. Franco Wong INSEAD and University of Toronto  X. Frank Zhang Yale University  ABSTRACT: We posit that a change in analyst interest in a firm is an early indicator of the firm’s future fundamentals, capital market activities, and stock returns. We measure increases in analyst interest by obser.ving analysts who do not cover a firm but participate in that firm’s earnings conference call, and we measure decreases in analyst interest by observing analysts who cover a firm, yet are absent from that firm’s call. We find that increases in analyst interest are positively associated with future changes in firm fundamentals and capital market activities, while decreases in analyst interest are negatively associated with capital market activities. We also f

### Summarizer code

In [82]:
#stopwords.words('english')

In [4]:
with open('pg1.txt','r') as f:
    simple_text=f.read().replace('\n',' ')

In [5]:
# Text Cleaning
# Converting everything into lowercase
simple_text=simple_text.strip().lower()
simple_text=re.sub(r'\(.*?\)','',simple_text) # removing text within parenthesis
simple_text=re.sub('â€™s','',simple_text) # removing special characters

In [35]:
simple_text

'a large literature examines the link between firm fundamentals and future stock returns . typically, the motivation for this line of research is that firm fundamentals are reflected in accounting data, which are informative about a firm future cash flows, and that investors do not fully impound this information into stock prices. but since financial statement information is backward-looking, it is beneficial for investors to identify early indicators of firm fundamental changes that are not yet reflected in financial statements. in this paper, we examine whether a change in analyst interest, proxied by either the onset of non-covering sell-side equity analysts who participate in a firm earnings conference call or by the absence of participation from covering analysts, is an early indicator of not only firm fundamental changes, but also of future capital market activities and stock returns. our focus on analyst interest stems from two observations. first, prior research shows that anal

In [6]:
sent_list=simple_text.split('.')

In [13]:
# Defining function to calculate sentence similarity

def sentence_similarity(sent1, sent2, stopwords=stopwords.words('english')):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
        
    if any(vector1) and any(vector2): 
        return 1 - cosine_distance(vector1, vector2)
    else:
        return 0


In [14]:
def build_similarity_matrix(sentences, stop_words=stopwords.words('english')):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix

In [28]:
#build_similarity_matrix(sent_list)

In [26]:
def generate_summary(arr, top_n=5):
    '''expects an array of sentences'''
    stop_words = stopwords.words('english')
    summarize_text = []

    

    # Step 1 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(arr, stop_words)

    # Step 2 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 3 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(arr)), reverse=True)    
    #print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
        summarize_text.append("".join(ranked_sentence[i][1]))

    # Step 4 - Offcourse, output the summarize text
    print("Summarized Text: \n\n", ". ".join(summarize_text))

In [27]:
generate_summary(sent_list)

Summarized Text: 

  in this paper, we examine whether a change in analyst interest, proxied by either the onset of non-covering sell-side equity analysts who participate in a firm earnings conference call or by the absence of participation from covering analysts, is an early indicator of not only firm fundamental changes, but also of future capital market activities and stock returns. 2 we posit that an onset of analysts who do not cover a given firm , but participate in that firm earnings conference call, captures increasing analyst interest in the firm, while analyst absenteeism captures decreasing analyst interest in the firm. , a top-ranked sell-side equity research firm in institutional investor annual all-american research survey, gives newly hired analysts up to one year to conduct due diligence on firms before initiating coverage on them .  we explore one aspect of analyst due diligence and document that analysts regularly participate in a firm earnings conference calls before

[text_summarization](https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70)