# Data Extraction and Text Analysis
<b>Objective is to extract textual data from SEC / EDGAR financial reports and perform text analysis to compute variables that are explained below.<br><br>
<b>Link to SEC / EDGAR financial reports are given in excel spreadsheet “cik_list.xlsx”. we add https://www.sec.gov/Archives/ to every cells of column F (cik_list.xlsx) to access link to the financial report.

In [1]:
#importing Libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from modules import *

In [4]:
#loading data
df = pd.read_excel("cik_list.xlsx")
df.head()

Unnamed: 0,CIK,CONAME,FYRMO,FDATE,FORM,SECFNAME
0,3662,SUNBEAM CORP/FL/,199803,1998-03-06,10-K405,edgar/data/3662/0000950170-98-000413.txt
1,3662,SUNBEAM CORP/FL/,199805,1998-05-15,10-Q,edgar/data/3662/0000950170-98-001001.txt
2,3662,SUNBEAM CORP/FL/,199808,1998-08-13,NT 10-Q,edgar/data/3662/0000950172-98-000783.txt
3,3662,SUNBEAM CORP/FL/,199811,1998-11-12,10-K/A,edgar/data/3662/0000950170-98-002145.txt
4,3662,SUNBEAM CORP/FL/,199811,1998-11-16,NT 10-Q,edgar/data/3662/0000950172-98-001203.txt


In [5]:
#adding https://www.sec.gov/Archives/ to every cells of column F and storing them in urls
pre_url = 'https://www.sec.gov/Archives/'
urls = list(df['SECFNAME'].apply(lambda x: pre_url+str(x)))[:152] # 152 number of urls
total_urls = len(urls)
#print('Total', total_urls)
#print(urls)

The Stop Words Lists are used to clean the text so that Sentiment Analysis can be performed by excluding the words found in Stop Words List.<br> The Stopwords were downloaded from here https://drive.google.com/file/d/0B4niqV00F3mseWZrUk1YMGxpVzQ/view?usp=sharing

In [6]:
#saving the stopwords
with open('StopWords_Generic.txt','r') as f:
    stopwords = f.readlines()
    stopwords = [word.strip() for word in stopwords]
#print(stopwords)

We will be calculating several scores such as:
1. Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.
2. Negative Score: This score is calculated by assigning the value of +1 for each word if found in the Negative Dictionary and then adding up all the values.
3. Polarity Score: This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula: Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001) Range is from -1 to +1
###### Analysis of Readability:
4. Average Sentence Length = the number of words / the number of sentences
5. Percentage of Complex words = the number of complex words / the number of words 
6. Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words) (Analysis of Readability is calculated using the Fox index)
7. Average Number of Words Per Sentence = the total number of words / the total number of sentences
8. Complex Word Count (Complex words are words in the text that contain more than two syllables.)
9. Word Count
10. uncertainty_score: This score is calculated by assigning the value of +1 for each word if found in the Uncertain Dictionary and then adding up all the values.
11. constraining_score: This score is calculated by assigning the value of +1 for each word if found in the Constraining Dictionary and then adding up all the values.






In [7]:
#saving Positive words
positive_words = pd.read_excel('LoughranMcDonald_SentimentWordLists_2018.xlsx',
                                header=None, sheet_name='Positive')
#display(positive_words.head())
positive_words = list(positive_words[0].apply(lambda x : x.lower()))
#print(positive_words[:5])

#saving Negative words
negative_words = pd.read_excel('LoughranMcDonald_SentimentWordLists_2018.xlsx',
                                header=None, sheet_name='Negative')
#display(negative_words.head())
negative_words = list(negative_words[0].apply(lambda x : x.lower()))
#print(negative_words[:5])

#loading uncertainty words
uncertaintywords = pd.read_excel('uncertainty_dictionary.xlsx')
uncertaintywords = list(uncertaintywords['Word'].apply(lambda x:x.lower()))

#loading constraining words
constrainingwords = pd.read_excel('constraining_dictionary.xlsx')
constrainingwords = list(constrainingwords['Word'].apply(lambda x:x.lower()))

In [8]:
#tokenizer
from nltk import word_tokenize
text = "This is a wonderful sentence."
text_words = word_tokenize(text)

#print(text_words)

In [9]:
#requesting text from web
import requests
from bs4 import BeautifulSoup
from nltk import word_tokenize

# keys of our scores dictionary
keys = ['positive_score', 'negative_score', 'polarity_score',
        'average_sentence_length', 'percentage_of_complex_words',
        'complex_word_count', 'fog_index', 'word_count', 'uncertainty_score',
        'constraining_score', 'positive_word_proportion', 'negative_word_proportion',
        'uncertainty_word_proportion', 'constraining_word_proportion',
        'constraining_words_whole_report']

In [None]:
import time

for i, url in enumerate(urls):
    #get the report
    
    req = requests.get(url)
    html = req.text
    soup = BeautifulSoup(html, features="html.parser")

    #print(soup.prettify())

    # Cleaning the soup 
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    #print(text)

    #clean sentence to get text only
    report = clean_sentencs(text, stopwords)

    # Dictionary of our score
    variables = dict.fromkeys(keys)

    positive_score = calculate_score(positive_words, report, 'positive_score', variables)
    negative_score = calculate_score(negative_words, report, 'negative_score', variables)
    polarity_score = polarity(positive_score,negative_score, variables)

    print('row {} total words: {} , total sentences {}'.format(i, len(report),len(text.split('. '))))

    average_sentence_length = len(report) / len(text.split('. '))
    variables['average_sentence_length'] = average_sentence_length
    percentage_of_complex_words = complex_words(report, variables)
    fog_index = fog(variables)
    variables['word_count'] = len(report)

    #calculating uncertainity_score and constraining_score
    uncertainty_score = calculate_score(uncertaintywords, report, 'uncertainty_score',variables)
    constraining_score = calculate_score(constrainingwords, report, 'constraining_score',variables)

    positive_word_proportion = word_proportion('positive_word_proportion', report, 'positive_score', variables)
    negative_word_proportion = word_proportion('negative_word_proportion', report, 'negative_score', variables)
    uncertainty_word_proportion = word_proportion('uncertainty_word_proportion', report, 'uncertainty_score', variables)
    constraining_word_proportion = word_proportion('constraining_word_proportion', report, 'constraining_score', variables)
    variables['constraining_words_whole_report'] = variables['constraining_score']

    for key in variables.keys():
        df.loc[i,key] = variables[key]  


    #break

We notice multiple duplicated rows, this occurs when i get a response of "Your Request Originates from an Undeclared Automated Tool"

In [None]:
print(df.head())
df.to_csv('sumbission.csv')

#print(variables)