# Data Gathering

***In this notebook we present the Data gathering process for this project***

This process is done the following steps: 

1) Downloading and converting (to .txt* format) the companies' 10-Ks. 

   Tools: web scraping, using a RPA Bot. 

2) Extracting section "1.A Risk Factors" . 


In the follwoing section we will discuss our progress, regarding the Data gathering.


- - -  

## 1. Downloading and converting the 10-K reports

In order to dowload tna convert the 10-K reports, we used RPA Bot.

Robotic Process Automation (RPA) is a software technology, used to automate digital tasks. RPA Bots can interact with any system or application the same way a human worker would, by watching the user perform tasks in the application's graphical user interface (GUI). Thereafter, perform the automation by repeating those tasks directly in the GUI. 
For this project, the RPA Bot was created with the platform of UiPath.

The Bot was designed to:

    •	Extracting a company's CIK and year of Cyber Incidents 
    •	Open EDGAR data base 
    •	Type the designated CIK    
    •	Locate the relevant 10-K report (if exists)  
    •	Saving the report locally    
    •	Iterate 
    
Initially, there were circa 2,000 cases of Cyber Incidents. In some cases, a company has experienced more than one incident in a given year. Therefore, the Bot collected a smaller number of reports in comparison with the date base size. Moreover, for some companies there was no 10-K report available from different reasons, such as mergers and acquisitions, several CIKs per company (e.g. Jpmorgan Chase & Co.). 

Eventually, untill the day of submission, we managed to retrieve approximately 700 report from each group. 

---

## (*)  Imports

We started this essential part of all the important libraries and downloads we needed in this section of the work. Those directories were carefully selected in light of their suitability for the nature of our data processing and even its initial analysis.

In [1]:
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import LancasterStemmer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk import FreqDist
nltk.download('stopwords')
nltk.download('wordnet')

import os
import io
import sys
import shutil
import string
from string import punctuation
import numpy as np 

import pandas as pd
from pandas import DataFrame
from pandas.plotting import scatter_matrix

import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## (*) Variables

Now we actually defined their variables we needed in a large part of the notebook, those variables that retain key values like names of important directories, lists of files, words that indicate important sections in each file we will work on and more. The definition of these variables at the beginning of this section was very important as this way we could make minimal changes to the code should we need to do so.

In [2]:
dir_with_data = 'Data'  # that's the directory name of the directory that contains all of the raw data
new_dir_with_outcome = 'Risk' + dir_with_data # that's the directory name of the directory that will contain all of the outcomes
file_names = []  # that's the list of the names of all the files from the Data directory
start_key_words = ['ariel','1A.', 'Risk Factors\n',
                 'RISK FACTORS\n']  # that's the list of the keywords that indicates if we are on the needed section
start_key_word_counters = []  # that's the list of counters to count how many times each of the kew words appeared
end_key_words = ['ITEM 1B.', 'Item 1B.', 'ITEM 1B',
               'Item 1B', 'UNRESOLVED STAFF COMMENTS', 'Unresolved Staff Comments',
               'Properties']  # that's the list of key words that indicates if the section is over
created_files_counter = 1
words_to_clean = []
project_directories = []

---

---

## 2. Extracting section "1.A Risk Factors"

In this part we validate and manipulate the text in the files, in order to extract section "1A. Risk Factors". 

We now went through virtually every file in the Data folder and cut out only the parts we wanted to work on. Those parts are characterized by a title that in most cases is called Item 1A Risk factors and therefore for the same cut from each file we are called Risk Section.

***Check: if a word  exists in a line***

In [3]:
def is_word_in_line(word, line):
    if line.find(word) != -1:
        return True
    return False

***Check: if the string names line contains any of the starts key words***

Here we literally tested whether a current line is actually the title of that Risk Section mentiond above.Finding the critical break is problematic. The files are in a slightly different format and it is important to note that from a large set of data we were able to find the kitty part in a large percentage but certainly not at all the files. The ambition was to reach as large a file as possible but it was quite difficult and we are definitely happy with the results.

In [4]:
def in_risk_paragraph_line(line):
    index = 0
    for word in start_key_words:
        if is_word_in_line(word, line):
            if start_key_word_counters[index] == 0:
                start_key_word_counters[index] = 1
                break
            elif start_key_word_counters[index] == 1:
                return True
        index += 1
    return False

***Check: if a line contains any key words that indicate reaching the end of a the section***

And Here we tested whether a current line is actually the end of that Risk Section. Similar to finding the beginning of the critical part, finding the end was also problematic. It is important to note that it was slightly easier to find a formal definition for finding the end of the required part. But of course there is little room for improvement and optimality, as I mentioned above we are very pleased with the results so this is the current situation.

In [5]:
def is_risk_paragraph_ended(line):
    global created_files_counter
    for word in end_key_words:
        if is_word_in_line(word, line):
            created_files_counter += 1
            return True
    return False

We started the process having 672 text files, which the RPA Bot downloaded. After extracting section 1A. Risk Factors, we ended up with 385 usable files, after cleaning empty files and files in which we extracted different sections. We will try to improve this part of the data gathering. 


---

### 2.1. Cleaning exceptional formatting characters

We have now created functions for minimal clearing of the information by deleting problematic characters and things left over as part of the information mining process. It is important to note that this part is not the original cleansing of the data but only a very basic processing of problematic characters and certain cues that tend to be found on websites. Without this part we actually encountered many problems later on which is why the cleaning operation is so critical at this stage.

***Check: if a char is valid in this format***

In [6]:
def is_valid_char(c):
    if c.isdigit() or c.isalpha() or c =='\n' or c =='' or c==' ': 
        return True
    return False

***Get a valid chars line (digits, alphabet and white)***

In [7]:
def get_clean_line(line):  
    first_char = line[0]  
    if len(line) > 1 and not first_char.isdigit() and not first_char.isalpha():
        line = line[1:]   
    return line

---

### 2.2. Creating "Risk Files" and saving them locally

"Risk Files" are the new files, containing the valideated text of section "1A. Risk Factors"

And now, having created functions for trimming all the required information we have created functions that cut each file individually and save the result locally. This is by opening a new file after each trimming is done. That means we actually create new files with the same name of the original files only with the addition of the word Risk. The same files contain the critical part for processing. All this is done so that we can distinguish between the files before and after the change.

***Make a buffer of words from risk section***



In the next bluck we have created a function that lists the Risk part within a list.

In [8]:
def get_risk_buffer(file_buffer_to_read):
    cur_buffer_to_write = [] 
    in_risk_paragraph = False
    for line in file_buffer_to_read:
        if not in_risk_paragraph and in_risk_paragraph_line(line):
            in_risk_paragraph = True                    
        elif in_risk_paragraph and is_risk_paragraph_ended(line):
            break
        if in_risk_paragraph:           
            clean_line = get_clean_line(line)
            cur_buffer_to_write.append(clean_line)
    return cur_buffer_to_write

***Write a buffer to a new file : a given file name + 'Risk'***

We have now created a function that writes the same part of Risk stored locally in a file with the same name plus Risk.

In [9]:
def write_list_buffer_to_file(cur_buffer_to_write, file_name, new_dir_with_outcome):
    file_to_write = open('results\\' + new_dir_with_outcome + '\\' + 'Risk' + file_name, 'w', encoding='utf-8')
    try:
        for line in cur_buffer_to_write:
            file_to_write.write(line + "\n")
    finally:
        file_to_write.close()

***Create a risk file and assigning it to the a directory***

Here we have defined the function that will do the important actions generically for each file.

In [10]:
def create_risk_file(file_buffer_to_read, file_name):
    global start_key_word_counters 
    start_key_word_counters = [0] * (len(start_key_words))
    cur_buffer_to_write = get_risk_buffer(file_buffer_to_read)
    write_list_buffer_to_file(cur_buffer_to_write, file_name, new_dir_with_outcome)

***Create all risk files for Data files***

And here we have done all the cutting and recreating of the files for on the files in the Data folder.

In [11]:
def create_all_risk_files():
    for file_name in file_names:
        with open(dir_with_data  + '\\' + file_name, 'r', encoding='utf-8') as f:
            file_buffer_to_read = f.readlines()
        create_risk_file(file_buffer_to_read, file_name)

***Create a directory with a given name***

Now in the last step before calling the functions and creating the new files, we had to save the new files in a dedicated folder so that they are easy to use. So basically we created another function that creates for us a new directory with a significant name that matches the folder name of the original data. The goal in creating all of these folders is easy and quick access to the various files during the process. But perhaps more important than that the goal is to keep the original data intact in such a way that we will not have to change it and woe to it. You can always delete what we created and actually start from scratch.

In [12]:
def create_dir(dir_name):
    path = os.path.join(dir_name)
    os.mkdir(path)
    project_directories.append(dir_name)

#### creating all the risk files in a needed directory

And finally after we have defined all the required functions we can call them now in order to create the new files for each section of Risk within a new folder designed for that. And so we certainly did.

In [13]:
create_dir('results')
file_names = os.listdir(dir_with_data)
create_dir('results\\' + new_dir_with_outcome)
create_all_risk_files()

And now we are in the situation that in the notebook folder we got a new folder that contains all the parts that interest us for data analysis, these data are not clean and contain unnecessary characters and information but are definitely the critical parts and information which is an essential part of data analysis for this project. In the next steps we will see how we continued the data processing process before their analysis and how we dealt with various problems in the original data.

---

### 2.3. Creating a corpus from the Risk Files

Now that we have the important sections we have created a corpus from them.

In [14]:
corpus_root = 'results\\' + new_dir_with_outcome 
risk_files_corpus = PlaintextCorpusReader(corpus_root, '.*') 

---

---

## 3. Conclusion

In Part 1. Data Gathering we reviewed the three data bases we use for this project. Also, we discribed the data collection process, in which managed to ***(1)*** download the 10-K files, using a RPA Bot, ***(2)*** extract the specific section "1A. Risk Factors". 

Finally, we manage to create around 700 workable file in total (both groups together) , containing the "item 1A. Risk Factor". Please see 1.4 Description- control group for further information. 

Please note:  we will refer the report files from now on as the observations of the data set. 