# Introduction

Four parts for each component:
1. Introduction
2. Example
3. Graded assignment

How many components?
1. Data extraction: web scraping
2. Data cleaning: python
4. Entity recognition: json output of Koko results
5. Data transformation: from json to dataframe

Difficulty level:
1. From trivial to non-trivial tasks -- this's a graduate course.

# 1. Data acquisition & cleaning

Task: extract texts of aviation incidents from websites  
Tools: BeautifulSoup, scrapy  
Grading metrics: autograder finds a given set of sentences in the txt file uploaded.

We first use BeautifulSoup to extract text information from the aircraft incident website.  
BeautifulSoup has been pre-installed for you.

Here we use Python's own html.parser.  
You can definitely try other parsers, as illustrated [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [4]:
from bs4 import BeautifulSoup

input_html = "./data/aircraft_incidents.htm"
with open(input_html, "r") as ifile:
    soup = BeautifulSoup(ifile, 'lxml')
    
soup.title.string

'List of accidents and incidents involving commercial aircraft - Wikipedia'

**Assignment 1:** Use BeautifulSoup to scrape the website on \[server url\], and generate a text file containing all the text on the website. The generated file should not contain any tag information of the scraped html file.

In [11]:
# Write your code here #

fname = "aviation_incidents.txt"

The following code will check if the generated text file satisfy the requirement.
(We should have a more strict checking after the assignment is submitted.)

In [28]:
import re

with open(fname, "r") as ifile:
    # Check if all tags have been removed
    doc = ifile.read()
    result = re.search(r"<.*>", doc)
    if result:
        print("Error: \n{}".format(result.string))
    else:
        print("No tag found!")

Error: 
<html></html>
Nothing special.



After the web page is scraped, you may find that the extracted data is not very organized.  
So the next step is to clean the extracted data, so that it can be used for further analysis.

* Remove uninteresting contents, such as "main page", "Donate to Wikipedia" texts.
* Decoding?
* Apostrophe elimination
* Punctuation removal
* Spell check
* 

Task: Clean the extracted text file to move years into each line.  
Tools: Python
Grading metrics: autograder would check the format of each line.

# 3. Entity recognition

Task: extract airlines and aircraft types  
Tools: Koko  
Grading metrics: autograder would examine the json output files

We can try extracting airline company names from the corpus, to see which company has the most number of incidents.  
There are a number tools that can extract entities from text, including NLTK, spaCy, Google NLP.  
Today we will introduce an entity extraction framework called Koko, which allows the user to specify properties of the extracted entities in a declarative way.

In [9]:
with open('./koko_queries/airlines_v1.koko', 'r') as file:
    print(file.read())

extract "Ents" x from "/Users/chen/Research/Code/BigGorilla-assignments/Koko/data/aviation_lists_cleaned.txt" if
	(str(x) contains "Airlines" {0.1}) or
	(str(x) contains "Air" {0.1})
with threshold 0.0



This query tells Koko to extract noun phrases 'x' from HappyDB if "x" is preceded by either "buy" or "purchase".

The weight in each "if" condition (e.g., {0.1} for ("buy" x)) represents the importance of the pattern specified in the condition.
Any appearance of an entity in happy moments that matches the pattern is considered a piece of evidence.
And each such piece of evidence would increment the entity's score by the condition's weight.

For example, if there's a happy moment "I buy a car", this moment is considered as evidence for "a car" based on the first condition, and 0.1 is added to "a car"'s score.
In Koko, the score of an entity is at most 1.

Finally, we can specify threshold in Koko queries.
Only entities scoring higher than the thresold would be returned in the results.
For simplicity, I put zero as thresold here, which shows all entities that have at least one piece of evidence in happy moments.

If you are interested, check out more tutorials for Koko [here](http://pykoko.readthedocs.io/en/latest/).

Let's run the Koko query now to see the results.

Here I use spaCy as the nlp processor for happy moments. Koko could leverage spaCy's APIs for entity extraction.
The extracted entities could be further matched against the conditions in the Koko query to get scored, ranked and filtered.

SpaCy is not the only option. We can also use Koko's default parser or Google NLP API as well.

In [8]:
import koko
import spacy

koko.run('./koko_queries/airlines_v1.koko', doc_parser='spacy', verbose_info=True)

Parsed query: extract "/Users/chen/Research/Code/BigGorilla-assignments/Koko/data/aviation_lists_cleaned.txt" Ents from "x" if
	(str(x) contains "Airlines" { 0.10 }) or
	(str(x) contains "Air" { 0.10 })   
with threshold 0.00


Results:

Entity name                    Entity count         Entity score
United Airlines                27                   1.000000
	 On October 7, 1935, United Airlines Trip 4, a Boeing 247D, crashes near Silver Crown, Wyoming, United States, due to pilot error; all 12 on board die.
 

	 On December 27, 1936, United Airlines Trip 34, a Boeing 247, crashes at Rice Canyon (near Newhall, California, United States) due to pilot error, killing all 12 on board.
 

	 On May 29, 1947, United Airlines Flight 521, a Douglas DC-4, crashes on takeoff from LaGuardia Airport, New York, United States, due to pilot error; 42 of 48 on board die.
 

	 On October 24, 1947, United Airlines Flight 608, a Douglas DC-6, crashes near Bryce Canyon Airport, Utah, United States, when

# 4. Data transformation

Task: transform Koko's json output to dataframe  
Tools: Python (read_json)  
Grading metrics: autograder would load the dataframe and do some checking, e.g., shape.  
Comments: is this part too simple?