In this exercise, we'll build out a mock version of the system described above. We won't use a real database and some of the components will be simple mocks (we won't trade real money on real trading platforms), but it will help you get a feel for designing a pipeline-based system. We'll build out the Scraper, Cleaner, Deduplicator, Analyzer, and DecisionMaker components of the system we described previously.

1. Define some example URLs of news articles that might be interesting as shown in the code below.

In [5]:
uber_url = "https://www.reuters.com/article/us-uber-lawsuit-california/uber-is-sued-over-resistance-to-california-gig-employment-law-idUSKCN1VX1VE"
apple_url = "https://www.reuters.com/article/us-apple-macbook/apple-refreshes-macbook-pro-laptop-with-16-inch-screen-idUSKBN1XN1V8"
apple_url2 = "https://www.reuters.com/article/us-apple-macbook/apple-refreshes-macbook-pro-laptop-with-16-inch-screen-idUSKBN1XN1V8"

article_urls = [uber_url, apple_url, apple_url2]

We define only three URLs, all from a single source. Two of them are the same. In the last line, we add all three URLs to an array.

2. Import the `requests` `string` libraries and the `Counter` module from `collections`, as shown in the code below.

In [4]:
import requests
import string
from collections import Counter

We imported three libraries that we'll use in various components.

3. Define a Scraper class that can fetch news articles and extract the full contents, including HTML, as shown in the code below.

In [None]:
class Scraper:
        
    def fetch_news(self, urls):
        article_contents = []
        for url in urls:
            try:
                contents = requests.get(url).text
                article_contents.append(contents)
            except Exception as e:
                print(e)
        return article_contents

Here we defined a Scraper class which can fetch news articles. We give it a list of urls, and it loops through these. For each url, it attempts to fetch it with the `requests` library and adds the contents from that page to an array. Finally, it returns the array. Note that the Scraper class takes in an array of URLs (which we have) and outputs an array containing the contents of the page.

4. Define an `is_clean` function which we'll use in our `Cleaner` module, as shown in the code below.

In [6]:
def is_clean(word):
    blacklist = {"var", "img", "e", "void"}
    if not word:
        return False
    if word in blacklist:
        return False
    for i, letter in enumerate(word):
        if i > 0 and letter in string.ascii_uppercase:
            return False
        if letter not in string.ascii_letters:
            return False
    return True

This function is outside of the main `Cleaner` module. It looks at a word and decides if it is part of an article or not. We use a very naive method for this. If the word is in our blacklist, we discard it, as it is probably part of the JavaScript content of the article. If the word is blank, we also discard it. 

If both of the above tests pass, we check if the word has any uppercase letters that are not the first letter. If it does, it is probably a function name. Finally, we check if all of the letters in the word are part of the English alphabet. If any other characters are present, we discard the word.

5. Define the full Cleaner module which uses the `is_clean` function, as shown in the code below.

In [10]:
class Cleaner:
    
    def clean_articles(self, articles):
        clean_articles = []

        for article in articles:
            clean_words = []
            try:
                for word in article.split(" "):
                    if is_clean(word):
                        clean_words.append(word)
            except Exception as e:
                print(e)
            clean_articles.append(' '.join(clean_words))
        return clean_articles

In this code, we define a `Cleaner` module with a `clean_articles` function. This function takes the list of artiles that the Scraper produced and loops through it. For each article, it breaks it into words, and keeps only the clean words. It then joins these together again, adds them to a different array, and finally returns the array of cleaned articles.

6. Create the Deduplicator module as shown in the code below.

In [11]:
class Deduplicator:
    
    def deduplicate_articles(self, articles):
        seen_articles = set()
        deduplicated_articles = []
        for article in articles:
            if hash(article) in seen_articles:
                continue
            else:
                seen_articles.add(hash(article))
                deduplicated_articles.append(article)
                    
        return deduplicated_articles

This module takes in a list of clean articles and checks if any of them are duplicated. It keeps only a single copy of each one and returns a new list, without any duplicates.

7. Create the Analyzer module, as shown in the code below.

In [12]:
class Analyzer:
    good_words = {"unveiled", "available", "faster", "stable"}
    bad_words = {"sued", "defiance", "violation"}

    def extract_entities_and_sentiment(self, articles):
        entity_score_pairs = []
        for article in articles:
            score = 0
            entities = []
            for word in article.split(" "):
                if word[0] == word[0].upper():
                    entities.append(word)
                if word.lower() in self.good_words:
                    score += 1
                elif word.lower() in self.bad_words:
                    score -= 1
            main_entities = [i[0] for i in Counter(entities).most_common(2)]
            entity_score_pair = (main_entities, score)
            entity_score_pairs.append(entity_score_pair)
        return entity_score_pairs

The Analyzer module is a bit more complicated than the previous modules becuase it has two jobs: two extract the entities from an aritcle, and to extract a sentiment score. We first define two very limited lists of 'good words' and 'bad words'. If the article is talking about a new product being unveiled, that is a good sign. If a company is being sued, that is probably bad. We then define a function that loops through each clean article (with duplicates already removed) and looks at every word. 

For each word, it checks if the word is an entity (it guesses that it is if it starts with capital letter). It then checks if the word is regarded as a 'good' or 'bad' word. If it is good, it adds 1 to the score. If it is bad, it removes one. If the word does not appear in either list, it leaves the score as is.

Finally, it finds the two most common entities mentioned in the article, and creates a data structure with both of these entities and the overall score. It returns this as output.

8. Now create the Decision Maker module using the following code.

In [14]:
class DecisionMaker:
    target_companies = set(['Apple', 'Uber', 'Google'])
        
    def make_decisions(self, entity_score_pairs):
        decisions = []
        for entities, score in entity_score_pairs:
            for entity in entities:
                if entity in self.target_companies:
                    quantity = abs(score)
                    order = "Buy" if score > 0 else "Sell"
                    decision = (order, quantity, entity)
                    decisions.append(decision)
        return decisions

This module has a set of target companies. These are the companies whose stock we want to trade. It takes as input the entity and score pairs that we created in the Analyzer module and turns these into structured trading decisions. If the score is positive for a given entity, it buys that stock. If it is negative, it sells the stock. The more positive or negative the score is, the more stock it buys or sells. It returns a list of decisions as output.

9. Initialise all components by running the following code.

In [15]:
scraper = Scraper()
cleaner = Cleaner()
deduplicator = Deduplicator()
analyzer = Analyzer()
decision_maker = DecisionMaker()

We created all 5 components, and they are now ready to be tested.

10. Fetch the news articles with the scraper and print out an excerpt, by running the following code.

In [19]:
contents = scraper.fetch_news(article_urls)
contents[0][:500]

'<!--[if !IE]> This has been served from cache <![endif]-->\n<!--[if !IE]> Request served from apache server: produs--i-0a4a08336159d88d2 <![endif]-->\n<!--[if !IE]> Cached on Wed, 26 Feb 2020 23:01:28 GMT and will expire on Wed, 26 Feb 2020 23:16:19 GMT <![endif]-->\n<!--[if !IE]> token: f9fd82a6-e004-4871-85e1-63089475bceb <![endif]-->\n<!--[if !IE]> App Server /produs--i-0655f4557687834a5/ <![endif]-->\n\n<!doctype html><html lang="en" data-edition="BETAUS">\n    <head>\n\n    <title>\n                U'

We ran our Scraper and output the first 500 characters of the first article. We can see it fetched content, but that this is messy and full of HTML tags and other information that is not part of the article.

11. Pass these article to the cleaner for cleaning, as in the following code.

In [22]:
clean_articles = cleaner.clean_articles(contents)
clean_articles[0][:500]

'This has been served from cache Request served from apache Cached on Feb and will expire on Feb App Server Uber is sued over resistance to California employment law Segment snippet included if Page hiding snippet Data Layer Object Declaration New Google Tag Manager new End Google Tag Manager new driver for Uber has sued the company for misclassifying its drivers as independent hours after California legislators voted to help thousands of those workers and enjoy the benefits of produced in Proces'

We ran our cleaner and output the first 500 characters of the first article. We can see that a lot of the junk is removed. It isn't perfect as there is still some content in the beginning that is not from the article, but cleaning is a tricky task and we at least we can see that the real content appears near the beginning.

12. Check how many articles we have, run the deduplicator and then check the count of the articles again, as in the code below.

In [25]:
print(len(clean_articles))
deduplicated = deduplicator.deduplicate_articles(clean_articles)
print(len(deduplicated))

3
2


We printed out the length of our `clean_articles` and noted that we have 3, one for each of our original URLs. We then ran our deduplicator, which removed the duplicate article, leaving us with the text of 2 articles.

13. Run our analyzer on our clean deduplicated articles, as shown in the code below.

In [27]:
entity_score_pairs = analyzer.extract_entities_and_sentiment(deduplicated)
print(entity_score_pairs)

[(['Uber', 'California'], -18), (['Pro', 'Apple'], 16)]


We ran the analyzer on our articles. We can see that it figured out that the first article was mainly about Uber and California and that it had a negative sentiment. The second article was mainly about Apple and "Pro" (the article talks a lot about the new Macbook Pro), and has a positive sentiment.

Pass this information to our Decision Maker to create trade instructions, as shown in the code below.

In [30]:
decisions = decision_maker.make_decisions(entity_score_pairs)
print(decisions)

[('Sell', 18, 'Uber'), ('Buy', 16, 'Apple')]


We created two decisions from our entity and sentiment pairs. The decision maker wants to sell 18 shares of Uber and buy 16 shares of Apple.

All of our components are very naive and would not work well in a real-world case. The Scraper downloads articles that we give it, but can't find them for itself. The cleaner doesn't even attempt to parse HTML, and keeps a lot of content that is not relevant. It also discards a lot of "real" words: those with punctuation, brand names with capitals in the middle of a word, and more.

Our deduplicator only deals with exact duplicates, but in real cases, often there are small differences between articles that are almost the same. Therefore hashing is not a good strategy here. Our Analyzer uses some hand picked wordlists that are relevant mainly to the articles that we chose, and it has a very naive entity extractor, relying only on capital letters.

Finally, our decision maker does not take information from all articles into account. If there is one very positive article on Apple and 10 slightly negative ones, it might still decide to buy more stock than it sells.

With all of the above in mind, we still can see that the system is both very modular and pipeline based. Each component is responsible only for a single aspect of the entire system, and any of them can be improved without affecting the others, as long as the input and output formats remain the same. The output of each component is fed as input into another component, leaving us with a very neat pipeline. 

This is great for maintaining and understanding the system, but also good for reproducing results. Often in machine learning systems, reproducibility is important, and by having a structured pipeline you can always feed the same data in to get the same results out.