# FSX Proof of Concept 
*by Graham Lim, Data Science Consultant for FSX*

In this notebook, we will be demonstrating our **preliminary Proof of Concept ("POC")** for our **FSX Risk Decisioning Platform (the "Platform")**.
This will be written in `python`, with explanatory notes given to explain each step of our proprietary process.

## Concept Summary
Conceptually, the Platform is a supervised machine learning (ML) media sentiment classifier that mines news articles automatically that deal with the subject of food supply risk, and predictively classifies them according to their risk labelling. This will utilize statistical ML models like those from `scikit-learn`, as well as Deep Learning/ Neural Networks, such as Google's open-source `BERT` Transformers.

Based on that risk labelling, an automated text report with visualizations is generated with insights. Initially, this will employ standard template-based code. Future versions will incorporate unsupervised text generation algorithms (e.g. `GPT2`) and/or commercial text-generation software (e.g. `Arria NLG`).

**Disclaimer**: Our POC has been designed for the sake of demonstrating capability and conceptual proof only, and is not strictly optimized for typical Machine Learning performance metrics (e.g. `accuracy`, `precision`, `recall`, `f1 score` etc.) due to the small sample dataset used here.

## POC Contents:
1. **Webscraping**: prototype code that scrapes a sample of articles from the web for news about food supply risk using `Selenium` and `BeautifulSoup`, which are open-source libraries that help us scrape. Future versions will scrape the web as comprehensively as possible, beyond the sample. Future versions will have a far larger and statistically significant dataset that is scraped.


2. **Cleaning and Labelling**: prototype code that transforms sample scraped news data into a Dataframe in .csv format, that is machine readable. We use the popular `Pandas` library for this. 

    We initially manually label the risk profiles of each news article in our sample data, for the sake of being able to train our prototype model. Note again that future versions will have a far larger and statistically significant dataset.


3. **Preprocessing and Modelling**: The text that we scraped has to be pre-processed into a machine-readable format e.g. tokenization, stop-words removal, and lemmatization. We then run models from `scikit-learn` or a custom Neural Network e.g. BERT on this machine-readable data in order to predictively classify news text as falling into a particular risk category e.g. `high`, `medium`, `low`, or `neutral`. 

    For the sake of simplicity in this PoC we will use `risky`, `neutral` and `stable` as our preliminary labels.
    

4. **Report Generator**: our report-generation algorithm is intended to provide actionable insights based on that risk assessment. In our prototype here, the report-generator will provide the following information:

    * Reiterate the risk label
    
    * Describe which keywords in the article triggered our label to be e.g. risky
    
    * Describe which entities are being discussed in the article
    
    * Based on the risk label, the algorithm **recommends** via text that other suppliers of the food item in question be reviewed.
    

## 1. Webscraping

You will need to run the following pip install commands in terminal or cmd line:

* `pip install bs4` (for BeautifulSoup)
* `pip install selenium` (for Selenium)
* `pip install webdriver-manager` (for the automated Selenium web driver to work)

In [1]:
#Standard Python DS imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#set column size to be larger
pd.set_option("display.max_colwidth", 1000)

We have to use `Selenium` because of the fact that all the articles won't show up on one webpage and it's easier to scrape them this way./

Hence, we will import `Selenium` and the related `WebDriver Manager` tool to run a Chrome instance within Selenium that will scrape our sample food supply news data.

In [3]:
#Selenium and WebDriver Manager imports:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

import time
from selenium.webdriver.common.keys import Keys

For our sample news feed, we are taking all articles tagged with the keyword `Rice` from `www.foodnavigator-asia.com`.

In [4]:
base_url = "https://www.foodnavigator-asia.com/Trends/Supply-chain?page="

Here is the scraping code for this particular website. Future versions of this code will utilize a larger dataset aggregated from other news sites.
For the initial phase of development, this would essentially involve a large-scale media analysis undertaking to manually label data first in order to train our supervised labelling model.

In [13]:
#This code will scrape data from the URL in question
# driver = webdriver.Chrome(ChromeDriverManager().install())

def getPages(url):
    driver = webdriver.Chrome("C:/Users/User/Desktop/FYP/chromedriver.exe") 
    driver.get(url)    
    numPages = driver.find_element_by_xpath("/html/body/div[2]/div/main/div[1]/div/ul/li[7]/a").text
    driver.close()
    return numPages



intro_list = []
title_list = []
date_list = []
post_url_list = []
    
def getArticles(target_url):     
    numPages = int(getPages(target_url))
    numPages += 1

    for i in range(1,numPages):
        driver = webdriver.Chrome("C:/Users/User/Desktop/FYP/chromedriver.exe")
        url = target_url + str(i)
        driver.get(url)
#         time.sleep(15)
#         driver_body = driver.find_element_by_tag_name('body')
        driver_body = WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.TAG_NAME,"body")))

        html = ""
        html = driver.page_source
        soup = BeautifulSoup(html, 'lxml')

        intro_elems = driver.find_elements_by_class_name("Teaser-intro")

        for intro in intro_elems:
            intro_list.append(intro.text)

        title_elems = driver.find_elements_by_class_name("Teaser-title")

        for title in title_elems:
            title_list.append(title.text)

        date_elems = driver.find_elements_by_class_name("Teaser-date")
        for date in date_elems:
            date_list.append(date.text)

        linksdiv = soup.find_all('h3', {'class': 'Teaser-title'})
        for linkdiv in linksdiv:
            post_url_list.append('www.foodnavigator-asia.com'+(linkdiv.find('a')['href']))

        driver.close()
    return 

In [14]:
getArticles(base_url)

In [15]:
#checking scraped length
print (len(intro_list))
print (len(title_list))
print (len(date_list))
print (len(post_url_list))

2889
2889
2889
2889


In [16]:
#We then write a simple function to convert and label these lists as DataFrames in pandas, and tells us what the `shape` of the dataframe is:

df = pd.DataFrame({'date': date_list,
                   'title': title_list,
                   'intro': intro_list,
                   'url': post_url_list
                  })

df.shape

(2889, 4)

This is what the result of our initially scraped news data looks like that our algorithm uses. In future, we will not only use the `title` and `intro` text of news articles, but the full text of the actual news article for better information capturing.

In [17]:
# df.head()
df.tail(1)

Unnamed: 0,date,title,intro,url
2888,17-Mar-2004,China-Vietnam food trade boost,Trade in food and beverage products is expected to grow\nsignificantly on the back of an agreement made between China and\nVietnam which will end tariffs on nearly 500 seafood and\nagricultural products during the course of the next...,www.foodnavigator-asia.com/Article/2004/03/17/China-Vietnam-food-trade-boost


In [20]:
df.to_csv("csv_data/food_navigator.csv")

# --- break ---

## 2. Cleaning and Labelling

Now that our data has been scraped, we have to re-format it and label the risk ratings of each our sample news feed articles in a way that our machine learning models will understand. For example, we'll initially have to combine the `title text` and `intro text` of each news article:

In [None]:
df = pd.read_csv("../data/df_raw.csv", index_col = 0)

In [None]:
#checking for null values

df.isnull().sum()

In [None]:
#combine title and intro text in one

df["combined"] = df["title"] + ". " + df["intro"]

df["combined"]

In [None]:
#we create a default risk rating of neutral first for all news articles
df["risk_rating"]="neutral"

In [None]:
df.head()

In [None]:
#manually annotate as risky, neutral or stabilizing to Asia rice food supply

#going through 10 rows at a time to annotate

#all news is neutral unless otherwise annotated as risky or stabilizing


risky_rows = [2, 3, 7, 13, 17, 18, 20, 37, 43, 49, 53, 54, 58, 68, 71]
stabilizing_rows = [6, 8, 9, 10, 11, 12, 14, 21, 24, 25, 27, 29, 30, 32, 42, 
                    46, 47, 48, 51, 55, 57, 61, 64, 65, 70, 73, 74]

In [None]:
len(risky_rows)

In [None]:
len(stabilizing_rows)

In [None]:
df[["combined"]][60:75]

In [None]:
#we replace our placeholder "neutral" rating in our manual annotations of risky rows 
for i in risky_rows:
    df.at[i, "risk_rating"] = "risky"

In [None]:
for i in stabilizing_rows:
    df.at[i, "risk_rating"] = "stable"

In [None]:
#risky ratings are in, looks ok
df.head(10)

In [None]:
df["risk_rating"].value_counts()

In [None]:
stable_ratings = df["combined"][df["risk_rating"]=="stable"]

In [None]:
df[df["risk_rating"]=="stable"]

In [None]:
df.loc[stabilizing_rows, "risk_rating"] = "stable"

We now label the risk rating in this intial `training dataset` so that our machine learning models will be able to learn what text ought to be classified as what rating in the future automatically. 

Again, for simplicity's sake, we use simpler labels like `risky`, `neutral`, and `stable`; future version swill have more nuanced risk labelling.

In [None]:
df.head(10)

In [None]:
df["risk_rating"].value_counts()

In [None]:
pwd

In [None]:
df.to_csv("../data/df_ratings_raw.csv")

## 3A. Pre-processing our Data 

In [None]:
df = pd.read_csv("../data/df_ratings_raw.csv", index_col = 0)

We will require a number of pre-processing tools that we import into our python scripts. The purpose of these tools is to further convert the dataset into a machine-readable format, by splitting our data into a training, validation and test set, as well as utilizing Natural Language Processing to convert our text into ML-compatible Word Vectors.

The tools we use are:
* Preprocessing tools from `scikit-learn`, which is a commercially-used open-source Python-based machine-learning library (https://scikit-learn.org/stable/), and
* Preprocessing tools from `spaCy`, which is an "Industrial-Strength"open-source Python-based Natural Language Processing library (https://spacy.io).

In [None]:
#sklearn modelling tools
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import cross_val_score

#for multilabel classification and oversampling
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.multiclass import OneVsRestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import label_binarize

#for PCA - Dimensionality Reduction
from sklearn.decomposition import TruncatedSVD

#for word vectorization
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer

#our dummy baseline
from sklearn.dummy import DummyClassifier

#advanced supervised classification models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB

#classification metrics imports 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, confusion_matrix
from scipy import interp
from itertools import cycle

from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, hinge_loss
from sklearn import metrics

In [None]:
#pre-process text for EDA and later modelling too using spacy and string libraries

import spacy
from spacy.lang.en import English # updated

import re
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from spacy import displacy
import spacy.cli
from spacy.pipeline import EntityRuler
from spacy.matcher import Matcher
from spacy.tokens import Doc
from spacy import displacy

from collections import Counter

Now that we've imported our tools, we write code using `spaCy` to transform words into numerical arrays known as Word Vectors. This involves a process coded out below known as `tokenization`.

We also convert all of the words in our dataset into their lemmatized form (base word-format e.g. the base format of "disappointed" is "disappoint", as well remove punctuation, stop-words, numbers. 

Again, all of these steps are necessary to allow our models to "read" the language and predict what risk scoring should be assigned to any given news article. 

Here is what the pre-processing code:

In [None]:
#stopwords removal

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    
    #remove non-words
    tokens = re.sub("[^a-zA-Z]", " ", sentence)
    tokens = re.sub("[0-9]+", "", tokens)
    
    # Creating our token object, which is used to create documents with linguistic annotations.
    tokens = parser(tokens)

    # Lemmatizing each token and converting each token into lowercase
    tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens ]

    # Removing stop words
    tokens = [ word for word in tokens if word not in stop_words and word not in punctuations]
    
    # return preprocessed list of tokens
    return(" ".join(tokens))

In [None]:
df["bag_cleaned_text"]=df["combined"].apply(spacy_tokenizer)

In [None]:
df.head(7)

Our algorithm will also assign a numerical risk score based on the risk categories with this dictionary of values for multiclass classification to work in sklearn. The purpose of this is for future risk-weighted scoring for our report-writing algorithm, which is explained later:

In [None]:
df["risk_rating_numerical"] = df["risk_rating"].copy()
print(df["risk_rating_numerical"])

In [None]:
#numerical risk rating score impact
df["risk_rating_numerical"].replace({'neutral':0, 
                           'stable':1, 
                           'risky':-1})

In the final part of our pre-processing, we now write code to actually transform our text into individual word and 2-word phrase (bigram) tokens, which will attach numerical scores that are weighted tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).

In [None]:
#Creating the features (tf-idf weights) for the processed text

texts = df['bag_cleaned_text'].astype('str')

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), 
                                   min_df = 2, 
                                   max_df = .95)

X = tfidf_vectorizer.fit_transform(texts) #features
y = df['risk_rating_numerical'].values #target

print (X.shape)
print(y.shape)

## 3B. Modelling: Multilabel Text Classification using `scikit-learn`
Here, we're going to train our data using the following `scikit-learn` machine learning models: 
* Random Forest, 
* XGBoost, 
* SVC, and 
* K-Nearest Neighbours.

We'll split the data into a train, validation and test set. We'll only run our models on the validation set to understand generalization of each model, which we compare against sklearn's own DummyClassifier as our baseline model.

The purpose of all this is to evaluate whether these conventional `sci-kit learn` machine learning models will perform better at predicting the right risk label for any given piece of news. We also compare this against an NLP-focused Deep Neural Network known as `BERT`, which is Google's own open-source Deep Learning model: (https://en.wikipedia.org/wiki/BERT_(language_model))

**Once the first phase of our training data has been compiled and labelled, our risk decisioning engine will constantly be evaluated for performance. We will, at all times, be asking ourselves - are the models we are using good at predicting risk correctly?**

In [None]:
#General model evaluation using default parameters on Validation Set

#Creating a dict of the models
model_dict = {'Random Forest': RandomForestClassifier(random_state=3),
              'XGBoost': XGBClassifier(random_state=3),
              'SVC':SVC(),
              'K Nearest Neighbor': KNeighborsClassifier(),
              'Dummy Baseline': DummyClassifier()} 


#Train test split with stratified sampling for evaluation
X_train, X_val, y_train, y_val = train_test_split(X, 
                                                    y, 
                                                    test_size = .3, 
                                                    shuffle = True, 
                                                    stratify = y, 
                                                    random_state = 3)


print("Multiclass Clause Classification - Validation Scores")

#Function to get the scores for each model in a datframe
def model_score_df(model_dict):   
    
    model_name, ac_score_list, p_score_list, r_score_list, f1_score_list, auc_list = [], [], [], [], [], []
    
    for k,v in model_dict.items():   
        model_name.append(k)
        v.fit(X_train, y_train)
        y_pred = v.predict(X_val)
        ac_score_list.append(accuracy_score(y_val, y_pred))
        p_score_list.append(precision_score(y_val, y_pred, average='micro'))
        r_score_list.append(recall_score(y_val, y_pred, average='micro'))
        f1_score_list.append(f1_score(y_val, y_pred, average='micro'))

        model_comparison_df = pd.DataFrame([model_name, ac_score_list, p_score_list, r_score_list, f1_score_list]).T
        model_comparison_df.columns = ['model_name', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']
        model_comparison_df = model_comparison_df.sort_values(by='f1_score', ascending=False)
    return model_comparison_df

model_score_df(model_dict)


## 3C. BERT Modelling
We also will utilize a Deep Learning Neural Network known as `BERT`, mentioned earlier. 

Kaushal Trivedi defines `BERT` as a "multilingual transformer based model that has achieved state-of-the-art results on various NLP tasks. BERT is a bidirectional model that is based on the transformer architecture, it replaces the sequential nature of RNN (LSTM & GRU) with a much faster Attention-based approach. The model is also pre-trained on two unsupervised tasks, masked language modeling and next sentence prediction. This allows us to use a pre-trained BERT model by fine-tuning the same on downstream specific tasks such as sentiment classification, intent detection, question-answering and more." (Source: https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d)

For our purposes, it is another model on top of `scikit-learn`'s ML models that can help us predict the correct risk label for a given set of news. The code below allows us to use a wrapper called `ktrain` that trains a BERT model on our news risk classifier:

In [None]:
#Please install ktrain on Google Colab: `pip install ktrain`

#BERT ktrain wrapper imports
import ktrain
from ktrain import text

In [None]:
df.head()

In [None]:
X = df["combined"]
y = df["risk_rating"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
#I have to change our train test split objects into a list of strings.
X_train = X_train.values.tolist()
X_test = X_test.values.tolist()

y_train = y_train.values.tolist()
y_test = y_test.values.tolist()

print("classes to predict")
print(y.value_counts())

In [None]:
#recalling our Y label dictionary
encoding = {'neutral':0, 
            'stable':1, 
            'risky':-1}

# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]

In [None]:
#we now run BERT-specific preprocessing:
class_names = ['neutral','stable','risky']

In [None]:
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=512, 
                                                                       max_features=100000)

In [None]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)

In [None]:
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=12)

In [None]:
learner.fit_onecycle(2e-5, 10)

In [None]:
#print out a performance report
learner.validate(val_data=(x_test, y_test), class_names=class_names)

Remember that once again neither `BERT` nor the `scikit-learn` models in this POC has a large enough dataset that is optimized for ML classification metrics `accuracy/precision/recall/f1 score` in the prototype; the production model will still involve a team of media analysts supervising and correcting the initial risk labelling so as to "teach" our AI models what it needs to predict. The purpose of showing the programming and logic here for our POC here is a showcase of capability that we have.

### What the output of our Classification/Labelling Models looks like

The modelling phase explained earlier will **generate a .csv file** that will look like the DataFrame below. Since we are still developing the models, we have created a preview of what 2 AI-classified/labelled news articles would look like:

In [None]:
df_report = pd.read_csv("../data/df_report_demo.csv", index_col=0)

In [None]:
df_report

We see that the model will predict the `risk rating (both verbally and numerically)`, the `main_entities` discussed, the `commodity` at risk, and whether or not the risk concerned is on the supply-side or demand side (`is_supply_shock`).

This data, as well as the risk-scoring ramifications based on this, will then be procedurally generated by our text report generator.

## 4. Report Generating Script

Once we have our risk-labelling AI model, we can then create a separate algorithm that generates a risk report on any given news article based on the risk label predicted using our chosen AI model.

The report will contain the following key components:
* NLP Multi-label: What is the `commodity` in question?
* NLP Multi-label: Which `country` is mainly in question? What other `entities` are talked about here?
* NLP Binary-class: Is the shock `demand-based` or `supply-based` shock?
* NLP Multi-label: Based on the risk label, What is the `new risk score` of country or countries in question for that commodity? Do we apply `+1, 0 -1`?
    * Update all countries's risk ratings accordingly
* If the risk score of the country in question is lowered, can we recommend other countries with a better risk score for supply of this commodity?

* What does the `AI-summarized text of the entire article` look like?

The report can be generated as a table with an accompanying visualization uses spaCy's `Named Entity Recognition`, and our production model will be able to generate an individual report for each news article that relates to food supply.

### Risk Matrices

For the report generator to readjust risk-weighted scores from food suppliers, we will require not only the model-generated .csv file, but we'll also have to create professionally curated `risk matrices` of commodity suppliers worldwide. 

Similar to credit scoring, this will be based on proprietary risk-weighting methods pioneered by FSX.ai's supply chain and risk experts. A score is assigned to individual country suppliers, with the idea being that the higher the number, the more secure that supplier's capacity to supply a particular commodity is, and vice-versa.

For example, because India is now the largest supplier of rice worldwide (~32.5% of worldwide supply), their supply risk score is higher than say Spain, which only accounts for 0.9% of worldwide rice supply. 

This also therefore means that negative news (resulting in a risk rating of `-1`) is more impactful on the new risk score for India than it is to Spain, and it suffers a `risk penalty multiplier` of `3.25`. Therefore, negative news about India's rice supplying capacity will result in its score being subtracted by `-3.25`, whereas Spain's penalty will only be `-0.09`.

For the sake of our demo, we have employed this simplistic calculation. We also assigned risk scores and penalties arbitrarily for ease of demonstration. Again, our production model will incorporate the fully articulated risk-scoring calculations based on curated weightages divined from our supply chain and risk experts. 

Accordingly, we create here a `sample risk score matrix of rice suppliers` worldwide for demo purposes:

In [None]:
#example matrix based on http://www.worldstopexports.com/rice-exports-country/
supplier_data = [['india', 32.5], ['thailand', 19.2], ['united_states', 8.6], ['vietnam', 6.6], 
                 ['pakistan', 5.6], ['china', 4.8], ['italy', 2.9], 
                 ['myanmar_burma', 2.6], ['cambodia', 2], ['uruguay', 1.7], 
                 ['brazil', 1.7], ['netherlands', 1.7], ['belgium', 1.2],
                 ['paraguay', 1], ['spain', 0.9]]
        
df_rice_supplier_matrix = pd.DataFrame(supplier_data, columns = ["supplier_country", "current_risk_score"])

In [None]:
#applying demo risk penalty multiplier as a new column in the matrix dataframe
df_rice_supplier_matrix['risk_penalty_multiplier'] = df_rice_supplier_matrix['current_risk_score'].map(lambda x: x / 10)

In [None]:
df_rice_supplier_matrix

In [None]:
other_countries = [state for state in df_rice_supplier_matrix["supplier_country"] if state!="thailand"]

In [None]:
df_rice_supplier_matrix[df_rice_supplier_matrix["supplier_country"] != "thailand"]

In [None]:
df_rice_supplier_matrix[(df_rice_supplier_matrix["supplier_country"] != "thailand")
                        & (df_rice_supplier_matrix["current_risk_score"] >= 2.0)]

In [None]:
#backing up this matrix
df_rice_supplier_matrix.to_csv("../data/df_supplier_matrix.csv")

In [None]:
state_prior_score = df_rice_supplier_matrix["current_risk_score"][df_rice_supplier_matrix["supplier_country"] == "thailand"]

In [None]:
state_prior_score.iloc[0]

In [3]: sub_df
Out[3]:
          A         B
2 -0.133653 -0.030854

In [4]: sub_df.iloc[0]
Out[4]:
A   -0.133653
B   -0.030854
Name: 2, dtype: float64

In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493

### Prototype Code for How The Report Generator Works ###

We now write a function that transforms this data into a text report with entity visualization using `spaCy's NER`:

In [None]:
row_0 = df_report.loc[0]

In [None]:
row_0["combined"]

In [None]:
type(row_0["date"])

In [None]:
def report_writer(row):

    #assign values into callable instances
    date = row["date"]
    title = row["title"]
    url = row["url"]
    risk_rating = row["risk_rating"]
    risk_rating_numerical = row["risk_rating_numerical"]
    commodity = row["commodity"]
    main_entities = row["main_entities"]
    supply_shock = row["is_supply_shock"]
    minimum_score = 2
    
    nlp = spacy.load('en_core_web_sm')
    current_text = row['combined']
    doc = nlp(current_text)
    
    print (f"***************************START OF REPORT********************************")
    print ("\n")

    print (f"This is an automatically generated report for:")
    print (f"\n'{title}', \ndated {date}, \nscraped from the url \n{url}.")
    print ("\n")
    print (f"Our FSX.ai Risk Model has identified that the article concerns risk towards the supply of {commodity.title()} by {main_entities.title()}.")
    print("\n")
    
    #get previous score of this country's rice supply risk
    state_prior_score = df_rice_supplier_matrix["current_risk_score"][df_rice_supplier_matrix["supplier_country"] == main_entities].iloc[0]    
    
    print (f"{main_entities.title()}'s prior risk score for {commodity.title()} supply is {state_prior_score}")
    
    #get previous score of this country's rice supply risk
    state_penalty = df_rice_supplier_matrix["risk_penalty_multiplier"][df_rice_supplier_matrix["supplier_country"] == main_entities].iloc[0]    

    print (f"{main_entities.title()}'s risk penalty multiplier is {state_penalty}")
    
    new_score = state_prior_score - (risk_rating_numerical*state_penalty)
    
    print ("\n")
    print (f"{main_entities.title()}'s new risk score for {commodity.title()} supply is {new_score}")
        
    print ("\n")

    print (f"Currently, FSX.ai recommends choosing suppliers with a minimum risk score of {minimum_score}.")
    minimum_score_suppliers = df_rice_supplier_matrix[(df_rice_supplier_matrix["supplier_country"] != main_entities)
                                                      & (df_rice_supplier_matrix["current_risk_score"] >= minimum_score)]
    
    print (f"Accordingly, FSX.ai recommends reviewing the increase of imports from other suppliers of {commodity.title()}:")
    print ("\n")
    print (minimum_score_suppliers)
    print ("\n")


    
    print (f"FSX.ai's spaCy NER has also highlighted the following related entities in this article that should be reviewed \nin relation to this news article:")
    print ("\n")
    print (displacy.render(doc, style='ent', jupyter=True))

    print (f"***************************END OF REPORT********************************")



In [None]:
# # return df_rice_supplier_matrix
#     print (f"Accordingly, FSX.ai recommends reviewing the increase of imports from other suppliers of {commodity.title()}:")
#     print ("\n")
#     print (df_rice_supplier_matrix[df_rice_supplier_matrix["supplier_country"] != main_entities])

In [None]:
report_writer(df_report.loc[0])

### Prototype Article Summarizer 
Finally, the last component of our report addresses this query:
* NLG Text Summarization: What is the `AI-summarized` version of the entire article?

We have prototype code that summarizes entire news articles using `huggingface's T5 Transformer`: https://huggingface.co/transformers/model_doc/t5.html

Let's first use a new example of an entire news article taken from `Channel News Asia`:https://www.channelnewsasia.com/news/asia/thailand-wetland-conservation-boon-reuang-special-economic-zone-12886250.

In [None]:
text = """"
DUTTA NAGAR, India: Until late March, Ashish Kumar was helping to make plastic boxes for 
Ferrero Rocher praline chocolates and the plastic spoons tucked inside Kinder Joy eggs to scoop out the 
milky sweet cream inside.

With a diploma in plastic mould technology, the 20-year-old had a foot on his chosen career ladder. His younger brother Aditya chose law, but Ashish had his sights set on plastic.

"I want to start a business of my own," he said, explaining how he wants to recycle plastic to make day-to-day products at his own factory.

India's coronavirus lockdown has thrown those plans into disarray. Educated but unemployed, Ashish Kumar is one of countless people across the globe whose social progress has been halted by the coronavirus that has infected more than 2 million people in India alone, and thrown the economy into reverse. With it, the aspirations of millions are fading.

For years, people in rural India have been gaining prosperity and moving into what economists call a burgeoning middle class of consumers – those who earn more than US$10 a day, by some definitions. This group has been a keystone of plans for economic development in the world's second most populous country.

In the COVID-19 pandemic, India's economy is forecast to shrink by 4.5 per cent this year, according to the International Monetary Fund. At least 400 million Indian workers are at risk of falling deeper into poverty, according to the International Labour Organization (ILO).

Kumar is one of around 131,000 people whom local officials estimate returned from working around India to Gonda, the district in the northern state of Uttar Pradesh that he left last June.

Nationwide, about 10 million people made long, hard journeys back to rural villages they had left. Some have gone back to the cities, but many of those who had been sending back funds are still stuck in the countryside.

Working in a factory in Baramati in the western state of Maharashtra, Kumar was earning 13,000 rupees (US$173) every month, more than twice his father's pay from a job in a grain market near Kumar's home village in Uttar Pradesh, a sprawling agrarian state. Of that, the young man was sending home around 9,000 rupees every month, much of which was helping to fund his younger brother's studies.

No longer. Once a provider for his family, now he has become a financial burden.

Kumar whiles away his time back home in the village of Dutta Nagar, bantering with friends in the muddy courtyard – they jokingly call it their "office" – outside the ramshackle primary school where he studied. In Uttar Pradesh, around 60 million of the state's population of more than 200 million lives in poverty, according to the World Bank.

He said he has applied for several jobs at plastic factories in western Gujarat state and other parts of northern India but has not found work.

"No matter what," he said, sitting near his parent's single-storey home, surrounded by jade green paddy fields. "I need a job."

PLASTIC FOR PRALINES

As a schoolboy, Kumar was obsessed with plastics.

A chance conversation with a cousin who had studied plastic engineering got him hooked, Kumar said, and he started researching. In Dutta Nagar, where there were no Internet connections, that often meant asking one of a handful of locals with a smartphone to Google the opportunities.

Kumar's ambitions were a world removed from his father Ashok's early years. The 47-year-old, who assists with weighing and pricing grain harvests, remembers when the family had neither enough food, nor proper clothes.

A slight man with a weather-beaten face, he never finished high school.

"I thought that the children shouldn't fall into our rut. They should be pushed ahead," he said.

Kumar, who says he has never tasted a Ferrero Rocher praline, finished his diploma in Gujarat last June, and took the train to start work as a technician at an Italian-owned factory 1,500km away from home.

The factory that employed him is run by Dream Plast India, a subsidiary of Gruppo Sunino, an Italian plastics maker with 10 plants around the world.

"The factory was first class," Kumar said. His contract included a monthly contribution from the company into a retirement fund and a bonus. Workers were served one meal every day, the supervisors were friendly, and the salary came on time, he said.

Six days a week, his work typically involved overseeing two machines and a couple of contract workers. At the end of the day, he would relax with a game of badminton or watch wrestling on YouTube.

His income over the past year helped his parents build a proper four-roomed brick home, after decades of living in a tumble-down mud hut where the roof let in heavy monsoon rains. It helped pay the fees for his brother to go to law school in Bahraich, an hour-and-a-half's drive away from their home village.

Then COVID-19 struck.

BROKE IN BARAMATI

Kumar first heard of the coronavirus in early March. When India's lockdown forced Dream Plast India to temporarily shut its plant in Baramati on Mar 21, he had enough cash to wait it out in town.

As the pandemic swept through India, a survey of some 5,000 workers in April and May found 66 per cent of participants had lost their jobs, and 77 per cent of households were consuming less food than before.

Prime Minister Narendra Modi's government announced a 20 trillion rupee package promising free rice, wheat and pulses for millions of people and a programme to provide employment in rural areas.

Even for those with work, trade unions and labour experts say conditions are deteriorating, for migrants particularly.  

In May, India's state governments issued health and safety guidelines for factories as they re-opened after lockdown, which included compulsory face masks, thermal screening, social distancing and frequent sanitisation. Union leaders allege many companies did not follow all protocols and cut corners, but they have not identified Kumar's.

Indian states including Uttar Pradesh and Gujarat said in May they were looking to relax workers' rights, including weakening regulations on wages and working hours, to support industry. That proposal drew criticism from workers' unions and the ILO. The amendments have only taken effect in some states.

Kumar's factory, which reopened in early May, did not respond to a question on measures taken there, but Dream Plast India's managing director Nitin Gupta said in an email the "company takes utmost precautions to adhere to the laws at all times". He declined to elaborate further.

Even so, Kumar and another worker Reuters spoke to said they did not feel safe to return.

Ferrero, the Italian confectioner, said it had audited the plant where Kumar worked in March and found no irregularities, but would further review subsequent months.

Reuters was unable to independently determine what safety measures the factory took.

By early June, Kumar's funds had run out. Even buying food became difficult.

His parents grew increasingly worried. "Whatever little money I had here in the bank, I sent some of that so he could eat," said his father, Ashok. "At that time, I was very scared. The biggest challenge was for him to come home."

India's railway network reopened in early May. On Jun 3, Kumar borrowed money to pay for a 48-hour journey home by train, bus and shared taxi. Then he went into a 14-day quarantine.

On Jun 25, Dream Plast India sent him an email, which was seen by Reuters, asking him to report to work within four days or face termination. Instead, he resigned on Jul 20.

His parents are apprehensive about him leaving home again, although they said they realise that without their elder son's earnings, his younger brother will not be able to finish law school.

Kumar is not ready to give up on his plastics factory.

"I will do it," he said. "No matter what it takes, I will fulfil my dream."
"""

In [None]:
#from adapting https://huggingface.co/transformers/model_doc/t5.html
import torch
import json 
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')

preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


# summmarize 
summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=500,
                                    early_stopping=True)

output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)

Other Features in Development: Macro events