<a href="https://colab.research.google.com/github/AmirMoghadamFalahi/sample_task/blob/main/model/information_extraction_v0_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, I want to just implement what's one my mind to extract information about different companies from autonews.com articles. This information would for example be partnerships, costumers, etc.

For this, we will do the following steps:

1.   Installing and importing required packages and libraries
2.   Making and ElasticSearch document store to store our data in next steps
3.   Reading the dataset from csv files
4.   Obtaining all company names from our dataset using SpaCy library and storing them
5.   Converting the data to a format that fits Haystack's requirments
6.   Indexing the data into our ElasticSearch
7.   Using pretrained models of Hugging Face to make Haystack's retriever, reader and finder.
8.   Using company names from step 4 to make some question about the information we want to extract and letting Haystack to search for the answers.
9.   Checking whether obtained answers are compatible with our previous knowledge from step 4. For example, if the model predict some company names which are in partnership with company "X", then these companies should be on our company list from step 4.
10.  Cross-validate the asnwers with itself and test the model results.

# Step 1. Installing and importing required packages and libraries

In [1]:
# it seems that something's wrong with the urllib3, 
# which is one of Haystacks package requirements. So I would update it to 
# another version

! pip install urllib3==1.25.10

Collecting urllib3==1.25.10
[?25l  Downloading https://files.pythonhosted.org/packages/9f/f0/a391d1463ebb1b233795cabfc0ef38d3db4442339de68f847026199e69d7/urllib3-1.25.10-py2.py3-none-any.whl (127kB)
[K     |██▋                             | 10kB 19.8MB/s eta 0:00:01[K     |█████▏                          | 20kB 24.9MB/s eta 0:00:01[K     |███████▊                        | 30kB 17.5MB/s eta 0:00:01[K     |██████████▎                     | 40kB 11.3MB/s eta 0:00:01[K     |████████████▉                   | 51kB 10.5MB/s eta 0:00:01[K     |███████████████▍                | 61kB 9.6MB/s eta 0:00:01[K     |██████████████████              | 71kB 9.5MB/s eta 0:00:01[K     |████████████████████▌           | 81kB 8.9MB/s eta 0:00:01[K     |███████████████████████         | 92kB 8.9MB/s eta 0:00:01[K     |█████████████████████████▊      | 102kB 9.6MB/s eta 0:00:01[K     |████████████████████████████▎   | 112kB 9.6MB/s eta 0:00:01[K     |██████████████████████████████▉ | 1

In [2]:
# Install the latest release of Haystack in your own environment 
!pip install git+https://github.com/deepset-ai/haystack.git

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-_d3b5l26
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-_d3b5l26
Collecting farm==0.4.9
[?25l  Downloading https://files.pythonhosted.org/packages/7b/6a/d30bc97eaca322d35979f7a9f8fd8102e53833d3eb5b3bd02add1196ac94/farm-0.4.9-py3-none-any.whl (190kB)
[K     |████████████████████████████████| 194kB 10.3MB/s 
[?25hCollecting fastapi
[?25l  Downloading https://files.pythonhosted.org/packages/48/65/454fb440d48098845875b5ba8599efafee1efabb97720a584c78674e6d26/fastapi-0.61.1-py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 4.2MB/s 
[?25hCollecting uvicorn
[?25l  Downloading https://files.pythonhosted.org/packages/30/cc/01cc4cb980dfcf04eb283b6497c7f280928a0b02c68c0f85b6901e7716ae/uvicorn-0.12.2-py3-none-any.whl (45kB)
[K     |████████████████████████████████| 51kB 7.2MB/s 
[?25hCollectin

In [3]:
# installing ElasticSearch and making 
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [4]:
# importing required libraries
import numpy as np
import pandas as pd
import time
import json
import zipfile
import itertools
import os
import sys

# if you get error on importing urllib3 package please just comment line 30 of
# /usr/local/lib/python3.6/dist-packages/botocore/utils.py 
# it seems something's wrong with this package
# you can also track this issue here:
# https://github.com/boto/botocore/issues/2187

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.retriever.sparse import ElasticsearchRetriever
from haystack.retriever.dense import EmbeddingRetriever, DensePassageRetriever
from haystack.reader.farm import FARMReader
from haystack import Finder

import spacy


In [5]:
! python -m spacy download en_core_web_md
import en_core_web_md

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.1MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp36-none-any.whl size=98051305 sha256=60a1b5843954cea11801254351bb9f4a4a67b6d8b814cecc5a62fb6296d0a4bf
  Stored in directory: /tmp/pip-ephem-wheel-cache-2s2wdu2i/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [6]:
# geting a clone of my GitHub repo that I stored the dataset on

!git clone https://github.com/AmirMoghadamFalahi/sample_task

Cloning into 'sample_task'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 77 (delta 30), reused 24 (delta 4), pack-reused 0[K
Unpacking objects: 100% (77/77), done.


In [7]:
# making a class to turn of unnecassary prints
class HiddenPrints:
    def __enter__(self):
        self._original_stdout = sys.stdout
        sys.stdout = open(os.devnull, 'w')

    def __exit__(self, exc_type, exc_val, exc_tb):
        sys.stdout.close()
        sys.stdout = self._original_stdout

# Step 2. Making and ElasticSearch document store to store our data in next steps

In [8]:
document_store = ElasticsearchDocumentStore(host="localhost", username="", 
                                            password="", index="document")

10/21/2020 22:11:37 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.725s]
10/21/2020 22:11:37 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.202s]


# Step 3. Reading the dataset from csv files

In [9]:
dataset_path = 'sample_task/dataset/'
csv_path = dataset_path + 'csv_data/'
dataset_zip_list = os.listdir(dataset_path)

for file_name in dataset_zip_list:
  with zipfile.ZipFile(dataset_path + file_name, 'r') as zip_ref:
      zip_ref.extractall(csv_path)

In [10]:
csv_list = sorted(os.listdir(csv_path))

for i, file_name in enumerate(csv_list):
  # print(i, file_name)

  if i == 0:
    autonews_df = pd.read_csv(csv_path + file_name)
    columns = autonews_df.columns
  else:
    df = pd.read_csv(csv_path + file_name, names=columns)
    autonews_df = pd.concat([autonews_df, df])
    # print(autonews_df.head())
  
autonews_df = autonews_df.reset_index(drop=True)
autonews_df = autonews_df.replace({np.nan: None})

In [11]:
autonews_df.head()

Unnamed: 0,id,article_datetime,article_timestamp,category,access_control,headline,link,got_single,paragraph_dic
0,1,2020-10-10,1602330000.0,Marketing,Subscription Required,Carmaker Honda tilts scale to trucks,/marketing/carmaker-honda-tilts-scale-trucks,True,"{""1"": ""LOS ANGELES — A year ago, the mantra being pushed by Honda executives..."
1,4,2020-10-09,1602240000.0,Suppliers,Subscription Required,Yanfeng system attacks COVID-19 inside cars with UV light,/suppliers/yanfeng-system-attacks-covid-19-inside-cars-uv-light,True,"{""1"": ""Interior supplier Yanfeng has revealed its next step in its effort to..."
2,6,2020-10-10,1602350000.0,Dealers,Subscription Required,Phone calls take new priority in pandemic,/dealers/phone-calls-take-new-priority-pandemic,True,"{""1"": ""More customers are buying cars using a computer today, but one of a d..."
3,7,2020-10-10,1602300000.0,Suppliers,Subscription Required,The new kink in automotive hiring: Amazon,/suppliers/new-kink-automotive-hiring-amazon,True,"{""1"": ""As if it hasn't been hard enough recruiting work forces over the past..."
4,8,2020-10-10,1602280000.0,Cars & Concepts,Subscription Required,"Nissan dumps vans in U.S. and Canada, eyes new commercial sales",/cars-concepts/nissan-dumps-vans-us-and-canada-eyes-new-commercial-sales,True,"{""1"": ""Nissan is shifting strategy on commercial vehicle sales now that it i..."


In [12]:
print('shape of dataframe:', autonews_df.shape)
print('dataframe columns:', autonews_df.columns)

shape of dataframe: (50042, 9)
dataframe columns: Index(['id', 'article_datetime', 'article_timestamp', 'category',
       'access_control', 'headline', 'link', 'got_single', 'paragraph_dic'],
      dtype='object')


# Step 4. Obtaining all company names from our dataset using SpaCy library and storing them

Firstly, we make an string text of all our articles:

In [13]:
news_lst = []

for i in range(autonews_df.shape[0]):
  dic = json.loads(autonews_df.paragraph_dic[i])
  whole_news = str(autonews_df.headline[i]) + '\n' + '\n'.join(list(dic.values()))
  news_lst.append(whole_news)

In [14]:
# printing the first article as a sample:
print(news_lst[0])

Carmaker Honda tilts scale to trucks
LOS ANGELES — A year ago, the mantra being pushed by Honda executives was that "cars matter" — a compelling message for a brand that derived almost half its volume from sedans and hatchbacks at the time.
But as the market continues its relentless shift toward crossovers and pickups, it turns out that trucks matter more — even at Honda.
The brand is now implementing a major change in strategy to emphasize the rugged, off-road capability of its light trucks to pick up more market share.
Honda estimates the overall U.S. auto market this year has shifted to 76 percent light trucks and 24 percent passenger cars. Honda itself has a mix of 56 percent light trucks to 44 percent cars through the third quarter. So for Honda, light trucks clearly are a big opportunity.
"As the market approaches 80 percent trucks, we have to make sure we play in that pond," said Art St. Cyr, vice president of automobile operations at American Honda Motor Co.
Doing that, he said

In [15]:
# initializing an nlp model from SpaCy pretrained models for NER purpose
nlp = en_core_web_md.load()

# separating ORG names from the texts and getting their occurance count

company_name_counts = {}

for i, article in enumerate(news_lst):

  if i % 1000 == 0:
    print(i)

  doc = nlp(article)
  ner_array = np.array([(X.text, X.label_) for X in doc.ents])
  try:
    companies, counts = np.unique(ner_array[np.where(ner_array[:, 1] == 'ORG')[0], 0], return_counts=True)
  except Exception as e:
    continue
  for i in range(len(companies)):
    if companies[i] not in company_name_counts.keys():
      company_name_counts[companies[i]] = counts[i]
    else:
      company_name_counts[companies[i]] += counts[i]

  # just to check first 10000 articles, it could be commented to check the 
  #   whole dataset
  if i >= 10000:
    break


0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [16]:
len(company_name_counts.items())

19713

We have extracted more than 19K company names, which is really huge. It seems that some of the names don't really belongs to companies, such as: 'Honda Performance Development', 'ISeeCars.com', 'NBA Finals', 'the Miami Heat', 'Yanfeng Technology Chief Technology'

I would exclude the company names which have occured less than 10 times in whole articles, although it might cause to lose some real company names.

#### **CAUTION**

A better way is to use from other data sources on the Internet to check which one of the names truly belongs to a company and which one doesn't.

In [17]:
sorted_company_name_counts = {k: int(v) for k, v in 
                              sorted(company_name_counts.items(), 
                                     key=lambda item: -item[1])}

In [18]:
company_name_counts_array = np.hstack((np.array(list(sorted_company_name_counts.keys())).reshape(-1, 1), 
                                       np.array(list(sorted_company_name_counts.values())).reshape(-1, 1)))
company_name_counts_array = company_name_counts_array[np.where(company_name_counts_array[:, 1].astype(int) > 10)[0]]

In [19]:
company_names_df = pd.DataFrame(company_name_counts_array, 
                                columns=['company_name', 'occurance_count'])

In [20]:
print(company_names_df.shape)

(1391, 2)


In [21]:
company_names_df.head(20)

Unnamed: 0,company_name,occurance_count
0,Ford,5715
1,Tesla,5367
2,GM,4950
3,Toyota,3940
4,Nissan,3117
5,BMW,2949
6,VW,2307
7,EV,2289
8,Automotive News,2072
9,Trump,2055


--------------------------------------------------------------------------------
A better practice is store this company names on a database to be available anytime. But here we just made a dataframe of them on the memory.



# Step 5. Converting the data to a format that fits Haystack's requirments

Here I used a function from the previous notebooks. it makes a list of Haystack-ready dictionaries to be inserted into the Elastic Search

In [22]:
def prepare_data_haystack(dataset):

  final_dicts = []

  for i, dic in enumerate(dataset.paragraph_dic):
    dic = json.loads(dic, encoding='utf-8')
    txt = ('\n'.join(dic.values()))
    dic = {'text': txt, 'meta': {'category': dataset.category[i], 'headline': dataset.headline[i], 
                                'datetime': dataset.article_datetime[i], 'id': dataset.id[i]}}
    final_dicts.append(dic)

  return final_dicts

haystack_dics = prepare_data_haystack(dataset=autonews_df)
print(haystack_dics[0])

{'text': 'LOS ANGELES — A year ago, the mantra being pushed by Honda executives was that "cars matter" — a compelling message for a brand that derived almost half its volume from sedans and hatchbacks at the time.\nBut as the market continues its relentless shift toward crossovers and pickups, it turns out that trucks matter more — even at Honda.\nThe brand is now implementing a major change in strategy to emphasize the rugged, off-road capability of its light trucks to pick up more market share.\nHonda estimates the overall U.S. auto market this year has shifted to 76 percent light trucks and 24 percent passenger cars. Honda itself has a mix of 56 percent light trucks to 44 percent cars through the third quarter. So for Honda, light trucks clearly are a big opportunity.\n"As the market approaches 80 percent trucks, we have to make sure we play in that pond," said Art St. Cyr, vice president of automobile operations at American Honda Motor Co.\nDoing that, he said, will mean adjusting 

# Step 6. Indexing the data into our ElasticSearch

In [23]:
document_store.write_documents(haystack_dics)

10/21/2020 22:23:13 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:2.014s]
10/21/2020 22:23:14 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.075s]
10/21/2020 22:23:15 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.032s]
10/21/2020 22:23:16 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.027s]
10/21/2020 22:23:17 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.029s]
10/21/2020 22:23:19 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.001s]
10/21/2020 22:23:20 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.999s]
10/21/2020 22:23:21 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.996s]


# Step 7. Using pretrained models of Hugging Face to make Haystack's retriever, reader and finder.

Here I used `ElasticsearchRetriever`, and it seems to be good enough. It is also so fast.

But other alternatives such as embedding models could be used here.

The best approach is to check a bunch of available models and in search of the best results.


In [24]:
retriever = ElasticsearchRetriever(document_store=document_store)
# retriever = DensePassageRetriever(document_store=document_store,
#                                   query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
#                                   passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
#                                   use_gpu=True,
#                                   embed_title=True,
#                                   max_seq_len=256,
#                                   batch_size=16,
#                                   remove_sep_tok_from_untitled_passages=True)

Likewise, in case of reader, several models from Hugging Face models could be used here.

In [25]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

10/21/2020 22:30:27 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
10/21/2020 22:30:28 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...
10/21/2020 22:30:28 - INFO - filelock -   Lock 139937637762664 acquired on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=559.0, style=ProgressStyle(description_…

10/21/2020 22:30:28 - INFO - filelock -   Lock 139937637762664 released on /root/.cache/torch/transformers/f7d4b9379a9c487fa03ccf3d8e00058faa9d664cf01fc03409138246f48760da.c6288e0f84ec797ba5c525c923a5bbc479b47c761aded9734a5f6a473b044c8d.lock
10/21/2020 22:30:28 - INFO - filelock -   Lock 139937637761712 acquired on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637366.0, style=ProgressStyle(descri…

10/21/2020 22:30:35 - INFO - filelock -   Lock 139937637761712 released on /root/.cache/torch/transformers/8c0c8b6371111ac5fbc176aefcf9dbe129db7be654c569b8375dd3712fc4dc67.d045adc91e17ecdf7dc3eeff4c875df94bdf2eb749d72b3ae47ae93f8e85213c.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
10/21/2020 22:31:01 - INFO - filelock -   Lock 139937617357232 acquired on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

10/21/2020 22:31:01 - INFO - filelock -   Lock 139937617357232 released on /root/.cache/torch/transformers/1e3af82648d7190d959a9d76d727ef629b1ca51b3da6ad04039122453cb56307.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be.lock
10/21/2020 22:31:01 - INFO - filelock -   Lock 139937617357232 acquired on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

10/21/2020 22:31:01 - INFO - filelock -   Lock 139937617357232 released on /root/.cache/torch/transformers/b901c69e8e7da4a24c635ad81d016d274f174261f4f5c144e43f4b00e242c3b0.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock





10/21/2020 22:31:01 - INFO - filelock -   Lock 139937617357232 acquired on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…

10/21/2020 22:31:01 - INFO - filelock -   Lock 139937617357232 released on /root/.cache/torch/transformers/2d9b03b59a8af464bf4238025a3cf0e5a340b9d0ba77400011e23c130b452510.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4.lock





10/21/2020 22:31:01 - INFO - filelock -   Lock 139937639109352 acquired on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=189.0, style=ProgressStyle(description_…




10/21/2020 22:31:01 - INFO - filelock -   Lock 139937639109352 released on /root/.cache/torch/transformers/507984f2e28c7dfed5db9a20acd68beb969c7f2833abc9e582e967fa0291f3dc.100c88dbe27dbd73822c575274ade4eb2427596ac56e96769249b7512341654d.lock
10/21/2020 22:31:02 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
10/21/2020 22:31:02 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
10/21/2020 22:31:02 - INFO - farm.infer -    0 
10/21/2020 22:31:02 - INFO - farm.infer -   /w\
10/21/2020 22:31:02 - INFO - farm.infer -   /'\
10/21/2020 22:31:02 - INFO - farm.infer -   


In [26]:
finder = Finder(reader, retriever)

# Step 8. Using company names from step 4 to make some question about the information we want to extract and letting Haystack to search for the answers.

In [27]:
# making a function that generates some questions about partnerships of companies
def get_q_list(company_name: str):
  q_list = []

  q_list.append("which companies are partners of " + company_name)
  q_list.append("what companies are in partnership with " + company_name)
  q_list.append("what companies are " + company_name + "'s business partners?")

  return q_list

### **CAUTION**

For simplicity, I just run the pipeline for first 20 companies:

In [28]:
answer_dict_list = []
num_of_companies_to_extract_partners = 20
TOP_K_RETRIEVER_DEFAULT = 20
TOP_K_READER_DEFAULT = 10

for i in range(num_of_companies_to_extract_partners):

  questions = get_q_list(company_names_df.company_name[i])

  company = company_names_df.company_name[i]
  answers = []
  for q in questions:
    prediction = finder.get_answers(question=q, 
                                    top_k_retriever=TOP_K_RETRIEVER_DEFAULT, 
                                    top_k_reader=TOP_K_READER_DEFAULT)
    answers.append([prediction['answers'][k]['answer'] 
                    for k in range(len(prediction['answers']))])
  answers = list(itertools.chain.from_iterable(answers))

  answer_dict_list.append({'company': company_names_df.company_name[i],
                           'answers': answers})
  

10/21/2020 22:31:02 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.259s]
10/21/2020 22:31:02 - INFO - haystack.retriever.sparse -   Got 20 candidates from retriever
10/21/2020 22:31:02 - INFO - haystack.finder -   Reader is looking for detailed answer in 73054 chars ...
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.52 Batches/s]
Inferencing Samples: 100%|

# Step 9. Checking whether obtained answers are compatible with our previous knowledge from step 4

In [29]:
# making a function that get a list of answers, extract unique company names,
#   validate them using company name dataframe, and return partners
def get_partners(answers, model, company_names_df, target_company_name):
  
  partners = []
  for answer in answers:
    doc = model(answer)
    ner_array = np.array([(X.text, X.label_) for X in doc.ents])
    try:
      companies = np.unique(ner_array[np.where(ner_array[:, 1] == 'ORG')[0], 0])
    except Exception as e:
      continue
    for company in companies:
      if company not in partners \
         and company in list(company_names_df.company_name) \
         and company != target_company_name:
          partners.append(company)

  return partners


In [30]:
for i in range(len(answer_dict_list)):

  partners = get_partners(answer_dict_list[i]['answers'], nlp, company_names_df, 
                          answer_dict_list[i]['company'])
  answer_dict_list[i]['partners'] = partners

In [31]:
i=3
print('company:', answer_dict_list[i]['company'])
print('partners:', answer_dict_list[i]['partners'])

company: Toyota
partners: ['Amazon', 'Mazda', 'Uber', 'FAW Group', 'Guangzhou Automobile Group', 'Mazda Motor Corp.', 'GAC', 'Dongfeng Motor Corp.', 'Guangzhou Automobile Group Co.', 'Denso Corp.', 'EVs', 'Toyota Motor Corp.', 'Fuji Heavy Industries', 'Subaru', 'Suzuki', 'Panasonic', 'Apple Inc.', 'Mitsubishi', 'PSA', 'Peugeot', 'Volkswagen Group']


:) Seems it's working, we just need to test the results

# Step 10. Cross-validate the asnwers with itself and test the model results.

First, let's make a dataframe from the list of dictionaries.

In [32]:
final_df = pd.DataFrame(answer_dict_list)
final_df.head()

Unnamed: 0,company,answers,partners
0,Ford,"[Changan Automobile Group and Jiangling Motors Group, Changan Automobile Gro...","[Changan Automobile Co., Ford Motor Co., Volkswagen Group, Ford Motor, VW, V..."
1,Tesla,"[Panasonic, South Korea’s LG Chem and China’s Contemporary Amperex Technolog...","[Contemporary Amperex Technology Co., LG Chem, Panasonic, Panasonic Corp., D..."
2,GM,"[Glympse and iHeartRadio, ExxonMobil and MasterCard, ride-hailing companies ...","[Honda, Chongqing Changan Automobile Co., SAIC, SAIC Motor Corp., Changan Au..."
3,Toyota,"[Amazon, Didi Chuxing, Pizza Hut, Uber and Mazda, Guangzhou Automobile Group...","[Amazon, Mazda, Uber, FAW Group, Guangzhou Automobile Group, Mazda Motor Cor..."
4,Nissan,"[Renault and Mitsubishi Motors, Renault and Mitsubishi, Renault SA and Mitsu...","[Mitsubishi Motors, Renault, Mitsubishi, Renault SA, Mitsubishi Motors Corp...."


Now, for each company, we would iterate through its obtained partners, make questions about the partner's partners (just like previous steps), if the target company was among partner's partner, we would take it as a 'true positive answer', if it wasn't among them, we would take it as a 'false positive'.
Finaly, we would measure precision as:


```
# precision = count of true positives / (count of true positives + count of false positives)
```



In [97]:
def tp_fp_count_calc(company_name, partners, finder_model, company_names_df, 
                     final_df):

  tp = 0
  fp = 0
  
  for partner in partners:
    
    # to decrease computation:
    if len(final_df[final_df['company'] == partner]) != 0:

      partners_of_partner = final_df[final_df['company'] == partner]['partners'].values[0]

    else:

      questions = get_q_list(partner)

      answers = []
      for q in questions:
        prediction = finder.get_answers(question=q, 
                                        top_k_retriever=TOP_K_RETRIEVER_DEFAULT, 
                                        top_k_reader=TOP_K_READER_DEFAULT)
        answers.append([prediction['answers'][k]['answer']
                        for k in range(len(prediction['answers']))])
      answers = list(itertools.chain.from_iterable(answers))
      partners_of_partner = get_partners(answers, nlp, company_names_df, answers)
      final_df = final_df.append({'company': partner, 'answers': answers, 
                                  'partners': partners_of_partner}, ignore_index=True)

    if company_name in partners_of_partner:
      tp += 1
    else:
      fp += 1

  return tp, fp, final_df

In [103]:
tp_lst, fp_lst, precision_lst = [], [], []

for i in range(final_df.shape[0]):

  tp, fp, final_df = tp_fp_count_calc(final_df.iloc[i, 0], final_df.iloc[i, 2], 
                                      finder, company_names_df, final_df)
  precision = tp / (tp + fp)

  tp_lst.append(tp)
  fp_lst.append(fp)
  precision_lst.append(precision)

['Ford', 'PSA', 'Peugeot', 'Suzuki', 'Ford Motor Co.', 'Mazda Motor Corp.', 'Changan Automobile Co.', 'Suzuki Motor Corp.', 'Changan', 'GM', 'Mazda', 'FAW', 'Volkswagen']
['Changan Automobile Co.', 'Chongqing Changan Automobile Co.', 'Daimler', 'Jiangling Motors Corp.', 'BAIC Group', 'Hyundai Motor Co.', 'Mazda Motor Corp.', 'BMW Group', 'Ford Motor Co.', 'VW Group', 'BMW AG', 'Delphi Automotive', 'Fiat Chrysler Automobiles', 'Intel Corp.', 'Lyft', 'Ford', 'VW', 'Volkswagen', 'Ford Motor', 'Magna International', 'Baojun', 'Nissan']
['FAW', 'SAIC', 'Audi', 'Porsche', 'Volkswagen Group', 'VW', 'AC', 'BMW', 'Daimler', 'EVs', 'Ford', 'VW Group', 'China FAW Group', 'SAIC Motor Corp.', 'Volkswagen', 'Tata Motors', 'Ford Motor Co.', 'Jianghuai Automobile Co.']
['Chongqing Changan Automobile Co.', 'Changan Automobile Co.', 'Audi', 'Porsche', "Volkswagen Group's", 'Ford Motor Co.', 'Lyft', 'BAIC Group', 'Hyundai Motor Co.', 'Ford', 'VW Group', 'Volkswagen Group', 'Magna International', 'Ford Mo

In [104]:
final_df.shape

(164, 3)

In [106]:
final_df_original = final_df.iloc[:20, :]
final_df_original['tp'] = np.array(tp_lst)
final_df_original['fp'] = np.array(fp_lst)
final_df_original['precision'] = np.array(precision_lst)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [107]:
final_df_original

Unnamed: 0,company,answers,partners,tp,fp,precision
0,Ford,"[Changan Automobile Group and Jiangling Motors Group, Changan Automobile Gro...","[Changan Automobile Co., Ford Motor Co., Volkswagen Group, Ford Motor, VW, V...",10,0,1.0
1,Tesla,"[Panasonic, South Korea’s LG Chem and China’s Contemporary Amperex Technolog...","[Contemporary Amperex Technology Co., LG Chem, Panasonic, Panasonic Corp., D...",6,7,0.461538
2,GM,"[Glympse and iHeartRadio, ExxonMobil and MasterCard, ride-hailing companies ...","[Honda, Chongqing Changan Automobile Co., SAIC, SAIC Motor Corp., Changan Au...",8,8,0.5
3,Toyota,"[Amazon, Didi Chuxing, Pizza Hut, Uber and Mazda, Guangzhou Automobile Group...","[Amazon, Mazda, Uber, FAW Group, Guangzhou Automobile Group, Mazda Motor Cor...",13,8,0.619048
4,Nissan,"[Renault and Mitsubishi Motors, Renault and Mitsubishi, Renault SA and Mitsu...","[Mitsubishi Motors, Renault, Mitsubishi, Renault SA, Mitsubishi Motors Corp....",7,0,1.0
5,BMW,"[BMW and Daimler, BMW AG, for example, has partnered with Intel Corp., Delph...","[Daimler, BMW AG, Delphi Automotive, Fiat Chrysler Automobiles, Intel Corp.,...",11,8,0.578947
6,VW,"[two of China's largest automakers, SAIC and FAW, SAIC and FAW, state-owned ...","[FAW, SAIC, FAW Group, SAIC Motor, Daimler, GM, Ford, Volkswagen, Byton, Hyu...",7,6,0.538462
7,EV,"[VW Group, Ford Motor Co. and BMW Group, VW's venture with Anhui Jianghuai A...","[BMW Group, Ford Motor Co., VW Group, VW, Hyundai, Jaguar, Land Rover, Mahin...",4,32,0.111111
8,Automotive News,"[traditional automakers, large suppliers and Silicon Valley companies, Adver...","[Cox Automotive, Omnicom, WPP, PSA, Suzuki, Geely, Volvo, Volvo Car Corp., Z...",0,23,0.0
9,Trump,"[China and Europe, Canada's and Mexico, Mexico and Canada, Canada and Mexico...","[the European Union, BMW, Daimler, Volkswagen Group, Ford Motor Co., Fiat, F...",0,13,0.0


In [108]:
final_df_original.to_csv('final_df_original.csv')

In [109]:
final_df.to_csv('final_df.csv')

# Future improvments

Here is a list of things we can do to improve the results:



1.   First thing we should do, is to search the Internet, to find out whether we can enrich our datasets, specially the dataset in which we stored the company names. It's a major things in our pipeline, because we validated our results using this dataset several times during the pipeline, and  better quality in this data would absolutely improve our results.
Also it was better to store this data on a database, instead of a dataframe, so we don't need to obtain it everytime we want to run the pipeline.
2.   In this notebook, we just used a sample of 50k articles from autonews.com website. Adding articles from other resources to our data, might cause that the retirever select better articles and therefore our precision might increase.
3.   In this notebook, we just used ElasticSearchRetriever (BM25) as our retriever model, but other models such as embedding models need to be tested. They might improve our pipeline results.
4.   Also, there are a bunch of other NLP models that could be used for Reader. Some of them might increase the presicion, or speed of our pipeline.
5.   Although the cross-validation idea isn't that bad, if we could find some ground truth information about partnership of companies with each other, it would help a lot to measure the precision of our pipeline more accurately.
6.   Furthermore, I run the pipeline for just 20 companies, so the results would be unreliable and it should be tested with a bigger sample.
7.   Finaly, it was better to store the results on a database to be easily achievable anytime we want, but since we are using an interactive environment and don't have access to servers, I just saved the results as a csv file.

