# A High-Level Code Journey through the RB News NLP Pipeline
#### **ESADE MIBA 2023 Capstone Team G03:**
- Florian Blaser
- John Bergmann
- Quique Mendez
- Michael Merheb
- Jingshi Zhang
---

## Setup
Clone our git repository at https://github.com/CAPSTONE-MIBA-G03/MIBA-2023-CAPSTONE-RB-NLP.git and install the required dependencies in your python environment (we recommend using Python 3.10.11):

Using Venv:

```bash
$ python -m venv venv
$ source ./venv/bin/activate
$ pip install -r requirements.txt
```

Using pyenv:

```bash
$ pyenv virtualenv 3.10.11 RB_NLP
$ pyenv activate 
$ pip install -r requirements.txt
```

In [7]:
# Internal imports
from pipeline_executor import PipelineExecutor
from nlp_analysis.word_wizard import WordWizard

# External imports
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from pprint import pprint

# Aesthetic settings
import warnings
warnings.filterwarnings("ignore")

## ETL

The `pipeline_executor.py` module has been designed as a wrapper to combine the following three ETL steps of the project into one seemless excecution:

- `link_extractor.py`: A performant, multi-threaded web-scraping module which uses a selection of Search Engines (Google, Bing, Yahoo) to gather a large variety of news article URL`s that are used downstream to extract a corpus of news data.

- `content_extractor.py`: A performant, multi-threaded, and well-generalizing web-scraping module used to extract news text content from a variety of sources. Makes use of two approaches to gather data if the first one fails.

- `content_cleaner.py`: A functional module used to preprocess the extracted news text. Takes into account length of extracted text, cleans punctuation, and uses regex patterns to filter out links, e-mails, phone numbers, as well as other phrases which indicate unusable text (ex: "Are you a robot?").

Using the `execute()` method, the user can specify his search query, if he/she wishes to limit the amount of retrieved news articles, and if he/she wants to overwrite any existing files.

Usability and performance oriented design choices:
| Feature | Rationale | Description |
| --- | --- | --- |
| Dynamic thread assignment | Performance | The number of assigned threads depends on the number of tasks, CPU cores, and free memory the machine has access to.
| Anti-scraping measure circumvention | Usability / Perfromance | To make the webscraper appear more human, the following has been implemented: timeouts, user-agents (HTTP headers), HTTP sessions (keep connection open between client and server)
| Logging | Usability | Logging has been implemented in the ETL process to allow users to monitor code and easily debug (if necessary)
| Direct Requests | Performance | As of now, the use of headless browsers has not been deemed necessary due to the noticeable performance impact associated with them. However, it should be noted that tools such as selenium or playwright are easily integratable due to the object-oriented design of the ETL process.
| Caching | Performance | If the user already searched for a sepcific query, the locally saved data is used instead of running the entire web-scraper again (this behavior can be overwritten)
| Search Query Behavior | Usability | Any news topic can be searched for by the user. This includes Search Engine specific functionality such as boolean operators and quotes to find exact matches

In [8]:
topic = "'Quantum Computing' AND 'Research'"
pipe = PipelineExecutor()
quantum_research = pipe.execute(query=topic, max_articles=None, overwrite=False)

In [9]:
quantum_research

Unnamed: 0,article_index,engine,link,source,title,description,body,paragraph
0,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,Researchers at North Carolina State University...
1,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,"""The discovery of Q-silicon having robust room..."
2,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,The NC State researchers showed that laser mel...
3,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,"""This discovery of Q-silicon stands to revolut..."
4,1,Yahoo,https://scitechdaily.com/a-major-quantum-compu...,SciTechDaily,A Major Quantum Computing Leap With a Magnetic...,A University of Washington-led team has made a...,A University of Washington-led team has made a...,A University of Washington-led team has made a...
...,...,...,...,...,...,...,...,...
3643,413,Google,https://newsroom.ibm.com/2023-03-24-IBM-and-Fu...,IBM Newsroom,IBM and Fundación Ikerbasque partner to launch...,"Fundación Ikerbasque, the Basque Foundation fo...","San Sebastian, March 24th, 2023. Fundación Ike...",IBM and Ikerbasque will also collaborate to de...
3644,413,Google,https://newsroom.ibm.com/2023-03-24-IBM-and-Fu...,IBM Newsroom,IBM and Fundación Ikerbasque partner to launch...,"Fundación Ikerbasque, the Basque Foundation fo...","San Sebastian, March 24th, 2023. Fundación Ike...",“It is very risky to say what the future of qu...
3645,413,Google,https://newsroom.ibm.com/2023-03-24-IBM-and-Fu...,IBM Newsroom,IBM and Fundación Ikerbasque partner to launch...,"Fundación Ikerbasque, the Basque Foundation fo...","San Sebastian, March 24th, 2023. Fundación Ike...",“The IBM-Euskadi Quantum Computational Center ...
3646,413,Google,https://newsroom.ibm.com/2023-03-24-IBM-and-Fu...,IBM Newsroom,IBM and Fundación Ikerbasque partner to launch...,"Fundación Ikerbasque, the Basque Foundation fo...","San Sebastian, March 24th, 2023. Fundación Ike...",The IBM-Euskadi Quantum Computational Center i...


## NLP

The `word_wizard.py` module is a performant and feature rich NLP module capable of performing various operations on the dataframe that gets returned from the ETL pipeline. When initializing a WordWizard object, the user can specify on which basis he/she wishes to analyze the data (either the entire article, or on paragraphs). For more precise analysis, this is done on single paragraphs by default.

Usability and performance oriented design choices:
| Feature | Rationale | Description |
| --- | --- | --- |
| Pretrained Models | Performance / Usability | Especially in tasks that are universally similar, such as language, using preatrained models often rpooves beneficial. With the exception of some unsupervised lerning approaches, the NLP pipe makes use of a variety of pretrained models. This allows us to benefit from performant models, trained on magnitudes of data so large that entire datacenters are needed, while using everyday computer systems.
| GPU acceleration | Performance | The NLP pipeline automatically detects and selects the most powerful device possible for Deep Learning Inference. This can also be overriden by the user for each method and defaults to the CPU if no GPU is detected. Currently supports NVIDIA's Compute Unified Device Architecture (CUDA) and Apple's Metal Performance Shaders (MPS) software frameworks.
| Lean Models | Performance | With computational complexity in mind, most WordWizard methods have been designed in a way that allows the user to choose between a heavy (and usually more performant) or lean (and potentially less performant) model. As is, lean models are preferred by the WordWizard.

General NLP Pipe Roadmap:

**1. Create Embeddings:** Either using `create_sentence_embeddings()` or `create_word_embeddings()`. Word embeddings offer more in-depth word analysis and a very fine-grained representation of text. However, they often struggle with Polysemy (words having multiple meanings -> Apple the fruit or the company?) and inherently loose contextual information. Sentence embeddings offer better contextual representations and appear to be more performant but end up loosing specific in-debth word-level information. Ultimately, the choice is context specific and one or the other may offer better results.

**2. Create Clusters:**  Clustering is done using the `cluster_embeddings()` method and serves the purpose of identifying and combining common news topics.

**3. Any of the following:**

- `summarize_medoids()`: Creates a summary of embeddings closest to the center of a cluster.
- `find_sentiment()`: Calculates a sentiment score for each piece of news data in the WordWizard.
- `entity_recognition()`: Filters out the most common entities in the news data corpus.
- `topic_modelling()`: Identifies meta-topics amongst the news corpus.
- `topic_modelling()`: Reduces the dimensionality of the embeddings for easier downstream analysis in tasks such as visualization.

**4. Perfrom further analysis using the enhanced dataframe from the WordWizard object**

In [10]:
wizard = WordWizard(quantum_research, device=None, interest="paragraph")

In [11]:
wizard \
.create_sentence_embeddings() \
.cluster_embeddings() \
.entitiy_recognition() \
.summarize_medoids(lean=True) \
.find_sentiment(lean=False) \
.topic_modelling() \
.reduce_demensionality()

Batches:   0%|          | 0/114 [00:00<?, ?it/s]

Extracting organizations from paragraph:   0%|          | 0/5 [00:00<?, ?it/s]

Creating summaries for cluster medoids based on paragraph:   0%|          | 0/10 [00:00<?, ?it/s]

Calculating sentiment using paragraph:   0%|          | 0/3422 [00:00<?, ?it/s]

<nlp_analysis.word_wizard.WordWizard at 0x3d5de3970>

In [12]:
wizard_copy = wizard.df.copy()
wizard_copy.head()

Unnamed: 0,article_index,engine,link,source,title,description,body,paragraph,sentences,paragraph_sentence_embeddings,paragraph_sentence_embeddings_clusters,paragraph_sentence_embeddings_clusters_medoids,paragraph_clusters_sentence_embeddings_NER,paragraph_sentence_embeddings_clusters_medoids_summaries,paragraph_sentiment,topics,paragraph_reduced_dimensions_word_embeddings
0,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,Researchers at North Carolina State University...,[Researchers at North Carolina State Universit...,"[-0.1020563393831253, 0.032378360629081726, -0...",3,False,"['IBM', 'Google', 'Microsoft', 'Quantum comput...",,1.0,"[(computer, 0.01395662766611499), (quantum com...","[8.750580787658691, -3.6687560081481934]"
1,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,"""The discovery of Q-silicon having robust room...","[""The discovery of Q-silicon having robust roo...","[-0.11736508458852768, 0.01704520918428898, -0...",4,False,"['Google', 'Non-Abelian anyons have the unique...",,1.0,"[(material, 0.01322349626579495), (property, 0...","[8.768294334411621, -3.7024354934692383]"
2,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,The NC State researchers showed that laser mel...,[The NC State researchers showed that laser me...,"[-0.08565413951873779, -0.032770633697509766, ...",4,False,"['Google', 'Non-Abelian anyons have the unique...",,1.0,"[(material, 0.01322349626579495), (property, 0...","[8.799320220947266, -3.7150118350982666]"
3,0,Bing,https://phys.org/news/2023-06-quantum-boost-di...,Phys.org,Quantum computing could get boost from discove...,Researchers at North Carolina State University...,This article has been reviewed according to Sc...,"""This discovery of Q-silicon stands to revolut...","[""This discovery of Q-silicon stands to revolu...","[-0.09404293447732925, 0.08271421492099762, -0...",3,False,"['IBM', 'Google', 'Microsoft', 'Quantum comput...",,1.0,"[(computer, 0.01395662766611499), (quantum com...","[8.753495216369629, -3.7127583026885986]"
4,1,Yahoo,https://scitechdaily.com/a-major-quantum-compu...,SciTechDaily,A Major Quantum Computing Leap With a Magnetic...,A University of Washington-led team has made a...,A University of Washington-led team has made a...,A University of Washington-led team has made a...,[A University of Washington-led team has made ...,"[-0.03416478633880615, 0.029943140223622322, -...",3,False,"['IBM', 'Google', 'Microsoft', 'Quantum comput...",,1.0,"[(computer, 0.01395662766611499), (quantum com...","[8.196556091308594, -6.83024787902832]"


## Downstream Analysis to generate Insights

### Prepare Data for Visualization

In [13]:
# Topics comes as a list of tuples (topic, score). We want to keep only the first 5 tuples and round the score to 3 decimal places
wizard_copy["topics"] = wizard_copy["topics"].apply(lambda x: [tuple((topic, round(score, 3))) for topic, score in x[:5]])
wizard_copy['x'] = wizard_copy.iloc[:,-1].apply(lambda x: x[0])
wizard_copy['y'] = wizard_copy.iloc[:,-2].apply(lambda x: x[1])
# Count cluster size
wizard_copy['cluster_size'] = wizard_copy.groupby('paragraph_sentence_embeddings_clusters')['paragraph_sentence_embeddings_clusters'].transform('count')
# Aggregate cluster sentiment
wizard_copy['cluster_sentiment'] = wizard_copy.groupby('paragraph_sentence_embeddings_clusters')['paragraph_sentiment'].transform('mean')
# Keep only medoids
wizard_copy = wizard_copy[wizard_copy['paragraph_sentence_embeddings_clusters_medoids'] == True]
# Keep only cluster_size, x, y, cluster_sentiment, paragraph_clusters_sentence_embeddings_NER, topics, and paragraph
ner_col = [col for col in wizard_copy.columns if col.endswith('NER')][0]
wizard_copy = wizard_copy[['cluster_size', 'x', 'y', 'cluster_sentiment', ner_col, 'topics', 'paragraph']]
wizard_copy.columns = ['size', 'x', 'y', 'sentiment', 'entities', 'topics', 'paragraph']

In [14]:
wizard_copy

Unnamed: 0,size,x,y,sentiment,entities,topics,paragraph
541,982,8.374166,0.976195,0.961303,"['IBM', 'Microsoft', 'Intel', 'Google', 'NVIDIA']","[(technology, 0.009), (research, 0.009), (ibm,...","“Today's quantum computers are novel, scientif..."
758,398,7.410124,4.066543,0.967337,"['Photonic Integrated Circuit', 'IBM', 'Jun (T...","[(market, 0.035), (global, 0.023), (global qua...",Global Quantum Computing Market (2023-2030) re...
1489,1006,8.135829,-2.241421,0.868787,"['IBM', 'Google', 'Microsoft', 'Quantum comput...","[(computer, 0.014), (quantum computer, 0.014),...",Quantum computing could revolutionize our worl...
1608,539,8.118448,-4.555763,0.93692,"['Google', 'Non-Abelian anyons have the unique...","[(material, 0.013), (property, 0.01), (particl...",While these particles had never been observed ...
1618,398,7.448356,4.045999,0.967337,"['Photonic Integrated Circuit', 'IBM', 'Jun (T...","[(market, 0.035), (global, 0.023), (global qua...",The recent flows and therefore the growth oppo...
1750,723,9.684717,4.228974,0.892116,"['IBM', 'Moderna', 'Cleveland Clinic', 'Micros...","[(ai, 0.006), (data, 0.006), (research, 0.006)...",Our research spans a multitude of industries i...
1757,723,9.682897,3.754164,0.892116,"['IBM', 'Moderna', 'Cleveland Clinic', 'Micros...","[(ai, 0.006), (data, 0.006), (research, 0.006)...",They are able to make well-calibrated decision...
1827,1006,8.414718,-1.236494,0.868787,"['IBM', 'Google', 'Microsoft', 'Quantum comput...","[(computer, 0.014), (quantum computer, 0.014),...","Ultimately, quantum computers excel at solving..."
2982,982,7.82146,1.102334,0.961303,"['IBM', 'Microsoft', 'Intel', 'Google', 'NVIDIA']","[(technology, 0.009), (research, 0.009), (ibm,...","Despite these challenges, there has been signi..."
3012,539,7.685882,-3.451177,0.93692,"['Google', 'Non-Abelian anyons have the unique...","[(material, 0.013), (property, 0.01), (particl...","After completing their master’s, they pivoted ..."


### Visualization based on Topic and Sentiment

In [24]:
fig = px.scatter(
    wizard_copy,
    x="x",
    y="y",
    size="size",
    color="sentiment",
    hover_name=wizard_copy["paragraph"].str.wrap(150).apply(lambda x: x.replace("\n", "<br>")),
    hover_data=["topics", "entities"],
    color_continuous_scale=px.colors.sequential.Viridis,
    title="Quantum Computing Research",
    width=1500,
    height=1000,
)

fig.update_layout(title_x=0.5, title_font_size=30)

fig.show()

In [16]:
wizard.df["paragraph_sentiment"].value_counts()

paragraph_sentiment
1.0    3353
0.0     295
Name: count, dtype: int64

### Summarization

In [17]:
summaries = wizard.df.loc[(~wizard.df["paragraph_sentence_embeddings_clusters_medoids_summaries"].isna()), "paragraph_sentence_embeddings_clusters_medoids_summaries"]
for summary in summaries:
    pprint(summary)
    print()

("“Today's quantum computers are novel, scientific tools that can be used to "
 'model problems that are extremely difficult, and perhaps impossible, for '
 'classical systems,” said Daro Gil, Senior Vice President and Director of IBM '
 'Research.')

('The Quantum Computing market research report provides the newest industry '
 'data and industry future trends, allowing you to identify the products and '
 'end users driving revenue growth and profitability.')

('Quantum computing could revolutionize our world.<n>For specific and crucial '
 'tasks, it promises to be exponentially faster than the zero-or-one binary '
 'technology.<n>But developing quantum computers hinges on building a stable '
 'network of qubits.')

('Particles have never been observed in nature.<n>They were created in a '
 'particular type of semiconductor device.<n>They were used to build the '
 'components for a robust quantum computer.')

('This latest report provides worldwide Quantum Computing market predictions

### There is always more to explore...