# Homework 8 | spaCy EntityRuler

In this homework, spaCy and pandas are used to automatically extract company names and stock symbols from text using custom rules.

In [45]:
# Import necessary libraries
import pandas as pd
import spacy
from spacy.pipeline import EntityRuler

## Load Dataset

In [46]:
# Define the path to the TSV dataset file
file_path = "stocks-1.tsv"

# Read the TSV file into a pandas DataFrame
df = pd.read_csv(file_path, sep="\t")
df.head()

Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
0,A,Agilent Technologies,Life Sciences Tools & Services,53.65B
1,AA,Alcoa,Metals & Mining,9.25B
2,AAC,Ares Acquisition,Shell Companies,1.22B
3,AACG,ATA Creativity Global,Diversified Consumer Services,90.35M
4,AADI,Aadi Bioscience,Pharmaceuticals,104.85M


## Create Patterns Automatically from the Dataset

In [47]:
# Extract unique company names
unique_companies = df["CompanyName"].dropna().unique()
# Extract unique stock symbols
unique_symbols = df["Symbol"].dropna().unique()

# Create patterns for company names with label 'COMPANY'
company_patterns = [{"label": "COMPANY", "pattern": name} 
                    for name 
                    in unique_companies]
# Create patterns for stock symbols with label 'STOCK'
symbol_patterns = [{"label": "STOCK", "pattern": symbol} 
                   for symbol 
                   in unique_symbols]
# Combine all patterns into one list
all_patterns = company_patterns + symbol_patterns

## Build spaCy Pipeline and Add EntityRuler

In [48]:
# Initialize a blank English spaCy pipeline
nlp = spacy.blank("en")

# Add EntityRuler component to the spaCy pipeline
ruler = nlp.add_pipe("entity_ruler")

# Add the company and stock symbol patterns to the ruler
ruler.add_patterns(all_patterns)

## Sample Paragraphs for Testing

In [49]:
# Define sample paragraphs for testing entity recognition
paragraphs = [
    """Helmerich & Payne (HP) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Energy Equipment & Services sector. In contrast, Check-Cap (CHEK) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.

Meanwhile, Vallon Pharmaceuticals (VLON) gained 0.8% after strong quarterly earnings, outperforming its peers in the Biotechnology space. Sequans Communications (SQNS) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Semiconductors & Semiconductor Equipment industry.""",


    """Aemetis (AMTX) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Ferro Corporation (FOE) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.

Meanwhile, RingCentral (RNG) gained 0.8% after strong quarterly earnings, outperforming its peers in the Software space. ACI Worldwide (ACIW) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Software industry.""",


    """On a mixed trading day, Par Pacific Holdings (PARR) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Nano Dimension (NNDM) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.

Meanwhile, Beyond Meat (BYND) gained 0.8% after strong quarterly earnings, outperforming its peers in the Food Products space. Apollo Investment (AINV) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Capital Markets industry."""
]

## Apply EntityRuler and Extract Entities

In [50]:
# Loop through each paragraph and apply the pipeline
for i, text in enumerate(paragraphs, 1):
# Process the text with spaCy
    doc = nlp(text)
# Print recognized entities for each paragraph
    print(f"Paragraph {i} Entities:")
    for ent in doc.ents:
        print(f"  {ent.text} ({ent.label_})")
    print("\n" + "-" * 70 + "\n")

Paragraph 1 Entities:
  Helmerich & Payne (COMPANY)
  HP (STOCK)
  Check-Cap (COMPANY)
  CHEK (STOCK)
  Vallon Pharmaceuticals (COMPANY)
  VLON (STOCK)
  Sequans Communications (COMPANY)
  SQNS (STOCK)

----------------------------------------------------------------------

Paragraph 2 Entities:
  Aemetis (COMPANY)
  AMTX (STOCK)
  Ferro Corporation (COMPANY)
  FOE (STOCK)
  RingCentral (COMPANY)
  RNG (STOCK)
  ACI Worldwide (COMPANY)
  ACIW (STOCK)

----------------------------------------------------------------------

Paragraph 3 Entities:
  Par Pacific Holdings (COMPANY)
  PARR (STOCK)
  Nano Dimension (COMPANY)
  NNDM (STOCK)
  Beyond Meat (COMPANY)
  BYND (STOCK)
  Apollo Investment (COMPANY)
  AINV (STOCK)

----------------------------------------------------------------------



The EntityRuler correctly identified all company names and their corresponding stock symbols in each paragraph. In Paragraph 1, it recognized entities such as Helmerich & Payne (HP), Check-Cap (CHEK), Vallon Pharmaceuticals (VLON), and Sequans Communications (SQNS). Similarly, in Paragraph 2, it successfully extracted Aemetis (AMTX), Ferro Corporation (FOE), RingCentral (RNG), and ACI Worldwide (ACIW). In Paragraph 3, the ruler identified Par Pacific Holdings (PARR), Nano Dimension (NNDM), Beyond Meat (BYND), and Apollo Investment (AINV). As shown in the results above, the company names were labeled as "COMPANY" and their corresponding tickers as "STOCK," suggesting that the custom EntityRuler -- built using patterns automatically generated from the dataset -- is effective in accurately extracting different types of entities from the text.