# 1. Problem Analysis

Named Entity Recognition (NER) is a crucial tool for extracting structured information from unstructured text. In this project, we aim to create an NER system to analyze financial reports of electronics companies, starting with Apple.

### Context:
The system processes reports such as:
- **10-Q Reports**: Quarterly reports with financial performance.
- **10-K Reports**: Annual reports detailing company performance.

### Initial Setup:
- The original reports were in **HTML format**.
- Converted to **PDF** for processing using online converter.

### Objectives:
1. Extract meaningful entities like:
   - Company names (e.g., Apple Inc.)
   - Monetary values (e.g., revenue, profit)
   - Financial events (e.g., stock splits)
   - Dates (e.g., fiscal years)
2. Lay the foundation to scale the system for other electronics companies like Samsung and Sony.

### Challenges:
1. Financial text contains jargon, abbreviations, and technical terms.
2. Preprocessing complex reports to remove noise.
3. Starting with limited data (only two Apple reports).


# 2. Data Collection and Preprocessing

### Data Collection:
1. Financial reports for **Apple Inc.** were obtained in **HTML format**.
2. These HTML reports were converted to **PDF** using online converter for easier text extraction.

### Preprocessing Steps:
1. **Text Extraction**:
   - Extracted raw text from PDF reports using PyPDF2.
   - Combined text from both reports into a single dataset.
   
2. **Cleaning the Text**:
   - Converted text to lowercase for uniformity.
   - Removed special characters and punctuation.
   - Replaced common abbreviations with their full forms (e.g., "$" → "USD").
   - Removed stopwords like "the", "and", "is" for better focus on meaningful words.

3. **Tokenization**:
   - Split sentences into words for structured processing.
   - Example:
     - Raw: "Apple Inc. reported revenue of $123 billion."
     - Processed: `['apple', 'inc', 'reported', 'revenue', 'usd', '123', 'billion']`.

These steps ensure the dataset is clean and ready for annotation and modeling.


In [59]:
!pip install pyPDF2
!pip install streamlit



In [60]:
# importing all the libraries needed for this project
import re
import nltk
from PyPDF2 import PdfReader
import spacy
from spacy.tokens import DocBin
import streamlit as st

# downloading nltk resources for tokenization and stopwords removal
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [61]:
# function to extract text from pdf files
def extract_text_from_pdf(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

In [62]:
# function to clean text: removes punctuation, converts to lowercase, and replaces abbreviations
def clean_text(text):
    text = text.lower()  # converting all text to lowercase
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)  # removing special characters and punctuation

    # replacing common financial abbreviations with full forms
    abbreviations = {
        "$": "usd",
        "ipo": "initial public offering",
        "ge": "general electric",
        "eps": "earnings per share"
    }
    for abbr, full_form in abbreviations.items():
        text = re.sub(r'\b' + re.escape(abbr) + r'\b', full_form, text)

    # removing stopwords
    stop_words = set(nltk.corpus.stopwords.words('english'))
    words = [word for word in nltk.word_tokenize(text) if word not in stop_words]
    return " ".join(words)

In [63]:
import nltk
nltk.download('punkt_tab')

# extracting text from the two pdf reports
file_path_10q = '/content/drive/MyDrive/nmims/sem2/llm/assignment 1/data new/1-0-q apple.pdf'
file_path_10k = '/content/drive/MyDrive/nmims/sem2/llm/assignment 1/data new/1-0-k apple.pdf'

text_10q = extract_text_from_pdf(file_path_10q)
text_10k = extract_text_from_pdf(file_path_10k)

# cleaning the extracted text
cleaned_text_10q = clean_text(text_10q)
cleaned_text_10k = clean_text(text_10k)

# combining both cleaned texts into one dataset
combined_text = cleaned_text_10q + " " + cleaned_text_10k

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# 3. Model Development

### Steps:
1. **Pre-trained Model**:
   Used SpaCy’s `en_core_web_sm` pre-trained NER model. It recognizes basic entities like:
   - Organizations (ORG)
   - Money values (MONEY)
   - Dates (DATE)

2. **Custom Training**:
   - Annotated domain-specific financial text..
   - Trained a new NER model using SpaCy with a custom dataset.

### Challenges:
1. Small training dataset (requires expansion).
2. Technical terms and abbreviations in financial documents.


In [64]:
# loading a pretrained model
import spacy

model = spacy.load("en_core_web_sm")  # loading a pretrained spacy model

# preparing text for entity extraction
k = ""
for t in combined_text.split():  # split combined text into words
    k += t + " "  # converting tokens back to a single string

doc = model(k)  # process the text using the spacy model

# extracting entities from the text
entities = [(ent.text, ent.label_) for ent in doc.ents]
print("extracted entities:")
for entity, label in entities:
    print(f"{entity}: {label}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
july 1 2017: DATE
24 2016: DATE
43: CARDINAL
700: CARDINAL
485: CARDINAL
44: CARDINAL
678: CARDINAL
518: CARDINAL
31 500: CARDINAL
24: CARDINAL
500 728: CARDINAL
48: CARDINAL
242: CARDINAL
54: CARDINAL
305: CARDINAL
153: CARDINAL
july 1 2017: DATE
162 million: CARDINAL
24 2016: TIME
163 million: CARDINAL
july 1 2017: DATE
24 2016: DATE
1 3 billion 1 5 billion: MONEY
326 million: CARDINAL
160 million: CARDINAL
third: ORDINAL
third: ORDINAL
third: ORDINAL
10: CARDINAL
july 1 2017: DATE
24 2016: TIME
one: CARDINAL
10: CARDINAL
10: CARDINAL
46 63: CARDINAL
july 1 2017: DATE
24 2016: DATE
july 1 2017: DATE
three: CARDINAL
10: CARDINAL
39 19: DATE
24 2016: TIME
two: CARDINAL
10: CARDINAL
47 21: DATE
apple inc q3 2017: ORG
10: CARDINAL
3: CARDINAL
july 1 2017: DATE
july 1 2017: DATE
24 2016: TIME
12: CARDINAL
529: CARDINAL
10 185: CARDINAL
49 491: CARDINAL
44 543: CARDINAL
6: CARDINAL
6: CARDINAL
517: CARDINAL
68: CARDINAL
981: 

In [65]:
# creating annotated training data

training_data = [
    {
        "text": "Apple Inc. reported a net profit of $200 billion in Q4 2023.",
        "entities": [(0, 9, "ORG"), (30, 41, "MONEY"), (45, 53, "DATE")]
    },
    {
        "text": "Microsoft Corp's revenue grew to $150 billion, while Google earned $180 billion.",
        "entities": [(0, 15, "ORG"), (36, 47, "MONEY"), (55, 61, "ORG"), (68, 79, "MONEY")]
    },
    {
        "text": "AAPL stock rose by 7% after the announcement.",
        "entities": [(0, 4, "ORG"), (17, 19, "PERCENT")]
    },
    {
        "text": "Tesla's Q2 2023 earnings increased by 10%, totaling $50 billion.",
        "entities": [(0, 5, "ORG"), (7, 14, "DATE"), (36, 39, "PERCENT"), (50, 60, "MONEY")]
    },
    {
        "text": "Amazon Web Services dominates the cloud market with revenue of $62 billion.",
        "entities": [(0, 19, "ORG"), (63, 73, "MONEY")]
    },
    {
        "text": "Alphabet Inc. reported revenue of $134 billion for fiscal year 2022.",
        "entities": [(0, 12, "ORG"), (30, 41, "MONEY"), (59, 63, "DATE")]
    },
]

# function to validate and prepare training data
from spacy.tokens import DocBin

def validate_training_data(data):
    """
    this function ensures training data is in the correct format
    and converts it into a spacy-compatible binary file
    """
    nlp = spacy.blank("en")  # create a blank spacy model
    doc_bin = DocBin()  # initialize spacy's binary data container
    for record in data:
        if "text" in record and "entities" in record:
            text = record["text"]
            entities = record["entities"]
            doc = nlp.make_doc(text)  # create a document object
            spans = [doc.char_span(start, end, label) for start, end, label in entities if doc.char_span(start, end, label)]
            doc.ents = spans  # assign the entities to the doc
            doc_bin.add(doc)  # add the doc to the binary container
    return doc_bin

# validating and saving training data
doc_bin = validate_training_data(training_data)
doc_bin.to_disk("training.spacy")  # save the training data


In [66]:
# generating the spacy config file using cli
!python -m spacy init config config.cfg --lang en --pipeline ner --force

# training the model using spacy's cli
!python -m spacy train config.cfg --output ./output --paths.train training.spacy --paths.dev training.spacy

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     40.70    0.00    0.00    0.00    0.00
200   

# 4. Visualization and Deployment

### Visualization:
- Used SpaCy’s `displacy` tool to highlight extracted entities in the text.
- Example:
  - Input: "Apple Inc. reported revenue of $123 billion."
  - Highlighted Output:
    - **Apple Inc.** (ORG)
    - **$123 billion** (MONEY)

### Deployment:
- Created an interactive web app using **Streamlit**.
- Features:
  1. Accepts text input from the user.
  2. Highlights entities like company names, dates, and monetary values.
  3. Displays extracted entities in an easy-to-read format.

### Benefits:
1. Real-time extraction of financial entities.
2. Easy to extend for new reports or companies.
3. User-friendly interface.


In [67]:

import spacy

# loading the trained model
nlp = spacy.load("./output/model-best")

# sample text to test the model
test_text = "Microsoft Inc. reported revenue of $123 billion, while AAPL stock surged by 5%."

# process the test text with the trained model
doc = nlp(test_text)

# display extracted entities
print("extracted entities:")
for ent in doc.ents:
    print(f"text: {ent.text}, label: {ent.label_}")

# visualize the entities using spacy's displacy
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)



extracted entities:
text: AAPL, label: ORG


In [68]:
!pip install streamlit spacy
!pip install pyngrok



In [69]:
# Create and save app.py
app_code = """
import streamlit as st
import spacy
from spacy import displacy

# Load the trained model
@st.cache_resource
def load_model():
    return spacy.load("./output/model-best")

nlp = load_model()

# Streamlit interface
st.title("Financial NER System")
st.write("This app extracts key entities from financial text such as company names, monetary values, and events.")

# User input
user_input = st.text_area("Enter financial text:", height=150)

if st.button("Extract Entities"):
    if user_input.strip():  # Check if text is provided
        # Process the input text
        doc = nlp(user_input)

        # Display extracted entities
        st.write("### Extracted Entities:")
        for ent in doc.ents:
            st.write(f"- **{ent.text}**: {ent.label_}")

        # Visualize the entities using SpaCy's displaCy
        st.write("### Visualization:")
        html = displacy.render(doc, style="ent", jupyter=False)
        st.components.v1.html(html, height=300, scrolling=True)
    else:
        st.error("Please enter some text to extract entities.")
"""

with open("app.py", "w") as file:
    file.write(app_code)
print("app.py saved!")


app.py saved!


## 5. Presentation and Documentation

### Observations and Errors
1. **Partial Entity Recognition**:
   - The trained model identified some entities, such as company names and monetary values, but missed or partially recognized others.
   - Example:
     Input: `"Microsoft Inc. reported revenue of $123 billion, while AAPL stock surged by 5%."`
     Extracted Entities:
     ```
     - Entity: Microsoft Inc., Label: ORG
     - Entity: $123 billion, Label: MONEY
     ```
     The entity `AAPL` was only partially recognized, missing the stock-related context.

2. **Limited Dataset**:
   - The model was trained on a small dataset comprising only two financial reports with manually annotated examples.
   - This limited the model’s ability to generalize to unseen financial text.

3. **Annotation Challenges**:
   - Manual annotation was time-consuming and error-prone, especially for complex financial jargon and domain-specific terms.

---

### Improvements and Future Directions

1. **Expand Dataset**:
   - Collect additional financial reports from diverse companies and sources (e.g., SEC filings, news websites, earnings calls).
   - Increase the variety of text formats and contexts to improve the model's robustness.

2. **Automate Annotation**:
   - Utilize weak supervision or semi-supervised learning for faster annotation.
   - Use pre-trained models like `FinBERT` to generate initial annotations and refine them manually.

3. **Improve Model Performance**:
   - Fine-tune the system with domain-specific pre-trained models, such as:
     - `FinBERT`
     - Financial SpaCy pipelines
   - Experiment with transformer-based models for better context understanding.

4. **Add New Entity Types**:
   - Extend the system to include additional entity types, such as:
     - Stock tickers (e.g., `AAPL`)
     - Financial events (e.g., `mergers`, `IPOs`)
     - Percentages (e.g., `5%`)
   - Create detailed annotation guidelines for these entities.

5. **Visualization and User Interface**:
   - Enhance the Streamlit app to:
     - Accept larger text inputs.
     - Provide filters for specific entities.
     - Offer interactive visualization with entity details.

6. **Real-World Deployment**:
   - Deploy the system as an API or web dashboard.
   - Test the system with real-world financial text and collect feedback for iterative improvements.

---

### Summary of Findings

1. **Challenges**:
   - Limited dataset and manual annotation hampered the system’s performance.
   - Entity recognition was partially successful due to insufficient domain-specific training.

2. **Key Results**:
   - The system effectively extracted some entities (e.g., company names and monetary values).
   - Entity visualization provided insights into their placement within the text.

3. **Future Scope**:
   - Automate annotations and expand the dataset for scalability.
   - Transition to advanced transformer-based models for improved accuracy and contextual understanding.
