### Here I would be using GPT4All model to extract keyfeatures from pdf . So, First we need to download the offline model from huggingface website, I had uploaded that model to my google drive, will use it in colab.

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

### Unzip model

In [None]:
!unzip "/content/drive/MyDrive/GPT4All_Models/mistral-7b-instruct-v0.2.Q5_K_M.zip" -d "/content/GPT4All_Models"


Archive:  /content/drive/MyDrive/GPT4All_Models/mistral-7b-instruct-v0.2.Q5_K_M.zip
  inflating: /content/GPT4All_Models/mistral-7b-instruct-v0.2.Q5_K_M.gguf  


### Load our model and process the pdf file to extract data from it and get the desired output.

In [None]:
import os

MODEL_PATH = "/content/GPT4All_Models/"

# List files in the directory
print(os.listdir(MODEL_PATH))


['mistral-7b-instruct-v0.2.Q5_K_M.gguf']


### Load model

In [None]:
!pip install gpt4all

Collecting gpt4all
  Downloading gpt4all-2.8.2-py3-none-manylinux1_x86_64.whl.metadata (4.8 kB)
Downloading gpt4all-2.8.2-py3-none-manylinux1_x86_64.whl (121.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gpt4all
Successfully installed gpt4all-2.8.2


In [None]:
from gpt4all import GPT4All

MODEL_FILE = "/content/GPT4All_Models/mistral-7b-instruct-v0.2.Q5_K_M.gguf"

# Load the model
model = GPT4All(MODEL_FILE)
print("✅ Model loaded successfully!")


✅ Model loaded successfully!


### Give PDF path

In [9]:
PDF_PATH = "/content/drive/MyDrive/GPT4All_Models/Fluid AI MAil PDF (1).pdf"


In [10]:
!pip install fitz

Collecting fitz
  Downloading fitz-0.0.1.dev2-py2.py3-none-any.whl.metadata (816 bytes)
Collecting configobj (from fitz)
  Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting configparser (from fitz)
  Downloading configparser-7.2.0-py3-none-any.whl.metadata (5.5 kB)
Collecting nipype (from fitz)
  Downloading nipype-1.10.0-py3-none-any.whl.metadata (7.1 kB)
Collecting pyxnat (from fitz)
  Downloading pyxnat-1.6.3-py3-none-any.whl.metadata (5.4 kB)
Collecting prov>=1.5.2 (from nipype->fitz)
  Downloading prov-2.0.1-py3-none-any.whl.metadata (3.6 kB)
Collecting rdflib>=5.0.0 (from nipype->fitz)
  Downloading rdflib-7.1.3-py3-none-any.whl.metadata (11 kB)
Collecting traits>=6.2 (from nipype->fitz)
  Downloading traits-7.0.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.8 kB)
Collecting acres (from nipype->fitz)
  Downloading acres-0.3.0-py3-none-any.whl.metadata (5.5 kB)
Collecting etelemetry>=0.3.1

In [13]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.4


### Fucntion to extract text from pdf

In [14]:
import fitz  # PyMuPDF for PDF text extraction

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a given PDF file.
    Args:
        pdf_path (str): Path to the PDF file.
    Returns:
        str: Extracted text from the PDF.
    """
    doc = fitz.open(pdf_path)  # Open the PDF file
    text = ""  # Initialize empty string for storing text
    for page in doc:
        text += page.get_text("text") + "\n"  # Extract text from each page
    return text

# Extract text
extracted_text = extract_text_from_pdf(PDF_PATH)
print("✅ PDF text extracted successfully!")


✅ PDF text extracted successfully!


### Clean the text

In [15]:
import re  # Regular expressions for text cleaning

def clean_text(text):
    """
    Cleans extracted text by removing extra spaces and unwanted characters.
    Args:
        text (str): Raw extracted text.
    Returns:
        str: Cleaned text.
    """
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # Remove non-ASCII characters
    return text.strip()

# Clean extracted text
cleaned_text = clean_text(extracted_text)
print("✅ PDF text cleaned successfully!")


✅ PDF text cleaned successfully!


### Now let's analyze the text within it

In [16]:
def analyze_text_with_gpt4all(text):
    """
    Uses GPT4All (offline) to analyze and extract key financial insights.
    Args:
        text (str): Cleaned text from the PDF.
    Returns:
        str: AI-generated insights.
    """
    model = GPT4All(MODEL_FILE)  # Load offline model
    prompt = f"""
    Extract key financial insights from the following company report:
    {text}

    Focus on:
    - Future growth prospects
    - Key changes in business
    - Key triggers affecting next year’s earnings
    - Material impacts on growth

    Provide a structured summary.
    """
    response = model.generate(prompt, max_tokens=500)  # Get AI-generated response
    return response.strip()

# Get insights from the offline model
insights = analyze_text_with_gpt4all(cleaned_text)
print("\n🔹 Extracted Insights:\n", insights)



🔹 Extracted Insights:
 The company report does not contain sufficient financial information to extract key insights for future growth prospects, key changes in business, or material impacts on growth. Instead, it focuses on the interview process and requirements for potential candidates joining Fluid AI. Therefore, no specific financial insights can be derived from this text.


In [17]:
def main():
    extracted_text = extract_text_from_pdf(PDF_PATH)  # Step 1: Extract text
    cleaned_text = clean_text(extracted_text)  # Step 2: Clean text
    insights = analyze_text_with_gpt4all(cleaned_text)  # Step 3: Get insights
    print("\n🔹 Extracted Insights:\n", insights)

# Run the pipeline
main()



🔹 Extracted Insights:
 The company report is not directly providing any financial insights, but it does provide information about the interview process for potential candidates at Fluid AI. To extract key financial insights from this text, we would need to look for specific details related to the company's future growth prospects, changes in business, and triggers affecting next year’s earnings. However, since there is no such information provided, it is not possible to provide a structured summary with financial insights based on this report alone.
    
    Instead, we can focus on some of the key elements mentioned in the interview process that could potentially impact an individual's future career growth and earning potential at Fluid AI:

1. Career goals: The company is interested in understanding both long-term and short-term professional aspirations of candidates to ensure alignment with their own business objectives. This suggests a focus on hiring individuals who are committed