<a href="https://colab.research.google.com/github/SanyamSwami123/make-more-series-andrej-karpathy/blob/main/ExtractText_FromPDF_Using_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Explanation:** Langchain is a powerful framework designed to facilitate the development of application that integrate with LLM and other language-related-topics. It provides abstraction and utilities for building complex workflows involving text generation, document analysis, and more.

**In-Short**: Langchain is a framework for integrating and managing LLMs within various workflows and application.

In [5]:
#1. install required packages
!pip install langchain pdfplumber transformers langchain_huggingface langchain-community faiss-cpu

Collecting langchain_huggingface
  Downloading langchain_huggingface-0.1.0-py3-none-any.whl.metadata (1.3 kB)
Collecting sentence-transformers>=2.6.0 (from langchain_huggingface)
  Downloading sentence_transformers-3.1.0-py3-none-any.whl.metadata (23 kB)
Downloading langchain_huggingface-0.1.0-py3-none-any.whl (20 kB)
Downloading sentence_transformers-3.1.0-py3-none-any.whl (249 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers, langchain_huggingface
Successfully installed langchain_huggingface-0.1.0 sentence-transformers-3.1.0


In [52]:
import pdfplumber
from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
from langchain.prompts import PromptTemplate

# Extract text from PDF
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() or ""
    return text

# Define the prompt template
# prompt_template = PromptTemplate(
#     input_variables=["text"],
#     template="""
#     Extract the following details from the W-2 form text:

#     - Employee's social security number
#     - Employer identification number
#     - Wages, tips, other compensation
#     - Federal income tax withheld
#     - Social security wages
#     - Social security tax withheld
#     - Medicare wages and tips
#     - Medicare tax withheld
#     - Employee's first name and initial
#     - Employee's last name
#     - Employer's name, address, and ZIP code



#     Output format:
#     - Employee's social security number:
#     - Employer identification number:
#     - Wages, tips, other compensation:
#     - Federal income tax withheld:
#     - Social security wages:
#     - Social security tax withheld:
#     - Medicare wages and tips:
#     - Medicare tax withheld:
#     - Employee's first name and initial:
#     - Employee's last name:
#     - Employer's name, address, and ZIP code:
#     """
# )

prompt_template = PromptTemplate(
    input_variables=["text"],
    template="""
    From the following W-2 form text, extract the Employee's first name and initial and last name.

    Output format:
    - Employee's first name and initial:
    - Last name:
    """
)
  #  Text:
    # {text}

   # Text:
    # {text}
# Initialize the text generation pipeline from Hugging Face
model_name = "gpt2"  # You can use another model if needed
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(prompt):
    # Tokenize and generate text with proper truncation
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    outputs = model.generate(inputs["input_ids"], max_length=1024, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract information using the model
def extract_information_from_text(text):
    try:
        # Generate text using the model
        prompt = prompt_template.format(text=text)
        response = generate_text(prompt)
        return response
    except Exception as e:
        print(f"Error during text generation: {e}")
        return None

# Example usage
pdf_path = "/content/W2_XL_input_clean_2999.pdf"  # Replace with your PDF file path
text = extract_text_from_pdf(pdf_path)
extracted_info = extract_information_from_text(text)
print(extracted_info)



    From the following W-2 form text, extract the Employee's first name and initial and last name.

    Output format:
    - Employee's first name and initial: 
    - Last name:
     - Company:

----------------------------------------------------


The following text is from (W-2 format):

------------------------------------------------------------------------------


1.1.1.3 Description

------------------------------

This is a W-2 Form T-90 which forms the employee's first name and final name. The form is a copy of the Employee's W-1 Form T-90. The type of input is a two-line double quoted list, first and last names and then additional company and company, and final name (using the name of employee).


1.1.1.4 Description

---------------


1.1.1.5 Description

---------------


1.1.1.6 Description

-------------------------------

This form forms the employee's last name and first company and company, and final name of the worker with the company for the other companies.


1.2. 

## What we are doing:
1. **reading a pdf file:**
- we have a file that is a book or a form. we need to look inside and read the words.
- This part of the code takes the file, opens it, and pulls out all the text.
2.  **Preparing the text for the model**:
- Imagine we want to ask a robot to find specific information information from the text we just read. We need to give the robot a special question or "prompt" that tells it what to look for.
- This prompt has placeholder where we put our text.
3. **Using a smart robot to answer questions:**
- We have a smart robot (a model in our case from **hugging face**) that can read text and answer questions based on it. We use this robot to generate answers from the text.


## Detailed steps:
1. Extracting Text from a PDF file.
- `extract_text_from_pdf`, Here we opens the pdf file, reads every page, and collects all the text from those pages.
- steps: `open`>`read each page`>`collect and combine te text`
2. creating a Prompt Template.
- `What it does`, sets up a template that tells the robot exactly what details to look for in the text give it.
- `How it works`,
  - define template with placeholder where our actual text will go.
  - The template outlines what details we want to extract.
3. Preparing the robot (model) to answer Questions.
- `what it does`,sets up a smart robot that can generate text based on the prompt we give it.
- `how it works`,
  - Load the robot model and its "language" skills.
  - Prepare it to read and generate text.
4. Generate answers with the robot.
- `what it does`, sends the prompt to the robot and gets back an answer.
- `How it works`,
  - convert the prompt into a format the robot understands.
  - Generate a reponse from the robot.
  - Convert the reponse back into readable text.
5. Putting it all together.
- `extract_information_from_text`, Uses the prompt template and robot to extract information from the text we got from the PDF.
- `How it works`,
  - Format the text with the prompt template.
  - Generate and return the information using the robot.
6. Using Everything
- `What it Does`, Reads the PDF, extracts the text, and gets the information we need.
- `How it works`,
  - Read the pdf file.
  - Extract information from the text using robot.
  - print the result.