<a href="https://colab.research.google.com/github/DevendraTomar/llm-learnings/blob/main/PDF_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview

This notebook demonstrates how to extract information about used cars from a PDF document using LangChain and Groq. It leverages the PyPDFLoader to load the document, ChatGroq as the language model, and a structured output schema (VehicleDetails) to organize the extracted data.

# Load Secrets

In [None]:
from google.colab import userdata
import os
os.environ["GOOGLE_API_KEY"]=userdata.get('GOOGLE_API_KEY')
os.environ["GROQ_API_KEY"]=userdata.get('GROQ_API_KEY')

# Install Dependency

In [None]:
!pip install -qU pypdf langchain_community langchain-groq



[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/121.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.9/121.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h

# Load External Data Using Document Loader

In [None]:
from langchain_community.document_loaders import PyPDFLoader


In [None]:
file_path="/content/cars24.pdf"
loader = PyPDFLoader(file_path)


In [None]:
doc=loader.load()

In [None]:
doc[0].page_content

'Know your car\nRegularly Serviced\nServiced every 10,000 km atauthorised service centre\nCity Driven\nCars driven for shorter trips incities\nReg Year\nAug 2010\nMake Year\n2010\nRegNumber\nBR01-AX6549\nEngineCapacity\n1197 cc\nInsurance \nComprehensive,Valid till Aug -2024\nSpare key\nYes\nTransmission\nManual\nKM Driven\n44,637 km\nOwnership\n1st owner\nFuel Type\nPetrol\nInspection Report\nWe aim to\nprovide our\ncustomers\nwith a reliable\ndrive. Every\ncar we sell is\nrefurbished by\nexperts at our\nMega\nRefurbishment\nLabs.\nNon Accidental Non Tampered Non Flooded 140 Quality Checks\nOVERVIEW •  12 EXTERIOR •  11 INTERIOR •  11 IMPERFECTIONS •  18\nImperfections 18\nMinor cosmetic imperfections are not repaired as they do not affectperformance, and reduces the cost of ownership\nRepainted Parts 13\nSome parts have been repainted for better aesthetics. However, we assurethe car is non-accidental\nPerfect Parts\nThoroughly tested and ready for the road as per CARS24 Quality Promi

#Extract Content

In [None]:
# Ensure your VertexAI credentials are configured
from langchain_groq import ChatGroq

llm = ChatGroq(model="llama-3.3-70b-versatile")


In [None]:
from langchain_core.prompts import ChatPromptTemplate

prompt = """You are an expert in used cars . You will be given a document by user for selling used cars. You are supposed to extract
the attributes from a given document {doc} """



In [None]:
from typing import Optional
from pydantic import BaseModel, Field

class VehicleDetails(BaseModel):
    reg_year: str = Field(..., description="The year in which the vehicle was officially registered.")
    make_year: Optional[str] = Field(None, description="The year the vehicle was manufactured or made by the manufacturer.")
    reg_number: str = Field(..., description="The unique registration number assigned to the vehicle by the registration authority.")
    engine_capacity: Optional[str] = Field(None, description="The displacement of the vehicle's engine, typically measured in liters (L) or cubic centimeters (cc).")
    cylinders: Optional[str] = Field(None, description="The number of cylinders present in the vehicle's engine.")
    insurance: Optional[str] = Field(None, description="Indicates the vehicle's insurance status, which could include details like the policy number or expiration date.")
    spare_key: Optional[str] = Field(None, description="Indicates whether the vehicle comes with a spare key.")
    transmission: Optional[str] = Field(None, description="The type of transmission the vehicle uses, such as automatic or manual.")
    km_driven: Optional[int] = Field(None, description="The total number of kilometers the vehicle has been driven, reflecting its usage and wear.")
    ownership: Optional[str] = Field(None, description="The number of owners the vehicle has had, which can indicate the vehicle's history.")
    fuel_type: Optional[str] = Field(None, description="The type of fuel the vehicle uses, such as petrol, diesel, or electric.")
    price: Optional[str] = Field(None, description="The price of the vehicle, which could be the current value or the listed sale price.")
    emi: Optional[str] = Field(None, description="The equated monthly installment (EMI) for financing the vehicle, if applicable.")
    no_seats: Optional[str] = Field(None, description="The number of seats available in the vehicle for passengers.")
    boot_space: Optional[str] = Field(None, description="The available boot or trunk space in the vehicle, usually measured in liters.")


In [None]:
resp=llm.with_structured_output(VehicleDetails).invoke(prompt.format(doc=doc[0].page_content))

In [None]:
for k,v in resp.dict().items():
  print(k,v)

reg_year 2010
make_year 2010
reg_number BR01-AX6549
engine_capacity 1197 cc
cylinders None
insurance Comprehensive, Valid till Aug-2024
spare_key Yes
transmission Manual
km_driven 44637
ownership 1st owner
fuel_type Petrol
price ₹1.48 Lakh
emi ₹2,893/month
no_seats None
boot_space None


<ipython-input-48-04c852b6f42a>:1: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  for k,v in resp.dict().items():


#Load Pyspark

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark=SparkSession.builder.getOrCreate()

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

# Initialize SparkSession with Hive support
spark = SparkSession.builder \
    .appName("VehicleDetails") \
    .enableHiveSupport() \
    .getOrCreate()

# Define the schema for the VehicleDetails table
vehicle_details_schema = StructType([
    StructField("reg_year", StringType(), True),            # Year of registration (Optional)
    StructField("make_year", StringType(), True),           # Year of manufacture (Optional)
    StructField("reg_number", StringType(), True),          # Registration number (Optional)
    StructField("engine_capacity", StringType(), True),     # Engine capacity (Optional)
    StructField("cylinders", StringType(), True),           # Number of cylinders (Optional)
    StructField("insurance", StringType(), True),           # Insurance details (Optional)
    StructField("spare_key", StringType(), True),           # Spare key available (Optional)
    StructField("transmission", StringType(), True),        # Transmission type (Optional)
    StructField("km_driven", StringType(), True),           # Kilometers driven (Optional)
    StructField("ownership", StringType(), True),           # Ownership history (Optional)
    StructField("fuel_type", StringType(), True),           # Fuel type (Optional)
    StructField("price", StringType(), True),               # Price (Optional)
    StructField("emi", StringType(), True),                 # EMI details (Optional)
    StructField("no_seats", StringType(), True),            # Number of seats (Optional)
    StructField("boot_space", StringType(), True)           # Boot space (Optional)
])
df=spark.createDataFrame([resp.dict()],schema=vehicle_details_schema)


<ipython-input-35-8a9f4e248873>:28: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  df=spark.createDataFrame([resp.dict()],schema=vehicle_details_schema)


In [None]:
df.show()

+--------+---------+-----------+---------------+---------+--------------------+---------+------------+---------+---------+---------+----------+------------+--------+----------+
|reg_year|make_year| reg_number|engine_capacity|cylinders|           insurance|spare_key|transmission|km_driven|ownership|fuel_type|     price|         emi|no_seats|boot_space|
+--------+---------+-----------+---------------+---------+--------------------+---------+------------+---------+---------+---------+----------+------------+--------+----------+
|    2010|     2010|BR01-AX6549|        1197 cc|     NULL|Comprehensive, Va...|      Yes|      Manual|44,637 km|1st owner|   Petrol|₹1.48 Lakh|₹2,893/month|    NULL|      NULL|
+--------+---------+-----------+---------------+---------+--------------------+---------+------------+---------+---------+---------+----------+------------+--------+----------+

