<a href="https://colab.research.google.com/github/RakshithVellulla/ModelI-O-Company-Profile-Extraction-using-LangChain-Cohere/blob/main/ModelI_O.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Step 1: Install required packages for the LangChain exercise

# Install OpenAI SDK (for interacting with OpenAI models)
!pip install openai

# Install core LangChain library (Model_IO, Output Parsers, prompts, etc.)
!pip install langchain

# Install LangChain community integrations (Wikipedia loader lives here)
!pip install langchain-community

# Install Cohere integration for LangChain (used if Cohere models are selected)
!pip install langchain-cohere

# Install Wikipedia Python package (used by Wikipedia document loader)
!pip install wikipedia




In [None]:
import os
from google.colab import userdata

os.environ["COHERE_API_KEY"] = userdata.get("ModelI/O")
print("COHERE_API_KEY loaded:", "COHERE_API_KEY" in os.environ)


COHERE_API_KEY loaded: True


In [None]:
# Step 3: Initialize Cohere language model (correct versioned model)

from langchain_cohere import ChatCohere

llm_cohere = ChatCohere(
    model="command-r-08-2024",  # Versioned Cohere model (required)
    temperature=0.3             # Low temperature for stable, parseable output
)

# Sanity test
response = llm_cohere.invoke("Say hello in one short sentence.")
print(response)


content='Hello there!' additional_kwargs={'id': '172fdb34-1641-472a-8fc0-0e18862d335c', 'finish_reason': 'COMPLETE', 'content': 'Hello there!', 'token_count': {'input_tokens': 207.0, 'output_tokens': 3.0}} response_metadata={'id': '172fdb34-1641-472a-8fc0-0e18862d335c', 'finish_reason': 'COMPLETE', 'content': 'Hello there!', 'token_count': {'input_tokens': 207.0, 'output_tokens': 3.0}} id='lc_run--019bfb94-51a4-7680-bf71-11bac7c2be0a-0' tool_calls=[] invalid_tool_calls=[] usage_metadata={'input_tokens': 207, 'output_tokens': 3, 'total_tokens': 210}


In [None]:
# Step 4: Load company data from Wikipedia

# Import the Wikipedia document loader
from langchain_community.document_loaders import WikipediaLoader

# Ask user for company name (can hardcode for testing)
company_name = input("Enter company name: ")

# Initialize WikipediaLoader
# load_max_docs=1 ensures we only fetch the main article
wiki_loader = WikipediaLoader(
    query=company_name,
    load_max_docs=1
)

# Load Wikipedia content
documents = wiki_loader.load()

# Basic verification
print(f"Number of documents loaded: {len(documents)}")
print("Metadata of first document:", documents[0].metadata)


Enter company name: google
Number of documents loaded: 1
Metadata of first document: {'title': 'Google', 'summary': 'Google LLC ( , GOO-g…ôl) is an American multinational technology corporation focused on information technology, online advertising, search engine technology, email, cloud computing, software, quantum computing, e-commerce, consumer electronics, and artificial intelligence (AI). It has been referred to as "the most powerful company in the world" by BBC, and is one of the world\'s most valuable brands. Google\'s parent company Alphabet Inc. has been described as a Big Tech company. \nGoogle was founded on September 4, 1998, by American computer scientists Larry Page and Sergey Brin. Together, they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Al

In [None]:
# Step 5.1: Define output schema using Pydantic

from pydantic import BaseModel, Field
from typing import List, Optional


 Define Output Parsing Schema:
Use Pydantic to define the schema for the desired output and create a custom output parser.

In [None]:
class CompanyProfile(BaseModel):
    """
    Schema that defines the exact structure
    we expect from the LLM output.
    """

    company_name: str = Field(
        description="Official name of the company"
    )

    founder: Optional[str] = Field(
        description="Founder or founders of the company"
    )

    founded_date: Optional[str] = Field(
        description="Year or full date when the company was founded"
    )

    revenue: Optional[str] = Field(
        description="Latest known revenue of the company"
    )

    employees: Optional[str] = Field(
        description="Approximate number of employees"
    )

    summary: List[str] = Field(
        description="A concise 4-line summary of the company"
    )


In [None]:
from langchain_core.output_parsers import PydanticOutputParser


In [None]:
# Step 5.2: Create a Pydantic output parser (correct for current LangChain)

from langchain_core.output_parsers import PydanticOutputParser

parser = PydanticOutputParser(
    pydantic_object=CompanyProfile
)


In [None]:
print(PydanticOutputParser)


<class 'langchain_core.output_parsers.pydantic.PydanticOutputParser'>


6 - Create a Prompt Template:

In [None]:
from langchain_core.prompts import PromptTemplate


In [None]:
# Step 6.2: Create the prompt template for company extraction

prompt = PromptTemplate(
    template="""
You are an information extraction assistant.

Using the Wikipedia content below, extract the following details about the company:
- company_name
- founder
- founded_date
- revenue
- employees
- summary (exactly 4 concise lines)

Wikipedia Content:
{company_text}

{format_instructions}
""",
    input_variables=["company_text"],
    partial_variables={
        "format_instructions": parser.get_format_instructions()
    }
)


7-Initialize the Language Model Chain:
Set up the language model chain using the defined prompt and language model (OpenAI or Cohere).

In [None]:
# Step 7.1: Create the language model chain using runnable syntax

chain = prompt | llm_cohere


In [None]:
# Step 7.2: Execute the chain with Wikipedia text

raw_output = chain.invoke(
    {"company_text": documents[0].page_content}
)

print(raw_output)


content='```json\n{\n    "company_name": "Google LLC",\n    "founder": "Larry Page and Sergey Brin",\n    "founded_date": "September 4, 1998",\n    "revenue": null,\n    "employees": null,\n    "summary": [\n        "Google is an American tech giant, offering a wide range of services and products.",\n        "Its core services include search, email, cloud computing, and artificial intelligence.",\n        "The company has expanded into various industries, from consumer electronics to quantum computing.",\n        "Google\'s influence and market share make it one of the most powerful and valuable brands globally."\n    ]\n}\n```' additional_kwargs={'id': '06da6d0f-49b2-4daf-8c70-5c04aed4a74b', 'finish_reason': 'COMPLETE', 'content': '```json\n{\n    "company_name": "Google LLC",\n    "founder": "Larry Page and Sergey Brin",\n    "founded_date": "September 4, 1998",\n    "revenue": null,\n    "employees": null,\n    "summary": [\n        "Google is an American tech giant, offering a wide

8 - Invoke the Chain and Fetch Results:
Invoke the chain with the company documents and format instructions, then print the results.

In [None]:
# Step 8.1: Invoke the chain with Wikipedia content

raw_output = chain.invoke(
    {"company_text": documents[0].page_content}
)

print("Raw LLM output:\n", raw_output)


Raw LLM output:
 content='{\n    "company_name": "Google LLC",\n    "founder": "Larry Page and Sergey Brin",\n    "founded_date": "September 4, 1998",\n    "revenue": "Not specified in the provided text",\n    "employees": "Not mentioned",\n    "summary": [\n        "Google LLC is an American tech giant with a diverse portfolio.",\n        "Its core services include search, advertising, and cloud computing.",\n        "The company has expanded into various industries, from email to AI.",\n        "Google\'s parent, Alphabet Inc., owns and controls its operations."\n    ]\n}' additional_kwargs={'id': '4ed9f8d8-4efa-4b58-adcd-4bc9ec92fc0e', 'finish_reason': 'COMPLETE', 'content': '{\n    "company_name": "Google LLC",\n    "founder": "Larry Page and Sergey Brin",\n    "founded_date": "September 4, 1998",\n    "revenue": "Not specified in the provided text",\n    "employees": "Not mentioned",\n    "summary": [\n        "Google LLC is an American tech giant with a diverse portfolio.",\n    

In [None]:
# Step 8.2: Parse the raw output into a Pydantic object

company_profile = parser.parse(raw_output.content)



In [None]:
# Step 8.1: Invoke the chain
raw_output = chain.invoke(
    {"company_text": documents[0].page_content}
)

# Inspect raw output (optional but useful)
print("Raw LLM output type:", type(raw_output))
print("Raw LLM output content:\n", raw_output.content)

# Step 8.2: Parse the raw text output
company_profile = parser.parse(raw_output.content)

# Step 8.3: Print structured result
print("\n--- Company Profile ---")
print(f"Company Name : {company_profile.company_name}")
print(f"Founder      : {company_profile.founder}")
print(f"Founded Date : {company_profile.founded_date}")
print(f"Revenue      : {company_profile.revenue}")
print(f"Employees    : {company_profile.employees}")

print("\nSummary:")
for i, line in enumerate(company_profile.summary, start=1):
    print(f"{i}. {line}")


Raw LLM output type: <class 'langchain_core.messages.ai.AIMessage'>
Raw LLM output content:
 {
    "company_name": "Google LLC",
    "founder": "Larry Page and Sergey Brin",
    "founded_date": "September 4, 1998",
    "revenue": "Unknown, but as of January 2022, Google was ranked second by Forbes as one of the most valuable brands.",
    "employees": "Unknown, but as of 2015, Google had a large number of employees, and it is the largest subsidiary of Alphabet Inc.",
    "summary": [
        "Google LLC is an American multinational technology corporation.",
        "It offers a wide range of products and services, including search engines, email, cloud computing, and AI.",
        "The company was founded by Larry Page and Sergey Brin in 1998 and has since become a dominant force in the tech industry.",
        "Google's parent company, Alphabet Inc., owns and controls a significant portion of its shares and voting power."
    ]
}

--- Company Profile ---
Company Name : Google LLC
Foun