# Example : Extract email address from a webpage

Lets extract information from webpages - using Unstructured

https://python.langchain.com/docs/integrations/document_loaders/unstructured_file/#overview

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [7]:
from langchain_community.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(
    urls=["https://maths.du.ac.in/faculty-profile/"]
)
docs = loader.load()


In [8]:
full_doc = "\n\n".join(doc.page_content for doc in docs)
print(full_doc)

DU





Menu Close

Home

About

Welcome

History

Campus

Seminar & Lecture Rooms

Research Scholar Room

Computer Lab

Committee Room

Library

Disabled-Friendly Campus

Clean and Green Campus

Ranking

Committees

Accomplishments

Department

Faculty

Students

Annual Reports

Gallery

Redressal Mechanisms

Brochure

Contact Us

People

Faculty Profile

Post-Doc Fellows

Ph.D. Scholars

M.Phil. Scholars

Supporting Staff

Tutors

Former Faculty

Former HODs

Research

Research Areas

Publications

Books Authored

Research Grants

Collaborations

Research Supervision

M.Phil. Awarded

Ph.D. Awarded

Academics

M.Sc. Programme

Ph.D. Programme

U.G. Curriculum

Academic Calendar

Time Tables

Examination and Results

Admissions

M.Sc. Admissions

Ph.D. Admissions

Resources

Library@DU

Forms

Useful Links

Previous Year Papers

News & Events @ Outside DU

Opportunities

Placement

Scholarships/Fellowships/Internships

Ad-hoc Panel

Events

Ph.D. Seminars

Colloquia/Workshops

Co-Curr

## No need for chunking and splitted

Since this is very small page, we can pass the content of the entire page as a context. No need to splitting and rerieval in this case.

In [9]:
#question = "Tell me about Randheer Singh?"
question = "Make a list of all email address."

In [10]:
# RAG promt template
prompt_template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""

# make the LLM read see the prompt, and analyse the retrieved document, and generate response

from langchain.chat_models import init_chat_model
model = init_chat_model("llama-3.3-70b-versatile", model_provider="groq")


response = model.invoke(prompt_template.format(
    context=full_doc,
    question=question))
    
print(response.content)

Here is the list of email addresses: 
1. tarukd@gmail.com
2. cslalitha1@gmail.com
3. rdasmsu@gmail.com
4. vambethkar@gmail.com, vambethkar@maths.du.ac.in
5. sachi_srivastava@yahoo.com, ssrivastava@maths.du.ac.in
6. lalit@maths.du.ac.in, lkumarvashisht@gmail.com
7. arvindpatelmath09@gmail.com
8. agaur@maths.du.ac.in
9. hemantksingh@maths.du.ac.in
10. rjain@maths.du.ac.in
11. sachinambariya@gmail.com
12. ag.nikita@gmail.com
13. azothansanga26@yahoo.com
14. anupama.panigrahi@gmail.com
15. pratimarai5@gmail.com
16. surendraiitr8@gmail.com
17. randheernsit@gmail.com
18. sumitnagpal.du@gmail.com
19. moghadma@gmail.com
20. ashrsdma@gmail.com
21. rkpanda@maths.du.ac.in
22. anuj.bshn@gmail.com
23. mrigendra154@gmail.com
24. akumar@maths.du.ac.in
25. head@maths.du.ac.in


In [11]:
from IPython.display import Markdown
Markdown(response.content)

Here is the list of email addresses: 
1. tarukd@gmail.com
2. cslalitha1@gmail.com
3. rdasmsu@gmail.com
4. vambethkar@gmail.com, vambethkar@maths.du.ac.in
5. sachi_srivastava@yahoo.com, ssrivastava@maths.du.ac.in
6. lalit@maths.du.ac.in, lkumarvashisht@gmail.com
7. arvindpatelmath09@gmail.com
8. agaur@maths.du.ac.in
9. hemantksingh@maths.du.ac.in
10. rjain@maths.du.ac.in
11. sachinambariya@gmail.com
12. ag.nikita@gmail.com
13. azothansanga26@yahoo.com
14. anupama.panigrahi@gmail.com
15. pratimarai5@gmail.com
16. surendraiitr8@gmail.com
17. randheernsit@gmail.com
18. sumitnagpal.du@gmail.com
19. moghadma@gmail.com
20. ashrsdma@gmail.com
21. rkpanda@maths.du.ac.in
22. anuj.bshn@gmail.com
23. mrigendra154@gmail.com
24. akumar@maths.du.ac.in
25. head@maths.du.ac.in

In [13]:
# Reprompting on the same data to prduce a different format output.
# RAG promt template
prompt_template = """You are an assistant and you are required to format the information a desired format. Use the following pieces context. The context contains a list of professors, their names, email addresses, designations etc. Understand the task carefully and and respond.
Context: {context} 
Task: Extract the names, designation, for each professor and return in this format
**Name** *emailID*

"""

# make the LLM read see the prompt, and analyse the retrieved document, and generate response

from langchain.chat_models import init_chat_model
model = init_chat_model("llama-3.3-70b-versatile", model_provider="groq")


response = model.invoke(prompt_template.format(
    context=full_doc,
    question=question))
    
from IPython.display import Markdown
Markdown(response.content)

Here is the list of professors in the desired format:


1. **Prof. Tarun Kumar Das** *tarukd@gmail.com*
2. **Prof. C.S. Lalitha** *cslalitha1@gmail.com*
3. **Prof. Ruchi Das** *rdasmsu@gmail.com*
4. **Prof. Vusala Ambethkar** *vambethkar@gmail.com, vambethkar@maths.du.ac.in*
5. **Prof. Sachi Srivastava** *sachi_srivastava@yahoo.com; ssrivastava@maths.du.ac.in*
6. **Prof. Lalit Kumar** *lalit@maths.du.ac.in, lkumarvashisht@gmail.com*
7. **Prof. Arvind Patel** *arvindpatelmath09@gmail.com*
8. **Prof. Atul Gaur** *agaur@maths.du.ac.in*
9. **Prof. Hemant Kumar Singh** *hemantksingh@maths.du.ac.in*
10. **Prof. Ranjana Jain** *rjain@maths.du.ac.in*
11. **Prof. Sachin Kumar** *sachinambariya@gmail.com*
12. **Prof. Nikita Agarwal** *ag.nikita@gmail.com*
13. **Dr. A. Zothansanga** *azothansanga26@yahoo.com*
14. **Dr. Anupama Panigrahi** *anupama.panigrahi@gmail.com*
15. **Dr. Pratima Rai** *pratimarai5@gmail.com*
16. **Dr. Surendra Kumar** *surendraiitr8@gmail.com*
17. **Dr. Randheer Singh** *randheernsit@gmail.com*
18. **Dr. Sumit Nagpal** *sumitnagpal.du@gmail.com*
19. **Dr. Sandeep Kumar Mogha** *moghadma@gmail.com*
20. **Dr. Ashok Kumar** *ashrsdma@gmail.com*
21. **Dr. Ratikanta Panda** *rkpanda@maths.du.ac.in*
22. **Dr. Anuj Bishnoi** *anuj.bshn@gmail.com*
23. **Dr. Mrigendra Singh Kushwaha** *mrigendra154@gmail.com*
24. **Prof. Ajay Kumar** *akumar@maths.du.ac.in*

# Exercise: Try to extract something else from a different webpage

In [15]:
from langchain_community.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(
    urls=["https://timesofindia.indiatimes.com/sports/nba/top-stories/fans-give-credit-for-luka-doncics-fitness-to-fiancee-anamaria-goltes-as-she-shares-picture-from-kitchen-garden/articleshow/121675609.cms"]
)
docs = loader.load()

In [16]:
full_doc = "\n\n".join(doc.page_content for doc in docs)
print(full_doc)

Edition

IN

IN

US

English

English

हिन्दी

मराठी

ಕನ್ನಡ

தமிழ்

বাংলা

മലയാളം

తెలుగు

ગુજરાતી

TOI logo

Sign In

TOI

Today's ePaper

News

Sports News

NBA News

Fans give credit for Luka Doncic’s fitness to fiancee Anamaria Goltes as she shares picture from kitchen garden

Trending

Who is Nikhil Sosale

IND Tour ENG

PM Modi

Yash Dayal

D Gukesh

Virat Kohli

Gautam Gambhir

Bengaluru Stampede

India vs England

Rohit Sharma

Who is Nikhil Sosale

IND Tour ENG

PM Modi

Yash Dayal

D Gukesh

Virat Kohli

Gautam Gambhir

Bengaluru Stampede

India vs England

Rohit Sharma

Who is Nikhil Sosale

IND Tour ENG

PM Modi

Yash Dayal

D Gukesh

Virat Kohli

Gautam Gambhir

Bengaluru Stampede

India vs England

Rohit Sharma

Fans give credit for Luka Doncic’s fitness to fiancee Anamaria Goltes as she shares picture from kitchen garden

TOI Sports Desk / TIMESOFINDIA.COM / Jun 06, 2025, 17:23 IST

Share

AA

Text Size

Small

Medium

Large

Luka Doncic is looking fitter during the offs

In [17]:
question = "give me the summary for this article."

In [18]:
# RAG promt template
prompt_template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""

# make the LLM read see the prompt, and analyse the retrieved document, and generate response

from langchain.chat_models import init_chat_model
model = init_chat_model("llama-3.3-70b-versatile", model_provider="groq")


response = model.invoke(prompt_template.format(
    context=full_doc,
    question=question))
    
print(response.content)

Luka Doncic's fans are giving credit to his fiancee, Anamaria Goltes, for his improved fitness. Goltes shared a picture of their kitchen garden on Instagram, showcasing the fresh veggies they are growing, which fans believe is contributing to Doncic's slimmer look. Doncic was recently spotted at a Real Madrid game in Spain, looking fitter during the offseason.
