<a href="https://colab.research.google.com/github/Praveengovianalytics/GenAI_Notebooks/blob/main/GenAI_and_Webscrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Webscraping Made Simple with GenAI 🌐

### Leveraging Internet Signals for Strategic Decision-Making 🚀

The integration of signals from the internet is crucial for empowering existing machine learning models, thereby facilitating informed and strategic business decisions. Here we extent the existing webscrapping process into much simpler with LLM using prompts

## Introducing Ollama 🛠️

**Ollama** is an open-source tool designed for the local execution of large language models (LLMs) such as Llama, Phi, and more. This tool allows developers and researchers to run and customize these models directly on their own systems without the need for cloud services. With features like GPU acceleration and easy model management, Ollama is perfect for those who require efficient, local processing of LLMs.

## What is ScrapeGraphAI? 📊

**ScrapeGraphAI** is a Python library that revolutionizes web scraping. By combining the power of LLMs and direct graph logic, it enables the creation of sophisticated scraping pipelines for websites, documents, and XML files. Simply specify what information you need, and ScrapeGraphAI handles the rest!

## Key Features:

- **Simple Installation:** Quick setup allows you to start right away with commands like `!pip install scrapegraphai`.
- **GPU Support:** Checks and utilizes available GPU resources to enhance processing capabilities.
- **Interactive Examples:** Practical examples and use cases like extracting mobile plans from websites or listing articles from news feeds demonstrate the utility and flexibility of Ollama and ScrapeGraphAI.

## Use Cases 📱

1. **Mobile Plan Comparison:** Automatically gather and compare mobile phone plans and pricing from various providers.
2. **News Digest Creation:** Efficiently compile and organize news articles from popular platforms for easy consumption.

With these tools, your ability to extract and utilize web data becomes limitless, opening up new avenues for data-driven strategies in your business.



# 1. Ollama (GenAI - LLM ) software setup

In [1]:
!pip3 install scrapegraphai==0.9.0b7 --upgrade -q
!pip install langchain-community -q

https://github.com/ollama/ollama


In [2]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Downloading ollama...
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


#### Verify the GPU availability

In [3]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri May 10 06:11:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Open terminal and run below
Use terminal in Colab and run the below command

```ollama serve & ollama run llama3:8b```

## Test Ollama with Llama8B

In [4]:

from langchain_community.llms import Ollama
llm = Ollama(model="llama3:8b")

In [5]:
llm.invoke("Write a proposal email for digital transformation.")

'Subject: Proposal for Digital Transformation Initiative\n\nDear [Decision Maker\'s Name],\n\nI hope this email finds you well. As we continue to navigate the ever-evolving landscape of technology, I am excited to propose a comprehensive digital transformation initiative that will enable our organization to stay ahead of the curve and drive growth.\n\nAs you may be aware, the world is rapidly becoming more digital. The COVID-19 pandemic has accelerated this trend, with many organizations forced to adopt remote work arrangements and rely on digital channels for communication and commerce. In light of these developments, I believe it is essential that we take a proactive approach to digital transformation, leveraging technology to enhance our operations, improve customer experiences, and increase competitiveness.\n\nOur proposal, titled "Digital Transformation: Unlocking New Possibilities," aims to achieve the following key objectives:\n\n1. **Enhance Operational Efficiency**: Implement 

# 2. Install playwright for Webscrapping

In [8]:
!playwright install

Downloading Chromium 124.0.6367.29 (playwright build v1112)[2m from https://playwright.azureedge.net/builds/chromium/1112/chromium-linux.zip[22m
[1G155.3 MiB [] 0% 0.0s[0K[1G155.3 MiB [] 0% 31.4s[0K[1G155.3 MiB [] 0% 18.0s[0K[1G155.3 MiB [] 0% 14.0s[0K[1G155.3 MiB [] 0% 11.7s[0K[1G155.3 MiB [] 0% 11.9s[0K[1G155.3 MiB [] 0% 10.3s[0K[1G155.3 MiB [] 1% 10.3s[0K[1G155.3 MiB [] 1% 9.4s[0K[1G155.3 MiB [] 1% 8.8s[0K[1G155.3 MiB [] 2% 8.6s[0K[1G155.3 MiB [] 2% 8.3s[0K[1G155.3 MiB [] 2% 8.4s[0K[1G155.3 MiB [] 2% 8.5s[0K[1G155.3 MiB [] 2% 8.8s[0K[1G155.3 MiB [] 2% 9.0s[0K[1G155.3 MiB [] 3% 8.8s[0K[1G155.3 MiB [] 3% 8.6s[0K[1G155.3 MiB [] 3% 7.9s[0K[1G155.3 MiB [] 4% 7.7s[0K[1G155.3 MiB [] 4% 8.2s[0K[1G155.3 MiB [] 4% 7.3s[0K[1G155.3 MiB [] 5% 7.1s[0K[1G155.3 MiB [] 5% 6.7s[0K[1G155.3 MiB [] 5% 7.0s[0K[1G155.3 MiB [] 6% 6.8s[0K[1G155.3 MiB [] 6% 6.9s[0K[1G155.3 MiB [] 6% 7.0s[0K[1G155.3 MiB [] 7% 7.0s[0K[1G155.3 MiB [] 7% 6.7s[0K[1G15

# 3. Lets start the web scrapping

In [None]:
!ollama pull mistral
!ollama pull nomic-embed-text

In [37]:
import nest_asyncio
import pandas as pd
nest_asyncio.apply()

## Usecase - 1 - Get the list of mobile plans in Simba

In [43]:
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="get the list of mobile plans and prices",
    # also accepts a string with the already downloaded HTML code
    source="https://simba.sg/",
    config=graph_config
)

In [44]:
result = smart_scraper_graph.run()
df = pd.DataFrame(result)
df

Unnamed: 0,mobile_plans
0,"{'name': 'SIM-Only 200GB', 'data': '200GB', 'p..."
1,"{'name': 'SIM-Only 100GB', 'data': '100GB', 'p..."
2,"{'name': 'SuperRoam 50GB (30D)', 'data': '50GB..."
3,"{'name': 'SuperRoam 50GB (90D)', 'data': '50GB..."
4,"{'name': 'SuperRoam MY', 'data': '130GB', 'pri..."
5,"{'name': 'SuperRoam Max', 'data': '100GB', 'pr..."
6,"{'name': 'Seniors', 'data': '20GB', 'price': '..."
7,"{'name': 'Business 50GB', 'data': '50GB', 'pri..."


## Usecase - 2 - Get the news feed from Google news

In [55]:
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    }
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List out todays news",
    # also accepts a string with the already downloaded HTML code
    source="https://news.google.com/",
    config=graph_config
)

In [56]:
result = smart_scraper_graph.run()
print(result)
# Creating a DataFrame
df = pd.DataFrame(result)
df



Unnamed: 0,news
0,"{'source': 'The Hill', 'title': 'Stormy Daniel..."
1,"{'source': 'CNN', 'title': 'Stormy Daniels wra..."
2,"{'source': 'The Washington Post', 'title': 'Ne..."
3,"{'source': 'Fox News', 'title': 'NY v. Trump: ..."
4,"{'source': 'Brooke Singman', 'title': 'Full Co..."
5,"{'source': 'Fox News', 'title': 'GOP governor ..."
6,"{'source': 'Al Jazeera English', 'title': ''We..."
7,"{'source': 'The New York Times', 'title': 'Opi..."
8,"{'source': 'CNN', 'title': 'Florida sheriff re..."
9,"{'source': 'The New York Times', 'title': 'Isr..."


# Conclusion 🌟

In this notebook, we explored the powerful capabilities of Ollama and ScrapeGraphAI, demonstrating how these tools can revolutionize the way we extract and utilize web data. Through practical examples and clear, step-by-step instructions, we showcased how these tools can be employed to gather strategic data from the web efficiently.


Ollama offers a flexible and robust platform for running large language models locally, allowing for significant customization and control, which is crucial for businesses looking to operate independently of cloud services. On the other hand, ScrapeGraphAI simplifies the process of scraping complex web data, making it accessible even to those with minimal technical expertise.

# Key Takeaways:

*   Enhanced Data Accessibility: By automating the extraction of data from various
web sources, businesses can gain insights faster and more reliably.

*   Cost-Effective Solutions: Running models locally reduces reliance on cloud platforms, potentially lowering operational costs.

*   Ease of Use: The intuitive setup and user-friendly interfaces of these tools make advanced data scraping and model management accessible to a broader audience.