## Ollama Web Scraper

### Some goodies to import first for our code

In [23]:
import os
import requests
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from openai import OpenAI

Needed to run the command below to setup our llama model

In [None]:
# !ollama pull llama3.2


In [None]:

from openai import OpenAI
openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

model = "llama3.2"

user_input = ""

while user_input != "exit":
    user_input = input("Type a prompt for the llm")

    response = openai.chat.completions.create(
        model = model,
        messages = [{"role" : "doctor", "content": f"{user_input}"}]
    )

    print(response.choices[0].message.content)


#### Stuff I Learnt

```python
openai.chat.completions.create(
    model = model,
    messages = [...]
)
```
Generates a chat completion based on the model and the history. "role": defines who the speaker is



### Creating A Website Class

Here we use the BeautifulSoup libary which is used for webscraping.


In [13]:

headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

class Website:
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        self.title = soup.title.string if soup.title else "No title found"
        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()
        self.text = soup.body.get_text(separator="\n", strip=True)

#### Stuff I Learnt

```python
requests.get(url, headers=headers)
```
I come accross a familiar method requests.get. However what's different is that it takes in a "headers" argument. Why?
.get Makes an HTTP request. In making an HTTP request, there needs to be info about the users device/web browser or it's mostly expected to have that information. Without it, it's pretty obvious this is a webscrapping bot. Hence we add it to bypass any anti scrapping measures

```python
    BeautifulSoup(response.content, 'html.parser')
```

The response.content gives us the raw HTML of the webpage. 

```python

..."html.parser")
```
This converts raw HTML into a structured BeautifulSoup Object

```python
for irrelevant in soup.body(["script", "style", "img", "input"]):
    irrelevant.decompose()
```
Removes elements in HTML that we don't want. 

```soup.body()```
Specifies the <body> tag in the HTML document
The list has all the various tages we want to iterate over and irrelvant.decompose()

```irrelevant.decompose()```
Removes the element completely. There is a ```.extract() ``` method which detaches the element and keeps it in memory.



###Prompts

Prompts are necessary as they tell the model what to do.

```system_prompt``` gives the model it's identity

```user_prompt``` gives the users input

In [28]:
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
    "


In [16]:
def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; \
please provide a short summary of this website in markdown. \
If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

In [2]:
def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content":user_prompt_for(website)}
    ]

here we set the messages for both the system and the user

#### Webscrapping Time!
This is a nice function to run everything when we pass in our url

In [18]:
def summarize(url):
    website = Website(url)
    response = openai.chat.completions.create(
        model = "llama3.2",
        messages = messages_for(website)
    )
    return response.choices[0].message.content

I tested it on my ResNet50 project!

In [29]:
summarize("http://localhost:8000/")

'# Summary of Recognize Object Website\n\n## Purpose\nRecognize Object is a platform that appears to be designed to identify objects through machine learning models. The website showcases the tool\'s capabilities, particularly its use of Google\'s QuickDraw dataset.\n\n### ResNet Drawing Guesser Tool\nThe site describes a "ResNet Drawing Guesser" made from Google\'s quickdraw dataset, intended for object recognition.\n\nNo news or announcements were present in the provided content.'