# **Example of using BERT to perform sentiment analysis**

In [None]:
from transformers import pipeline
##Create a sentiment analysis pipeline using a pretrained BERT model.
classifier=pipeline("sentiment-analysis"),
model="bert-base-uncased"
tokenizer="bert-base-uncased"
##Test sentences
sentences=[
    "I love using BERT for natural language processing tasks!"
    "I am not a fan of waiting in long lines"
]
##Run inference
results=classifier(sentences)
for sentence, result in zip (sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Prediction: {result['label']} | Score: {result['score']:.4f}")
    print()


# **OpenAI**

In [32]:
import os
import dotenv

In [33]:
dotenv.load_dotenv()

True

In [44]:
#print(os.environ.get('OPENAI_API_KEY'))

In [9]:
system_prompt='''
You are an AI assistant who can perform the following steps:
1. Reason through the problem by describing your thoughts in a "Thought:" section.
2. When you need to use a tool, output an "Action:" section with the tool name and its input.
3. After the tool call, you'll see an "Observation:" section with the tool's output.
4. Continue this cycle of Thought → Action → Observation as needed.
5. End with a concise "Final Answer:" that answers the user's query.

Note:
- The chain of thought in "Thought:" sections is only visible to you and not part of your final answer.
- The user should only see your "Final Answer:".
'''

In [23]:
user_prompt = '''
What is the weather in Thunder Bay, Ontario, Canada Today?
'''

In [34]:
from openai import OpenAI
client=OpenAI()

completion=client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {
            "role": "user", "content": user_prompt}
    ]
)

In [35]:
text = completion.choices[0].message.content
print(text)

Thought:
To find out the weather in Thunder Bay, Ontario, Canada today, I can use a weather-related API to get the current weather information for that location.

Action:
API Call to get the current weather in Thunder Bay, Ontario, Canada.

Observation:
The current weather information for Thunder Bay, Ontario, Canada is retrieved.

Final Answer:
I will provide the current weather in Thunder Bay, Ontario, Canada after using the weather API.


In [37]:
import re
pattern = r'Action:\s*(\w+)\("([^"]+)"\)'

match = re.search(pattern, text)
if match:
    tool_name = match.group(1)    # 'GetWeather'
    tool_input = match.group(2)   # 'Thunder Bay, Ontario, Canada'
    print("Tool name:", tool_name)
    print("Tool input:", tool_input)
else:
    print("No match found.")

No match found.


In [38]:
tool_name = "GetWeather"
tool_input = "Thunder Bay, Ontario, Canada"
print(f"Manually set: tool_name = '{tool_name}', tool_input = '{tool_input}'")

Manually set: tool_name = 'GetWeather', tool_input = 'Thunder Bay, Ontario, Canada'


In [39]:
import requests
import os

def get_current_weather(city_name):
    #base_url = "https://api.openweathermap.org/data/3.0/onecall"
    #params = {
    #    "lat": 48.3809,
    #    "lon": 89.2477,
    #    "appid": os.environ.get('OPENWEATHERMAPS_API_KEY'),
    #    "units": "metric"  # use "imperial" for Fahrenheit
    #}

    # Make the GET request
    #response = requests.get(base_url, params=params)
    
    # Raise an exception if there's an HTTP error
    #response.raise_for_status()
    
    # Parse the JSON response
    #data = response.json()

    # Extract relevant fields from the response
    #weather_info = {
    #    "city": data["name"],
    #    "temperature": data["main"]["temp"],
    #    "description": data["weather"][0]["description"],
    #    "humidity": data["main"]["humidity"]
    #}
    weather_info = {
        "city": "Thunder Bay",
        "temperature": -5.2,   # in Celsius
        "description": "snow",
        "humidity": 85         # in percentage
    }   
    return weather_info

In [40]:
if tool_name == 'GetWeather':
    weather_info = get_current_weather(tool_input)
    print(weather_info)

{'city': 'Thunder Bay', 'temperature': -5.2, 'description': 'snow', 'humidity': 85}


In [41]:
updated_text = text + f"\n\n Observation: {weather_info}"
print(updated_text)

Thought:
To find out the weather in Thunder Bay, Ontario, Canada today, I can use a weather-related API to get the current weather information for that location.

Action:
API Call to get the current weather in Thunder Bay, Ontario, Canada.

Observation:
The current weather information for Thunder Bay, Ontario, Canada is retrieved.

Final Answer:
I will provide the current weather in Thunder Bay, Ontario, Canada after using the weather API.

 Observation: {'city': 'Thunder Bay', 'temperature': -5.2, 'description': 'snow', 'humidity': 85}


In [42]:
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user","content": user_prompt},
        {"role": "assistant","content": text}, # This is the model's initial simulated response
        {"role": "user","content": updated_text} # This is where the 'Observation' is fed back
    ]
)

In [43]:
text2 = completion.choices[0].message.content
print(text2)

Final Answer:
The current weather in Thunder Bay, Ontario, Canada is as follows: 
- Temperature: -5.2°C
- Description: Snow
- Humidity: 85%


# **Anthropic**

In [45]:
%pip install -q python-dotenv anthropic

Note: you may need to restart the kernel to use updated packages.


In [46]:
import dotenv
dotenv.load_dotenv()

True

In [None]:
import anthropic

client = anthropic.Anthropic()

model_id = "claude-3-5-sonnet-20241022"

messages=[{
  "role": "user",
  "content": "Hello, Claude",
}]

message = client.messages.create(
    model=model_id,
    max_tokens=1000,
    temperature=0,
    messages=messages
)
print(message.content)

# **Cohere**

In [1]:
%pip install -q cohere python-dotenv


Note: you may need to restart the kernel to use updated packages.


In [2]:
import dotenv
import os
dotenv.load_dotenv()

True

In [3]:
import cohere
co = cohere.ClientV2()
response = co.chat(
    model="command-r-plus-08-2024",
    messages=[{"role": "user", "content": "hello world!"}],
)
print(response)

id='5c3123d6-174d-4e3b-859e-eaf3f25b3d27' finish_reason='COMPLETE' message=AssistantMessageResponse(role='assistant', tool_calls=None, tool_plan=None, content=[TextAssistantMessageResponseContentItem(type='text', text='Hello there! How can I help you today?')], citations=None) usage=Usage(billed_units=UsageBilledUnits(input_tokens=3.0, output_tokens=10.0, search_units=None, classifications=None), tokens=UsageTokens(input_tokens=204.0, output_tokens=10.0)) logprobs=None


# **ai21-Labs**

In [5]:
%pip install -q ai21

Note: you may need to restart the kernel to use updated packages.


In [6]:
import os
import dotenv
dotenv.load_dotenv()
from ai21 import AI21Client
from ai21.models.chat import ResponseFormat
from ai21.models.chat import UserMessage

In [8]:
messages = [
    UserMessage(
        content="Tell me something I don't know. Limit the response to 30 words maximum."
    )
]
client = AI21Client(api_key=os.environ.get("AI21_API_KEY"))
response = client.chat.completions.create(
		model="jamba-large",
		messages=messages,
		n=1,
		max_tokens=2048,
		temperature=0.4,
		top_p=1,
		response_format=ResponseFormat(type="text"),
)
print(response)

id='chatcmpl-9b014a63-1d6b-57bb-54e4-3ee5796f15cc' choices=[ChatCompletionResponseChoice(index=0, message=AssistantMessage(role='assistant', content="The world's smallest mammal is the bumblebee bat, weighing just 2 grams, while the largest is the blue whale, reaching over 150 tons.", tool_calls=None), logprobs=None, finish_reason='stop')] usage=UsageInfo(prompt_tokens=29, completion_tokens=37, total_tokens=66)


# **Google AI Studio**

In [12]:

%pip install -q -U google-generativeai

Note: you may need to restart the kernel to use updated packages.


In [13]:

import dotenv
import os
dotenv.load_dotenv()

True

In [14]:
import google.generativeai as genai

genai.configure(api_key=os.environ.get("GOOGLE_API_KEY"))
model = genai.GenerativeModel("gemini-2.5-flash")
response = model.generate_content("Explain how AI works")
print(response.text)

  from .autonotebook import tqdm as notebook_tqdm


AI, or Artificial Intelligence, isn't a single technology but rather a broad field focused on enabling machines to perform tasks that typically require human intelligence.

At its core, **AI works by identifying patterns in data and then using those patterns to make predictions, decisions, or generate new content.**

Let's break down the fundamental components and processes:

---

### The Core Idea: Learning from Data

Imagine a child learning to identify a cat. They don't start with a rulebook. Instead, they see many examples: fluffy cats, sleek cats, big cats, small cats, cats in different poses. Their brain gradually builds an internal "model" of what a cat looks like by observing common features.

AI works similarly. Instead of a brain, we use:

1.  **Data (The Fuel):** This is the raw information AI learns from. It can be text, images, audio, numbers, videos, etc. The more data, and the higher its quality, the better the AI can learn.
    *   **Labeled Data:** Data that has been p

# **Azure AI Foundry**

In [15]:
%pip install azure-ai-inference azure-ai-projects azure-identity

Collecting azure-ai-inference
  Downloading azure_ai_inference-1.0.0b9-py3-none-any.whl.metadata (34 kB)
Collecting azure-ai-projects
  Downloading azure_ai_projects-1.0.0b12-py3-none-any.whl.metadata (22 kB)
Collecting azure-identity
  Downloading azure_identity-1.23.1-py3-none-any.whl.metadata (82 kB)
Collecting isodate>=0.6.1 (from azure-ai-inference)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Collecting azure-core>=1.30.0 (from azure-ai-inference)
  Downloading azure_core-1.35.0-py3-none-any.whl.metadata (44 kB)
Collecting azure-storage-blob>=12.15.0 (from azure-ai-projects)
  Downloading azure_storage_blob-12.26.0-py3-none-any.whl.metadata (26 kB)
Collecting azure-ai-agents>=1.0.0 (from azure-ai-projects)
  Downloading azure_ai_agents-1.0.2-py3-none-any.whl.metadata (52 kB)
Collecting cryptography>=2.5 (from azure-identity)
  Downloading cryptography-45.0.5-cp311-abi3-win_amd64.whl.metadata (5.7 kB)
Collecting msal>=1.30.0 (from azure-identity)
  Downloading msa

In [None]:
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

project_connection_string="MY STRING"

project = AIProjectClient.from_connection_string(
  conn_str=project_connection_string,
  credential=DefaultAzureCredential())

In [None]:
chat = project.inference.get_chat_completions_client()

In [None]:

response = chat.complete(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful writing assistant"},
        {"role": "user", "content": "Write me a poem about flowers"},
    ]
)

print(response.choices[0].message.content)

# **Hugging Face Pipeline**

In [19]:
%pip install -q transformers

Note: you may need to restart the kernel to use updated packages.


In [25]:
from dotenv import load_dotenv
import os

# Specify the path to env.txt
load_dotenv("env.txt")

True

In [6]:
%pip install sentencepiece

Note: you may need to restart the kernel to use updated packages.


In [1]:
from transformers import pipeline

pipe = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [2]:
pipe("Hello, how is your day going today?")

[{'translation_text': "Bonjour, comment se passe ta journée aujourd'hui ?"}]

# **Hugging Face Direct Model**

In [3]:
from dotenv import load_dotenv
import os

load_dotenv("env.txt")

True

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "D:/Models"

# Download the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B", cache_dir=model_path)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", cache_dir=model_path)

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-3.2-1B.
403 Client Error. (Request ID: Root=1-68863d81-5a9ed78e2ba1e8ac61315175;e5a540bf-2491-46f2-b174-3f9bc3fbdd64)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-1B/resolve/main/config.json.
Your request to access model meta-llama/Llama-3.2-1B is awaiting a review from the repo authors.

In [7]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

NameError: name 'model' is not defined

In [None]:
# Function to generate text
def generate_text(prompt, max_new_tokens=50, temperature=0.7, top_p=0.9):
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
  output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, temperature=temperature, top_p=top_p)
  output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
  return output_text

In [None]:
# Start the chat loop
while True:
  user_input = input("User: ")
  if user_input.lower() == "quit":
    break
  response = generate_text(user_input)
  print("Llama 3.2:", response)

# **Hugging Face Dataset**

In [6]:
%pip install -q datasets

Note: you may need to restart the kernel to use updated packages.


In [7]:
%pip install torch transformers[torch] accelerate

Note: you may need to restart the kernel to use updated packages.


In [8]:
from dotenv import load_dotenv
import os

In [9]:
import torch

In [10]:

from datasets import load_dataset
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForQuestionAnswering,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

In [11]:
# Load a tiny dataset subset
dataset = load_dataset("squad", split="train[:100]")  # Only 100 examples
eval_dataset = load_dataset("squad", split="validation[:20]")  # 20 validation examples

In [12]:
model_name = "distilbert-base-uncased"  # Much smaller than BERT
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
model = DistilBertForQuestionAnswering.from_pretrained(model_name)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [66]:
def preprocess_function(examples):
    # Tokenize questions and contexts
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=256,
        stride=128,
        return_offsets_mapping=True,
        padding="max_length",
        return_overflowing_tokens=False
    )
    
    # Initialize answer arrays
    start_positions = []
    end_positions = []
    
    # Process each example in the batch
    for i in range(len(examples["question"])):
        offset = tokenized["offset_mapping"][i]
        answer = examples["answers"][i]
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        start_token = None
        end_token = None

        # Find token positions
        for idx, (start, end) in enumerate(offset):
            if start <= start_char < end:
                start_token = idx
            if start < end_char <= end:
                end_token = idx

        # Handle edge cases when answer not found (e.g. due to truncation)
        if start_token is None or end_token is None:
            start_positions.append(0)  # Usually the [CLS] token index
            end_positions.append(0)
        else:
            start_positions.append(start_token)
            end_positions.append(end_token)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"] = end_positions
    return tokenized

In [33]:
# Process datasets
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=32,  # Explicit batch size
    remove_columns=dataset.column_names,
)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [34]:
tokenized_eval_dataset = eval_dataset.map(
    preprocess_function,
    batched=True,
    batch_size=32,  # Explicit batch size
    remove_columns=eval_dataset.column_names,
)

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [35]:
# Fast training configuration
training_args = TrainingArguments(
    output_dir="./quick-qa-results",
    num_train_epochs=1,  # Single epoch
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=5e-5,  # Slightly higher learning rate
    weight_decay=0.01,
    logging_steps=10,
    eval_strategy="no",  # Skip evaluation to save time
    save_strategy="no",  # Don't save checkpoints
    use_cpu=True,  # Force CPU
    report_to="none",  # Disable wandb/tensorboard reporting
)

In [36]:
# Initialize and train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=DataCollatorWithPadding(tokenizer),
)

In [None]:
# Train and save
trainer.train()
model.save_pretrained("./quick-qa-model")
tokenizer.save_pretrained("./quick-qa-model")

Step,Training Loss
10,3.0042
20,3.6167


('./quick-qa-model1\\tokenizer_config.json',
 './quick-qa-model1\\special_tokens_map.json',
 './quick-qa-model1\\vocab.txt',
 './quick-qa-model1\\added_tokens.json',
 './quick-qa-model1\\tokenizer.json')

In [67]:
def load_qa_model(model_path="./quick-qa-model"):
    # Load model and tokenizer from saved directory
    tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
    model = DistilBertForQuestionAnswering.from_pretrained(model_path)
    return model, tokenizer

In [68]:
def answer_question(question, context, model, tokenizer):
    # Tokenize input
    inputs = tokenizer(
        question,
        context,
        return_tensors="pt",
        max_length=256,
        truncation="only_second",
        padding=True
    )
    
    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Find start and end positions
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits)
    
    # Convert token positions to string
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    answer = tokenizer.decode(inputs["input_ids"][0][answer_start:answer_end + 1], skip_special_tokens=True)
  
    
    return answer

In [79]:
model, tokenizer = load_qa_model()

# Example context and question
context = """
Python is a high-level programming language created by Guido van Rossum.
Python's design emphasizes code readability with its notable use of significant whitespace. 
Its language constructs and object-oriented approach aim to help programmers write clear, logical code.
"""

question = "Who created Python?"

In [80]:
# Get answer
answer = answer_question(question, context, model, tokenizer)
print(f"\nQuestion: {question}")
print(f"Answer: {answer}")


Question: Who created Python?
Answer: guido van rossum


# **Pinecone**

In [2]:
%pip install "pinecone[grpc]"

Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import dotenv
dotenv.load_dotenv()

True

In [4]:
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
import time

api_key=os.environ.get('PINECONE_API_KEY')
pc = Pinecone(api_key=api_key)

In [5]:
data = [
    {"id": "vec1", "text": "Apple is a popular fruit known for its sweetness and crisp texture."},
    {"id": "vec2", "text": "The tech company Apple is known for its innovative products like the iPhone."},
    {"id": "vec3", "text": "Many people enjoy eating apples as a healthy snack."},
    {"id": "vec4", "text": "Apple Inc. has revolutionized the tech industry with its sleek designs and user-friendly interfaces."},
    {"id": "vec5", "text": "An apple a day keeps the doctor away, as the saying goes."},
    {"id": "vec6", "text": "Apple Computer Company was founded on April 1, 1976, by Steve Jobs, Steve Wozniak, and Ronald Wayne as a partnership."}
]

# Convert the text into numerical vectors that Pinecone can index
embeddings = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[d['text'] for d in data],
    parameters={"input_type": "passage", "truncate": "END"}
)

In [6]:
print(embeddings)

EmbeddingsList(
  model='multilingual-e5-large',
  vector_type='dense',
  data=[
    {'vector_type': dense, 'values': [0.04931640625, -0.01328277587890625, ..., -0.0196380615234375, -0.010955810546875]},
    {'vector_type': dense, 'values': [0.032562255859375, -0.027862548828125, ..., -0.0200653076171875, -0.021026611328125]},
    ... (2 more embeddings) ...,
    {'vector_type': dense, 'values': [0.0312347412109375, -0.0185699462890625, ..., -0.02996826171875, -0.032989501953125]},
    {'vector_type': dense, 'values': [0.03955078125, -0.01013946533203125, ..., 0.0011348724365234375, -0.04296875]}
  ],
  usage={'total_tokens': 130}
)


In [8]:
# Create a serverless index
index_name = "example-index"

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    ) 

# Wait for the index to be ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [9]:

# Target the index where you'll store the vector embeddings
index = pc.Index("example-index")

# Prepare the records for upsert
# Each contains an 'id', the embedding 'values', and the original text as 'metadata'
records = []
for d, e in zip(data, embeddings):
    records.append({
        "id": d['id'],
        "values": e['values'],
        "metadata": {'text': d['text']}
    })

# Upsert the records into the index
index.upsert(
    vectors=records,
    namespace="example-namespace"
)

upserted_count: 6

In [10]:
# Define your query
query = "Tell me about the tech company known as Apple."

# Convert the query into a numerical vector that Pinecone can search with
query_embedding = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[query],
    parameters={
        "input_type": "query"
    }
)

# Search the index for the three most similar vectors
results = index.query(
    namespace="example-namespace",
    vector=query_embedding[0].values,
    top_k=3,
    include_values=False,
    include_metadata=True
)

print(results)

{'matches': [{'id': 'vec2',
              'metadata': {'text': 'The tech company Apple is known for its '
                                   'innovative products like the iPhone.'},
              'score': 0.87259847,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': 'vec4',
              'metadata': {'text': 'Apple Inc. has revolutionized the tech '
                                   'industry with its sleek designs and '
                                   'user-friendly interfaces.'},
              'score': 0.85148114,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': 'vec6',
              'metadata': {'text': 'Apple Computer Company was founded on '
                                   'April 1, 1976, by Steve Jobs, Steve '
                                   'Wozniak, and Ronald Wayne as a '
                                   'partnership.'},
              'score': 

## **Cohere with pinecone**

In [24]:
%pip install datasets==3.6.0

Note: you may need to restart the kernel to use updated packages.


In [12]:

import os
import dotenv
dotenv.load_dotenv()

True

In [13]:
import cohere
co = cohere.Client()

In [None]:
from datasets import load_dataset

trec = load_dataset("CogComp/trec", split="train[:1000]")

In [None]:
embeds = co.embed(
    texts=trec['text'],
    model='embed-english-v3.0',
    input_type='search_document',
    truncate='END'
).embeddings

# **SerpAPI**

In [26]:
%pip install google-search-results 

Collecting google-search-results
  Downloading google_search_results-2.4.2.tar.gz (18 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: google-search-results
  Building wheel for google-search-results (setup.py): started
  Building wheel for google-search-results (setup.py): finished with status 'done'
  Created wheel for google-search-results: filename=google_search_results-2.4.2-py3-none-any.whl size=32173 sha256=b53e88590a9da2d7105753c9b82bceed6fdbd44b0a842f529b4d4fc284d2b9f5
  Stored in directory: c:\users\nutne\appdata\local\pip\cache\wheels\44\af\e2\dde9fab6f1876485b72b35e9cd48da741da67d20e617c3b971
Successfully built google-search-results
Installing collected packages: google-search-results
Successfully installed google-search-results-2.4.2
Note: you may need to restart the kernel to use updated packages.


  DEPRECATION: Building 'google-search-results' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'google-search-results'. Discussion can be found at https://github.com/pypa/pip/issues/6334


In [27]:
import os
import dotenv
import json
dotenv.load_dotenv()

True

In [28]:
from serpapi import GoogleSearch

In [29]:
params = {
  "engine": "google",
  "q": "のは bunpro",
  "api_key": os.environ.get('SERP_API_KEY')
}

search = GoogleSearch(params)
results = search.get_dict()
organic_results = results["organic_results"]

In [30]:
print(json.dumps(organic_results, indent=4))

[
    {
        "position": 1,
        "title": "Adjective + \u306e(\u306f) (JLPT N5)",
        "link": "https://bunpro.jp/grammar_points/adjective-%E3%81%AE-%E3%81%AF",
        "redirect_link": "https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://bunpro.jp/grammar_points/adjective-%25E3%2581%25AE-%25E3%2581%25AF&ved=2ahUKEwiQvZ7tlOSOAxXHQzABHZYDC50QFnoECAwQAQ",
        "displayed_link": "https://bunpro.jp \u203a grammar_points \u203a adjective-\u306e-\u306f",
        "favicon": "https://serpapi.com/searches/6889d79210b5cf792765cfed/images/baa7d92468b281f5b9e357edd80470395e4e75e212a2ff93b7dcff11381d0313.png",
        "snippet": "One of the roles that the particle \u306e can take in Japanese is replacing a noun that has already been mentioned, or one that has not been mentioned yet.",
        "snippet_highlighted_words": [
            "replacing a noun that has already been mentioned"
        ],
        "source": "Bunpro"
    },
    {
        "position": 2,
       

In [31]:
%pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [32]:
import requests
from bs4 import BeautifulSoup

def scrape_and_clean(url):
    # Fetch the webpage content
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes
    except requests.RequestException as e:
        return f"Error fetching the URL: {e}"
    
    # Parse HTML with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()
    
    # Get text and clean it
    text = soup.get_text()
    
    # Clean up the text
    # Break into lines and remove leading/trailing space
    lines = (line.strip() for line in text.splitlines())
    # Break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # Drop blank lines
    text = ' '.join(chunk for chunk in chunks if chunk)
    
    return text

In [33]:
# Example usage
url = organic_results[0]['link']
clean_text = scrape_and_clean(url)
print(clean_text)

Adjective + の(は) (JLPT N5) | BunproLoading user data...BunproGrammar InfoN5 Lesson 10: 7/12(Adjective) + のはThe 'one' that... (Indefinite pronoun, Adjective nominalization)DetailsExamplesResourcesAdjective + の(は)The 'one' that... (Indefinite pronoun, Adjective nominalization)DetailsExamplesResourcesDetailsExamplesResourcesReady to transform your studies?Learn N5 in under a month!Try now, no credit card required!Try BunproLearn MoreStructure［な］Adjective + な + の + は(1)［い］Adjective + の + は(1)(1) が、もDetailsPart of SpeechExpressionWord TypeCase Marking ParticleRegisterStandardAbout Adjective + の(は)One of the roles that the particle の can take in Japanese is replacing a noun that has already been mentioned, or one that has not been mentioned yet. In this way, it is similar to 'the one that (A)' in English. When using this expression, we will need to use な before の, when a な-Adjective is being used. When using this expression, の will be followed by は, が, or も, depending on what the speaker/wri