#  Text Classification

* Text Classicfication with Representation Models
* Text Classification with Gererative Models

### The Sentiment of Movie Reviews

In [21]:
from datasets import load_dataset

In [22]:
data = load_dataset("rotten_tomatoes")
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [23]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

### Text Classification using representation 
 Using a task-specific Model

In [28]:
from transformers import pipeline

model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

pipe = pipeline(
    model = model_path,
    tokenizer = model_path,
    return_all_scores=True,
    device = 'cuda:0'
)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [29]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

In [35]:
KeyDataset(data["test"], "text")

<transformers.pipelines.pt_utils.KeyDataset at 0x244a0417ce0>

In [30]:
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total = len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[1]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)


100%|██████████| 1066/1066 [00:11<00:00, 95.37it/s] 


In [36]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """"Create and print the classification report."""
    performance = classification_report(
        y_true, y_pred, target_names=["Negative Review", "Positive Review"]
    )

    print(performance)

evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.81      0.69      0.75       533
Positive Review       0.73      0.84      0.78       533

       accuracy                           0.77      1066
      macro avg       0.77      0.77      0.77      1066
   weighted avg       0.77      0.77      0.77      1066



### Classification tasks that leverage embeddings

 **Supervised learning**

In [37]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

### Model I/O: Loading Quantised models with Langchain



In [2]:
!pip install llama-cpp-python

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.3.8.tar.gz (67.3 MB)
     ---------------------------------------- 0.0/67.3 MB ? eta -:--:--
     - -------------------------------------- 1.8/67.3 MB 12.6 MB/s eta 0:00:06
     -- ------------------------------------- 4.2/67.3 MB 12.6 MB/s eta 0:00:06
     --- ------------------------------------ 5.8/67.3 MB 10.4 MB/s eta 0:00:06
     --- ------------------------------------ 6.6/67.3 MB 8.6 MB/s eta 0:00:08
     ---- ----------------------------------- 7.3/67.3 MB 7.4 MB/s eta 0:00:09
     ---- ----------------------------------- 8.1/67.3 MB 6.5 MB/s eta 0:00:10
     ----- ---------------------------------- 8.7/67.3 MB 6.1 MB/s eta 0:00:10
     ----- ---------------------------------- 9.2/67.3 MB 5.5 MB/s eta 0:00:11
     ----- ---------------------------------- 9.7/67.3 MB 5.

In [16]:
from langchain import LlamaCpp

llm = LlamaCpp(
    model_path = "microsoft/Phi-3-mini-4k-instruct-gguf",
    n_gpu_layers = -1,
    max_token = 500,
    n_ctx = 2048,
    seed = 42,
    verbose = False
)

ValidationError: 1 validation error for LlamaCpp
  Value error, Could not load Llama model from path: microsoft/Phi-3-mini-4k-instruct-gguf. Received error Model path does not exist: microsoft/Phi-3-mini-4k-instruct-gguf [type=value_error, input_value={'model_path': 'microsoft...: None, 'grammar': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

In [4]:
llm.invoke("Hi! My name is  Manas. What is 1 + 1?")

NameError: name 'llm' is not defined

In [99]:
import os
# from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI

import load_dotenv
from dotenv import load_dotenv


#openai_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

gemini_llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",               # e.g. "gemini-1.5-flash" or "gemini-pro"
    temperature=0,
    api_key=os.getenv("GOOGLE_API_KEY")
)

In [100]:
react_template = """Answer the following questions as best as you can. you have the access tot he following tools:
    {tools}

    Use the following format:

    Question: the input question you must answer
    Thoughts: you should always think about what to do
    Action: the action to take, should be one of [{tool_names}]
    Action Input: the input to the action
    Observation: the result of the action
    ... (this Thought/Action/Action Input/Observation can repeat N times)
    Thought : I now know the final answer
    Final Answer: the final answer to the original input question

    Begin!

    Question: {input}
    Thought: {agent_scratchpad}

 """


In [101]:
from langchain import LLMChain
from langchain import PromptTemplate

In [102]:
prompt = PromptTemplate(
    template=react_template,
    input_variables=["tools", "tool_names", "input", "agent_scratchpad"]
)

In [103]:
from langchain.agents import initialize_agent, Tool, load_tools
from langchain.tools import DuckDuckGoSearchResults

In [104]:
search = DuckDuckGoSearchResults()

search_tool = Tool(
    name="duckduck",
    description="A web search engine. Use this to as a search engine for general queries.",
    func=search.run,
)

In [105]:
tools = load_tools(["llm-math"], llm=gemini_llm)
tools.append(search_tool)

In [106]:
from langchain.agents import AgentExecutor, create_react_agent


In [107]:
agent = create_react_agent(
    llm=gemini_llm,
    tools=tools,
    prompt=prompt
)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    handle_parsing_errors=True
)

In [108]:
agent_executor.invoke(
    {
        "input": "What is the current price of a MacBook Pro in USD? How much would it cost in ERU if the exchange rate is 0.85 ERU for 1 USD."
    }
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mQuestion: What is the current price of a MacBook Pro in USD? How much would it cost in ERU if the exchange rate is 0.85 ERU for 1 USD.
Thoughts: I need to find the current price of a MacBook Pro in USD. Then I can convert that price to ERU using the given exchange rate.  I can use DuckDuckGo to search for the price. Since there are various MacBook Pro models (different screen sizes, processors, storage), I'll try to find a base model price.
Action: duckduck
Action Input: "price of 13-inch macbook pro"[0m[33;1m[1;3msnippet: The base 13-inch configuration with a 256GB SSD started at $1,299, matching the retail entry price of the Intel 13-inch MacBook Pro. The 13-inch MacBook Pro is the last one with a Touch Bar., title: Best MacBook Pro Deals for April 2025 | Save up to $1,600 - AppleInsider, link: https://appleinsider.com/deals/best-macbook-pro-deals, snippet: Retail price starts at $1,299; M1 MacBook Pro 13-inch (Late 2020

{'input': 'What is the current price of a MacBook Pro in USD? How much would it cost in ERU if the exchange rate is 0.85 ERU for 1 USD.',
 'output': 'A base 13-inch MacBook Pro costs approximately $1299.  This is equal to approximately 1104.15 ERU.'}

In [None]:
# import google.generativeai as genai
# import os
# 
# # 1) point at your key
# genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
# 
# # 2) fetch & print each model’s name + supported methods
# for m in genai.list_models():
#     # use the correct snake-case field name
#     methods = getattr(m, "supported_generation_methods", None)
#     print(f"{m.name}: {methods}")
# 

models/chat-bison-001: ['generateMessage', 'countMessageTokens']
models/text-bison-001: ['generateText', 'countTextTokens', 'createTunedTextModel']
models/embedding-gecko-001: ['embedText', 'countTextTokens']
models/gemini-1.0-pro-vision-latest: ['generateContent', 'countTokens']
models/gemini-pro-vision: ['generateContent', 'countTokens']
models/gemini-1.5-pro-latest: ['generateContent', 'countTokens']
models/gemini-1.5-pro-001: ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-pro-002: ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-pro: ['generateContent', 'countTokens']
models/gemini-1.5-flash-latest: ['generateContent', 'countTokens']
models/gemini-1.5-flash-001: ['generateContent', 'countTokens', 'createCachedContent']
models/gemini-1.5-flash-001-tuning: ['generateContent', 'countTokens', 'createTunedModel']
models/gemini-1.5-flash: ['generateContent', 'countTokens']
models/gemini-1.5-flash-002: ['generateContent', 'countToken