# M.A.D Post Labs (LLM Module)

In this lab, you will implement a basic LLM-powered application. There are **three parts** that you need to complete:
<br>
* OpenAI's API and ReAct Agent
* LLM Applications
* Sentiment Analysis
</br>

**Instructions**

* Completing all tasks will earn you **100 points.**
* You can navigate through your tasks by looking for the **TODO, TO DO, fill in the blank, _** text in this lab.
* Please **keep the \<TODO_OUTPUT_BEGIN\> tag** for grading purposes.
* Any case of plagiarism will result in a **zero score for all plagiarized submissions**.

# Prerequisites

Before we start, please fill in your information. This information will be used in the grading system.

In [None]:
### ***** RUN THIS CELL *****

AUTHOR_NAME = "RUCHIDA"
AUTHOR_SURNAME = "PITHAKSIRIPAN"
AUTHOR_NICKNAME = "FAI"

### ***** RUN THIS CELL *****

You need to install the following libraries to assist in completing the labs.

In [None]:
!pip install -q requests==2.32.3
!pip install -q openai==1.41.1
!pip install -q langchain==0.2.2
!pip install -q langchain_community==0.2.3
!pip install -q langchain-openai==0.1.8
!pip install -q langgraph==0.0.64

You also need an OpenAI account with **valid API keys**.

1) If you have not registered before, you can do so at https://platform.openai.com/login?launch

2) Then, the API key can be generated from https://platform.openai.com/api-keys

Note: For first-time use, there will be a $5 free credit from OpenAI (valid for 30 days).

# Part 1: OpenAI's API and ReAct Agent [35 Points]

## Task 1: Calling OpenAI APIs [20 Points]

In this tasks, you are to implement a code which call OpenAI APIs.

**Before we begin, ensure the ‘OPENAI_API_KEY’ is added in the Secrets tab (key icon on the left).**

<img src ="https://raw.githubusercontent.com/jrkns/cloud_imgs/main/210824_colab_secret.png" width="500" height="300">

#### TODO-1: Sent first OpenAI API request. [2.5 Points]

Ensure that you correctly generate and configure the OpenAI API key by running the code below. The output **should display your first and last name** for grading purposes. If it doesn't, **please set the variables and rerun the cell in the prerequisite section above**.

In [None]:
### Make sure OPENAI_API_KEY is added in the Secrets tab
from google.colab import userdata
from openai import OpenAI
import json

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
#
### TODO-1: Sent first OpenAI API request. [2.5 Points]
client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": f"My name is \"{AUTHOR_NAME} {AUTHOR_SURNAME}\", What is my name?"}
    ],
    temperature=0,
)
#
print("<TODO_1_OUTPUT_BEGIN>")
print(completion)
print(completion.choices[0].message)
print("<TODO_1_OUTPUT_END>")
#
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_1_OUTPUT_BEGIN>
ChatCompletion(id='chatcmpl-AIUeEIF8VqGqdHTRfqgAhT7POOcbc', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Your name is Ruchida Pithaksiripan.', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1728971078, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_e2bde53e6e', usage=CompletionUsage(completion_tokens=12, prompt_tokens=27, total_tokens=39, prompt_tokens_details={'cached_tokens': 0}, completion_tokens_details={'reasoning_tokens': 0}))
ChatCompletionMessage(content='Your name is Ruchida Pithaksiripan.', refusal=None, role='assistant', function_call=None, tool_calls=None)
<TODO_1_OUTPUT_END>


#### TODO-2: Look at the EXPECTED-OUTPUT and try to guess the prompts. [5 Points]

Your task is to guess the original prompt that generated the contents below. The **output may differ** from the expected output in details (e.g., location names), but the **schema of the output should match**.

**Hint:** Consider passing additional parameters to OpenAI to ensure the JSON output is valid. https://platform.openai.com/docs/guides/structured-outputs/json-mode

In [None]:
### TODO-2: Look at the EXPECTED-OUTPUT and try to guess the prompts. [5 Points]
###         The output may vary in details, but the main components should match.
TODO_2_PROMPTS = """
    Generate a JSON object with a key 'trip' that contains a list of dictionaries.
    Each dictionary represents a destination and consists of the following keys:
    - 'name' (the name of the place)
    - 'station' (the nearest BTS/MRT station, including any additional transport details like taking a ferry)
    - 'time' (morning or afternoon)
    Using a one day trip information in csv format below
    <data>
    Chatuchak Weekend Market, Mo Chit, morning
    Jim Thompson House, National Stadium, afternoon
    Wat Arun (Temple of Dawn), Sathorn (Taksin) - then take a ferry, morning
    Asiatique The Riverfront, Sathorn (Taksin) - then take a ferry, afternoon
    </data>
"""

client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": TODO_2_PROMPTS}
    ],
    # temperature=?,
    response_format={
        "type": "json_object"
    }

)

### EXPECTED-OUTPUT (may vary):
# <TODO_2_OUTPUT_BEGIN>
# {'trip': [{'name': 'Chatuchak Weekend Market',
#    'station': 'Mo Chit',
#    'time': 'morning'},
#   {'name': 'Jim Thompson House',
#    'station': 'National Stadium',
#    'time': 'afternoon'},
#   {'name': 'Wat Arun (Temple of Dawn)',
#    'station': 'Sathorn (Taksin) - then take a ferry',
#    'time': 'morning'},
#   {'name': 'Asiatique The Riverfront',
#    'station': 'Sathorn (Taksin) - then take a ferry',
#    'time': 'afternoon'}]}
# <TODO_2_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
print("<TODO_2_OUTPUT_BEGIN>")
print(json.loads(completion.choices[0].message.content))
print("<TODO_2_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_2_OUTPUT_BEGIN>
{'trip': [{'name': 'Chatuchak Weekend Market', 'station': 'Mo Chit', 'time': 'morning'}, {'name': 'Jim Thompson House', 'station': 'National Stadium', 'time': 'afternoon'}, {'name': 'Wat Arun (Temple of Dawn)', 'station': 'Sathorn (Taksin) - then take a ferry', 'time': 'morning'}, {'name': 'Asiatique The Riverfront', 'station': 'Sathorn (Taksin) - then take a ferry', 'time': 'afternoon'}]}
<TODO_2_OUTPUT_END>


#### TODO-3: Look at the EXPECTED-OUTPUT and try to guess the prompts. [5 Points]

Your task is to guess the original prompt that generated the contents below. The **output may differ** from the expected output in details (e.g., location names), but the **schema of the output should match**.

**Hint:** Consider passing additional parameters to OpenAI to ensure the JSON output is valid. https://platform.openai.com/docs/guides/structured-outputs/json-mode

In [None]:
### TODO-3: Look at the EXPECTED-OUTPUT and try to guess the prompts. [5 Points]
###         The output may vary in details, but the main components should match.
TODO_3_PROMPTS = """
    You are an assistant that extracts structured information from customer feedback.
For each input message, return a JSON object with the following fields:
- 'name': The person's name if mentioned, otherwise 'unknown'.
- 'sentiment': 'positive' if the feedback is overall positive, and 'negative' if the feedback is negative.
- 'products': A list of products mentioned in the feedback (e.g., 'credit_card', 'debit_card').

For example, if the input is: "Hi, I’m Fai. Your credit card is amazing."
Your response should be:
{"name": "Fai", "sentiment": "positive", "products": ["credit_card"]}

If the input is: "Your debit card was not working, so I couldn’t use it at the theatre yesterday :("
Your response should be:
{"name": "unknown", "sentiment": "negative", "products": ["debit_card"]}
"""

client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))

def secret_logic(input_string):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": TODO_3_PROMPTS},
            {"role": "user", "content": f"text: {input_string}\noutput:"}
        ],
        # temperature=?,
        # ???=?
    )
    return json.loads(completion.choices[0].message.content)

### EXPECTED-OUTPUT:
# <TODO_3_OUTPUT_BEGIN>
# INPUT: Hi, I’m Bob. Your credit card is amazing.
# OUTPUT: {'name': 'Bob', 'sentiment': 'positive', 'products': ['credit_card']}
# INPUT: Your debit card was not working, so I couldn’t use it at the airport yesterday :(
# OUTPUT: {'name': 'unknown', 'sentiment': 'negative', 'products': ['debit_card']}
# INPUT: Last week, my wife Alice lost her wallet, and both her debit and credit cards are gone. Your customer service responded quickly to freeze those cards. Thank you again.
# OUTPUT: {'name': 'Alice', 'sentiment': 'positive', 'products': ['debit_card', 'credit_card']}
# <TODO_3_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
inpts = [
    "Hi, I’m Bob. Your credit card is amazing.",
    "Your debit card was not working, so I couldn’t use it at the airport yesterday :(",
    "Last week, my wife Alice lost her wallet, and both her debit and credit cards are gone. Your customer service responded quickly to freeze those cards. Thank you again."
]
#
print("<TODO_3_OUTPUT_BEGIN>")
for inpt in inpts:
    print("INPUT:", inpt)
    print("OUTPUT:", secret_logic(inpt))
print("<TODO_3_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_3_OUTPUT_BEGIN>
INPUT: Hi, I’m Bob. Your credit card is amazing.
OUTPUT: {'name': 'Bob', 'sentiment': 'positive', 'products': ['credit_card']}
INPUT: Your debit card was not working, so I couldn’t use it at the airport yesterday :(
OUTPUT: {'name': 'unknown', 'sentiment': 'negative', 'products': ['debit_card']}
INPUT: Last week, my wife Alice lost her wallet, and both her debit and credit cards are gone. Your customer service responded quickly to freeze those cards. Thank you again.
OUTPUT: {'name': 'Alice', 'sentiment': 'positive', 'products': ['debit_card', 'credit_card']}
<TODO_3_OUTPUT_END>


#### TODO-4: Look at the EXPECTED-OUTPUT and try to guess the prompts. [5 Points]

Your task is to guess the original prompt that generated the contents below. **The output should match perfectly.**

In [None]:
### TODO-4: Look at the EXPECTED-OUTPUT and try to guess the prompts. [5 Points]
###         The output may vary in details, but the main components should match.
TODO_4_PROMPTS = """
    Write a Christmas tree using the $ symbol.
    Only respond with the tree.
    The size should have be half of the original tree, if the original have odd line, plus one more line.
"""

client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a Christmas tree using the * symbol. Only respond with the tree."},
        {"role": "assistant", "content": "```\n      *      \n     ***     \n    *****    \n   *******   \n  *********  \n *********** \n*************\n      |      \n```"},
        {"role": "user", "content": TODO_4_PROMPTS},
    ],
    temperature=0,
)

### BEFORE-OUTPUT:
# ```
#       *
#      ***
#     *****
#    *******
#   *********
#  ***********
# *************
#       |
# ```

### EXPECTED-OUTPUT:
# <TODO_4_OUTPUT_BEGIN>
# ```
#       $
#      $$$
#     $$$$$
#    $$$$$$$
#       |
# ```
# <TODO_4_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
print("<TODO_4_OUTPUT_BEGIN>")
print(completion.choices[0].message.content)
print("<TODO_4_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_4_OUTPUT_BEGIN>
```
    $    
   $$$   
  $$$$$  
 $$$$$$$ 
    |    
```
<TODO_4_OUTPUT_END>


#### TODO-5: Write a code that includes prompts and Vision API calls. [2.5 Points]


**Working with images**

In this task, you will implement code to use the OpenAI Vision API to identify **legs present in the following image.** **The output may vary.**

Image: https://takabb.com/public/images/1638819994.png

**Hint:** For more details about the Vision API, please visit https://platform.openai.com/docs/guides/vision

In [None]:
### TODO-5: Write a code that includes prompts and Vision API calls. [2.5 Points]

### EXPECTED-OUTPUT (may vary):
# <TODO_5_OUTPUT_BEGIN>
# The image features two large centipedes, each with visible legs.
# Each centipede has 21 pairs of legs, making a total of 42 legs per centipede.
# Since there are two centipedes, the total number of centipede legs in the image is 42 + 42 = 84 legs.
# <TODO_5_OUTPUT_END>

print("<TODO_5_OUTPUT_BEGIN>")

# <Write your code here>
response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": """
          Analyze the following image and describe How many large centipedes and how many legs they have
          using the example format below.
          Example Answer:
          The image features one centipedes, each with visible legs.
          Each centipede has 2 pairs of legs, making a total of 4 legs per centipede.
          Since there are one centipedes, the total number of centipede legs in the image is 4 legs.
          Answer:
          """
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://takabb.com/public/images/1638819994.png",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0].message.content)


print("<TODO_5_OUTPUT_END>")

<TODO_5_OUTPUT_BEGIN>
The image features five large centipedes, each with visible legs. Each centipede has 15 pairs of legs, making a total of 30 legs per centipede. Since there are five centipedes, the total number of centipede legs in the image is 150 legs.
<TODO_5_OUTPUT_END>


## Task 2: Building ReAct Agent with Langchain [15 Points]

In this task, you are required to implement code that defines tools for the LLM to use. The tools include:
* Loan limit calculator (TODO-6)
* KBTG's job search (TODO-7)


#### TODO-6: Write loan limit calculator function with the following logic [5 Points]

* income <= 0 , loan_limit = 0
* 0 < income < 15000 , loan_limit = 1 x income
* 15000 <= income < 50000, loan_limit = 3 x income
* 50000 <= income, loan_limit = 5 x income

In [None]:
### TODO-6: Write loan limit calculator function with the following logic [5 Points]
### income <= 0 , loan_limit = 0
### 0 < income < 15000 , loan_limit = 1 x income
### 15000 <= income < 50000, loan_limit = 3 x income
### 50000 <= income, loan_limit = 5 x income

from langchain_core.tools import tool

@tool
def calculate_loan_limit(monthly_income: int):
    """calculate loan limit from monthly income"""
    if monthly_income <= 0:
        return 0
    elif monthly_income < 15000:
        return monthly_income
    elif monthly_income < 50000:
        return 3 * monthly_income
    else:
        return 5 * monthly_income

### EXPECTED-OUTPUT:
# <TODO_6_OUTPUT_BEGIN>
# INPUT: -999
# OUTPUT: 0
# INPUT: 0
# OUTPUT: 0
# INPUT: 14500
# OUTPUT: 14500
# INPUT: 15000
# OUTPUT: 45000
# INPUT: 50000
# OUTPUT: 250000
# INPUT: 100000
# OUTPUT: 500000
# <TODO_6_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
print("<TODO_6_OUTPUT_BEGIN>")
income_lists = [-999, 0, 14500, 15000, 50000, 100000]
for income in income_lists:
    print("INPUT:", income)
    print("OUTPUT:", int(calculate_loan_limit({"monthly_income": income})))
print("<TODO_6_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_6_OUTPUT_BEGIN>
INPUT: -999
OUTPUT: 0
INPUT: 0
OUTPUT: 0
INPUT: 14500
OUTPUT: 14500
INPUT: 15000
OUTPUT: 45000
INPUT: 50000
OUTPUT: 250000
INPUT: 100000
OUTPUT: 500000
<TODO_6_OUTPUT_END>


  print("OUTPUT:", int(calculate_loan_limit({"monthly_income": income})))


#### TODO-7: Extract the remaining fields. (job_title, job_location) [5 Points]

In this task, the majority of the code is provided. This code makes a POST request to www.kbtg.tech to retrieve job openings based on the given keywords. Your task is to extract the remaining job information, including the job title and job location.


**Hint:** An example for extracting job_company_name is provided. You might need to **log the raw response** from the API to complete the code.

In [None]:
import json
import re
import requests

from langchain_core.tools import tool

@tool
def search_kbtg_jobs(keyword: str):
    """searching the opening job."""

    query = re.sub(r'\s+', '+', keyword)
    res = requests.post(
        "https://www.kbtg.tech/career-factor",
        headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
        data=f"page=1&limit=3&keyword={query}"
    )
    res = res.content.decode("utf-8")
    res = json.loads(res)["data"]

    job_search_results = []
    for job_data in res:
        ### TODO-7: Extract the remaining fields. (job_title, job_location) [5 Points]
        job_company_name = job_data["jobRequisition"]["legalEntity_obj"]["name"]
        job_title = job_data["jobRequisition"]["jobReqLocale"]["results"][0]["externalTitle"]
        job_location = job_data["jobRequisition"]["cust_WorkLocation"]["results"][0]["picklistLabels"]["results"][0]["label"]
        job_search_results.append({"title": job_title, "company": job_company_name, "location": job_location})
    return job_search_results

### EXPECTED-OUTPUT (may vary):
# <TODO_7_OUTPUT_BEGIN>
# [{'title': 'Business Analyst (Wealth)',
#   'company': 'KASIKORN SOFT',
#   'location': 'KBTG Building'},
#  {'title': 'Advanced Business Analyst (Digital Channel)',
#   'company': 'KASIKORN SOFT',
#   'location': 'KBTG Building'},
#  {'title': 'Business Analyst (Non-Mobile)',
#   'company': 'KASIKORN SOFT',
#   'location': 'KBTG Building'}]
# <TODO_7_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
print("<TODO_7_OUTPUT_BEGIN>")
print(search_kbtg_jobs("Analyst"))
print("<TODO_7_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_7_OUTPUT_BEGIN>
[{'title': 'Advanced Business Analyst. (KSO3BI)', 'company': 'KASIKORN SOFT', 'location': 'KBTG Building, Muang Thong Thani'}, {'title': 'Business Analyst (Credit Card)', 'company': 'KASIKORN SOFT', 'location': 'KBTG Building, Muang Thong Thani'}, {'title': 'Senior Business Analyst (OFSAA) [Hybrid]', 'company': 'KASIKORN SOFT', 'location': 'KBTG Building, Muang Thong Thani'}]
<TODO_7_OUTPUT_END>


In [None]:
from google.colab import userdata

from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

model = ChatOpenAI(openai_api_key=userdata.get("OPENAI_API_KEY"),
                   model="gpt-4o-mini", temperature=0.0)

agent_executor = create_react_agent(model, [search_kbtg_jobs, calculate_loan_limit])

def log_lc_response(response):
    for msg in response['messages']:
        msg = msg.to_json()
        if msg["id"][-1] == "HumanMessage":
            print("HUMAN:", msg["kwargs"]["content"])
        elif msg["id"][-1] == "AIMessage":
            if "additional_kwargs" in msg["kwargs"]:
                print("TOOL-REQ::", msg["kwargs"]["additional_kwargs"]["tool_calls"][0]["function"])
            else:
                print("AI:", msg["kwargs"]["content"])
        elif msg["id"][-1] == "ToolMessage":
            print("TOOL-RES:", msg["kwargs"]["content"])

#### TODO-8: Test agent with tools (calculate_loan_limit) [2 Points]

The code is already provided. Run this cell to validate that TODO-6 and TODO-7 are correct.

In [None]:
### TODO-8: Test agent with tools (calculate_loan_limit) [2 Points]

### EXPECTED-OUTPUT:
# <TODO_8_OUTPUT_BEGIN>
# HUMAN: If I have a monthly income of 30,000, what loan amount can I get?
# TOOL-REQ:: {'arguments': '{"monthly_income":30000}', 'name': 'calculate_loan_limit'}
# TOOL-RES: 90000
# AI: With a monthly income of 30,000, you can get a loan amount of up to 90,000.
# <TODO_8_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
response = agent_executor.invoke(
    {"messages": [HumanMessage(content= "If I have a monthly income of 30,000, what loan amount can I get?")]}
)
#
print("<TODO_8_OUTPUT_BEGIN>")
log_lc_response(response)
print("<TODO_8_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_8_OUTPUT_BEGIN>
HUMAN: If I have a monthly income of 30,000, what loan amount can I get?
TOOL-REQ:: {'arguments': '{"monthly_income":30000}', 'name': 'calculate_loan_limit'}
TOOL-RES: 90000
AI: With a monthly income of 30,000, you can get a loan amount of up to 90,000.
<TODO_8_OUTPUT_END>


#### TODO-9: Test agent with tools (search_kbtg_jobs) [2 Points]

The code is already provided. Run this cell to validate that TODO-6 and TODO-7 are correct.

In [None]:
### TODO-9: Test agent with tools (search_kbtg_jobs) [2 Points]

### EXPECTED-OUTPUT:
# <TODO_9_OUTPUT_BEGIN>
# HUMAN: Hello, Do you have Analyst Job opening?
# TOOL-REQ:: {'arguments': '{"keyword":"Analyst"}', 'name': 'search_kbtg_jobs'}
# TOOL-RES: [{"title": "Business Analyst (Wealth)", "company": "KASIKORN SOFT", "location": "KBTG Building"}, {"title": "Advanced Business Analyst (Digital Channel)", "company": "KASIKORN SOFT", "location": "KBTG Building"}, {"title": "Business Analyst (Non-Mobile)", "company": "KASIKORN SOFT", "location": "KBTG Building"}]
# AI: Yes, there are several Analyst job openings available:
#
# 1. **Business Analyst (Wealth)**
#    - Company: KASIKORN SOFT
#    - Location: KBTG Building
#
# 2. **Advanced Business Analyst (Digital Channel)**
#    - Company: KASIKORN SOFT
#    - Location: KBTG Building
#
# 3. **Business Analyst (Non-Mobile)**
#    - Company: KASIKORN SOFT
#    - Location: KBTG Building
#
# If you need more information about any specific position, feel free to ask!
# <TODO_9_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
response = agent_executor.invoke(
    {"messages": [HumanMessage(content= "Hello, Do you have Analyst job opening?")]}
)
#
print("<TODO_9_OUTPUT_BEGIN>")
log_lc_response(response)
print("<TODO_9_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_9_OUTPUT_BEGIN>
HUMAN: Hello, Do you have Analyst job opening?
TOOL-REQ:: {'arguments': '{"keyword":"Analyst"}', 'name': 'search_kbtg_jobs'}
TOOL-RES: [{"title": "Advanced Business Analyst. (KSO3BI)", "company": "KASIKORN SOFT", "location": "KBTG Building, Muang Thong Thani"}, {"title": "Business Analyst (Credit Card)", "company": "KASIKORN SOFT", "location": "KBTG Building, Muang Thong Thani"}, {"title": "Senior Business Analyst (OFSAA) [Hybrid]", "company": "KASIKORN SOFT", "location": "KBTG Building, Muang Thong Thani"}]
AI: Yes, there are several Analyst job openings available:

1. **Advanced Business Analyst (KSO3BI)**
   - **Company:** KASIKORN SOFT
   - **Location:** KBTG Building, Muang Thong Thani

2. **Business Analyst (Credit Card)**
   - **Company:** KASIKORN SOFT
   - **Location:** KBTG Building, Muang Thong Thani

3. **Senior Business Analyst (OFSAA) [Hybrid]**
   - **Company:** KASIKORN SOFT
   - **Location:** KBTG Building, Muang Thong Thani

If you need more info

#### TODO-10: Test agent without tools in general cases. [1 Point]

The code is already provided. Run this cell to validate that TODO-6 and TODO-7 are correct.

In [None]:
### TODO-10: Test agent without tools in general cases. [1 Point]

### EXPECTED-OUTPUT:
# <TODO_10_OUTPUT_BEGIN>
# HUMAN: Howdy?
# AI: Hello! How can I assist you today?
# <TODO_10_OUTPUT_END>

################ DON'T CHANGE THE CODE IN THIS BLOCK ################
response = agent_executor.invoke(
    {"messages": [HumanMessage(content= "Howdy?")]}
)
#
print("<TODO_10_OUTPUT_BEGIN>")
log_lc_response(response)
print("<TODO_10_OUTPUT_END>")
################ DON'T CHANGE THE CODE IN THIS BLOCK ################

<TODO_10_OUTPUT_BEGIN>
HUMAN: Howdy?
AI: Hello! How can I assist you today?
<TODO_10_OUTPUT_END>


# Part 2: LLM Applications [35 Points]

## Task 1: LLM Application Laboratory

<h1> Laboratory Overview </h1>

<h2> Scenario </h2>

You manage a platform that hosts fine food reviews from various sources. To improve search functionality and enhance user experience, you need to categorize these reviews, analyze sentiment, and identify key topics such as taste, packaging, and freshness.

All data is stored in a CSV file and should be loaded into a Pandas `DataFrame` for processing.

<h2> Objective </h2>

The objective of this lab is to provide learners with hands-on experience in categorizing and analyzing food reviews using LangChain and OpenAI embeddings. Learners will:

- Define and implement a structured tagging schema for fine food reviews.
- Automatically categorize reviews and identify key topics using LangChain.
- Perform semantic search using text embeddings and cosine similarity.

By the end of this lab, learners will be able to build systems that enhance the discoverability and organization of food reviews through automated tagging and advanced search techniques.


<h2> Tasks </h2>

1. **Set up Environment**:

    - Install and import all necessary libraries for LangChain Tagging and Semantic Search
    - Load and pre-process data

2. **Create a `chain` object for tagging food reviews**:

    The `chain` should be composed of a `prompt_template` for the Tagging-LLM and an `llm` using the `gpt-3.5-turbo` model (or any other **OpenAI GPT** model) with a structured output format based on the defined `schema`.
   
    The required tagging fields in the `schema` are as follows:

    - `food_category`: A **string** representing the food category (e.g., Candy, Chips, Tea).
    - `sentiment`: A **string** indicating the sentiment of the review (Positive, Negative, Neutral).
    - `topic_of_concern`: A **list of strings** identifying the main aspects discussed in the review (e.g., Taste, Freshness, Packaging).

3. **Apply tagging function to review data**:

    Add all tagging field as the new columns.

4. **Define Utility Functions**:
   - One function should be designed to vectorize/embed text data.
   - The second function should compare the cosine similarity between text embeddings.

5. **Apply your functions to answer the provided questions**:
   - Perform text embedding to the `tag` columns
   - Analyze and apply semantic search to answer the given question

<h2> Grading </h2>

This task will be evaluated based on two key outputs:

1. The `schema` implemented for the LLM tagging task.
2. Your responses to Questions 5.1 through 5.3.

The final evaluation will be based on **the output generated in the last cell** of this task, so please ensure you run it before submitting.

## 1. Set up Environment

### 1.1 Install and import all necessary libraries for LangChain Tagging and Semantic Search

LangChain and LLM-related Module

In [None]:
import openai
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List

Data Manipulation Module

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

### 1.2 Load and pre-process data

<h2> Data Explanation

The dataset used in this lab is sourced from the [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews) dataset on Kaggle. This dataset contains over 500,000 reviews of fine foods from Amazon, including details such as product information, review text, rating, and helpfulness scores.

In this lab, we will randomly select 100 rows for Tagging and Semantic Search task. And only a subset of the original columns will be used.

**Data in this Lab:**
- **`review_id`**: Unique identifier for each review.
- **`product_id`**: Unique identifier for the product.
- **`user_id`**: Unique identifier for the user.
- **`review_score`**: The rating given by the user (from 1 to 5).
- **`review_dt`**: The datetime for the review.
- **`review_title`**: A title of the review.
- **`review_text`**: The main content of the review.

**Usage:**
The focus columns that will be used for tagging task are `review_title` and `review_text`. The goal is to automatically tag reviews with meaningful labels and enable semantic search for better insights and recommendations.

**Source:**
You can find more details about the dataset [here](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews).


<h2> Data Acquisition

Run the `Downloading Cell` and `Preprocessing Cell` to load and pre-process the data. If succeed, the dataset will be stored in variable `df`. **DO NOT CHANGE THE CODE!!!**

*Note:* If the `Downloading Cell` malfunction (cannot accquired dataset), please follow these instructions:
1. Go download the dataset directly from [here](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews).
2. Unzip the file and upload the file named `Reviews.csv` into Colab's File Explorer.
3. Try rerun the `Preprocessing Cell` and check variable `df`

In [None]:
# Downloading Cell

!kaggle datasets download -d snap/amazon-fine-food-reviews
!unzip amazon-fine-food-reviews.zip
!rm -r amazon-fine-food-reviews.zip database.sqlite hashes.txt

Dataset URL: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
License(s): CC0-1.0
Downloading amazon-fine-food-reviews.zip to /content
 95% 229M/242M [00:01<00:00, 223MB/s]
100% 242M/242M [00:01<00:00, 222MB/s]
Archive:  amazon-fine-food-reviews.zip
replace Reviews.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
  inflating: database.sqlite         
  inflating: hashes.txt              


In [None]:
# Preprocessing Cell

# Read CSV
df = pd.read_csv('Reviews.csv')

# Rename columns
df = df.rename(columns={
    'Text': 'review_text',
    'Summary': 'review_title',
    'Id': 'review_id',
    'ProductId': 'product_id',
    'UserId': 'user_id',
    'Score': 'review_score'
})

# Convert second to Datetime
df['review_dt'] = pd.to_datetime(df['Time'], unit='s')

# Drop irrelevant columns
df = df.drop(['ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time'], axis=1)

# Sample for 100 rows
# random_state should be 42. DO NOT CHANGE THE SEED!!!. This will affect your final grading.
df = df.sample(100, random_state=42).reset_index(drop=True)

In [None]:
df.head()

Unnamed: 0,review_id,product_id,user_id,review_score,review_title,review_text,review_dt
0,165257,B000EVG8J2,A1L01D2BD3RKVO,5,Crunchy & Good Gluten-Free Sandwich Cookies!,Having tried a couple of other brands of glute...,2010-03-10
1,231466,B0000BXJIS,A3U62RE5XZDP0G,5,great kitty treats,My cat loves these treats. If ever I can't fin...,2011-03-01
2,427828,B008FHUFAU,AOXC0JQQZGGB6,3,COFFEE TASTE,A little less than I expected. It tends to ha...,2008-10-15
3,433955,B006BXV14E,A3PWPNZVMNX3PA,2,So the Mini-Wheats were too big?,"First there was Frosted Mini-Wheats, in origin...",2012-04-25
4,70261,B007I7Z3Z0,A1XNZ7PCE45KK7,5,Great Taste . . .,and I want to congratulate the graphic artist ...,2012-04-18


## 2. Create a `chain` object for tagging food reviews

### 2.1 Set up API Key and GPT Model

In [None]:
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
GPT_MODEL = "gpt-3.5-turbo"

### 2.2 Construct `prompt_template` and `llm` object

#### 2.2.1 Complete the Code for a Review Tagging Prompt Template

**Task:**
Define a `template` for an LLM in a review tagging task.
The output should follow the structure defined in the schema.
The `template` will receive two arguments, `title` and `text`, from column `review_title` and `review_text`, which will be passed to the LLM for tagging.

**Instructions:**
- Replace `<Fill in your prompt here>` with the text that adheres to the schema requirements.
- Ensure the placeholders `{title}` and `{text}` are correctly referenced in the prompt.

In [None]:
template = """

    Please read the following review title and review text, then provide relevant tags based on the content.
    The tags should capture the key topics, sentiment, or any important themes related to the review. Provide between 3 and 5 tags.
    Focus on aspects such as product features, user experience, or any specific points mentioned by the reviewer.

    **Title**: {title}
    **Review**: {review}
"""

tagging_prompt = ChatPromptTemplate.from_template(template)

#### 2.2.2 Define an LLM Object with Specific Parameters

**Task:**
Use the `ChatOpenAI` class to initilize an `llm` object.
This should be created by passing only the following three arguments:
- `openai_api_key`: `OPENAI_API_KEY`
- `model`: `GPT_MODEL`
- `temperature`: Set to `0` for deterministic output

**Instructions:**
Replace the placeholder below with the correct initialization using the ChatOpenAI class.

In [None]:
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model=GPT_MODEL,
    temperature=0
)

#### 2.2.3 Complete the `FoodTaggingSchema` Class Definition **[20 points]**

**Task:**  
Define a dictionary-based schema for tagging food reviews using the `FoodTaggingSchema` class. Your job is to complete the class by filling in the type hints and descriptions according to the criteria below:

- `food_category`: A **string** representing the food category (e.g., Candy, Chips, Tea, Coffee). The category should **not be too specific** (e.g., brand names) and **not too general** (e.g., "drinks", "meal", "food").
- `sentiment`: A **string** indicating the sentiment of the review (Positive, Negative, Neutral).
- `topic_of_concern`: A **list of strings** identifying the key aspects discussed in the review (e.g., Taste, Freshness, Packaging). The list should contain a maximum of 3 items.

*Note:* For the `food_category` field, there may be cases where the product being reviewed is **not food** or is **unidentifiable**. Ensure your prompt accounts for these scenarios:
- Use **`NONE_FOOD`** when the product reviewed is not food.
- Use **`UNIDENTIFY`** when the category cannot be determined.

**Instructions:**

- Complete the pydantic-class by defining the fields with correct type hints and descriptions.
- Ensure the descriptions are detailed and meaningful for each field.


In [None]:
foodTaggingSchema = {
    "title": "FoodTaggingSchema",
    "type": "object",
    "description": "Schema for tagging food reviews based on food category, sentiment, and topics of concern.",
    "properties": {
        "food_category": {
            "type": "string", # fill in the blank
            "description": """

            A string representing the food category (e.g., Candy, Chips, Tea, Coffee).
            The category should not be too specific (e.g., brand names) and not too general
            (e.g., "drinks", "meal", "food"). Use 'NONE_FOOD' if the product is not food,
            and 'UNIDENTIFY' if the category cannot be determined.

            """
        },
        "sentiment": {
            "type": "string",
            "description": """

            A string indicating the sentiment of the review. It can be 'positive', 'negative',
            or 'neutral'. The sentiment should reflect the overall tone and satisfaction level
            expressed by the reviewer about the food product.

            """,
            "enum": ["positive", "negative", "neutral"], # fill in 2 more choices for sentiment analysis
            "default": "neutral"
        },
        "topic_of_concern": {
            "type": "array",
            "description": """

            A list of strings identifying the key aspects discussed in the review (e.g., Taste,
            Freshness, Packaging). These should reflect the main concerns or positive points
            the reviewer highlights. The list should contain a maximum of 3 items.

            """,
            "items": {
                "type": "string" # fill in type of element in the array
            },
            "maxItems": 3 # fill in the blank
        }
    },
    "required": ["food_category", "sentiment", "topic_of_concern"]
}

#### 2.2.4 Complete the code to define a structured output LLM



In [None]:
llm_structured = llm.with_structured_output(foodTaggingSchema) # fill in the blank

#### 2.2.5 Define `chain` object for Review Tagging

In [None]:
chain = tagging_prompt | llm_structured

### 2.3 Implement a Function to Tag a Review

**Task:**
Define a function named `tag_review` that:
- Accepts two parameters: `title` and `review`.
- Prepares an input dictionary containing these two arguments.
- Invokes a predefined chain using the `input` dictionary.
- Extracts and returns the following information from the response:
  - `food_category`: The category of food being reviewed.
  - `sentiment`: The sentiment of the review (e.g., Positive, Negative, Neutral).
  - `topic_of_concern`: A list of topics that the review is concerned with.

**Instructions:**
1. Replace `chain._(input)` with the correct function or method call that processes the input.
2. Ensure the return statement accurately extracts the `food_category`, `sentiment`, and `topic_of_concern` fields from the response **respectively**.


In [None]:
def tag_review(title, review):
    input = {
        "title": title,
        "review": review
    }

    # Feed the input using `invoke` method
    res = chain.invoke(input) # fill in the blank

    # Extract and return the required fields from the response
    return res['food_category'], res['sentiment'], res['topic_of_concern'] # fill in the blank

## 3. Apply the `tag_review` Function to a DataFrame

Apply the tag_review function to tag reviews in a DataFrame `df` by extracting and labeling the following fields:

- food_category
- sentiment
- topic_of_concern

This process may take some time and will use your OpenAI API credits.


In [None]:
df[['food_category', 'sentiment', 'topic_of_concern']] = (
    df
    .progress_apply(
        lambda x: tag_review(x['review_title'], x['review_text']),
        axis=1,
        result_type='expand'
    )
)

100%|██████████| 100/100 [01:05<00:00,  1.53it/s]


## 4. Define Utility Functions

In this section, we will create utility functions to support semantic search. A `client` object and `get_embedding` function is already provided, which takes input text and applies GPT’s embedding to generate vector representations. Your task is to implement two additional functions that will be essential for performing semantic search.

In [None]:
# Initiate API connections
client = openai.OpenAI(api_key=OPENAI_API_KEY)

# Define Embedding function
def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model='text-embedding-3-large'
    )
    return response.data[0].embedding

### 4.1 Implement a Function to Calculate Cosine Similarity

Task:
Complete the `cosine_similarity` function to calculate the cosine similarity between two vectors `vec1` and `vec2`.
The cosine similarity is defined as the dot product of the vectors divided by the product of their magnitudes.

Formula:
Cosine Similarity = (vec1 · vec2) / (||vec1|| * ||vec2||)

Instructions:
1. Replace the placeholders (`_`) with the correct NumPy function calls.
2. Ensure that the dot product and norms are calculated correctly.

In [None]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) # fill in the blank
    return dot_product

### 4.2 Implement a Function for Semantic Search

Task:
Complete the `semantic_search` function, which performs a semantic search based on cosine similarity.
The function should:
- Convert the query into an embedding.
- Compute the similarity between the query embedding and the embeddings in the specified DataFrame column.
- Sort the DataFrame based on the similarity scores.

Instructions:
1. Fill in the blanks (`_`) with the appropriate function calls to convert the query and column values into embeddings and calculate cosine similarity.
2. Ensure the DataFrame is sorted correctly based on the similarity scores.

Parameter Explanation:
- `query`: A string representing the search query. This will be converted into an embedding.
- `df`: A Pandas DataFrame containing the data to be searched.
- `col`: The name of the column in `df` that contains the text or embeddings to be compared with the query.

In [None]:
def semantic_search(query, df, col):
    query_emb = get_embedding(query) # fill in the blank
    df['similarity'] = df[col].apply(lambda x: cosine_similarity(query_emb, x)) # fill in the blank
    return df.sort_values('similarity', ascending=False)

## 5. Apply your functions to answer the provided questions
   
Use the created functions and the tagged data to perform semantic searches and derive insights based on the questions given.

### 5.1 How many positive/negative/nuetral reviews from this data? **[5 points]**

*Hint:* `value_counts` method

In [None]:
# Count values of `sentiment` column
df['sentiment'].value_counts() # fill in the blank

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,73
negative,22
neutral,4
mixed,1


In [None]:
# @title Answer to 5.1 (please run the cell after filling-in the answer) {"run":"auto","vertical-output":true,"display-mode":"form"}
positive_reviews = 73 # @param {"type":"integer"}
negative_reviews = 22 # @param {"type":"integer"}
neutral_reviews = 4 # @param {"type":"integer"}



### 5.2 List me 3 `review_id` mentioned about 'Healthy Product' or any related terms. **[5 points]**

#### 5.2.1 Concatenate `topic_of_concern` column into single string

- Use `, ` as delimiter
- Save the concatenated string into new column, `toc_concat`

*Hint:* Use `join` method

In [None]:
df['toc_concat'] = df['topic_of_concern'].progress_apply(lambda x: ', '.join(x)) # fill in the blank

100%|██████████| 100/100 [00:00<00:00, 150819.99it/s]


#### 5.2.2 Apply Embedding to the concatenated column

- Apply embedding to column `toc_concat`
- Save the embedded value into new column, `toc_emb`

In [None]:
df['toc_emb'] = df['toc_concat'].progress_apply(lambda x: get_embedding(x)) # fill in the blank

100%|██████████| 100/100 [00:35<00:00,  2.83it/s]


#### 5.2.3 Apply Similarity of the embedded column with the key word `Healthy Product`

- Define `Healthy Product` as `query` for searching
- Use semantic search function to get similarity score between `query` and `toc_emb`

In [None]:
query = 'Healthy Product'
df_result_hp = semantic_search(df=df, query=query, col='toc_emb')

#### 5.2.4 Analyze the search result from similarity score

- Look into rows with best score, check their original `topic_of_concern` column if it's related to **Healthy Product**
- If so, paste their `review_id` into the following form to answer the question

In [None]:
df_result_hp.sort_values('similarity', ascending=False).head(3)

Unnamed: 0,review_id,product_id,user_id,review_score,review_title,review_text,review_dt,food_category,sentiment,topic_of_concern,toc_concat,toc_emb,similarity
75,567964,B0030VJ8YU,A12WA0NDY18R6W,5,Yummy organic baby food,This baby food is both convenient and healthy....,2011-04-28,Baby Food,positive,"[Convenience, Healthiness, Quality]","Convenience, Healthiness, Quality","[0.010342562571167946, 0.02029174566268921, -0...",0.540965
43,422062,B002OVICJO,A136BYILR08J32,5,Liv-a-littles chicken treats,If you're looking for a yummy treat for your c...,2008-05-26,Pet Treats,positive,"[Quality, Preference, Healthiness]","Quality, Preference, Healthiness","[0.006347076501697302, 0.008854907006025314, -...",0.513153
24,32044,B0062A87HA,A3C89YOH5NT8D3,5,Great deal!,I was so happy to see this deal. I recently ha...,2011-02-22,Pumpkin Product,positive,"[Organic, Affordable, Recipe Usage]","Organic, Affordable, Recipe Usage","[-0.02851654589176178, -0.017710639163851738, ...",0.508778


In [None]:
# @title Answer to 5.2 (please run the cell after filling-in the answer){"run":"auto","vertical-output":true,"display-mode":"form"}
first_id = "567964" # @param {"type":"string"}
second_id = "422062" # @param {"type":"string"}
third_id = "32044" # @param {"type":"string"}


### 5.3 What is the most common drink that is in the top 10 similarity score with the word `Beverages`? **[5 points]**

#### 5.3.1 Apply Embedding to the `food_category`

- Save the embedded value into new column, `food_cat_emb`

In [None]:
df['food_cat_emb'] = df['food_category'].progress_apply(lambda x: get_embedding(x)) # fill in the blank

100%|██████████| 100/100 [00:29<00:00,  3.39it/s]


#### 5.3.2 Apply Similarity of the embedded column with the key word `Beverages`

- Define `Beverages` as `query` for searching
- Use semantic search function to get similarity score between `query` and column `food_cat_emb`

In [None]:
query = 'Beverages' # fill in the blank
df_result_bev = semantic_search(query=query, df=df, col='food_cat_emb') # fill in the blank

#### 5.3.3 Analyze the top 10 search result from similarity score

- Look at the top 10 rows with best `cosine_similarity` score, check their for their food category

In [None]:
df_result_bev.sort_values('similarity', ascending=False).head(10) # fill in the blank

Unnamed: 0,review_id,product_id,user_id,review_score,review_title,review_text,review_dt,food_category,sentiment,topic_of_concern,toc_concat,toc_emb,similarity,food_cat_emb
46,136889,B001M1DUDU,A2BFA2D3MC3LL2,5,Good packaging - no spills!,I like this product enough that I used to buy ...,2012-03-22,Beverages,positive,"[Packaging, Quality, Price]","Packaging, Quality, Price","[0.0021961783058941364, 0.01707911677658558, -...",0.999999,"[0.014680768363177776, -0.013876223936676979, ..."
55,447100,B003VN9536,A2T135XHAH82S4,5,excellent product,"We love Crystal Light Fruit Punch. Granted, t...",2012-09-08,Beverages,positive,"[Flavor, Value for Money, Quantity]","Flavor, Value for Money, Quantity","[0.011758225969970226, -0.00013294033124111593...",0.999999,"[0.014680768363177776, -0.013876223936676979, ..."
77,360547,B009M2LRTA,AB6Y2PL1WSFTS,3,Love this beverage...but don't have it shipped...,My family loves this soda. We sometimes buy i...,2011-12-07,Soda,negative,"[Shipping Quality, Taste Alteration, Product Q...","Shipping Quality, Taste Alteration, Product Qu...","[-0.0007932085427455604, -0.0010302709415555, ...",0.519349,"[0.001338779809884727, -0.03389210253953934, -..."
96,536639,B0002LDAHC,A6SANB5L4DFDU,5,Flavorful Coffee,I grew up with this can around my house -my pa...,2011-12-18,Coffee,positive,"[Flavor, Preference, Quality, Addictive]","Flavor, Preference, Quality, Addictive","[0.022912200540304184, -0.014770646579563618, ...",0.506418,"[0.009030983783304691, -0.038445279002189636, ..."
66,236606,B008YA1NWC,A3Q82OOSCTLLMR,5,So yummy,To be honest I'm not a huge Green Mountain Roa...,2011-05-20,Coffee,positive,"[Flavor, Availability, Tropical Island Experie...","Flavor, Availability, Tropical Island Experience","[-0.004776034504175186, -0.02954932302236557, ...",0.506418,"[0.009030983783304691, -0.038445279002189636, ..."
85,162306,B001EYUE5C,A1UO0OMCFXQ1TG,5,Delicious,These used to almost never be in stock. Inven...,2012-08-11,Coffee,positive,"[Flavor, Inventory, Availability]","Flavor, Inventory, Availability","[-0.02655716799199581, -0.015021787025034428, ...",0.506418,"[0.009030983783304691, -0.038445279002189636, ..."
50,246569,B002D4DY8G,A3EFSLEMHNPP6A,2,Artificial flavor...?!,"With a legendary name like Gevalia, I was disa...",2009-12-29,Coffee,negative,"[Artificial Flavors, Packaging, Taste]","Artificial Flavors, Packaging, Taste","[0.0056574055925011635, -0.01980714499950409, ...",0.506414,"[0.008989663794636726, -0.038406770676374435, ..."
28,146985,B005GRCWDU,A1QVLJ260F6SZD,5,Best coffee EVER!,I've tried several brands and flavors of groun...,2012-02-18,Coffee,positive,"[Flavor, Value for Money, Brand Preference]","Flavor, Value for Money, Brand Preference","[0.02213113196194172, -0.0009594275034032762, ...",0.506348,"[0.00895767193287611, -0.038405921310186386, 0..."
2,427828,B008FHUFAU,AOXC0JQQZGGB6,3,COFFEE TASTE,A little less than I expected. It tends to ha...,2008-10-15,Coffee,negative,"[Taste, Expectation, Company Favorite]","Taste, Expectation, Company Favorite","[0.008754176087677479, -0.033111512660980225, ...",0.506348,"[0.00895767193287611, -0.038405921310186386, 0..."
98,251504,B001E5DZYS,A33OH380L3DW0Z,5,Excellent Flavor,This coffee has a full bodied flavor without b...,2010-09-01,Coffee,positive,"[Flavor, Strength, Repurchase]","Flavor, Strength, Repurchase","[0.02001223899424076, -0.011058501899242401, -...",0.506348,"[0.00895767193287611, -0.038405921310186386, 0..."


In [None]:
# @title Answer to 5.3 (please run the cell after filling-in the answer){"run":"auto","vertical-output":true}
most_common_beverage = "B001M1DUDU" # @param {"type":"string"}


## Answer for grading

Run all the following cells to print the answer out. This will contribute to a total of **35 points**. **Make sure to submit the notebook with all the answers printed out.**

If `NameError` raises, try running the `Answer Form` of the question.

### 2.2.3 Tagging Schema Answer [20 points]

In [None]:
print("<TODO_SCHEMA_FILL_OUTPUT_BEGIN>")
print(foodTaggingSchema['properties']['food_category']['type'])
print("<ANS_SPLIT>")
print('|'.join(foodTaggingSchema['properties']['sentiment']['enum']))
print("<ANS_SPLIT>")
print(foodTaggingSchema['properties']['topic_of_concern']['items']['type'])
print("<ANS_SPLIT>")
print(foodTaggingSchema['properties']['topic_of_concern']['maxItems'])
print("<TODO_SCHEMA_FILL_OUTPUT_END>")
print("<TODO_SCHEMA_OUTPUT_BEGIN>")
print(foodTaggingSchema)
print("<TODO_SCHEMA_OUTPUT_END>")

<TODO_SCHEMA_FILL_OUTPUT_BEGIN>
string
<ANS_SPLIT>
positive|negative|neutral
<ANS_SPLIT>
string
<ANS_SPLIT>
3
<TODO_SCHEMA_FILL_OUTPUT_END>
<TODO_SCHEMA_OUTPUT_BEGIN>
{'title': 'FoodTaggingSchema', 'type': 'object', 'description': 'Schema for tagging food reviews based on food category, sentiment, and topics of concern.', 'properties': {'food_category': {'type': 'string', 'description': '\n\n            A string representing the food category (e.g., Candy, Chips, Tea, Coffee).\n            The category should not be too specific (e.g., brand names) and not too general\n            (e.g., "drinks", "meal", "food"). Use \'NONE_FOOD\' if the product is not food,\n            and \'UNIDENTIFY\' if the category cannot be determined.\n\n            '}, 'sentiment': {'type': 'string', 'description': "\n\n            A string indicating the sentiment of the review. It can be 'positive', 'negative',\n            or 'neutral'. The sentiment should reflect the overall tone and satisfaction level\

### 5.1 Answer [5 points]

In [None]:
print("<TODO_5_1_OUTPUT_BEGIN>")
print(positive_reviews)
print("<ANS_SPLIT>")
print(negative_reviews)
print("<ANS_SPLIT>")
print(neutral_reviews)
print("TODO_5_1_OUTPUT_END")

<TODO_5_1_OUTPUT_BEGIN>
73
<ANS_SPLIT>
22
<ANS_SPLIT>
4
TODO_5_1_OUTPUT_END


### 5.2 Answer [5 points]

In [None]:
print("<TODO_5_2_OUTPUT_BEGIN>")
print(first_id)
print("<ANS_SPLIT>")
print(second_id)
print("<ANS_SPLIT>")
print(third_id)
print("<TODO_5_2_OUTPUT_BEGIN>")

<TODO_5_2_OUTPUT_BEGIN>
567964
<ANS_SPLIT>
422062
<ANS_SPLIT>
32044
<TODO_5_2_OUTPUT_BEGIN>


### 5.3 Answer [5 points]

In [None]:
print("<TODO_5_3_OUTPUT_BEGIN>")
print(most_common_beverage)
print("<TODO_5_3_OUTPUT_BEGIN>")

<TODO_5_3_OUTPUT_BEGIN>
B001M1DUDU
<TODO_5_3_OUTPUT_BEGIN>


# Part 3: Sentiment Analysis [30 Points]

#### Sentiment analysis is the process of determining the emotional tone behind a body of text, typically classifying it as positive, negative, or neutral. It often involves preprocessing and cleaning the text data to remove noise, followed by vectorizing the text to transform it into a numerical format suitable for analysis. These steps are crucial for building accurate models that can effectively analyze sentiment in large datasets.

#### In this task, you need to complete four sub-task functions related to the Sentiment Analysis process covered in class. You can use a sanity check to verify the correctness of your functions before submitting.

#### **Make sure that you run all sanity check cells before submitting**

In [None]:
import math
import numpy as np
import spacy
import nltk
import re
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
### Initialize SpaCy and NLTK
nlp = spacy.load("en_core_web_lg")

nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


## Task 1: Clean & Preprocess text with REGEX [7.5 Points]

#### Cleaning and preprocessing text with regex in Python involves using regular expressions to identify and remove unwanted elements from the text, such as special characters, numbers, or extra spaces. This process is essential for preparing raw text data for analysis by normalizing the text, reducing noise, and ensuring consistency. By applying regex patterns, you can efficiently clean the text, making it more suitable for tasks like tokenization, vectorization, and sentiment analysis.

In [None]:
### TO DO: Use regex to remove number (0-9) and these 3 symbols (, . !) from the text_input
### Hint: re.sub() can be used for replacing occurrences of a regex pattern with a specified replacement string. Also, You can use sanity check for debugging.
### Expected Output: Text without number (0-9) and these 3 symbols (, . !)

def regex_clean(text_input):
  ### BEGIN YOUR ANSWER
  pattern = r'[0-9,.!]'
  cleaned_text = re.sub(pattern, '', text_input)
  cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
  return cleaned_text
  # raise NotImplementedError
  ### END YOUR ANSWER

In [None]:
### ***** PLEASE RUN AND DO NOT CHANGE ANYTHING IN THIS CELL ******
### Sanity Check: regex_clean

regex_clean_1 = "Hello, world0123!"
regex_clean_2 = "1.I won lottery.!!!"

assert regex_clean(regex_clean_1) == "Hello world", "regex_clean_1: Wrong Output"
print("regex_clean_1: Passed!")

assert regex_clean(regex_clean_2) == "I won lottery", "regex_clean_2: Wrong Output"
print("regex_clean_2: Passed!")

regex_clean_1: Passed!
regex_clean_2: Passed!


## Task 2: Vectorize text using SpaCy [7.5 Points]

#### Vectorizing text using SpaCy involves converting text data into numerical representations, called vectors, that capture the meaning and context of the words or phrases. SpaCy provides powerful tools like pre-trained word embeddings that transform text into dense vectors, which reflect semantic similarities and differences. These vectors can then be used for various natural language processing tasks, such as sentiment analysis, text classification, or clustering, enabling machine learning models to understand and process text data more effectively.

In [None]:
### TO DO: Use SpaCy "en_core_web_lg" to vectorize the text_input and save value as vector_output
### Hint: You can call the variable 'nlp' in the function.
### Expected Output: Vector of the text_input

def spacy_vectorize(text_input):
  ### BEGIN YOUR ANSWER
  nlp = spacy.load("en_core_web_lg")
  doc = nlp(text_input)
  vector_output = doc.vector
  return vector_output
  # raise NotImplementedError
  ### END YOUR ANSWER

In [None]:
### ***** PLEASE RUN AND DO NOT CHANGE ANYTHING IN THIS CELL ******
### Sanity Check: spacy_vectorize

spacy_vectorize_1 = "Hello World"
spacy_vectorize_2 = "This is a cat."

assert np.allclose(spacy_vectorize(spacy_vectorize_1), [ 1.9091499 , -0.45174998,  6.0755    , -2.40725   ,  1.6430551 ,
        1.22375   , -0.16766998, -2.6821    ,  2.07438   ,  3.67405   ,
        1.06575   ,  1.355615  , -2.1858    ,  3.25365   , -0.96880007,
       -0.84369993,  3.050255  ,  1.929425  ,  0.266375  , -1.4181999 ,
        2.23065   ,  0.13395   ,  0.342965  , -1.3394501 ,  0.47084004,
        4.08997   , -4.8279247 ,  0.82527   , -0.23190004, -0.78024995,
       -0.64109   ,  1.39115   ,  1.01615   ,  2.80055   , -3.33231   ,
        2.213785  , -0.04525006, -0.45984995, -2.32922   , -2.417888  ,
        0.20391501,  1.33049   , -0.563985  ,  1.6517899 , -2.4878802 ,
       -3.03185   , -4.56585   ,  1.74856   , -1.357615  , -3.959625  ,
       -0.69825   , -2.26605   , -0.72382504, -2.1701    ,  1.7167001 ,
        1.4455001 , -0.5869755 , -0.85075   , -2.0085502 ,  0.3819    ,
        1.27522   ,  2.0743499 , -0.825005  ,  3.4989    ,  2.49142   ,
       -0.873825  , -2.3759599 ,  0.15710002,  4.71345   , -0.96141493,
        0.97319996,  2.310585  , -0.8499249 , -1.6558999 , -1.2397499 ,
       -3.4123    , -0.290385  ,  1.2902601 ,  4.49415   , -0.531245  ,
       -0.30260015, -0.5022    , -1.4202    ,  2.66235   ,  2.96855   ,
       -2.3820899 , -0.51495004, -3.87338   , -2.63205   , -4.5396    ,
       -0.962265  ,  3.0814    , -1.27605   , -1.2235    , -4.5558    ,
        1.517935  , -0.57454   , -1.419395  , -1.28969   ,  1.01775   ,
        1.5958949 , -0.93831   ,  0.29061398,  1.4059498 , -0.40600002,
        0.8541    , -0.68939996, -0.8835    ,  0.232621  ,  0.258035  ,
        2.1965249 ,  1.03335   , -1.72347   ,  0.54244006,  0.31351   ,
       -2.8723998 ,  0.831015  , -1.2166994 , -0.57189   ,  2.9512    ,
       -1.379975  ,  0.05383   ,  3.03583   ,  0.05925   ,  3.7099    ,
        1.7822    ,  0.90089995, -0.93752   , -0.21735   , -0.35956   ,
        0.37635005,  3.199625  , -0.17940009, -1.82325   , -0.800265  ,
       -2.078645  ,  1.5732349 , -0.98045003, -2.45435   ,  3.3913498 ,
        0.816275  ,  1.681535  ,  1.073505  ,  1.8717201 , -2.70605   ,
       -0.81369   , -4.704305  , -0.44801   , -1.9599199 ,  0.331795  ,
        1.54905   ,  0.448628  , -3.3199    ,  0.476015  , -1.8448    ,
       -0.20492001,  2.029295  ,  0.07160002,  2.6872551 , -1.6288    ,
        1.235345  ,  1.317705  , -1.9263599 , -0.38435   , -3.817655  ,
       -1.144415  , -2.076225  , -1.4848001 ,  1.5587001 , -3.9872901 ,
       -1.8787    , -0.4929    , -3.1566    ,  2.332205  , -0.27949995,
       -0.89615506, -0.972535  ,  0.66885   , -0.097805  , -1.4099    ,
        0.54473996,  2.80335   , -2.785755  , -1.03169   ,  0.14777   ,
        4.7942    , -0.43314993, -1.8141501 ,  2.64845   ,  1.3605    ,
       -1.07244   ,  3.8125    ,  0.4349    ,  3.1717026 ,  1.961475  ,
       -0.32695   , -1.58219   ,  2.23975   ,  1.095645  ,  3.9083    ,
        1.88625   , -0.103234  , -0.88906   , -0.5008501 ,  5.5821753 ,
       -0.35721502, -0.81245005,  1.1631999 , -2.726985  , -0.3377    ,
        1.5778999 ,  2.3084202 , -0.77995   , -1.0862    , -0.85148996,
        1.3574849 , -1.22574   ,  0.48182   ,  2.32545   , -3.6890001 ,
       -0.553385  ,  0.51283497, -2.75875   , -1.10565   , -6.53037   ,
       -0.86206496,  0.32465   ,  2.204345  , -2.70668   ,  1.98566   ,
       -1.339125  ,  1.17286   , -0.297028  ,  1.8505    , -0.36110997,
        1.0040851 , -1.047425  ,  0.82317   ,  0.478855  ,  0.33419997,
        2.25532   ,  0.473225  ,  1.1447845 , -0.43174   , -0.54912496,
       -0.94259   , -1.265725  , -1.479     , -1.01386   ,  0.2139035 ,
       -1.92481   , -4.812345  , -1.598245  , -1.266505  , -0.6402    ,
       -3.97715   ,  0.97725004, -4.9416847 ,  1.13679   ,  2.8118    ,
        0.829285  , -4.5960503 ,  1.8581    , -0.70976   , -1.4105451 ,
       -3.03318   , -2.25385   , -0.45599997, -1.1853501 ,  1.188738  ,
        0.31739998,  0.855105  ,  1.8125999 ,  1.8716501 , -1.5424001 ,
       -0.26905   , -1.5464    ,  0.50355005,  0.10933   ,  2.1044002 ,
        0.82834995,  0.517655  ,  2.0570998 , -3.6235    , -0.42475003,
        0.8547085 ,  3.8453999 , -0.36260003,  0.8638995 ,  3.07476   ,
        1.7916199 , -0.07744998,  0.73646003,  1.40165   , -1.0778465 ,
       -3.5363998 ,  0.75514895, -1.166235  , -3.014495  , -1.0041499 ]) == True, "spacy_vectorize_1: Wrong Output"
print("spacy_vectorize_1: Passed!")

assert np.allclose(spacy_vectorize(spacy_vectorize_2), [-5.1525068e-01,  3.3550200e+00, -3.1942058e+00, -3.7386200e+00,
        6.2822609e+00, -5.1502001e-01,  3.1933394e-01,  2.0876498e+00,
        1.4574479e+00,  1.1849389e-01,  1.0310520e+01, -2.3752013e-01,
       -1.0822041e+00,  8.7533605e-01,  2.1434102e+00,  2.7598598e+00,
        1.7363818e+00,  2.4873600e+00,  9.3288791e-01, -8.0104399e-01,
        7.9022008e-01,  1.4068760e+00, -3.2576861e+00, -1.4497462e-01,
       -3.5770001e+00, -2.9040198e+00, -2.8258998e+00, -2.1282201e+00,
       -3.2122798e+00,  4.4936083e-02, -9.3070203e-01, -4.8406202e-01,
       -3.2052238e+00, -8.0821002e-01, -4.3008800e+00,  9.2272198e-01,
        8.8425463e-01,  4.6460000e-01,  5.6586399e+00, -5.5996007e-01,
       -1.9406722e+00,  4.9976605e-01,  1.9287260e+00, -4.2426005e-01,
       -2.4993002e+00,  2.3217678e+00,  3.4900203e+00, -3.3036518e+00,
       -2.2066395e+00, -5.9411001e-01, -6.2006605e-01,  1.6725420e+00,
        3.4256377e+00, -3.1799202e+00,  5.1342005e-01, -3.2105997e-01,
        2.9841797e+00,  8.2749999e-01, -9.0579844e-01,  6.0143000e-01,
        1.4841321e+00, -1.7128880e+00, -3.8300015e-02, -5.5789995e-01,
       -2.7657757e+00, -8.8112041e-02, -3.4344602e+00, -6.1846356e+00,
        2.9752400e+00,  2.4428980e+00, -4.2204332e-02,  3.0471997e+00,
       -4.6080999e+00, -2.0597601e-01, -4.3766794e-01,  1.3226600e+00,
       -2.7763600e+00,  2.7712102e+00, -2.7255280e+00, -7.9127997e-01,
       -5.3278599e+00, -2.5735700e+00,  6.4166194e-01, -1.3083364e+00,
        2.3878360e+00, -1.5716702e+00,  2.2275715e+00,  5.7804799e-01,
        3.7506242e+00, -2.9679279e+00, -8.5738802e-01, -4.6682458e+00,
        3.9121425e+00, -7.4809999e+00,  6.8826199e-01, -4.5307999e+00,
       -4.4429404e-01,  8.1760027e-02, -4.2538404e-01, -4.9222360e+00,
        7.6188183e-01,  7.7960002e-01,  6.2873201e+00,  6.4020529e+00,
        2.1447120e+00,  4.6194186e+00,  7.8504041e-02,  2.3271642e+00,
       -3.5233657e+00,  2.9579301e+00, -2.3646080e+00,  4.1450725e+00,
       -5.2480614e-01,  4.8736520e+00,  5.7585800e-01, -1.2049781e+00,
       -2.6510799e+00,  2.4680258e-03,  2.7278199e+00, -6.6711998e-01,
       -1.2842820e+00, -2.0238199e+00, -1.5410120e+00,  5.8798203e+00,
       -1.9881001e+00, -6.8943801e+00,  3.4059811e-02, -6.2057996e-01,
        2.7027278e+00, -2.8297079e+00, -2.8847060e+00, -1.2029999e+00,
        5.8111439e+00, -7.5403605e+00, -4.8309737e-01,  8.7368393e-01,
        2.2249398e+00, -4.3276601e+00,  5.3056002e+00, -2.8657200e+00,
       -1.8121380e+00, -2.4238411e-01,  3.0637522e+00,  2.2479560e+00,
        1.4982580e+00,  1.1447401e+00, -3.6077163e+00, -3.3542195e-01,
       -3.4776998e-01,  3.6801600e+00,  2.5131402e+00,  5.4348302e+00,
        1.0040540e+00,  2.4681199e+00,  2.4649601e-01,  2.4969239e+00,
        5.1022220e+00,  3.4736018e+00,  2.8122789e-01,  3.5790208e-01,
       -1.9040579e+00, -2.0078850e+00,  1.5827520e+00,  2.6213200e+00,
       -4.8154202e+00, -3.1938400e+00, -3.7035301e+00,  5.6587391e+00,
        2.6762402e+00, -6.9629952e-02,  3.8428657e+00, -8.7008762e-01,
        3.2447200e+00,  1.1510401e+00,  1.2628801e-01, -3.8476224e+00,
        6.0685790e-01, -2.0399571e-03, -3.7674537e+00, -7.1674001e-01,
       -1.0998760e+00,  1.5341840e+00,  1.6909180e+00, -3.6566799e+00,
       -3.3844619e+00, -7.9374206e-01, -3.7961879e+00, -3.4601181e+00,
       -9.8252803e-01,  3.4114003e+00, -2.1496696e+00, -7.1959990e-01,
       -2.2053800e+00, -3.7677798e+00,  2.7276399e+00, -1.6924605e-01,
       -3.0168099e+00, -3.6999942e-03,  1.2836021e+00, -3.8780200e+00,
        5.0803035e-01, -9.0892601e-01, -3.5374198e+00, -3.6654201e+00,
        1.5899861e+00,  1.8346682e+00, -6.6831002e+00,  1.5564820e+00,
       -3.7149956e+00, -3.7630200e+00,  4.5250001e+00,  2.0020201e+00,
       -4.0279999e-01,  2.5854721e+00,  1.1243280e+00,  1.7673798e-01,
       -8.1888998e-01, -2.1178601e+00, -2.4210601e+00,  4.5965600e-01,
       -2.7663758e+00, -2.4932561e+00,  1.5002759e+00, -3.0449599e-01,
       -3.7684008e-01, -7.2592008e-01,  2.7756200e+00, -2.7147999e-01,
        4.4642601e+00, -1.6031481e+00, -3.9000988e-03, -9.0072603e+00,
        1.7786318e+00,  4.6200242e+00, -4.9403801e+00,  3.1929597e-01,
       -1.3880199e+00,  9.0721129e-03,  1.3808960e+00,  1.5602281e+00,
        1.7240620e+00,  6.1998796e-01,  3.5996997e+00,  2.3757000e+00,
       -4.1756921e+00, -2.2395339e+00, -3.8822072e+00,  2.0884180e+00,
       -1.7062393e-01,  3.8123124e+00,  2.4698810e-01,  4.3285400e-01,
       -5.1090002e+00, -4.3618999e+00, -2.2364192e+00, -1.0938101e+00,
        3.9004002e+00,  5.5234003e+00,  7.7447397e-01, -2.3161778e+00,
       -1.2803020e+00,  2.8454998e+00,  2.4919000e+00,  6.7252998e+00,
       -1.9188640e+00, -1.0170660e+00, -2.4665399e+00,  7.8406394e-01,
       -3.7017677e+00,  1.0613760e+00,  5.0155802e+00, -1.6745160e+00,
       -5.1563793e-01, -2.8973603e+00,  2.0842299e+00,  8.4648001e-01,
        2.1587601e+00,  2.6430061e+00, -5.7316399e-01,  1.3787849e+00,
        1.5152200e-01, -3.9843597e+00, -4.6193199e+00,  1.4414209e+00,
        8.4691324e+00, -1.7156000e+00,  2.9018404e+00,  1.5891964e-02,
       -2.0408401e+00,  3.0835021e+00,  1.3132802e+00,  9.1946000e-01,
        5.1444678e+00,  5.1536405e-01, -1.0884761e+00,  3.9948399e+00,
       -7.9524004e-01,  1.5913141e+00, -3.6228924e+00,  2.8538461e+00]) == True, "spacy_vectorize_2: Wrong Output"
print("spacy_vectorize_2: Passed!")

spacy_vectorize_1: Passed!
spacy_vectorize_2: Passed!


## Task 3: Find text similarity using SpaCy [7.5 Points]

#### SpaCy provides a built-in function to calculate cosine similarity, which measures how similar two texts are by comparing the angles between their vector representations. This method, with values ranging from -1 to 1, is effective for assessing semantic similarity in tasks like document comparison and clustering.

In [None]:
### TO DO: Use SpaCy "en_core_web_lg" to find cosine similarity of 2 texts
### Hint: You can call the variable 'nlp' in the function.
### Expected Output: Vector of the text_input

def spacy_similarity(text_input_1, text_input_2):
  ### BEGIN YOUR ANSWER
  nlp = spacy.load("en_core_web_lg")
  doc1 = nlp(text_input_1)
  doc2 = nlp(text_input_2)
  similarity_score = doc1.similarity(doc2)
  return similarity_score
  # raise NotImplementedError
  ### END YOUR ANSWER

In [None]:
### ***** PLEASE RUN AND DO NOT CHANGE ANYTHING IN THIS CELL ******
### Sanity Check: spacy_similarity

spacy_similarity_1_a = "Hello World"
spacy_similarity_1_b = "This is a cat."
spacy_similarity_2_a = "Water"
spacy_similarity_2_b = "Drinking"

assert math.isclose(spacy_similarity(spacy_similarity_1_a, spacy_similarity_1_b), 0.011677389356738732) == True, "spacy_similarity_1: Wrong Output"
print("spacy_similarity_1: Passed!")

assert math.isclose(spacy_similarity(spacy_similarity_2_a, spacy_similarity_2_b), 0.4627892310102833) == True, "spacy_similarity_2: Wrong Output"
print("spacy_similarity_2: Passed!")

spacy_similarity_1: Passed!
spacy_similarity_2: Passed!


## Task 4: Analyze text sentiment using NLTK "VADER" [7.5 Points]

#### VADER (Valence Aware Dictionary and sEntiment Reasoner) in NLTK is a tool for analyzing text sentiment, specifically designed to capture both the polarity (positive, negative, neutral) and intensity of sentiments expressed in social media and other informal text. VADER uses a pre-built lexicon of words and assigns them sentiment scores, allowing it to effectively handle slang, emojis, and negations. It's easy to implement and provides quick, reliable sentiment analysis, making it ideal for applications like social media monitoring and opinion mining.

In [None]:
### TO DO: Use NLTK "VADER" to analyze sentiment in text data
### Hint: You can call the variable 'sid' in the function.
### Expected Output: NLTK VADER sentiment output format like -> {'neg': 0.0, 'neu': 0.484, 'pos': 0.516, 'compound': 0.4927}

def nltk_sentimentanalysis(text_input):
  sentiment_scores = sid.polarity_scores(text_input)
  return sentiment_scores
  ### END YOUR ANSWER

In [None]:
### ***** PLEASE RUN AND DO NOT CHANGE ANYTHING IN THIS CELL ******
### Sanity Check: nltk_sentimentanalysis

nltk_sentimentanalysis_1 = "It is marvelous."
nltk_sentimentanalysis_2 = "This movie is suck."

assert nltk_sentimentanalysis(nltk_sentimentanalysis_1) == {'neg': 0.0, 'neu': 0.339, 'pos': 0.661, 'compound': 0.5994}, "nltk_sentimentanalysis_1: Wrong Output"
print("nltk_sentimentanalysis_1: Passed!")

assert nltk_sentimentanalysis(nltk_sentimentanalysis_2) == {'neg': 0.492, 'neu': 0.508, 'pos': 0.0, 'compound': -0.4404}, "nltk_sentimentanalysis_2: Wrong Output"
print("nltk_sentimentanalysis_2: Passed!")

nltk_sentimentanalysis_1: Passed!
nltk_sentimentanalysis_2: Passed!
