### **Exercise: Sentiment Analysis and Key Insights Extraction from Ford Car Reviews**

### **Problem Statement:**
You have been provided with a dataset containing Ford car reviews. Your task is to use LangChain and the concepts you’ve learned to perform the following tasks:

1. **Sentiment Analysis**: Analyze the sentiment of each review, categorize it as positive, neutral, or negative, and store the result.
2. **Key Insights Extraction**: Extract key pieces of information from each review, such as the pros and cons mentioned, and the specific features the reviewer liked or disliked (e.g., vehicle performance, comfort, price).

You will build a LangChain-based solution that leverages language models to automatically extract this information and provide a structured summary of the reviews. 

---
### **Steps to Solve:**

#### **Step 1: Load the Dataset**
- The dataset file is named `ford_car_reviews.csv` and is sourced from Kaggle: [Edmunds Consumer Car Ratings and Reviews](https://www.kaggle.com/datasets/ankkur13/edmundsconsumer-car-ratings-and-reviews).
- For this exercise, **limit the data to the first 25 records**. This can be achieved by using `df.head(25)` or `df.iloc[:25]` when loading the data into a DataFrame.

#### **Step 2: Define the Sentiment Analysis Task**
- Use LangChain to create a pipeline to classify the sentiment of each review.
- Define prompts that can guide the model to evaluate the sentiment. For example:
  - "Given the following car review, classify the sentiment as positive, neutral, or negative."

#### **Step 3: Key Insights Extraction**
- Use LangChain to create a pipeline to extract pros, cons, and notable features from each review. Define prompts such as:
  - "What are the pros and cons of the vehicle described in the following review?"
  - "What specific features of the vehicle does the reviewer like or dislike?"

#### **Step 4: Update the DataFrame with New Information**
- Run the pipeline for each review and collect the sentiment and insights.
- Once the analysis and extraction are complete, update the original DataFrame with additional columns to include:
  - Sentiment (positive, neutral, negative)
  - Pros
  - Cons
  - Liked_Features
  - Disliked_Features

---

### **Example Output:**

```json
{
  "Review_Date": "03/07/13",
  "Vehicle_Title": "2006 Ford Mustang Coupe",
  "Review_Text": "With the expected arrival of our 6th child...",
  "Rating": 4.125,
  "Sentiment": "Positive",
  "Pros": "Good driving experience, Large seating capacity, Great options",
  "Cons": "None mentioned",
  "Liked_Features": ["Driving experience", "Seating capacity", "Options available"],
  "Disliked_Features": []
}
```

In [1]:
# Importing built-in libraries for system operations and data handling
import os  # For operating system interactions like path and environment management
import json  # To handle JSON data for parsing and serialization
import getpass  # To securely manage sensitive input like passwords

# Importing the Pandas library for data manipulation and analysis
import pandas as pd

# Importing the dotenv module to load environment variables from a .env file
from dotenv import load_dotenv

# Importing LangChain and related modules for conversational AI functionalities
from langchain_groq import ChatGroq  # For integrating conversational AI
from langchain_core.prompts import ChatPromptTemplate  # For creating structured prompts
from langchain_core.output_parsers import JsonOutputParser  # For parsing AI outputs into JSON


In [3]:
def load_api_key(env_file: str = ".env") -> None:
    """
    Load the API key from the specified .env file, or prompt the user to input it manually if not found.

    :param env_file: Path to the .env file containing the GROQ_API_KEY.
    """
    # Load environment variables from the specified .env file, overriding any existing values
    load_dotenv(env_file, override=True)
    
    # Check if the GROQ_API_KEY is already set in the environment
    if "GROQ_API_KEY" not in os.environ:
        # Prompt the user to enter the API key securely if it is not found
        os.environ["GROQ_API_KEY"] = getpass.getpass("GROQ API Key: ")


In [5]:
# Function to configure and return a ChatGroq LLM instance
def get_llm_model(model_id: str = "llama3-8b-8192", temperature: float = 0.0) -> ChatGroq:
    """
    Create an instance of the ChatGroq model with specific parameters.

    :param model_id: The identifier for the desired LLM model (default is "llama3-8b-8192").
    :param temperature: A float controlling the randomness of output (default is 0.0 for deterministic results).
    :return: An instance of ChatGroq configured with the given parameters.
    """
    # Instantiate and return the ChatGroq model using the provided settings
    return ChatGroq(model_name=model_id, temperature=temperature)


In [7]:
# Define the system message template for sentiment analysis and feature extraction
SYSTEM_MESSAGE_TEMPLATE = """
Given the following car review, perform the following tasks:
1. Classify the sentiment as Positive, Neutral, or Negative.
2. Identify the following details:
   - Pros: List the positive aspects of the vehicle mentioned in the review.
   - Cons: List the negative aspects of the vehicle mentioned in the review.
   - Liked Features: Highlight specific features the reviewer liked (if mentioned).
   - Disliked Features: Highlight specific features the reviewer disliked (if mentioned).
Respond with a JSON object containing these fields: 
Sentiment, Pros, Cons, Liked_Features, Disliked_Features.
"""

# Create a chat prompt template using the system message and user input
chat_prompt_template = ChatPromptTemplate([
    ("system", SYSTEM_MESSAGE_TEMPLATE),  # Add system instructions
    ("human", "{user_input}"),            # Insert user input dynamically
])


In [9]:
def analyze_review(llm: ChatGroq, review_text: str) -> dict:
    """
    Analyzes a single review using the provided LLM instance. 
    Returns a structured dictionary containing sentiment, pros, cons, etc.

    :param llm: Instance of ChatGroq or similar LLM.
    :param review_text: The text of the review to analyze.
    :return: A dictionary with keys "Sentiment", "Pros", "Cons",
             "Liked_Features", and "Disliked_Features".
    """
    # Format the prompt
    prompt = chat_prompt_template.format(user_input=review_text)
    
    # Invoke the LLM
    response = llm.invoke(prompt)
    
    # Parse JSON response
    try:
        json_response = JsonOutputParser().invoke(response)
        result = json_response
    except json.JSONDecodeError:
        # Fallback result if parsing fails
        result = {
            "Sentiment": "Neutral",
            "Pros": "Could not extract",
            "Cons": "Could not extract",
            "Liked_Features": [],
            "Disliked_Features": []
        }
    return result

In [11]:
def main(csv_file_path: str = "ford_car_reviews.csv") -> None:
    """
    Reads a CSV file of Ford car reviews, analyzes each review, and prints the results.

    :param csv_file_path: Path to the CSV file containing the reviews.
    """
    # 1. Load environment variables and API key
    load_api_key(".env")

    # 2. Instantiate the LLM
    model = get_llm_model(model_id="llama3-8b-8192", temperature=0.0)

    # 3. Read the CSV file into a DataFrame
    df = pd.read_csv(csv_file_path, engine="python")
    df.drop("Unnamed: 0", axis=1,inplace=True)
    df=df.iloc[:25]
 # 4. Process and transform output
    for _, row in df.iterrows():
        # Extract columns
        review_date    = row.get("Review_Date", "")       # e.g., " on 08/12/17 06:06 AM (PDT)"
        vehicle_title  = row.get("Vehicle_Title", "")     # e.g., "2006 Ford Mustang Coupe ..."
        review_text    = row.get("Review", "")            # The full text of the review
        rating         = row.get("Rating", 0.0)           # e.g., 3.0

        # Analyze the review for sentiment, pros, cons, etc.
        analysis = analyze_review(model, review_text)

        # Convert lists to comma-separated strings 
        # (or "None mentioned" if the list is empty)
        pros_str = ", ".join(analysis["Pros"]) if analysis["Pros"] else "None mentioned"
        cons_str = ", ".join(analysis["Cons"]) if analysis["Cons"] else "None mentioned"

        # Construct the final dictionary in your desired structure:
        final_output = {
            "Review_Date": review_date.strip(),   # or transform to a different date format if needed
            "Vehicle_Title": vehicle_title,
            "Review_Text": review_text.strip(),
            "Rating": float(rating),              # ensure it's a float
            "Sentiment": analysis["Sentiment"],
            "Pros": pros_str,
            "Cons": cons_str,
            "Liked_Features": analysis["Liked_Features"],
            "Disliked_Features": analysis["Disliked_Features"]
        }

        # Print out the final output in JSON form
        print(json.dumps(final_output, indent=2))
        print()

In [15]:
if __name__ == "__main__":
    main()

{
  "Review_Date": "on 06/06/18 14:19 PM (PDT)",
  "Vehicle_Title": "2006 Ford Mustang Coupe GT Premium 2dr Coupe (4.6L 8cyl 5M)",
  "Review_Text": "Doesn\u2019t disappoint",
  "Rating": 5.0,
  "Sentiment": "Positive",
  "Pros": "None mentioned",
  "Cons": "None mentioned",
  "Liked_Features": [],
  "Disliked_Features": []
}

{
  "Review_Date": "on 08/12/17 06:06 AM (PDT)",
  "Vehicle_Title": "2006 Ford Mustang Coupe V6 Standard 2dr Coupe (4.0L 6cyl 5M)",
  "Review_Text": "I bought mine 4/17 with 98K. Have been wanting a V6 5-sp, '05-'09 vintage for years. The engine is fine. Sounds good. Great mileage. Good power. I pride myself on smooth take-off and gear changes, but this is the orneriest transmission I've ever used! The difference between idle and 4000 rpm is about 1/8\" on the gas pedal, so starting-out without either stalling or way over-revving takes a LOT of finesse. Gear changes are very difficult to master smoothly without lurching. The ride is very harsh with a lot of road n