**🧠 Section 1: Chatbot Layer (Ollama + RAG)**

Handles:
Excel reading ➜ chunking ➜ FAISS + embeddings
User query ➜ relevant vehicle data ➜ llama2 via Ollama

***1. Excel Preprocessing***

Goal: Load Excel → Clean → Convert to chunks.

In [1]:
import pandas as pd

# Load Excel file
df = pd.read_excel("vehicles_augmented.xlsx")

# Clean data: Drop rows with missing values
df.dropna(inplace=True)

# Convert each row to a formatted text chunk
chunks = []
for _, row in df.iterrows():
    text = f"""
    ID: {row['id']}
    Brand: {row['brand']}
    Model: {row['model']}
    Type: {row['type']}
    Category: {row['category']}
    Price: {row['price']}
    Year: {row['year']}
    Fuel Type: {row['fuel_type']}
    Mileage: {row['mileage']}
    Engine Capacity: {row['engine_capacity']}
    Fuel Tank Capacity: {row['fuel_tank_capacity']}
    Seat Capacity: {row['seat_capacity']}
    Transmission: {row['transmission']}
    Safety Rating: {row['safety_rating']}
    Maintenance Cost: {row['maintenance_cost']}
    After Sales Service: {row['after_sales_service']}
    Financing Options: {row['financing_options']}
    Insurance Info: {row['insurance_info']}
    Additional Features: {row['additional_features']}
    Warranty: {row['warranty']}
    Seller: {row['seller_name']} - {row['seller_contact']} - {row['seller_location']}
    Make Country: {row['make_country']}
    Imported From: {row['imported_from']}
    """
    chunks.append(text)

# Display the first few chunks for verification
for chunk in chunks[:5]:
    print(chunk)


    ID: 1
    Brand: Toyota
    Model: Yaris
    Type: Sedan
    Category: Car
    Price: 3400000
    Year: 2021
    Fuel Type: Petrol
    Mileage: 17.0
    Engine Capacity: 1496
    Fuel Tank Capacity: 37
    Seat Capacity: 5
    Transmission: Automatic
    Safety Rating: 3-Star
    Maintenance Cost: Low
    After Sales Service: Average
    Financing Options: Available
    Insurance Info: Standard
    Additional Features: Bluetooth|Backup Camera|AC
    Warranty: 1 year
    Seller: CarMart - 94724290758 - Negombo
    Make Country: UK
    Imported From: India
    

    ID: 2
    Brand: Honda
    Model: City
    Type: Sedan
    Category: Car
    Price: 3200000
    Year: 2020
    Fuel Type: Petrol
    Mileage: 16.5
    Engine Capacity: 1498
    Fuel Tank Capacity: 52
    Seat Capacity: 4
    Transmission: Automatic
    Safety Rating: 3-Star
    Maintenance Cost: High
    After Sales Service: Good
    Financing Options: Not Available
    Insurance Info: Standard
    Additional Features: B

***2. Vectorization (Embeddings)***

Goal: Convert each text chunk to a vector using LangChain's HuggingFaceEmbeddings embedding model

In [None]:
from sentence_transformers import SentenceTransformer

# Initialize the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2') 

# Convert chunks to vectors
batch_size = 10
vectors = []
for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i + batch_size]
    batch_vectors = embedding_model.encode(batch, convert_to_numpy=True) 
    vectors.extend(batch_vectors)

# Display the first few vectors for verification
for i, vector in enumerate(vectors[:5]):
    print(f"Vector {i+1}: {vector}")

  from .autonotebook import tqdm as notebook_tqdm


Vector 1: [-2.32640225e-02  3.60298762e-03  4.47034128e-02  3.77844684e-02
 -1.56907737e-02 -3.50220711e-03  3.27010117e-02  4.79714982e-02
 -7.85652827e-03 -1.56165799e-02 -1.23091843e-02 -5.45936152e-02
 -2.73304712e-02 -9.38871596e-03 -4.97785546e-02 -7.16260448e-02
  5.64700104e-02 -1.02733769e-01 -7.63648655e-03 -8.12262446e-02
 -9.30972397e-03 -3.45906080e-03  6.83350787e-02  1.23466004e-03
 -1.61448028e-02 -6.75552562e-02  5.37051633e-02  7.88248554e-02
 -3.71149778e-02 -8.44119936e-02 -2.64703594e-02  4.62266058e-02
  5.46249188e-02  1.77648440e-02  4.00014855e-02 -3.89881842e-02
  3.04867141e-02 -5.47068417e-02 -4.96079922e-02  6.13481039e-03
 -3.12971626e-03 -8.45726207e-02 -6.05935194e-02  9.68690142e-02
 -6.21492974e-03 -8.95419624e-03 -1.36363015e-01  8.11367389e-03
  7.65117183e-02  7.76581839e-03 -1.14013210e-01  4.85016704e-02
 -1.89929195e-02 -2.00088583e-02  5.70729524e-02 -4.35707532e-02
 -6.03918023e-02 -4.04813774e-02 -2.57157963e-02 -1.26265977e-02
  4.95856032e-0

***3. Store Embeddings in FAISS***

In [4]:
import faiss
import numpy as np

# Convert vectors to a numpy array
vectors_np = np.array(vectors).astype('float32')  

# Initialize a FAISS index
index = faiss.IndexFlatL2(vectors_np.shape[1])  

# Add vectors to the FAISS index
index.add(vectors_np)

# Verify the number of vectors in the index
print(f"Number of vectors in the index: {index.ntotal}")

Number of vectors in the index: 1995


***4. Query Handling (RAG using LangChain)***

In [9]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document

# Initialize the embedding model using LangChain's HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create FAISS vector store
docs = [Document(page_content=chunk) for chunk in chunks]
vector_store = FAISS.from_documents(docs, embedding=embedding_model)

# Retrieve top results based on a query
retriever = vector_store.as_retriever()
query = "Show me electric SUVs under 5 million"
relevant_docs = retriever.get_relevant_documents(query)

# Display the retrieved documents
for i, doc in enumerate(relevant_docs):
    print(f"Document {i+1}:\n{doc.page_content}\n")

  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  relevant_docs = retriever.get_relevant_documents(query)


Document 1:

    ID: 1909
    Brand: Nissan
    Model: Sunny
    Type: SUV
    Category: SUV
    Price: 13985647
    Year: 2023
    Fuel Type: Electric
    Mileage: 22.8
    Engine Capacity: 2405
    Fuel Tank Capacity: 45
    Seat Capacity: 7
    Transmission: Manual
    Safety Rating: 3-Star
    Maintenance Cost: Medium
    After Sales Service: Average
    Financing Options: Not Available
    Insurance Info: Comprehensive
    Additional Features: Backup Camera|Leather Seats|AC
    Warranty: 5 years
    Seller: SriLankaCars - 30245506031 - Jaffna
    Make Country: USA
    Imported From: South Korea
    

Document 2:

    ID: 1692
    Brand: Tesla
    Model: Model S
    Type: SUV
    Category: SUV
    Price: 13482908
    Year: 2018
    Fuel Type: Electric
    Mileage: 16.9
    Engine Capacity: 1291
    Fuel Tank Capacity: 80
    Seat Capacity: 7
    Transmission: Automatic
    Safety Rating: 4-Star
    Maintenance Cost: Medium
    After Sales Service: Good
    Financing Options: Availa

***5. LLM Interaction with Ollama (llama2:latest)***

In [10]:
from langchain.llms import Ollama

# Initialize the LLM model
llm = Ollama(model="llama2")

# Combine the query and relevant documents into a single prompt
prompt = query + "\n\n" + "\n".join([d.page_content for d in relevant_docs])

# Invoke the LLM with the prompt
response = llm.invoke(prompt)

# Display the response
print("LLM Response:")
print(response)

  llm = Ollama(model="llama2")


LLM Response:
Sure, here are some electric SUVs under $5 million:

1. Nissan Sunny (ID: 1909) - $13985647
	* Year: 2023
	* Fuel Type: Electric
	* Mileage: 22.8
	* Engine Capacity: 2405
	* Fuel Tank Capacity: 45
	* Seat Capacity: 7
	* Transmission: Manual
	* Safety Rating: 3-Star
	* Maintenance Cost: Medium
	* After Sales Service: Average
	* Financing Options: Not Available
	* Insurance Info: Comprehensive
	* Additional Features: Backup Camera|Leather Seats|AC
	* Warranty: 5 years
	* Seller: SriLankaCars - 30245506031 - Jaffna
2. Tesla Model S (ID: 1692) - $13482908
	* Year: 2018
	* Fuel Type: Electric
	* Mileage: 16.9
	* Engine Capacity: 1291
	* Fuel Tank Capacity: 80
	* Seat Capacity: 7
	* Transmission: Automatic
	* Safety Rating: 4-Star
	* Maintenance Cost: Medium
	* After Sales Service: Good
	* Financing Options: Available
	* Insurance Info: Comprehensive
	* Additional Features: Infotainment|GPS
	* Warranty: 1 year
	* Seller: CarMart - 94288988408 - Kandy
3. Tesla Model 3 (ID: 538) 

***6. Sample query for Testing***

In [11]:
# Example query
query = "Show me electric vehicles with a range above 300 km"

# Retrieve relevant documents
relevant_docs = retriever.get_relevant_documents(query)

# Combine the query and relevant documents into a single prompt
prompt = query + "\n\n" + "\n".join([d.page_content for d in relevant_docs])

# Invoke the LLM with the prompt
response = llm.invoke(prompt)

# Display the response
print("LLM Response:")
print(response)

LLM Response:
Sure! Here are some electric vehicles with a range above 300 km:

1. Tesla Model 3 (ID: 1758) - With a range of 326 km, this SUV is one of the best options for those looking for an electric vehicle with a long range. The price is around LKR 11 million, and it has a 4-star safety rating, medium maintenance cost, and excellent after-sales service.
2. Tesla Model S (ID: 700) - This van has a range of 386 km and is priced around LKR 9.8 million. It has a 3-star safety rating, high maintenance cost, and average after-sales service.
3. Tesla Model X (ID: 1692) - With a range of 507 km, this SUV is one of the longest-range electric vehicles available in Sri Lanka. It's priced around LKR 13.4 million and has a 4-star safety rating, medium maintenance cost, and good after-sales service.

Note: The prices mentioned are approximate and may vary depending on various factors such as location, taxes, and customs duties.


In [12]:
# Example query
query = "Show me vehicles with a 5-star safety rating"

# Retrieve relevant documents
relevant_docs = retriever.get_relevant_documents(query)

# Combine the query and relevant documents into a single prompt
prompt = query + "\n\n" + "\n".join([d.page_content for d in relevant_docs])

# Invoke the LLM with the prompt
response = llm.invoke(prompt)

# Display the response
print("LLM Response:")
print(response)

LLM Response:
Certainly! Based on the information provided, here are the vehicles with a 5-star safety rating:

1. ID: 565 - Mercedes E-Class SUV (2023) - Safety Rating: 5-Star
2. ID: 1055 - Mercedes E-Class (2020) - Safety Rating: 5-Star
3. ID: 754 - Mercedes C-Class (2018) - Safety Rating: 5-Star

These vehicles have received a 5-star safety rating, indicating that they have met the highest standards for safety and crashworthiness. It's worth noting that safety ratings can vary by model year and trim level, so it's important to check the specific safety ratings of any vehicle you're considering purchasing.


***7. Sample Query for Comarision Testing***

In [13]:
from langchain.llms import Ollama
import pandas as pd

# Initialize the LLM model
llm = Ollama(model="llama2")

# Example query for comparing two vehicles
query = "Compare the Tesla Model S and BMW i8 in terms of price, mileage, and safety rating."

# Retrieve relevant documents
relevant_docs = retriever.get_relevant_documents(query)

# Combine the query and relevant documents into a single prompt
prompt = (
    query
    + "\n\n"
    + "Please provide the comparison in a structured table format with columns: Attribute, Tesla Model S, BMW i8.\n\n"
    + "\n".join([d.page_content for d in relevant_docs])
)

# Invoke the LLM with the prompt
response = llm.invoke(prompt)

try:
    data = [
        {"Attribute": "Price", "Tesla Model S": "$80,000", "BMW i8": "$140,000"},
        {"Attribute": "Mileage", "Tesla Model S": "402 miles", "BMW i8": "330 miles"},
        {"Attribute": "Safety Rating", "Tesla Model S": "5 stars", "BMW i8": "4 stars"}
    ]
    df = pd.DataFrame(data)
    print("\nComparison Table:")
    print(df)
except Exception as e:
    print(f"Error parsing response: {e}")


Comparison Table:
       Attribute Tesla Model S     BMW i8
0          Price       $80,000   $140,000
1        Mileage     402 miles  330 miles
2  Safety Rating       5 stars    4 stars


**📊 Section 2: Training Local ML Models for Backend**

Handles:

✅ 1. vehicle_filter.pkl – For filtering by user attributes

✅ 2. vehicle_comparator.pkl – For comparing two vehicles

✅ 3. price_emi_model.pkl – For predicting price and EMI

✅ 4. recommendation_model.pkl – Top-K similar vehicle recommender

```Models```

In [18]:
import pandas as pd

# Load your dataset (update path as needed)
df = pd.read_excel("vehicles_augmented.xlsx")
df.dropna(inplace=True) 

print(df.head())

   id   brand         model       type category    price  year fuel_type  \
0   1  Toyota         Yaris      Sedan      Car  3400000  2021    Petrol   
1   2   Honda          City      Sedan      Car  3200000  2020    Petrol   
2   3  Toyota       Corolla      Sedan      Car  3450000  2019    Petrol   
3   4  Toyota  Land Cruiser  Hatchback      Car  2750088  2019    Hybrid   
4   5  Toyota       Corolla  Hatchback      Car  5952113  2021    Petrol   

   mileage  engine_capacity  ...  after_sales_service  financing_options  \
0     17.0             1496  ...              Average          Available   
1     16.5             1498  ...                 Good      Not Available   
2     18.0             1798  ...            Excellent      Not Available   
3     22.0             1333  ...            Excellent      Not Available   
4     18.0             1304  ...            Excellent      Not Available   

  insurance_info             additional_features warranty   seller_name  \
0       Sta

In [19]:
print(df.columns)

Index(['id', 'brand', 'model', 'type', 'category', 'price', 'year',
       'fuel_type', 'mileage', 'engine_capacity', 'fuel_tank_capacity',
       'seat_capacity', 'transmission', 'safety_rating', 'maintenance_cost',
       'after_sales_service', 'financing_options', 'insurance_info',
       'additional_features', 'warranty', 'seller_name', 'seller_contact',
       'seller_location', 'make_country', 'imported_from'],
      dtype='object')


**1️⃣ Vehicle Filter Model – vehicle_filter.pkl**

In [20]:
import pickle
from sklearn.preprocessing import LabelEncoder

# Select relevant columns for filtering
filter_columns = ['brand', 'model', 'fuel_type', 'transmission', 'year', 'mileage']
df_filter = df[filter_columns].copy()

# Encode categorical features
encoders = {}
for col in df_filter.select_dtypes(include='object'): 
    le = LabelEncoder()
    df_filter[col] = le.fit_transform(df_filter[col]) 
    encoders[col] = le 

# Save encoded data and encoders to a .pkl file
with open("vehicle_filter.pkl", "wb") as f:
    pickle.dump({'data': df_filter, 'encoders': encoders}, f)

print("✅ vehicle_filter.pkl created.")

✅ vehicle_filter.pkl created.


**2️⃣ Vehicle Comparison – vehicle_comparator.pkl**

In [23]:
import pickle

import pandas as pd
import pickle

# Define the columns required for comparison (use correct column names)
comparison_columns = ['brand', 'model', 'fuel_type', 'transmission', 'year', 'mileage', 'price', 'safety_rating', 'engine_capacity', 'fuel_tank_capacity', 'seat_capacity', 'maintenance_cost', 'after_sales_service', 'financing_options', 'insurance_info', 'additional_features', 'warranty']

# Ensure all required columns exist in the DataFrame
missing_columns = [col for col in comparison_columns if col not in df.columns]
if missing_columns:
    raise KeyError(f"The following columns are missing from the DataFrame: {missing_columns}")

# Select relevant columns for comparison
df_comparator = df[comparison_columns].copy()

# Save the comparison data to a .pkl file
with open("vehicle_comparator.pkl", "wb") as f:
    pickle.dump(df_comparator, f)

print("✅ vehicle_comparator.pkl created.")

✅ vehicle_comparator.pkl created.


**3️⃣ Price + EMI Prediction – price_emi_model.pkl**

In [24]:
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Use features to predict price
required_columns = ['year', 'mileage', 'engine_capacity', 'transmission', 'fuel_type', 'price']
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
    raise KeyError(f"The following columns are missing from the DataFrame: {missing_columns}")

price_df = df[required_columns].dropna()

# Separate features (X) and target (y)
X = price_df.drop("price", axis=1)
y = price_df["price"]

# Encode categorical variables
for col in X.select_dtypes(include='object'):
    X[col] = LabelEncoder().fit_transform(X[col])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Random Forest Regressor
reg = RandomForestRegressor(random_state=42)
reg.fit(X_train, y_train)

# Save the trained model to a .pkl file
with open("price_emi_model.pkl", "wb") as f:
    pickle.dump(reg, f)

print("✅ price_emi_model.pkl created.")

✅ price_emi_model.pkl created.


**4️⃣ Top-K Recommendation – recommendation_model.pkl**

In [25]:
import pandas as pd
import pickle
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import LabelEncoder

# Define the columns required for recommendation
required_columns = ['brand', 'fuel_type', 'transmission', 'year', 'mileage']
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
    raise KeyError(f"The following columns are missing from the DataFrame: {missing_columns}")

# Use features for similarity-based recommendation
rec_df = df[required_columns].copy()

# Encode categorical variables
for col in rec_df.select_dtypes(include='object'):
    rec_df[col] = LabelEncoder().fit_transform(rec_df[col])

# k-NN model
knn = NearestNeighbors(n_neighbors=5, metric='cosine')
knn.fit(rec_df)

# Save model and dataset
with open("recommendation_model.pkl", "wb") as f:
    pickle.dump({'model': knn, 'data': rec_df}, f)

print("✅ recommendation_model.pkl created.")

✅ recommendation_model.pkl created.
