# Creating_a_Natural_Language_to_SQL_System_using_Llama_Index

[Creating_a_Natural_Language_to_SQL_System_using_Llama_Index](https://medium.aiplanet.com/creating-a-natural-language-to-sql-system-using-llama-index-10373e665bbb)

## SETUP

In [12]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [None]:
from sqlalchemy import create_engine
from llama_index.core import SQLDatabase
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.indices.struct_store.sql_query import NLSQLTableQueryEngine
from llama_index.core.tools.query_engine import QueryEngineTool
from llama_index.agent.openai import OpenAIAgent
import pandas as pd

## Load the CSV file to pandas dataframe

In [29]:
# load airline reviews from github
df = pd.read_csv(
    "https://raw.githubusercontent.com/Ibrahim-Maiga/Sentiment-Analysis-for-Airline-Reviews/refs/heads/main/Data/Airline_review.csv"
)
df.shape

(23171, 20)

In [30]:
df.sample(2)

Unnamed: 0.1,Unnamed: 0,Airline Name,Overall_Rating,Review_Title,Review Date,Verified,Review,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Ground Service,Inflight Entertainment,Wifi & Connectivity,Value For Money,Recommended
22655,22655,Winair,1,"""Winair is a terrible airline""",8th May 2022,False,We flew on Winair from St. Maarten to St. Eu...,,Couple Leisure,Economy Class,St. Maarten to St. Eustatius,April 2022,1.0,1.0,,1.0,,,1.0,no
14968,14968,Nepal Airlines,7,"""Hospitality was great from your staff""",12th April 2021,True,I had heard lots about this airline. Somehow ...,,Solo Leisure,Economy Class,Kathmandu to Dubai,March 2021,3.0,4.0,3.0,4.0,3.0,2.0,3.0,yes


## Drop the unwanted columns


In [31]:
# drop the "Unnamed: 0" column
df = df.drop(columns=["Unnamed: 0"])
df.sample(2)

Unnamed: 0,Airline Name,Overall_Rating,Review_Title,Review Date,Verified,Review,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Ground Service,Inflight Entertainment,Wifi & Connectivity,Value For Money,Recommended
19328,Sunwing Airlines,4,"""I don't know what is going on with Sunwing""",11th November 2022,True,First the good - the airplane staff are excel...,Boeing 737-800,Couple Leisure,Economy Class,Edmonton to Cancun,November 2022,3.0,5.0,1.0,4.0,1.0,1.0,2.0,no
22076,Viva Air,1,"""This was a terrible experience""",16th March 2021,True,This was a terrible experience. We had to re...,,Family Leisure,Economy Class,Medellin to Bogota,March 2021,,,,1.0,,,1.0,no


## Check the % of null values with each column attribute

In [27]:
df.isnull().sum()[df.isnull().sum() > 0] / len(df) * 100

Aircraft                  69.233093
Type Of Traveller         16.132234
Seat Type                  4.730050
Route                     16.520651
Date Flown                16.201286
Seat Comfort              17.931898
Cabin Staff Service       18.385050
Food & Beverages          37.421777
Ground Service            20.685339
Inflight Entertainment    53.264857
Wifi & Connectivity       74.450822
Value For Money            4.600578
dtype: float64

## Drop the null values

In [32]:
temp = df.dropna()
temp.shape

(1667, 19)

## Create SQLAlchemy engine with SQLlite and convert dataframe into SQL tables


In [40]:
engine = create_engine("sqlite://", echo=False)
temp.to_sql("reviews", con=engine)

1667

## Construct a SQLDatabase Index


In [42]:
reviews_db = SQLDatabase(engine=engine, include_tables=["reviews"])

reviews_db.engine

Engine(sqlite://)


## Instantiate the LLM and ServieContext


In [43]:
llm = OpenAI(model="gpt-4o-mini", temperature=0)

embedding = OpenAIEmbedding()

In [46]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embedding
Settings.chunk_size = 500

## Create the Natural Language to SQL table query engine for the added SQL table


In [48]:
sql_query_engine = NLSQLTableQueryEngine(
    sql_database=reviews_db,
    tables=["reviews"],
)


## Add a detailed description making sense for the NLSQLTableQueryEngine


In [50]:
description = "Provides information about airlines reviews from reviews table. Use a detailed plain text question as input to the tool."


## Wrap all the details into QueryEngineTool


In [60]:
sql_tool = QueryEngineTool.from_defaults(
    query_engine=sql_query_engine,
    name="sql_query",
    description=description,
)

agent = OpenAIAgent.from_tools(tools=[sql_tool], verbose=True)

## Generate the response


In [61]:
# question1
response = agent.chat("What are the top 10 bad ReviewBody?")

Added user message to memory: What are the top 10 bad ReviewBody?
=== Calling Function ===
Calling function: sql_query with args: {"input":"SELECT ReviewBody FROM reviews ORDER BY Rating ASC LIMIT 10;"}
Got output: The reviews highlight a range of experiences with various airlines, showcasing both positive and negative aspects of air travel. 

1. **Negative Experiences**: Several reviewers expressed dissatisfaction, particularly with Adria Airways and Aer Lingus. One passenger described a "very bad experience" with rerouted and canceled flights, while another noted poor customer service and issues with lost luggage on Aer Lingus. 

2. **Positive Experiences**: In contrast, Aegean Airlines received high praise for its helpful and courteous cabin crew, efficient check-in procedures, and overall excellent service. Reviewers appreciated the professionalism of the staff and the quality of the onboard experience.

3. **Mixed Reviews**: Some travelers shared mixed feelings, noting that while 

In [63]:
# question2
response = agent.chat("What is the highest overall rating provided by travellers?")

Added user message to memory: What is the highest overall rating provided by travellers?
=== Calling Function ===
Calling function: sql_query with args: {"input":"SELECT MAX(Rating) FROM reviews;"}
Got output: The highest rating in the reviews is 9.



In [64]:
response = agent.chat(
    "How many Business class travellers have given overall ratings greater than 3?"
)

Added user message to memory: How many Business class travellers have given overall ratings greater than 3?
=== Calling Function ===
Calling function: sql_query with args: {"input":"SELECT COUNT(*) FROM reviews WHERE Class = 'Business' AND Rating > 3;"}
Got output: It seems there was an error in the SQL query due to the use of spaces in the column name "Type Of Traveller." To correct this, you should use underscores or enclose the column name in quotes. Here’s the corrected SQL query:

```sql
SELECT COUNT(*) FROM reviews WHERE Overall_Rating > 3 AND "Type Of Traveller" = 'Business';
```

This query will count the number of reviews where the overall rating is greater than 3 and the type of traveler is 'Business'.

=== Calling Function ===
Calling function: sql_query with args: {"input":"SELECT COUNT(*) FROM reviews WHERE \"Type Of Traveller\" = 'Business' AND Rating > 3;"}
Got output: There are 118 reviews from business travelers that have a rating greater than 3.

