# CITIES INFORMATIONS
This notebook is used to show how Graph RAG performs when structured and unstructured data are both used in the logic.
It will be showed :
  - how to ingest different kind of data (online PDF files, tabular CSV data, ecc...)
  - how to instantiate the LLM and Langchain
  - how to connect to a Neo4j instance in order to show the generated graph
  - how to query the Neo4j graph
  - how to use the prompts to query the LLM

## STEP 0 - Imports

In [2]:
import os
import re
import dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.graphs import Neo4jGraph
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_groq import ChatGroq
from sklearn.model_selection import train_test_split


## Imports for Ditto, the Entity Resolution system
import nltk
import csv
import pandas as pd
import torch
import torchtext
import deepmatcher as dm

In [3]:
dotenv.load_dotenv()

NEO4J_URI = os.environ["NEO4J_URI"]
NEO4J_USERNAME = os.environ["NEO4J_USERNAME"]
NEO4J_PASSWORD = os.environ["NEO4J_PASSWORD"]
GROQ_API_KEY = os.environ["GROQ_API_KEY"]









<br><br><br><br><br><br><br>
## STEP 1 - Documents and table ingestion
In this phase, it is necessary to provide the dataset used to feed the Knowledge graph. To do this, we will ingest some PDF files containing informations about cities' air pollution and a table with all the data of most of the cities in the world.

### Step 1.1 - Documents ingestion

In [4]:
urls = [
    #"https://www.iqair.com/dl/2023_World_Air_Quality_Report.pdf",
    "https://www.istat.it/en/files/2011/01/qualita_aria_EN.pdf?title=Air+quality+in+European+cities+-+22+Jun+2010+-+qualita_aria_EN.pdf",

]

csvFilePath = "./worldcities.csv"

In [5]:
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"}
docs = [PyPDFLoader(url, headers=headers).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

### Step 1.2 - Documents' text splitting

In [6]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=7500, chunk_overlap=100)
doc_splits = text_splitter.split_documents(docs_list)

### Step 1.3 - Table
This phase will be posticipated to the **"STEP 3 - Neo4j and graph insertion"** since it is necessary to upload the **worldcities.csv** file directly into the graph DB









<br><br><br><br><br><br><br>
## STEP 2 - Large Language Model and Graph generation from Documents
This is one of the most important steps to cover: here we are going to instantiate the LLM (Large Language Model) used to extract the Entity and the Relationships from the documents in order to obtain the Nodes (entities) and Edges (relationships) of the Graph used in GraphRAG.

In [None]:
llm = ChatGroq(
    groq_api_key=GROQ_API_KEY,
    model_name="llama3-70b-8192")
llm_transformer=LLMGraphTransformer(llm=llm)
graph_documents=llm_transformer.convert_to_graph_documents(doc_splits)
# graph_documents









<br><br><br><br><br><br><br>
## STEP 3 - Neo4j and graph insertion
Now it is time to save the originated graph into a persistent Neo4j instance.

### Step 3.1 - Initialize Neo4j connection

In [7]:
database_name = "progettotesi"

graph=Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database=database_name
)

In [None]:
# CSV loading has been performed manually to save time
# Step to execute to do it manually:
#   - add the file "worldcities.csv" inside the Neo4j project's "import" folder
#   - go to the folder "bin" inside the Neo4j project's folder
#   - use the following command inside Neo4j terminal:
#
#     neo4j-admin database import full progettotesi --delimiter=";" --array-delimiter="U+007C" --nodes=import/worldcities.csv


# graph.query(
#     query = "LOAD CSV WITH HEADERS "
#     "FROM 'file:///C:/Users/Gabri/OneDrive/Documenti/Universit%C3%A0/Tesi/RAG/RAG%20terzo/worldcities.csv' as row "
#     "MERGE("
#         "m:City{"
#             "id: row.city_ascii, "
#             "latitude: row.lat, "
#             "longitude: row.lng, "
#             "country: row.country, "
#             "iso2: row.iso2, "
#             "iso3: row.iso3, "
#             "administrative_name: row.admin_name, "
#             "capital: row.capital, "
#             "population: row.population"
#         "}"
#     ")")

In [None]:
def flatten(xss):
    return [x for xs in xss for x in xs]

# # Add etities (nodes) and relationships (edges) into the graph
nodes_as_dict = flatten([list({'id': node.id, 'type': node.type} for node in doc.nodes) for doc in graph_documents])

edges_as_dict = flatten([list(
    {
        'type': rel.type,
        'source':{
            'id': rel.source.id,
            'type': rel.source.type
         },
        'target': {
            'id': rel.target.id,
            'type': rel.target.type
        }
    } for rel in doc.relationships) for doc in graph_documents])

In [None]:
for node in nodes_as_dict:
    node_type = re.sub('[^A-Za-z0-9]+', '', node["type"])

    query = f"""
    MERGE (n:{node_type} {"{city_ascii: $id}" if node['type'] == "City" else "{id: $id}"})
    SET n.type = $type, n.updated = True
    """
    graph.query(query=query, params={"id": node["id"], "type": node["type"]})

In [None]:
for relationship in edges_as_dict:
    source_id = relationship["source"]["id"]
    source_type = re.sub('[^A-Za-z0-9]+', '', relationship["source"]["type"])

    target_id = relationship["target"]["id"]
    target_type = re.sub('[^A-Za-z0-9]+', '', relationship["target"]["type"])

    rel_type = re.sub('[^A-Za-z0-9]+', '', relationship["type"])

    print(source_id + "(" + source_type + ")" + " -[" + relationship["type"] + "]-> " + target_id + "(" + target_type + ")")

    condition = "(a.type = \"City\" and a.country IS NOT NULL and b.type = \"Country\" and a.country <> b.id) or (a.type = \"Country\" and b.type = \"City\" and b.country IS NOT NULL and a.id <> b.country)"
    query = f"""
        MATCH (a:{source_type} {"{city_ascii: $source_id}" if source_type == "City" else "{id: $source_id}"})
        MATCH (b:{target_type} {"{city_ascii: $target_id}" if target_type == "City" else "{id: $target_id}"})
        CALL apoc.do.when(
            {condition},
            'RETURN null',
            'MERGE (a)-[r:{rel_type}]->(b) return a, r, b',
            {{a: a, b: b}}
        )
        YIELD value
        return value
    """


    graph.query(
        query=query,
        params={
            "source_id": source_id,
            "target_id": target_id
        }
    )









<br><br><br><br><br><br><br>
## STEP 4 - Entity matching
Since some of the nodes and relationships created are duplicated or have different names to point to the same concept, it is required to execute an "Entity matching" phase to reconciliate entities and make them back to just one.

For this goal, we'll use **Ditto**, a deep learning-based Entity Matching system that leverages pre-trained language models (like BERT) to improve accuracy in identifying and matching similar entities across different datasets.

### Step 4.1 - Prepare dataframe for Deepmatcher

In [None]:
# citiesQueryResult = graph.query(query = "MATCH (n:City) RETURN n")
# citiesDf = [item['n'] for item in citiesQueryResult]

In [None]:
# PROFILE
# MATCH (a:City)
# WITH COLLECT(a) AS cities
# UNWIND cities AS a
# UNWIND cities AS b
# WITH
#     a,
#     b,
#     apoc.text.levenshteinSimilarity(a.city_ascii, b.city_ascii) AS city_ascii_sim,
#     apoc.text.levenshteinSimilarity(a.country, b.country) AS country_sim
# return a.city_ascii, a.country, b.city_ascii, b.country, city_ascii_sim, country_sim
# LIMIT 1000000

In [13]:
goldenDataset = pd.read_csv("neo4jCitiesGoldenDataset.csv", sep=";", index_col=False)
train_val_set, test_set = train_test_split(goldenDataset, test_size=0.2, random_state=42)
train_set, val_set = train_test_split(train_val_set, test_size=0.25, random_state=42)

In [14]:
train_set.to_csv(".\\dataset\\train_set.csv", sep=",", index=True, index_label="id")
val_set.to_csv(".\\dataset\\val_set.csv", sep=",", index=True, index_label="id")
test_set.to_csv(".\\dataset\\test_set.csv", sep=",", index=True, index_label="id")

In [15]:
train, validation, test = \
    dm.data.process(
        path='dataset',
        train="train_set.csv",
        validation="val_set.csv",
        test="test_set.csv",
        use_magellan_convention=False,
        label_attr='is_matching',
        left_prefix="a.",
        right_prefix="b.",
        ignore_columns=['city_ascii_sim', 'country_sim'],
        embeddings_cache_path=".vector_cache",
        embeddings='fasttext.en.bin'
    )


Reading and processing data from "dataset\train_set.csv"
0% [###                           ] 100% | ETA: 00:03:47
KeyboardInterrupt



In [12]:
train_table = train.get_raw_table()
train_table

Unnamed: 0,id,a.city_ascii,a.country,b.city_ascii,b.country,is_matching
0,876,cairo,egypt,cardito,italy,0
1,326,shanghai,china,shanwei,china,0
2,381,new york,united states,newport,united states,0
3,853,manila,philippines,maniago,italy,0
4,311,manila,philippines,marialva,brazil,0
...,...,...,...,...,...,...
595,118,karachi,pakistan,dalachi,china,0
596,334,shanghai,china,sangli,india,0
597,409,shenzhen,china,shepshed,united kingdom,0
598,225,manila,philippines,laila,india,0


### 4.2 - Prepare data from Neo4j

In [None]:
def base_serialization_logic(row, outfile):
    serialized_row = ""
    for column, value in row.items():
        if(column != "type" and column != "updated"):
            serialized_row += f"COL {column} VAL {value} "
    serialized_row = serialized_row.strip()  # Rimuove lo spazio extra finale
    outfile.write(serialized_row + "\n")

def serialize_csv(input_file, output_file):
    with open(input_file, 'r', newline='', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        with open(output_file, 'w', encoding='utf-8') as outfile:
            for row in reader:
                base_serialization_logic(row, outfile)

def serialize_dataframe(df, output_file):
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for index, row in df.iterrows():
            base_serialization_logic(row, outfile)

In [None]:
output_file = 'serialized_cities.txt' # Nome del file di output serializzato
serialize_dataframe(pd.DataFrame(citiesDf), output_file)