py2neo is one of the principal libraries for working with Neo4j from Python. It wraps Neo4j’s HTTP and Bolt protocols and provides a clean and simple interface to execute Cypher queries, handle transactions, and manage the database.

Step 1: Import Necessary Libraries
importing the necessary Python libraries. Here, we'll use pandas for data manipulation and py2neo for interacting with the Neo4j database.

In [7]:
# Import necessary libraries
import pandas as pd
from py2neo import Graph
from py2neo import Graph, Node, Relationship



Step 2: Connect to Neo4j
Set up a connection to your Neo4j database:

In [8]:
from py2neo import Graph

# Replace 'your_new_password' with the password you just set
graph = Graph("bolt://localhost:7687", auth=("neo4j", "12345678"))

# Test the connection
print(graph.run("RETURN 'Connection Successful!' AS message").data())


[{'message': 'Connection Successful!'}]


Step 3: Load and Analyze CSV Data

Importing pandas: First, we import the pandas library, which is essential for data manipulation in Python.
    Setting the file path: We store the path to your CSV file in a variable called file_path. Since the CSV file is in the same directory as your Jupyter notebook, you only need to specify the filename.
    Reading the CSV file: We use pd.read_csv() to read the CSV file into a DataFrame. This function converts the structured CSV file into a pandas DataFrame, which allows for more complex data manipulations.
    Printing the data: df.head() prints the first few rows of the DataFrame, giving you a snapshot of the data structure. df.info() provides a concise summary of the DataFrame, showing types of columns, non-null values, and memory usage.

In [5]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2016_-_Cities_Emissions_Reduction_Targets_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure
print("Initial Data:")
print(df.head())
print(df.info())

# 1. Handling Missing Values
# Check for missing values in each column
print("\nMissing Values Before Cleaning:")
print(df.isna().sum())

# Fill numeric columns with median values
df['Baseline emissions (metric tonnes CO2e)'] = df['Baseline emissions (metric tonnes CO2e)'].fillna(df['Baseline emissions (metric tonnes CO2e)'].median())
df['Percentage reduction target'] = df['Percentage reduction target'].fillna(df['Percentage reduction target'].median())

# Fill categorical columns with a placeholder value
df['City Short Name'] = df['City Short Name'].fillna('Unknown City')
df['Country'] = df['Country'].fillna('Unknown Country')
df['Organisation'] = df['Organisation'].fillna('Unknown Organisation')

# Convert 'Reporting Year' and 'Target date' to datetime format and fill with a default value if they remain NaN
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

df['Reporting Year'] = df['Reporting Year'].fillna(0)
df['Target date'] = df['Target date'].fillna(0)

# 2. Data Type Conversions
# Convert 'Reporting Year' to datetime format if not done yet
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year

# Convert 'Target date' to datetime format if it's not NaN
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

# 3. Standardizing Formats
# Standardize text in 'City Short Name' by capitalizing
df['City Short Name'] = df['City Short Name'].str.title()

# 4. Removing Duplicates
# Remove duplicate rows, if any
df.drop_duplicates(inplace=True)

# 5. Trimming and Cleaning Strings
# Strip leading/trailing spaces from all string columns
df['Organisation'] = df['Organisation'].str.strip()

# 6. Consolidating Categories
# Consolidate similar country names if needed
df['Country'] = df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# 7. Validating Data
# Ensure that 'Percentage reduction target' is non-negative
df = df[df['Percentage reduction target'] >= 0]

# 8. Creating Unique Identifiers
# Create a unique identifier for each row if the 'Account No' is not unique
df['id'] = pd.factorize(df['Account No'])[0] + 1

# 9. Exporting Clean Data
# Export the cleaned data to a new CSV file
df.to_csv('cleaned_data_Neo4j_2016_-_Cities_Emissions_Reduction_Targets_20240207.csv', index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(df.head())
print(df.info())

# Check for missing values after cleaning
print("\nMissing Values After Cleaning:")
print(df.isna().sum())


Initial Data:
             Organisation  Account No  Country      City Short Name  C40  \
0           Odder Kommune       58796  Denmark        Odder Kommune  NaN   
1        Comune di Napoli       36158    Italy               Napoli  NaN   
2     Egedal Municipality       62855  Denmark  Egedal Municipality  NaN   
3            Yilan County       61753   Taiwan               Yilan   NaN   
4  City of Emeryville, CA       61790      USA       Emeryville, CA  NaN   

   Reporting Year Sector              Target boundary Baseline year  \
0            2016  Total                          NaN          2010   
1            2016  Total                          NaN          2005   
2            2016  Total                          NaN          2009   
3            2016  Total                          NaN          2009   
4            2016  Total  Overall community emissions          2004   

   Baseline emissions (metric tonnes CO2e)  Percentage reduction target  \
0                          

Explanation of the Code

Loading Data: Loads the CSV file into a DataFrame using pandas.
    
Handling Missing Values: Checks for and handles missing data by filling or removing based on the column importance.
    
Data Type Conversions: Converts columns to appropriate data types, like converting years to datetime objects.
    
Standardizing Formats: Ensures consistent formatting in text fields and date representations.
    
Removing Duplicates: Removes duplicate entries to ensure the uniqueness of data.
    
Trimming and Cleaning Strings: Cleans up string data by trimming spaces and correcting formats.
    
Consolidating Categories: Combines similar categories to reduce redundancy and simplify analysis.
    
Validating Data: Applies checks to ensure data meets logical constraints (e.g., non-negative percentages).
    
Creating Unique Identifiers: Generates unique identifiers for nodes in Neo4j, ensuring each node can be distinctly referenced.
                                                         
Exporting Clean Data: Saves the cleaned DataFrame back to a CSV file, ready for importing into Neo4j.

Step 4: Define Functions to Create Nodes and Relationships

In [16]:
# Function to create city nodes
def create_city_nodes(df):
    for index, row in df.iterrows():
        city = Node("City", name=row['City Short Name'], country=row['Country'], reporting_year=row['Reporting Year'])
        graph.merge(city, "City", "name")

# Function to create emissions reduction target relationships
def create_emissions_reduction_relationships(df):
    for index, row in df.iterrows():
        city_node = graph.nodes.match("City", name=row['City Short Name']).first()
        if city_node:
            reduction_target = Node("ReductionTarget", target=row['Percentage reduction target'], baseline_emissions=row['Baseline emissions (metric tonnes CO2e)'], target_year=row['Target date'])
            relationship = Relationship(city_node, "HAS_TARGET", reduction_target)
            graph.create(relationship)

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_emissions_reduction_relationships(df)


Step 5: Insert cleaned_data_Neo4j_2016_-_Cities_Emissions_Reduction_Targets_20240207.csv into Neo4j 

In [17]:
from py2neo import Graph, Node, Relationship

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_emissions_reduction_relationships(df)

# Verify by running a query to see the inserted nodes and relationships
query = """
MATCH (c:City)-[:HAS_TARGET]->(t:ReductionTarget)
RETURN c.name AS city, c.country AS country, c.reporting_year AS year, t.target AS reduction_target, t.baseline_emissions AS baseline_emissions, t.target_year AS target_year
ORDER BY c.name
"""
results = graph.run(query).to_data_frame()
print(results)


                city    country  year  reduction_target  baseline_emissions  \
0     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
1     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
2     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
3     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
4           Adelaide  Australia  2016              35.0            411000.0   
...              ...        ...   ...               ...                 ...   
1115         Yonkers        USA  2016              20.0            844276.0   
1116        Zaragoza      Spain  2016              20.0           1237553.0   
1117        Zaragoza      Spain  2016              20.0           1237553.0   
1118        Zaragoza      Spain  2016              20.0           1237553.0   
1119        Zaragoza      Spain  2016              20.0           1237553.0   

      target_year  
0          2030.0  
1          

Step 1: Load and Inspect the CSV

First, let's load the CSV file and display its column names to ensure we use the correct ones.

In [10]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2016_-_Citywide_GHG_Emissions_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure and column names
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)


Initial Data:
   Account Number            City Name         Country City Short Name  C40  \
0           35894    Ville de Montreal          Canada        Montreal  NaN   
1           35898   Greater Manchester  United Kingdom      Manchester  NaN   
2           54128         City of Reno             USA            Reno  NaN   
3           35879  City of Minneapolis             USA     Minneapolis  NaN   
4           50558   City of London, ON          Canada      London, ON  NaN   

   Reporting Year        Measurement Year  \
0            2016  12/31/2009 12:00:00 AM   
1            2016  12/31/2013 12:00:00 AM   
2            2016  12/31/2014 12:00:00 AM   
3            2016  12/31/2014 12:00:00 AM   
4            2016  12/31/2014 12:00:00 AM   

                                            Boundary  \
0  Other: The regional entity that constitutes th...   
1                                A metropolitan area   
2      Administrative boundary of a local government   
3      Administr

Step 2: Load and Clean the Data

In [11]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2016_-_Citywide_GHG_Emissions_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(df.isna().sum())

# Fill numeric columns with median values
df['Total City-wide Emissions (metric tonnes CO2e)'] = df['Total City-wide Emissions (metric tonnes CO2e)'].fillna(df['Total City-wide Emissions (metric tonnes CO2e)'].median())

# Fill categorical columns with a placeholder value
df['City Short Name'] = df['City Short Name'].fillna('Unknown City')
df['Country'] = df['Country'].fillna('Unknown Country')
# Assuming 'Organisation' refers to 'Primary Methodology' or 'Methodology Details'
df['Primary Methodology'] = df['Primary Methodology'].fillna('Unknown Methodology')

# Convert 'Reporting Year' to datetime format and fill with a default value if they remain NaN
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year
df['Reporting Year'] = df['Reporting Year'].fillna(0).astype(int)

# Data Type Conversions
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
df['City Short Name'] = df['City Short Name'].str.title()

# Removing Duplicates
df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
df['Primary Methodology'] = df['Primary Methodology'].str.strip()

# Consolidating Categories
df['Country'] = df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
df = df[df['Total City-wide Emissions (metric tonnes CO2e)'] >= 0]

# Creating Unique Identifiers
df['id'] = pd.factorize(df['Account Number'])[0] + 1

# Exporting Clean Data
df.to_csv('cleaned_data_Neo4j_2016_-_Citywide_GHG_Emissions_20240207.csv', index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(df.head())
print(df.info())

# Check for missing values after cleaning
print("\nMissing Values After Cleaning:")
print(df.isna().sum())


Initial Data:
   Account Number            City Name         Country City Short Name  C40  \
0           35894    Ville de Montreal          Canada        Montreal  NaN   
1           35898   Greater Manchester  United Kingdom      Manchester  NaN   
2           54128         City of Reno             USA            Reno  NaN   
3           35879  City of Minneapolis             USA     Minneapolis  NaN   
4           50558   City of London, ON          Canada      London, ON  NaN   

   Reporting Year        Measurement Year  \
0            2016  12/31/2009 12:00:00 AM   
1            2016  12/31/2013 12:00:00 AM   
2            2016  12/31/2014 12:00:00 AM   
3            2016  12/31/2014 12:00:00 AM   
4            2016  12/31/2014 12:00:00 AM   

                                            Boundary  \
0  Other: The regional entity that constitutes th...   
1                                A metropolitan area   
2      Administrative boundary of a local government   
3      Administr

Step 3: Define Functions to Create Nodes and Relationships in Neo4j

In [12]:
from py2neo import Graph, Node, Relationship

# Connect to Neo4j
graph = Graph("bolt://localhost:7687", auth=("neo4j", "12345678"))

# Function to create city nodes
def create_city_nodes(df):
    for index, row in df.iterrows():
        city = Node("City", name=row['City Short Name'], country=row['Country'], methodology=row['Primary Methodology'], year=row['Reporting Year'])
        graph.merge(city, "City", "name")

# Function to create GHG emissions relationships
def create_ghg_emissions_relationships(df):
    for index, row in df.iterrows():
        city_node = graph.nodes.match("City", name=row['City Short Name']).first()
        if city_node:
            ghg_emission = Node("GHGEmission", total_emissions=row['Total City-wide Emissions (metric tonnes CO2e)'], year=row['Reporting Year'])
            relationship = Relationship(city_node, "HAS_GHG_EMISSION", ghg_emission)
            graph.create(relationship)

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_ghg_emissions_relationships(df)


Step 4: Verify Data in Neo4j

In [13]:
query = """
MATCH (c:City)-[:HAS_GHG_EMISSION]->(g:GHGEmission)
RETURN c.name AS city, c.country AS country, c.year AS year, g.total_emissions AS total_emissions, g.year AS emission_year
ORDER BY c.name
"""
results = graph.run(query).to_data_frame()
print(results)


               city    country  year  total_emissions  emission_year
0    Aarhus Kommune    Denmark  2016          1900.00           2016
1         Abington         USA  2016        615224.00           2016
2       Addis Ababa   Ethiopia  2016       3708292.00           2016
3          Adelaide  Australia  2016        486541.00           2016
4          Ajax, On     Canada  2016        538836.00           2016
..              ...        ...   ...              ...            ...
182          Yilan      Taiwan  2016       8180556.68           2016
183        Yokohama      Japan  2016      21950000.00           2016
184         Yonkers        USA  2016       1248650.00           2016
185        Zaragoza      Spain  2016       1785603.75           2016
186          Águeda   Portugal  2016        258882.00           2016

[187 rows x 5 columns]


Step 1: Load and Inspect the CSV

In [14]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2017_-_Cities_Community_Wide_Emissions.csv'
df = pd.read_csv(file_path)

# Display the initial data structure and column names
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)


Initial Data:
   Account number                     Organization                 City  \
0           49363  Nelson Mandela Bay Municipality  Nelson Mandela Bay    
1           31171           Ayuntamiento de Madrid               Madrid   
2            3417                    New York City        New York City   
3           59537               City of Denton, TX           Denton, TX   
4           35894                Ville de Montreal             Montreal   

        Country         Region  C40  Access  Reporting year  \
0  South Africa         Africa  NaN  Public            2017   
1         Spain         Europe  C40  Public            2017   
2           USA  North America  C40  Public            2017   
3           USA  North America  NaN  Public            2017   
4        Canada  North America  C40  Public            2017   

           Accounting year                                       Boundary  \
0  2013-07-01 - 2014-06-30                            A metropolitan area   
1 

Step 2: Load and Clean the Data

In [15]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2017_-_Cities_Community_Wide_Emissions.csv'
df = pd.read_csv(file_path)

# Display the initial data structure
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(df.isna().sum())

# Fill numeric columns with median values
df['Total emissions (metric tonnes CO2e)'] = df['Total emissions (metric tonnes CO2e)'].fillna(df['Total emissions (metric tonnes CO2e)'].median())

# Fill categorical columns with a placeholder value
df['City'] = df['City'].fillna('Unknown City')
df['Country'] = df['Country'].fillna('Unknown Country')
df['Organization'] = df['Organization'].fillna('Unknown Organisation')

# Convert 'Reporting year' to datetime format and fill with a default value if they remain NaN
df['Reporting year'] = pd.to_datetime(df['Reporting year'], format='%Y', errors='coerce').dt.year
df['Reporting year'] = df['Reporting year'].fillna(0).astype(int)

# Data Type Conversions
df['Reporting year'] = pd.to_datetime(df['Reporting year'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
df['City'] = df['City'].str.title()

# Removing Duplicates
df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
df['Organization'] = df['Organization'].str.strip()

# Consolidating Categories
df['Country'] = df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
df = df[df['Total emissions (metric tonnes CO2e)'] >= 0]

# Creating Unique Identifiers
df['id'] = pd.factorize(df['Account number'])[0] + 1

# Exporting Clean Data
df.to_csv('cleaned_data_Neo4j_2017_-_Cities_Community_Wide_Emissions.csv', index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(df.head())
print(df.info())

# Check for missing values after cleaning
print("\nMissing Values After Cleaning:")
print(df.isna().sum())


Initial Data:
   Account number                     Organization                 City  \
0           49363  Nelson Mandela Bay Municipality  Nelson Mandela Bay    
1           31171           Ayuntamiento de Madrid               Madrid   
2            3417                    New York City        New York City   
3           59537               City of Denton, TX           Denton, TX   
4           35894                Ville de Montreal             Montreal   

        Country         Region  C40  Access  Reporting year  \
0  South Africa         Africa  NaN  Public            2017   
1         Spain         Europe  C40  Public            2017   
2           USA  North America  C40  Public            2017   
3           USA  North America  NaN  Public            2017   
4        Canada  North America  C40  Public            2017   

           Accounting year                                       Boundary  \
0  2013-07-01 - 2014-06-30                            A metropolitan area   
1 

Step 3: Define Functions to Create Nodes and Relationships in Neo4j

In [16]:
from py2neo import Graph, Node, Relationship

# Connect to Neo4j
graph = Graph("bolt://localhost:7687", auth=("neo4j", "12345678"))

# Function to create city nodes
def create_city_nodes(df):
    for index, row in df.iterrows():
        city = Node("City", name=row['City'], country=row['Country'], organisation=row['Organization'], year=row['Reporting year'])
        graph.merge(city, "City", "name")

# Function to create GHG emissions relationships
def create_ghg_emissions_relationships(df):
    for index, row in df.iterrows():
        city_node = graph.nodes.match("City", name=row['City']).first()
        if city_node:
            ghg_emission = Node("GHGEmission", total_emissions=row['Total emissions (metric tonnes CO2e)'], year=row['Reporting year'])
            relationship = Relationship(city_node, "HAS_GHG_EMISSION", ghg_emission)
            graph.create(relationship)

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_ghg_emissions_relationships(df)


Step 4: Verify Data in Neo4j

In [None]:
query = """
MATCH (c:City)-[:HAS_GHG_EMISSION]->(g:GHGEmission)
RETURN c.name AS city, c.country AS country, c.year AS year, g.total_emissions AS total_emissions, g.year AS emission_year
ORDER BY c.name
"""
results = graph.run(query).to_data_frame()
print(results)


Step 1: Load and Inspect the CSV

First, let's load the CSV file and display its column names to ensure we use the correct ones.

In [18]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2017_-_Cities_Emissions_Reduction_Targets_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure and column names
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)


Initial Data:
   Account No                   Organisation              City  \
0       54408                 Aarhus Kommune            Aarhus   
1       63616  Abasan Al-Kabira Municipality  Abasan Al-Kabira   
2       63616  Abasan Al-Kabira Municipality  Abasan Al-Kabira   
3        1499        Ajuntament de Barcelona         Barcelona   
4        1499        Ajuntament de Barcelona         Barcelona   

              Country               Region  Access  C40  Reporting year  \
0             Denmark               Europe  Public  NaN            2017   
1  State of Palestine  South and West Asia  Public  NaN            2017   
2  State of Palestine  South and West Asia  Public  NaN            2017   
3               Spain               Europe  Public  C40            2017   
4               Spain               Europe  Public  C40            2017   

    Type of target     Sector  ... Baseline emissions (metric tonnes CO2e)  \
0  Absolute target        NaN  ...                          

Step 2: Load and Clean the Data

In [19]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2017_-_Cities_Emissions_Reduction_Targets_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(df.isna().sum())

# Fill numeric columns with median values
df['Baseline emissions (metric tonnes CO2e)'] = df['Baseline emissions (metric tonnes CO2e)'].fillna(df['Baseline emissions (metric tonnes CO2e)'].median())
df['Percentage reduction target'] = df['Percentage reduction target'].fillna(df['Percentage reduction target'].median())

# Fill categorical columns with a placeholder value
df['City'] = df['City'].fillna('Unknown City')
df['Country'] = df['Country'].fillna('Unknown Country')
df['Organisation'] = df['Organisation'].fillna('Unknown Organisation')

# Convert 'Reporting year' and 'Target date' to datetime format and fill with a default value if they remain NaN
df['Reporting year'] = pd.to_datetime(df['Reporting year'], format='%Y', errors='coerce').dt.year
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

df['Reporting year'] = df['Reporting year'].fillna(0).astype(int)
df['Target date'] = df['Target date'].fillna(0).astype(int)

# Data Type Conversions
df['Reporting year'] = pd.to_datetime(df['Reporting year'], format='%Y', errors='coerce').dt.year
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
df['City'] = df['City'].str.title()

# Removing Duplicates
df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
df['Organisation'] = df['Organisation'].str.strip()

# Consolidating Categories
df['Country'] = df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
df = df[df['Percentage reduction target'] >= 0]

# Creating Unique Identifiers
df['id'] = pd.factorize(df['Account No'])[0] + 1

# Exporting Clean Data
df.to_csv('cleaned_data_Neo4j_2017_-_Cities_Emissions_Reduction_Targets_20240207.csv', index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(df.head())
print(df.info())

# Check for missing values after cleaning
print("\nMissing Values After Cleaning:")
print(df.isna().sum())


Initial Data:
   Account No                   Organisation              City  \
0       54408                 Aarhus Kommune            Aarhus   
1       63616  Abasan Al-Kabira Municipality  Abasan Al-Kabira   
2       63616  Abasan Al-Kabira Municipality  Abasan Al-Kabira   
3        1499        Ajuntament de Barcelona         Barcelona   
4        1499        Ajuntament de Barcelona         Barcelona   

              Country               Region  Access  C40  Reporting year  \
0             Denmark               Europe  Public  NaN            2017   
1  State of Palestine  South and West Asia  Public  NaN            2017   
2  State of Palestine  South and West Asia  Public  NaN            2017   
3               Spain               Europe  Public  C40            2017   
4               Spain               Europe  Public  C40            2017   

    Type of target     Sector  ... Baseline emissions (metric tonnes CO2e)  \
0  Absolute target        NaN  ...                          

Step 3: Define Functions to Create Nodes and Relationships in Neo4j

In [20]:
from py2neo import Graph, Node, Relationship

# Connect to Neo4j
graph = Graph("bolt://localhost:7687", auth=("neo4j", "12345678"))

# Function to create city nodes
def create_city_nodes(df):
    for index, row in df.iterrows():
        city = Node("City", name=row['City'], country=row['Country'], organisation=row['Organisation'], year=row['Reporting year'])
        graph.merge(city, "City", "name")

# Function to create emissions reduction target relationships
def create_emissions_reduction_relationships(df):
    for index, row in df.iterrows():
        city_node = graph.nodes.match("City", name=row['City']).first()
        if city_node:
            reduction_target = Node("ReductionTarget", target=row['Percentage reduction target'], baseline_emissions=row['Baseline emissions (metric tonnes CO2e)'], target_year=row['Target date'])
            relationship = Relationship(city_node, "HAS_TARGET", reduction_target)
            graph.create(relationship)

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_emissions_reduction_relationships(df)


Step 4: Verify Data in Neo4j

In [21]:
query = """
MATCH (c:City)-[:HAS_TARGET]->(t:ReductionTarget)
RETURN c.name AS city, c.country AS country, c.year AS year, t.target AS reduction_target, t.baseline_emissions AS baseline_emissions, t.target_year AS target_year
ORDER BY c.name
"""
results = graph.run(query).to_data_frame()
print(results)


                 city             country  year  reduction_target  \
0              Aarhus             Denmark  2017             100.0   
1    Abasan Al-Kabira  State of Palestine  2017               6.0   
2    Abasan Al-Kabira  State of Palestine  2017              19.0   
3            Adelaide           Australia  2017              35.0   
4            Adelaide           Australia  2017             100.0   
..                ...                 ...   ...               ...   
398          Zaragoza               Spain  2017              20.0   
399            Zürich         Switzerland  2017              82.0   
400            Zürich         Switzerland  2017              55.0   
401            Zürich         Switzerland  2017              28.0   
402        Ærøskøbing             Denmark  2017              10.0   

     baseline_emissions  target_year  
0             1484767.0       2030.0  
1                6893.0       2020.0  
2               18320.0       2020.0  
3             1

Step 1: Load and Inspect the CSV

First, let's load the CSV file and display its column names to ensure we use the correct ones.

In [22]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2023_Cities_Climate_Risk_and_Vulnerability_Assessments_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure and column names
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)


Initial Data:
  Questionnaire  Organization Number                Organization Name  \
0   Cities 2023               840926      Prefeitura de Serra Talhada   
1   Cities 2023                51075                 City of Shenzhen   
2   Cities 2023               863190                            Renca   
3   Cities 2023               930366  Municipalidad Distrital de Yura   
4   Cities 2023                60236          Trelleborg Municipality   

         City Country/Area     CDP Region  C40 City  GCoM City  Access  \
0         NaN       Brazil  Latin America     False       True  public   
1    Shenzhen        China      East Asia      True      False  public   
2         NaN        Chile  Latin America     False      False  public   
3         NaN         Peru  Latin America     False       True  public   
4  Trelleborg       Sweden         Europe     False       True  public   

            Assessment attachment and/or direct link  \
0  https://drive.google.com/file/d/19DMxxK532I

Step 2: Load and Clean the Data

In [23]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2023_Cities_Climate_Risk_and_Vulnerability_Assessments_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure
print("Initial Data:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nColumn Names:")
print(df.columns)

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(df.isna().sum())

# Fill numeric columns with median values
df['Year of publication or approval'] = df['Year of publication or approval'].fillna(df['Year of publication or approval'].median())
df['Population'] = df['Population'].fillna(df['Population'].median())
df['Population Year'] = df['Population Year'].fillna(df['Population Year'].median())

# Fill categorical columns with a placeholder value
df['City'] = df['City'].fillna('Unknown City')
df['Country/Area'] = df['Country/Area'].fillna('Unknown Country')
df['Organization Name'] = df['Organization Name'].fillna('Unknown Organisation')

# Convert 'Year of publication or approval' to datetime format and fill with a default value if they remain NaN
df['Year of publication or approval'] = pd.to_datetime(df['Year of publication or approval'], format='%Y', errors='coerce').dt.year
df['Year of publication or approval'] = df['Year of publication or approval'].fillna(0).astype(int)

# Data Type Conversions
df['Year of publication or approval'] = pd.to_datetime(df['Year of publication or approval'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
df['City'] = df['City'].str.title()

# Removing Duplicates
df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
df['Organization Name'] = df['Organization Name'].str.strip()

# Consolidating Categories
df['Country/Area'] = df['Country/Area'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
df = df[df['Population'] >= 0]

# Creating Unique Identifiers
df['id'] = pd.factorize(df['Organization Number'])[0] + 1

# Exporting Clean Data
df.to_csv('cleaned_data_Neo4j_2023_Cities_Climate_Risk_and_Vulnerability_Assessments_20240207.csv', index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(df.head())
print(df.info())

# Check for missing values after cleaning
print("\nMissing Values After Cleaning:")
print(df.isna().sum())


Initial Data:
  Questionnaire  Organization Number                Organization Name  \
0   Cities 2023               840926      Prefeitura de Serra Talhada   
1   Cities 2023                51075                 City of Shenzhen   
2   Cities 2023               863190                            Renca   
3   Cities 2023               930366  Municipalidad Distrital de Yura   
4   Cities 2023                60236          Trelleborg Municipality   

         City Country/Area     CDP Region  C40 City  GCoM City  Access  \
0         NaN       Brazil  Latin America     False       True  public   
1    Shenzhen        China      East Asia      True      False  public   
2         NaN        Chile  Latin America     False      False  public   
3         NaN         Peru  Latin America     False       True  public   
4  Trelleborg       Sweden         Europe     False       True  public   

            Assessment attachment and/or direct link  \
0  https://drive.google.com/file/d/19DMxxK532I

Step 3: Define Functions to Create Nodes and Relationships in Neo4j

In [24]:
from py2neo import Graph, Node, Relationship

# Connect to Neo4j
graph = Graph("bolt://localhost:7687", auth=("neo4j", "12345678"))

# Function to create city nodes
def create_city_nodes(df):
    for index, row in df.iterrows():
        city = Node("City", name=row['City'], country=row['Country/Area'], organisation=row['Organization Name'], year=row['Year of publication or approval'])
        graph.merge(city, "City", "name")

# Function to create risk assessment relationships
def create_risk_assessment_relationships(df):
    for index, row in df.iterrows():
        city_node = graph.nodes.match("City", name=row['City']).first()
        if city_node:
            risk_assessment = Node("RiskAssessment", publication_year=row['Year of publication or approval'], population=row['Population'], population_year=row['Population Year'])
            relationship = Relationship(city_node, "HAS_RISK_ASSESSMENT", risk_assessment)
            graph.create(relationship)

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_risk_assessment_relationships(df)


Step 4: Verify Data in Neo4j

In [25]:
query = """
MATCH (c:City)-[:HAS_RISK_ASSESSMENT]->(r:RiskAssessment)
RETURN c.name AS city, c.country AS country, c.year AS year, r.publication_year AS publication_year, r.population AS population, r.population_year AS population_year
ORDER BY c.name
"""
results = graph.run(query).to_data_frame()
print(results)


                  city             country  year  publication_year  \
0               Aarhus             Denmark  2020              2020   
1     Abasan Al-Kabira  State of Palestine  2019              2019   
2              Abidjan       Côte d'Ivoire  2017              2017   
3              Abidjan       Côte d'Ivoire  2017              2016   
4                Accra               Ghana  2017              2017   
...                ...                 ...   ...               ...   
1365            Águeda            Portugal  2009              2015   
1366            Águeda            Portugal  2009              2018   
1367            Águeda            Portugal  2009              2022   
1368             Åseda              Sweden  2019              2019   
1369             Évora            Portugal  2017              2017   

      population  population_year  
0         362235           2023.0  
1          35000           2022.0  
2         692583           2021.0  
3        611064