py2neo is one of the principal libraries for working with Neo4j from Python. It wraps Neo4j’s HTTP and Bolt protocols and provides a clean and simple interface to execute Cypher queries, handle transactions, and manage the database.

Step 1: Import Necessary Libraries
importing the necessary Python libraries. Here, we'll use pandas for data manipulation and py2neo for interacting with the Neo4j database.

In [7]:
# Import necessary libraries
import pandas as pd
from py2neo import Graph
from py2neo import Graph, Node, Relationship



Step 2: Connect to Neo4j
Set up a connection to your Neo4j database:

In [8]:
from py2neo import Graph

# Replace 'your_new_password' with the password you just set
graph = Graph("bolt://localhost:7687", auth=("neo4j", "12345678"))

# Test the connection
print(graph.run("RETURN 'Connection Successful!' AS message").data())


[{'message': 'Connection Successful!'}]


Step 3: Load and Analyze CSV Data

Importing pandas: First, we import the pandas library, which is essential for data manipulation in Python.
    Setting the file path: We store the path to your CSV file in a variable called file_path. Since the CSV file is in the same directory as your Jupyter notebook, you only need to specify the filename.
    Reading the CSV file: We use pd.read_csv() to read the CSV file into a DataFrame. This function converts the structured CSV file into a pandas DataFrame, which allows for more complex data manipulations.
    Printing the data: df.head() prints the first few rows of the DataFrame, giving you a snapshot of the data structure. df.info() provides a concise summary of the DataFrame, showing types of columns, non-null values, and memory usage.

In [13]:
import pandas as pd

# Load the CSV data into a DataFrame
file_path = '2016_-_Cities_Emissions_Reduction_Targets_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure
print("Initial Data:")
print(df.head())
print(df.info())

# 1. Handling Missing Values
# Check for missing values in each column
print("\nMissing Values Before Cleaning:")
print(df.isna().sum())

# Fill numeric columns with median values
df['Baseline emissions (metric tonnes CO2e)'] = df['Baseline emissions (metric tonnes CO2e)'].fillna(df['Baseline emissions (metric tonnes CO2e)'].median())
df['Percentage reduction target'] = df['Percentage reduction target'].fillna(df['Percentage reduction target'].median())

# Fill categorical columns with a placeholder value
df['City Short Name'] = df['City Short Name'].fillna('Unknown City')
df['Country'] = df['Country'].fillna('Unknown Country')
df['Organisation'] = df['Organisation'].fillna('Unknown Organisation')

# Convert 'Reporting Year' and 'Target date' to datetime format and fill with a default value if they remain NaN
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

df['Reporting Year'] = df['Reporting Year'].fillna(0)
df['Target date'] = df['Target date'].fillna(0)

# 2. Data Type Conversions
# Convert 'Reporting Year' to datetime format if not done yet
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year

# Convert 'Target date' to datetime format if it's not NaN
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

# 3. Standardizing Formats
# Standardize text in 'City Short Name' by capitalizing
df['City Short Name'] = df['City Short Name'].str.title()

# 4. Removing Duplicates
# Remove duplicate rows, if any
df.drop_duplicates(inplace=True)

# 5. Trimming and Cleaning Strings
# Strip leading/trailing spaces from all string columns
df['Organisation'] = df['Organisation'].str.strip()

# 6. Consolidating Categories
# Consolidate similar country names if needed
df['Country'] = df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# 7. Validating Data
# Ensure that 'Percentage reduction target' is non-negative
df = df[df['Percentage reduction target'] >= 0]

# 8. Creating Unique Identifiers
# Create a unique identifier for each row if the 'Account No' is not unique
df['id'] = pd.factorize(df['Account No'])[0] + 1

# 9. Exporting Clean Data
# Export the cleaned data to a new CSV file
df.to_csv('cleaned_data_Neo4j_2016_-_Cities_Emissions_Reduction_Targets_20240207.csv', index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(df.head())
print(df.info())

# Check for missing values after cleaning
print("\nMissing Values After Cleaning:")
print(df.isna().sum())


Initial Data:
             Organisation  Account No  Country      City Short Name  C40  \
0           Odder Kommune       58796  Denmark        Odder Kommune  NaN   
1        Comune di Napoli       36158    Italy               Napoli  NaN   
2     Egedal Municipality       62855  Denmark  Egedal Municipality  NaN   
3            Yilan County       61753   Taiwan               Yilan   NaN   
4  City of Emeryville, CA       61790      USA       Emeryville, CA  NaN   

   Reporting Year Sector              Target boundary Baseline year  \
0            2016  Total                          NaN          2010   
1            2016  Total                          NaN          2005   
2            2016  Total                          NaN          2009   
3            2016  Total                          NaN          2009   
4            2016  Total  Overall community emissions          2004   

   Baseline emissions (metric tonnes CO2e)  Percentage reduction target  \
0                          

Explanation of the Code

Loading Data: Loads the CSV file into a DataFrame using pandas.
    
Handling Missing Values: Checks for and handles missing data by filling or removing based on the column importance.
    
Data Type Conversions: Converts columns to appropriate data types, like converting years to datetime objects.
    
Standardizing Formats: Ensures consistent formatting in text fields and date representations.
    
Removing Duplicates: Removes duplicate entries to ensure the uniqueness of data.
    
Trimming and Cleaning Strings: Cleans up string data by trimming spaces and correcting formats.
    
Consolidating Categories: Combines similar categories to reduce redundancy and simplify analysis.
    
Validating Data: Applies checks to ensure data meets logical constraints (e.g., non-negative percentages).
    
Creating Unique Identifiers: Generates unique identifiers for nodes in Neo4j, ensuring each node can be distinctly referenced.
                                                         
Exporting Clean Data: Saves the cleaned DataFrame back to a CSV file, ready for importing into Neo4j.

Step 4: Define Functions to Create Nodes and Relationships

In [16]:
# Function to create city nodes
def create_city_nodes(df):
    for index, row in df.iterrows():
        city = Node("City", name=row['City Short Name'], country=row['Country'], reporting_year=row['Reporting Year'])
        graph.merge(city, "City", "name")

# Function to create emissions reduction target relationships
def create_emissions_reduction_relationships(df):
    for index, row in df.iterrows():
        city_node = graph.nodes.match("City", name=row['City Short Name']).first()
        if city_node:
            reduction_target = Node("ReductionTarget", target=row['Percentage reduction target'], baseline_emissions=row['Baseline emissions (metric tonnes CO2e)'], target_year=row['Target date'])
            relationship = Relationship(city_node, "HAS_TARGET", reduction_target)
            graph.create(relationship)

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_emissions_reduction_relationships(df)


Step 5: Insert cleaned_data_Neo4j_2016_-_Cities_Emissions_Reduction_Targets_20240207.csv into Neo4j 

In [17]:
from py2neo import Graph, Node, Relationship

# Create nodes and relationships in Neo4j
create_city_nodes(df)
create_emissions_reduction_relationships(df)

# Verify by running a query to see the inserted nodes and relationships
query = """
MATCH (c:City)-[:HAS_TARGET]->(t:ReductionTarget)
RETURN c.name AS city, c.country AS country, c.reporting_year AS year, t.target AS reduction_target, t.baseline_emissions AS baseline_emissions, t.target_year AS target_year
ORDER BY c.name
"""
results = graph.run(query).to_data_frame()
print(results)


                city    country  year  reduction_target  baseline_emissions  \
0     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
1     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
2     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
3     Aarhus Kommune    Denmark  2016             100.0           2100000.0   
4           Adelaide  Australia  2016              35.0            411000.0   
...              ...        ...   ...               ...                 ...   
1115         Yonkers        USA  2016              20.0            844276.0   
1116        Zaragoza      Spain  2016              20.0           1237553.0   
1117        Zaragoza      Spain  2016              20.0           1237553.0   
1118        Zaragoza      Spain  2016              20.0           1237553.0   
1119        Zaragoza      Spain  2016              20.0           1237553.0   

      target_year  
0          2030.0  
1          