In [1]:
!pip install pymongo Pandas



2016_-_Cities_Emissions_Reduction_Targets_20240207

Step 1: Import Necessary Libraries
IWe need to import libraries for handling data (pandas) and connecting to MongoDB (pymongo).

In [2]:
# Import necessary libraries
import pandas as pd
from pymongo import MongoClient


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Explanation:

pandas: A powerful library for data manipulation and analysis. We'll use it to load and clean the CSV data.

pymongo: A library to interact with MongoDB from Python. We'll use it to connect to our MongoDB instance and perform database operations.

Step 2: Connect to MongoDB

Set up a connection to your MongoDB database named EksamensProjekt_DB.

In [5]:
# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")  # Replace with your actual connection string
db = client['EksamensProjekt_DB']
collection = db['cities_emissions']

# Verify connection by listing databases
print(client.list_database_names())


['EksamensProjekt_DB', 'Mongo', 'OLA2_DB', 'admin', 'config', 'local', 'somesite']


Explanation:

MongoClient: This class is used to create a connection to a MongoDB server. The connection string mongodb://localhost:27017/ specifies that the server is running locally on port 27017.

db: Selects the EksamensProjekt_DB database. MongoDB will create this database when we first store data in it.

collection: Selects the cities_emissions collection within the database. This collection will store our cleaned data.

Explanation:

MongoClient: This class is used to create a connection to a MongoDB server. The connection string mongodb://localhost:27017/ specifies that the server is running locally on port 27017.

db: Selects the EksamensProjekt_DB database. MongoDB will create this database when we first store data in it.

collection: Selects the cities_emissions collection within the database. This collection will store our cleaned data.

Step 3: Load and Clean CSV Data

Load the CSV data into a pandas DataFrame and clean it.

In [9]:
# Load the CSV data into a DataFrame
file_path = '2016_-_Cities_Emissions_Reduction_Targets_20240207.csv'
df = pd.read_csv(file_path)

# Display the initial data structure
print("Initial Data:")
print(df.head())
print(df.info())

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(df.isna().sum())

# Fill numeric columns with median values
df['Baseline emissions (metric tonnes CO2e)'] = df['Baseline emissions (metric tonnes CO2e)'].fillna(df['Baseline emissions (metric tonnes CO2e)'].median())
df['Percentage reduction target'] = df['Percentage reduction target'].fillna(df['Percentage reduction target'].median())

# Fill categorical columns with a placeholder value
df['City Short Name'] = df['City Short Name'].fillna('Unknown City')
df['Country'] = df['Country'].fillna('Unknown Country')
df['Organisation'] = df['Organisation'].fillna('Unknown Organisation')

# Convert 'Reporting Year' and 'Target date' to datetime format and fill with a default value if they remain NaN
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

df['Reporting Year'] = df['Reporting Year'].fillna(0)
df['Target date'] = df['Target date'].fillna(0)

# Data Type Conversions
df['Reporting Year'] = pd.to_datetime(df['Reporting Year'], format='%Y', errors='coerce').dt.year
df['Target date'] = pd.to_datetime(df['Target date'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
df['City Short Name'] = df['City Short Name'].str.title()

# Removing Duplicates
df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
df['Organisation'] = df['Organisation'].str.strip()

# Consolidating Categories
df['Country'] = df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
df = df[df['Percentage reduction target'] >= 0]

# Creating Unique Identifiers
df['id'] = pd.factorize(df['Account No'])[0] + 1

# Exporting Clean Data
# Export the cleaned data to a new CSV file
cleaned_file_path = 'cleaned_data_2016_MongoDB_-_Cities_Emissions_Reduction_Targets_20240207.csv'
df.to_csv(cleaned_file_path, index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(df.head())
print(df.info())

print("\nMissing Values After Cleaning:")
print(df.isna().sum())


Initial Data:
             Organisation  Account No  Country      City Short Name  C40  \
0           Odder Kommune       58796  Denmark        Odder Kommune  NaN   
1        Comune di Napoli       36158    Italy               Napoli  NaN   
2     Egedal Municipality       62855  Denmark  Egedal Municipality  NaN   
3            Yilan County       61753   Taiwan               Yilan   NaN   
4  City of Emeryville, CA       61790      USA       Emeryville, CA  NaN   

   Reporting Year Sector              Target boundary Baseline year  \
0            2016  Total                          NaN          2010   
1            2016  Total                          NaN          2005   
2            2016  Total                          NaN          2009   
3            2016  Total                          NaN          2009   
4            2016  Total  Overall community emissions          2004   

   Baseline emissions (metric tonnes CO2e)  Percentage reduction target  \
0                          

Explanation:

Load CSV: Read the CSV file into a DataFrame using pd.read_csv.

 Display Initial Data: Print the first few rows and the structure of the DataFrame to understand its content.
 
Handle Missing Values:
Fill numeric columns with median values to avoid skewing the data.
Fill categorical columns with a placeholder value to maintain consistency.

Data Type Conversions: Convert date columns to a datetime format for consistency and easier manipulation.

Standardize Formats: Standardize text formats (e.g., capitalizing city names) to maintain consistency.

Remove Duplicates: Remove any duplicate rows to ensure each record is unique.

Trim and Clean Strings: Remove extra spaces from string values.

Consolidate Categories: Replace similar categories (e.g., 'Usa' to 'USA') for consistency.

Validate Data: Ensure all percentage reduction targets are non-negative.

Create Unique Identifiers: Create a unique identifier for each row based on the 'Account No' column.

Step 4: Insert Data into MongoDB

Define a function to insert the cleaned data into MongoDB.

In [6]:
# Function to insert city documents
def insert_city_documents(df):
    for index, row in df.iterrows():
        city_document = {
            "name": row['City Short Name'],
            "country": row['Country'],
            "reporting_year": row['Reporting Year'],
            "baseline_emissions": row['Baseline emissions (metric tonnes CO2e)'],
            "percentage_reduction_target": row['Percentage reduction target'],
            "target_year": row['Target date'],
            "organisation": row['Organisation'],
            "id": row['id']
        }
        collection.insert_one(city_document)

# Insert city documents into MongoDB
insert_city_documents(df)


Explanation:

Function Definition: Define a function insert_city_documents that takes a DataFrame and a collection as arguments.

Iterate Over Rows: Loop through each row in the DataFrame.

Create Document: For each row, create a dictionary (document) containing the relevant fields.
    
Insert Document: Use collection.insert_one to insert each document into the MongoDB collection.

Step 5: Verify Data Insertion

Query MongoDB to verify the data insertion.

In [7]:
# Verify by retrieving one document
print("One document from MongoDB:")
print(collection.find_one())

# Query to see all inserted documents (sample)
for doc in collection.find().limit(5):
    print(doc)


One document from MongoDB:
{'_id': ObjectId('665c8389b791c25edf693d0b'), 'name': 'Odder Kommune', 'country': 'Denmark', 'reporting_year': 2016, 'baseline_emissions': 6136.0, 'percentage_reduction_target': 2.0, 'target_year': nan, 'organisation': 'Odder Kommune', 'id': 1}
{'_id': ObjectId('665c8389b791c25edf693d0b'), 'name': 'Odder Kommune', 'country': 'Denmark', 'reporting_year': 2016, 'baseline_emissions': 6136.0, 'percentage_reduction_target': 2.0, 'target_year': nan, 'organisation': 'Odder Kommune', 'id': 1}
{'_id': ObjectId('665c8389b791c25edf693d0c'), 'name': 'Napoli', 'country': 'Italy', 'reporting_year': 2016, 'baseline_emissions': 2913434.0, 'percentage_reduction_target': 25.0, 'target_year': 2020.0, 'organisation': 'Comune di Napoli', 'id': 2}
{'_id': ObjectId('665c8389b791c25edf693d0d'), 'name': 'Egedal Municipality', 'country': 'Denmark', 'reporting_year': 2016, 'baseline_emissions': 268000.0, 'percentage_reduction_target': 7.0, 'target_year': 2020.0, 'organisation': 'Egedal

Explanation:

Retrieve One Document: Use collection.find_one() to retrieve and print one document from the collection to verify the insertion.

Sample of Inserted Documents: Use collection.find().limit(5) to retrieve and print a sample of five documents to ensure the data has been inserted correctly.

2016_-_Citywide_GHG_Emissions_20240207.csv

Step 1: Import Necessary Libraries

First, we need to import the necessary Python libraries:

In [10]:
# Import necessary libraries
import pandas as pd
from pymongo import MongoClient


Explanation:

pandas: A powerful library for data manipulation and analysis. We'll use it to load and clean the CSV data.

pymongo: A library to interact with MongoDB from Python. We'll use it to connect to our MongoDB instance and perform database operations.

Step 2: Connect to MongoDB

Set up a connection to your MongoDB database named EksamensProjekt_DB.

Explanation:

MongoClient: This class is used to create a connection to a MongoDB server. The connection string mongodb://localhost:27017/ specifies that the server is running locally on port 27017.

db: Selects the EksamensProjekt_DB database. MongoDB will create this database when we first store data in it.

Step 3: Load and Clean New CSV Data

Load the new CSV data into a pandas DataFrame and clean it.

In [12]:
# Load the new CSV data into a DataFrame
new_file_path = '2016_-_Citywide_GHG_Emissions_20240207.csv'
new_df = pd.read_csv(new_file_path)

# Display the initial data structure
print("Initial Data:")
print(new_df.head())
print(new_df.info())

# Display the column names
print("Column Names:")
print(new_df.columns)

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(new_df.isna().sum())

# Fill numeric columns with median values
emissions_column = 'Total City-wide Emissions (metric tonnes CO2e)'  # Correct column name
new_df[emissions_column] = new_df[emissions_column].fillna(new_df[emissions_column].median())

# Fill categorical columns with a placeholder value
new_df['City Name'] = new_df['City Name'].fillna('Unknown City')
new_df['Country'] = new_df['Country'].fillna('Unknown Country')

# Convert 'Reporting Year' to datetime format and fill with a default value if they remain NaN
new_df['Reporting Year'] = pd.to_datetime(new_df['Reporting Year'], format='%Y', errors='coerce').dt.year
new_df['Reporting Year'] = new_df['Reporting Year'].fillna(0)

# Data Type Conversions
new_df['Reporting Year'] = pd.to_datetime(new_df['Reporting Year'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
new_df['City Name'] = new_df['City Name'].str.title()

# Removing Duplicates
new_df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
new_df['City Name'] = new_df['City Name'].str.strip()

# Consolidating Categories
new_df['Country'] = new_df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
new_df = new_df[new_df[emissions_column] >= 0]

# Creating Unique Identifiers
new_df['id'] = pd.factorize(new_df['City Name'] + new_df['Reporting Year'].astype(str))[0] + 1

# Exporting Clean Data
cleaned_new_file_path = 'cleaned_data_2016_MongoDB_-_Citywide_GHG_Emissions_20240207.csv'
new_df.to_csv(cleaned_new_file_path, index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(new_df.head())
print(new_df.info())

print("\nMissing Values After Cleaning:")
print(new_df.isna().sum())



Initial Data:
   Account Number            City Name         Country City Short Name  C40  \
0           35894    Ville de Montreal          Canada        Montreal  NaN   
1           35898   Greater Manchester  United Kingdom      Manchester  NaN   
2           54128         City of Reno             USA            Reno  NaN   
3           35879  City of Minneapolis             USA     Minneapolis  NaN   
4           50558   City of London, ON          Canada      London, ON  NaN   

   Reporting Year        Measurement Year  \
0            2016  12/31/2009 12:00:00 AM   
1            2016  12/31/2013 12:00:00 AM   
2            2016  12/31/2014 12:00:00 AM   
3            2016  12/31/2014 12:00:00 AM   
4            2016  12/31/2014 12:00:00 AM   

                                            Boundary  \
0  Other: The regional entity that constitutes th...   
1                                A metropolitan area   
2      Administrative boundary of a local government   
3      Administr

Step 4: Create a New Collection and Insert Data into MongoDB

In [13]:
# Define the new collection name
new_collection_name = 'citywide_ghg_emissions'
new_collection = db[new_collection_name]

# Function to insert city documents
def insert_citywide_ghg_documents(df, collection):
    for index, row in df.iterrows():
        city_document = {
            "city": row['City Name'],
            "country": row['Country'],
            "reporting_year": row['Reporting Year'],
            "total_ghg_emissions": row[emissions_column],
            "id": row['id']
        }
        collection.insert_one(city_document)

# Insert city documents into MongoDB
insert_citywide_ghg_documents(new_df, new_collection)

# Verify by retrieving one document
print("One document from MongoDB:")
print(new_collection.find_one())

# Query to see all inserted documents (sample)
for doc in new_collection.find().limit(5):
    print(doc)


One document from MongoDB:
{'_id': ObjectId('665c8f39b791c25edf693e23'), 'city': 'Ville De Montreal', 'country': 'Canada', 'reporting_year': 2016, 'total_ghg_emissions': 13722942.0, 'id': 1}
{'_id': ObjectId('665c8f39b791c25edf693e23'), 'city': 'Ville De Montreal', 'country': 'Canada', 'reporting_year': 2016, 'total_ghg_emissions': 13722942.0, 'id': 1}
{'_id': ObjectId('665c8f39b791c25edf693e24'), 'city': 'Greater Manchester', 'country': 'United Kingdom', 'reporting_year': 2016, 'total_ghg_emissions': 14889318.0, 'id': 2}
{'_id': ObjectId('665c8f39b791c25edf693e25'), 'city': 'City Of Reno', 'country': 'USA', 'reporting_year': 2016, 'total_ghg_emissions': 4437665.0, 'id': 3}
{'_id': ObjectId('665c8f39b791c25edf693e26'), 'city': 'City Of Minneapolis', 'country': 'USA', 'reporting_year': 2016, 'total_ghg_emissions': 4794708.0, 'id': 4}
{'_id': ObjectId('665c8f39b791c25edf693e27'), 'city': 'City Of London, On', 'country': 'Canada', 'reporting_year': 2016, 'total_ghg_emissions': 3070000.0, 

Explanation:

Load and Inspect CSV: Load the CSV file and inspect the column names to identify the correct names.

Handle Missing Values: Fill missing numeric and categorical values, convert date columns, and clean text fields.

Export Clean Data: Save the cleaned data to a new CSV file.

Create Collection and Insert Data: Define a new collection and insert the cleaned data into MongoDB.

Verify Data Insertion: Print some documents from the collection to verify the data insertion.

2017_-_Cities_Community_Wide_Emissions.csv

Step 1: Import Necessary Libraries

First, i need to import the necessary Python libraries:

Step 2: Connect to MongoDB

Set up a connection to your MongoDB database named EksamensProjekt_DB

In [14]:
# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")  # Replace with your actual connection string
db = client['EksamensProjekt_DB']

# Verify connection by listing databases
print(client.list_database_names())


['EksamensProjekt_DB', 'Mongo', 'OLA2_DB', 'admin', 'config', 'local', 'somesite']


Step 3: Load and Inspect New CSV Data

Load the new CSV data into a pandas DataFrame and inspect the column names to identify the correct name for the emissions column.

Explanation:
  Load CSV: Read the CSV file into a DataFrame using pd.read_cs
v   Display Initial Data: Print the first few rows and the structure of the DataFrame to understand its conte
n    Display Column Names: Print the column names to identify the correct name for the emissions column.

In [16]:
# Load the new CSV data into a DataFrame
new_file_path = '2017_-_Cities_Community_Wide_Emissions.csv'
new_df = pd.read_csv(new_file_path)

# Display the initial data structure
print("Initial Data:")
print(new_df.head())
print(new_df.info())

# Display the column names
print("Column Names:")
print(new_df.columns)


Initial Data:
   Account number                     Organization                 City  \
0           49363  Nelson Mandela Bay Municipality  Nelson Mandela Bay    
1           31171           Ayuntamiento de Madrid               Madrid   
2            3417                    New York City        New York City   
3           59537               City of Denton, TX           Denton, TX   
4           35894                Ville de Montreal             Montreal   

        Country         Region  C40  Access  Reporting year  \
0  South Africa         Africa  NaN  Public            2017   
1         Spain         Europe  C40  Public            2017   
2           USA  North America  C40  Public            2017   
3           USA  North America  NaN  Public            2017   
4        Canada  North America  C40  Public            2017   

           Accounting year                                       Boundary  \
0  2013-07-01 - 2014-06-30                            A metropolitan area   
1 

Step 4: Update the Data Cleaning Script

Once you have identified the correct column names, you can proceed with the data cleaning and insertion process.

Explanation:
  Handle Missing Values   Fill Numeric Columns: Fill missing values in numeric columns with the median value to avoid skewing the dat    Fill Categorical Columns: Fill missing values in categorical columns with placeholder values ('Unknown City', 'Unknown Countr
y
    Convert 'Reporting Year' to Datetime: Convert the 'Reporting year' column to datetime format and fill any remaining NaN values wi
t
    Standardize Formats: Standardize the format of the 'City' column to title case for consis
t.
    Remove Duplicates: Remove any duplicate rows to ensure each record is 
ue.
    Trim and Clean Strings: Trim any leading or trailing spaces from string
 es.
    Consolidate Categories: Replace similar categories (e.g., 'Usa' to 'USA') for con
sncy.
    Validate Data: Ensure all emissions values are non
-tive.
    Create Unique Identifiers: Generate a unique identifier for each row based on the combination of 'City' and 'Repor
tyear'.
    Export Cleaned Data: Save the cleaned DataFrame to a new CSV file for future use.

In [17]:
# Load the new CSV data into a DataFrame
new_file_path = '2017_-_Cities_Community_Wide_Emissions.csv'
new_df = pd.read_csv(new_file_path)

# Display the initial data structure
print("Initial Data:")
print(new_df.head())
print(new_df.info())

# Display the column names
print("Column Names:")
print(new_df.columns)

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(new_df.isna().sum())

# Fill numeric columns with median values
emissions_column = 'Total emissions (metric tonnes CO2e)'  # Correct column name
new_df[emissions_column] = new_df[emissions_column].fillna(new_df[emissions_column].median())

# Fill categorical columns with a placeholder value
new_df['City'] = new_df['City'].fillna('Unknown City')
new_df['Country'] = new_df['Country'].fillna('Unknown Country')

# Convert 'Reporting year' to datetime format and fill with a default value if they remain NaN
new_df['Reporting year'] = pd.to_datetime(new_df['Reporting year'], format='%Y', errors='coerce').dt.year
new_df['Reporting year'] = new_df['Reporting year'].fillna(0)

# Data Type Conversions
new_df['Reporting year'] = pd.to_datetime(new_df['Reporting year'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
new_df['City'] = new_df['City'].str.title()

# Removing Duplicates
new_df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
new_df['City'] = new_df['City'].str.strip()

# Consolidating Categories
new_df['Country'] = new_df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
new_df = new_df[new_df[emissions_column] >= 0]

# Creating Unique Identifiers
new_df['id'] = pd.factorize(new_df['City'] + new_df['Reporting year'].astype(str))[0] + 1

# Exporting Clean Data
cleaned_new_file_path = 'cleaned_data_2017_MongoDB_-_Cities_Community_Wide_Emissions.csv'
new_df.to_csv(cleaned_new_file_path, index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(new_df.head())
print(new_df.info())

print("\nMissing Values After Cleaning:")
print(new_df.isna().sum())


Initial Data:
   Account number                     Organization                 City  \
0           49363  Nelson Mandela Bay Municipality  Nelson Mandela Bay    
1           31171           Ayuntamiento de Madrid               Madrid   
2            3417                    New York City        New York City   
3           59537               City of Denton, TX           Denton, TX   
4           35894                Ville de Montreal             Montreal   

        Country         Region  C40  Access  Reporting year  \
0  South Africa         Africa  NaN  Public            2017   
1         Spain         Europe  C40  Public            2017   
2           USA  North America  C40  Public            2017   
3           USA  North America  NaN  Public            2017   
4        Canada  North America  C40  Public            2017   

           Accounting year                                       Boundary  \
0  2013-07-01 - 2014-06-30                            A metropolitan area   
1 

Step 5: Create a New Collection and Insert Data into MongoDB

Explanation:
  Define New Collection Name: Specify the name of the new collection to store the dat
a   Function Definition: Define a function insert_community_wide_emissions_documents that takes a DataFrame and a collection as argumen
t    Iterate Over Rows: Loop through each row in the DataFr
a
    Create Document: For each row, create a dictionary (document) containing the relevant fi
e
    Insert Document: Use collection.insert_one to insert each document into the MongoDB colle
c.
    Verify Data Insertion: Print some documents from the collection to verify the data insertion.

In [18]:
# Define the new collection name
new_collection_name = 'community_wide_emissions_2017'
new_collection = db[new_collection_name]

# Function to insert city documents
def insert_community_wide_emissions_documents(df, collection):
    for index, row in df.iterrows():
        city_document = {
            "city": row['City'],
            "country": row['Country'],
            "reporting_year": row['Reporting year'],
            "total_emissions": row[emissions_column],
            "id": row['id']
        }
        collection.insert_one(city_document)

# Insert city documents into MongoDB
insert_community_wide_emissions_documents(new_df, new_collection)

# Verify by retrieving one document
print("One document from MongoDB:")
print(new_collection.find_one())

# Query to see all inserted documents (sample)
for doc in new_collection.find().limit(5):
    print(doc)


One document from MongoDB:
{'_id': ObjectId('665cd614b791c25edf693edf'), 'city': 'Nelson Mandela Bay', 'country': 'South Africa', 'reporting_year': 2017, 'total_emissions': 12232310.0, 'id': 1}
{'_id': ObjectId('665cd614b791c25edf693edf'), 'city': 'Nelson Mandela Bay', 'country': 'South Africa', 'reporting_year': 2017, 'total_emissions': 12232310.0, 'id': 1}
{'_id': ObjectId('665cd614b791c25edf693ee0'), 'city': 'Madrid', 'country': 'Spain', 'reporting_year': 2017, 'total_emissions': 9236196.0, 'id': 2}
{'_id': ObjectId('665cd614b791c25edf693ee1'), 'city': 'New York City', 'country': 'USA', 'reporting_year': 2017, 'total_emissions': 52042186.0, 'id': 3}
{'_id': ObjectId('665cd614b791c25edf693ee2'), 'city': 'Denton, Tx', 'country': 'USA', 'reporting_year': 2017, 'total_emissions': 1604007.0, 'id': 4}
{'_id': ObjectId('665cd614b791c25edf693ee3'), 'city': 'Montreal', 'country': 'Canada', 'reporting_year': 2017, 'total_emissions': 10197555.0, 'id': 5}


In [None]:
2017_-_Cities_Emissions_Reduction_Targets_20240207.csv

Step 1: Import Necessary Libraries

In [None]:
# Import necessary libraries
import pandas as pd
from pymongo import MongoClient


Step 2: Connect to MongoDB

In [None]:
# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")  # Replace with your actual connection string
db = client['EksamensProjekt_DB']

# Verify connection by listing databases
print(client.list_database_names())


Step 3: Load and Inspect New CSV Data

Load the new CSV data into a pandas DataFrame and inspect the column names to identify the correct name for the emissions reduction targets.

Explanation:
  Load CSV: Read the CSV file into a DataFrame using pd.read_cs
v   Display Initial Data: Print the first few rows and the structure of the DataFrame to understand its conte
n    Display Column Names: Print the column names to identify the correct name for the emissions reduction targets.

In [19]:
# Load the new CSV data into a DataFrame
new_file_path = '2017_-_Cities_Emissions_Reduction_Targets_20240207.csv'
new_df = pd.read_csv(new_file_path)

# Display the initial data structure
print("Initial Data:")
print(new_df.head())
print(new_df.info())

# Display the column names
print("Column Names:")
print(new_df.columns)


Initial Data:
   Account No                   Organisation              City  \
0       54408                 Aarhus Kommune            Aarhus   
1       63616  Abasan Al-Kabira Municipality  Abasan Al-Kabira   
2       63616  Abasan Al-Kabira Municipality  Abasan Al-Kabira   
3        1499        Ajuntament de Barcelona         Barcelona   
4        1499        Ajuntament de Barcelona         Barcelona   

              Country               Region  Access  C40  Reporting year  \
0             Denmark               Europe  Public  NaN            2017   
1  State of Palestine  South and West Asia  Public  NaN            2017   
2  State of Palestine  South and West Asia  Public  NaN            2017   
3               Spain               Europe  Public  C40            2017   
4               Spain               Europe  Public  C40            2017   

    Type of target     Sector  ... Baseline emissions (metric tonnes CO2e)  \
0  Absolute target        NaN  ...                          

Step 4: Data Cleaning

Explanation:
  Handle Missing Values   Fill Numeric Columns: Fill missing values in numeric columns with the median value to avoid skewing the dat    Fill Categorical Columns: Fill missing values in categorical columns with placeholder values ('Unknown City', 'Unknown Countr
y
    Convert 'Reporting Year' and 'Target date' to Datetime: Convert these columns to datetime format and fill any remaining NaN values wit
    Standardize Formats: Standardize the format of the 'City' column to title case for consis
t.
    Remove Duplicates: Remove any duplicate rows to ensure each record is 
ue.
    Trim and Clean Strings: Trim any leading or trailing spaces from string
 es.
    Consolidate Categories: Replace similar categories (e.g., 'Usa' to 'USA') for con
sncy.
    Validate Data: Ensure all emissions reduction target values are non
-tive.
    Create Unique Identifiers: Generate a unique identifier for each row based on the combination of 'City' and 'Repor
tyear'.
    Export Cleaned Data: Save the cleaned DataFrame to a new CSV file for future use.

In [20]:
# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(new_df.isna().sum())

# Fill numeric columns with median values
emissions_column = 'Baseline emissions (metric tonnes CO2e)'  # Correct column name
reduction_column = 'Percentage reduction target'  # Correct column name
new_df[emissions_column] = new_df[emissions_column].fillna(new_df[emissions_column].median())
new_df[reduction_column] = new_df[reduction_column].fillna(new_df[reduction_column].median())

# Fill categorical columns with a placeholder value
new_df['City'] = new_df['City'].fillna('Unknown City')
new_df['Country'] = new_df['Country'].fillna('Unknown Country')

# Convert 'Reporting Year' and 'Target date' to datetime format and fill with a default value if they remain NaN
new_df['Reporting year'] = pd.to_datetime(new_df['Reporting year'], format='%Y', errors='coerce').dt.year
new_df['Target date'] = pd.to_datetime(new_df['Target date'], format='%Y', errors='coerce').dt.year

new_df['Reporting year'] = new_df['Reporting year'].fillna(0)
new_df['Target date'] = new_df['Target date'].fillna(0)

# Data Type Conversions
new_df['Reporting year'] = pd.to_datetime(new_df['Reporting year'], format='%Y', errors='coerce').dt.year
new_df['Target date'] = pd.to_datetime(new_df['Target date'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
new_df['City'] = new_df['City'].str.title()

# Removing Duplicates
new_df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
new_df['City'] = new_df['City'].str.strip()

# Consolidating Categories
new_df['Country'] = new_df['Country'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
new_df = new_df[new_df[reduction_column] >= 0]

# Creating Unique Identifiers
new_df['id'] = pd.factorize(new_df['City'] + new_df['Reporting year'].astype(str))[0] + 1

# Exporting Clean Data
cleaned_new_file_path = 'cleaned_data_2017_MongoDB_-_Cities_Emissions_Reduction_Targets_20240207.csv'
new_df.to_csv(cleaned_new_file_path, index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(new_df.head())
print(new_df.info())

print("\nMissing Values After Cleaning:")
print(new_df.isna().sum())



Missing Values Before Cleaning:
Account No                                                                              0
Organisation                                                                            0
City                                                                                    2
Country                                                                                 2
Region                                                                                  2
Access                                                                                  0
C40                                                                                   312
Reporting year                                                                          0
Type of target                                                                          0
Sector                                                                                108
Baseline year                                                      

Step 5: Create a New Collection and Insert Data into MongoDB

Explanation:
  Define New Collection Name: Specify the name of the new collection to store the dat
a   Function Definition: Define a function insert_emissions_reduction_documents that takes a DataFrame and a collection as argumen
t    Iterate Over Rows: Loop through each row in the DataFr
a
    Create Document: For each row, create a dictionary (document) containing the relevant fi
e
    Insert Document: Use collection.insert_one to insert each document into the MongoDB colle
c.
    Verify Data Insertion: Print some documents from the collection to verify the data insertion.

In [21]:
# Define the new collection name
new_collection_name = 'emissions_reduction_targets_2017'
new_collection = db[new_collection_name]

# Function to insert city documents
def insert_emissions_reduction_documents(df, collection):
    for index, row in df.iterrows():
        city_document = {
            "city": row['City'],
            "country": row['Country'],
            "reporting_year": row['Reporting year'],
            "baseline_emissions": row[emissions_column],
            "percentage_reduction_target": row[reduction_column],
            "target_year": row['Target date'],
            "id": row['id']
        }
        collection.insert_one(city_document)

# Insert city documents into MongoDB
insert_emissions_reduction_documents(new_df, new_collection)

# Verify by retrieving one document
print("One document from MongoDB:")
print(new_collection.find_one())

# Query to see all inserted documents (sample)
for doc in new_collection.find().limit(5):
    print(doc)


One document from MongoDB:
{'_id': ObjectId('665cd801b791c25edf693fc4'), 'city': 'Aarhus', 'country': 'Denmark', 'reporting_year': 2017, 'baseline_emissions': 1484767.0, 'percentage_reduction_target': 100.0, 'target_year': 2030.0, 'id': 1}
{'_id': ObjectId('665cd801b791c25edf693fc4'), 'city': 'Aarhus', 'country': 'Denmark', 'reporting_year': 2017, 'baseline_emissions': 1484767.0, 'percentage_reduction_target': 100.0, 'target_year': 2030.0, 'id': 1}
{'_id': ObjectId('665cd801b791c25edf693fc5'), 'city': 'Abasan Al-Kabira', 'country': 'State of Palestine', 'reporting_year': 2017, 'baseline_emissions': 18320.0, 'percentage_reduction_target': 19.0, 'target_year': 2020.0, 'id': 2}
{'_id': ObjectId('665cd801b791c25edf693fc6'), 'city': 'Abasan Al-Kabira', 'country': 'State of Palestine', 'reporting_year': 2017, 'baseline_emissions': 6893.0, 'percentage_reduction_target': 6.0, 'target_year': 2020.0, 'id': 2}
{'_id': ObjectId('665cd801b791c25edf693fc7'), 'city': 'Barcelona', 'country': 'Spain', 

2023_Cities_Climate_Risk_and_Vulnerability_Assessments_20240207.csv

Step 1: Import Necessary Libraries

Explanation:
  pandas: A powerful library for data manipulation and analysis. We'll use it to load and clean the CSV dat
a   pymongo: A library to interact with MongoDB from Python. We'll use it to connect to our MongoDB instance and perform database operations.

In [4]:
# Import necessary libraries
import pandas as pd
from pymongo import MongoClient


Step 2: Connect to MongoDB

Explanation:

MongoClient: This class is used to create a connection to a MongoDB server. The connection string mongodb://localhost:27017/ specifies that the server is running locally on port 27017.

db: Selects the EksamensProjekt_DB database. MongoDB will create this database when we first store data in it.

In [5]:
# Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")  # Replace with your actual connection string
db = client['EksamensProjekt_DB']

# Verify connection by listing databases
print(client.list_database_names())


['EksamensProjekt_DB', 'Mongo', 'OLA2_DB', 'admin', 'config', 'local', 'somesite']


Step 3: Load and Inspect New CSV Data

Load the new CSV data into a pandas DataFrame and inspect the column names to identify the correct name for the climate risk and vulnerability assessments.

Explanation:
  Load CSV: Read the CSV file into a DataFrame using pd.read_cs
v   Display Initial Data: Print the first few rows and the structure of the DataFrame to understand its conte
n    Display Column Names: Print the column names to identify the correct name for the climate risk and vulnerability assessments.

In [6]:
# Load the new CSV data into a DataFrame
new_file_path = '2023_Cities_Climate_Risk_and_Vulnerability_Assessments_20240207.csv'
new_df = pd.read_csv(new_file_path)

# Display the initial data structure
print("Initial Data:")
print(new_df.head())
print(new_df.info())

# Display the column names
print("Column Names:")
print(new_df.columns)


Initial Data:
  Questionnaire  Organization Number                Organization Name  \
0   Cities 2023               840926      Prefeitura de Serra Talhada   
1   Cities 2023                51075                 City of Shenzhen   
2   Cities 2023               863190                            Renca   
3   Cities 2023               930366  Municipalidad Distrital de Yura   
4   Cities 2023                60236          Trelleborg Municipality   

         City Country/Area     CDP Region  C40 City  GCoM City  Access  \
0         NaN       Brazil  Latin America     False       True  public   
1    Shenzhen        China      East Asia      True      False  public   
2         NaN        Chile  Latin America     False      False  public   
3         NaN         Peru  Latin America     False       True  public   
4  Trelleborg       Sweden         Europe     False       True  public   

            Assessment attachment and/or direct link  \
0  https://drive.google.com/file/d/19DMxxK532I

Step 4: Data Cleaning 

Explanation:
  Handle Missing Values   Fill Numeric Columns: Fill missing values in numeric columns with the median value to avoid skewing the dat    Fill Categorical Columns: Fill missing values in categorical columns with placeholder values ('Unknown City', 'Unknown Countr
y
    Convert 'Year of publication or approval' to Datetime: Convert this column to datetime format and fill any remaining NaN values wit
    Standardize Formats: Standardize the format of the 'City' column to title case for consis
t.
    Remove Duplicates: Remove any duplicate rows to ensure each record is 
ue.
    Trim and Clean Strings: Trim any leading or trailing spaces from string
 es.
    Consolidate Categories: Replace similar categories (e.g., 'Usa' to 'USA') for con
sncy.
    Validate Data: Ensure all numeric values are non
-tive.
    Create Unique Identifiers: Generate a unique identifier for each row based on the combination of 'City' and 'Year of publication or
 oval'.
    Export Cleaned Data: Save the cleaned DataFrame to a new CSV file for future use.

In [7]:
import pandas as pd
import numpy as np


# Load your dataset
new_df = pd.read_csv('2023_Cities_Climate_Risk_and_Vulnerability_Assessments_20240207.csv')  # Replace with the actual file path

# Handling Missing Values
print("\nMissing Values Before Cleaning:")
print(new_df.isna().sum())

# Fill numeric columns with median values
numeric_columns = ['Population', 'Population Year', 'Year of publication or approval']  # Replace with the actual column names if different
for column in numeric_columns:
    new_df[column] = new_df[column].fillna(new_df[column].median())

# Fill categorical columns with a placeholder value
new_df['City'] = new_df['City'].fillna('Unknown City')
new_df['Country/Area'] = new_df['Country/Area'].fillna('Unknown Country')

# Convert 'Year of publication or approval' to datetime format and fill with a default value if they remain NaN
new_df['Year of publication or approval'] = pd.to_datetime(new_df['Year of publication or approval'], format='%Y', errors='coerce').dt.year
new_df['Year of publication or approval'] = new_df['Year of publication or approval'].fillna(0)

# Data Type Conversions
new_df['Year of publication or approval'] = pd.to_datetime(new_df['Year of publication or approval'], format='%Y', errors='coerce').dt.year

# Standardizing Formats
new_df['City'] = new_df['City'].str.title()

# Removing Duplicates
new_df.drop_duplicates(inplace=True)

# Trimming and Cleaning Strings
new_df['City'] = new_df['City'].str.strip()

# Consolidating Categories
new_df['Country/Area'] = new_df['Country/Area'].replace({'Usa': 'USA', 'Uk': 'United Kingdom'})

# Validating Data
for column in numeric_columns:
    new_df = new_df[new_df[column] >= 0]

# Handling Missing URLs and Location Data
new_df['Assessment attachment and/or direct link'] = new_df['Assessment attachment and/or direct link'].fillna('No Link Provided')
new_df['City Location'] = new_df['City Location'].fillna('No Location Data')

# Handling Missing Text Data
text_columns = ['Factors considered in assessment', 'Primary author(s) of assessment']
for column in text_columns:
    new_df[column] = new_df[column].fillna('Information not available')

# Creating Unique Identifiers
new_df['id'] = pd.factorize(new_df['City'] + new_df['Year of publication or approval'].astype(str))[0] + 1

# Exporting Clean Data
cleaned_new_file_path = 'cleaned_data_2023_MongoDB_Cities_Climate_Risk_and_Vulnerability_Assessments.csv'
new_df.to_csv(cleaned_new_file_path, index=False)

# Display the cleaned data structure and info
print("\nCleaned Data:")
print(new_df.head())
print(new_df.info())

print("\nMissing Values After Cleaning:")
print(new_df.isna().sum())



Missing Values Before Cleaning:
Questionnaire                                                         0
Organization Number                                                   0
Organization Name                                                     0
City                                                                439
Country/Area                                                          0
CDP Region                                                            0
C40 City                                                              0
GCoM City                                                             0
Access                                                                0
Assessment attachment and/or direct link                             37
Confirm attachment/link provided                                      0
Boundary of assessment relative to jurisdiction boundary             12
Year of publication or approval                                      26
Factors considered in assessmen

Step 5: Create a New Collection and Insert Data into MongoDB

Create a New Collection and Insert Data into MongoDB:
  Define New Collection Name: Specify the name of the new collection to store the dat
a   Function Definition: Define a function insert_climate_risk_documents that takes a DataFrame and a collection as argument    Iterate Over Rows: Loop through each row in the DataFr
a
    Create Document: For each row, create a dictionary (document) containing the relevant fi
e
    Insert Document: Use collection.insert_one to insert each document into the MongoDB collection.

In [8]:
# Define the new collection name
new_collection_name = 'climate_risk_assessments_2023'
new_collection = db[new_collection_name]

# Function to insert city documents
def insert_climate_risk_documents(df, collection):
    for index, row in df.iterrows():
        city_document = {
            "city": row['City'],
            "country": row['Country/Area'],
            "assessment_year": row['Year of publication or approval'],
            "population": row['Population'],
            "population_year": row['Population Year'],
            "city_location": row['City Location'],
            "id": row['id']
        }
        collection.insert_one(city_document)

# Insert city documents into MongoDB
insert_climate_risk_documents(new_df, new_collection)

# Verify by retrieving one document
print("One document from MongoDB:")
print(new_collection.find_one())

# Query to see all inserted documents (sample)
for doc in new_collection.find().limit(5):
    print(doc)


One document from MongoDB:
{'_id': ObjectId('665cdaa9b791c25edf694158'), 'city': 'Unknown City', 'country': 'Brazil', 'assessment_year': 2022, 'population': 92228, 'population_year': 2023.0, 'city_location': nan, 'id': 1}
{'_id': ObjectId('665cdaa9b791c25edf694158'), 'city': 'Unknown City', 'country': 'Brazil', 'assessment_year': 2022, 'population': 92228, 'population_year': 2023.0, 'city_location': nan, 'id': 1}
{'_id': ObjectId('665cdaa9b791c25edf694159'), 'city': 'Shenzhen', 'country': 'China', 'assessment_year': 2022, 'population': 17662000, 'population_year': 2022.0, 'city_location': 'POINT (113.813 22.9175)', 'id': 2}
{'_id': ObjectId('665cdaa9b791c25edf69415a'), 'city': 'Unknown City', 'country': 'Chile', 'assessment_year': 2019, 'population': 162517, 'population_year': 2022.0, 'city_location': nan, 'id': 3}
{'_id': ObjectId('665cdaa9b791c25edf69415b'), 'city': 'Unknown City', 'country': 'Peru', 'assessment_year': 2022, 'population': 33346, 'population_year': 2017.0, 'city_locat

In [9]:
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['EksamensProjekt_DB']
collection = db['climate_risk_assessments_2023']

# Verify data
documents = collection.find().limit(5)  # Limit to 5 documents for display
for doc in documents:
    print(doc)


{'_id': ObjectId('665cdaa9b791c25edf694158'), 'city': 'Unknown City', 'country': 'Brazil', 'assessment_year': 2022, 'population': 92228, 'population_year': 2023.0, 'city_location': nan, 'id': 1}
{'_id': ObjectId('665cdaa9b791c25edf694159'), 'city': 'Shenzhen', 'country': 'China', 'assessment_year': 2022, 'population': 17662000, 'population_year': 2022.0, 'city_location': 'POINT (113.813 22.9175)', 'id': 2}
{'_id': ObjectId('665cdaa9b791c25edf69415a'), 'city': 'Unknown City', 'country': 'Chile', 'assessment_year': 2019, 'population': 162517, 'population_year': 2022.0, 'city_location': nan, 'id': 3}
{'_id': ObjectId('665cdaa9b791c25edf69415b'), 'city': 'Unknown City', 'country': 'Peru', 'assessment_year': 2022, 'population': 33346, 'population_year': 2017.0, 'city_location': nan, 'id': 1}
{'_id': ObjectId('665cdaa9b791c25edf69415c'), 'city': 'Trelleborg', 'country': 'Sweden', 'assessment_year': 2018, 'population': 46649, 'population_year': 2022.0, 'city_location': 'POINT (13.1569 55.3751

In [10]:
!pip install flask


