## Purpose

The purpose of this notebook is to build a CSV file that includes the preprocessed version of all the house & condo sale data from 2016-2021 given by the City of Winnipeg. 

The CSV will have 2 columns:

1. ```PropertyType``` - to specify whether it is a house or condo
2. ```PropertyDetails``` - a JSON string that has the house/condo information

This CSV file will then be uploaded to a Azure Postgres Database.

### Input

This notebook requires 2 CSV files as input that will be read by Pandas:

1. ```House sale data``` concatenated with its corresponding tax assessment information
2. ```Condo sale data``` concatenated with its corresponding tax assessment information
   
You can change the location where Pandas will fetch the CSV file from, but currently it retrieves the file from a separate ```output``` folder.

### Output

This notebook will output a CSV file, with both house & condo sale data, in the same ```output``` folder, which can then be uploaded to the Postgres database.

In [1]:
import pandas as pd

In [2]:
# Read the CSV files from their respective directory

house_data = pd.read_csv('../datasets/output/house_sale_with_tax.csv')
condo_data = pd.read_csv('../datasets/output/condo_sale_with_tax.csv')

# Empty dataframe with 2 columns that will be outputted at end of this notebook

output = pd.DataFrame(columns=['PropertyType', 'PropertyDetails'])

  house_data = pd.read_csv('../datasets/output/house_sale_with_tax.csv')
  condo_data = pd.read_csv('../datasets/output/condo_sale_with_tax.csv')


In [3]:
"""
Purpose:
This function iterates over every row in the given dataframe, converts the entirety of each row into a JSON object 
and adds it as a column to the output dataframe

Parameters:
input - dataframe from which to read the information
output - dataframe to which the JSON object will be added to
type - string speciyfing whether the input dataframe holds 'house' or 'condo' data

Return:
output - dataframe that holds all the JSON objects in one of its column
"""

def add_rows(input, output, type):

    # Iterate over all rows of the input dataframe

    for index, row in input.iterrows():

        # JSON object that holds the information from a row
        prop_details = row.to_json()

        # Create a new row with 2 columns, giving it the proper type string and the JSON object 
        new_row = pd.DataFrame({'PropertyType': [type], 'PropertyDetails': [prop_details]})

        # Concatenate the above row with the output dataframe while ignoring the existing indices of the 2 dataframes
        output = pd.concat([output, new_row], ignore_index=True)

    return output

In [4]:
# Call the above function on the 2 input CSV files and concatenate them with the same output dataframe

output = add_rows(house_data, output, 'house')
output = add_rows(condo_data, output, 'condo')

In [5]:
# Check the structure of the output dataframe

output.head()

Unnamed: 0,PropertyType,PropertyDetails
0,house,"{""Roll Number"":9000096000,""Sale Year"":2017,""Sa..."
1,house,"{""Roll Number"":9000096500,""Sale Year"":2016,""Sa..."
2,house,"{""Roll Number"":9000240000,""Sale Year"":2017,""Sa..."
3,house,"{""Roll Number"":9000109000,""Sale Year"":2017,""Sa..."
4,house,"{""Roll Number"":9000227500,""Sale Year"":2016,""Sa..."


In [6]:
# Also check if no missing rows are missing in the output dataframe

print("Check if all entries of house and condo was added to output: " + str(len(output) == (len(house_data) + len(condo_data))))

Check if all entries of house and condo was added to output: True


In [7]:
# Output the dataframe as a CSV file
# We also set index to False so that the CSV doesn't include an extra column with the indices of each row

output.to_csv('../datasets/output/upload_postgres.csv', index=False)