# This is part 1 of this project...

In this part, we will extract the dataset, import the required libraries, and load the data for a first look.

Extract the dataset.

In [42]:
import zipfile

with zipfile.ZipFile("dataset.zip", "r") as zip_ref:
    zip_ref.extractall("dataset")
print("Unzipped dataset.zip to the 'dataset' folder.")

Unzipped dataset.zip to the 'dataset' folder.


Import the libraries we'll use for data handling.

In [43]:
import numpy as np
import pandas as pd

Load the CSV file and display the first 10 rows to verify successful import.

In [44]:
df = pd.read_csv("dataset/Real_Estate_Sales_2001-2022_GL.csv", low_memory=False)
df.head(10)

Unnamed: 0,Serial Number,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Sales Ratio,Property Type,Residential Type,Non Use Code,Assessor Remarks,OPM remarks,Location
0,2020177,2020,04/14/2021,Ansonia,323 BEAVER ST,133000.0,248400.0,0.5354,Residential,Single Family,,,,POINT (-73.06822 41.35014)
1,2020225,2020,05/26/2021,Ansonia,152 JACKSON ST,110500.0,239900.0,0.4606,Residential,Three Family,,,,
2,2020348,2020,09/13/2021,Ansonia,230 WAKELEE AVE,150500.0,325000.0,0.463,Commercial,,,,,
3,2020090,2020,12/14/2020,Ansonia,57 PLATT ST,127400.0,202500.0,0.6291,Residential,Two Family,,,,
4,210288,2021,06/20/2022,Avon,12 BYRON DRIVE,179990.0,362500.0,0.4965,Residential,Condo,,,,POINT (-72.879115982 41.773452988)
5,200500,2020,09/07/2021,Avon,245 NEW ROAD,217640.0,400000.0,0.5441,Residential,Single Family,,,,
6,200121,2020,12/15/2020,Avon,63 NORTHGATE,528490.0,775000.0,0.6819,Residential,Single Family,,,,POINT (-72.89675 41.79445)
7,20058,2020,06/01/2021,Barkhamsted,46 RATLUM MTN RD,203530.0,415000.0,0.490434,Residential,Single Family,,"2003 COLONIAL, 2140 SFLA, 2.99 AC",,
8,200046,2020,01/25/2021,Beacon Falls,34 LASKY ROAD,158030.0,243000.0,0.6503,Residential,Single Family,,,,
9,200016,2020,11/13/2020,Beacon Falls,9 AVON COURT,65590.0,100000.0,0.6559,Residential,Condo,,,,


# Part 1 – Building and Exploring a Tabular Dataset

In this part, we will build and prepare a tabular dataset for a regression problem using a real estate sales dataset. We will document the source, select relevant columns, split the data into training and test subsets, and save these subsets for later use.

## Problem Type

The problem addressed is a **regression** task: we aim to predict the sale price of a property based on its features.

## Data Source and Description

The data comes from a CSV file containing real estate sales (`Real_Estate_Sales_2001-2022_GL.csv`). Each row represents a transaction, and the columns include information about listing year, property type, residential type, etc. We will select 8 relevant columns, with at least 3 different data types (numeric, real, categorical).

## Loading and Selecting Relevant Columns

We will load the data and select 8 relevant columns for the problem.

In [None]:
# Încărcare date
df = pd.read_csv("dataset/Real_Estate_Sales_2001-2022_GL.csv", low_memory=False)

# Selectăm 8 coloane relevante
selected_columns = [
    "List Year",        # numeric (int)
    "Date Recorded",    # categoric (str)
    "Town",             # categoric (str)
    "Address",          # categoric (str)
    "Assessed Value",   # numeric (int)
    "Sale Amount",      # numeric (int) - target
    "Property Type",    # categoric (str)
    "Residential Type"  # categoric (str)
]
df = df[selected_columns]
df = df.dropna(subset=["Sale Amount"])  # eliminate rows with NaN in 'Sale Amount'
df.head(10)

Unnamed: 0,List Year,Date Recorded,Town,Address,Assessed Value,Sale Amount,Property Type,Residential Type
0,2020,04/14/2021,Ansonia,323 BEAVER ST,133000.0,248400.0,Residential,Single Family
1,2020,05/26/2021,Ansonia,152 JACKSON ST,110500.0,239900.0,Residential,Three Family
2,2020,09/13/2021,Ansonia,230 WAKELEE AVE,150500.0,325000.0,Commercial,
3,2020,12/14/2020,Ansonia,57 PLATT ST,127400.0,202500.0,Residential,Two Family
4,2021,06/20/2022,Avon,12 BYRON DRIVE,179990.0,362500.0,Residential,Condo
5,2020,09/07/2021,Avon,245 NEW ROAD,217640.0,400000.0,Residential,Single Family
6,2020,12/15/2020,Avon,63 NORTHGATE,528490.0,775000.0,Residential,Single Family
7,2020,06/01/2021,Barkhamsted,46 RATLUM MTN RD,203530.0,415000.0,Residential,Single Family
8,2020,01/25/2021,Beacon Falls,34 LASKY ROAD,158030.0,243000.0,Residential,Single Family
9,2020,11/13/2020,Beacon Falls,9 AVON COURT,65590.0,100000.0,Residential,Condo


## Splitting into Training and Test Subsets

We will randomly split the data: 25% for testing, 75% for training.

In [46]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.25, random_state=42)
print(f"Train set: {len(train_df)} rows")
print(f"Test set: {len(test_df)} rows")

Train set: 823221 rows
Test set: 274408 rows


## Saving the Subsets

We export the subsets to separate CSV files.

In [47]:
train_df.to_csv("dataset/train.csv", index=False)
test_df.to_csv("dataset/test.csv", index=False)
print("Saved train.csv and test.csv in the dataset folder.")

Saved train.csv and test.csv in the dataset folder.
