In [1]:
!pip install pandas



In [2]:
# imports
import pandas as pd

# **Assignment 3 - Dataset 1: Regression**

Group: #45

Isaac Lafond - 300191954

**Introduction:**

This notebook ... (what it is and how to use it)

**Dataset Description:**

This dataset by Den Kuznetz, contains features which can affect the value of a property for the pupose of predicting property value based on these features. It contains several attributes such as square footage, the number of bedrooms, bathrooms, floors, the year the prpoperty was built, whether the property has a garden or pool, the size of the garage, the location score and the distance from the city center. The size values such as square footage and garage size are measured in square meters and the distance to center is measured in kilometers. The garden and pool indicator is shown as 1 for yes and 0 for no. Finally, the location score is a score from 0 to 10 indicating the quality of the neighbourhood (higher = better).

## **Regression Empirical Study**

In [100]:
df = pd.read_csv("https://raw.githubusercontent.com/IsaacLafond/CSI-4142---Fundamentals-of-Data-Science/main/Assignment%203/datasets/real_estate_dataset.csv")

### **a) Cleaning the data**

Here we apply checks for the presence of errors that would require cleaning/imputation.

True = all values pass the check

False = not all values pass the check

In [145]:
# 1 - Data type check

# iterate over each column in the dataframe df and ensure that each value in the column is of the correct data type
for column in df.columns:
  column_type = df[column].dtype
  print(f"{column}: {df[column].apply(lambda x: type(x) == column_type).all()}")

ID: True
Square_Feet: True
Num_Bedrooms: True
Num_Bathrooms: True
Num_Floors: True
Year_Built: True
Has_Garden: True
Has_Pool: True
Garage_Size: True
Location_Score: True
Distance_to_Center: True
Price: True


In [146]:
# 2 - Range check
# Create numrange class with min and max and default is infinite range
# define contains functions so that "in" check return true in between min and max
class NumRange:
  def __init__(self, min=float('-inf'), max=float('inf')):
    self.min = min
    self.max = max
  def __contains__(self, x):
    return self.min <= x <= self.max

# parameters = column name: expected range
params = {
    'ID': NumRange(1, 500),
    'Square_Feet': NumRange(),
    'Num_Bedrooms': NumRange(),
    'Num_Bathrooms': NumRange(),
    'Num_Floors': NumRange(),
    'Year_Built': NumRange(0, 2025),
    'Has_Garden': NumRange(0, 1),
    'Has_Pool': NumRange(0, 1),
    'Garage_Size': NumRange(),
    'Location_Score': NumRange(0, 10),
    'Distance_to_Center': NumRange(),
    'Price': NumRange()
}

# iterate over each column and very that each value in the column is withing the given range in params
for column in df.columns:
  print(f"{column}: {df[column].apply(lambda x: x in params[column]).all()}")

ID: True
Square_Feet: True
Num_Bedrooms: True
Num_Bathrooms: True
Num_Floors: True
Year_Built: True
Has_Garden: True
Has_Pool: True
Garage_Size: True
Location_Score: True
Distance_to_Center: True
Price: True


In [103]:
# 3 - Format check
# regex for an integer of any number
integer_regex = '^\d+$'
# regex for any floating point number
float_regex = '^\d+\.\d+$'
# regex for integer that is binary that is can only be 0 or 1
binary_regex = '^(0|1)$'
# regex for a year list as YYYY that cant be more than 2025
year_regex = '^\d{4}$'

# parameters = column name: expected pattern
params = {
    'ID': integer_regex,
    'Square_Feet': float_regex,
    'Num_Bedrooms': integer_regex,
    'Num_Bathrooms': integer_regex,
    'Num_Floors': integer_regex,
    'Year_Built': year_regex,
    'Has_Garden': binary_regex,
    'Has_Pool': binary_regex,
    'Garage_Size': integer_regex,
    'Location_Score': float_regex,
    'Distance_to_Center': float_regex,
    'Price': float_regex
}

# iterate over each column in the dataframe and ensure each value matches the corresponding regex in param
for column in df.columns:
  # check if each value in the column matches the regex in params
  print(f"{column}: {df[column].astype(str).str.match(params[column]).all()}")

ID: True
Square_Feet: True
Num_Bedrooms: True
Num_Bathrooms: True
Num_Floors: True
Year_Built: True
Has_Garden: True
Has_Pool: True
Garage_Size: True
Location_Score: True
Distance_to_Center: True
Price: True


In [105]:
# 4 - Consistency check

"""
In this dataset none of the columns are logically linked.
Therefore a consistency check is not needed.
"""
pass

In [108]:
# 5 - Uniqueness check
# check for uniqueness in values which are identifier
# This dataset does not have email of usernames and ID is the only identfier
# therefore id should have 500 unique values

# define any int class for comparison that is always true
class AnyNum:
  def __eq__(self, x):
    return True
any_num = AnyNum()

# parameter = column name: number of expected unique values
params = {
    'ID': 500,
    'Square_Feet': any_num,
    'Num_Bedrooms': any_num,
    'Num_Bathrooms': any_num,
    'Num_Floors': any_num,
    'Year_Built': any_num,
    'Has_Garden': any_num,
    'Has_Pool': any_num,
    'Garage_Size': any_num,
    'Location_Score': any_num,
    'Distance_to_Center': any_num,
    'Price': any_num
}

# iterate over each column and print the number of unique values in the column
for column in df.columns:
  print(f"{column}: {df[column].nunique() == params[column]}")

ID: True
Square_Feet: True
Num_Bedrooms: True
Num_Bathrooms: True
Num_Floors: True
Year_Built: True
Has_Garden: True
Has_Pool: True
Garage_Size: True
Location_Score: True
Distance_to_Center: True
Price: True


In [149]:
# 6 - Presence check

# iterate over each column and verify that all values arent empty
for column in df.columns:
  print(f"{column}: {df[column].notnull().any()}")

ID: True
Square_Feet: True
Num_Bedrooms: True
Num_Bathrooms: True
Num_Floors: True
Year_Built: True
Has_Garden: True
Has_Pool: True
Garage_Size: True
Location_Score: True
Distance_to_Center: True
Price: True


In [173]:
# 7 - Length check
# define a length range class to verify if length is in expected range
class LengthRange:
  def __init__(self, min=0, max=float('inf')):
    self.min = min
    self.max = max
  def __contains__(self, x):
    return self.min <= len(x) <= self.max
any_length = LengthRange()

# parameters = column name: expected value length
params = {
    'ID': LengthRange(1, 3),
    'Square_Feet': any_length,
    'Num_Bedrooms': any_length,
    'Num_Bathrooms': any_length,
    'Num_Floors': any_length,
    'Year_Built': LengthRange(4, 4),
    'Has_Garden': LengthRange(1, 1),
    'Has_Pool': LengthRange(1, 1),
    'Garage_Size': any_length,
    'Location_Score': any_length,
    'Distance_to_Center': any_length,
    'Price': any_length
}

# iterate over each column and very that each value in the column is withing the given length range in params
for column in df.columns:
  print(f"{column}: {df[column].astype(str).apply(lambda x: x in params[column]).all()}")

ID: True
Square_Feet: True
Num_Bedrooms: True
Num_Bathrooms: True
Num_Floors: True
Year_Built: True
Has_Garden: True
Has_Pool: True
Garage_Size: True
Location_Score: True
Distance_to_Center: True
Price: True


In [175]:
# 8 - Look-up check
# this check verifies that all the values are in the allowed set of values for that column
# for this test the only columns with a defined list of acceptable values is Has_Garden and Has_Pool (and IDs from 1-500)
# define any list that always contains the value you verify
class AnyList:
  def __contains__(self, x):
    return True
any_list = AnyList()

# parameters = column name: list of accepted values
params = {
    'ID': range(1, 501),
    'Square_Feet': any_list,
    'Num_Bedrooms': any_list,
    'Num_Bathrooms': any_list,
    'Num_Floors': any_list,
    'Year_Built': any_list,
    'Has_Garden': [0, 1],
    'Has_Pool': [0, 1],
    'Garage_Size': any_list,
    'Location_Score': any_list,
    'Distance_to_Center': any_list,
    'Price': any_list
}

# iterate over each column and verify each of its values are in the list of acceptable values
for column in df.columns:
  print(f"{column}: {df[column].apply(lambda x: x in params[column]).all()}")

ID: True
Square_Feet: True
Num_Bedrooms: True
Num_Bathrooms: True
Num_Floors: True
Year_Built: True
Has_Garden: True
Has_Pool: True
Garage_Size: True
Location_Score: True
Distance_to_Center: True
Price: True


In [None]:
# 9 - Exact duplicate check

In [None]:
# 10 - Near duplicate check

### **b) Categorical feature encoding**

### **c) EDA and Outlier detection**

### **d) Predictive analysis: Linear Regression**

### **e) Feature Engineering**

### **f) Empirical study**

### **g) Result analysis**

##**Conclusion**

In conclusion, ...

## **References**

[1] ...