## Exploratory Data Analysis
### Load the training data

In [1]:
import pandas as pd

In [2]:
# Read the training data
X_train = pd.read_csv('cs5228-2022-semester-1-final-project/train.csv')

### Initial feature removal
Dropping the columns that are unlikely to be useful to the prediction of property prices.

In [3]:
# Drop the columns that are not useful in predicting prices (before EDA, based on the data description alone)
columns_to_drop = ['listing_id', 'title', 'address', 'property_name', 'available_unit_types', 'total_num_units', 'property_details_url']
X_train = X_train.drop(columns_to_drop, axis=1)
X_train.head()

Unnamed: 0,property_type,tenure,built_year,num_beds,num_baths,size_sqft,floor_level,furnishing,lat,lng,elevation,subzone,planning_area,price
0,hdb 4 rooms,,1988.0,3.0,2.0,1115,,unspecified,1.414399,103.837196,0,yishun south,yishun,514500.0
1,hdb,99-year leasehold,1992.0,4.0,2.0,1575,,unspecified,1.372597,103.875625,0,serangoon north,serangoon,995400.0
2,condo,freehold,2022.0,4.0,6.0,3070,low,partial,1.298773,103.895798,0,mountbatten,marine parade,8485000.0
3,Condo,freehold,2023.0,3.0,2.0,958,,partial,1.312364,103.803271,0,farrer court,bukit timah,2626000.0
4,condo,99-year leasehold,2026.0,2.0,1.0,732,,unspecified,1.273959,103.843635,0,anson,downtown core,1764000.0


In [4]:
print(f'There are {X_train.shape[0]} records in the training set.')

There are 20254 records in the training set.


### Univariate analysis
#### Numerical variables
The summary statistics of numerical variables can be printed using the `describe()` function. Several problems are observed:
- Missing values for several columns (see next section)
- `built_year` should not exceed 2022, but a quarter of the data has value >= 2023, max value is 2028.
- `size_sqft` should not be zero. The maximum value is unreasonable, indicating outliers.
- For Singapore, (`lat`, `lng`) should be around (1.4, 103), but there are records that are outliers.
- The column `elevation` has zero value for all records.
- The target variable `price` has dirty records with zero value or impossibly high price.

In [5]:
# Print a summary table for numeric variables
X_train.describe()

Unnamed: 0,built_year,num_beds,num_baths,size_sqft,lat,lng,elevation,price
count,19332.0,20174.0,19820.0,20254.0,20254.0,20254.0,20254.0,20254.0
mean,2010.833695,3.122931,2.643542,1854.364,1.434282,103.855356,0.0,5228263.0
std,15.822803,1.281658,1.473835,13543.43,1.558472,3.593441,0.0,277974800.0
min,1963.0,1.0,1.0,0.0,1.239621,-77.065364,0.0,0.0
25%,2000.0,2.0,2.0,807.0,1.307329,103.806576,0.0,819000.0
50%,2017.0,3.0,2.0,1119.0,1.329266,103.841552,0.0,1680000.0
75%,2023.0,4.0,3.0,1528.0,1.372461,103.881514,0.0,3242400.0
max,2028.0,10.0,10.0,1496000.0,69.486768,121.023232,0.0,39242430000.0


#### Categorical variables
The code below prints the summary table for categorical variables. Several problems are observed:
- problem 1...
- problem 2...

In [6]:
# Print a summary table for categorical variables
X_train.describe(include='object')

Unnamed: 0,property_type,tenure,floor_level,furnishing,subzone,planning_area
count,20254,18531,3508,20254,20141,20141
unique,39,11,17,5,244,43
top,condo,99-year leasehold,high,unspecified,moulmein,bukit timah
freq,7905,11407,1674,14716,656,1320


#### Missing values
Below shows the number of records with empty values for each column:
- For `floor_level`, around 83% of training records are empty values
- For `tenure`, around 9% of training records are empty values

In [7]:
print('Portion of empty values for each column:')
print(
    (X_train.shape[0] - X_train.describe(include='all').loc['count']) / X_train.shape[0])


Portion of empty values for each column:
property_type         0.0
tenure            0.08507
built_year       0.045522
num_beds          0.00395
num_baths        0.021428
size_sqft             0.0
floor_level        0.8268
furnishing            0.0
lat                   0.0
lng                   0.0
elevation             0.0
subzone          0.005579
planning_area    0.005579
price                 0.0
Name: count, dtype: object


### Data cleaning
can do the data cleaning here