# <b><font color="X002FBC">Level 1 (Task2)(Basic)</b>


<b><font size="6" color="X04356A">|1_1|Data_Collection_&_Web_Scraping</font></b>
* **Author**: Karl Christiaan Schmutz
* **Date**: 12-June-2025


**Objectives:** 

 - Handle missing data (e.g., imputation, removal).
 - Detect and remove outliers.
 - Convert categorical variables into numerical format
    using one-hot encoding or label encoding.
 - Normalize or standardize numerical data.
 - <b><font size="3" color="X220135">Tools: Python, pandas, scikit-learn.</font></b>

#### Installing the libraries needed:

In [1]:
pip install python

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement python (from versions: none)
ERROR: No matching distribution found for python


In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


#### Choose a dataset:

##### About the dataset:

Welcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. The Bangladori house dataset was used. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property.

As extracted from Kaggle: https://www.kaggle.com/datasets/saipavansaketh/pune-house-data/data

#### Clean and preprocess a raw dataset:

In [6]:
#KNN Imputer works nicely for handling missing data, with more sophisticated relationships

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler
from category_encoders import OneHotEncoder
from sklearn.ensemble import IsolationForest


#Examine how the dataset looks like.
#===============================================#
df = pd.read_csv('./data/Bangalore _house_data.csv')
#checking missing values, data types, shape of the data
display(df.info())
#===============================================#
#Handle missing data (e.g., imputation, removal).
#===============================================#
# Define columns, Nuymeric and Categorical
numeric_data = df.select_dtypes(include=['int64', 'float64', 'int32', 'float32']).columns.tolist()
numeric_features = [col for col in numeric_data]
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()
#-------#
print(f'Numeric Features    : {numeric_features}')
print(f'Categorical Features: {categorical_features}')
#-------#

# Imputers for each type
numeric_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Combine using ColumnTransformer
try:
    # Create preprocessing pipeline
    preprocessor = ColumnTransformer(
        transformers=[
            ('numeric_transform', numeric_imputer, numeric_features),
            ('categorical_transform', categorical_imputer, categorical_features)
        ],
        remainder='passthrough'
    )
    
    # Apply transformations
    imputed_array = preprocessor.fit_transform(df)
    
    # Convert back to DataFrame with proper column names
    feature_names = (numeric_features + categorical_features + 
                    [col for col in df.columns if col not in numeric_features + categorical_features])
    imputed_data = pd.DataFrame(imputed_array, columns=feature_names, index=df.index)
    
except Exception as e:
    print(f"Error in preprocessing: {e}")
# Convert the imputed data back to a DataFrame
df_imputed = pd.DataFrame(imputed_data, columns=df.columns)
print(df_imputed.info())
display(df_imputed.head(10))
#===============================================#
#Normalize or standardize numerical data
#===============================================#
# Define scaler
for col in numeric_data:
    print(f'{col} = {df[col].skew()}')
pt = PowerTransformer(method='yeo-johnson')

df_transformed = df_imputed.copy()
df_transformed[numeric_data] = pt.fit_transform(df_imputed[numeric_data])
display(df_transformed.describe().T)
df_transformed[categorical_features] = df_imputed[categorical_features]
# Normalize the data
scaler = StandardScaler()
standard_df = df_transformed.copy()
standard = scaler.fit_transform(df_transformed[numeric_data])
standard_df[numeric_data] = pd.DataFrame(standard, columns =numeric_data)
standard_df[categorical_features] = df_transformed[categorical_features]
display(standard_df.head(10))
#===============================================#
#Convert categorical variables into numerical format using one-hot encoding or label encoding
#===============================================#
# instantiate the encoder and specify columns to encode
ohe = OneHotEncoder(
    use_cat_names=True, 
    cols=categorical_features
)

# Transform data
encoded_df = standard_df.copy()
encoded = ohe.fit_transform(standard_df[categorical_features])
encoded_df = pd.concat([encoded, encoded_df[numeric_data]], axis=1)

display(encoded_df.head(10))
#===============================================#
#Detect and remove outliers
#===============================================#
iso = IsolationForest(contamination=0.01, random_state=42)
outliers = iso.fit_predict(encoded_df[numeric_data])
df_cleaned = encoded_df[outliers == 1]
display(df_cleaned.info())
# remove null values out of the dataset
df_cleaned = df_cleaned.dropna()
#===============================================#



#----------------------------------------------#

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


None

Numeric Features    : ['bath', 'balcony', 'price']
Categorical Features: ['area_type', 'availability', 'location', 'size', 'society', 'total_sqft']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   area_type     13320 non-null  object
 1   availability  13320 non-null  object
 2   location      13320 non-null  object
 3   size          13320 non-null  object
 4   society       13320 non-null  object
 5   total_sqft    13320 non-null  object
 6   bath          13320 non-null  object
 7   balcony       13320 non-null  object
 8   price         13320 non-null  object
dtypes: object(9)
memory usage: 936.7+ KB
None


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,GrrvaGr,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,GrrvaGr,1200,2.0,1.0,51.0
5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.0
6,Super built-up Area,18-May,Old Airport Road,4 BHK,Jaades,2732,4.0,1.584376,204.0
7,Super built-up Area,Ready To Move,Rajaji Nagar,4 BHK,Brway G,3300,4.0,1.584376,600.0
8,Super built-up Area,Ready To Move,Marathahalli,3 BHK,GrrvaGr,1310,3.0,1.0,63.25
9,Plot Area,Ready To Move,Gandhi Bazar,6 Bedroom,GrrvaGr,1020,6.0,1.584376,370.0


bath = 4.227696763299001
balcony = 0.005856767469113565
price = 8.064468821273252


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bath,13320.0,-9.863315e-16,1.000038,-2.429114,-0.497969,-0.497969,0.548367,4.057441
balcony,13320.0,-3.734083e-18,1.000038,-2.020617,-0.724799,0.526973,0.526973,1.750972
price,13320.0,-4.726283e-16,1.000038,-4.762217,-0.644736,-0.058272,0.654835,3.247668


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,-0.497969,-0.724799,-1.081698
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,1.679122,1.750972,0.654835
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,GrrvaGr,1440,-0.497969,1.750972,-0.290501
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,0.548367,-0.724799,0.343509
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,GrrvaGr,1200,-0.497969,-0.724799,-0.611104
5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,-0.497969,-0.724799,-1.133023
6,Super built-up Area,18-May,Old Airport Road,4 BHK,Jaades,2732,1.214153,0.010699,1.278274
7,Super built-up Area,Ready To Move,Rajaji Nagar,4 BHK,Brway G,3300,1.214153,0.010699,2.248916
8,Super built-up Area,Ready To Move,Marathahalli,3 BHK,GrrvaGr,1310,0.548367,-0.724799,-0.25885
9,Plot Area,Ready To Move,Gandhi Bazar,6 Bedroom,GrrvaGr,1020,2.024274,0.010699,1.857427


Unnamed: 0,area_type_Super built-up Area,area_type_Plot Area,area_type_Built-up Area,area_type_Carpet Area,availability_19-Dec,availability_Ready To Move,availability_18-May,availability_18-Feb,availability_18-Nov,availability_20-Dec,...,total_sqft_250,total_sqft_2395,total_sqft_1020 - 1130,total_sqft_2758,total_sqft_1133 - 1384,total_sqft_774,total_sqft_4689,bath,balcony,price
0,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,-0.497969,-0.724799,-1.081698
1,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1.679122,1.750972,0.654835
2,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,-0.497969,1.750972,-0.290501
3,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0.548367,-0.724799,0.343509
4,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,-0.497969,-0.724799,-0.611104
5,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,-0.497969,-0.724799,-1.133023
6,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1.214153,0.010699,1.278274
7,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1.214153,0.010699,2.248916
8,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0.548367,-0.724799,-0.25885
9,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,2.024274,0.010699,1.857427


<class 'pandas.core.frame.DataFrame'>
Index: 13186 entries, 0 to 13319
Columns: 6229 entries, area_type_Super built-up  Area to price
dtypes: float64(3), int64(6226)
memory usage: 626.7 MB


None