'''
Use-Case: House Price Prediction
Dataset: melb_data.csv

Perform the following tasks:
Load the data in dataframe (Pandas)
Handle inappropriate data
Handle the missing data
Handle the categorical data
'''

In [3]:
import pandas as pd

data = pd.read_csv("melb_data.csv")
print("Initial Data:\n", data.head())

if 'Address' in data.columns:
    data.drop(['Address'], axis=1, inplace=True)
if 'SellerG' in data.columns:
    data.drop(['SellerG'], axis=1, inplace=True)

data = data.dropna(axis=1, thresh=len(data)*0.5)

for col in data.columns:
    if data[col].dtype == 'object':
        data[col].fillna(data[col].mode()[0], inplace=True)
    else:
        data[col].fillna(data[col].median(), inplace=True)

data = pd.get_dummies(data, drop_first=True)

print("\nProcessed Data:\n", data.head())
print("\nData Types:\n", data.dtypes)


Initial Data:
    Unnamed: 0      Suburb           Address  Rooms Type      Price Method  \
0           1  Abbotsford      85 Turner St      2    h  1480000.0      S   
1           2  Abbotsford   25 Bloomburg St      2    h  1035000.0      S   
2           4  Abbotsford      5 Charles St      3    h  1465000.0     SP   
3           5  Abbotsford  40 Federation La      3    h   850000.0     PI   
4           6  Abbotsford       55a Park St      4    h  1600000.0     VB   

  SellerG       Date  Distance  ...  Bathroom  Car  Landsize  BuildingArea  \
0  Biggin  3/12/2016       2.5  ...       1.0  1.0     202.0           NaN   
1  Biggin  4/02/2016       2.5  ...       1.0  0.0     156.0          79.0   
2  Biggin  4/03/2017       2.5  ...       2.0  0.0     134.0         150.0   
3  Biggin  4/03/2017       2.5  ...       2.0  1.0      94.0           NaN   
4  Nelson  4/06/2016       2.5  ...       1.0  2.0     120.0         142.0   

   YearBuilt  CouncilArea  Lattitude Longtitude      

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[col].fillna(data[col].mode()[0], inplace=True)
