# Notebook for the AAA course
## **Authors**: BRETECHE Youenn & YAKOUBOV Anas
## Presentation
This notebook is a part of the AAA course. The goal is to predict the price of houses in Melbourne using a dataset from Kaggle. We will use a Linear Regression model to predict the price of the houses. Thanks to the dataset, we have several features that can be used to predict the price of the houses. We will use these features to train the model and evaluate it.

This notebook is divided into several parts in order to follow the steps needed before training the model.

For the moment, this notebook is only cleaning the data and trying to apply a Linear Regression model to the data. The next steps will be to try different models and compare them to see which one is the best for this dataset.


### Imports

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
from sklearn.impute import SimpleImputer
from pandas.plotting import scatter_matrix
import seaborn as sns
import matplotlib.pyplot as plt



### Load the dataset

In [None]:
housing = pd.read_csv("dataset.csv")
target_name = "Price"
data = housing.drop(columns=target_name)
target = housing[target_name]

print(f"Dataset size: {data.shape}")

### Columns description
**Rooms**: Number of rooms

**Price**: Price in dollars

**Method**: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

**Type**: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

**SellerG**: Real Estate Agent

**Date**: Date sold

**Distance**: Distance from CBD

**Regionname**: General Region (West, North West, North, North east …etc)

**Propertycount**: Number of properties that exist in the suburb.

**Bedroom2**: Scraped # of Bedrooms (from different source)

**Bathroom**: Number of Bathrooms

**Car**: Number of carspots

**Landsize**: Land Size

**BuildingArea**: Building Size

**CouncilArea**: Governing council for the area

In [None]:
numerical_features = ["Rooms", "Distance", "Propertycount", "Bedroom2", "Bathroom", "Car", "Landsize", "BuildingArea"]

categorical_features = ["Type", "SellerG", "Regionname", "CouncilArea"]

data = data[numerical_features + categorical_features]
data.head()

### Data preprocessing
Building area has missing values, we check the percentage of missing values

In [None]:
data["BuildingArea"].isna().mean() * 100

We can see that 47% of the values are missing, if we drop the rows with missing values we will lose a lot of data
So we will remove the column

In [None]:
data = data.drop(columns=["BuildingArea"])
numerical_features.remove("BuildingArea")

data.head()

Lets visualize the data

In [None]:
data.hist(bins=50, figsize=(20, 15))

Let's visualize the correlation matrix using a heatmap

In [None]:
plt.figure(figsize=(10, 10))

data_with_target = data.copy()
data_with_target[target_name] = target

sns.heatmap(data_with_target[numerical_features + [target_name]].corr(), annot=True)
plt.show()

We can see that Rooms and Bedroom2 are highly correlated. But also that Propertycount and Distance are negatively correlated with Price, so they might be good predictors. We can also see that Price is highly correlated with Rooms and Bedroom2 which makes sense since the more rooms a house has the more expensive it is.


Let's also visualize the scatter matrix of the numerical features.
It will help us see the distribution of the data.

In [None]:
scatter_matrix(data[numerical_features], figsize=(20, 20))
plt.show()

We can see the same results as before, Rooms and Bedroom2 are highly correlated, and Propertycount and Distance are negatively correlated with Price. We can also see that Landsize has a long tail distribution, which means that there are some outliers in the data.


Landsize seems to have a long tail distribution

In [None]:
data[["Landsize"]].describe()

If we remove the outliers we can see the distribution more clearly, but we will lose data, so we will keep the outliers.

We check all the columns with missing values

In [None]:
data.isna().mean() * 100

We can see that Car column has missing values, we can use the most frequent value to fill the missing values

In [None]:
data["Car"].value_counts()

data["Car"] = data["Car"].fillna(data["Car"].mode()[0])

data.isna().mean() * 100

## Conclusion
Now that our data are cleaned, we can start applying transformations to them and train the model.
This will be done later in the following steps of the notebook. However, for now, they are just a draft and this version is only focused on cleaning the data. You can take a look but don't expect much from it.

## [DRAFT] Applying transformations, training the model and evaluating it

### Pipeline creation
We create a pipeline to apply the transformations to the data

In [None]:
numerical_features_transformer = Pipeline(steps=[
    # ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
], verbose=True)

categorical_features_transformer = Pipeline(steps=[
    # ("imputer", SimpleImputer(strategy="most_frequent")),
    # ("encoder", OneHotEncoder())
    ("encoder", OrdinalEncoder())
], verbose=True)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_features_transformer, numerical_features),
        ("cat", categorical_features_transformer, categorical_features)
    ]
)

# We also need to define the classifier, which in this case is a Linear Regression model
# classifier = LinearRegression()
# classifier = LogisticRegression()
classifier = DecisionTreeRegressor()

model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", classifier)
])

# Now we have a pipeline that can apply the transformations to the data

### Training and testing sets
Splitting the data into training and testing sets

In [None]:
data.shape

In [None]:
target.shape

In [None]:
test_size = 0.2 # 20% of the data will be used for testing
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=test_size)
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")
print(f"Training target size: {y_train.shape}")
print(f"Testing target size: {y_test.shape}")

### Applying transformations
Applying the transformations to the training and testing sets

In [None]:
model.fit(X_train, y_train)

#### Evaluating the model

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
from sklearn.model_selection import GridSearchCV

# param_grid = [
#     {"preprocessor__num__imputer__strategy": ["mean"]},
#     # {"preprocessor__cat__imputer__strategy": ["most_frequent"]}
# ]
#
# cv = GridSearchCV(model, param_grid, cv=5, scoring="neg_mean_squared_error",
#                            verbose=2, n_jobs=8)
#
# cv.fit(X_train, y_train)

cv = cross_validate(model, X_train, y_train, cv=5, return_estimator=True, n_jobs=5, verbose=2)

In [None]:
# We can see the best parameters
# cv.best_params_
cv


In [None]:
# Let's evaluate the model on the test set
from sklearn.metrics import mean_squared_error

# Transform the test set using the pipeline
# X_test_transformed = model.named_steps["preprocessor"].transform(X_test)

y_pred = model.predict(X_test)
mean_squared_error(y_test, y_pred)


In [None]:
print(y_test.head())

In [None]:
print(model.predict(X_test.head()))