<html>
<div>
  <img src="https://www.engineersgarage.com/wp-content/uploads/2021/11/TCH36-01-scaled.jpg" width=360px width=auto style="vertical-align: middle;">
  <span style="font-family: Georgia; font-size:30px; color: white;"> <br/> University of Tehran <br/> AI_CA5 <br/> Spring 02 </span>
</div>
<span style="font-family: Georgia; font-size:15pt; color: white; vertical-align: middle;"> low_mist - std id: 810100186 </span>
</html>

In this notebook we are to learn about machine learning and try to anticipate price of houses.

## Problem Description
in this problem we will learn about basics of machine learning, in order to assign prices to houses. At first we try to do that by using linear regression without any library, and then we use Scikit-Learn to do that.

## Dataset
The `house_data.csv` file contains data about houses and their prices in one of the cities of Washington, D.C. in years 2014 and 2015. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from copy import deepcopy
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
import category_encoders as cat_enc
from mlxtend.evaluate import bias_variance_decomp

DATASET_PATH = "assets/house_data.csv"

In [None]:
df = pd.read_csv(DATASET_PATH)
pd.set_option("display.max_columns", None)
df.head(10)

## Part1. Analysis of datasets
### Q1-1. Describe dataset using info and describe methods.

In [None]:
df.info()

The `info` method returns general info about the dataframe, its data, and the data types.

The panda's dataframe's is printed and we can see that there are 21613 entries in the dataframe.  
There are 26 columns and for each of them, the column's name, its data types, and the non-null count is shown.  
Non-null count shows how many rows have a value in a specific column.  
At the end, the count of each data type among the columns and the structure's memory usage is shown.

In [None]:
df.describe()

The `describe` method shows some statistical information about the dataframe.

Each table row reports a property of the corresponding column's data:  

- count: The number of the data.
- mean: The average of the data.
- std: The standard deviation of the data.
- min: The minimum data.
- 25%: The first quartile of the column's data.
- 50%: The median of the column's data.
- 75%: The third quartile of the column's data.
- max: The maximum data.

### Q1-2. For each feature show number and proportion of missing values.

In [None]:
def missing_values(df: pd.DataFrame) -> pd.DataFrame:
    nan_values_count = df.isna().sum()
    nan_values_percent = nan_values_count / len(df)
    nan_values = pd.concat([nan_values_count, nan_values_percent], axis=1, keys=["Missing", "Percentage"])
    return nan_values
    return nan_values[nan_values["Missing"] != 0] # this one is better but I need to 
                                                  # show the missing values for every value in report

missing_values(df)

### Q1-3. Plotting the correlation graph between the features. Which features are most correlated with target?

In [None]:
plt.figure(figsize=(20, 20))
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=".3f", cmap="Blues", linewidths=1, square=True)
plt.show()

To see what features have the most correlation with the outcome, we can simply use the price row in `df.corr()`.

In [None]:
price_corr = df.corr(numeric_only=True)["price"].drop("price")
price_corr = price_corr[abs(price_corr) > 0.31].sort_values(ascending=False)
display(price_corr)

As we can see, square foot-related features namely `sqft_living`, `sqft_above` and `sqft_living15`, has the most correlation with the outcome.

### Q1-4. Plot unique values for each feature of last part.

In [None]:
NUM_OF_INTERVALS = 20
df_backup = deepcopy(df)

sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.color_palette("rocket", as_cmap=True)

for col in price_corr.index.to_list():
    if df[col].quantile(0.9) - df[col].quantile(0.1) < 20:
        sns.countplot(x=col, data=df)
        plt.show()
    else:
        intervals = pd.interval_range(start=df[col].quantile(0.1), end=df[col].quantile(0.9), periods=NUM_OF_INTERVALS + 1)
        interval_tuples = [(interval.left, interval.right) for interval in intervals]
        bins = pd.IntervalIndex.from_tuples(interval_tuples)
        df[col] = pd.cut(df[col], bins)
        ax = sns.countplot(x=col, data=df)
        ax.set_xticklabels([f'{int(np.mean(interval))}' for interval in interval_tuples])
        plt.show()


### Q1-5. Plotting the relationship between the features using hexbin and scatter plots.

In [None]:
df = deepcopy(df_backup)

def plot_corr_scatter_hexbin(col):
    fig, axs = plt.subplots(ncols=2, figsize=(12, 4))
    plt.suptitle(col)
    axs[0].scatter(df[col], df["price"])
    axs[1].hexbin(df[col], df["price"], gridsize=20, cmap="Blues")
    plt.subplots_adjust(wspace=0.3)
    plt.show()

for col in price_corr.index:
    plot_corr_scatter_hexbin(col)

### Q1-6. Use other methods to analyze the data.

In [None]:
def plot_corr_joint(col):
    sns.jointplot(x=col, y="price", data=df, kind="hex")

for col in price_corr.index:
    plot_corr_joint(col)

## Part 2. Preprocessing

First we need to delete some invalid values such as negative values for number of bathrooms and so forth.

In [None]:
negative_columns = ["bedrooms", "bathrooms", "sqft_living", "grade"]
df[negative_columns] = np.where(df[negative_columns] < 0, np.nan, df[negative_columns])

### Q2-1. How to handle missing values.
Missing values in machine learning projects can be a significant hindrance to accurate results. However, there are multiple ways to deal with it during the preprocessing stage to minimize their impact. 

---

**Imputation:** is the process of replacing missing values with a substitution.
  
One such technique is mean imputation, where missing values are replaced by the mean value of the non-missing data. Similarly, the median and mode of the non-missing data could be used as a substitute. Alternatively, missing values could be replaced with the value of the previous or the next observation - called forward-fill and backward-fill, respectively. Other ways can be to predict or simply put a random value into the data.

  - *Filling with Mean:*  
    Using the average to fill missing values is simple to do and the mean is a good representative of the data as a whole.  
    But sometimes it may not make sense and be impossible to use for a column.
  - *Filling with Median:*  
    The outlier data can affect the mean negatively.  
    In such cases, it may be better to use the median which is not affected by outliers.
  - *Filling with Mode:*  
    Mean and median do not work with categorical data.  
    Using the mode can be a alternative for such data.
  - *Random Fill:*  
    In this method we fill using random values, mostly in the range of the column's minimum and maximum data, or between the categories.
  - *Predicting:*  
    A more advanced method is to have a way of predicting what the missing value should be based on the other properties of the row.
  - *Forward-fill and Backward-fill:* 
    In this method we fill the missing values with next or previous observation respectively

A more advanced approach is an imputation method using model-based techniques, such as k-NN imputation or MICE (Multiple Imputation by Chained Equations). While the former fills the missing data based on the k nearest neighbors, the latter fits the observed data into a regression model before imputing missing values with the help of this model.

---

**Dropping**: Another option is to remove any observations that contain missing values. This choice should be made with careful consideration, as it can reduce the size of the data and impact its representativeness. There are two main ways, dropping columns and dropping rows.  

  - *Dropping Columns:*  
    In this method we remove any column that has missing values in it.  
    This is usually not wanted because we potentially losing a lot of data.  
    This method should only be considered on columns that have too many missing values; and in fact, its actually the better thing to do in such cases because there is not much data to fill it with good precision.
  - *Dropping Rows:*  
    Works similarly to dropping columns.  
    If we remove all rows that have missing values, if a column is all missing, then all of the rows will be gone.  
    This method should also only be considered on rows that have most of their properties missing.

---

Finally, in some cases where missing data is limited, one could choose to ignore these values altogether and proceed with the analysis. However, before doing that, it is essential to determine whether the missingness is random or non-random. 

These are just a few possible ways to handle missing data during the preprocessing stage of a machine learning project. Depending on the specific case, there may be other methods that could be more effective.

### Q2-2. Handling missing values

In [None]:
missing_values(df)

As we can see `yr_built` and `sqft_living` and `floors` has the most missing values.  
Filling with median is chosen, this is to not get affected by outliers and also not be fractional.

In [None]:
df.fillna(df.median(numeric_only=True), inplace=True)
missing_values(df)

As an alternative method we delete rows that have more than two NaN values, and then use KNNImputer to fill them.

In [None]:
missing = df_backup[df_backup.isna().sum(axis=1) > 2]
df_imputed = deepcopy(df_backup)
df_imputed.drop(missing.index, inplace=True)
df_imputed = df_imputed.drop(["date", "location", "style"], axis=1)
df_imputed.reset_index(drop=True, inplace=True)
imputer = KNNImputer(n_neighbors=5)
imputed = imputer.fit_transform(df_imputed)
imputed = pd.DataFrame(imputed, columns=df_imputed.columns)
imputed[negative_columns] = np.where(imputed[negative_columns] < 0, np.nan, imputed[negative_columns])

In [None]:
imputed.describe()

In [None]:
df.describe()

As it is shown above, both the `mean` and `std` of the features remained almost the same as the old data after filling the missing values. So, we can use the `KNNImputer` method to fill the missing values.   

By using the `mean` method, the `mean` of the new dataset will be the same as the old one, but the `std` will change a lot. But `median` will have better performance. But we will continue with the module that we got from filling with median since we need non-numerical variables too.

### Q2-3. Normalization and Standardization, should we use them?

Normalization means scaling the values of the features to a fixed range. For example, we can scale the values of the features to the range of [0, 1] or [-1, 1]. This method is useful when we have no outliers and the data lies in a fixed range. We can use the `MinMaxScaler` method to do this. We can't use normalization when we are not using algorithms such as `KNN` or `Neural Networks` which are based on distance. Below is the formula for the `MinMaxScaler` method:

$$X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$$
Standardization means scaling the values of the features to have a mean of 0 and a standard deviation of 1. This method is useful when we have features with different means and standard deviations. We can use the `StandardScaler` method to do this. Below is the formula for the `StandardScaler` method:

$$X_{std} = \frac{X - \mu}{\sigma}$$  

To answer when we need to do them I quote from [this link](https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff)  
**Normalization** is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.  
**Standardization** assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.


In [None]:
class DataScaler:
    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.numeric_cols = df.select_dtypes(include="number")
        self.scaler_std = StandardScaler()
        self.scaler_norm = MinMaxScaler()

    def standardization(self, exclude_cols: list = []):
        self.df[self.numeric_cols.columns] = self.scaler_std.fit_transform(self.numeric_cols)
        self.df[exclude_cols] = self.numeric_cols[exclude_cols]

    def normalization(self, exclude_cols: list = []):
        self.df[self.numeric_cols.columns] = self.scaler_norm.fit_transform(self.numeric_cols)
        self.df[exclude_cols] = self.numeric_cols[exclude_cols]
        
scalar = DataScaler(df)


Now we check the distribution of features to decide between normalizing and standardizing.

In [None]:
scalar.df.hist(bins=20, figsize=(20,15))
plt.show()

Since the important features (i.e. the one with high correlation with price feature) are mostly normally distributed, we use standardization.

In [None]:
scalar.normalization(exclude_cols=["price"])
scalar.df.describe()

In [None]:
scalar.df.hist(bins=20, figsize=(20,15))
plt.show()

### Q2-4. Categorical values and encoding.

There are many ways to encode the categorical features. Some of them are as follows:

* `Label Encoding`: Assign a number to each category.  
This method is useful when the categories have an order. This is substituting each possible value of a categorical feature with a corresponding number. While label encoding is very simple, it is not always ideal because the numbers do not mean anything and can cause issues if used for calculating distance.  
  > Category 1: 0  
  > Category 2: 1  
  > Category 3: 2

* `One-Hot Encoding`: Create a new feature for each category.
This method is useful when the categories don't have an order. It is the most useful method for the algorithms that use the distance between the data points, such as `KNN`. In this method, an additional feature is added for each categorical value and is marked 0 or 1. While this encoding is more proper, it adds a lot of new binary features which use more memory and can slow the dataset down.
    > Category 1: 1, 0, 0  
    > Category 2: 0, 1, 0  
    > Category 3: 0, 0, 1


* `Binary Encoding`: Encode the categories using binary numbers.
This method is useful when the categories don't have an order. It is somehow similar to the `One-Hot Encoding` method but it uses less memory.
    > Category 1: 00  
    > Category 2: 01  
    > Category 3: 10

* `Frequency Encoding`: Encode the categories using the frequency of the categories.
This method is useful when the categories don't have an order. 
    > Category 1: 0.5  
    > Category 2: 0.25  
    > Category 3: 0.25

* `Target Encoding`: This is the process of replacing a categorical value with the mean of a target variable.  
  To do this, the data is grouped by each categorical value, and the average of a chosen target variable is calculated for that group. If the target is numerical, the categorical values are replaced with their corresponding average of the target. 
   If the target is categorical, the values are replaced with their corresponding probability of the target. 
This method is useful when the categories don't have an order. 
    > Category 1: 0.5  
    > Category 2: 0.25  
    > Category 3: 0.75

In [None]:
class CategoricalEncoder:
    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.cat_cols = df.select_dtypes(include=["category", "object"])
        
        self.encoders = {
            "label": cat_enc.OrdinalEncoder(cols=self.cat_cols.columns),
            "one-hot": cat_enc.OneHotEncoder(cols=self.cat_cols.columns, use_cat_names=True),
            "target": cat_enc.TargetEncoder(cols=self.cat_cols.columns, min_samples_leaf=2, smoothing=1.1),
            "frequency": cat_enc.CountEncoder(cols=self.cat_cols.columns),
            "binary": cat_enc.BinaryEncoder(cols=self.cat_cols.columns),
        }

    def encode(self, mode: str, target: str = None):
        if mode != "target":
            self.df[self.cat_cols.columns] = self.encoders[mode].fit_transform(self.cat_cols)
        else:
            self.df[self.cat_cols.columns] = self.encoders[mode].fit_transform(self.cat_cols, self.df[target])

encoder = CategoricalEncoder(df)
encoder.encode(mode="label")
display(df)

###  Q2-5. Removing columns.

Some columns like id and Unnamed are unique and wont help us, others like longitude and latitude and zipcode are not crucial in determining price so we drop them. And also features that have low correlation with target.

In [None]:
df = df[price_corr.index.union(["price"])]
df.head(10)

### Q2-6. Splitting the dataset into train and test sets.
There are some ways to split the dataset into train and test sets. Some of them are as follows:

- Randomly split the dataset into train and test sets
    - This method is the most common method. But it has a problem. If we split the dataset randomly, the train and test sets may not have the same distribution.   
- Split the dataset based on the time
    - This method is useful when we have a time series dataset. But it is not useful in this case.  
- Split the dataset based on the target
    - This method is useful when we have an imbalanced dataset.
- Cross-validation 
    - Which groups the data into *k* parts, and chooses one of them at each iteration and uses it as the test data, while using the rest as training data. *K-fold cross-validation* is simply splitting into *k* parts.

Here we use the first method. And there are also several percentage for dividing, we use 80-20 here.

In [None]:
class DataSplitter:
    def __init__(self, df: pd.DataFrame, train_percent: float = 0.8):
        self.data = df[df.columns.difference(["price"])]
        self.outcome_data = df["price"]
        self.__split(train_percent)

    def __split(self, train_percent: float):
        train_feat, test_feat, train_out, test_out = train_test_split(
                                                    self.data, self.outcome_data, train_size=train_percent, random_state=1)
        self.data_train = train_feat
        self.data_test = test_feat
        self.outcome_train = train_out
        self.outcome_test = test_out
        
dataSplitter = DataSplitter(df)


### Q2-7. Validation set.
In machine learning, a validation set is a subset of the data that is used to evaluate the performance of a trained model. The validation set is typically used to tune the hyperparameter of the model and to estimate the generalization error of the model.
The validation data is used to test the trained model before using the testing data.  
During this step, the classifier hyperparameters are adjusted.
The generalization error is the difference between the performance of the model on the training data and the performance of the model on new, unseen data. 
To estimate the generalization error, we typically split the data into three sets: a training set, a validation set, and a test set. The training set is used to train the model, the validation set is used to tune the hyperparameters of the model and to estimate the generalization error, and the test set is used to evaluate the final performance of the model.
The validation set is used to evaluate the performance of the model during the training process. After each epoch of training, the model is evaluated on the validation set to see how well it is generalizing to new data. This allows us to monitor the performance of the model and to make adjustments to the hyperparameters as needed.
Once the model has been trained and the hyperparameters have been tuned using the validation set, we can evaluate the final performance of the model on the test set. The test set provides an unbiased estimate of the generalization error of the model, since it has not been used during the training or validation process.

In [None]:
df.describe()

## Part 3. Training, Testing and Evaluating the Models.

### **Phase 1.** Linear Regression
Main form of simple linear regression function: 
$$f(x) = \alpha x + \beta$$

here we want to find the intercept($\alpha$) and slope($\beta$) by minimizing the derivation of the RSS function:

- step 1: Compute RSS of the training data  

$$ RSS = \Sigma (y_i - (\hat{\beta} + \hat{\alpha} * x_i) )^2 $$

Where $\hat{\alpha}$ is the estimated value of the constant term $\alpha$ and $\hat{\beta}$ is the estimated value of the slope coefficient $\beta$

- step 2: Compute the derivatives of the RSS function in term of $\underline{\alpha}$ and $\underline{\beta}$, and set them equal to 0 to find the desired parameters

$$ \frac{\partial RSS}{\partial \beta} = \Sigma (-f(x_i) + \hat{\beta} + \hat{\alpha} * x_i) = 0$$
$$ \to \hat{\beta} = \hat{y} - \hat{\alpha} \hat{x} \to (1)$$


$$ \frac{\partial RSS}{\partial \alpha} = \Sigma (-2 x_i y_i + 2 \hat{\beta} x_i + 2\hat{\alpha} x_i ^ 2) = 0 \to (2)$$

$$ (1) , (2) \to \hat{\alpha} = \frac{\Sigma{(x_i - \hat{x})(y_i - \hat{y})}}{\Sigma{(x_i - \hat{x})^2}}
$$ 
$$ \hat{\beta} = \hat{y} - \hat{\alpha} \hat{x}$$



Based on the formula above, complete this function to compute the parameters of a simple linear regression

In [None]:
def simple_linear_regression(input_feature, output):
    # TO DO:

    # compute the sum of input_feature and output

    # compute the product of the output and the input_feature and its sum

    # compute the squared value of the input_feature and its sum

    # use the formula for the slope

    # use the formula for the intercept

    return (intercept, slope)

Now complete this function to predict the value of given data based on the calculated intercept and slope

In [None]:
def get_regression_predictions(input_feature, intercept, slope):
    # TO DO:

    # calculate the predicted values:

    return predicted_values

Now that we have a model and can make predictions let's evaluate our model using Root Mean Square Error (RSME). RMSE is the square root of the mean of the squared differences between the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

Complete the following function to compute the RSME of a simple linear regression model given the input_feature, output, intercept and slope:

In [None]:
def get_root_mean_square_error(predicted_values, output):
    # TO DO:

    # Compute the residuals (since we are squaring it doesn't matter which order you subtract)

    # square the residuals and add them up

    # find the mean of the above phrase

    # calculate the root

    return RMSE

AS you might guessed, the RMSE has no bound and it is not easy to find out the percentage of fitting the model into data with it. instead, we use R2 score. The R2 score is calculated by comparing the sum of the squared differences between the actual and predicted values of the dependent variable to the total sum of squared differences between the actual and mean values of the dependent variable. Matematically, the R2 score formula is shown as follows:

$$R^2 = 1 - \frac{SSres}{SStot} = 1 - \frac{\sum_{i=1}^{n} (y_{i,true} - y_{i,pred})^2}{\sum_{i=1}^{n} (y_{i,true} - \bar{y}_{true})^2} $$

In this step, complete the following function to calculate the R2 score of a given input_feature, output, intercept, and slope:

In [None]:
def get_r2_score(predicted_values, output):
    # TO DO:

    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)

    # square the residuals and add them up -> SSres

    # compute the SStot

    # compute the R2 score value

    return R2_score

Now calculate the fitness of the model and explain the outputs

In [None]:
# TO DO:

designated_feature_list = ["sqft_living", "yr_built", "grade", "zipcode"]

for feature in designated_feature_list:
    # TO DO: calculate R2 score and RMSE for each given feature
    pass