# Hands-on Session 2: Regression


## House Price Prediction
In this hands on session, we will learn how to perform prediction on house prices given a set of attributes for houses on sale. The dataset contains 13 attributes/features:

1	**Id**: To count the records.

2	**MSSubClass**: Identifies the type of dwelling involved in the sale.

3	**MSZoning**: Identifies the general zoning classification of the sale.

4	**LotArea**: Lot size in square feet.

5	**LotConfig**:	Configuration of the lot

6	**BldgType**:	Type of dwelling

7	**OverallCond**:	Rates the overall condition of the house

8	**YearBuilt**:	Original construction year

9	**YearRemodAdd**:	Remodel date (same as construction date if no remodeling or additions).

10	**Exterior1st**:	Exterior covering on house

11	**BsmtFinSF2**: Type 2 finished square feet.

12	**TotalBsmtSF**: Total square feet of basement area

13	**SalePrice**: To be predicted

## Importing Libraries and Dataset
First, we will import the following libraries that will be used specific purpose as stated below:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_excel("HousePricePrediction.xlsx")

# Printing first 5 records of the dataset
print(dataset.head(5))

We can use the shape function to show us the dimension of the dataset:

In [None]:
dataset.shape

Describing the numerical and categorical columns

In [None]:
dataset.describe()

In [None]:
dataset.select_dtypes('object').describe()

## Data Preprocessing
Now, we categorize the features depending on their datatype (int, float, object) and then compute the number of items.

In [None]:
obj = (dataset.dtypes == 'object')
object_cols = list(obj[obj].index)
print("Categorical variables:",len(object_cols))

int_ = (dataset.dtypes == 'int')
num_cols = list(int_[int_].index)
print("Integer variables:",len(num_cols))

fl = (dataset.dtypes == 'float')
fl_cols = list(fl[fl].index)
print("Float variables:",len(fl_cols))

## Exploratory Data Analysis (EDA)
EDA refers to the deep analysis of data that allows us to discover different patterns and spot anomalies. Before making inferences from data it is essential to examine and explore the dataset.

First, we will visualise a **heatmap** using seaborn library. The heatmap allows us to the the correlation between the attributes in the dataset and allow use to understand our dataset easier in a visual manner.

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(dataset.corr(numeric_only=True),
            cmap = 'BrBG',
            fmt = '.2f',
            linewidths = 2,
            annot = True)

Next, we draw a **barplot** to analyze the different categorical features.

In [None]:
unique_values = []
for col in object_cols:
  unique_values.append(dataset[col].unique().size)
plt.figure(figsize=(10,6))
plt.title('No. Unique values of Categorical Features')
plt.xticks(rotation=90)
sns.barplot(x=object_cols,y=unique_values)
plt.show()

The plot shows that *Exterior1st* has around 16 unique categories and other features have around  6 unique categories. To findout the actual count of each category we can plot the bargraph of each four features separately.

In [None]:
plt.figure(figsize=(20,5))
plt.title('Categorical Features: Distribution')
plt.xticks([],rotation=90),plt.yticks([])
index = 1

for col in object_cols:
    y = dataset[col].value_counts()
    plt.subplot(1, 4, index)
    plt.xticks(rotation=90)
    sns.barplot(x=list(y.index), y=y)
    index += 1
plt.show()


## Data Cleaning
Data Cleaning is a very important step to improvise the data or remove incorrect, corrupted or irrelevant data before we performed prediction with the dataset.

Unlike the Iris dataset we explored in Hands-on Session 1, not all attributes are important and relevant for the model training. So, we can drop these non-useful attributes/column before training. In addition, it is quite common that the dataset contains missing values. There are 2 approaches to dealing with empty/null values:

1) We can delete the column/row (if the feature or record is not much important).

2) Filling the empty slots with mean/mode/0/NA/etc. (depending on the dataset requirement).

First, we can drop the *Id* Column that will not be used for prediction.

In [None]:
dataset.drop(['Id'],
             axis=1,
             inplace=True)

Next, we replace SalePrice empty values with their mean values to make the data distribution symmetric.

In [None]:
dataset['SalePrice'] = dataset['SalePrice'].fillna(
  dataset['SalePrice'].mean())

Since there are very few empty records, we drop records with null values.

In [None]:
new_dataset = dataset.dropna()

After performing the data cleaning, we now check features which have null values in the new dataframe (if there are still any).



In [None]:
new_dataset.isnull().sum()

## OneHotEncoder – For Label categorical features
One hot Encoding is a way to convert categorical data into binary vectors. This maps the values to integer values. By using OneHotEncoder, object data will be converted into int.

Before, we apply OneHotEncoding, we first find all the features which have the object datatype.

In [None]:
from sklearn.preprocessing import OneHotEncoder

s = (new_dataset.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
print('No. of. categorical features: ',
      len(object_cols))

Then, we can apply OneHotEncoding to the whole list.

In [None]:
OH_encoder = OneHotEncoder(sparse_output=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(new_dataset[object_cols]))
OH_cols.index = new_dataset.index
OH_cols.columns = OH_encoder.get_feature_names_out()
# df_final = new_dataset.drop(object_cols, axis=1)
# df_final = pd.concat([df_final, OH_cols], axis=1)

In [None]:
OH_cols.head()

## Normalization - For Numerical Features

In [None]:
from sklearn.preprocessing import StandardScaler #z-score normalization
df_year = new_dataset[['YearBuilt','YearRemodAdd']]
df_numerical = new_dataset.drop(object_cols, axis=1).drop(['YearBuilt','YearRemodAdd','SalePrice'],axis=1).copy()
scaler = StandardScaler()
df_num_norm = pd.DataFrame(scaler.fit_transform(df_numerical),columns=df_numerical.columns)
df_num_norm.index = new_dataset.index
df_final = pd.concat([df_num_norm,df_year, OH_cols], axis=1)
display(df_final)

## Splitting Dataset into Training and Testing
X and Y splitting (i.e. Y is the SalePrice column and the rest of the other columns are X)

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

X = df_final#.drop(['SalePrice'], axis=1)
Y = new_dataset['SalePrice']

# Split the training set into
# training and validation set
X_train, X_valid, Y_train, Y_valid = train_test_split(
    X, Y, train_size=0.8, test_size=0.2, random_state=0)

## Model Training and Evaluation
As house price is a  continuous value, we will use regression models to train the house prediction model. We will train the predictio model with 3 popular machine learning algorithms and compare their results:

1) SVM-Support Vector Machine
2) Random Forest Regressor
3) Linear Regressor

And To calculate loss we will be using the **mean_absolute_percentage_error** module. It can easily be imported by using sklearn library.

## SVM – Support Vector Machine
SVM can be used for both regression and classification model. It finds the hyperplane in the n-dimensional plane.

In [None]:
from sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


model_SVR = svm.SVR(kernel="linear", C=10, epsilon = 0.1, gamma = 0.1)
model_SVR.fit(X_train,Y_train)
Y_pred = model_SVR.predict(X_valid)

#Calculate MAE
mae = mean_absolute_error(Y_valid, Y_pred)
range_y = max(Y_train) - min(Y_train)
nmae = mae / range_y
print(f"MAE: {nmae}")

# Calculate R-squared
r2 = r2_score(Y_valid, Y_pred)
print(f"R-squared: {r2}")

In [None]:
plt.figure(figsize=(10, 6))

pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_valid)
# Plot the true values
plt.scatter(X_pca, Y_valid, color='blue', label='True values')

# Plot the predicted values
plt.scatter(X_pca, Y_pred, color='red', label='Predicted values')
# Add title and labels
plt.title('SVM Regression: Predictions vs Real Values')
plt.xlabel('Feature')
plt.ylabel('Target Value')
plt.legend()

# Show the plot
plt.show()

## Random Forest Regression
Random Forest is an ensemble technique that uses multiple of decision trees and can be used for both regression and classification tasks.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model_RFR = RandomForestRegressor(n_estimators=100)
model_RFR.fit(X_train, Y_train)
Y_pred = model_RFR.predict(X_valid)

#Calculate MAE
mae = mean_absolute_error(Y_valid, Y_pred)
range_y = max(Y_train) - min(Y_train)
nmae = mae / range_y
print(f"MAE: {nmae}")

# Calculate R-squared
r2 = r2_score(Y_valid, Y_pred)
print(f"R-squared: {r2}")

In [None]:
plt.figure(figsize=(10, 6))

pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_valid)
# Plot the true values
plt.scatter(X_pca, Y_valid, color='blue', label='True values')

# Plot the predicted values
plt.scatter(X_pca, Y_pred, color='red', label='Predicted values')
# Add title and labels
plt.title('RF Regression: Predictions vs Real Values')
plt.xlabel('Feature')
plt.ylabel('Target Value')
plt.legend()

# Show the plot
plt.show()

## Linear Regression
Linear Regression predicts the final output-dependent value based on the given independent features. For example, here we have to predict SalePrice depending on features like MSSubClass, YearBuilt, BldgType, Exterior1st etc.

In [None]:
from sklearn.linear_model import LinearRegression

model_LR = LinearRegression()
model_LR.fit(X_train, Y_train)
Y_pred = model_LR.predict(X_valid)

#Calculate MAE
mae = mean_absolute_error(Y_valid, Y_pred)
range_y = max(Y_train) - min(Y_train)
nmae = mae / range_y
print(f"MAE: {nmae}")

# Calculate R-squared
r2 = r2_score(Y_valid, Y_pred)
print(f"R-squared: {r2}")

In [None]:
plt.figure(figsize=(10, 6))

pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_valid)
# Plot the true values
plt.scatter(X_pca, Y_valid, color='blue', label='True values')

# Plot the predicted values
plt.scatter(X_pca, Y_pred, color='red', label='Predicted values')
# Add title and labels
plt.title('Linear Regression: Predictions vs Real Values')
plt.xlabel('Feature')
plt.ylabel('Target Value')
plt.legend()

# Show the plot
plt.show()

### Save model, encoder, scaler, and more...



In [None]:
import pickle
pickle.dump(model_LR, open("regressor.pkl", "wb"))
pickle.dump(OH_encoder, open("onehot.pkl","wb"))
pickle.dump(scaler, open("regress_norm.pkl","wb"))

In [None]:
min_values = {}
max_values = {}
categories = {}

# Loop through the DataFrame columns
for col in dataset.columns:
    if dataset[col].dtype in ['int64', 'float64']:  # Numerical columns
        min_values[col] = dataset[col].min()
        max_values[col] = dataset[col].max()
    elif dataset[col].dtype.name == 'object':  # Categorical columns
        categories[col] = dataset[col].astype('category').cat.categories.tolist()

# Print the results
print("Min values:", min_values)
print("Max values:", max_values)
print("Categories:", categories)

# Combine the min, max, and categories into a dictionary
min_max_data = {
    'Min values': min_values,
    'Max values': max_values,
    'Categories': categories
}

# Save to a Pickle file
with open('min_max_and_categories.pkl', 'wb') as f:
    pickle.dump(min_max_data, f)

print("Data saved to min_max_and_categories.pkl")


## Take Home Exercise
Work on the Life Satisfaction data and GDP per capita data which has been provided for you. The CSV files were obtained publicly from:
* (http://stats.oecd.org/index.aspx?DataSetCode=BLI) (Life Satisfaction data from OECD)
* (http://goo.gl/j1MSKe) (GDP per capita data from IMF)

**Goal**: We want to find out if money makes people happy. To take it further: If we knew the GDP per capita of a certain country, can we predict its level of life satisfaction?

Try the following:
* Join the data tables and sort by GDP per capita.
* Plot the data for selected countries to visualize its distribution. Find out if there's any trend.
* Model "Life satisfaction" as a function of "GDP per capita".
* Find the parameters of the model through training.
* Test out the model by evaluating it with some data.