<a href="https://colab.research.google.com/github/2403A51L42/23CSBTB52/blob/main/LA4_HousePrice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import numpy as np            # Importing the NumPy library for numerical operations
import pandas as pd           # Importing the pandas library for data manipulation and analysis
import matplotlib.pyplot as plt  # Importing matplotlib's pyplot for data visualization
import seaborn as sns         # Importing seaborn for statistical data visualization


In [7]:
from google.colab import drive       # Importing the drive module from Google Colab to interact with Google Drive
drive.mount('/content/drive')        # Mounting Google Drive to the '/content/drive' path to access its files


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
file_path = '/content/drive/MyDrive/USA_Housing.csv'

In [9]:
df = pd.read_csv(file_path)    # Reading a CSV file from the specified 'file_path' into a pandas DataFrame
df.head()                      # Displaying the first 5 rows of the DataFrame to get a quick overview of the data


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/USA_Housing.csv'

In [None]:
df.info(verbose=True)    # Provides a summary of the DataFrame, including data types, non-null counts, and memory usage
                         # 'verbose=True' ensures detailed output for each column


In [None]:
df.describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9])  # Generates summary statistics for numeric columns
                                                      # 'percentiles' specifies custom percentiles (10%, 25%, 50%, 75%, 90%)


In [None]:
df.columns    # Returns the column names of the DataFrame as an Index object


In [None]:
sns.pairplot(df)    # Creates pairwise scatter plots for each numeric variable in the DataFrame using Seaborn
                    # Helps in visualizing relationships between variables and identifying patterns or correlations


In [None]:
df['Price'].plot.hist(bins=25, figsize=(8, 4))  # Creates a histogram of the 'Price' column with 25 bins
                                                # The 'figsize' parameter adjusts the size of the plot to 8x4 inches


In [None]:
df['Price'].plot.density()  # Plots a kernel density estimate (KDE) for the 'Price' column
                            # Provides a smoothed estimate of the distribution of the data


In [None]:
# Drop specific columns by name (e.g., 'Address')
df_cleaned = df.drop(columns=['Address'])  # Replace 'Address' with the actual column name


In [None]:
# Now compute the correlation
df_cleaned.corr()

**A heatmap is a data visualization technique** that displays data in a matrix format where individual values are represented by colors.

 This allows you to **easily visualize the magnitude** of values across a matrix or grid.

In [None]:
plt.figure(figsize=(10, 7))                    # Creates a new figure with a size of 10x7 inches
sns.heatmap(df_cleaned.corr(), annot=True, linewidths=2)  # Generates a heatmap of the correlation matrix for 'df_cleaned'
                                                         # 'annot=True' displays the correlation coefficients in each cell
                                                         # 'linewidths=2' adds a 2-pixel-wide line between cells in the heatmap


In [10]:
l_column = list(df.columns)    # Converts the column names of the DataFrame into a list
len_feature = len(l_column)   # Calculates the number of columns in the DataFrame (length of the column list)
l_column                      # Displays the list of column names


NameError: name 'df' is not defined

**Put all the numerical features in X and Price in y,
ignore Address which is string for linear regression**

In [None]:
X = df[l_column[0:len_feature-2]]    # Selects all columns from the DataFrame except the last two columns for features (X)
y = df[l_column[len_feature-2]]      # Selects the second-to-last column as the target variable (y)


In [None]:
print("Feature set size:", X.shape)    # Prints the dimensions of the feature set (X), showing the number of rows and columns
print("Variable set size:", y.shape)   # Prints the dimensions of the target variable (y), showing the number of rows (typically a single column)


In [None]:
X.head()    # Displays the first 5 rows of the feature set (X) to provide a quick overview


**Import train_test_split function from scikit-learn**

In [None]:
from sklearn.model_selection import train_test_split  # Imports the train_test_split function from scikit-learn for splitting data into training and testing sets


**Create X and y train and test splits in one command using
a split ratio and a random seed**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                            test_size=0.3, random_state=123)
                            # Splits the feature set (X) and target variable (y) into training and testing sets
                            # test_size=0.3 specifies that 30% of the data will be used for testing, and 70% for training
                            # random_state=123 ensures reproducibility by setting a fixed seed for random number generation


**Check the size and shape of train/test splits (it should be in the ratio as per test_size parameter above)**

In [None]:
print("Training feature set size:", X_train.shape)    # Prints the dimensions of the training feature set (X_train)
print("Test feature set size:", X_test.shape)          # Prints the dimensions of the test feature set (X_test)
print("Training variable set size:", y_train.shape)    # Prints the dimensions of the training target variable (y_train)
print("Test variable set size:", y_test.shape)         # Prints the dimensions of the test target variable (y_test)


**Import linear regression model estimator from scikit-learn and instantiate**

In [None]:
from sklearn.linear_model import LinearRegression  # Imports the LinearRegression class from scikit-learn for linear regression modeling
from sklearn import metrics                        # Imports the metrics module from scikit-learn for evaluating model performance


In [None]:
lm = LinearRegression() # Creating a Linear Regression object 'lm'

In [None]:
lm.fit(X_train,y_train) # Fit the linear model on to the 'lm' object itself i.e. no need to set this to another variable

**Check the intercept and coefficients and put them in a DataFrame**

In [None]:
print("The intercept term of the linear model:", lm.intercept_)  # Prints the intercept term of the linear regression model 'lm'


In [None]:
print("The coefficients of the linear model:", lm.coef_)  # Prints the coefficients of the linear regression model 'lm'


In [None]:
cdf = pd.DataFrame(data=lm.coef_, index=X_train.columns, columns=["Coefficients"])
# Creating a DataFrame for the coefficients of the linear model, with feature names as the index and "Coefficients" as the column name
cdf  # Displaying the DataFrame with the coefficients of the linear model

**Calculation of standard errors and t-statistic for the coefficients**

In [None]:
n = X_train.shape[0]  # Number of training samples
k = X_train.shape[1]  # Number of features
dfN = n - k           # Degrees of freedom for the residual error

train_pred = lm.predict(X_train)  # Predicting values using the training feature set
train_error = np.square(train_pred - y_train)  # Calculating squared errors between predicted and actual values
sum_error = np.sum(train_error)  # Summing all squared errors

se = [0, 0, 0, 0, 0]  # Initialize a list to store standard errors for each feature

for i in range(k):
    r = (sum_error / dfN)  # Calculating the mean squared error for the feature
    r = r / np.sum(np.square(X_train[list(X_train.columns)[i]] - X_train[list(X_train.columns)[i]].mean()))
    # Dividing by the variance of the feature
    se[i] = np.sqrt(r)  # Calculating the standard error for the feature

cdf['Standard Error'] = se  # Adding the standard errors to the DataFrame
cdf['t-statistic'] = cdf['Coefficients'] / cdf['Standard Error']  # Calculating the t-statistic for each coefficient

cdf  # Displaying the DataFrame with coefficients, standard errors, and t-statistics


In [None]:
print("Therefore, features arranged in the order of importance for predicting the house price\n", '-'*90, sep='')
# Prints a header indicating that features are being arranged by importance, followed by a line of dashes for separation

l = list(cdf.sort_values('t-statistic', ascending=False).index)
# Sorts the features by their t-statistic in descending order and extracts the feature names into a list

print(' > \n'.join(l))
# Prints the list of feature names, each separated by ' > ' and a newline


In [None]:
l = list(cdf.index)  # Extracts the feature names from the DataFrame index into a list

from matplotlib import gridspec  # Imports gridspec for creating complex grid layouts in matplotlib
fig = plt.figure(figsize=(18, 10))  # Creates a new figure with a size of 18x10 inches
gs = gridspec.GridSpec(2, 3)  # Defines a grid layout with 2 rows and 3 columns for subplots

# Creating subplots within the defined grid layout

ax0 = plt.subplot(gs[0])
ax0.scatter(df[l[0]], df['Price'])  # Plots a scatter plot of the first feature against 'Price'
ax0.set_title(l[0] + " vs. Price", fontdict={'fontsize': 20})  # Sets the title of the subplot

ax1 = plt.subplot(gs[1])
ax1.scatter(df[l[1]], df['Price'])  # Plots a scatter plot of the second feature against 'Price'
ax1.set_title(l[1] + " vs. Price", fontdict={'fontsize': 20})  # Sets the title of the subplot

ax2 = plt.subplot(gs[2])
ax2.scatter(df[l[2]], df['Price'])  # Plots a scatter plot of the third feature against 'Price'
ax2.set_title(l[2] + " vs. Price", fontdict={'fontsize': 20})  # Sets the title of the subplot

ax3 = plt.subplot(gs[3])
ax3.scatter(df[l[3]], df['Price'])  # Plots a scatter plot of the fourth feature against 'Price'
ax3.set_title(l[3] + " vs. Price", fontdict={'fontsize': 20})  # Sets the title of the subplot

ax4 = plt.subplot(gs[4])
ax4.scatter(df[l[4]], df['Price'])  # Plots a scatter plot of the fifth feature against 'Price'
ax4.set_title(l[4] + " vs. Price", fontdict={'fontsize': 20})  # Sets the title of the subplot


**R-square of the model fit**

In [None]:
print("R-squared value of this fit:", round(metrics.r2_score(y_train, train_pred), 3))
# Calculates and prints the R-squared value (coefficient of determination) for the model's fit on the training data
# 'metrics.r2_score' computes the R-squared value, and 'round(..., 3)' rounds the result to three decimal places


Prediction, error estimate, and regression evaluation matrices

**Prediction using the lm model**

In [None]:
predictions = lm.predict(X_test)  # Predicts the target values for the test feature set (X_test) using the trained linear model

print("Type of the predicted object:", type(predictions))
# Prints the type of the 'predictions' object to show it is a NumPy array or similar

print("Size of the predicted object:", predictions.shape)
# Prints the size (shape) of the 'predictions' object to show the number of predicted values


**Scatter plot of predicted price and y_test set to see if the data fall on a 45 degree straight line**



In [None]:
plt.figure(figsize=(10, 7))  # Creates a new figure with a size of 10x7 inches
plt.title("Actual vs. predicted house prices", fontsize=25)  # Sets the title of the plot with a font size of 25
plt.xlabel("Actual test set house prices", fontsize=18)  # Labels the x-axis with a font size of 18
plt.ylabel("Predicted house prices", fontsize=18)  # Labels the y-axis with a font size of 18
plt.scatter(x=y_test, y=predictions)  # Creates a scatter plot of actual test set house prices (x) vs. predicted house prices (y)


**Plotting histogram of the residuals i.e. predicted errors (expect a normally distributed pattern)**

In [None]:
plt.figure(figsize=(10, 7))  # Creates a new figure with a size of 10x7 inches
plt.title("Histogram of residuals to check for normality", fontsize=25)  # Sets the title of the plot with a font size of 25
plt.xlabel("Residuals", fontsize=18)  # Labels the x-axis with a font size of 18
plt.ylabel("Kernel density", fontsize=18)  # Labels the y-axis with a font size of 18
sns.histplot([y_test - predictions], kde=True)  # Creates a histogram of the residuals (differences between actual and predicted values)
                                              # 'kde=True' adds a Kernel Density Estimate (KDE) to visualize the distribution more smoothly


**Scatter plot of residuals and predicted values (Homoscedasticity)**

Homoscedasticity is a key assumption in linear regression analysis that means the variance of the error term is constant across all values of the independent variables. This means that the error term, or "noise", in the relationship between the independent and dependent variables, does not vary much as the value of the predictor variable changes



In [None]:
plt.figure(figsize=(10, 7))  # Creates a new figure with a size of 10x7 inches
plt.title("Residuals vs. predicted values plot (Homoscedasticity)\n", fontsize=25)
# Sets the title of the plot with a font size of 25 and adds a newline for separation
plt.xlabel("Predicted house prices", fontsize=18)  # Labels the x-axis with a font size of 18
plt.ylabel("Residuals", fontsize=18)  # Labels the y-axis with a font size of 18
plt.scatter(x=predictions, y=y_test - predictions)
# Creates a scatter plot of predicted house prices (x) vs. residuals (differences between actual and predicted values)


**Regression evaluation metrices**

In [None]:
print("Mean absolute error (MAE):", metrics.mean_absolute_error(y_test, predictions))
# Calculates and prints the Mean Absolute Error (MAE) between the actual and predicted values for the test set

print("Mean square error (MSE):", metrics.mean_squared_error(y_test, predictions))
# Calculates and prints the Mean Squared Error (MSE) between the actual and predicted values for the test set

print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y_test, predictions)))
# Calculates and prints the Root Mean Squared Error (RMSE) between the actual and predicted values for the test set
# RMSE is the square root of the MSE


**R-square value**

In [None]:
print("R-squared value of predictions:", round(metrics.r2_score(y_test, predictions), 3))
# Calculates and prints the R-squared value (coefficient of determination) for the model's predictions on the test set
# 'metrics.r2_score' computes the R-squared value, and 'round(..., 3)' rounds the result to three decimal places


In [None]:
#compute minmax value for observed price and expected price
import numpy as np
min=np.min(predictions/6000)
max=np.max(predictions/12000)
print(min, max)

In [None]:
# Compute MinMax value for Price=100
L = (100 - min) / (max - min)
# Calculates the Min-Max normalization value for a house price of 100, where 'min' and 'max' are the minimum and maximum values of the 'Price' variable

L  # Displays the normalized value for Price=100

plt.hist(L)  # Creates a histogram of the Min-Max normalized values
