# HW03- Simple Regression with Real Estate Data

## Simple Linear Regression
Simple linear regression is a stastistical method used to model the relationship between a dependent variable and one independent variable and are most commonly used. 
Mathematically, it can be expressed as:
$$y = \beta_0 + \beta_1 x + \epsilon$$
where:
- $y$ is the dependent variable (response variable)
- $x$ is the independent variable (predictor variable)
- $\beta_0$ is the y-intercept of the regression line
- $\beta_1$ is the slope of the regression line
- $\epsilon$ is the error term (residuals)

There are several assumptions that need to be met for simple linear regression to be valid:
1. **Linearity**: The relationship between the independent and dependent variable is linear.
2. **Independence**: The observations are independent of each other.
3. **Homoscedasticity**: The residuals (errors) have constant variance across all levels of the independent variable.
4. **Normality**: The residuals are normally distributed.

If these assumptions are not met, the results of the regression analysis may be invalid or misleading.

In order to check these assumptions, we can use various diagnostic plots and statistical tests.
1. **Linearity**: We can use scatter plots to visualize the relationship between the independent and dependent variables.
2. **Independence**: We can check the independence of observations by examining the data collection process and ensuring that there is no autocorrelation.  
3. **Homoscedasticity**: We can use residual plots to check for constant variance of the residuals.
4. **Normality**: We can use Q-Q plots or statistical tests like the Shapiro-Wilk test to check if the residuals are normally distributed.

If assumptions are violated, we may need to consider alternative regression methods or transformations of the data.If assumptions are met, then we will focuse on the accuracy of the model, which can be evaluated using various metrics such as R-squared, adjusted R-squared, and root mean squared error (RMSE).
1. **R-squared**: This metric indicates the proportion of variance in the dependent variable that can be explained by the independent variable. It ranges from 0 to 1, with higher values indicating a better fit.
2. **Adjusted R-squared**: This metric adjusts the R-squared value for the number of predictors in the model, providing a more accurate measure of model fit when multiple independent variables are used.
3. **Root Mean Squared Error (RMSE)**: This metric measures the average magnitude of the residuals, providing an indication of how well the model predicts the dependent variable. Lower RMSE values indicate a better fit.

while r-squared and adjusted r-squared are useful for evaluating the overall fit of the model, RMSE is particularly useful for assessing the accuracy of predictions made by the model. It is always not necessary that having a high R-squared value means that the model is accurate, as it may be overfitting the data. That is why RMSE is a more reliable metric for evaluating the accuracy of the model.
For this assignment, we will be more focused on the building a simple linear regression model using the real estate data and evaluating its performance using the metrics mentioned above.

**Importing Libraries and Loading Data**

we will be useing the following libraries:
1. `pandas`: for data manipulation and analysis
2. `numpy`: for numerical operations
3. `matplotlib.pyplot`: for data visualization
4. `seaborn`: for statistical data visualization
5. `statsmodels.api`: for statistical modeling and regression analysis
6. `statsmodels`: for statistical tests and diagnostics

And then after importing the libraries, we will load the real estate data from a CSV file using `pandas`. The dataset contains information about various properties, including their prices and features.




In [3]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels


# print(plt.style.available)
# given the list of available styles, i have chosen 'fivethirtyeight' for its clean and modern look
plt.style.use('fivethirtyeight')



In [6]:
# Importing the dataset
df_realestate = pd.read_csv('data/Real Estate Data - Week 3.csv',index_col=0,header=0)


### Section 5a: Create the X and y datasets and Simple Regression

Task in hand:
- create the X and y datasets where X is `Living Area Above Grade` and y is `Sale Price`
- create a simple linear regression model using `statsmodels` with name of model as `reg`.
- use the X to predict the model and name it `pred`.
- Use the model (reg) to create the residuals and name it `resid`
- Show the summary of the results

In [7]:
# Create the X and y datasets
X = df_realestate[['Living Area Above Grade']]
y = df_realestate['Sale Price']

# Add a constant to the independent variable
X = sm.add_constant(X)

# Create the simple linear regression model
reg = sm.OLS(y, X).fit()

# Use the model to predict
pred = reg.predict(X)

# Calculate residuals
resid = y - pred

# Show the summary of the results
print(reg.summary())

                            OLS Regression Results                            
Dep. Variable:             Sale Price   R-squared:                       0.463
Model:                            OLS   Adj. R-squared:                  0.463
Method:                 Least Squares   F-statistic:                     1199.
Date:                Tue, 03 Jun 2025   Prob (F-statistic):          7.76e-190
Time:                        17:02:54   Log-Likelihood:                -17076.
No. Observations:                1390   AIC:                         3.416e+04
Df Residuals:                    1388   BIC:                         3.417e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                    3

From the summary, we have a fitted model as:
$$y = 33690 + 96.93 \times \text{Living Area Above Grade}$$



## Section 5b:  Create a Regression Table
Task in hand:
- Creating a regression table which will allows us to view the X and y variables, along with what we predicted and the residual.
- Need to create a code block that creates a table that includes the following columns and name it `df_reg`:
- `Living Area Above Grade`
- `Sale Price`
- We need to concatenate the df_reg, pred, and resid and use the name df_reg as the final data set and print head.
- rename the colimns as `Sale Price Predicted` and `Residuals`.

In [9]:
# Create a regression table
df_reg = pd.concat([X['Living Area Above Grade'], y, pred, resid], axis=1)
df_reg.columns = ['Living Area Above Grade', 'Sale Price', 'Sale Price Predicted', 'Residuals']

# Display the first few rows of the regression table
print(df_reg.head(10))

    Living Area Above Grade  Sale Price  Sale Price Predicted      Residuals
Id                                                                          
1                      1710      208500         199432.115140    9067.884860
2                      1262      181500         156009.099394   25490.900606
3                      1786      223500         206798.519597   16701.480403
4                      1717      140000         200110.599761  -60110.599761
5                      2198      250000         246732.185862    3267.814138
6                      1362      143000         165701.736837  -22701.736837
7                      1694      307000         197881.293149  109118.706851
8                      2090      200000         236264.137424  -36264.137424
9                      1774      129900         205635.403103  -75735.403103
10                     1077      118000         138077.720124  -20077.720124


## Section 5C: Create a Regression Plot (regplot)
Task in hand:
- Create a regression plot to visually see the relationship between the `Living Area Above Grade` and `Sale Price`.
Need to include the following
- x is `Living Area Above Grade`
- y is `Sale Price`
- data is `df_reg`
- scatter_kws ={'color': 'green',"alpha" :0.15, 's': 50}
- line_kws = {'color': 'black'}
- put the titile as ` Regression Fit plot for Sale Price and Living Area`  with fontsize 18 pt and centered.
- xlabel as `Living Area Above Grade` with fontsize 14 pt and centred.
- ylabel as `Sale Price` with fontsize 14 pt and centred. 

In [None]:
# Create a regression plot
plt.figure(figsize=(10, 6))
sns.regplot(x='Living Area Above Grade', y='Sale Price', data=df_reg,
            scatter_kws={'color': 'green', 'alpha': 0.15, 's': 50},
            line_kws={'color': 'black'})

# Add title and labels
plt.title('Regression Fit plot for Sale Price and Living Area', fontsize=18, loc='center')
plt.xlabel('Living Area Above Grade', fontsize=14, labelpad=10)
plt.ylabel('Sale Price', fontsize=14, labelpad=10)

# Show the plot
plt.show()