# Proof of Independence of Errors

The purpose of this notebook is to check if the residuals (errors) from the regression model are independent of each other.

The 'Independence of Errors' assumption in regression analysis suggests that the residuals from the model (the differences between the observed values and the values predicted by the model) should not be correlated with each other. This is crucial because if the errors are correlated, it can indicate that information about the behavior of the system is missing from the model.

To verify the independence of errors, there are several methods that can be emlpoyed, these includes:

- Durbin-Watson Test
- Ljung-Box Test
- Breusch-Godfrey Test
- Runs Test

In this notebook, I will perform the **Durbin-Watson Test**.

The **Durbin-Watson** statistic tests for the presence of autocorrelation in the residuals from a statistical regression analysis. The value of the test statistic ranges from 0 to 4, where a value of 2 suggests no autocorrelation, values less than 2 suggest positive autocorrelation, and values greater than 2 suggest negative autocorrelation

In [20]:
import pandas as pd 
import statsmodels.api as sm

from statsmodels.stats.stattools import durbin_watson


In [8]:
# Read the cleaned dataset file into a dataframe 
df = pd.read_csv('../data/cleaned_df.zip', compression='zip', index_col=0)
df.head()

Unnamed: 0,location,bath,balcony,price,House_size,new_total_sqft
2,Uttarahalli,2.0,3.0,62.0,3.0,1440.0
3,Lingadheeranahalli,3.0,1.0,95.0,3.0,1521.0
4,Kothanur,2.0,1.0,51.0,2.0,1200.0
8,Marathahalli,3.0,1.0,63.25,3.0,1310.0
10,Whitefield,2.0,2.0,70.0,3.0,1800.0


In [9]:
df.shape

(6996, 6)

### Feature Engineering

Find the number of unique values in the categorical columns

In [10]:
len(df['location'].unique())

849

Seeing that the categorical column `location` has over **800** unique values, using `One-Hot Encoding` on this column will increase the dimensionality of the dataset. High dimensionality can lead to the *curse of dimensionality*, where models have a hard time learning patterns due to the vast feature space.

To prevent this, I'll create a `group_location` function that group the categories that represent less than a set threshold, [default is 0.01 ( i.e 1%)] of the dataset into an `Other` category

In [11]:
def group_location(threshold= 0.01):
    '''
    This funciton takes in a threshold and groups the unique locations whose total number of
    rows/observations does not go meet the set threshold into the general category 'Other'.

    The function returns the result of the value_counts() method of the location column.

    Input:
    threshold - float between 0 and 1 

    Return:
    It returns the unique categories and the total number of values each unique category has


    '''
    counts = df['location'].value_counts(normalize=True)


    # Get the categories that represent less than set threshold
    other_categories = counts[counts < threshold].index

    # Replace these categorwies with 'Other' 
    df['location'] = df['location'].replace(other_categories, 'Other')

    

    return df['location'].value_counts()

In [12]:
group_location()

location
Other                    4971
Whitefield                314
Sarjapur  Road            222
Kanakpura Road            166
Electronic City           155
Uttarahalli               149
Thanisandra               136
Yelahanka                 133
Raja Rajeshwari Nagar     130
Marathahalli              116
Bannerghatta Road         108
Haralur Road              103
7th Phase JP Nagar        102
Hebbal                    101
Hennur Road                90
Name: count, dtype: int64

#### Encode the categorical column

In [13]:
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()

Unnamed: 0,bath,balcony,price,House_size,new_total_sqft,location_Bannerghatta Road,location_Electronic City,location_Haralur Road,location_Hebbal,location_Hennur Road,location_Kanakpura Road,location_Marathahalli,location_Other,location_Raja Rajeshwari Nagar,location_Sarjapur Road,location_Thanisandra,location_Uttarahalli,location_Whitefield,location_Yelahanka
2,2.0,3.0,62.0,3.0,1440.0,False,False,False,False,False,False,False,False,False,False,False,True,False,False
3,3.0,1.0,95.0,3.0,1521.0,False,False,False,False,False,False,False,True,False,False,False,False,False,False
4,2.0,1.0,51.0,2.0,1200.0,False,False,False,False,False,False,False,True,False,False,False,False,False,False
8,3.0,1.0,63.25,3.0,1310.0,False,False,False,False,False,False,True,False,False,False,False,False,False,False
10,2.0,2.0,70.0,3.0,1800.0,False,False,False,False,False,False,False,False,False,False,False,False,True,False


In [14]:
encoded_cols = [cols for cols in df_encoded.columns if cols.startswith('location_')]

for col in encoded_cols:
    df_encoded[col] = df_encoded[col].astype(int)

df_encoded.head()

Unnamed: 0,bath,balcony,price,House_size,new_total_sqft,location_Bannerghatta Road,location_Electronic City,location_Haralur Road,location_Hebbal,location_Hennur Road,location_Kanakpura Road,location_Marathahalli,location_Other,location_Raja Rajeshwari Nagar,location_Sarjapur Road,location_Thanisandra,location_Uttarahalli,location_Whitefield,location_Yelahanka
2,2.0,3.0,62.0,3.0,1440.0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,3.0,1.0,95.0,3.0,1521.0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,2.0,1.0,51.0,2.0,1200.0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
8,3.0,1.0,63.25,3.0,1310.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
10,2.0,2.0,70.0,3.0,1800.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [15]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6996 entries, 2 to 13317
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   bath                            6996 non-null   float64
 1   balcony                         6996 non-null   float64
 2   price                           6996 non-null   float64
 3   House_size                      6996 non-null   float64
 4   new_total_sqft                  6996 non-null   float64
 5   location_Bannerghatta Road      6996 non-null   int32  
 6   location_Electronic City        6996 non-null   int32  
 7   location_Haralur Road           6996 non-null   int32  
 8   location_Hebbal                 6996 non-null   int32  
 9   location_Hennur Road            6996 non-null   int32  
 10  location_Kanakpura Road         6996 non-null   int32  
 11  location_Marathahalli           6996 non-null   int32  
 12  location_Other                  6996 n

In [16]:
df_encoded.shape

(6996, 19)

In [17]:
X = df_encoded.drop('price', axis=1)
y = df_encoded['price']


# Add a constant term for the intercept
X = sm.add_constant(X)  


# Fit the OLS model that includes an intercept term
model = sm.OLS(y, X).fit()

In [18]:
# Compute and print the Durbin-Watson statistic
dw_stat = durbin_watson(model.resid)
print("Durbin-Watson statistic:", dw_stat)

Durbin-Watson statistic: 2.038444923584805


#### Interpreting the Durbin-Watson Statistic:
- A Durbin-Watson statistic close to 2.0 suggests no autocorrelation.
- Values approaching 0 indicate positive autocorrelation.
- Values approaching 4 indicate negative autocorrelation.

**The Result of the `Durbin-Watson` test indicates no autocorrelation in the residuals of the model. Therefore, I fail to reject the null hypothesis of no autocorrelation between the residuals.**

**This result implies that the residuals from the regression model are independent of each other, satisfying one of the critical assumptions of the OLS regression.**

