## "Normal Distribution of Errors" assumption means. 

This implies that the errors (residuals) resulting from a statistical model or measurement process should follow a normal (Gaussian) distribution. This assumption is crucial for various statistical methods, such as linear regression.

### Steps to follow

##### Feature Engineering

create a group_location function that group the categories that represent less than a set threshold, [default is 0.01 ( i.e 1%)] of the dataset into an Other category

##### Fit a Statistical Model:

Fit the appropriate statistical model to the dataset. For example,  fit a linear regression model to the data.
    
##### Compute Residuals:

Calculate the residuals by subtracting the observed values from the predicted values obtained from the model. Residuals represent the errors of the model, and their distribution is what we're interested in.

##### Visual Inspection:

Plot a histogram of the residuals to visually inspect their distribution. The histogram should provide an initial indication of whether the residuals approximately follow a normal distribution. 

##### Quantile-Quantile (Q-Q) Plot:

Create a Q-Q plot (Quantile-Quantile plot) of the residuals against a theoretical normal distribution. In a Q-Q plot, if the residuals follow a normal distribution, the points should fall approximately along a straight line. Deviations from a straight line suggest departures from normality.

##### Statistical Tests:

Perform formal statistical tests to assess the normality of the residuals. Common tests include the Shapiro-Wilk test, Kolmogorov-Smirnov test, and Anderson-Darling test. These tests provide p-values, which indicate whether the null hypothesis of normality can be rejected or not. A high p-value suggests that the data are consistent with a normal distribution.

##### Interpret Results:

Based on the visual inspection and statistical tests, interpret the results. If the residuals exhibit a symmetric distribution around zero with no significant departures from normality according to both visual and statistical assessments, then the assumption of normal distribution of errors is considered met.

##### Report Findings:

Finally, report findings, including any limitations or assumptions made during the analysis.

In [1]:
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import scipy.stats as stats

In [3]:
data = 'cleaned_df.zip'

In [None]:
df = pd.read_csv(data, index_col=0)

: 

In [None]:
df.head()

: 

In [None]:
df.shape

: 

### Feature Engineering

Find the number of unique values in the categorical columns

In [None]:
len(df['location'].unique())

: 

Seeing that the categorical column location has over 800 unique values, using One-Hot Encoding on this column will increase the dimensionality of the dataset. High dimensionality can lead to the curse of dimensionality, where models have a hard time learning patterns due to the vast feature space.

To prevent this, I'll create a group_location function that group the categories that represent less than a set threshold, [default is 0.01 ( i.e 1%)] of the dataset into an Other category

In [None]:
def group_location(threshold= 0.01):
    '''
    This funciton takes in a threshold and groups the unique locations whose total number of
    rows/observations does not go meet the set threshold into the general category 'Other'.

    The function returns the result of the value_counts() method of the location column.

    Input:
    threshold - float between 0 and 1 

    Return:
    It returns the unique categories and the total number of values each unique category has


    '''
    counts = df['location'].value_counts(normalize=True)


    # Get the categories that represent less than set threshold
    other_categories = counts[counts < threshold].index

    # Replace these categorwies with 'Other' 
    df['location'] = df['location'].replace(other_categories, 'Other')

    

    return df['location'].value_counts()


: 

In [None]:
group_location()

: 

#### Encode the categorical column

In [None]:
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()

: 

In [None]:
encoded_cols = [cols for cols in df_encoded.columns if cols.startswith('location_')]

for col in encoded_cols:
    df_encoded[col] = df_encoded[col].astype(int)

df_encoded.head()

: 

In [None]:
df_encoded.info()

: 

### Fit a Statistical Model

In [None]:
X = df_encoded.drop('price', axis=1)
y = df_encoded['price']


# Add a constant term for the intercept
X = sm.add_constant(X)  


# Fit the OLS model that includes an intercept term
model = sm.OLS(y, X).fit()

: 

### Compute Residuals

In [None]:
residuals = model.resid
residuals

: 

### Visual Inspection

In [None]:
plt.hist(residuals, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Histogram of Residuals')
plt.grid(True)
plt.show()

: 

: 

### Quantile-Quantile (Q-Q) Plot

In [None]:
sm.qqplot(residuals, line ='45')
plt.title('Q-Q Plot of Residuals')
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
plt.grid(True)
plt.show()

: 

### Statistical Tests

In [None]:
res = residuals.sample(30)
res

: 

In [None]:

# Shapiro-Wilk test
shapiro_test_stat, shapiro_p_value = stats.shapiro(res)
print("Shapiro-Wilk Test - Test Statistic:", shapiro_test_stat, " p-value:", shapiro_p_value)

# Kolmogorov-Smirnov test
ks_test_stat, ks_p_value = stats.kstest(res, 'norm')

print("Kolmogorov-Smirnov Test - Test Statistic:", ks_test_stat, " p-value:", ks_p_value)

# Anderson-Darling test
anderson_test_stat, anderson_critical_values, anderson_significance_levels = stats.anderson(res)
print("Anderson-Darling Test - Test Statistic:", anderson_test_stat)
print("Anderson-Darling Test - Critical Values:", anderson_critical_values)

: 

### Interpret Results

#### Histogram and Q-Q Plot Observations: 
The histogram graph displays a slight left skew, indicating a deviation from a perfectly symmetrical distribution. Similarly, the Q-Q plot deviates from a perfect straight line, suggesting departures from normality in the distribution of residuals.

#### Statistical Test Results:

##### Shapiro-Wilk Test:
    The p-value (0.0677) from the Shapiro-Wilk test is marginally above the conventional significance level of 0.05, implying that we fail to reject the null hypothesis of normality. However, this result should be interpreted cautiously, considering its proximity to the significance threshold.
##### Kolmogorov-Smirnov Test: 
    The very low p-value (9.11e-07) from the Kolmogorov-Smirnov test indicates a significant departure from normality.
##### Anderson-Darling Test: 
    The test statistic (0.6919) falls below the critical value at the 5% significance level, suggesting no significant departure from normality according to the Anderson-Darling test.
#### Interpretation of Results: 
The combination of visual inspection and statistical tests suggests that while the distribution of residuals exhibits some departure from normality, the evidence is somewhat mixed. The Shapiro-Wilk test, although inconclusive, hints at a potential normal distribution, whereas the Kolmogorov-Smirnov test strongly suggests otherwise. The Anderson-Darling test falls in between, indicating no significant departure from normality at the 5% significance level. However, given the slight skew observed in the histogram and deviations in the Q-Q plot, caution is warranted in interpreting the results.

### Report Findings

##### Findings: 
    The analysis suggests that the assumption of normality in the residuals may not hold perfectly. This could imply that the regression model might not fully capture the underlying data distribution.
##### Limitations: 
    It's important to acknowledge several limitations in this analysis. Firstly, the interpretation of normality tests can be influenced by sample size, and the dataset under consideration may have unique characteristics not fully captured by standard statistical tests. Additionally, while visual inspection is informative, it is subjective and may vary depending on individual interpretation. Lastly, the choice of significance level and the assumption of independence of observations are inherent assumptions in the conducted tests.