# Housing Price Prediction

In this notebook, we'll go through the process of loading, exploring, and analyzing a housing dataset. We will perform linear regression using both scikit-learn and statsmodels libraries.

## 1. Import Libraries
Let's start by importing the necessary libraries for data manipulation, visualization, and modeling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 2. Load Data
Load the housing data from a CSV file and inspect its basic structure.

In [None]:
housing = pd.read_csv('housing.csv')
housing.shape
housing.head()

## 3. Data Exploration
Explore the data to understand its structure, check for missing values, and view distributions of categorical variables.

In [None]:
housing.info()

In [None]:
housing["ocean_proximity"].value_counts()

## 4. Descriptive Statistics and Visualization
Generate summary statistics and visualize the data using histograms and scatter plots.

In [None]:
housing.describe()

In [None]:
housing.hist(bins=50, figsize=(20,15))

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", 
             alpha=0.1, s=housing["population"] / 100, label="population",
            c="median_house_value", cmap=plt.get_cmap("jet"))

## 5. Data Cleaning
Handle missing values and convert categorical variables into dummy variables.

In [None]:
non_numeric_columns = housing.select_dtypes(exclude=['float64', 'int64']).columns
print(non_numeric_columns)

In [None]:
housing_numeric = housing.drop('ocean_proximity', axis=1)
corr_matrix = housing_numeric.corr()
print(corr_matrix)

In [None]:
corr_matrix = housing_numeric.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
housing_na = housing.dropna(subset=["total_bedrooms"])
housing_na.shape

In [None]:
dummies = pd.get_dummies(housing_na.ocean_proximity)
dummies.head()

In [None]:
housing_na_dummies = pd.concat([housing_na, dummies], axis='columns')
housing_clean = housing_na_dummies.drop(['ocean_proximity', 'ISLAND'], axis='columns')
housing_clean.head()

In [None]:
X = housing_clean.drop(['median_house_value'], axis='columns')
X.head()
y = housing_clean['median_house_value']

## 6. Train-Test Split
Split the data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=1984)

## 7. Train Linear Regression Model
Fit a linear regression model using scikit-learn and evaluate its performance.

In [None]:
from sklearn.linear_model import LinearRegression
OLS = LinearRegression()
OLS.fit(X_train, y_train)

# Display the intercept and coefficients
print("The intercept is " + str(OLS.intercept_))
print("The coefficients are " + str(OLS.coef_))
print("The R-squared value is " + str(OLS.score(X_train, y_train)))

In [None]:
# Predicting with OLS
y_pred = OLS.predict(X_test)
performance = pd.DataFrame({'PREDICTIONS': y_pred, 'ACTUAL VALUES': y_test})
performance['error'] = performance['ACTUAL VALUES'] - performance['PREDICTIONS']
performance.head()

## 8. Plot Residuals
Visualize the residuals to assess model performance.

In [None]:
performance.reset_index(drop=True, inplace=True) # in-place turns into a column
performance.reset_index(inplace=True)
fig = plt.figure(figsize=(10,5))
plt.bar('index', 'error', data=performance[:50], color='black', width=0.3)
plt.ylabel('residuals')
plt.xlabel('observations')
plt.show()

## 9. OLS Model with Statsmodels
Fit the OLS model using the statsmodels library and display the summary.

In [None]:
import statsmodels.api as sm
X_train = sm.add_constant(X_train)
X_train.head()

In [None]:
bool_columns = ['<1H OCEAN', 'INLAND', 'NEAR BAY', 'NEAR OCEAN']
for col in bool_columns:
    X_train[col] = X_train[col].astype(int)

In [None]:
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()
X_train_np = sm.add_constant(X_train_np)
nicer_OLS = sm.OLS(y_train_np, X_train_np).fit()
print(nicer_OLS.summary())