# Loading the data and preliminary exploration

The notebook is a full analysis of the housing prices in California dataset,the ultimate goal is to predict housing prices in different districts based on demographic information for those districts,i will use the Pandas library to load it and do preliminary analysis.

1.	Data Acquisition: we start by downloading the California housing data which includes of longitude, latitude, housing_median_age, total_rooms,total_bedrooms,  population, households, median_income,median_house_value,  ocean_proximity. 
2.	Data Preprocessing: Before proceeding with analysis or modeling, it is important to preprocess the data. This step involves addressing missing values and outliers present in the dataset.
3.	Exploratory Data Analysis (EDA): Following the data preprocessing stage, it is essential to perform exploratory data analysis (EDA) to gain insights into the data's distribution, examine the relationships among different features, and how these future relate to The dependent variable is ln(median house value).
4.	Model Building and Training: Once the exploratory data analysis (EDA) is complete, the next step is to separate the response variable (or label), in this case, the median house value, from the predictor variables. This allows me to focus on building a regression model to predict the median house value using the available features.
5.	Model Evaluation: Finally, i will evaluate the performance of the model using the test set.


# Domain Knowledge

In the context of the Carlifonia Housing dataset, it's important to understand the significance of the Housing features and the role they might play in predicting the Median house value. Let's delve deeper into these features:

1. **longitude**: A measure of how far west a house is; a higher value is farther west
2. **Latitude**: A measure of how far north a house is; a higher value is farther north
3. **HousingMedianAge**: Median age of a house within a block; a lower number is a newer building
4. **totalRooms**: Total number of rooms within a block
5. **TotalBedrooms**: Total number of bedrooms within a block
6. **population**: Total number of people residing within a block
7. **Households**: Total number of households, a group of people residing within a home unit, for a block
8. **MedianIncome**: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. **MedianHouseValue**: Median house value for households within a block (measured in US Dollars)
10. **OceanProximity**: Location of the house w.r.t ocean/sea


# Abstract

This study presents an in-depth analysis of the California housing dataset to predict the onset of housing prices in different districts based on demographic information for those districts. Our methodology comprised of a two-step process: firstly, a meticulous Exploratory Data Analysis (EDA), and secondly, the application of a Random Forest algorithm to predict the outcome.


This algothism works with 4 steps which are:
Step1.Select random samples from a the data or training set. 
Step2: This algorithm will construct a decision tree for every training data. 
Step3: Voting will take place by averaging the decision tree. 
Step4: Finally,select the most voted prediction result as the final prediction result.

The EDA revealed that the median income is the most significant factor associated with the median house value while other variables like Longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,householdes showed lesser correlation. 

we proceed by Training and cross validating different models and select the most promising one amongsth (Linear Regression, Decision Tree, and Random Forest) 

from the models we trained and cross validated we were able to conclude that Random Forest regression was the reliable model


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
housing = pd.read_csv("/kaggle/input/housing/housing.csv")

In [None]:
housing.describe()

In [None]:
housing.loc[:,"latitude":"total_rooms"]

In [None]:
housing[["latitude", "longitude"]]

In [None]:
housing.head()

In [None]:
housing["ocean_proximity"].describe()

In [None]:
housing["ocean_proximity"].value_counts()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
train_set

In [None]:
train_set.describe()

In [None]:
test_set

In [None]:
test_set.describe()

**Visualising data****



In [None]:
plt.plot(housing["longitude"], housing["latitude"], ".")

In [None]:
housing.hist(bins=50, figsize=(20,15))

# Correlations



In [None]:
corr_matrix = housing.corr()
corr_matrix

In [None]:
corr_matrix["median_house_value"]

In [None]:
import seaborn as sns
sns.heatmap(corr_matrix)

In [None]:
#Graph II
corr = corr_matrix
target_corr = corr['median_house_value'].drop('median_house_value')

# Sort correlation values in descending order
target_corr_sorted = corr_matrix["median_house_value"].sort_values(ascending=False)

# Create a heatmap of the correlations with the target column
sns.set(font_scale=0.8)
sns.set_style("white")
sns.set_palette("PuBuGn_d")
sns.heatmap(target_corr_sorted.to_frame(), cmap="coolwarm", annot=True, fmt='.2f')
plt.title('Correlation with median house income')
plt.show()

In [None]:
housing.total_rooms

# Adding Columnms
The addition of the "per-household" quantities in the housing dataset serves to provide normalized or standardized measures that can offer insights into the housing characteristics on a per-household basis. These new columns can provide a more meaningful representation of the data and potentially capture patterns or relationships that may be obscured when looking at the raw values.

In [None]:
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]

In [None]:
housing

In [None]:
housing.describe()

In [None]:
#Graph II
import seaborn as sns

corr_matrix = housing.corr()
corr_matrix
corr = corr_matrix
target_corr = corr['median_house_value'].drop('median_house_value')

# Sort correlation values in descending order
target_corr_sorted = corr_matrix["median_house_value"].sort_values(ascending=False)

# Create a heatmap of the correlations with the target column
sns.set(font_scale=0.8)
sns.set_style("white")
sns.set_palette("PuBuGn_d")
sns.heatmap(target_corr_sorted.to_frame(), cmap="coolwarm", annot=True, fmt='.2f')
plt.title('Correlation with median house income')
plt.show()

# Preparing the data for ML

We want to predict the median house value using the other variables. So we separate the response variable (or label) from the predictor variables

**CEATING TEST DATA**

I proceed by spliting the dataset into a training set and a test set by splitting Randomly so as to avoid any accidental bias.i decide to go with The test_size=0.2 inside the function indicates the percentage of the data that should be held over for testing. It's usually around 80/20 or 70/30.


In [None]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
new_housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

In [None]:
housing_labels

For the purposes of this prediction, it makes sense to replace the missing values from the dataset with the median of those values in that column. This is called **imputation**. We do this using a **SimpleImputer** object. But before using it, we will drop the one non-numerical column from the dataset.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
housing_num = new_housing.drop("ocean_proximity", axis=1)

In [None]:
housing_num

In [None]:
X = imputer.fit_transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)


In [None]:
housing_tr

In [None]:
housing_tr.describe()


In [None]:
housing_num.describe()

# Fitting a model

## Linear regression

we use linear regression to predict the median house price. We have to import functionality from the SKLearn library,Fitting a linear regression model to the training data.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(housing_tr, housing_labels)

In [None]:
import statsmodels.api as sm

# Add a constant term to the independent variables
w = sm.add_constant(housing_tr)

# Create an OLS (Ordinary Least Squares) model
ols_model = sm.OLS(housing_labels, w)

# Fit the OLS model
ols_results = ols_model.fit()

# Print the model summary
print(ols_results.summary())

And that is it! The fitting is done. We then inspect the coefficients

In [None]:
lin_reg.coef_

In [None]:
housing_tr[:5]

In [None]:
some_data = housing_tr[:5]
some_labels = housing_labels[:5]

print("predictions:", lin_reg.predict(some_data))
print("data:       ", list(some_labels))

We then proceed to compare the predictions over all the training set to the actual values of the median house value, and compute the root mean square error as a provisional measure of the accuracy of the prediction.

calculating the mean squared error (MSE) between the actual median house values (housing_labels) and the predicted values (housing_predictions). The MSE is a measure of the average squared difference between the predicted and actual values.

Notice that we are not yet looking at the test data.We are just doing some rough validation using the training set.

In [None]:
housing_predictions = lin_reg.predict(housing_tr)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

## Decision tree regression

we are trying to fit other kinds of models to our data, and see how well they predict the data. Let us see how a decision tree model fares

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_tr, housing_labels)

In [None]:
housing_predictions = tree_reg.predict(housing_tr)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

The error is 0! Have we found the perfect model? No, this is a sure sign that we have overfitting here. Remember, we are only comparing predictions to data within the training set - we have not used the test set yet! We will do this later. This suggests overfitting, where the model has learned the training data too well and may not generalize well to new, unseen data. To validate the model's performance, it is necessary to evaluate its performance on a separate test set, which the code mentions will be done later.

## Random forest regression 

lets try a Random Forest Regression

In [None]:
forest_reg = RandomForestRegressor(n_estimators=10)
forest_reg.fit(housing_tr, housing_labels)

In [None]:
housing_predictions = forest_reg.predict(housing_tr)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

# Cross-validation
using the SKLearn library to do cross-validation.  

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_tr, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [None]:
def display_val_scores(scores):
    print("scores:", scores)
    print("mean:  ", scores.mean())
    print("stddev:", scores.std())

In [None]:
display_val_scores(tree_rmse_scores)

In [None]:
scores = cross_val_score(lin_reg, housing_tr, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)


In [None]:
display_val_scores(lin_rmse_scores)

In [None]:
scores = cross_val_score(forest_reg, housing_tr, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-scores)

In [None]:
display_val_scores(forest_rmse_scores)

In this case, the mean RMSE of 53498.81943012941 indicates the average prediction error of the random forest model. The standard deviation of 1726.490702709601 shows the variability in the RMSE values across the folds. Overall, the random forest model seems to perform reasonably well, with an average RMSE score in the range of 53000 and a moderate standard deviation. so we have chosen the random forest model compared to others

# Comparing to test data

Now that we have chosen a model as the best, let us see how it performs on the test set

In [None]:
final_model = forest_reg

X_test = test_set.drop("median_house_value", axis=1)
X_test = X_test.drop("ocean_proximity", axis=1)
X_tr = imputer.fit_transform(X_test)
X_test_tr = pd.DataFrame(X_tr, columns=X_test.columns,
                         index=X_test.index)

y_test = test_set["median_house_value"].copy()

final_predictions = final_model.predict(X_test_tr)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [None]:
final_rmse

The root mean square error is the basically the same as the one we found through cross-validation, showing that our training data is representative.