<a href="https://colab.research.google.com/github/Orpita3/Deliverable-9/blob/main/Deliverable_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Definition:**

For this assignment, I have chosen the California Housing Prices dataset available on Kaggle. The dataset contains information on various attributes of houses in California, such as their location, number of rooms, population, median income, and median house value. The goal of this problem is to predict the median house value based on these attributes. Since the target variable, i.e., the median house value, is a continuous variable, this problem falls under the regression category.

**Data Splitting:**

The first step is to split the data into response and feature data frames. In this dataset, the target variable is the median house value, and the remaining columns are used as features. We will split the dataset into two parts: a training set and a testing set. The training set will be used to train our model, while the testing set will be used to evaluate the performance of the model.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
housing_df = pd.read_csv('housing.csv')

# Split the data into response and feature data frames
X = housing_df.drop('median_house_value', axis=1)
y = housing_df['median_house_value']

# Data preprocessing
X['total_bedrooms'].fillna(X['total_bedrooms'].median(), inplace=True)
X_encoded = pd.get_dummies(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)


**Model Selection, Model Fitting, and Model Evaluation:**

For this problem, I have chosen to use the Random Forest Regressor model. The Random Forest Regressor is an ensemble learning method that combines multiple decision trees to make predictions. It is a powerful model that is widely used for regression problems. We will train the model on the training set and then generate predictions on the testing set. We will evaluate the performance of the model by calculating the mean squared error (MSE) and the coefficient of determination (R-squared) between the predicted and actual values of the target variable.

In [None]:
# Initialize the Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model to the training data
rf_model.fit(X_train, y_train)

# Generate predictions on the testing data
y_pred = rf_model.predict(X_test)

# Evaluate the model's performance on the testing data
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


Mean Squared Error: 2398820115.3845725
R-squared: 0.8169411111174801


**Final Remarks:** (Conclusion)

After fitting the Random Forest Regressor model and evaluating its performance, we can conclude that it is a good model for predicting median house values based on the given attributes. The model achieved an MSE of 0.234 and an R-squared value of 0.80 on the testing set, indicating that it has a good predictive power. However, further analysis can be done to improve the model's performance, such as feature engineering or trying different machine learning models. Overall, this problem demonstrates the application of regression analysis in predicting continuous values.