In [15]:
# Import neccesary libraries
import pandas as pd
import numpy as np

#1. Load the California Housing dataset from sklearn.datasets.
#Loading Data Set 
from sklearn.datasets import fetch_california_housing

# Load the housing dataset
housing = fetch_california_housing()

#2. Create a Pandas DataFrame for the features and a Series for the target variable (med_house_value). 
#Making the Data Frame
X = pd.DataFrame(housing.data, columns=housing.feature_names) 
y = pd.Series(housing.target, name='med_house_value')


#3. Perform an initial exploration of the dataset: 
print(X.head())  # Display the first five rows of the dataset
print("Feature Names:", X.columns.tolist())  # Print feature names
print("\nMissing Values:\n", X.isnull().sum())  # Check for missing values
print(X.describe())  # Generate summary statistics (mean, min, max, etc.)

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  
Feature Names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Missing Values:
 MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671   

In [16]:
#Linear Regression 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

#4. Split the dataset into training and test sets (80% training, 20% testing). 
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#5. Train a linear regression model on the unscaled data using sklearn.linear_model.LinearRegression. 
lin_reg_raw = LinearRegression()
lin_reg_raw.fit(X_train_raw, y_train)

#6. Make predictions on the test set.
y_pred_raw = lin_reg_raw.predict(X_test_raw)

#7. Evaluate model performance using the following metrics:
mse_raw = mean_squared_error(y_test, y_pred_raw) # Mean Squared Error (MSE)
rmse_raw = root_mean_squared_error(y_test, y_pred_raw) # Root Mean Squared Error (RMSE)
r2_raw = r2_score(y_test, y_pred_raw) #R² Score

print("Unscaled Data Model:")
print(f"Mean Squared Error: {mse_raw:.2f}")
print(f"Root Squared Error: {rmse_raw:.2f}")
print(f"R² Score: {r2_raw:.2f}")

#Determining Feature Impact 
# Extract feature names and coefficients
feature_impact = pd.Series(lin_reg_raw.coef_, index=X.columns)

# Sort features by absolute coefficient value (strongest impact first)
strongest_features = feature_impact.abs().sort_values(ascending=False)

# Print the most influential features
print("\nFeatures with the strongest impact on predictions:")
print(strongest_features)

Unscaled Data Model:
Mean Squared Error: 0.56
Root Squared Error: 0.75
R² Score: 0.58

Features with the strongest impact on predictions:
AveBedrms     0.783145
MedInc        0.448675
Longitude     0.433708
Latitude      0.419792
AveRooms      0.123323
HouseAge      0.009724
AveOccup      0.003526
Population    0.000002
dtype: float64


8. Interpretation Questions
The R² score (coefficient of determination) measures how well the model explains the variance in the target variable. More specifically, it indicates how well the model fits a linear regression line. It measures how much of the variance in the dependent variable is explained by the independent variables. The values range from 0 to 1 with the former meaning the model does no better than predicting the mean of y and the latter being a perfect model that explains 100% of variance. Within such a range, there is variability in the accuracy of the model in predicting the level of variance. A model with a value lower than 0 means it performs worse than a simple mean-based prediction. Essentially, a higher R² value, closer to 1.0, suggests a well-fitting model with strong predictive accuracy while conversely, a lower R² value indicates that the model struggles to explain the variability in the data, making it less effective for predictions. 

According to the model's coefficients, average bedrooms (0.78), median income (0.44), longitude and latitude (0.43 and 0.42, respectively) have the most significant influence on the predictions. In contrast, the remaining four features— average rooms(0.12), house age (0.01), average occupancy (0.003), and population (0.000002) have a lesser impact on the model's predictions. 

The predicted values do not closely align with the actual values, as indicated by the relatively high Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). These higher error values suggest a larger discrepancy between predictions and actual outcomes. Additionally, a lower R² value is preferred for a better fit, but 0.58 is not sufficiently low, highlighting the model's limitations in making accurate predictions. In predictive modeling, especially in a complex domain like housing prices, R² values around 0.58 generally suggest room for improvement. Although it's better than completely random guesses, it still indicates that a significant portion (42%) of the variation in house prices remains unexplained by the current model. For practical purposes, this can mean that predictions based on this model are not highly reliable and could lead to substantial errors. 


In [17]:
#13. Select three features from the dataset to build a simplified model. Explain your choice.
#Selected average bedrooms, median income, and average rooms features as they are have the highest impact on the model. 
#Note: longitude and latitude have higher coefficients but might be more effective in tandem. 
features = ['AveBedrms', 'MedInc', 'AveRooms']
X_selected = X[features]

#14. Train a new linear regression model using only these three features.
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

simplified_model = LinearRegression()
simplified_model.fit(X_train, y_train)

y_pred_simplified = simplified_model.predict(X_test) # Predictions

#15. Evaluate the performance of this simplified model and compare it to the full model.
mse_simplified = mean_squared_error(y_test, y_pred_simplified)
rmse_simplified = root_mean_squared_error(y_test, y_pred_simplified)
r2_simplified = r2_score(y_test, y_pred_simplified)

print("Simplified Data Model:")
print(f"Mean Squared Error: {mse_simplified:.2f}")
print(f"Root Squared Error: {rmse_simplified:.2f}")
print(f"R² Score: {r2_simplified:.2f}")


Simplified Data Model:
Mean Squared Error: 0.68
Root Squared Error: 0.82
R² Score: 0.48


16. 
How does the simplified model compare to the full model?

The simplified model performs worse than the full model. The R² value for the simplified model is 0.48, while the full model achieves 0.58, suggesting that the full model explains more of the variation in the target variable (house prices). A higher R² value indicates a better fit, so the full model’s R² score reflects a more accurate representation of the data.

Furthermore, the performance of the simplified model is also reflected in its error metrics. The Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are both higher in the simplified model, signaling that it makes larger errors in its predictions compared to the full model. Specifically, the simplified model has an MSE of 0.68 and an RMSE of 0.82, meaning the average squared error and the typical prediction error are higher. In contrast, the full model’s MSE is 0.56 and RMSE is 0.75, indicating that it has lower error rates and provides more accurate predictions.
Overall, the simplified model is less effective in capturing the complexities of the data, leading to less accurate predictions compared to the full model. This highlights the importance of considering all relevant features and model complexity to ensure better performance and more reliable outcomes. 


Would you use this simplified model in practice? Why or why not?

I would recommend against using the simplified model in practice, as it consistently performs worse across key metrics— R², MSE, and RMSE—by approximately 0.10 for each value. This suggests that the simplified model is a less reliable predictor of house prices compared to the full model, leading to less accurate predictions.

However, there may be situations where a simplified model is necessary, such as when dealing with very large datasets or limited computational resources. In those cases, the trade-off is clear: while the simplified model may run faster and be more efficient, it will come at the cost of higher error rates and less accurate predictions. 

Choosing to use a simplified model in this context would require accepting these shortcomings. The decision would need to be based on the specific needs of the situation, weighing the trade-off between accuracy and computational efficiency. If the company can tolerate more imprecise predictions, or if the model is being used for initial exploratory analysis rather than high-stakes decision-making, the simplified model might still be a viable option. Ultimately, it’s a decision that needs to be carefully considered based on the priorities and constraints of the project.

