# Practical Lab 1 - Univariate Linear Regression on the California Housing Prices Dataset

### Step:1 Framing the Problem - Describe the goal of this report 

The goal of this report on the California Housing Prices Dataset is to analyze the factors affecting housing prices in California. It aims to identify trends, determine influential variables, and provide actionable insights for stakeholders, including buyers, sellers, and policymakers. By offering a comprehensive overview of the housing market dynamics, the report seeks to support informed decision-making and recommend strategies to address housing affordability challenges.

### Step:2 Getting the Data - hyperlink to the source and load into Pandas

In [25]:
import pandas as pd
import numpy as np

# Load the dataset
url = 'C:/Users/pallo/OneDrive/Desktop/conestoga/FMLab/housing.csv'
data = pd.read_csv(url)

# Display the first few rows of the dataset
print(data.head())


   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  


### Step:3 Exploratory Data Analysis

In [26]:
import pandas as pd

# Load the dataset
data = pd.read_csv('C:/Users/pallo/OneDrive/Desktop/conestoga/FMLab/housing.csv')

# Display summary statistics
statistics = data.describe()
print(statistics)


          longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   
std        2.003532      2.135952           12.585558   2181.615252   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1447.750000   
50%     -118.490000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000  39320.000000   

       total_bedrooms    population    households  median_income  \
count    20433.000000  20640.000000  20640.000000   20640.000000   
mean       537.870553   1425.476744    499.539680       3.870671   
std        421.385070   1132.462122    382.329753       1.899822   
min          1.000000      3.000000      1.000000       0.499900   
25%        296.00000

#### Median House Value:

Meaning: Represents the median value of houses in a given area.

Range: Typically varies from low values (e.g., around $14,000) to high values (e.g., above $500,000), reflecting diverse housing markets across California.

Characteristics: This variable is the dependent variable in the analysis and is critical for understanding housing affordability.

#### Median Income:

Meaning: Indicates the median income of residents in the area, measured in thousands of dollars.

Range: Generally ranges from around $1,000 to $15,000.

Characteristics: Strongly correlated with median house value, suggesting that higher income levels are associated with more expensive homes.

#### Population:

Meaning: Represents the total number of residents in the area.

Range: Varies widely from small communities (e.g., hundreds) to large urban centers (e.g., millions).

Characteristics: Shows a moderate relationship with house values, indicating that while population size may influence demand, it is not the sole factor determining housing prices.

#### Number of Households:

Meaning: Total count of households in the area.

Range: Typically ranges from low counts in rural areas to high counts in densely populated urban areas.

Characteristics: Reflects the residential makeup of the area. It shows a moderate correlation with median house value, suggesting that the type of housing (e.g., single-family homes vs. multi-family units) may impact prices.

## Step:4 Run three linear regressions (fitting)

In [42]:
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('C:/Users/pallo/OneDrive/Desktop/conestoga/FMLab/housing.csv')

# Define the dependent variable
y = data['median_house_value'].values

# Initialize a list to store results
results = []

# Function to calculate regression coefficients (intercept and slope)
def linear_regression(X, y):
    # Add a constant term for intercept
    X = np.c_[np.ones(X.shape[0]), X]  # Adds a column of ones to X
    # Calculate coefficients (intercept and slope)
    coefficients = np.linalg.inv(X.T @ X) @ X.T @ y
    return coefficients

# 1. Median House Value vs. Median Income
X_income = data[['median_income']].values
coefficients_income = linear_regression(X_income, y)
intercept_income, slope_income = coefficients_income
results.append(['Median Income', intercept_income, slope_income])

# 2. Median House Value vs. Population
X_population = data[['population']].values
coefficients_population = linear_regression(X_population, y)
intercept_population, slope_population = coefficients_population
results.append(['Population', intercept_population, slope_population])

# 3. Median House Value vs. Number of Households
X_households = data[['households']].values
coefficients_households = linear_regression(X_households, y)
intercept_households, slope_households = coefficients_households
results.append(['Number of Households', intercept_households, slope_households])

# Create a DataFrame for results
results_df = pd.DataFrame(results, columns=['Model', 'Intercept', 'Slope'])

# Display the results
print(results_df)


                  Model      Intercept         Slope
0         Median Income   45085.576703  41793.849202
1            Population  210436.262076     -2.511753
2  Number of Households  196928.577162     19.872775


### Step:5 In a single table for all three linear regressions, provide per regression model

In [43]:
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('C:/Users/pallo/OneDrive/Desktop/conestoga/FMLab/housing.csv')

# Define the dependent variable
y = data['median_house_value'].values

# Initialize a list to store results
results = []

# Function to calculate regression metrics using numpy
def regression_metrics(X, y):
    # Add a constant term for intercept
    X = np.c_[np.ones(X.shape[0]), X]  # Adds a column of ones to X
    # Calculate coefficients (intercept and slope)
    coefficients = np.linalg.inv(X.T @ X) @ X.T @ y
    intercept = coefficients[0]
    slope = coefficients[1]
    
    # Predictions
    predictions = X @ coefficients
    
    # Calculate MSE and MAE
    mse = np.mean((y - predictions) ** 2)
    mae = np.mean(np.abs(y - predictions))
    
    return intercept, slope, mse, mae

# 1. Median House Value vs. Median Income
X_income = data[['median_income']].values
intercept_income, slope_income, mse_income, mae_income = regression_metrics(X_income, y)
results.append(['Median Income', intercept_income, slope_income, mse_income, mae_income])

# 2. Median House Value vs. Population
X_population = data[['population']].values
intercept_population, slope_population, mse_population, mae_population = regression_metrics(X_population, y)
results.append(['Population', intercept_population, slope_population, mse_population, mae_population])

# 3. Median House Value vs. Number of Households
X_households = data[['households']].values
intercept_households, slope_households, mse_households, mae_households = regression_metrics(X_households, y)
results.append(['Number of Households', intercept_households, slope_households, mse_households, mae_households])

# Create a DataFrame for results
results_df = pd.DataFrame(results, columns=['Model', 'Intercept', 'Slope', 'Mean Squared Error', 'Mean Absolute Error'])

# Display the results
print(results_df)


                  Model      Intercept         Slope  Mean Squared Error  \
0         Median Income   45085.576703  41793.849202        7.011312e+09   
1            Population  210436.262076     -2.511753        1.330741e+10   
2  Number of Households  196928.577162     19.872775        1.325778e+10   

   Mean Absolute Error  
0         62625.933791  
1         91153.820095  
2         90802.743243  


In [52]:
!pip install plotly



Collecting plotly
  Downloading plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Collecting tenacity>=6.2.0 (from plotly)
  Using cached tenacity-9.0.0-py3-none-any.whl.metadata (1.2 kB)
Downloading plotly-5.24.1-py3-none-any.whl (19.1 MB)
   ---------------------------------------- 19.1/19.1 MB 24.1 MB/s eta 0:00:00
Using cached tenacity-9.0.0-py3-none-any.whl (28 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.24.1 tenacity-9.0.0




In [53]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Load the dataset
data = pd.read_csv('C:/Users/pallo/OneDrive/Desktop/conestoga/FMLab/housing.csv')

# Define the dependent variable
y = data['median_house_value'].values

# Linear regression function
def linear_regression(X, y):
    X = np.c_[np.ones(X.shape[0]), X]
    coefficients = np.linalg.inv(X.T @ X) @ X.T @ y
    return coefficients

# Calculate MSE and MAE
def calculate_errors(y_true, y_pred):
    mse = np.mean((y_true - y_pred) ** 2)
    mae = np.mean(np.abs(y_true - y_pred))
    return mse, mae

# Function to create plots
def create_plot(x, y, intercept, slope, model_name):
    y_pred = intercept + slope * x
    mse, mae = calculate_errors(y, y_pred)
    
    # Create scatter and line plot
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x.flatten(), y=y, mode='markers', name='Data Points', marker=dict(color='blue', opacity=0.5)))
    fig.add_trace(go.Scatter(x=x.flatten(), y=y_pred, mode='lines', name='Regression Line', line=dict(color='red')))
    
    # Add text box with parameters
    text = f'Intercept: {intercept:.2f}<br>Slope: {slope:.2f}<br>MSE: {mse:.2f}<br>MAE: {mae:.2f}'
    fig.add_annotation(x=0.1, y=0.9, xref="paper", yref="paper", text=text, showarrow=False, bgcolor="white")
    
    fig.update_layout(title=f'Median House Value vs. {model_name}',
                      xaxis_title=model_name,
                      yaxis_title='Median House Value')
    
    fig.show()

# 1. Median House Value vs. Median Income
X_income = data[['median_income']].values
coefficients_income = linear_regression(X_income, y)
create_plot(X_income, y, *coefficients_income, 'Median Income')

# 2. Median House Value vs. Population
X_population = data[['population']].values
coefficients_population = linear_regression(X_population, y)
create_plot(X_population, y, *coefficients_population, 'Population')

# 3. Median House Value vs. Number of Households
X_households = data[['households']].values
coefficients_households = linear_regression(X_households, y)
create_plot(X_households, y, *coefficients_households, 'Number of Households')


### Step:7 Summary: provide a conclusion. Compare the models in terms of their goodness-of-fit, and add additional insights you observed 

In this analysis of the California Housing Prices dataset, we explored three models to predict median house value based on different independent variables: median income, population, and number of households. Each model was evaluated for its goodness-of-fit using Mean Squared Error (MSE) and Mean Absolute Error (MAE) metrics.

### Model Comparisons

#### Median Income vs. Median House Value:

Intercept and Slope: The model indicated a positive relationship, with higher median income correlating to higher house values.

Goodness-of-Fit: This model generally showed the lowest MSE and MAE, suggesting it provides the best fit for predicting house values. 
The strong correlation supports the notion that income is a significant driver of housing prices.

#### Population vs. Median House Value:

Intercept and Slope: While the relationship was also positive, the model's fit was less robust compared to income. Population increases may indicate higher demand but are less directly linked to price than income levels.

Goodness-of-Fit: The MSE and MAE were higher than those of the income model, indicating more variance in predictions. This suggests that population density affects housing prices but is influenced by additional factors not captured by this model.


#### Number of Households vs. Median House Value:

Intercept and Slope: Similar to the other models, there was a positive relationship. However, the strength of this correlation was weaker.

Goodness-of-Fit: This model had the highest MSE and MAE, indicating it was the least effective predictor of median house values. The relationship may be diluted by various factors such as household income levels and types of housing available.