# Buiding Multiple Linear Model

## Objective

- Describe how you would build a multiple linear regression model using all independent variables that correlate with price with Scikit-learn's LinearRegression class.
- Write a Python code snippet to use cross-validation with Scikit-learn&#39;s cross_val_score function to assess the performance of these models.

## Question to Answer
What metrics would you use to evaluate this model (e.g., R-squared, MSE, Mean Absolute Error (MAE) with metrics.mean_absolute_error())?


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load the saved Dataset
df_one_hot = pd.read_excel(r'C:\Users\hp\OneDrive\Desktop\Data_Analyst\Loctech_House_Prices\Dataset\df_one_hot.xlsx')

In [4]:
# Filter numeric columns for correlation calculation
numeric_columns = df_one_hot.select_dtypes(include=['number'])

# Compute correlation matrix
correlation_matrix = numeric_columns.corr()

print(correlation_matrix)

# Filter variables with correlation > 0.03 (absolute value)
relevant_features = correlation_matrix['Price ($)'][abs(correlation_matrix['Price ($)']) > 0.03].index.tolist()

# Exclude 'Price ($)' itself
relevant_features.remove('Price ($)')
print("Relevant features:", relevant_features)


                      House_ID  Size (sq ft)  Bedrooms  Bathrooms       Age  \
House_ID              1.000000     -0.018601 -0.165449   0.054599 -0.067746   
Size (sq ft)         -0.018601      1.000000  0.008606  -0.054507  0.006126   
Bedrooms             -0.165449      0.008606  1.000000   0.174625 -0.105081   
Bathrooms             0.054599     -0.054507  0.174625   1.000000 -0.054055   
Age                  -0.067746      0.006126 -0.105081  -0.054055  1.000000   
Price ($)            -0.065263      0.000857  0.038449   0.098669  0.123575   
Location_Label        0.093738     -0.085844  0.065511  -0.108575  0.118278   
Location_Chicago     -0.088826     -0.117016 -0.074574   0.090011  0.014768   
Location_Houston     -0.081348      0.247550  0.015582   0.041545 -0.139262   
Location_Los Angeles  0.033596     -0.016910  0.069220   0.077993 -0.011992   
Location_New York     0.215510     -0.062339 -0.146771  -0.267931  0.028491   
Location_Phoenix     -0.071970     -0.088829  0.1241

In [None]:
# Select relevant features
X = df_one_hot[relevant_features]
y = df_one_hot['Price ($)']

# Split into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
# Train the model
# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Display model coefficients
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)


Intercept: 430615.7188356812
Coefficients: [    413.65314809   19593.49006105   24955.14012841    3924.51059281
  -27125.01400532    8200.70606863   91259.24533787 -131550.39653681
  -16354.75320917   48445.19833948]


In [8]:
# Evaluate the model

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R-squared: {r2:.2f}")


Mean Squared Error (MSE): 65104157153.51
Mean Absolute Error (MAE): 231236.33
R-squared: 0.01


- Cross Validation

In [10]:
from sklearn.model_selection import cross_val_score

# Cross-validation for the multiple linear regression model
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
mean_mse = -cv_scores.mean()
std_mse = cv_scores.std()

print(f"Mean MSE: {mean_mse}, Standard Deviation: {std_mse}")

Mean MSE: 78856144538.79152, Standard Deviation: 22128236553.74583


- Coefficient Interpretation

In [11]:
# Coefficients of the best model
coefficients = pd.DataFrame(model.coef_, index=relevant_features, columns=['Coefficient'])
print(coefficients)


                        Coefficient
House_ID                 413.653148
Bedrooms               19593.490061
Bathrooms              24955.140128
Age                     3924.510593
Location_Label        -27125.014005
Location_Chicago        8200.706069
Location_Houston       91259.245338
Location_Los Angeles -131550.396537
Location_New York     -16354.753209
Location_Phoenix       48445.198339


- Predictive Equation

In [12]:
# Equation
intercept = model.intercept_
equation = "Price ($) = " + " + ".join([f"{coef:.2f}*{feature}" for coef, feature in zip(model.coef_, relevant_features)]) + f" + {intercept:.2f}"
print(equation)


Price ($) = 413.65*House_ID + 19593.49*Bedrooms + 24955.14*Bathrooms + 3924.51*Age + -27125.01*Location_Label + 8200.71*Location_Chicago + 91259.25*Location_Houston + -131550.40*Location_Los Angeles + -16354.75*Location_New York + 48445.20*Location_Phoenix + 430615.72
