# UA2_S1. Multivariate analysis - Multiple regression

**Exercise 1: Multiple Linear Regression Analysis on Property Data**

**Objective:**    
In this exercise, you will analyze the impact of **size (m²)** and **age (years)** of a property on its price using multiple linear regression. You will manually compute the regression coefficients, interpret statistical results, and visualize the data with an interactive 3D plot.

**Instructions:**    

1. **Data Loading:**
   - Download the file **"Datos_de_Precio_de_Propiedades.csv"** and ensure the file path is correct.
   - Load the data into a pandas DataFrame.

2. **Data Processing:**
   - Round all numerical values to the nearest integer.
   - Configure pandas to display numbers without decimals and without thousand separators.

3. **Multiple Linear Regression Calculation:**
   - Define the independent variables **(Tamaño_m2 and Edad_años)** and the dependent variable **(Precio)**.
   - Add a column of ones to include the intercept term in the model.
   - Use the **least squares equation** to calculate the regression coefficients manually.

4. **Model Evaluation:**
   - Compute **predictions** and **residuals**.
   - Determine the **correlation coefficient R** to measure the goodness of fit.
   - Calculate the **p-values** of the coefficients to assess their statistical significance.

5. **Results Visualization:**
   - Plot the data in an **interactive 3D graph** using Plotly.
   - Include both **real data points** and the **fitted regression surface**.

6. **Analysis and Interpretation:**
   - Write a brief **conclusion** interpreting the relationship between price, size, and age of the properties.
   - Reflect on the importance of each variable in predicting the price based on the obtained results.

**DATA LOADING**

In [16]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import plotly.graph_objects as go



In [3]:
df = pd.read_csv('Datos_de_Precio_de_Propiedades.csv')
display(df)

Unnamed: 0,Precio,Tamaño_m2,Edad_años
0,125748.835254,106.181018,1.597155
1,186770.557020,192.607146,13.091798
2,170310.852290,159.799091,6.972764
3,138701.239225,139.798773,10.662843
4,100014.337458,73.402796,18.243763
...,...,...,...
95,131269.706362,124.069339,7.634982
96,148192.745081,128.409924,14.793158
97,118287.096904,114.131153,18.045095
98,53659.492574,53.812869,17.854642


**Data Processing:**

In [24]:
#Round all numerical values to the nearest integer.
df = df.round(0).astype(int)

In [5]:
#Configure pandas to display numbers without decimals and without thousand separators.
pd.options.display.float_format = '{:,.0f}'.format
pd.options.display.max_columns = None  # Para mostrar todas las columnas

df.head(5)

Unnamed: 0,Precio,Tamaño_m2,Edad_años
0,125749,106,2
1,186771,193,13
2,170311,160,7
3,138701,140,11
4,100014,73,18


**Multiple Linear Regression Calculation:**

In [8]:
# Define the independent variables **(Tamaño_m2 and Edad_años)**
X = df[['Tamaño_m2', 'Edad_años']].values  # Convertir a numpy array
X = np.c_[np.ones(X.shape[0]), X]  # Agregar columna de unos para el intercepto

# Definde the dependent variable
y = df['Precio'].values.reshape(-1, 1)  # Asegurar que y sea un vector columna

In [9]:
# Calculate the egression coefficients manually
beta = np.linalg.inv(X.T @ X) @ X.T @ y

print('The regression coefficients son:')
print(beta)

The regression coefficients son:
[[49866.59981524]
 [  777.21831193]
 [-1121.41504732]]


**Model Evaluation:**

In [10]:
# Compute predictions
y_pred = X @ beta

# Compute residuals
residuals = y - y_pred


In [11]:

print("First 5 Predictions:\n", y_pred[:5])
print("First 5 Residuals:\n", residuals[:5])

First 5 Predictions:
 [[130601.35812935]
 [184883.06160033]
 [166246.01771385]
 [146563.29315295]
 [ 86457.76671198]]
First 5 Residuals:
 [[-4852.52287552]
 [ 1887.49541945]
 [ 4064.83457583]
 [-7862.05392776]
 [13556.57074566]]


In [None]:
# Compute R-squared (coefficient of determination)
SS_res = np.sum(residuals**2)  # Sum of squared residuals
SS_tot = np.sum((y - np.mean(y))**2)  # Total sum of squares
R_squared = 1 - (SS_res / SS_tot)


R-squared: 0.9293535634215089


In [None]:

print("R-squared:", R_squared)

In [14]:
# Compute standard error of coefficients
n, p = X.shape  # Number of observations and predictors
sigma_squared = SS_res / (n - p)  # Estimate of variance
var_beta = np.linalg.inv(X.T @ X) * sigma_squared  # Covariance matrix
SE_beta = np.sqrt(np.diag(var_beta))  # Standard errors

# Compute t-statistic
t_stats = beta.flatten() / SE_beta

# Compute p-values
p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n - p))

In [15]:
# Display results
print("Coefficients:", beta.flatten())
print("Standard Errors:", SE_beta)
print("t-statistics:", t_stats)
print("p-values:", p_values)

Coefficients: [49866.59981524   777.21831193 -1121.41504732]
Standard Errors: [3461.93830003   22.25131259  178.29218752]
t-statistics: [14.4042428  34.92909952 -6.28975988]
p-values: [0.00000000e+00 0.00000000e+00 9.14882636e-09]


**Results Visualization:**

In [17]:
# Create a grid of values
Tamaño_range = np.linspace(df['Tamaño_m2'].min(), df['Tamaño_m2'].max(), 30)
Edad_range = np.linspace(df['Edad_años'].min(), df['Edad_años'].max(), 30)
Tamaño_grid, Edad_grid = np.meshgrid(Tamaño_range, Edad_range)

# Compute predicted prices for the surface
X_surface = np.c_[np.ones(Tamaño_grid.ravel().shape), Tamaño_grid.ravel(), Edad_grid.ravel()]
y_surface = X_surface @ beta  # Predict using regression equation
y_surface = y_surface.reshape(Tamaño_grid.shape)  # Reshape for plotting


In [19]:
# Create 3D scatter plot of real data
scatter = go.Scatter3d(
    x=df['Tamaño_m2'], 
    y=df['Edad_años'], 
    z=df['Precio'], 
    mode='markers', 
    marker=dict(size=5, color='blue', opacity=0.8), 
    name='Actual Data'
)

# Create surface plot for regression plane
surface = go.Surface(
    x=Tamaño_range, 
    y=Edad_range, 
    z=y_surface, 
    colorscale='viridis', 
    opacity=0.7, 
    name='Regression Surface'
)

In [20]:
# Combine both plots
fig = go.Figure(data=[scatter, surface])

In [None]:
# Customize layout
fig.update_layout(
    title="3D Regression Model: Precio vs Tamaño & Edad",
    scene=dict(
        xaxis_title="Tamaño_m2",
        yaxis_title="Edad_años",
        zaxis_title="Precio"
    )
)



**CONCLUSION**


**Downward trend**

-It appears that the price decreases as age increases, and size decreases. This suggests that older and smaller properties tend to have lower prices. 

**Good alignment with the regression plane**

-The adjusted surface follows a clear line, indicating a strong linear relationship between the variables. The R² is 0.92 so is high, this means that the model explains prices well based on size and age. 

**Point distribution**

-The blue points represent the actual data and align quite well with the adjusted plane. However, if there are many points far from the surface, there may be other factors not considered that affect the price.

**FINAL CONCLUSION**

Size_m2 has a positive impact on price (larger properties are worth more). 
Age_years has a negative impact, suggesting that older properties tend to be worth less. 
The model seems to fit the data well, as most of the points follow the regression surface.

So the final conclusion is that the model is reliable for predicting prices using only size and age. It is possible to improve the model by including more variables such as location, number of rooms, construction quality, etc.

---
---

**Exercise 2:**    

Perform a **Multiple Linear Regression** analysis to predict property prices based on **size (m²)** and **age (years)** using only the **first 5 records** from the dataset. Implement the model manually using the **least squares** formula, calculate the regression coefficients, evaluate statistical significance, and visualize the results in an interactive 3D plot.  

Then, **compare the R-value and p-values** obtained from the **subset of 5 records** with those calculated using the **full dataset** to analyze how sample size affects the regression results.

---
---

**Submission:**
- Create a new folder in your GitHub repository called UA2_S1 and store the notebook there.
- Submit this **Notebook** (UA2_S1) with the implemented and well-commented code.
- Include a **screenshot** of the generated plots.
- Attach a **short report** with your conclusions on the obtained results.
- Don't forget to commit and push at the end of the session.

In [26]:
df= df.head(5)

print(df)

   Precio  Tamaño_m2  Edad_años
0  125749        106          2
1  186771        193         13
2  170311        160          7
3  138701        140         11
4  100014         73         18


In [27]:
# Define the independent variables **(Tamaño_m2 and Edad_años)**
X = df[['Tamaño_m2', 'Edad_años']].values  # Convertir a numpy array
X = np.c_[np.ones(X.shape[0]), X]  # Agregar columna de unos para el intercepto

# Definde the dependent variable
y = df['Precio'].values.reshape(-1, 1)  # Asegurar que y sea un vector columna

In [28]:
# Calculate the egression coefficients manually
beta = np.linalg.inv(X.T @ X) @ X.T @ y

print('The regression coefficients son:')
print(beta)

The regression coefficients son:
[[49875.65858296]
 [  726.56362768]
 [ -315.35393561]]


In [29]:
# Compute predictions
y_pred = X @ beta

# Compute residuals
residuals = y - y_pred


In [30]:

print("First 5 Predictions:\n", y_pred[:5])
print("First 5 Residuals:\n", residuals[:5])

First 5 Predictions:
 [[126260.69524588]
 [186002.83756237]
 [163918.36146258]
 [148125.67316652]
 [ 97238.43256265]]
First 5 Residuals:
 [[ -511.69524588]
 [  768.16243763]
 [ 6392.63853742]
 [-9424.67316652]
 [ 2775.56743735]]


In [31]:
# Compute R-squared (coefficient of determination)
SS_res = np.sum(residuals**2)  # Sum of squared residuals
SS_tot = np.sum((y - np.mean(y))**2)  # Total sum of squares
R_squared = 1 - (SS_res / SS_tot)


In [32]:

print("R-squared:", R_squared)

R-squared: 0.971300970652438


In [33]:
# Compute standard error of coefficients
n, p = X.shape  # Number of observations and predictors
sigma_squared = SS_res / (n - p)  # Estimate of variance
var_beta = np.linalg.inv(X.T @ X) * sigma_squared  # Covariance matrix
SE_beta = np.sqrt(np.diag(var_beta))  # Standard errors

# Compute t-statistic
t_stats = beta.flatten() / SE_beta

# Compute p-values
p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n - p))

In [34]:
# Display results
print("Coefficients:", beta.flatten())
print("Standard Errors:", SE_beta)
print("t-statistics:", t_stats)
print("p-values:", p_values)

Coefficients: [49875.65858296   726.56362768  -315.35393561]
Standard Errors: [15311.0203871     90.0505781    693.12762146]
t-statistics: [ 3.25750063  8.06839493 -0.4549724 ]
p-values: [0.08271477 0.01501609 0.6937445 ]


In [35]:
# Create a grid of values
Tamaño_range = np.linspace(df['Tamaño_m2'].min(), df['Tamaño_m2'].max(), 30)
Edad_range = np.linspace(df['Edad_años'].min(), df['Edad_años'].max(), 30)
Tamaño_grid, Edad_grid = np.meshgrid(Tamaño_range, Edad_range)

# Compute predicted prices for the surface
X_surface = np.c_[np.ones(Tamaño_grid.ravel().shape), Tamaño_grid.ravel(), Edad_grid.ravel()]
y_surface = X_surface @ beta  # Predict using regression equation
y_surface = y_surface.reshape(Tamaño_grid.shape)  # Reshape for plotting


In [36]:
# Create 3D scatter plot of real data
scatter = go.Scatter3d(
    x=df['Tamaño_m2'], 
    y=df['Edad_años'], 
    z=df['Precio'], 
    mode='markers', 
    marker=dict(size=5, color='blue', opacity=0.8), 
    name='Actual Data'
)

# Create surface plot for regression plane
surface = go.Surface(
    x=Tamaño_range, 
    y=Edad_range, 
    z=y_surface, 
    colorscale='viridis', 
    opacity=0.7, 
    name='Regression Surface'
)

In [37]:
# Combine both plots
fig = go.Figure(data=[scatter, surface])

In [38]:
# Customize layout
fig.update_layout(
    title="3D Regression Model: Precio vs Tamaño & Edad",
    scene=dict(
        xaxis_title="Tamaño_m2",
        yaxis_title="Edad_años",
        zaxis_title="Precio"
    )
)

**FULL DATASET**

R-squared: 0.9293535634215089

Coefficients: [49866.59981524   777.21831193 -1121.41504732]

Standard Errors: [3461.93830003   22.25131259  178.29218752]

t-statistics: [14.4042428  34.92909952 -6.28975988]

p-values: [0.00000000e+00 0.00000000e+00 9.14882636e-09]

**5 RECORDS**

R-squared: 0.971300970652438

Coefficients: [49875.65858296   726.56362768  -315.35393561]

Standard Errors: [15311.0203871     90.0505781    693.12762146]

t-statistics: [ 3.25750063  8.06839493 -0.4549724 ]

p-values: [0.08271477 0.01501609 0.6937445 ]