# Interaction Effects

Boston housing dataset, a popular choice for regression problems. This exercise will guide you through creating an interaction term between two predictors and analyzing its impact on a linear regression model predicting housing prices.

Goal: Investigate how the interaction between average number of rooms per dwelling (RM) and proportion of owner-occupied units built before 1940 (AGE) affects the median value of owner-occupied homes (MEDV).

Dataset:
The Boston housing dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It has 506 entries with 13 features and one target variable (MEDV: Median value of owner-occupied homes in $1000).

In [None]:
# Load the Dataset: Use scikit-learn to load the Boston housing dataset.
from sklearn.datasets import load_boston

# complete the code: load the dataset...

boston = 

boston.DESCR

In [None]:
# Prepare the Data: Extract the RM (average number of rooms per dwelling) and AGE (proportion of owner-occupied units built prior to 1940) features. 
# Also, prepare the target variable MEDV (median value of owner-occupied homes).
import pandas as pd

df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target
df.head()

Create an Interaction Term: Generate a new feature that is the product of RM and AGE to explore the potential interaction effect between these two variables on MEDV.

In [None]:
# Hint: You can create a new column in a DataFrame by assigning the product of two columns like this:
# `DataFrame['new_column_name'] = DataFrame['column1'] * DataFrame['column2']`

# complete the code: create the new feature called RM_AGE that is the product of RM and AGE...


df.head()

In [None]:
# Split the Data: Divide your data into training and testing sets using train_test_split.

from sklearn.model_selection import train_test_split

# complete the code...


Model Training: Train two linear regression models - one with and one without the interaction term. 

Use LinearRegression from scikit-learn.

In [None]:
from sklearn.linear_model import LinearRegression


# complete the code...



Evaluate the Models: Compare the performance of the two models using the R² score on the test set. Discuss the impact of including the interaction term on the model's performance.

In [None]:
from sklearn.metrics import r2_score


# complete the code...



## How to detect interaction effets?

Exploratory Data Analysis (EDA) is a crucial first step in analyzing a dataset, as it helps uncover underlying patterns, spot anomalies, and form hypotheses, including potential interaction effects between variables. 
Let's perform EDA on the Boston housing dataset, focusing on uncovering potential interactions.

Identify non-linear relationships and potential interactions by visualizing how pairs of features relate to the target.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select a few features based on correlation analysis for detailed scatter plots
features = ['RM', 'LSTAT', 'TAX', 'INDUS']
target = 'MEDV'

for feature in features:
    plt.figure(figsize=(8, 5))
    sns.scatterplot(data=df, x=feature, y=target, hue='AGE', palette='coolwarm')
    plt.title(f'{feature} vs. {target} by AGE')
    plt.show()


## Correlation Analysis

Look for pairs of features that both have strong correlations with the target variable (MEDV). Their interaction might enhance the model’s prediction capability.

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#boston = load_boston()
#df = pd.DataFrame(boston.data, columns=boston.feature_names)
#df['MEDV'] = boston.target


plt.figure(figsize=(14, 10))
sns.heatmap(df.corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.show()



Based on this analysis, build a linear regression model using interaction between two features from your choice (based on this visual analysis). Compare the R2 with the baseline model.

In [None]:

## complete the code

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd

# Load the Boston housing dataset
#boston = load_boston()
#df = pd.DataFrame(boston.data, columns=boston.feature_names)
#df['MEDV'] = boston.target

X['new_feature'] = 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

new_model_with_interaction = LinearRegression()

# Train a model with the interaction term
new_model_with_interaction.fit(X_train, y_train)

# Evaluate Model:



## Feature Transformation

## Step 1: Conduct a Standard Regression Analysis on the Raw Data

In [None]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Load the dataset
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target

# Visualizing the relationship between LSTAT and MEDV
plt.scatter(df['LSTAT'], df['MEDV'], alpha=0.5)
plt.xlabel('LSTAT (% lower status of the population)')
plt.ylabel('MEDV (Median value of owner-occupied homes)')
plt.title('Relationship between LSTAT and MEDV')
plt.show()

# Splitting the dataset
X = df[['LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear regression on original LSTAT
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)


# Evaluating the model
r2_raw= r2_score(y_test, y_pred)
print(f"R² score (original data): {r2_raw}")


## Step 2: Construct a Residual Plot

In [None]:
sns.residplot(x=y_test, y=y_pred, lowess=True, line_kws={'color': 'red', 'lw': 1})
plt.xlabel('Observed Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Raw Data')
plt.show()


the plot pattern is not random: Data may need transformation.

## Step 3-4: Choose a Transformation Method & Apply It

In [None]:
# Independent variable and its quadratic transformation
X['LSTAT^2'] = X['LSTAT'] ** 2

In [None]:
X.head()

## Step 4: Conduct a Regression Analysis Using the Transformed Variables

In [None]:


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear regression using the original and transformed LSTAT
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# Evaluate the model
r2_score_transformed = r2_score(y_test, y_pred)
print(f"R² score (with quadratic transformation): {r2_score_transformed}")




sns.residplot(x=y_test, y=y_pred, lowess=True, line_kws={'color': 'red', 'lw': 1})
plt.xlabel('Observed Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Raw Data')
plt.show()




In [None]:
# the plot pattern is not random

In [None]:
# Comparing R² scores
print(f"R² before transformation: {r2_raw}")
print(f"R² after transformation: {r2_score_transformed}")

if r2_score_transformed > r2_raw:
    print("The transformation was successful!")
else:
    print("The transformation did not improve the model. Try a different transformation method.")

    

# Box-Cox Transformation

The Box-Cox transformation is a more general approach that can handle many types of skewness. It requires positive data and finds a parameter λ (lambda) to best normalize the data.

In [None]:
help(boxcox)

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import boxcox

# Load the dataset
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target

# Focus on the 'RM' feature and the target 'MEDV'
X = df['LSTAT']
y = df['MEDV']


plt.figure(figsize=(8, 6))
plt.scatter(df['LSTAT'], df['MEDV'])
plt.xlabel('% Lower Status of the Population (LSTAT)')
plt.ylabel('Median Value of Owner-occupied Homes (MEDV)')
plt.title('Original Relationship between LSTAT and MEDV')
plt.show()


df['Log_LSTAT'] = np.log(df['LSTAT'])

plt.figure(figsize=(8, 6))
plt.scatter(df['Log_LSTAT'], df['MEDV'])
plt.xlabel('Log of LSTAT')
plt.ylabel('MEDV')
plt.title('Relationship between Log of LSTAT and MEDV')
plt.show()



df['Sqrt_LSTAT'] = np.sqrt(df['LSTAT'])

plt.figure(figsize=(8, 6))
plt.scatter(df['Sqrt_LSTAT'], df['MEDV'])
plt.xlabel('Square Root of LSTAT')
plt.ylabel('MEDV')
plt.title('Relationship between Square Root of LSTAT and MEDV')
plt.show()


# Box-Cox Transformation
df['BoxCox_LSTAT'], fitted_lambda_lstat = boxcox(df['LSTAT'])

plt.figure(figsize=(8, 6))
plt.scatter(df['BoxCox_LSTAT'], df['MEDV'])
plt.xlabel('Box-Cox Transformed LSTAT')
plt.ylabel('MEDV')
plt.title('Relationship between Box-Cox Transformed LSTAT and MEDV')
plt.show()

print(f"Fitted Lambda for Box-Cox Transformation of LSTAT: {fitted_lambda_lstat}")



In [None]:
df.head()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Using the Box-Cox transformed RM for modeling
X_transformed = df[['BoxCox_LSTAT']]
y = df['MEDV']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=42)

# Initialize the model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")


In [None]:
### predicting new value:

In [None]:
new_LSTAT_values = np.array([ 15])  # Example new data
new_LSTAT_values_transformed = boxcox(new_LSTAT_values, lmbda=0.22776736893884023)
predicted_MEDV = model.predict(new_LSTAT_values_transformed.reshape(-1, 1))
predicted_MEDV

# Conclusion