**Project Title**: Predicting House Prices Using Linear Regression

**Objective**: To introduce students to supervised learning, focusing on linear regression, by guiding them through a project that predicts house prices based on a variety of features.

# **Data Exploration and Preprocessing**

**Task 2: Data Preprocessing**

In [65]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
# Assuming the dataset is stored in a CSV file named 'BostonHousing.csv' in the specified path
df = pd.read_csv('drive/MyDrive/Datasets/BostonHousing.csv')

# Identify missing values
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

# Impute missing values
# For simplicity, we'll use mean imputation for numerical features
df.fillna(df.mean(), inplace=True)

# Verify that there are no missing values left
print("\nMissing Values after Imputation:")
print(df.isnull().sum())

Missing Values:
crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
b          0
lstat      0
medv       0
dtype: int64

Missing Values after Imputation:
crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
b          0
lstat      0
medv       0
dtype: int64


In [66]:
# Detect outliers using box plots for each feature
features = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv']

# # Create subplots for each feature to identify outliers
# fig, axes = plt.subplots(nrows=7, ncols=2, figsize=(15, 30))
# axes = axes.flatten()

# for i, feature in enumerate(features):
#     sns.boxplot(x=df[feature], ax=axes[i])
#     axes[i].set_title(f'Boxplot of {feature}')
#     axes[i].set_xlabel(feature)

# # Adjust layout
# plt.tight_layout()
# plt.show()

# Handle outliers by capping them at the 1st and 99th percentiles
for feature in features:
    lower_percentile = df[feature].quantile(0.01)
    upper_percentile = df[feature].quantile(0.99)
    df[feature] = df[feature].clip(lower=lower_percentile, upper=upper_percentile)

# Verify outliers are handled
print("\nSummary Statistics after handling outliers:")
print(df.describe())

# Encode categorical variables
# For this dataset, the only categorical variable is 'chas'
df = pd.get_dummies(df, columns=['chas'], drop_first=True)

# Verify encoding
print("\nData after encoding categorical variables:")
print(df.head())

Summary Statistics after handling outliers:
             crim          zn       indus        chas         nox          rm  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.375175   11.304348   11.118875    0.069170    0.554770    6.287106   
std      6.908970   23.112644    6.809112    0.253994    0.115773    0.678876   
min      0.013610    0.000000    1.253500    0.000000    0.398000    4.524450   
25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.677083   12.500000   18.100000    0.000000    0.624000    6.623500   
max     41.370330   90.000000   25.650000    1.000000    0.871000    8.335000   

              age         dis         rad         tax     ptratio           b  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    68.584506    3.778529    9.549407  407.794466   18.45474

In [67]:
# Normalize/standardize numerical features.
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Separate features and target variable
X = df.drop(columns=['medv'])
y = df['medv']

# Normalize numerical features
scaler = MinMaxScaler()
X_normalized = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

print("\nNormalized Features:")
print(X_normalized.head())

# Standardize numerical features
scaler = StandardScaler()
X_standardized = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

print("\nStandardized Features:")
print(X_standardized.head())


Normalized Features:
       crim   zn     indus       nox        rm       age       dis       rad  \
0  0.000000  0.2  0.043305  0.295983  0.538124  0.627369  0.359703  0.000000   
1  0.000331  0.0  0.238415  0.150106  0.497710  0.774066  0.469118  0.043478   
2  0.000331  0.0  0.238415  0.150106  0.698206  0.583467  0.469118  0.043478   
3  0.000454  0.0  0.037977  0.126850  0.649132  0.419638  0.605729  0.086957   
4  0.001341  0.0  0.037977  0.126850  0.688234  0.509583  0.605729  0.086957   

        tax   ptratio         b     lstat  chas_1  
0  0.225941  0.280488  1.000000  0.067568     0.0  
1  0.112971  0.585366  1.000000  0.201608     0.0  
2  0.112971  0.585366  0.989569  0.036958     0.0  
3  0.071130  0.695122  0.994182  0.001837     0.0  
4  0.071130  0.695122  1.000000  0.078845     0.0  

Standardized Features:
       crim        zn     indus       nox        rm       age       dis  \
0 -0.487032  0.289983 -1.294970 -0.144997  0.424494 -0.120448  0.151891   
1 -0.485047

In [68]:
# Split the data into training and testing sets
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# # Separate features and target variable
# X = df.drop(columns=['medv'])
# y = df['medv']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save the training and testing sets to the drive
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

train_data.to_csv('drive/MyDrive/Datasets/BostonHousing_train.csv', index=False)
test_data.to_csv('drive/MyDrive/Datasets/BostonHousing_test.csv', index=False)

print("Training and testing data saved successfully.")

Training and testing data saved successfully.


In [69]:
# Load the saved trained and test dataset (Optional)
train_data = pd.read_csv('drive/MyDrive/Datasets/BostonHousing_train.csv')
test_data = pd.read_csv('drive/MyDrive/Datasets/BostonHousing_test.csv')

In [70]:
# Check the size of the trained and tested dataset
print("Size of the training dataset:", train_data.shape)
print("Size of the testing dataset:", test_data.shape)

Size of the training dataset: (404, 14)
Size of the testing dataset: (102, 14)
