# Machine Learning Beginner Project: Linear Regression

## Project Title: Predicting House Prices Using Linear Regression
#### Objective: To introduce students to supervised learning, focusing on linear regression, by guiding them through a project that predicts house prices based on a variety of features.

#### Dataset: We'll use the "Boston Housing Dataset" from the UCI Machine Learning Repository. This dataset contains information about housing in Boston, including features such as the number of rooms, age of the house, and crime rate, along with the target variable, which is the median value of owner-occupied homes.

### Data Exploration and Preprocessing

Task 1: Data Exploration
Notebook: notebooks/EDA.ipynb Steps:

Load the dataset.
Explore the data structure, types, and summary statistics.
Visualize relationships between features and the target variable.
Identify missing values and outliers.

Task 2: Data Preprocessing
Notebook: notebooks/Data_Preprocessing.ipynb Steps:

Handle missing values and outliers.
Encode categorical variables.
Normalize/standardize numerical features.
Split the data into training and testing sets.
Script: scripts/data_preprocessing.py

### Model Building and Training
Task 3: Model Training
Notebook: notebooks/Model_Training.ipynb Steps:

Choose appropriate features for the model.

Train a linear regression model.

Perform hyperparameter tuning (if applicable).

Script: scripts/train_model.py

### Model Evaluation
Task 4: Model Evaluation
Notebook: notebooks/Model_Evaluation.ipynb Steps:

Evaluate the model using metrics such as Mean Squared Error (MSE), R-squared.
Plot residuals to check the assumptions of linear regression.
Compare model performance with different feature sets or preprocessing steps.
Script: scripts/evaluate_model.py

### Feature Engineering and Improvement
Task 5: Feature Engineering
Notebook: notebooks/Feature_Engineering.ipynb Steps:

Create new features that might improve model performance.
Test different feature combinations.
Evaluate the impact of new features on model performance.

In [1]:
#Install Packages
!pip install pandas numpy matplotlib seaborn scikit-learn statsmodels



In [3]:
# Step 1: Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

In [9]:
# Step 2: Load and explore the data
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('data/BostonHousing.csv', delimiter=r"\s+", names=columns)

# Handle missing values (none in this dataset, but included for completeness)
data = data.dropna()

# Explore the data
print(data.head())
print(data.info())

# Visualize the data
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')

In [None]:


# Step 3: Split the data. Separate features and target variable 'X' contains all features except 'medv', which is the target variable
X = data.drop('MEDV', axis=1)  # Features
y = data['MEDV']               # Target variable. 'y' contains the target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Scale the features (optional but recommended for some models)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 5: Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 6: Make predictions
y_pred = model.predict(X_test)

# Step 7: Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Plotting actual vs predicted values
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted House Prices')
plt.show()



=======================


How to Check and Handle Multicollinearity:
Here’s how you can check for multicollinearity by calculating the VIF and handling it:

python - code
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Add constant to features to calculate VIF
X_train_const = add_constant(X_train)

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X_train_const.values, i) for i in range(X_train_const.shape[1])]

print(vif_data)