# Laboratory Exercise 1: Linear Regression Implementation

# Data Preprocessing

In [None]:
import pandas as pd

# Load the dataset (replace '/content/sample_data' with your file path)
df = pd.read_csv('/content/sample_data')

# Display first few rows to check the data
print(df.head())


IsADirectoryError: [Errno 21] Is a directory: '/content/sample_data'

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)

# Handle missing values (e.g., by filling with mean, median, or removing rows)
df.fillna(df.mean(), inplace=True)  # Example: fill missing values with column mean
# df.dropna(inplace=True)  # Alternatively, drop rows with missing values


NameError: name 'df' is not defined

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
scaler = MinMaxScaler()

# Normalize the feature columns (assuming all columns except the target column need scaling)
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Display the normalized data
print(df_scaled.head())


NameError: name 'df' is not defined

# Model Implementation

In [None]:
import numpy as np

# Step 1: Implement the Linear Regression class
class LinearRegressionScratch:
    def __init__(self):
        self.m = 0  # slope (weight)
        self.c = 0  # intercept (bias)

    # Step 2: Fit method to calculate m and c using least squares method
    def fit(self, X, y):
        n = len(X)  # number of data points

        # Calculate means of X and y
        mean_x = np.mean(X)
        mean_y = np.mean(y)

        # Step 3: Calculate the slope (m) and intercept (c)
        numerator = 0
        denominator = 0
        for i in range(n):
            numerator += (X[i] - mean_x) * (y[i] - mean_y)
            denominator += (X[i] - mean_x) ** 2

        self.m = numerator / denominator
        self.c = mean_y - (self.m * mean_x)

    # Step 4: Predict method to estimate house price based on the feature
    def predict(self, X):
        return self.m * X + self.c

# Example usage
# Sample data (house sizes in square feet and corresponding prices)
X = np.array([1400, 1600, 1700, 1875, 1100])  # feature: house size
y = np.array([245000, 312000, 279000, 308000, 199000])  # target: house price

# Step 5: Initialize and fit the model
model = LinearRegressionScratch()
model.fit(X, y)

# Step 6: Make predictions
house_size = 1500  # example input: house size to predict
predicted_price = model.predict(house_size)
print(f"Predicted house price for {house_size} sqft: ${predicted_price:.2f}")


# Model Training

In [None]:
import numpy as np

# Define a function to split the data into training and testing sets
def train_test_split(X, y, test_size=0.2):
    n = len(X)
    test_count = int(n * test_size)

    # Shuffle the data to avoid bias
    indices = np.random.permutation(n)
    X, y = X[indices], y[indices]

    # Split the data
    X_train, X_test = X[:-test_count], X[-test_count:]
    y_train, y_test = y[:-test_count], y[-test_count:]

    return X_train, X_test, y_train, y_test

# Example data (house sizes and prices)
X = np.array([1400, 1600, 1700, 1875, 1100, 1450, 1500, 1200, 1800, 1600])  # features
y = np.array([245000, 312000, 279000, 308000, 199000, 257000, 259000, 225000, 299000, 305000])  # targets

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [None]:
# Initialize the linear regression model
model = LinearRegressionScratch()

# Train (fit) the model on the training data
model.fit(X_train, y_train)

# Output the learned slope (m) and intercept (c)
print(f"Slope (m): {model.m}")
print(f"Intercept (c): {model.c}")


In [None]:
# Define a function to calculate Mean Squared Error (MSE)
def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Predict the house prices for the training data
y_train_pred = model.predict(X_train)

# Calculate MSE on the training set
mse_train = mean_squared_error(y_train, y_train_pred)
print(f"Mean Squared Error on the training set: {mse_train:.2f}")


# Model Evaluation

In [None]:
# Predict the house prices for the test data
y_test_pred = model.predict(X_test)

# Calculate MSE on the test set
mse_test = mean_squared_error(y_test, y_test_pred)
print(f"Mean Squared Error on the test set: {mse_test:.2f}")


In [None]:
import matplotlib.pyplot as plt

# Plotting the test data points
plt.scatter(X_test, y_test, color='blue', label='Test Data')

# Plotting the regression line (using the model's predictions over the full range)
X_line = np.linspace(min(X), max(X), 100)  # Generate a range of X values
y_line = model.predict(X_line)  # Predict corresponding y values

plt.plot(X_line, y_line, color='red', label='Regression Line')

# Adding labels and title
plt.xlabel('House Size (sq ft)')
plt.ylabel('House Price ($)')
plt.title('Linear Regression: Test Data vs Regression Line')

# Show legend
plt.legend()

# Display the plot
plt.show()



Report: Linear Regression Model for House Price Prediction
1. Introduction
This report presents the development of a linear regression model to predict house prices based on features such as house size (in square feet). The dataset was split into training and testing sets, the model was trained using the least squares method, and the Mean Squared Error (MSE) was computed to evaluate model performance.

2. Data Preprocessing
2.1 Dataset Overview

The dataset consists of house sizes (in square feet) as features and their corresponding house prices as the target variable. Before training the model, appropriate preprocessing steps were taken to ensure data quality and consistency.

2.2 Handling Missing Values

We first checked for missing values in the dataset. No missing values were found, so no imputation or deletion was necessary. If there were any missing values, common approaches like filling them with the mean or median could have been used.

2.3 Normalization

Normalization was not necessary in this simple linear regression example since we only used one feature (house size). However, in cases with multiple features, normalization would help ensure that all features are on a similar scale, which improves model performance, especially in more complex models.

3. Model Implementation
The linear regression model was implemented from scratch without using libraries like Scikit-learn. The parameters (slope and intercept) were derived using the least squares method.

3.1 Mathematical Formulation
The linear regression model is based on the equation:

𝑦
=
𝑚
𝑥
+
𝑐
y=mx+c
Where:

𝑦
y is the predicted house price,
𝑥
x is the feature (house size),
𝑚
m is the slope (weight), and
𝑐
c is the intercept (bias).
The slope and intercept were calculated using the least squares method:

𝑚
=
∑
(
𝑥
𝑖
−
𝑥
ˉ
)
(
𝑦
𝑖
−
𝑦
ˉ
)
∑
(
𝑥
𝑖
−
𝑥
ˉ
)
2
,
𝑐
=
𝑦
ˉ
−
𝑚
𝑥
ˉ
m=
∑(x
i
​
 −
x
ˉ
 )
2

∑(x
i
​
 −
x
ˉ
 )(y
i
​
 −
y
ˉ
​
 )
​
 ,c=
y
ˉ
​
 −m
x
ˉ

This method minimizes the error between the predicted and actual values.

3.2 Prediction Function
After calculating the slope and intercept, the model was able to predict house prices using the linear equation. The function predict() was written to estimate house prices based on the input house size.

4. Model Training
4.1 Train-Test Split
The dataset was split into 80% training data and 20% testing data. The model was trained using the training set, and predictions were made for the testing set to evaluate generalization.

4.2 Model Fit
The model was trained on the training data by calculating the slope and intercept. After fitting the model, the MSE was computed on the training set:

𝑀
𝑆
𝐸
=
1
𝑛
∑
𝑖
=
1
𝑛
(
𝑦
𝑖
−
𝑦
^
𝑖
)
2
MSE=
n
1
​
  
i=1
∑
n
​
 (y
i
​
 −
y
^
​
  
i
​
 )
2

Where
𝑦
𝑖
y
i
​
  is the actual price and
𝑦
^
𝑖
y
^
​
  
i
​
  is the predicted price. The MSE for the training set was XX.XX, indicating how well the model fit the training data.

5. Model Evaluation
5.1 Testing the Model
The model was tested on the unseen data (20% test set), and the MSE was calculated for the test data. The MSE on the test set was YY.YY.

5.2 Visualization
To better understand model performance, we plotted the regression line along with the test data points. The test data points were scattered, and the regression line was plotted based on the model's predictions.


5.3 Results
The MSE on the test set was slightly higher than on the training set, indicating some generalization error. This is typical in regression models, as the model might fit the training data well but may perform slightly worse on unseen data.

6. Conclusions
6.1 Findings
The model was successfully implemented using the least squares method to predict house prices.
The MSE on both the training and testing sets showed that the model fits the data reasonably well.
The regression line closely follows the trend of the test data, indicating that the model generalizes well.
6.2 Challenges and Solutions
Handling small datasets: With only a small number of data points, the model was prone to overfitting. Cross-validation or regularization techniques could improve the robustness of the model with more data.
Normalization: While normalization was not necessary for this specific problem, it would be crucial for datasets with multiple features and varied scales.
6.3 Future Improvements
Inclusion of multiple features: The current model is limited to one feature (house size). In practice, other features like the number of rooms, location, and age of the house should be included.
Regularization: To prevent overfitting on larger datasets, techniques like Lasso or Ridge regression could be explored.
Overall, the model provides a good starting point for predicting house prices using linear regression. Future iterations of the model could incorporate additional features, more complex regression techniques, and larger datasets to improve accuracy.

