<a href="https://colab.research.google.com/github/Lakshaykumarr28/Prasunet_ML_01/blob/main/Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/Lakshaykumarr28/Prasunet_ML_01

Importing all the necessary libraries

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

Reading the Training and Testing data

In [4]:
training_data = pd.read_csv("/content/Prasunet_ML_01/Dataset_ML_01/train.csv")
testing_data = pd.read_csv("/content/Prasunet_ML_01/Dataset_ML_01/test.csv")

# Exploring the Dataset

Printing information about the columns

In [7]:
training_data.info()

Describing the data

In [5]:
training_data.describe()

# Data Preprocessing

Finding the columns with NULL values

In [11]:
null_values = training_data.isnull().sum()

Handling the Columns with NULL values

In [10]:
null_columns = training_data.columns[training_data.isnull().any()]

print("NULL columns are:")
for col in null_columns:
  print(col)

Printing the number of NULL values in null_columns

In [13]:
print("The columns with number of NULL or Missing values are:")
null_columns = null_values[null_values > 0]
print(null_columns)

Checking if there are duplicate rows and deleting (if any)

In [15]:
repeating_rows = training_data.duplicated().sum()
print(f"number of duplicated rows are: {repeating_rows}")
training_data.drop_duplicates(inplace = True)

# Selecting Columns as Features for Training and Testing

In [17]:
features = training_data[['TotalBsmtSF', 'BedroomAbvGr', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath']]
target = training_data[['SalePrice']]


Splitting the Dataset for Training and Testing

In [20]:
X = training_data[['TotalBsmtSF', 'BedroomAbvGr', 'BsmtFullBath', 'BsmtHalfBath',"FullBath", "HalfBath"]]
y = training_data['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Creating a dataframe with only selected features and target

In [23]:
df = pd.concat([features, target], axis = 1)
df.head()

Creating a Correlation Matrix and plotting a heatmap for the same

In [28]:
# the correlation matrix
correlation_matrix = df.corr()

# heatmap for the correlation matrix
plt.figure(figsize=(8, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='Greens', fmt=".2f")
plt.title("Correlation Heatmap: Features VS Target")
plt.show()

Checking for Missing values in the selected features

In [29]:
missing_values = X.isnull().sum()
print(missing_values)

# The Linear Regression Model

In [31]:
model = LinearRegression()

Fitting the model

In [32]:
model.fit(X_train, y_train)
print(model)

Predicting on the Testing data

In [36]:
y_pred = model.predict(X_test)

# Calculating the Mean Squared Error
mse = mean_squared_error(y_pred, y_test)
print(f"Mean Squared Error: {mse}")


Printing Cross Validation Scores

In [42]:
scores = cross_val_score(model, X, y, cv=5)
print(f'Cross-Validation Scores: {scores}')
print(f'Mean CV Score: {scores.mean()}')

Plotting Actual VS Predicted Prices

In [44]:
plt.scatter(y_test, y_pred, c='Green')
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs. Predicted Prices")
plt.show()

Taking a demo new house to predict the price

In [45]:
demo_house = np.array([[1500, 4, 2,0,3,1]])
predicted_price = model.predict(demo_house)
print(f"Predicted Price for the New House: ${predicted_price[0]:.2f}")