**Question 1:** Use the **House Prices dataset** from Kaggle to build a **linear regression model** to predict house prices.

● Load the dataset

● Preprocess the dataset

● Build and train linear regression model

● Evaluate its performance using Mean Square Error (MSE)

In [3]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing the numpy library for numerical operations
import numpy as np

# Importing the seaborn library for statistical data visualization
import seaborn as sns

# Importing the matplotlib library's pyplot module for plotting
import matplotlib.pyplot as plt


In [4]:
# Reading the house price dataset from a CSV file and loading it into a pandas DataFrame
house_price_train_data = pd.read_csv("/content/train.csv")


In [5]:
# Displaying the shape (number of rows and columns) of the house price DataFrame
house_price_train_data.shape


(1460, 81)

In [6]:
# Displaying the first few rows of the house price DataFrame to get a glimpse of the data
house_price_train_data.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
# Providing a concise summary of the house price DataFrame including the column data types and non-null values
house_price_train_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [8]:
# Checking for missing values in the house price DataFrame and summing up the number of missing values for each column
house_price_train_data.isnull().sum()


Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [9]:
# Handling missing values
# Fill missing numerical values with the median of the column
numerical_features = house_price_train_data.select_dtypes(include=[np.number]).columns
house_price_train_data[numerical_features] = house_price_train_data[numerical_features].fillna(house_price_train_data[numerical_features].median())

# Fill missing categorical values with the mode of the column
categorical_features = house_price_train_data.select_dtypes(include=[object]).columns
house_price_train_data[categorical_features] = house_price_train_data[categorical_features].fillna(house_price_train_data[categorical_features].mode().iloc[0])

# Encoding categorical variables using one-hot encoding
train_df = pd.get_dummies(house_price_train_data)

# Separate the target variable (SalePrice) from the features
x = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


In [31]:
x_train = house_price_train_data.drop(['SalePrice'], axis=1)
y_train = house_price_train_data['SalePrice']
print("Traing features :", x_train.shape)
print("Traing target :", y_train.shape)

Traing features : (1460, 80)
Traing target : (1460,)


In [10]:
# Importing the LinearRegression class from the sklearn.linear_model module
from sklearn.linear_model import LinearRegression


In [11]:
# Creating an instance of the LinearRegression model
lin_reg = LinearRegression()

# Fitting the model to the training data, where x_train is the features and y_train is the target variable
lin_reg.fit(x_train, y_train)


In [13]:
from sklearn.metrics import mean_squared_error
# Predict on the test set
y_pred = lin_reg.predict(X_test)
# Calculate Mean Square Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Square Error:", mse)


Mean Square Error: 873707802.614269
