Task_01 => implementing a linear regression model to predict the prices of houses based on their square footage and the number of bedrooms and bathrooms.
Dataset link => 'https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data'

To perform linear regression in Python, we typically use a combination of libraries that facilitate data manipulation, analysis, and modeling. Here are some of the most commonly used libraries:

1.NumPy: Provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

2.Pandas: Offers data structures and data analysis tools. It is particularly useful for handling and manipulating tabular data.

3.Matplotlib: A plotting library used for creating static, interactive, and animated visualizations in Python.

4.Seaborn: A statistical data visualization library based on Matplotlib, providing a high-level interface for drawing attractive and informative statistical graphics.

5.Scikit-learn: A machine learning library that provides simple and efficient tools for data mining and data analysis. It is the go-to library for implementing machine learning algorithms, including linear regression.

6.Statsmodels: Provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration.


In [124]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score

Download and loading the dataset 

In [125]:
train_data  = pd.read_csv('train.csv')
test_data  = pd.read_csv('test.csv')

In [126]:
# Print the shapes of the datasets
print("Training data shape:", train_data.shape)
print("Test data shape:", test_data.shape)

Training data shape: (1460, 81)
Test data shape: (1459, 80)


=> The given data set has 81 columns, which is a lot; but to predict the prices of houses we only need to consider a few of those columns (square footage, number of bedrooms, bathrooms). 
- Those columns thou are not obviously stated, we use some of the features listed that will have a practical implication on the price and are more or less, similar to square footage, number of bedrooms & bathrooms.

Data Preprocessing:
 - Handle Missing Values


=> here the first and most obvious option for prepocessing would have been to drop any, missing values; but such operation will leave us with an empty dataset. 
-therefor the next apparent option would be scaling, and hence the columns(features) used are numerical and categorical type, we use pipelines.


In [127]:
# Define numerical and categorical columns based on the column names in your dataset
numerical_cols = ['GrLivArea', 'BedroomAbvGr', 'FullBath', 'LotArea']
categorical_cols = ['MSSubClass', 'MSZoning', 'Neighborhood']


# Define preprocessing pipelines for numerical and categorical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Fill missing values with mean
    ('scaler', StandardScaler())  # Scale the features
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing values with most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Apply preprocessing pipeline to both training and test data
train_data_processed = preprocessor.fit_transform(train_data)
test_data_processed = preprocessor.transform(test_data)


 - Select Relevant Features

In [128]:
# Print the shapes of the datasets
print("Training data shape:", train_data.shape)
print("Test data shape:", test_data.shape)

Training data shape: (1460, 81)
Test data shape: (1459, 80)


In [129]:
# 'BedroomAbvGr' (number of bedrooms), 'FullBath' (number of bathrooms), and 'SalePrice' (target variable)
# Selecting features and target variable for training
X_train = train_data[['GrLivArea', 'BedroomAbvGr', 'FullBath', 'HalfBath']]
y_train = train_data['SalePrice']

# Selecting features for testing
X_test = test_data[['GrLivArea', 'BedroomAbvGr', 'FullBath', 'HalfBath']]

 - Feature Scaling

Split the Data into Training and Testing Sets, but here we have 2 separate .csv file for training and testing; hence:-

Performing Linear Regression

create and train the model 

In [130]:
model = LinearRegression()
model.fit(X_train, y_train)

Make predictions

In [131]:
y_pred = model.predict(X_test)

 Evaluate the Model

In [132]:
# Perform cross-validation on the training data
mse_scores = -cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
r2_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')

# Calculate the mean of the scores
mean_mse = mse_scores.mean()
mean_r2 = r2_scores.mean()

print(f'Mean Squared Error: {mean_mse}')
print(f'R^2 Score: {mean_r2}')

Mean Squared Error: 2676474166.1980286
R^2 Score: 0.575836397456419


#: Visualize the results

In [133]:
# Saving the predictions to a CSV file
predictions_df = pd.DataFrame({'Predicted SalePrice': y_pred})
predictions_df.to_csv('predictions.csv', index=False)

print("Predictions saved to predictions.csv")


Predictions saved to predictions.csv


In [134]:
# Round the predicted prices to one decimal point
rounded_pred = [round(pred, 1) for pred in y_pred]

In [135]:
# Read the original test data
test_data = pd.read_csv('test.csv')

# Add the predicted prices as a new column to the test data
test_data['Predicted_SalePrice'] = rounded_pred

# Save the test data with predicted prices to a new CSV file
test_data.to_csv('test_with_predictions.csv', index=False)

print("Test data with rounded predicted prices saved to test_with_rounded_predictions.csv")


Test data with rounded predicted prices saved to test_with_rounded_predictions.csv
