# Machine Learning intern at Prasunet Company

## Author: Pothumanchi Bhargav Narendra Raju
### Prasunet_ML_01

## Task 01: House Price Prediction using Linear Regression

### Project Overview:

This project aims to predict house prices using a linear regression model. The dataset used for this project is from the Kaggle competition "House Prices: Advanced Regression Techniques." The goal is to provide accurate predictions based on various features of the houses, such as square footage, the number of bedrooms, and the number of bathrooms.

### Dataset:

The dataset can be found on Kaggle: [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). It contains comprehensive data on houses, including numerous features that can potentially affect house prices.

### Features Used:

For this project, the primary features used for prediction are:

- **Square Footage**: The total area of the house in square feet.
- **Number of Bedrooms**: The total number of bedrooms in the house.
- **Number of Bathrooms**: The total number of bathrooms in the house.

These features were selected based on their significance in influencing house prices.

### Model:

A linear regression model was employed to predict house prices. The model was trained on the training data and validated using cross-validation techniques to ensure its accuracy and generalizability.

### Predictions:

The model was used to predict house prices on the test data provided by Kaggle. The final predictions are stored in the `HousePricePrediction.csv` file.

### Repository Structure:

- `train.csv`: Contains the dataset files.
- `test.csv`: Contains the test dataset files.
- `HousePricePrediction.csv`: The final predictions generated by the model.


In [None]:
pip install pandas numpy scikit-learn




In [None]:
pip install pandas numpy scikit-learn




In [None]:
import pandas as pd

df = pd.read_csv('train.csv')

print(df.head())


   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

In [None]:
features = df[['GrLivArea', 'BedroomAbvGr', 'FullBath']]
target = df['SalePrice']

print(features.isnull().sum())

features = features.fillna(features.mean())

GrLivArea       0
BedroomAbvGr    0
FullBath        0
dtype: int64


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:
from sklearn.metrics import mean_squared_error, r2_score
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')


Mean Squared Error: 2806426667.247853
R^2 Score: 0.6341189942328371


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv('train.csv')

features = df[['GrLivArea', 'BedroomAbvGr', 'FullBath']]
target = df['SalePrice']

print(features.isnull().sum())
features = features.fillna(features.mean())
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')


GrLivArea       0
BedroomAbvGr    0
FullBath        0
dtype: int64
Mean Squared Error: 2806426667.247853
R^2 Score: 0.6341189942328371


In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

df_test = pd.read_csv('test.csv')

test_features = df_test[['GrLivArea', 'BedroomAbvGr', 'FullBath']]

test_features = test_features.fillna(test_features.mean())

predictions = model.predict(test_features)

submission_df = pd.DataFrame({'Id': df_test['Id'], 'SalePrice': predictions})
submission_df.to_csv('PricePrediction.csv', index=False)

print("Predictions saved to PricePrediction.csv")


Predictions saved to PricePrediction.csv
