In [212]:
#Importing the two basic libraries
import numpy as np 
import pandas as pd 

In [213]:
#Loading the training data
train_data = pd.read_csv('../input/train.csv')

**Taking a look at our training data**

In [214]:
#This is how our training data looks like
print(train_data.head())

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities    ...     PoolArea PoolQC Fence MiscFeature MiscVal  \
0         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
1         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
2         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
3         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
4         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   

  MoSold YrSold  SaleType  SaleCondition  SalePrice  
0      2   2008     

**Handling all the missing values in the training data to make our dataset ready to use**

In [215]:
total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [216]:
#Droping the columns with missing data
train_data = train_data.drop((missing_data[missing_data['Total'] > 1]).index,1)
train_data = train_data.drop(train_data.loc[train_data['Electrical'].isnull()].index)
#train_data.isnull().sum().max() #just to check that there's no missing data missing...

**Now we can see that there are a number of columns which are not necessary for our predictions here. So to reduce the computation we are going to use only the following columns to predict**

In [217]:
predictors = ['OverallQual','TotalBsmtSF','2ndFlrSF','GarageArea','YearBuilt','GrLivArea']

In [218]:
#Extracting only important columns from the training data
X = train_data[predictors]

In [219]:
#Using y as an output predictor
y = train_data.SalePrice

**Creating a Linear Regression model**

In [220]:
#Importing the required libraries from scikit learn
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

**Splitting the data into train and validation sets to train and test the model**

In [221]:
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

In [222]:
#Setting up the Linear Regression Model
X = train_data[predictors]
training_model = LinearRegression()
training_model.fit(train_X,train_y)
predicted_prices = training_model.predict(val_X)

..........**Now comparing the predicted values with original values of Sale Prices will give us the idea about how efficient our algorithm is be here:**

In [223]:
final_values = pd.DataFrame({'Original Value': val_y, 'Predicted Prices': predicted_prices.round()})

In [224]:
#Visualizing and comparing the final predicted values to original prices
print(final_values.head(4))

      Original Value  Predicted Prices
1420          179900          176048.0
494            91300           87737.0
1412           90000           82446.0
569           135960          133787.0


In [225]:
#Calculating accuracy of our predicted prices
from sklearn.metrics import r2_score

In [226]:
accuracy = r2_score(val_y,predicted_prices.round())*100

In [227]:
print("Accuracy is: ", accuracy)

Accuracy is:  80.61131603367528


**This accuracy is achieved on the basic linear regression model. It can be further improved by using advanced regression models like XgBoost Model.** 