# Simple Linear Regression

As you go along you'll get to know how to import a dataset, set up separate variables for features and target, create a train-test split, fit a basic linear regression model, and calculate R2 score.

The first thing we need to do is to import some relevant libraries.
## Import Liraries

In [1]:
import pandas as pd
import numpy as np

`pandas` is a python library, used for data manipulation and analysis.

`NumPy` is a python library, used to perform mathematical operations mainly using multi-dimensional arrays.

## Import Dataset
We'll be using `Ames_Housing_Sales.csv` Dataset.

The Ames housing dataset examines features of houses sold in Ames during the 2006–10 timeframe. The goal is to use the training data to predict the sale prices of the houses in the testing data.

In [2]:
data = pd.read_csv('../datasets/Ames_Housing_Sales.csv')
data.head() #this returns top 5 rows of the dataset

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,,3,1Fam,TA,Gd,978.0,0.0,...,0.0,Pave,6,1262.0,AllPub,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,,3,1Fam,Gd,No,216.0,0.0,...,0.0,Pave,7,756.0,AllPub,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0


In [3]:
# to see how many entries or observation the dataset contains you can simply print its shape
print(data.shape)

(1379, 80)


In [4]:
# to examine how many features are there of each data type(object, float64, int64)
data.dtypes.value_counts()

object     43
float64    21
int64      16
dtype: int64

The __Ames_Housing_Sales__ dataset contains 43 categorical columns and 37 numerical columns.

In [5]:
# to examine each of the columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1379 entries, 0 to 1378
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1379 non-null   float64
 1   2ndFlrSF       1379 non-null   float64
 2   3SsnPorch      1379 non-null   float64
 3   Alley          1379 non-null   object 
 4   BedroomAbvGr   1379 non-null   int64  
 5   BldgType       1379 non-null   object 
 6   BsmtCond       1379 non-null   object 
 7   BsmtExposure   1379 non-null   object 
 8   BsmtFinSF1     1379 non-null   float64
 9   BsmtFinSF2     1379 non-null   float64
 10  BsmtFinType1   1379 non-null   object 
 11  BsmtFinType2   1379 non-null   object 
 12  BsmtFullBath   1379 non-null   int64  
 13  BsmtHalfBath   1379 non-null   int64  
 14  BsmtQual       1379 non-null   object 
 15  BsmtUnfSF      1379 non-null   float64
 16  CentralAir     1379 non-null   object 
 17  Condition1     1379 non-null   object 
 18  Conditio

In [6]:
# to get the total sum of all missing values
data.isnull().sum().sum()

0

In [7]:
# to get descriptive statistics of the dataset
data.describe()

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,BedroomAbvGr,BsmtFinSF1,BsmtFinSF2,BsmtFullBath,BsmtHalfBath,BsmtUnfSF,EnclosedPorch,...,OverallQual,PoolArea,ScreenPorch,TotRmsAbvGrd,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
count,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,...,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0
mean,1177.129804,353.424946,3.609862,2.86512,455.57868,48.102248,0.430747,0.058738,570.765047,21.039159,...,6.187092,2.920957,15.945613,6.552574,1074.445975,97.456853,1972.958666,1985.435098,2007.812183,185479.51124
std,387.014961,439.553171,30.154682,0.783961,459.691379,164.324665,0.514052,0.238285,443.677845,60.535107,...,1.34578,41.335545,57.249593,1.589821,436.371874,126.699192,29.379883,20.444852,1.330221,79023.8906
min,438.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,3.0,0.0,0.0,1880.0,1950.0,2006.0,35311.0
25%,894.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,228.0,0.0,...,5.0,0.0,0.0,5.0,810.0,0.0,1955.0,1968.0,2007.0,134000.0
50%,1098.0,0.0,0.0,3.0,400.0,0.0,0.0,0.0,476.0,0.0,...,6.0,0.0,0.0,6.0,1008.0,0.0,1976.0,1994.0,2008.0,167500.0
75%,1414.0,738.5,0.0,3.0,732.0,0.0,1.0,0.0,811.0,0.0,...,7.0,0.0,0.0,7.0,1314.0,171.0,2001.0,2004.0,2009.0,217750.0
max,4692.0,2065.0,508.0,6.0,5644.0,1474.0,2.0,2.0,2336.0,552.0,...,10.0,738.0,480.0,12.0,6110.0,857.0,2010.0,2010.0,2010.0,755000.0


## Set up separate variables for features and target

But before, as we can fit categorical variables into a regression model, we'll drop them from the data frame.

In [8]:
numeric_mask = data.dtypes == np.object
numeric_cols = data.columns[numeric_mask]
data = data.drop(numeric_cols, axis=1)

In __Ames_Housing_Sales__ dataset, `SalePrice` is our target (dependent) variable.

In [9]:
target_col = "SalePrice"

X = data.drop(target_col, axis=1)
y = data[target_col]

Now that we have feature (X) and target (y) data ready to go, we're nearly ready to fit and evaluate a baseline model using our current feature set. We'll need to create a train/validation split before we fit and score the model.

## Create Train Test Splits

To do so we need to import a method called `train_test_split` from Scikit Learn.

In [10]:
# import train_test_split
from sklearn.model_selection import train_test_split

We’ve splited our data into X and y, now we can pass them into `the train_test_split` function as a parameter along with `test_size`, and this function will return us four variables. They’re `X_train`, `y_train`, `X_test`, and `y_test`.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # splits train and test in 7:3 ratio

## Fit a basic Linear Regression Model

Import the class containing the regression method from `sklearn.linear_model`.

In [12]:
from sklearn.linear_model import LinearRegression

Create an instance of the class.

In [13]:
LR = LinearRegression()

Fit the instance on the data and then predict the expected value

In [14]:
LR = LR.fit(X_train , y_train)
# Predicting
y_train_pred = LR.predict(X_train) # using train data
y_test_pred = LR.predict(X_test) # using test data

## Calculate R2 Score

The R2 score is a very important metric that is used to evaluate the performance of a regression-based machine learning model.

Import the method `r2_score` from the `sklearn.metrics`.

In [15]:
from sklearn.metrics import r2_score

In [16]:
# calculating r2 score
score_train = r2_score(y_train.values, y_train_pred) # using train data
score_test = r2_score(y_test.values, y_test_pred) # using test data

print("-------------------R2 Scores-----------------")
print("Predicting with train data i.e. on seen data : ", score_train)
print("Predicting with test data i.e. on unseen data : ", score_test)

-------------------R2 Scores-----------------
Predicting with train data i.e. on seen data :  0.8387684736812978
Predicting with test data i.e. on unseen data :  0.6474581398991139
