# Simple Linear Regression with Sklearn

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size.csv'.
I am going to create a simple linear regression using the data above

Objectives:
Create a scatter plot (with or without a regression line)
Calculate the R-squared
Display the intercept and coefficient(s)
Using the model make a prediction about an apartment with size 750 sq.ft.
Note: In this exercise, the dependent variable is 'price', while the independent variable is 'size'.

In [1]:
# For this project, we will need NumPy, pandas, matplotlib and seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# and of course the actual regression (machine learning) module
from sklearn.linear_model import LinearRegression

# Loading the data

In [2]:
# We start by loading the data
data = pd.read_csv(r'C:\Users\HP\Downloads\GetFreeCourses.Co-Udemy-The Data Science Course 2022 Complete Data Science Bootcamp\34 - Advanced Statistical Methods - Linear Regression with sklearn\29588378-real-estate-price-size-year.csv')

# Let's explore the top 5 rows of the df
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


# To Create the regression

## We declare the dependent and independent variables

In [3]:
# There is a single independent variable: 'SAT'
x = data['size']

# and a single depended variable: 'GPA'
y = data['price']
# Often it is useful to check the shapes of the features
x.shape

(100,)

In [4]:
y.shape


(100,)

In [5]:
# In order to feed x to sklearn, it should be a 2D array (a matrix)
# Therefore, we must reshape it 
# Note that this will not be needed when we've got more than 1 feature (as the inputs will be a 2D array by default)

# x_matrix = x.values.reshape(84,1)
x_matrix = x.values.reshape(-1,1)

# Check the shape just in case
x_matrix.shape

(100, 1)

# Regression

In [6]:
# We start by creating a linear regression object
reg = LinearRegression()

In [7]:
# The whole learning process boils down to fitting the regression
# Note that the first argument is the independent variable, while the second - the dependent (unlike with StatsModels)
reg.fit(x_matrix,y)

LinearRegression()

# R-Squared

In [8]:
# To get the R-squared in sklearn we must call the appropriate method
reg.score(x_matrix,y)

0.7447391865847586

# Coefficients

In [9]:
# Getting the coefficients of the regression
# Note that the output is an array, as we usually expect several coefficients
reg.coef_

array([223.17874259])

# Intercept

In [10]:
# Getting the intercept of the regression
# Note that the result is a float as we usually expect a single value
reg.intercept_

101912.60180122897

# Making Predictions

In [12]:
# There is a dedicated method should we want to predict values
# Note that the result is an array, as we can predict more than one value at a time
reg.predict([[750],[890]])

array([269296.65874718, 300541.68271043])

In [17]:
# we can input a data frame
new_data = pd.DataFrame(data=[[750],[890]],columns=['size'])
new_data

Unnamed: 0,size
0,750
1,890


In [18]:
reg.predict(new_data)



array([269296.65874718, 300541.68271043])

In [19]:
new_data['predicted house cost']=reg.predict(new_data)
new_data



Unnamed: 0,size,predicted house cost
0,750,269296.658747
1,890,300541.68271
