# Understanding Linear Regression

Before understanding what's linear regression, let us first learn about some key terms that will be used extensively in the further activity.

1. Target: It is the feature that we want to predict.
2. Error: It is the difference between the actual and the predicted value. It is also referred to as a Residual.
3. Cost Function: It defines the total error of the model. It is this function that the model tries to optimize. In the case of Linear Regressionm, it is the mean of the sum of all the errors in the dataset, popularly known as the **mean squared error**.
4. Gradient Descent: It tries to find the best set of coefficients which optimizes the Cost Function.

---

Now, talking about Linear Regression, it tries to fit a straight line between variables which optimizes the cost function or which explains the maximum variance of the target.

The Equation of the straight line is **y = mx+c** where m is the slope of **x** and **c** is the intercept. This equation follows only when we have one independant feature. As the feature increases, the number of x's increases and the number of slopes increases.

Whole process of Linear Regression can be summarized as:
1. Randomly initialise the slopes/coefficients.
2. Use Gradient Descent

>Calculate Cost<br>
>Calculate the slope of the cost function.<br>
>Move in the direction of decreasing cost function by calculating the derivative.<br>
Update the parameters.<br>
Repeat the above steps untill the minimum value of cost function is not found.

This was just a quick recap of what a Linear Regression does. If you want to have a detailed study about this algorithm, I have added links to youtube videos under the **Additional Material** at the last of the activity. So, you can have a look to that as well.

For now, enough of theory. Let's begin with implementing Linear Regression and see what it can do.

---

## Importing Libraries

We start by importing some libraries which will be required in the future.

In [None]:
import pandas as pd #to handle the dataset
import matplotlib.pyplot as plt #to draw plots
import seaborn as sns #custom library to plot more visually appealing plots
sns.set()

## Loading the Data

Tasks:
* read the csv file from the path: '../../data/data_cleaned.csv' using pandas.
* look at top 5 rows of the dataset.

In [None]:
# read the file
#display top 5 rows

<details>
<summary>Solution</summary>
<p>
    
```python
data = pd.read_csv('../../data/data_cleaned.csv')
data.head()
```
    
</p>
</details>

Here, we have to predict the price of a house based on other features. Hence, price will be our target or dependant feature. Whereas others will be the independant features.

Now, look at the shape of the data using **shape** attribute of the pandas DataFrame.

In [None]:
#have a look at the shape of data

<details>
<summary>Solution</summary>
<p>
    
```python
data.shape
```
    
</p>
</details>

We can see that out data has **4600 records and 7 columns**

## Data Analysis

Data Analysis is usually done before hand to see if there is anything in the data which can affect the performance of the model. Here, we will draw a pairplot from seaborn's library to visualize the whole dataset at once. A pairplot plots all the pairs of features that are present in the data. This way we can have a more compact and broader view of the data.

* Call the *pairplot* function from seaborn library for the whole dataset.

In [None]:
#plot a pairplot

<details>
<summary>Solution</summary>
<p>
    
```python
sns.pairplot(data)
```
    
</p>
</details>

There are some conclusions that can be drawn from above pairplot. These are:
1. In the last row, we can see 2-3 points that are unusually higher than the rest of the data. These are typically known as **outliers**. But for the sake of simplicity, we will not dive into its details.
2. We can see a linear relationship between sqft_living and bathrooms which seems logical.

## Train and Test Splits

Now, we will divide our data into train and test splits. We will train our model on the train dataset and then test our model's performance on the test dataset.But first, let us seperate our dependand variable with the rest of independant variable.
* store all the independant variables in a variable **X**.
* store the dependant variable in a variable **y**.

In [None]:
# store independant variable in X
#store dependant variable in y

<details>
<summary>Solution</summary>
<p>
    
```python
X = data.drop('price', axis = 1) #independant features
y = data.price #dependant features
```
    
</p>
</details>

Now, split them into training and test datasets in the ratio of 8:2 using **train_test_split** function. There should be 4 variables namely **X_train, X_test, y_train and y_test**

In [None]:
#import the function
from sklearn.model_selection import train_test_split

#call the train_test_split function

<details>
<summary>Solution</summary>
<p>
    
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42, shuffle= True)
```
    
</p>
</details>


The shape of these variables will be like:

In [None]:
print("X_train: ", X_train.shape)
print('y_train: ', y_train.shape)
print('X_test: ', X_test.shape)
print('y_test: ', y_test.shape)

## Fitting a Model

* Initialize and fit the model to training data

In [None]:
#import the model
from sklearn.linear_model import LinearRegression

 #initialize the model

 #fit the model to the training data

<details>
<summary>Solution</summary>
<p>
    
```python
model = LinearRegression()
model.fit(X_train, y_train)
```
    
</p>
</details>

## Evaluate a model

Now, evaluate the model by:
* making predictions on training data
* making predictions on test data

In [None]:
#import the evaluation metric
from sklearn.metrics import mean_squared_error

#calculate error on training data

#calculate error on testing data

#print the errors

<details>
<summary>Solution</summary>
<p>
    
```python
train_error = mean_squared_error(y_train, model.predict(X_train))
test_error = mean_squared_error(y_test, model.predict(X_test))
print('Training Error: ', train_error)
print('Testing Error: ', test_error)
```
    
</p>
</details>

## Make Predictions

In [None]:
bedrooms = int(input("Bedrooms: "))
bathrooms = float(input('Bathrooms: '))
sqft_living = float(input('sqft_living: '))
sqft_lot = float(input('sqft_lot: '))
floors = float(input('Floors: '))
condition = int(input('Condition: '))

predicted_price = model.predict([[bedrooms, bathrooms, sqft_living, sqft_lot, floors, condition]])

print("Expected Price: ", predicted_price)

**Congratulations!!! You have just completed your first activity and are now ready to dive deeper into the concepts of regression.**
But, before moving on, have a look at the resources in the next section to have a more solid understanding of the theoretical concepts behind Linear Regression.

## Additional Resources

1. Linear Regression Concepts: https://www.youtube.com/watch?v=nk2CQITm_eo&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=9
2. Data Analysis Introduction: https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
3. In-Depth Data Analysis: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/