**Aim:** Write a program to implement Linear Regression using any appropriate dataset.


**Theory:** Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). There are two main types:

Simple regression

Simple linear regression uses traditional slope-intercept form, where m and b are the variables our algorithm will try to “learn” to produce the most accurate predictions. x represents our input data and y represents our prediction.

y=mx+b

Multivariable regression

A more complex, multi-variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.

f(x,y,z)=w1x+w2y+w3z
The variables x,y,z represent the attributes, or distinct pieces of information, we have about each observation. For sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.

Sales=w1Radio+w2TV+w3News

**Code:**

Dataset Used is **Wine Quality Dataset**<br>
Importing Libraries

In [103]:
import pandas as pd
import numpy as np
import math

In [81]:
df=pd.read_csv('winequality-white.csv',sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [82]:
df['quality'].value_counts()

6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: quality, dtype: int64

In [83]:
df=df.sample(frac=1)
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4259,3.8,0.310,0.02,11.10,0.036,20.0,114.0,0.99248,3.75,0.44,12.4,6
1087,7.0,0.240,0.32,1.30,0.037,39.0,123.0,0.99200,3.17,0.42,11.2,8
1393,5.7,0.135,0.30,4.60,0.042,19.0,101.0,0.99460,3.31,0.42,9.3,6
2003,7.4,0.300,0.22,5.25,0.053,33.0,180.0,0.99260,3.13,0.45,11.6,6
4761,6.2,0.150,0.27,11.00,0.035,46.0,116.0,0.99602,3.12,0.38,9.1,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4856,7.1,0.230,0.39,13.70,0.058,26.0,172.0,0.99755,2.90,0.46,9.0,6
4839,5.2,0.405,0.15,1.45,0.038,10.0,44.0,0.99125,3.52,0.40,11.6,4
2057,6.7,0.310,0.31,4.90,0.031,20.0,151.0,0.99260,3.36,0.82,12.0,7
2726,5.3,0.200,0.31,3.60,0.036,22.0,91.0,0.99278,3.41,0.50,9.8,6


In [84]:
df['fixed acidity'].value_counts()

6.80     308
6.60     290
6.40     280
6.90     241
6.70     236
        ... 
4.50       1
14.20      1
11.80      1
3.90       1
6.45       1
Name: fixed acidity, Length: 68, dtype: int64

# Linear Regression implementation


In [85]:
def mean(values):
    return sum(values) / float(len(values))

In [86]:
def variance(values, mean):
    return sum([(x-mean)**2 for x in values])

In [87]:
def covariance(x, mean_x, y, mean_y):
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar

In [88]:
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[-1] for row in dataset]
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

In [89]:
def simple_linear_regression(train, test):
    predictions = list()
    b0, b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    return predictions

In [90]:
df=df.to_numpy()


In [91]:
df

array([[ 3.8  ,  0.31 ,  0.02 , ...,  0.44 , 12.4  ,  6.   ],
       [ 7.   ,  0.24 ,  0.32 , ...,  0.42 , 11.2  ,  8.   ],
       [ 5.7  ,  0.135,  0.3  , ...,  0.42 ,  9.3  ,  6.   ],
       ...,
       [ 6.7  ,  0.31 ,  0.31 , ...,  0.82 , 12.   ,  7.   ],
       [ 5.3  ,  0.2  ,  0.31 , ...,  0.5  ,  9.8  ,  6.   ],
       [ 6.8  ,  0.21 ,  0.62 , ...,  0.59 , 10.2  ,  5.   ]])

In [92]:
x_train=df[:(int)(0.8*(len(df)))]
x_test=df[(int)(0.8*(len(df))):]

In [93]:
print('Training Length: {}'.format(len(x_train)))
print('Test Length: {}'.format(len(x_test)))

Training Length: 3918
Test Length: 980


In [94]:
predictions=simple_linear_regression(x_train, x_test)

In [96]:
predictions

[5.81910243133131,
 5.7495694501344765,
 5.737980619935004,
 6.0045237145228665,
 5.7495694501344765,
 5.981346054123922,
 5.911813072927088,
 5.981346054123922,
 5.807513601131838,
 6.0045237145228665,
 5.807513601131838,
 5.911813072927088,
 5.865457752129199,
 5.9002242427276155,
 6.0277013749218105,
 5.888635412528144,
 5.830691261530783,
 5.8770465823286715,
 5.911813072927088,
 5.865457752129199,
 6.016112544722338,
 5.853868921929727,
 5.830691261530783,
 5.888635412528144,
 5.853868921929727,
 5.795924770932365,
 5.830691261530783,
 6.016112544722338,
 5.992934884323394,
 5.958168393724977,
 5.92340190312656,
 5.92340190312656,
 5.795924770932365,
 5.981346054123922,
 5.934990733326033,
 5.934990733326033,
 5.934990733326033,
 5.81910243133131,
 5.9002242427276155,
 5.737980619935004,
 5.969757223924449,
 5.8770465823286715,
 5.958168393724977,
 5.958168393724977,
 6.016112544722338,
 5.772747110533421,
 5.981346054123922,
 5.888635412528144,
 5.958168393724977,
 5.830691261530

In [100]:
actual=x_test[:,-1]
actual

array([5., 6., 6., 3., 6., 5., 6., 6., 6., 7., 5., 7., 4., 6., 7., 7., 7.,
       7., 6., 7., 6., 7., 5., 5., 7., 5., 6., 6., 5., 4., 6., 8., 5., 5.,
       5., 5., 6., 6., 5., 6., 5., 6., 5., 6., 6., 6., 5., 6., 5., 6., 5.,
       5., 5., 5., 6., 6., 6., 6., 6., 6., 7., 6., 6., 6., 6., 6., 6., 5.,
       6., 6., 5., 7., 6., 5., 6., 5., 4., 6., 5., 7., 6., 6., 5., 6., 6.,
       7., 6., 5., 7., 8., 6., 4., 6., 7., 6., 6., 7., 6., 5., 5., 7., 7.,
       7., 7., 6., 7., 6., 5., 5., 7., 5., 5., 5., 5., 5., 7., 7., 6., 5.,
       6., 7., 5., 6., 7., 7., 5., 7., 7., 7., 6., 6., 5., 7., 6., 5., 5.,
       5., 6., 4., 6., 6., 5., 7., 7., 6., 6., 7., 5., 5., 5., 6., 6., 6.,
       7., 6., 7., 6., 6., 7., 6., 7., 7., 6., 5., 6., 6., 6., 6., 6., 6.,
       5., 7., 6., 8., 5., 5., 5., 5., 6., 6., 5., 7., 6., 6., 7., 5., 5.,
       5., 6., 7., 5., 6., 5., 7., 5., 6., 5., 6., 6., 6., 5., 6., 6., 6.,
       7., 7., 5., 4., 5., 7., 6., 6., 5., 6., 5., 6., 4., 5., 6., 6., 8.,
       5., 6., 7., 5., 6.

In [105]:
sum=0
for i in range(len(actual)):
    sum += (actual[i]-predictions[i])**2

ans = (sum/len(actual)*1.0)**(1/2)    

print("RMSE : {}".format(ans))    
    

RMSE : 0.877032274133051
