In [None]:
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). 
Here, we establish relationship between independent and dependent variables by fitting a best line. 
This best fit line is known as regression line and represented by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. 
Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, 
without asking them their weights! What do you think the child will do?
He / she would likely look (visually analyze) at the height and build of people and arrange them using a 
combination of these visible parameters. 
This is linear regression in real life! 
The child has actually figured out that height and build would be correlated to the weight by a relationship, 
which looks like the equation above.

In this equation: 

X – Independent variable    
Y – Dependent Variable
a – Slope
b – Intercept

These coefficients a and b are derived based on minimizing the sum of squared difference of distance 
between data points and regression line.
Linear Regression is mainly of two types: 
    Simple Linear Regression and Multiple Linear Regression. 
    Simple Linear Regression is characterized by one independent variable. 
    And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. 
    While finding the best fit line, you can fit a polynomial or curvilinear regression. 
    And these are known as polynomial or curvilinear regression.
    
    NB/ Linear regression is both a statistical model and a supervised machine learning algorithm.

In [None]:
'''
The following code is for the Linear Regression
Created by- ANALYTICS VIDHYA
'''

# importing required libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# read the train and test dataset
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(train_data.head())

# shape of the dataset
print('\nShape of training data :',train_data.shape)
print('\nShape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Item_Outlet_Sales

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Item_Outlet_Sales'],axis=1)
train_y = train_data['Item_Outlet_Sales']

# seperate the independent and target variable on training data
test_x = test_data.drop(columns=['Item_Outlet_Sales'],axis=1)
test_y = test_data['Item_Outlet_Sales']

'''
Create the object of the Linear Regression model
You can also add other parameters and test your code here
Some parameters are : fit_intercept and normalize
Documentation of sklearn LinearRegression: 

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

 '''
model = LinearRegression()

# fit the model with the training data
model.fit(train_x,train_y)

# coefficeints of the trained model
print('\n Coefficient of model :', model.coef_)

# intercept of the model
print('\n Intercept of model',model.intercept_)

# predict the target on the test dataset
predict_train = model.predict(train_x)
print('\n Item_Outlet_Sales on training data',predict_train) 

# Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\n RMSE on train dataset : ', rmse_train)

# predict the target on the testing dataset
predict_test = model.predict(test_x)
print('\n Item_Outlet_Sales on test data',predict_test) 

# Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\n RMSE on test dataset : ', rmse_test)

In [None]:
#Linear Regression using pima-indians-diabetes dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression #used to perform linear and polynomial regression and make predictions accordingly.
from sklearn.metrics import mean_squared_error

column_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

pima = pd.read_csv('pima-indians-diabetes.csv', header=None, names=column_names)

#split dataset in features and target variable
feature_columns = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_columns] # Features
y = pima.label # Target variable

# split X and y into training and testing sets with our test data taking 25% & train data 75%
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)


#create a model & fit it
# model = LinearRegression()
# model.fit(X_train,y_train)
model = LinearRegression().fit(X_train,y_train)

print(pima.head())
# print(pima.head())

# shape of the dataset
print('\n Shape of dataset :', pima.shape)

# coefficeints of the trained model
print('\n Model Coefficient :', model.coef_, sep='\n')
# intercept of the model
print('\n Model Intercept: ',model.intercept_)

# predict the target on the train dataset
predict_train = model.predict(X_train)
print('\n Prediction on Training data \n',predict_train, sep='\n') 

# predict the target on the test dataset
predict_test = model.predict(X_test)
print('\n Prediction on Training data \n',predict_test, sep='\n') 

# Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(y_train,predict_train)**(0.5)
print('\n RMSE on train dataset : ', rmse_train)

# Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(y_test,predict_test)**(0.5)
print('\n RMSE on test dataset :', rmse_test)