# IMPLEMENTATION OF SIMPLE LINEAR REGRESSION

## SIMPLE LINEAR REGRESSION

Statistical method that we can use to find a relationship between two variables and make predictions. The two variables used are typically denoted as y and x. The independent variable, or the variable used to predict the dependent variable is denoted as x. The dependent variable, or the outcome/output, is denoted as y.

**PROBLEM STATEMENT**


Analysing the relationship between 'TV advertising' and 'sales' using a simple linear regression model.

In [1]:
#Setting up the Kaggle API credentials
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

The syntax of the command is incorrect.
'cp' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
!kaggle datasets download -d ashydv/advertising-dataset

'kaggle' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
# Unzipping the zipped dataset file
import zipfile
zip_ref = zipfile.ZipFile('/content/advertising-dataset.zip')
zip_ref.extractall('/content')
zip_ref.close()

FileNotFoundError: [Errno 2] No such file or directory: '/content/advertising-dataset.zip'

In [None]:
# Importing the Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [None]:
# Loading the dataset
df = pd.read_csv('advertising.csv')
df.head()

**Understanding the Data**


In [None]:
df.shape

**Data Insights**

Our data has 200 rows and 4 columns.

In [None]:
df.info()

**Data Insights**


*   All columns are not having any Null Entries

*   All the columns are of numerical type.



In [None]:
df.describe()

**Data Insights**
* Mean values
* Standard Deviation
* Minimum Values
* Maximum Values

## Data Cleaning

### Checking for Null Values

In [None]:
data.isnull().sum()

Data Insights
* As stated earlier there are no null values in our dataset.

### Outlier Analysis

In [None]:
fig, axes = plt.subplots(3,figsize =(5,5))
sns.set(font_scale=.8)
sns.boxplot(data= data, x = 'TV',orient= 'v', ax= axes[0], palette='Set2')

sns.boxplot(data= data, x = 'Radio', orient= 'v', ax= axes[1], palette='Set2')

sns.boxplot(data= data, x = 'Newspaper', orient= 'v', ax= axes[2],palette='Set2')

plt.tight_layout()
plt.show()

Data Insights:
* TV Budget is distributed largely.
* Radio Budget is also largely distributed.
* Newpaper budget is not distributed much compared to TV and Radio.

In [None]:
sns.boxplot(data= data, x = 'Sales',orient= 'v', palette='Set2', legend=False)
plt.title('Box Plot of Sales')
plt.show()

### UNIVARIATE ANALYSIS USING HISTOGRAM

In [None]:
plt.figure(figsize = (17,9))
plt.title("Comparision of how the budget of TV, Radio and Newspaper are related with the Sales.")
sns.scatterplot(data=data, x='TV', y='Sales')

#### Pairplot

In [None]:
sns.pairplot(data, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter')
plt.show()

### UNIVARIATE ANALYSIS USING HISTOGRAM

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.distplot(data['Sales'], bins=30)
plt.show()

**Data Insights**
* Here we can see the distribution of the sales column. The values in the sales column s normally distributed.

CHECKING CORRELATION


In [None]:
sns.heatmap(data.corr(), cmap="YlGnBu", annot = True)
plt.show()

Data Insights:
* TV can be considered as a feature variable as it is strongly correlated to the Sales Column compared to Radio and Newspaper.

## MODEL BUILDING

**PERFORMING SIMPLE LINEAR REGRESSION**

We have,

    y = m*x + c

* x is the independent variable(TV)
* y is the dependent variable(Sales)
Therefore we can write,


    Sales = m*TV + c

* m is the model coefficent.

### Preparing X and Y

* The scikit-learn library expects X (feature variable) and y (response variable) to be NumPy arrays.
* However, X can be a dataframe as Pandas is built over NumPy.

In [None]:
X = data['TV']
X.head()

In [None]:
y = data['Sales']
y.head()

### Splitting Data into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75 , random_state=0000)

In [None]:
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))

Data Insights
* The generated training and testing data is Series.

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Data Insights:
* The training data for TV and Sales has 150 observations.
* The testing data for TV and sales has 50 observations.

In [None]:
X_train

In [None]:
import numpy as np
X_train = X_train[:, np.newaxis]
X_test = X_test[:, np.newaxis]

Here we transform  input data in the form of a 2D array where each row represents a sample and each column represents a feature. Because as we are dealing only with one feature, we make the input(Independent varialbe) into a 2D array which is a general convention in scikit-learn library.


In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))

Data Insights:
* Here the X_train has become an Array of 150 rows and 1 fearure column whereas X_test has 50 rows with 1 feature variable.
* y_train and y_test is one dimentional array of 50 elements which represent the target variable Salesassociated with the X_traina and X_test.

### Performing Linear Regression

In [None]:
# importing LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

# Representing LinearRegression as lr(Creating LinearRegression Object)
lr = LinearRegression()

# Fit the model using lr.fit()
lr.fit(X_train, y_train)

In [None]:
lr.score(X_train, y_train)

Calculating the Coeffiecents of the Simple Linear Regression Model

In [None]:
print("The Intercept is", lr.intercept_)
print("The Slope of the regression equation is ",lr.coef_)

Therefore the simple linear regression equation can be represented as,


      y = 7.13 * (0.055*TV)

Predictions

In [None]:
# Making predictions on the testing set which returns an ndarray
y_pred = lr.predict(X_test)

In [None]:
y_pred.shape

In [None]:
y_test.shape

**COMPUTING RMSE and R2 SCORE**

**RMSE**
* Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data.

*  The measure of how well a regression line fits the data points.

* It is the  the standard deviation of the errors which occur when a prediction is made on a dataset.






In [None]:
# Actual vs Predicted
import matplotlib.pyplot as plt
c = [i for i in range(1,51,1)]         # generating index
fig = plt.figure()
plt.plot(c,y_test, color="blue", linewidth=2, linestyle="-", label="Actual")
plt.plot(c,y_pred, color="red",  linewidth=2, linestyle="-", label="Predicted")
fig.suptitle('Actual and Predicted', fontsize=20)              # Plot heading
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Sales', fontsize=16)
plt.legend()

**Data Insights**:

From the plot we can infer that there is not much deviation in the actual and predicted values.

Plotting the Error

In [None]:
# Error terms
c = [i for i in range(1,51,1)]
fig = plt.figure()
plt.plot(c,y_test-y_pred, color="blue", linewidth=2, linestyle="-")
fig.suptitle('Error Terms', fontsize=20)              # Plot heading
plt.xlabel('Index', fontsize=18)                      # X-label
plt.ylabel('ytest-ypred', fontsize=16)

Data Insights:
This plot helps us to infer that difference between actual and predicted values.
In the plot most of the error values are in the range of -2 to 2 and a few are too low and too high. So the model should work accurate as expected.

### Model Evaluation

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
print('Mean_Squared_Error :' ,mse)

In [None]:
rmse = np.sqrt(mse)
print('Root_Mean_Squared_Error :' ,rmse)

The RMSE value is obtained to be 2.45 which suggests that, on average, our model's predictions are off by approximately 2.45 units of the target variable. Hence we can say that the model performance is performing well in predicting the Sales.

In [None]:
# Assuming you have calculated the RMSE value and determined the range of 'Sales'
RMSE = 2.451544497150294  # Example RMSE value
sales_range = max(y_test) - min(y_test)  # Example range of 'Sales' variable

# Calculate the ratio of RMSE to the range of 'Sales'
RMSE_to_sales_range_ratio = RMSE / sales_range

# Print the ratio for comparison
print("RMSE-to-Sales-Range Ratio:", RMSE_to_sales_range_ratio)


Since the ratio is almost 0.1, the ratio can be considered as small and the model has good predictive accuracy.

## Inference from the R2 Score


In [None]:
from sklearn.metrics import r2_score
r_squared = r2_score(y_test, y_pred)
print('R2 Score :',r_squared)

### R2 Score

 * Measure of how well the independent variables explain the variability of the dependent variable.

 * A higher R2 score indicates a better fit of the model to the data.

 * It ranges from 0 to 1.

 * 0 indicates that the model does not explain any of the variability in the target variable.
 * 1 indicates that the model perfectly explains all the variability in the target variable.

 * In our  case, an R2 score of 0.8053611644334993 suggests that approximately 80.54% of the variance in 'Sales' is explained by the independent variables in your model.





## Plotting the Regression Line

In [None]:
plt.plot(X_test,y_pred, color='red')
plt.scatter(data['TV'],data['Sales'])
plt.title('Tv vs Sales')
plt.xlabel("TV")
plt.ylabel('Sales ')
plt.show()

**CONCLUSION**

In this lab, we conducted a simple linear regression analysis to model the relationship between the predictor variable (TV advertising expenditure) and the target variable (sales). Our analysis resulted in the development of a linear regression model with a Root Mean Squared Error (RMSE) of 2.45 and an R-squared (R2) score of 0.81. Based on these results, we have successfully fitted the model using the regression line, which provides valuable insights into the relationship between TV advertising expenditure and sales.