Introduction:

Linear regression is a method used to find the relationship between two variables. Imagine you have a set of points on a graph and you want to draw a straight line that best represents the overall trend of these points. This line helps you predict values based on past data. In simple terms, linear regression helps you "fit" a straight line through data to make predictions or understand trends.

Definition:

Linear regression is a statistical technique used to model the relationship between a dependent variable Y and one or more independent variables X. 
The simplest case is simple linear regression, where there is only one independent variable.
The relationship is modeled by the equation of a straight line:
Y=mX+b

Where:
Y is the dependent variable (the one you want to predict or explain).
X is the independent variable (the one you use to make predictions).
m is the slope of the line (shows how much Y changes for each unit change in X).
b is the intercept of the line (the value of Y when X=0).
The goal of linear regression is to determine the best values for m (slope) and b (intercept), which results in the line that minimizes the difference between the predicted values of Y and the actual values in the dataset.


Algorithm:

1) Input Data: Given a set of data points (x1,y1),(x2,y2),..(xn,yn)
2) Compute Sums: Calculate the total of X values, Y values, products of X and Y, and squares of X values.
3) Calculate Slope (m): Use the formula:
   m = (n * sum(XY) - sum(X) * sum(Y)) / (n * sum(X^2) - (sum(X))^2)
4) Calculate Intercept (b): Use the formula:
   b = (sum(Y) - m * sum(X)) / n
5) Construct Regression Line: The regression equation is Y=mX+b.
6) Prediction: For a new X, predict Y using Y=mX+b.
7) Evaluate the Model: Measure model accuracy with metrics like Mean Squared Error (MSE) or R-squared.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv(r"C:\Users\iamni\OneDrive\Desktop\Salary_Data.csv") #pandas 

In [3]:
df.head() #pandas 

Unnamed: 0,YearsExperience,Salary
0,1.1,39343
1,1.3,46205
2,1.5,37731
3,2.0,43525
4,2.2,39891


In [4]:
df.shape #pandas 

(30, 2)

In [5]:
df.size #pandas 

60

In [6]:
df.info() #pandas 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   YearsExperience  30 non-null     float64
 1   Salary           30 non-null     int64  
dtypes: float64(1), int64(1)
memory usage: 612.0 bytes


In [7]:
df.describe() #pandas 

Unnamed: 0,YearsExperience,Salary
count,30.0,30.0
mean,5.313333,76003.0
std,2.837888,27414.429785
min,1.1,37731.0
25%,3.2,56720.75
50%,4.7,65237.0
75%,7.7,100544.75
max,10.5,122391.0


In [8]:
x = df['YearsExperience'].values.reshape(-1,1) #.values is a pandas function and .reshape is a numpy function
y = df['Salary']

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3, random_state = 10)
#sklearn (train_test_split is a function from the sklearn.model_selection module of the sci-kit learn library)

In [10]:
model = LinearRegression()
model.fit(x_train, y_train)
x_test_pred = model.predict(x_test)
#sklearn library used
#sklearn is the library, linear_model is the module and LinearRegression is the class
#.fit() and .predict() are functions from sklearn library

In [11]:
accuracy = r2_score(y_test, x_test_pred) #accuracy_score is a function from the metrics module form sklearn library
accuracy

0.9647278344670828