Model Creation Phases
Data Collection, Data Pre processing, Input/Output Split, Split Train and Test data, Evaluation Metrics and Save the best model

It imports the Pandas library, which is a popular open-source Python package used for data manipulation and analysis.
Pandas provides data structures such as DataFrames (tabular data) and Series (1-dimensional data) that are useful for handling and processing structured data (like spreadsheets or databases).
Instead of typing pandas every time, we can use the shorter alias pd for better readability and convenience.

In [7]:
import pandas as pd 

Data Collection - Collect the data from the End user in the formal meeting

pd.read_csv():
read_csv() is a function from the Pandas library that is used to read data from a CSV (Comma-Separated Values) file into a DataFrame.
A DataFrame is a 2-dimensional data structure in Pandas, similar to a table or spreadsheet, with labeled rows and columns.
"Salary_Data.csv":
This is the file name (or file path) of the CSV file that contains the data you want to load.
The function assumes the file is either in the current working directory or you can provide a full path to the file (e.g., "C:/path/to/your/file/Salary_Data.csv").
dataSet:
The result of pd.read_csv() (the loaded data) is assigned to the dataSet variable.
Now, dataSet is a DataFrame that contains the contents of the Salary_Data.csv file.

In [8]:
dataSet = pd.read_csv("Salary_Data.csv")

Pre Processed Data - create the proper format data using the collected data from the client

It stores your CSV file's data in a structured format for easy access and manipulation. You want know about what is inside in the dataset just like print

In [7]:
dataSet

Unnamed: 0,Years_of_Experience,Salary
0,7.5,57844
1,19.0,100539
2,14.6,84203
3,12.0,74551
4,3.1,41509
5,3.1,41509
6,1.2,34455
7,17.3,94227
8,12.0,74551
9,14.2,82718


Split Input and Ouput
Here Input is Year of Experience and Output is Salary, Split the Input data from DataSet - Input - InDependent Variable

dataSet[["Years_of_Experience"]]:
This line extracts the Years_of_Experience column from the dataSet DataFrame.
Double square brackets [["..."]] are used to extract it as a DataFrame, not as a single Series.
As a result, independent will be a new DataFrame containing only the Years_of_Experience column.
independent:
This new DataFrame will store the values of the Years_of_Experience column as independent variable(s), which are typically the features or inputs for machine learning models.

In [11]:
independent = dataSet[["Years_of_Experience"]]
independent

Unnamed: 0,Years_of_Experience
0,7.5
1,19.0
2,14.6
3,12.0
4,3.1
5,3.1
6,1.2
7,17.3
8,12.0
9,14.2


In [None]:
#Split the output data from DataSet - Output - Dependent Variable

dataSet[["Salary"]]:
This line extracts the Salary column from the dataSet DataFrame.
Double square brackets [["..."]] are used to extract it as a DataFrame, not as a Series.
As a result, dependent will be a new DataFrame containing only the Salary column.
dependent:
This new DataFrame represents the target (y) or dependent variable, which is typically the output or label in machine learning models.
In a regression problem, like predicting salaries based on years of experience, the dependent variable is the Salary.

In [10]:
dependent = dataSet[["Salary"]]
dependent

Unnamed: 0,Salary
0,57844
1,100539
2,84203
3,74551
4,41509
5,41509
6,34455
7,94227
8,74551
9,82718


In [None]:
# Split Train and Test data

In [12]:
from sklearn.model_selection import train_test_split
train_test_split(independent,dependent,test_size=0.30,random_state=0)

[    Years_of_Experience
 34                 19.3
 32                  1.3
 26                  4.0
 30                 12.2
 8                  12.0
 13                  4.2
 5                   3.1
 17                 10.5
 14                  3.6
 31                  3.4
 24                  9.1
 1                  19.0
 12                 16.6
 6                   1.2
 23                  7.3
 4                   3.1
 18                  8.6
 21                  2.8
 19                  5.8
 9                  14.2
 7                  17.3
 33                 19.0
 3                  12.0
 0                   7.5,
     Years_of_Experience
 29                  0.9
 20                 12.2
 16                  6.1
 28                 11.8
 22                  5.8
 15                  3.7
 10                  0.4
 2                  14.6
 11                 19.4
 27                 10.3
 25                 15.7,
     Salary
 34  101653
 32   34826
 26   44850
 30   75293
 8    74551
 

In [None]:
#Above output giving Train and Test data we need to Separate it for our convinience. Input train data, Input test data, Output train data, Output Test data

train_test_split() is a function from the scikit-learn library that helps you split your dataset into training and testing sets. 
This is essential for evaluating how well a machine learning model performs on unseen data.
In this example, the independent variable (Years of Experience) is the feature used to predict the dependent variable (Salary).

This function splits the input dataset into four subsets:
x_train: Training data (independent variable)
x_test: Testing data (independent variable)
y_train: Training target (dependent variable)
y_test: Testing target (dependent variable)

Parameters Passed to train_test_split():

independent: This contains the independent variable (like Years_of_Experience).
dependent: This contains the dependent variable (like Salary).
test_size=0.30: This specifies that 30% of the data should be used for testing, and the remaining 70% will be used for training.
random_state=0: This ensures the same random split occurs every time the code is executed (for reproducibility). If not set, the split may vary every time.

Training Set (x_train, y_train):

The training set is used to train the model (the model learns patterns from this data).

Testing Set (x_test, y_test):

The testing set is used to evaluate the model’s performance on new/unseen data.

In [13]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(independent,dependent,test_size=0.30,random_state=0)

In [None]:
# To Learn the Model

LinearRegression() is a class from the sklearn.linear_model module used to create a linear regression model.
In simple terms, linear regression is a method used to find the relationship between a dependent variable (like Salary) and one or more independent variables (like Years_of_Experience).
The goal is to fit a straight line that best predicts the target variable (dependent variable) using the independent variables.

from sklearn.linear_model import LinearRegression: Imports the LinearRegression class from the sklearn library.
regressor = LinearRegression(): Creates an instance (object) of the LinearRegression model.
This object will be used to fit the data and make predictions.

regressor.fit(x_train, y_train):

fit() is a method used to train the model using the training data.
x_train: Independent variable(s) (Years of Experience in this case).
y_train: Dependent variable (Salary in this case).
The model learns the relationship between the input (x_train) and output (y_train) data and calculates the best-fit line (slope and intercept).

In [14]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)

Linear Regression Works:
The model tries to fit a line to the data such that:

𝑦=𝑤𝑥+𝑐
y: Predicted output (Salary)
x: Independent variable (Years of Experience)
w: Weight (coefficient of x)
c: Intercept (constant term)

When fit() is Called?
Training: The model finds the best values for:
Coefficients (w): This shows how much the dependent variable (salary) changes with a one-unit change in the independent variable (experience).
Intercept (c): This is the predicted value of the dependent variable when the independent variable is 0.

find w - slope or weight

In [28]:
weight = regressor.coef_
weight

array([[3712.5936588]])

In [None]:
find c - bias or intercept

In [29]:
bias = regressor.intercept_
bias

array([29999.62152809])

In [None]:
#Evaluation Metrics - Now we use R2
#R2 value near to 1 means best model, Near to 0 means poor model

predict:
regressor.predict(x_test):
After training the model using the fit() function, this method is used to make predictions based on unseen data (in this case, x_test).
The model uses the coefficients and intercept it learned during training to compute predictions for the x_test values.
The result, y_pred, contains the predicted salary values.

r2_score():
This function from sklearn.metrics calculates the R-squared value, which measures how well the model fits the data.
R² (coefficient of determination) explains the proportion of variance in the dependent variable (Salary) that can be predicted from the independent variable (Years of Experience).

In [15]:
y_pred = regressor.predict(x_test)
from sklearn.metrics import r2_score
r_square = r2_score(y_test,y_pred)

In [16]:
r_square # value near to 1 means best model

0.9999999998013439

In [None]:
# Save the Best model

pickle: pickle is a Python library used to serialize and deserialize Python objects.
Serialization: Converting a Python object (like a trained model) into a byte stream, so it can be saved to a file.
Deserialization: Loading the saved byte stream back into its original Python object.
This is especially useful when you need to save a trained machine learning model

import pickle: Imports the pickle module, which allows us to save (serialize) and load (deserialize) Python objects.
filename = "finalized_model_of_linear_regession.sav":Sets the file name where the model will be saved. In this case, "finalized_model_of_linear_regession.sav" is the chosen name for the serialized model file.
sav: is used as a file extension to indicate that it contains a saved model (though technically, any extension can be used).

In [17]:
import pickle
filename = "finalized_model_of_linear_regession.sav"

wb: Opens the file in write-binary mode.
pickle.dump(): Serializes the trained model and writes it to the specified file (filename).

In [18]:
pickle.dump(regressor,open(filename,'wb'))

Verify the model saved correctly or not, To verify the saved data, we need to load the data
rb: Opens the file in read-binary mode.
pickle.load(): Deserializes the saved byte stream back into the original model.

In [19]:
load_the_model_for_view = pickle.load(open("finalized_model_of_linear_regession.sav",'rb'))
result = load_the_model_for_view.predict([[15]])



Model Predicted result

In [20]:
result

array([[85688.52641012]])