# Polynomial Regression

**Understanding the Dataset:**

Imagine we’re part of an HR department looking to hire a promising candidate who appears to be a great fit for our company. After successfully navigating the interview process, the candidate says yes to the offer, but then comes the inevitable question: What is your salary expectation?

The candidate, being highly experienced and advanced in their career, states their expectation as $160,000 per year. When we ask why they are expecting such a high salary, they respond confidently: That’s what I earned at my previous company, so I expect at least the same here.

Now, as HR professionals, we need to determine whether this claim is accurate or an exaggeration. To do this, we’ll use a polynomial regression model to predict the candidate’s previous salary based on their position.

**The Dataset**

To make this prediction, we need data. Let’s say we collected salary information from reliable online sources like Glassdoor, showing the typical salaries for various positions in the candidate’s previous company—from Business Analyst to CEO. Additionally, by reviewing the candidate’s LinkedIn profile, we confirmed that they held the position of Region Manager for two years.

In our dataset, the salary for a Region Manager is listed as $150,000$, while the next higher position (e.g., VP) earns $200,000. Since the candidate has significant experience as a Region Manager, their actual salary is likely somewhere between these two figures. For our analysis, we’ll assign their position a level of 6.5 (between levels 6 and 7).

**The Goal**

By training our polynomial regression model on this dataset, we will predict the salary for a position at level 6.5. Comparing this prediction to the candidate’s claimed $160,000 will allow us to determine whether their claim is truthful or a bluff.

Let’s dive in and build the model to find out!

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

The first column is redundant, it's actually like labeling the second column, that's why we will avoid it.

Then, we’re going to skip the step of splitting the dataset into a training set and a test set. The reason for this is  we want to use the entire dataset to maximise the accuracy of our model when predicting the salary for the position level between 6 and 7. By utilising all available data points, we make sure the model captures the full range of patterns and relationships within the dataset.

## Training the Linear Regression model on the whole dataset

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Training the Polynomial Regression model on the whole dataset

Using the PolynomialFeatures class from Scikit-learn’s preprocessing module, we’ll transform the feature X into a matrix that includes X^1,X^2,…,Xn, where n is the degree of the polynomial we choose. We'll create new features by raising the position levels to higher powers (e.g., 2,3,...,n).

**Transform the Feature Matrix:**

We start with our current matrix of features X, which contains only one feature: the position levels (x1). We need to transform it into a new matrix of features containing:

${x_1}$: the original feature.

${x_1}^2$: the square of the feature (for n=2).

${x_1}^3$, ${x_1}^4$: higher powers if we decide on n=3 or n=4.

To achieve this, we use the fit_transform method from the PolynomialFeatures object we created earlier. This method transforms X into a new feature matrix with all polynomial combinations up to the specified degree.

## Visualising the Linear Regression results

It’s clear that the linear regression model is not well-suited for this dataset. While linear regression works well when the relationship between features and the target variable is approximately linear, that’s not the case here. For many position levels, the model’s predictions are significantly off from the actual salaries.

For instance:

At several points, the predicted salaries are far higher or lower than the real values.
While the model fits a couple of points reasonably well, the majority of predictions deviate greatly from the actual data.
This discrepancy demonstrates why linear regression is not ideal for this dataset. If we were to use this model to determine whether the candidate's claimed salary is truthful, it could lead to errors—such as offering a salary far higher than necessary, which would be poor negotiation strategy.

This example highlights the limitations of linear regression for datasets with non-linear patterns. Next, we’ll move on to the polynomial regression model, which is far better suited for capturing the complex relationship between position levels and salaries. Let’s visualise the results to see the improvement!

## Visualising the Polynomial Regression results

The issue is resolved, as the predictions on the blue curve now align much more closely with the actual salaries. This improvement is achieved with just n=2. However, by increasing the polynomial degree to n=3 or n=4, the results will improve even further. Let’s demonstrate this by retraining the polynomial regression model with n=4.

With n=4, the polynomial regression equation becomes:

Salary=$b_0$+$b_1$×(Position Level)+$b_2$×(Position Level2)+$b_3$×(Position Level3)+$b_4$×(Position Level4)

After retraining the model on the dataset, we can visualise the updated results. As expected, the polynomial regression model now fits the dataset almost perfectly. While this indicates overfitting, it is acceptable in this specific case because our goal is to achieve precise predictions for position levels between 6 and 7.


## Visualising the Polynomial Regression results (for higher resolution and smoother curve)


To further improve the visualisation, we’ll refine the curve. Previously, the graph used only the integer values of position levels (e.g., 1, 2, 3, etc.), resulting in a less smooth curve. Instead, we’ll increase the resolution by using smaller intervals (e.g., 1.0, 1.1, 1.2, etc.), creating a denser set of x-coordinates. This produces a much smoother and more aesthetically pleasing curve.

The final graph illustrates a perfectly fitted polynomial regression model. While it is overfitted to this dataset, this is fine for our purpose here, as it allows for highly accurate predictions, helping us determine whether the candidate’s salary claim is truthful or a bluff.

This method is mainly for demonstration purposes, as most real-world datasets involve multiple features, making such visualisations impractical. However, in this case, it beautifully highlights the power of polynomial regression. Enjoy the results!

## Predicting a new result with Linear Regression

**Why Use a Two-Dimensional Array?**

In Python, arrays are built using square brackets. A single pair of square brackets creates a one-dimensional list or vector. For example:

[6.5] creates a list with one value, which is not the correct format for the predict method.
To create a two-dimensional array (even if it contains only one value), we use double square brackets:

[[6.5]] creates an array with one row and one column, which matches the expected format for the predict method.
Each pair of square brackets corresponds to dimensions:

The outer brackets represent rows.
The inner brackets represent columns.
For example:

[[6.5, 5]] creates an array with one row and two columns.
[[6.5, 5], [2, 3]] creates an array with two rows and two columns.


## Predicting a new result with Polynomial Regression