Polynomial Regression is a statistical method used to model a relationship between a dependent variable y and an independent variable x as an n^th degree polynomial. Unlike linear regression, which fits a straight line, polynomial regression fits a curve to the data points.
While Linear Regression fits a straight line, Polynomial Regression is used when the data shows a curve or non-linear pattern. It models the relationship between the independent variable (
For a polynomial of degree
Where:
-
$\beta_0$ : The Intercept. -
$\beta_1, \beta_2, \dots$ : The Coefficients for each power of$x$ . -
$n$ : The Degree (1 = Line, 2 = Parabola/Curve, 3 = S-shape).
In Polynomial Regression, the Degree (
| Degree | Shape | Bends | Description |
|---|---|---|---|
| 1 (Linear) | Straight Line | 0 | A rigid straight line. It can only go up or down at a constant rate. |
| 2 (Quadratic) | Parabola (U-Shape) | 1 | It can change direction once (e.g., go up then down, like a ball thrown in the air). |
| 3 (Cubic) | S-Shape | 2 | It can change direction twice (e.g., go up, then down, then back up). |
| Wavy / Complex | A very wiggly line that can twist and turn as many times as needed. |
The goal of Machine Learning is not to connect every single dot. The goal is to find the general pattern so we can predict new data accurately.
Best when the data has a simple curve or a single peak/valley.
- Example: Fuel Efficiency vs. Speed. (Efficiency goes up as you speed up, peaks at 60mph, then drops as you go faster).
Best when the pattern is more complex with multiple fluctuations.
- Example: Electricity Usage over a Day. (Low at night, high in the morning, dips in the afternoon, high again in the evening).
Almost Never.
โ ๏ธ The Danger of High Degrees (Overfitting) Imagine a model with Degree 20. It has 19 bends available. It is so flexible that it will wiggle frantically to pass through every single data point perfectly.While it gets 100% accuracy on the training data, it fails miserably on new data because it learned the "noise" instead of the actual pattern. This is called Overfitting.
Objective: Find the best-fit curve for the following data points which follow a non-linear trend.
Data Points:
Since the data curves upwards, we use a Quadratic Equation (Degree 2):
To solve for the coefficients (
| 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 4 | 8 | 4 | 16 | 8 | 16 |
| 3 | 9 | 27 | 9 | 81 | 27 | 81 |
| 4 | 15 | 60 | 16 | 240 | 64 | 256 |
| Sum ( |
29 | 96 | 30 | 338 | 100 | 354 |
We arrange these sums into the matrix equation
Substituting the values from our table (
By solving the matrix equation (calculating
The Final Best-Fit Equation:
Follow these steps to set up the environment and run the Polynomial Regression model on your local machine.
Ensure you have Python installed. You will need the following libraries:
pip install pandas matplotlib scikit-learn
Keep both files in the same directory:
polynomial_regression.py(The main logic)predict.csv(The training dataset)
Open your terminal or command prompt, navigate to the folder, and run:
python polynomial_regression.py
Once the script runs, it will ask for input:
- Enter Hours: The program will prompt you to enter the number of study hours (e.g.,
5). - Output: It will display the predicted marks in the console.
- Visualization: A graph will pop up showing the relationship between Study Hours and Marks.
Here is the breakdown of how the model was developed to predict Student Marks based on Study Hours.
We use Pandas to load the dataset.
- Action: Read
predict.csvinto a dataframe. - Dataset columns:
Hours(Independent Variable ) andMarks(Dependent Variable ). - Reshaping: The input
Hoursis reshaped into a 2D array ([[...]]) because Scikit-Learn expects a matrix format.
The relationship between study hours and marks isn't a straight line (marks plateau as hours increase). To fit this curve, we use PolynomialFeatures.
- Logic: We convert the single feature
Hoursinto a polynomial set:Hours,Hours^2. - Code Concept:
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
- Before:
[5 hours] - After:
[5, 25](This allows the model to learn the curved pattern).
We fit a standard Linear Regression model on the transformed data.
- Math: The model learns how both the raw hours and the squared hours affect the marks ().
When the user enters a value (e.g., 8 hours):
- Transform: The code converts
8into[8, 64]. - Predict: The model calculates the likely marks using these values.
Finally, we use Matplotlib to visualize the result.
- Scatter Plot: Shows the actual student data (Hours vs Marks).
- Curve Line (Red): Shows the polynomial regression curve, demonstrating how the model fits the data better than a straight line.