Linear regression is one of the simplest and most widely used machine learning algorithms. It aims to model the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to observed data. In this example, we'll use `scikit-learn` to perform a simple linear regression.

# Importing necessary libraries

To begin, we need to import the necessary libraries. We'll use `scikit-learn` for the linear regression model, `numpy` to generate some synthetic data, and `matplotlib` to visualize the results.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In this example, we import `LinearRegression` from `sklearn.linear_model` to build the regression model. We also import `train_test_split` to split our dataset into training and testing sets, and `mean_squared_error` to evaluate the model.

# Generating synthetic data

Next, let's create a synthetic dataset. We'll generate a simple linear relationship between `X` (the independent variable) and `y` (the dependent variable) and add some noise to simulate real-world data.

In [None]:
# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)  # y = 4 + 3X + noise

# Plot the data
plt.scatter(X, y)
plt.title("Synthetic Data for Linear Regression")
plt.xlabel("X")
plt.ylabel("y")
plt.show()

Here, we use `numpy` to create an array `X` of 100 random points between 0 and 2. The corresponding `y` values are generated using the linear equation
𝑦
=
4
+
3
𝑋
 with some added noise. The scatter plot gives us a visual representation of the data we'll use for our regression model.

# Splitting the data into training and testing sets

Before training the model, it's important to split the dataset into training and testing sets. This allows us to evaluate the model on data it hasn't seen during training, providing a more realistic assessment of its performance.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Using `train_test_split`, we split the data into 80% training data and 20% testing data. This helps prevent overfitting and ensures that our model generalizes well to unseen data.

# Training the linear regression model

Now that we have our data ready, we can train the linear regression model using the training set. The model will attempt to learn the linear relationship between `X_train` and `y_train`.

In [None]:
# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

We initialize the `LinearRegression` model and use the `fit()` method to train it on the training data. The model will compute the optimal values for the slope (coefficient) and intercept based on the training data.

# Making predictions on the test set

Once the model is trained, we can use it to make predictions on the test set. This will allow us to see how well the model performs on new, unseen data.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

Here, we use the `predict()` method to generate predictions on the test data (`X_test`). These predictions will be compared to the actual values (`y_test`) to evaluate the model's accuracy.

# Evaluating the model's performance

To assess the quality of our linear regression model, we'll compute the mean squared error (MSE) on the test set. This metric measures how close the predicted values are to the actual values, with a lower MSE indicating better performance.

In [None]:
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

The `mean_squared_error()` function computes the average of the squared differences between the predicted and actual values. This gives us an indication of how well the model has captured the underlying relationship in the data.

# Visualizing the model's predictions

Finally, we can visualize the model’s predictions by plotting the regression line along with the original data points. This helps us see how well the model fits the data.

In [None]:
# Plot the regression line and the test data
plt.scatter(X_test, y_test, label="Test data")
plt.plot(X_test, y_pred, color='red', label="Regression line")
plt.title("Linear Regression: Test Data vs Predictions")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()

In this plot, the blue points represent the actual test data, while the red line represents the model's predictions. Ideally, the regression line should follow the trend of the test data, showing that the model has learned the underlying relationship.

# Summary

In this example, we walked through the process of building a simple linear regression model using `scikit-learn`. We started by generating synthetic data, split it into training and testing sets, and trained a linear regression model. We then evaluated the model's performance using mean squared error and visualized the results. Linear regression is a foundational algorithm that introduces key concepts such as training/testing splits, model fitting, and performance evaluation, which are central to machine learning.
