In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("Dataset Info:")
print(f"Shape: {df.shape}")
print(f"\nMissing values: {df.isnull().sum().sum()}")
print("\nBasic statistics:")
print(df.describe())

plt.figure(figsize=(8, 6))
sns.heatmap(df.iloc[:, :-1].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

In [None]:
X = df.drop(['petal length (cm)', 'target'], axis=1)
y = df['petal length (cm)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

In [None]:
model_no_pca = LinearRegression()
model_no_pca.fit(X_train_scaled, y_train)
y_pred_no_pca = model_no_pca.predict(X_test_scaled)

r2_no_pca = r2_score(y_test, y_pred_no_pca)
rmse_no_pca = np.sqrt(mean_squared_error(y_test, y_pred_no_pca))
mae_no_pca = mean_absolute_error(y_test, y_pred_no_pca)

print("=== Model WITHOUT PCA ===")
print(f"R² Score: {r2_no_pca:.4f}")
print(f"RMSE: {rmse_no_pca:.4f}")
print(f"MAE: {mae_no_pca:.4f}")

In [None]:
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print(f"Explained variance by each component: {pca.explained_variance_ratio_}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.4f}")

plt.figure(figsize=(8, 5))
plt.bar(['PC1', 'PC2'], pca.explained_variance_ratio_, color=['steelblue', 'coral'])
plt.ylabel('Variance Explained')
plt.title('Variance Explained by Principal Components')
plt.show()

In [None]:
model_with_pca = LinearRegression()
model_with_pca.fit(X_train_pca, y_train)
y_pred_pca = model_with_pca.predict(X_test_pca)

r2_pca = r2_score(y_test, y_pred_pca)
rmse_pca = np.sqrt(mean_squared_error(y_test, y_pred_pca))
mae_pca = mean_absolute_error(y_test, y_pred_pca)

print("\n=== Model WITH PCA ===")
print(f"R² Score: {r2_pca:.4f}")
print(f"RMSE: {rmse_pca:.4f}")
print(f"MAE: {mae_pca:.4f}")

In [None]:
comparison = pd.DataFrame({
    'Metric': ['R²', 'RMSE', 'MAE'],
    'Without PCA': [r2_no_pca, rmse_no_pca, mae_no_pca],
    'With PCA': [r2_pca, rmse_pca, mae_pca]
})

print("\n=== Performance Comparison ===")
print(comparison.to_string(index=False))

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
metrics = ['R²', 'RMSE', 'MAE']
colors = ['steelblue', 'coral']

for i, metric in enumerate(metrics):
    values = [comparison.loc[i, 'Without PCA'], comparison.loc[i, 'With PCA']]
    axes[i].bar(['Without PCA', 'With PCA'], values, color=colors)
    axes[i].set_title(f'{metric} Comparison')
    axes[i].set_ylabel(metric)
    axes[i].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

Good afternoon, ma’am. This program demonstrates how to apply Linear Regression and Principal Component Analysis (PCA) on the Iris dataset to study how PCA affects model performance.

First, I import all the required Python libraries. I use numpy and pandas for data handling, matplotlib.pyplot and seaborn for visualizations, and from sklearn, I import load_iris for the dataset, train_test_split for dividing the data, StandardScaler for scaling, PCA for dimensionality reduction, LinearRegression for the model, and three performance metrics — mean_squared_error, mean_absolute_error, and r2_score.

Next, I load the built-in Iris dataset using load_iris() and store it in a DataFrame. I assign the four feature columns — sepal length, sepal width, petal length, and petal width — and add a fifth column named target, which contains numeric class labels representing the three iris species.

I print the dataset information including its shape, total number of missing values, and descriptive statistics like mean, minimum, and maximum values using df.describe(). This helps in understanding the overall data quality and feature distributions.

Then, I plot a heatmap of feature correlations using sns.heatmap. This shows how strongly each feature is related to the others. For example, petal length and petal width are highly correlated, which suggests some redundancy that PCA can later handle by combining them into principal components.

After that, I prepare the features (X) and the target (y). Here, I set the target variable as petal length (cm) because I want to predict it using the remaining features. So I drop the petal length (cm) column and also the target column from X, keeping only sepal length, sepal width, and petal width as predictors.

Next, I split the dataset into training and testing sets using train_test_split with a test size of 30% and a random state of 42 for reproducibility. I print the number of samples in both sets.

Before training the model, I scale the features using StandardScaler(). Scaling is important because PCA and linear regression are both sensitive to feature magnitude. I fit the scaler on the training data and then transform both the training and test sets to have zero mean and unit variance.

Now, I train the Linear Regression model without PCA. I create a model object model_no_pca, fit it using the scaled training data, and predict on the test data. Then I calculate three key metrics —

R² Score, which shows how much variance in the target is explained by the model,

RMSE (Root Mean Squared Error), which measures the average prediction error, and

MAE (Mean Absolute Error), which measures the average absolute difference between predicted and actual values.
I print these results under “Model WITHOUT PCA.”

Next, I apply PCA (Principal Component Analysis) to reduce the feature space to two components using PCA(n_components=2). PCA transforms correlated features into new, uncorrelated ones called principal components while preserving as much variance as possible. I fit PCA on the training data and transform both training and test sets. I print the explained variance ratio for each component and the total variance retained, which tells me how much information from the original dataset is preserved after dimensionality reduction.

Then, I plot a bar chart to visualize the variance captured by each of the two components (PC1 and PC2). This helps confirm whether two components capture most of the information or not.

After applying PCA, I train another Linear Regression model on the reduced data. I fit the model, predict the test data, and calculate the same three metrics — R², RMSE, and MAE — this time “WITH PCA.”

To clearly compare both models, I store the results in a comparison DataFrame showing each metric for “Without PCA” and “With PCA.” I print this table, which makes it easy to see whether PCA improved or reduced the model’s performance.

Finally, I create a figure with three bar charts side by side — one each for R², RMSE, and MAE — comparing both models visually. The blue bars represent the results without PCA, and the orange bars show the results with PCA. This visualization gives a quick performance comparison between the two approaches.

In summary, this program uses the Iris dataset to demonstrate how PCA can reduce data dimensionality while retaining most of the variance. It shows how Linear Regression performs before and after PCA, and compares the results using standard metrics. This helps us understand whether PCA improves model performance, simplifies computation, or causes minor information loss. It’s a great example of combining data preprocessing, dimensionality reduction, regression modeling, and result visualization in a complete machine learning workflow.