📈 What Is Linear Regression?
Linear regression estimates how one variable (called the dependent variable, like sales) changes as another variable (the independent variable, like advertising spend) changes.

Equation Format
𝑦=𝑚𝑥+𝑏

y is the target variable

x is the input feature

m is the slope (effect of x on y)

b is the intercept (where line hits the y-axis)

In [0]:
#load data set
from sklearn.datasets import fetch_california_housing
import pandas as pd

What’s happening:

* fetch_california_housing() loads a dataset with housing features and prices.

* housing.data contains the features (like income, age, rooms).

* housing.target contains the house prices.

* We create a pandas DataFrame df with all this data.

In [0]:
housing = fetch_california_housing()
df= pd.DataFrame(housing.data, columns=housing.feature_names)
df['target']= housing.target


In [0]:
display(df.head())
print(df.shape)

X contains all columns except 'target'.

y is the column we want to predict.

We split the data into training (80%) and testing (20%) sets.

test_size=0.2 means 20% of the data is used for testing, and 80% for training.

random_state=42 ensures the split is reproducible (same result every time).


In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)


R², or coefficient of determination, measures how well your model’s predictions match the actual data.

🔹 Intuition:
It tells you how much of the variation in the target variable (house prices) is explained by your model.

It’s like asking: “How good is my model at predicting?”

🎯 R² Score Range

R².Value	          Meaning
1.0	             Perfect prediction (model explains all variation)
0.0	             Model explains none of the variation
< 0.0	           Model is worse than just predicting the average

🧮 Formula (Conceptual)
𝑅2=1−Sum of Squared Errors/Total Sum of Squares

Sum of Squared Errors (SSE): Difference between actual and predicted values.

Total Sum of Squares (TSS): Total variation in the actual values.

What this does:

Creates a linear regression model.

Fits it to the training data so it learns the relationship between features and house prices.

In [0]:
y_pred = model.predict(X_test)

rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"R² Score: {r2:.3f}")


y_pred are the predicted prices.

RMSE shows how far off predictions are on average (lower is better).

R² shows how well the model explains the variance in the data (closer to 1 is better).

In [0]:
import matplotlib.pyplot as plt

plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Linear Regression Predictions")
plt.show()


In [0]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

ridge = Ridge()
param = {'alpha': [0.1, 0.2, 0.4, 0.5, 1e-10, 1e-5, 1e-3, 1e-2, 1e-1, 1, 2, 3]}

ridgeregressor = GridSearchCV(ridge, param, scoring='neg_mean_squared_error', cv=5)

ridgeregressor.fit(X_train, y_train)




🧠 What Is Ridge Regression?
Ridge Regression is a type of linear regression that adds L2 regularization to the cost function. It helps prevent overfitting by discouraging the model from assigning large weights to any one feature.

Ridge Cost Function:
$$
J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \theta_j^2
$$

First term: Regular Mean Squared Error (how wrong the predictions are).

Second term: L2 penalty (sum of squared coefficients).

α (alpha): Regularization strength.

If α = 0: Ridge becomes regular linear regression.

If α is large: Coefficients shrink more, reducing model complexity.


In [0]:
print(ridgeregressor.best_score_)
print(ridgeregressor.best_params_)


In [0]:
ridge = Ridge(alpha=3.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)

r2_ridge = r2_score(y_test, y_pred_ridge)
print(f"R² Score (Ridge): {r2_ridge:.3f}")

In [0]:
plt.scatter(y_test, y_pred_ridge, alpha=0.7)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Ridge Regression Predictions")
plt.show()

In [0]:
lassoregressor = GridSearchCV(lasso, param, scoring='neg_mean_squared_error', cv=5)

lassoregressor.fit(X_train, y_train)

🧠 What Is Lasso Regression?
Lasso stands for Least Absolute Shrinkage and Selection Operator. It’s a type of linear regression that adds L1 regularization to the cost function.

Unlike Ridge (which shrinks coefficients), Lasso can shrink some coefficients all the way to zero—effectively removing irrelevant features. This makes it great for feature selection

Lasso Cost Function:

$$
J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} |\theta_j|
$$

First term: Mean Squared Error (how wrong the predictions are).

Second term: L1 penalty (sum of absolute values of coefficients).

α (alpha): Regularization strength.

Larger α → more aggressive shrinking.

Smaller α → behaves more like regular linear regression.

In [0]:
print(lassoregressor.best_score_)
print(lassoregressor.best_params_) 

In [0]:
lasso = Lasso(alpha=1e-05)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"R² Score (Lasso): {r2_lasso:.3f}")


In [0]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X_train, y_train)
y_pred_poly = poly_model.predict(X_test)

r2_poly = r2_score(y_test, y_pred_poly)
print(f"R² Score (Polynomial): {r2_poly:.2f}")


🧠 What Is Polynomial Regression?
Polynomial Regression is an extension of Linear Regression that allows the model to fit non-linear relationships between the features and the target variable.

🔍 Why Use It?
Linear regression fits a straight line.

Polynomial regression fits a curve by adding powers of the input features.
🧮 Mathematical Form
For a single feature 
𝑥
, a polynomial regression of degree 2 looks like:

$$
\hat{y} = \theta_0 + \theta_1 x + \theta_2 x^2
$$
For degree 3:
$$
\hat{y} = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3
$$

And so on.

In [0]:
display(y_test,y_pred_poly)

In [0]:
plt.scatter(y_test, y_pred_poly, alpha=0.7)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")

In [0]:
# #Initialize Spark Session
# spark= SparkSession.builder.appName("Linear Regression").getOrCreate()

# #Fetch the dataset
# california_housing = fetch_california_housing()

# data= [tuple(list(row)+[target]) for row,target in zip(california_housing.data,california_housing.target)]

# #define column names
# columns = list(california_housing.feature_names)+["target"]

# c_schema = StructType([
#     StructField("MedInc", DoubleType(), True),
#     StructField("HouseAge", DoubleType(), True),
#     StructField("AveRooms", DoubleType(), True),
#     StructField("AveBedrms", DoubleType(), True),
#     StructField("Population", DoubleType(), True),
#     StructField("AveOccup", DoubleType(), True),
#     StructField("Latitude", DoubleType(), True),
#     StructField("Longitude", DoubleType(), True),
#     StructField("target", DoubleType(), True)
# ])

In [0]:
# spark_df = spark.createDataFrame(data,schema= c_schema)

In [0]:

# from pyspark.ml.connect.regression import LinearRegression



# # Initialize the Linear Regression model without arguments
# lr = LinearRegression()

# # Set the parameters using setParams
# lr.setParams(featuresCol='features', labelCol='target')

# # Fit the model
# lr_model = lr.fit(spark_df)

# # Make predictions
# predictions = lr_model.transform(spark_df)

# # Display predictions
# display(predictions)
