<a href="https://colab.research.google.com/github/PravinV001/Python/blob/Class_N_Colab/In_Class_Linear_Regression_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!gdown 1UpLnYA48Vy_lGUMMLG-uQE1gf_Je12Lh

Downloading...
From: https://drive.google.com/uc?id=1UpLnYA48Vy_lGUMMLG-uQE1gf_Je12Lh
To: /content/cars24-car-price-clean.csv
  0% 0.00/7.10M [00:00<?, ?B/s] 66% 4.72M/7.10M [00:00<00:00, 44.3MB/s]100% 7.10M/7.10M [00:00<00:00, 60.3MB/s]


# Diff b/w Fit, transform, and fit_transform

## Fit

The `fit` method is about understanding, learning, or calculating properties from the training data that will be useful later, typically for transformation or prediction. It doesn't modify the data. For many scikit-learn objects, this is where the "learning" happens if it's a machine learning model, or where properties are computed if it's a transformer.

#### SimpleImputer example with `fit`:
Imagine we have some data with missing values and we want to fill them with the mean value of the column. The `fit` method, when applied on this data, will compute the mean for each column and store it inside the imputer object.

In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

In [None]:
data = [[1, 2],
        [np.nan, 3],
         [7, 6]]


imputer = SimpleImputer(strategy='mean')

imputer.fit(data)

# The imputer has now computed and stored the mean for each column.
print("Means:", imputer.statistics_)

Means: [4.         3.66666667]


In [None]:
data

[[1, 2], [nan, 3], [7, 6]]


## 2. `transform`
The `transform` method takes in data and returns a new version of the data after applying the transformation. The nature of the transformation depends on the specific algorithm or technique of the transformer or model.

#### SimpleImputer example with `transform`:
Using the previously `fit` imputer, you can now call `transform` on new data to fill missing values with the computed mean.


In [None]:
new_data = [[np.nan,    10],
            [5,      np.nan],
             [6,       7]]

imputer.transform(new_data)

array([[ 4.        , 10.        ],
       [ 5.        ,  3.66666667],
       [ 6.        ,  7.        ]])

In [None]:
data

[[1, 2], [nan, 3], [7, 6]]



## 3. `fit_transform`
The `fit_transform` method essentially combines the `fit` and `transform` methods into one step. It's often more efficient to use `fit_transform` on the training data rather than using `fit` and `transform` separately. However, you should be cautious about using `fit_transform` on both training and testing data for certain preprocessing techniques, as it might cause data leakage.

#### SimpleImputer example with `fit_transform`:
Here, the imputer will calculate the mean values and then immediately use them to fill the missing values.



In [None]:
data = [[1, 2],
        [np.nan, 3],
         [9, 100]]

imputer = SimpleImputer(strategy='mean')
filled_data = imputer.fit_transform(data)

print("Means:", imputer.statistics_)

Means: [ 5. 35.]


In [None]:
filled_data

array([[  1.,   2.],
       [  5.,   3.],
       [  9., 100.]])

In [None]:
import pandas as pd
import numpy as np

# Create a dummy dataset with 2 numerical columns and 5 rows
dummy_data = {
    'Numerical_Col_1': np.random.randint(1, 100, 5),
    'Numerical_Col_2': np.random.rand(5) * 100
}
df_dummy = pd.DataFrame(dummy_data)

# Display the first 5 rows of the dummy DataFrame
display(df_dummy.head())

# MultiCollinearity

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
!gdown 1UpLnYA48Vy_lGUMMLG-uQE1gf_Je12Lh


Downloading...
From: https://drive.google.com/uc?id=1UpLnYA48Vy_lGUMMLG-uQE1gf_Je12Lh
To: /content/cars24-car-price-clean.csv
  0% 0.00/7.10M [00:00<?, ?B/s] 96% 6.82M/7.10M [00:00<00:00, 46.2MB/s]100% 7.10M/7.10M [00:00<00:00, 47.7MB/s]


In [None]:
df = pd.read_csv('cars24-car-price-clean.csv')
df.head()

Unnamed: 0,selling_price,year,km_driven,mileage,engine,max_power,age,make,model,Individual,Trustmark Dealer,Diesel,Electric,LPG,Petrol,Manual,5,>5
0,-1.111046,-0.801317,1.195828,0.045745,-1.310754,-1.15778,0.801317,-0.433854,-1.125683,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
1,-0.223944,0.45003,-0.737872,-0.140402,-0.537456,-0.360203,-0.45003,-0.327501,-0.333227,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
2,-0.915058,-1.42699,0.035608,-0.582501,-0.537456,-0.404885,1.42699,-0.327501,-0.789807,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
3,-0.892365,-0.801317,-0.409143,0.32962,-0.921213,-0.693085,0.801317,-0.433854,-0.905265,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
4,-0.182683,0.137194,-0.544502,0.760085,0.042999,0.010435,-0.137194,-0.246579,-0.013096,-0.80071,-0.098382,1.014945,-0.020095,-0.056917,-0.97597,0.495818,0.444503,-0.424728


In [None]:
X=df.drop('selling_price', axis=1)
y=df[['selling_price']]


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
y_train = np.array(y_train)


print("Shape of X_train: ", X_train.shape)

print("Shape of y_train: ", y_train.shape)


print("Shape of X_test: ", X_test.shape)

print("Shape of y_test: ", y_test.shape)


Shape of X_train:  (15856, 17)
Shape of y_train:  (15856, 1)
Shape of X_test:  (3964, 17)
Shape of y_test:  (3964, 1)


statmodel

In [None]:
# lin = LinearRegression()


# lin.fit(X_train, y_train)


In [None]:
import statsmodels.api as sm

X_sm = sm.add_constant(X_train)

model = sm.OLS(y_train, X_sm) # OLS ordinary least square = Linear Regression
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.940
Model:                            OLS   Adj. R-squared:                  0.940
Method:                 Least Squares   F-statistic:                 1.562e+04
Date:                Mon, 12 Jan 2026   Prob (F-statistic):               0.00
Time:                        17:06:32   Log-Likelihood:                -149.33
No. Observations:               15856   AIC:                             332.7
Df Residuals:                   15839   BIC:                             463.1
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const                0.0004      0.002  

In [None]:
X_sm

Unnamed: 0,const,year,km_driven,mileage,engine,max_power,age,make,model,Individual,Trustmark Dealer,Diesel,Electric,LPG,Petrol,Manual,5,>5
14764,1.0,-0.488480,-0.003066,0.448288,-0.215410,-0.244030,0.488480,0.724475,0.548445,-0.800710,-0.098382,1.014945,-0.020095,-0.056917,-0.975970,0.495818,0.444503,-0.424728
9589,1.0,0.762867,-0.157762,2.070095,-0.439107,-0.538485,-0.762867,-0.433854,-0.254506,1.248892,-0.098382,1.014945,-0.020095,-0.056917,-0.975970,0.495818,0.444503,-0.424728
6502,1.0,0.450030,-0.235110,-1.894839,2.467028,1.639323,-0.450030,0.724475,1.944427,1.248892,-0.098382,1.014945,-0.020095,-0.056917,-0.975970,-2.016868,-2.249703,2.354446
15778,1.0,0.762867,-0.679861,0.273775,1.001426,2.007951,-0.762867,3.360773,3.010306,-0.800710,-0.098382,1.014945,-0.020095,-0.056917,-0.975970,-2.016868,0.444503,-0.424728
18565,1.0,1.075704,-0.815220,-0.582501,0.222343,0.517801,-1.075704,-0.327501,1.052258,-0.800710,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,-2.016868,0.444503,-0.424728
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8385,1.0,0.762867,0.035608,0.511113,0.042999,0.018702,-0.762867,-0.246579,-0.139049,-0.800710,-0.098382,1.014945,-0.020095,-0.056917,-0.975970,0.495818,0.444503,-0.424728
2743,1.0,-0.488480,-0.157762,-1.087426,0.235842,0.153642,0.488480,-0.240799,-0.789807,1.248892,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,0.495818,0.444503,-0.424728
9165,1.0,1.388540,-0.931242,-0.663941,0.949359,0.957920,-1.388540,-0.258140,3.010306,-0.800710,-0.098382,1.014945,-0.020095,-0.056917,-0.975970,-2.016868,0.444503,-0.424728
18958,1.0,0.450030,0.383674,-0.533638,-0.537456,0.120130,-0.450030,-0.258140,0.085567,-0.800710,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,-2.016868,0.444503,-0.424728


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor


vif = pd.DataFrame()
X_t = pd.DataFrame(X_train, columns=X_train.columns)
X_t.head()


Unnamed: 0,year,km_driven,mileage,engine,max_power,age,make,model,Individual,Trustmark Dealer,Diesel,Electric,LPG,Petrol,Manual,5,>5
14764,-0.48848,-0.003066,0.448288,-0.21541,-0.24403,0.48848,0.724475,0.548445,-0.80071,-0.098382,1.014945,-0.020095,-0.056917,-0.97597,0.495818,0.444503,-0.424728
9589,0.762867,-0.157762,2.070095,-0.439107,-0.538485,-0.762867,-0.433854,-0.254506,1.248892,-0.098382,1.014945,-0.020095,-0.056917,-0.97597,0.495818,0.444503,-0.424728
6502,0.45003,-0.23511,-1.894839,2.467028,1.639323,-0.45003,0.724475,1.944427,1.248892,-0.098382,1.014945,-0.020095,-0.056917,-0.97597,-2.016868,-2.249703,2.354446
15778,0.762867,-0.679861,0.273775,1.001426,2.007951,-0.762867,3.360773,3.010306,-0.80071,-0.098382,1.014945,-0.020095,-0.056917,-0.97597,-2.016868,0.444503,-0.424728
18565,1.075704,-0.81522,-0.582501,0.222343,0.517801,-1.075704,-0.327501,1.052258,-0.80071,-0.098382,-0.985275,-0.020095,-0.056917,1.024622,-2.016868,0.444503,-0.424728


In [None]:
vif['Features'] = X_t.columns
vif

Unnamed: 0,Features
0,year
1,km_driven
2,mileage
3,engine
4,max_power
5,age
6,make
7,model
8,Individual
9,Trustmark Dealer


In [None]:
X_t.shape[1]

17

In [None]:
for i in range(X_t.shape[1]):
  print(variance_inflation_factor(X_t.values, i))

  vif = 1. / (1. - r_squared_i)


inf
1.3210580673146801
3.19249135048289
6.249951505381583
4.999695054503285
inf
3.2165690096438606
5.769832008074061
1.0933166661280747
1.0200010690942567
16.95602626368789
1.2280529691670257
1.2395788668906436
17.872424949371236
1.7919621931037664
12.153167330579864
13.387359478692712


In [None]:
vif['Features'] = X_t.columns
vif['VIF'] = [variance_inflation_factor(X_t.values, i) for i in range(X_t.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

  vif = 1. / (1. - r_squared_i)


Unnamed: 0,Features,VIF
0,year,inf
5,age,inf
13,Petrol,17.87
10,Diesel,16.96
16,>5,13.39
15,5,12.15
3,engine,6.25
7,model,5.77
4,max_power,5.0
6,make,3.22


In [None]:
# Remove 'year' feature and recompute VIF
X_train_no_year = X_train.drop('year', axis=1)


vif_no_year = pd.DataFrame()
X_t_no_year = pd.DataFrame(X_train_no_year, columns=X_train_no_year.columns)
vif_no_year['Features'] = X_t_no_year.columns
vif_no_year['VIF'] = [variance_inflation_factor(X_t_no_year.values, i) for i in range(X_t_no_year.shape[1])]
vif_no_year['VIF'] = round(vif_no_year['VIF'], 2)
vif_no_year = vif_no_year.sort_values(by="VIF", ascending=False)
vif_no_year

Unnamed: 0,Features,VIF
12,Petrol,17.87
9,Diesel,16.96
15,>5,13.39
14,5,12.15
2,engine,6.25
6,model,5.77
3,max_power,5.0
5,make,3.22
1,mileage,3.19
4,age,1.96


In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the California Housing dataset
housing = fetch_california_housing(as_frame=True)

# Create a DataFrame from the features and the target variable
df = housing.frame

# Display the first 5 rows of the dataset
print("First 5 rows of the California Housing dataset:")
display(df.head())

# Display information about the dataset to see data types and non-null counts
print("\nDataset Information:")
df.info()

# Display a description of the features
print("\nDataset Description:")
print(housing.DESCR)

First 5 rows of the California Housing dataset:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422



Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

Dataset Description:
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[df.columns.difference(['MedHouseVal'])],
                                                    df['MedHouseVal'],
                                                    test_size=0.2)


In [None]:
from sklearn.linear_model import LinearRegression

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

print("Linear Regression model trained successfully.")



Linear Regression model trained successfully.


In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R2): {r2:.2f}")

Mean Squared Error (MSE): 0.51
Root Mean Squared Error (RMSE): 0.71
R-squared (R2): 0.61


In [None]:
X_train_with_random = X_train.copy()
X_train_with_random['RandomFeature1'] = np.random.randn(len(X_train))
X_train_with_random['RandomFeature2'] = np.random.randn(len(X_train))

X_test_with_random = X_test.copy()
X_test_with_random['RandomFeature1'] = np.random.randn(len(X_test))
X_test_with_random['RandomFeature2'] = np.random.randn(len(X_test))

In [None]:

# Train a new model with the random feature included
model_with_random = LinearRegression()
model_with_random.fit(X_train_with_random, y_train)

# Make predictions and calculate R2 score
y_pred_with_random = model_with_random.predict(X_test_with_random)
r2_with_random = r2_score(y_test, y_pred_with_random)

# Compare the R2 scores
print(f"Original R2 score (8 features): {r2:.4f}")
print(f"R2 score with random feature (9 features): {r2_with_random:.4f}")
print(f"Difference: {r2_with_random - r2:.4f}")

if r2_with_random > r2:
    print("\nAs expected, R2 increased even though we added a useless random feature!")
else:
    print("\nIn this case, R2 didn't increase (rare, but can happen with test set)")

Original R2 score (8 features): 0.6067
R2 score with random feature (9 features): 0.6067
Difference: 0.0000

As expected, R2 increased even though we added a useless random feature!
