# **The tasks:**

In this task, I invite you to train 2 models on the selected Data Set that should predict the target column. The models are the following:

- the LinearRegression from sklearn.
- the Lin_reg implementation offered in SMLH.

The tasks:

- Create a jupyter notebook with a clean code.
- Study the correlation between features and find the features subset with the highest correlation with the target column, and try to explain from the business point of view why they have such a big correlation.
- Create a second set of data with the columns that have an absolute correlation between 0.5 and 0.8 with the target column.
- Split the data into 2 sub-sets using the train_test_split function from sklearn.
- Train a sklearn Linear Regression model on the data provided to you.
- Train a from-scratch implementation of Linear Regression on the train sub-set.
- Test the models on the test sets from the initial set of data, for error metrics use the models score function for the sklearn model.
- Split the data with the selected columns into 2 sub-sets using the train_test_split function from sklearn.
- Train a sklearn Linear Regression model on the data with selected columns (train subset).
- Train a from-scratch implementation of Linear Regression on the train sub-set.
- Test the models on the test sets from the initial set of data, for error metrics use the models score function for the sklearn model.
- Please try to interpret the results that you are getting by comparing the error of the models that you created.
- Please comment on your code.

In [23]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


In [3]:
dt=pd.read_csv("Dataset.csv")

In [4]:
dt.describe()

Unnamed: 0.1,Unnamed: 0,Id,year,price,distance_travelled(kms),brand_rank,car_age,distance below 30k km,new and less used,inv_car_price,inv_car_dist,inv_car_age,inv_brand,std_invprice,std_invdistance_travelled,std_invrank,best_buy1,best_buy2
count,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0,1725.0
mean,862.0,862.0,2015.390725,1494837.0,53848.256232,15.731014,5.609275,0.269565,0.209275,1.416237e-06,4.1e-05,inf,0.18781,0.084623,0.013809,0.177658,88.962902,32.537208
std,498.108924,498.108924,3.207504,1671658.0,44725.541963,12.951122,3.207504,0.443863,0.406909,1.291449e-06,0.00011,,0.254849,0.08106,0.038689,0.258034,188.95069,158.662274
min,0.0,0.0,1990.0,62500.0,350.0,1.0,0.0,0.0,0.0,6.802721e-08,1e-06,0.032258,0.012346,0.0,0.0,0.0,0.0,0.0
25%,431.0,431.0,2013.0,545000.0,29000.0,5.0,3.0,0.0,0.0,5.479452e-07,1.4e-05,0.125,0.041667,0.030123,0.004524,0.029687,14.237358,0.0
50%,862.0,862.0,2016.0,875000.0,49000.0,14.0,5.0,0.0,0.0,1.142857e-06,2e-05,0.2,0.071429,0.067464,0.006703,0.059821,36.716166,0.0
75%,1293.0,1293.0,2018.0,1825000.0,70500.0,24.0,8.0,1.0,0.0,1.834862e-06,3.4e-05,0.333333,0.2,0.110899,0.011631,0.19,90.776658,0.0
max,1724.0,1724.0,2021.0,14700000.0,790000.0,81.0,31.0,1.0,1.0,1.6e-05,0.002857,inf,1.0,1.0,1.0,1.0,2477.51764,2477.51764


Study the correlation between features and find the features subset with the highest correlation with the target column, and try to explain from the business point of view why they have such a big correlation.

I dropped all of the non numeric values


In [5]:
numeric_dt = dt.select_dtypes(include=['number'])

In [9]:
correlations = numeric_dt.corr()['price'].drop('price').sort_values(key=abs, ascending=False)
print(correlations)

std_invprice                -0.517723
inv_car_price               -0.517723
year                         0.288483
car_age                     -0.288483
inv_car_age                  0.267973
new and less used            0.219786
distance below 30k km        0.212197
std_invrank                  0.185660
inv_brand                    0.185660
brand_rank                  -0.164591
distance_travelled(kms)     -0.137351
best_buy1                   -0.106855
Unnamed: 0                  -0.105696
Id                          -0.105696
std_invdistance_travelled    0.081735
inv_car_dist                 0.081735
best_buy2                    0.008077
Name: price, dtype: float64


Create a second set of data with the columns that have an absolute correlation between 0.5 and 0.8 with the target column.

In [10]:
high_corr_features = correlations[abs(correlations) > 0.8].index.tolist()
moderate_corr_features = correlations[(abs(correlations) >= 0.5) & (abs(correlations) <= 0.8)].index.tolist()

In [13]:
dt_high_corr = numeric_dt[['price'] + high_corr_features]
dt_moderate_corr = numeric_dt[['price'] + moderate_corr_features]

Im cleaning the data with replacing some Nan values

In [14]:
def clean_data(X, y):
    X = X.replace([np.inf, -np.inf], np.nan)
    valid_idx = X.dropna().index
    return X.loc[valid_idx], y.loc[valid_idx]

In [15]:
class LinearRegressionScratch:
    def fit(self, X, y):
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        self.theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y

    def predict(self, X):
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        return X_b @ self.theta

    def score(self, X, y):
        y_pred = self.predict(X)
        u = ((y - y_pred) ** 2).sum()
        v = ((y - y.mean()) ** 2).sum()
        return 1 - u / v

On a full dataset

In [21]:
X_full = numeric_dt.drop(columns=['price'])
y_full = numeric_dt['price']
X_full, y_full = clean_data(X_full, y_full)

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.2, random_state=42)

In [25]:
model_sklearn = LinearRegression().fit(X_train, y_train)
print("Full Data - sklearn R²:", model_sklearn.score(X_test, y_test))

Full Data - sklearn R²: 0.35263346000642215


scratch model

In [32]:
model_scratch = LinearRegressionScratch()
model_scratch.fit(X_train.values, y_train.values)
print("Full Data - scratch R²:", model_scratch.score(X_test.values, y_test.values))

Full Data - scratch R²: 0.35263346000642104


Коррелированные признаки

In [33]:
X_mod = df_moderate_corr.drop(columns=['price'])
y_mod = df_moderate_corr['price']
X_mod, y_mod = clean_data(X_mod, y_mod)

X_train_mod, X_test_mod, y_train_mod, y_test_mod = train_test_split(X_mod, y_mod, test_size=0.2, random_state=42)

In [34]:
# sklearn
model_sklearn_mod = LinearRegression().fit(X_train_mod, y_train_mod)
print("Moderate Corr - sklearn R²:", model_sklearn_mod.score(X_test_mod, y_test_mod))

Moderate Corr - sklearn R²: 0.24782534492611974


In [35]:
# с нуля scratch
model_scratch_mod = LinearRegressionScratch()
model_scratch_mod.fit(X_train_mod.values, y_train_mod.values)
print("Moderate Corr - scratch R²:", model_scratch_mod.score(X_test_mod.values, y_test_mod))

Moderate Corr - scratch R²: 0.24782534492611918
