Having the following Dataset contained in the data/auto.csv where the units of the features are not in the IS, you have to make a model with that dataset which estimates the vehicle consumption (mpg variable) according to the others variables

#### Load data

In [22]:
import numpy as np 
import pandas as pd
# Read the file 
auto_df = pd.read_csv("auto.csv")
auto_df.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,mpg
0,8,307.0,130.0,3504.0,12.0,70,1,18.0
1,8,350.0,165.0,3693.0,11.5,70,1,15.0
2,8,318.0,150.0,3436.0,11.0,70,1,18.0
3,8,304.0,150.0,3433.0,12.0,70,1,16.0
4,8,302.0,140.0,3449.0,10.5,70,1,17.0


In [23]:
# First we have to isolate the target variable and split the other variables into the explanatory variable
target_variable = auto_df.drop(['cylinders','displacement','horsepower','weight','acceleration', 'model_year', 'origin'], axis =1)
target_variable
explanatory = auto_df.drop('mpg', axis =1)

#### VIF - Variance Inflation Factor

The multicollinearity in a independient features set impacts negatively to the models built with them. A solution of this problem is use the VIF which allows cuantify the intensity of multicollinearity. The VIF value increases according to the multicollinearity increases. The VIF values bigger than 5 are considered high and VIF values bigger than 10 are considered very high.

The VIF value calculation can be implement like this:

In [24]:
def calculateVIF(data) :
    from sklearn.linear_model import LinearRegression
    import pandas as pd
    
    features = list(data.columns)
    num_features = len(features)
    
    # Create the model and the result dataframe
    model = LinearRegression()
    result = pd.DataFrame(index = ['VIF'], columns = features)
    result = result.fillna(0)
    
    # For each feature
    for ite in range(num_features) :
        x_features = features[:]
        y_feature  = features[ite]
        # Remove the feature (because it is the independient)
        x_features.remove(y_feature)
        
        x = data[x_features]
        y = data[y_feature]
        
        # Fit the model 
        model.fit(x, y)
        # Calculate VIF
        result[y_feature] = 1 / (1 - model.score(x, y))
    
    return result

In [30]:
feature_names = list(auto_df.columns)
feature_names.remove('mpg')
feature_names

['cylinders',
 'displacement',
 'horsepower',
 'weight',
 'acceleration',
 'model_year',
 'origin']

In [35]:
features = auto_df[feature_names]
target = auto_df['mpg']
explanatory

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,8,307.0,130.0,3504.0,12.0,70,1
1,8,350.0,165.0,3693.0,11.5,70,1
2,8,318.0,150.0,3436.0,11.0,70,1
3,8,304.0,150.0,3433.0,12.0,70,1
4,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...
387,4,140.0,86.0,2790.0,15.6,82,1
388,4,97.0,52.0,2130.0,24.6,82,2
389,4,135.0,84.0,2295.0,11.6,82,1
390,4,120.0,79.0,2625.0,18.6,82,1


In [38]:
VIF = calculateVIF(explanatory)
VIF

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
VIF,10.737535,21.836792,9.943693,10.83126,2.625806,1.244952,1.772386


Once VIF method has been defined, a procedure which remove the features is needed. This procedure is the following:

In [45]:
# data    -> data
# max_VIF -> maximum VIF value to follow removing features
def selectDataUsingVIF(data, max_VIF = 5) :
    # Copy data
    result = data.copy(deep = True)
    
    VIF = calculateVIF(result)
    
    # While the VIF value is bigger than max_VIF:
    while VIF.values.max() > max_VIF :
        # Get the column of the feature which gets the maximum VIF
        col_max = np.where(VIF == VIF.values.max())[1][0]
        
        # Remove this feature of the data
        features = list(result.columns)
        features.remove(features[col_max])
        result = result[features]
        
        # Again, calculate VIF
        VIF = calculateVIF(result)

    # Return the result
    return result

In [47]:
selected_features = selectDataUsingVIF(explanatory)
calculateVIF(selected_features)

Unnamed: 0,cylinders,acceleration,model_year,origin
VIF,1.999959,1.384478,1.159429,1.495041


Once we have removed the features with multicollinearity, we are going to make a model with selected features. That model will be a Linear Regression. For training the model, we are going to use crossvalidation for the model will not be overfitting.

In [50]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error
from sklearn.model_selection import train_test_split



# 75% train, 25% test
x_train, x_test, y_train, y_test = train_test_split(selected_features, target,
                                                    # train_size, for change default values
                                                    # test_size, for change default values
                                                    random_state = 1) # for the split don't change in each execution

# Create a model and fit it with de train data
model = LinearRegression()
model.fit(x_train, y_train)

# Let's predict the x_test data
pred = model.predict(x_test)

print("The model's metrics are:")
print('Trainning R^2 = ', model.score(x_train, y_train))
print('Testing R^2 = ', model.score(x_test, y_test))
print('MSE = ', mean_squared_error(pred, y_test))
print('MAE = ', mean_absolute_error(pred, y_test))
print('MedianAE = ', median_absolute_error(pred, y_test))

The model's metrics are:
Trainning R^2 =  0.7238460585048392
Testing R^2 =  0.7631629667618446
MSE =  16.892920987161908
MAE =  3.005021600236145
MedianAE =  2.2242538147356044


With that, we can say the model is not overfitting and the metrics are good. So, we have made the model which estimates the vehicle consumption.