# Hackathon 2: Fuel consumption model

The amount of fuel a vehicle uses to travel a distance is defined as Fuel Consumption. Fuel Consumption is a significant factor to be optimized for enterprises and professionals of transportation to maximize profits and achieve better functionality of the enterprise. Predicting fuel consumption using only a few features will help managers to choose routes and times of transportation that will guide to an increased fuel economy. The raw dataset is available on Kaggle:

https://www.kaggle.com/yiorgos1973/fuelconsumption

The database (file fuelConsumptionhack2.csv) contains 4491 records with the following information a fleet of light trucks: 

Payload:  Describes the loaded cargo.

Reliability: Probability of this route to be on time according to historical data.

Season: Describes the weather condition. 0 is good weather and 1 is bad weather

Net: This is a discrete quantitative variable that describes the quality of the roadnet. 1 is bad quality, 2 is mediocre and 3 is good quality

LoadValue: This is the value of the cargo of the light truck

TransmissionType: Truck can have automatic or manual transmission.

Fuel: The dependent variable. It is describing the fuel consumption of a truck when doing this route (l/100km).

Your aim is to understand how the fuel consumption can be explained by other recorded features. You will use pandas, numpy to read the dataset.


In [84]:
import pandas as pd
data = pd.read_csv("fuelConsumptionhack2.csv",sep=";")
Pay = data["Payload"].values
Rel = data["Reliability"].values
Sea = data["Season"].values
Net = data["Net"].values
LV = data["LoadValue"].values
TT = data["TransmissionType"].values
Fuel = data["Fuel"].values

# Hypothesis testing
1) We start our analysis by checking that the average consumption is above 15 l/100km. To do this, perform a one-sided test for a confidence level of 95% (alpha=5%). Write clearly the tested assumptions and Report the statistics, critical value and the p-value. 

In [85]:
import numpy as np
import scipy.stats as sc

X    = Fuel[:]
n    = len(X)

# we get the statistics
Stat=sc.describe(X)
print("Mean:",Stat.mean,", Variance:", Stat.variance)

print("H0 : mu(Fuel) = 15\nH1 : mu(Fuel) > 15\n")

# we calculate T(X)
Tx= (Stat.mean-15)/np.sqrt(Stat.variance/n)

# we compare it to the percentiles of a t distribution
alpha = 0.05
t_u   = sc.t.ppf(q=1-alpha,df=n-1)
print("T(X): ",Tx)
print("Critical value: ",t_u)
print("We see that T(X) is way above the critical value, hence we strongly reject H0")

#The p-value is
pval = 1-sc.t.cdf(np.abs(Tx),df=n-1)
print("p-value:",pval, "< alpha (5%) => We confirm the rejection of H0")


Mean: 16.508159869000547 , Variance: 3.4078219088079567
H0 : mu(Fuel) = 15
H1 : mu(Fuel) > 15

T(X):  54.76172806719549
Critical value:  1.6451929157669274
We see that T(X) is way above the critical value, hence we strongly reject H0
p-value: 0.0 < alpha (5%) => We confirm the rejection of H0


2) We continue with an analysis of the transmission type on the average fuel consumption. Under the assumption of equal variances, test if the average consumption of automatic vehicles is the same as the one with a manual transmission (use a two-sided test and a confidence level of 95%, i.e. \alpha=5% ). Report the statistics, critical values and the p-value.

In [86]:
#dividing the set in two subsets based on the transmition
TT_auto=[Fuel[i] for i in range(len(Fuel)) if TT[i]=='automatic']
TT_man=[Fuel[i] for i in range(len(Fuel)) if TT[i]=='manual']
n1=len(TT_auto)
n2=len(TT_man)

#we get the statistics
X1=sc.describe(TT_auto)
X2=sc.describe(TT_man)
print('Automatic:',"Mean:",X1.mean,", Variance:",X1.variance )
print('Manual:',"   Mean:",X2.mean,", Variance:", X2.variance )

print("H0 : mu(Auto)-mu(Manual) = 0\nH1 : mu(Auto)-mu(Manual) is different from 0\n")

#We find the pool variance
Var_pool=X1.variance*n1/(n1+n2)+X2.variance*n2/(n1+n2)
print("Assumption : variance is the same for manual and automatic and equal to",Var_pool,"\n")

#we find the value of T(X1,X2)
alpha=0.05
Txy=(X1.mean-X2.mean)/np.sqrt(Var_pool*(1/n1+1/n2))
print("T(X1,X2):",Txy)
t_l = sc.norm.ppf(alpha/2)
t_u = sc.norm.ppf(1-alpha/2)
print("Critical values:",[t_l,t_u])

print("We see that T(X1,X2) is outside of the critical values interval, hence we reject H0")

#The p-value is
pval = 2*(1-sc.norm.cdf(np.abs(Txy)))
print("p-value:",pval, "< alpha (5%) => We confirm the rejection of H0")

Automatic: Mean: 16.42050764687532 , Variance: 3.2239692302077048
Manual:    Mean: 16.555655702471977 , Variance: 3.5021653593719293
H0 : mu(Auto)-mu(Manual) = 0
H1 : mu(Auto)-mu(Manual) is different from 0

Assumption : variance is the same for manual and automatic and equal to 3.4043973451386083 

T(X1,X2): -2.3439983931797244
Critical values: [-1.9599639845400545, 1.959963984540054]
We see that T(X1,X2) is outside of the critical values interval, hence we reject H0
p-value: 0.019078253589527083 < alpha (5%) => We confirm the rejection of H0


3) The test in question 2 is performed under the assumption of equal variances. Test this assumption with a confidence level of 95% (two-sided test, \alpha=5%). Report the values of statistics and the critical values (p-value not requested)

In [87]:
print('Automatic:',"Variance:", X1.variance )
print('Manual:',"Variance:", X2.variance )
print("H0 : sigma(Auto) = sigma(Manual)\nH1 : sigma(Auto) is different from sigma(Manual)\n")

#we find the value of T(X1,X2)
alpha=0.05
Txy=X1.variance/X2.variance
print("T(X1,X2):",Txy)
t_l = sc.f.ppf(alpha/2,dfn=n1-1,dfd=n2-1)
t_u = sc.f.ppf(1-alpha/2,dfn=n1-1,dfd=n2-1)
print("Critical values:",[t_l,t_u])

print("We see that T(X1,X2) is inside of the critical values interval, hence we do not reject H0")

Automatic: Variance: 3.2239692302077048
Manual: Variance: 3.5021653593719293
H0 : sigma(Auto) = sigma(Manual)
H1 : sigma(Auto) is different from sigma(Manual)

T(X1,X2): 0.9205645363318551
Critical values: [0.9164663405212443, 1.0899135647470608]
We see that T(X1,X2) is inside of the critical values interval, hence we do not reject H0


  4) Under the assumption of equal variances, test if the average consumption is the same with bad or good weather conditions (use a two-sided test and a confidence level of 95%, i.e. \alpha=5% ). Report  the statistics, critical values and the p-value.

In [88]:
#dividing the set in two subsets based on the weather
Good_W=[Fuel[i] for i in range(len(Fuel)) if Sea[i]==0]
Bad_W=[Fuel[i] for i in range(len(Fuel)) if Sea[i]==1]
n1=len(Good_W)
n2=len(Bad_W)

#we get the statistics
X1=sc.describe(Good_W)
X2=sc.describe(Bad_W)
print('Good weather:',"Mean:",X1.mean,", Variance:",X1.variance )
print('Bad weather:'," Mean:",X2.mean,", Variance:", X2.variance )

print("H0 : mu(Good W.)-mu(Bad W.) = 0\nH1 : mu(Good W.)-mu(Bad W.) is different from 0\n")

#We find the pool variance
Var_pool=X1.variance*n1/(n1+n2)+X2.variance*n2/(n1+n2)
print("Assumption : variance is the same for good and bad weather and equal to ",Var_pool,"\n")

#we find the value of T(X1,X2)
alpha=0.05
Txy=(X1.mean-X2.mean)/np.sqrt(Var_pool*(1/n1+1/n2))
print("T(X1,X2):",Txy)
t_l = sc.norm.ppf(alpha/2)
t_u = sc.norm.ppf(1-alpha/2)
print("Critical values:",[t_l,t_u])

print("We see that T(X1,X2) is far outside of the critical values interval, hence we strongly reject H0")

#The p-value is
pval = 2*(1-sc.norm.cdf(np.abs(Txy)))
print("p-value:",pval, "< alpha (5%) => We confirm the rejection of H0")

Good weather: Mean: 16.053005795731895 , Variance: 2.3596204431511048
Bad weather:  Mean: 19.165650013686328 , Variance: 1.256410059476164
H0 : mu(Good W.)-mu(Bad W.) = 0
H1 : mu(Good W.)-mu(Bad W.) is different from 0

Assumption : variance is the same for good and bad weather and equal to  2.198300785444798 

T(X1,X2): -49.72098227948306
Critical values: [-1.9599639845400545, 1.959963984540054]
We see that T(X1,X2) is far outside of the critical values interval, hence we strongly reject H0
p-value: 0.0 < alpha (5%) => We confirm the rejection of H0


# Linear regression
5) The next step consists to perform a linear regression of the variable “Fuel consumption” on all other explanatory variables.
i.	Report the F statistics and interpret it
ii.	What does measure the R2?
iii.	Analyze the t-statistics and p-values of coefficients of regression. Are all coefficients significant at 5%? Use the library statsmodels.api. The function OLS accepts pandas dataframe (use .drop() to remove columns).

In [89]:
import scipy.stats as sc
import numpy as np
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt

#we transform the TransmitionType into a binary variable
data['TransmissionType'] = pd.Categorical(data['TransmissionType'])

#we calculate our regression with the ols function
res=sm.ols(formula='Fuel ~ Payload + Reliability + Season + Net + LoadValue + TransmissionType', data=data).fit()

#We calculate the f critical value for k=6 (6 explanatory variables)
print("F_(6,n-7,95%) (Critical value for F):",sc.f.ppf(1-0.05,dfn=6,dfd=len(Fuel)-7))
#we print the results
print("F_statistics :", res.fvalue,"; f p-value =", res.f_pvalue,"\n")
print("R^2 :", res.rsquared, "; adjusted :",res.rsquared_adj,'\n')
print("AIC :", res.aic)
print("BIC :", res.bic)
print("Log-likelihood :", res.llf,'\n')
print(res.summary().tables[1])

F_(6,n-7,95%) (Critical value for F): 2.100608396123668
F_statistics : 3777.1416828075644 ; f p-value = 0.0 

R^2 : 0.8347627990136768 ; adjusted : 0.8345417951782069 

AIC : 10183.252075445338
BIC : 10228.12400680883
Log-likelihood : -5084.626037722669 

                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                     26.5017      0.332     79.843      0.000      25.851      27.152
TransmissionType[T.manual]     0.0295      0.023      1.255      0.210      -0.017       0.076
Payload                        0.0035      0.001      3.638      0.000       0.002       0.005
Reliability                   -0.0956      0.003    -31.429      0.000      -0.102      -0.090
Season                         0.6444      0.040     16.167      0.000       0.566       0.723
Net                           -1.8143      0.028    -64.480      0.000      -1.

<strong>Answer :</strong> The F statistic can be interpreted as the explained variance on unexplained variance. It means that when the F statistic is too small then the linearity between X et Y must be rejected. Here, we assert that F is high enough because it is above F_(6,n-7,95%)=2.1 (5% confidence level). <br />
R squared, the coefficient of determination, measures the quality of the linear regression. Its value is between 0 and 1. The closer to 1, the better the model. Here, its value is 0.835 so the regression has a good quelity.<br />
The coefficients are not significants at 5% when their p-value (P>|t|) is above 5%. Here, it is the case for the TransmitionType and the LoadValue coefficients, so we choose to reject them.

6) Remove non-significant coefficients and run again the regression. Compare the Log-likelihood, AIC and BIC (the AIC and BIC are not explained in the course, search on internet for explanations).

In [90]:
#we calculate our regression with the ols function
est=sm.ols(formula='Fuel ~ Payload + Reliability + Season + Net', data=data).fit()
#We calculate the f critical value for k=6 (6 explanatory variables)
print("F_(6,n-7,95%) (Critical value for F):",sc.f.ppf(1-0.05,dfn=6,dfd=len(Fuel)-7))
#we print the results
print("F_statistics :", res.fvalue,"; f p-value =", res.f_pvalue,"\n")
print("R^2 :", res.rsquared, "; adjusted :",res.rsquared_adj,'\n')
print("AIC :", res.aic)
print("BIC :", res.bic)
print("Log-likelihood :", res.llf,'\n')
print(est.summary().tables[1])

F_(6,n-7,95%) (Critical value for F): 2.100608396123668
F_statistics : 3777.1416828075644 ; f p-value = 0.0 

R^2 : 0.8347627990136768 ; adjusted : 0.8345417951782069 

AIC : 10183.252075445338
BIC : 10228.12400680883
Log-likelihood : -5084.626037722669 

                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      26.5270      0.331     80.035      0.000      25.877      27.177
Payload         0.0037      0.001      3.971      0.000       0.002       0.006
Reliability    -0.0976      0.003    -37.364      0.000      -0.103      -0.092
Season          0.6439      0.040     16.177      0.000       0.566       0.722
Net            -1.8151      0.028    -64.523      0.000      -1.870      -1.760


<strong>Answer :</strong> We can see that the R^2 stayed the same and that the F statistics rose even more. <br />
AIC (Akaike information criterion) and BIC (Bayesian information criterion) are both criteria which penalise models with regards to their number of parameters. The BIC also depends on the size of the sample. Their formulas depend both on the log-likelihood (AIC = 2k - 2LL and BIC = -2LL + ln(n)k, with LL the log-likelihood).<br />
The objective is, unlike the log likelihood which must be maximized, to minimize the AIC and the BIC. We can see here that the second model has a lower BIC and a higher log-likelihood and that its AIC remains the same. Since the results of the second model are better than those of the first, we can conclude by saying that the second model must be prefered to the first one.

7) We have treated categorical variables (Season, net, TransmissionType) as continuous explanatory variables. For binary variables as Season or TransmissionType (coded as 0-1), the coefficient of regression (\beta) represents the marginal impact of the Season or TransmissionType on fuel consumption. For categorical variables with more than 2 instances, like “net”, the \beta is not clearly interpretable. Imagine that we code net as : 1=mediocre, 2=good, 3=bad, we would obtain a totally different value for the beta because the coding of road quality has no natural order. In practice, non-binary categorical variable cannot by entered in the regression equation just as they are. We must recode them into several binary variables. For example, the variable NET is removed and split into 2 binary variables BAD and MEDIOCRE as follows:

If net=1 (bad quality): we replace it by BAD=1 , MEDIOCRE=0
If net=1 (mediocre quality): we replace it by BAD=0 , MEDIOCRE=1
If net=1 (good quality): we replace it by BAD=0 , MEDIOCRE=0

We do not add a binary variable for roads of good quality because it generates identifiability problems with the intercept beta_0. In general, a categorical variable with n instances in recoded as n-1 binary variables.

Recode the variable NET into two binary variables and rerun the regression.  Compare the Log-likelihood , AIC and BIC of this model with previous ones. (use drop() and insert() to remove/add columns to a dataframe).


In [91]:
#transforming the Net array into two binary arrays
BAD=[1 if Net[i]==1 else 0 for i in range(len(Net))]
MED=[1 if Net[i]==2 else 0 for i in range(len(Net))]

#inserting the to arrays in the dataset
data.insert(5,"BadRoad",BAD)
data.insert(5,"MediocreRoad",MED)

#we calculate our regression with the ols function
est=sm.ols(formula='Fuel ~ Payload + Reliability + Season + BadRoad + MediocreRoad', data=data).fit()
#We calculate the f critical value for k=6 (6 explanatory variables)
print("F_(6,n-7,95%) (Critical value for F):",sc.f.ppf(1-0.05,dfn=6,dfd=len(Fuel)-7))
#we print the results
print("F_statistics :", res.fvalue,"; f p-value =", res.f_pvalue,"\n")
print("R^2 :", res.rsquared, "; adjusted :",res.rsquared_adj,'\n')
print("AIC :", res.aic)
print("BIC :", res.bic)
print("Log-likelihood :", res.llf,'\n')
print(est.summary().tables[1])

F_(6,n-7,95%) (Critical value for F): 2.100608396123668
F_statistics : 3777.1416828075644 ; f p-value = 0.0 

R^2 : 0.8347627990136768 ; adjusted : 0.8345417951782069 

AIC : 10183.252075445338
BIC : 10228.12400680883
Log-likelihood : -5084.626037722669 

                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       21.0255      0.349     60.159      0.000      20.340      21.711
Payload          0.0037      0.001      3.978      0.000       0.002       0.006
Reliability     -0.0969      0.003    -36.737      0.000      -0.102      -0.092
Season           0.6679      0.042     15.778      0.000       0.585       0.751
BadRoad          3.5465      0.075     46.992      0.000       3.399       3.694
MediocreRoad     1.8330      0.030     60.825      0.000       1.774       1.892


<strong>Answer :</strong> The log-likelihood, the AIC and the BIC are the same for this model and the previous one. This is because we just split a discrete variable into mere binary variables. This doesn't affect the regression in itself but it just allows us to better see the marginal impact that the road degradation has on the fuel consumption.