# Towards Reducing Pollution in Kigali
### Predict Vehicles' consumption in liters/100km for each car in the city

Introduction

Pollution is a big issue in the city of Kigali. Policy makers in the city want to take action and deploy some measures to address this problem. You have been hired as a machine learning expert to analyze some data and help them make good decisions.

Cars that consume more fuel pollute more. As a first step, we want to estimate how much fuel each individual car consumes every 100 km. The provided dataset concerns city-cycle fuel consumption in liters per 100 kilometers (target).

The aim of this homework is to help you apply the skills that you have learned so far to a real dataset. This involves learning what data means, how to handle and visualize data, training, cross validation, prediction, testing your model, etc.

Description of covariates
This dataset has 3 multi-valued discrete and 5 continuous covariates


    1. cylinders:     multi-valued discrete
    2. displacement:  continuous
    3. horsepower:    continuous
    4. weight:        continuous
    5. acceleration:  continuous
    6. model year:    multi-valued discrete
    7. origin:        multi-valued discrete
    8. car name:      string (unique for each instance)

In [None]:
# as usual, let us load all the necessary libraries
import numpy as np  # numerical computation with arrays
import pandas as pd # library to manipulate datasets using dataframes
import scipy as sp  # statistical library


# below sklearn libraries for different models
from sklearn.tree import DecisionTreeClassifier as DecisionTree
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler

#import libraries for implementing neural networks
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD, Adam
from keras.regularizers import l2

# plot 
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Additional Libraries

import seaborn as sns
from scipy.stats import zscore
from sklearn.svm import SVR
from sklearn.svm import LinearSVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Loading Competition Data
Run all the steps below to obtain the data

In [None]:
# Read training data
train_data = pd.read_csv('https://raw.githubusercontent.com/onefishy/Rwanda-course-2020/master/Competition_data/train.csv',na_values='?') # read in the data as a DataFrame
train_data.head(3) # show the first 3 rows of the dataset

In [None]:
# Read test data 
test_data = pd.read_csv('https://raw.githubusercontent.com/onefishy/Rwanda-course-2020/master/Competition_data/test.csv') # read in the data as a DataFrame
test_data.head(3) # show the first 3 rows of the dataset

# Now we are done with downloading data! 
* Try building a model inside this notebook by create additional cells below with code to specify and fit the model
* If you are fitting large neural nets, make sure this google colab notebook is running on GPUs
* Check Edit --> Notebook settings --> Hardware accelerator: GPU

In [None]:
# Start building a model here

## Data Cleaning

In [None]:
train_data['weight']=train_data['weight'].abs()#Change negative values in the weight column
train_data['horsepower'] = pd.to_numeric(train_data['horsepower'], errors='coerce')#Chnage the horsepower column to numerical from object
train_data=train_data.drop(['car name'],axis=1)#Drop the car name column in train data since we don't need it for prediction due to its Uniqueness
test_data=test_data.drop(['car name'],axis=1)#Drop the car name column in test data since we don't need it for prediction due to its Uniqueness

## Check for Null values

In [None]:
train_data.isna().sum() #check for null values in the dataset

## Imputing Null values with Liner Regression Model
 Since our data has liner pattern we will apply liner regression imputer to fill the null values by learning the non-null datapoints

In [None]:
#Select column with null values
null_data = train_data[train_data.isnull().any(axis=1)]#check for null values in the training set and assign to local variable
train_data2 = train_data.dropna()#select all non-null columns by dropping the null values

x_imp = train_data2.drop(["horsepower"], axis = 1)#We have seen the horsepower column has null values and we have to split for further learning
y_imp = train_data2.horsepower #Select horsepower as label to be predicted 

X_train, X_test, Y_train, Y_test = train_test_split(x_imp,y_imp, test_size  = 0.3,random_state = 42)#appply split on non-null and null column

lr = LinearRegression()#instantiate liner regression model
lr.fit(X_train,Y_train)#fit the training date to our model

null_data2 = null_data.copy()#copy the null data to null_data2 for saving the original
null_data2 = null_data2.drop("horsepower", axis = 1)##copy the horsepower data to null_data2 for saving the original

#start loop iteration for each null cell and append the prediction to null cells
predictions = []
for i in range(null_data2.shape[0]):
    predictions.append(null_data2.iloc[i,:])

#append all null values to the original data
values = []
for i in range(len(predictions)):
    for j in range(null_data2.shape[1]):
        values.append(predictions[i][j])

#start looping fora each cell with null value
#instantaite some local variables for the loop      
i = 0
j = null_data2.shape[1]
lr_predictions =[]

for a in range(0,null_data2.shape[0]):
    print("Prediction {}".format(a+1))
    print(lr.predict((np.array([values[i:j]]))))
    lr_predictions.append(lr.predict((np.array([values[i:j]])))[0])
    print("---------------")
    i = i+(int(len(values) / len(predictions)))
    j = j+(int(len(values) / len(predictions)))

null_index = train_data[train_data["horsepower"].isna()].index

#Append the predicted null values to our original dataset
for i in range(len(null_index)):
    train_data["horsepower"][null_index[i]] = lr_predictions[i]

#print the number of missing values after imputation
print("Missing Values: {}".format(train_data.isnull().sum().sum()))

#Remove Outliers

In [None]:
low = .05 #lower quntile
high = .95 #Upper quantile 

# Step 1: compute 5% percentile and the 95% percentile of each column in the dataset
quantile_df = train_data.quantile([low, high])

# Step 2: perform outlier removal fornumerical columns
# COMPLETE
features=['displacement', 'horsepower', 'weight', 'acceleration','fuel (L/100km)']
          
for i in features:
  train_data_rm = train_data[(train_data[i] > quantile_df.loc[0.05, i]) & (train_data[i] < quantile_df.loc[0.95, i])]
  print('Number of rows after outlier removal: {}'.format(train_data_rm.shape[0]))
   
train_data=train_data_rm 

### Assign X and Y

In [None]:
# Divide the column 'x = training covariates and 'y=label as output

x = train_data.drop(columns=['fuel (L/100km)'])#remove output label 

y = train_data['fuel (L/100km)']# assign output label for y

# One-hot Encoding

In [None]:
#Encode all catagorical columns in the training set 
x["origin"] = x["origin"].astype(str)
x["cylinders"] = x["cylinders"].astype(str)
x = pd.get_dummies(x)


#Encode all catagorical columns in the testing set 
test_data["origin"] = test_data["origin"].astype(str) 
test_data["cylinders"] = test_data["cylinders"].astype(str)
test_data = pd.get_dummies(test_data)

#Add cylinders with value 5 in testing set with 0 values, since there is no cylinder value with 5
#we have to apply the code to make our training and testing columns equlal size
test_data.insert(loc=7, column='cylinders_5', value = 0)

## Split the dataset

In [None]:
#Split our data for training and testing 

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=0)


In [None]:
#Checking the shapes for training and testing of our datas
print("Shape of Training data is:",x_train.shape)
print("Shape of Testing data is:",x_test.shape)



---


#MODELING

---



#Model 1
># Support Vector Regression (SVR)

In [None]:
#Support Vector Machine

# Step 1: Instantiate the Model for Support Vector Regressor`
svm_regr = make_pipeline(StandardScaler(),SVR(C=8,epsilon=2e-1))
svm_regr.fit(x_train,y_train)
  
# Step 2: Predict label on training set
y_train_pred = svm_regr.predict(x_train)
# Step 3: Compute RMSE on training set 
print('RMSE on Training Data on :', np.sqrt(mean_squared_error(y_train, y_train_pred)))
# Step 4: Predict label on test set
y_test_pred = (svm_regr.predict(x_test))
# Step 5: Compute RMSE on test set 
print('RMSE on Testing Data  :', np.sqrt(mean_squared_error(y_test, y_test_pred)),'\n')

# Save your predictions to a DataFrame
prediction = svm_regr.predict(test_data)
my_submission = pd.DataFrame(prediction)
my_submission=my_submission.rename(columns={0: "predictions"})
my_submission.to_csv('my_submission_svm.csv',index=True,index_label='id')
#files.download('my_submission_svm.csv')


#Model 2
># Liner Regression Model

In [None]:
# Step 1: Instantiate the Liner regression Model 
liner_model=LinearRegression()
liner_model.fit(x,y)

# Step 2: Predict label on training set
y_train_pred = liner_model.predict(x_train)

# Step 3: Compute RMSE on training set 
print('RMSE on Training Data:', np.sqrt(mean_squared_error(y_train, y_train_pred)))

# Step 4: Predict label on test set
y_test_pred = liner_model.predict(x_test)

# Step 5: Compute RMSE on test set 
print('RMSE on Testing Data: ', np.sqrt(mean_squared_error(y_test, y_test_pred)))

# Save your predictions to a DataFrame
liner_pred=liner_model.predict(test_data)
my_submission = pd.DataFrame(liner_pred)
my_submission=my_submission.rename(columns={0: "predictions"})
my_submission.to_csv('my_submission_liner.csv',index=True,index_label='id')
#files.download('my_submission_liner.csv')

#Model 3
> # Random Forest

In [None]:
# Step 1: Instantiate the Random Forest Model 
x, y = make_regression(n_features=8, n_informative=3,random_state=0, shuffle=False)
rand_regr = RandomForestRegressor(max_depth=3, random_state=0)
rand_regr.fit(x_train,y_train)

# Step 2: Predict label on training set
y_train_pred = rand_regr.predict(x_train)
# Step 3: Compute RMSE on training set 
print('RMSE on Training Data:', np.sqrt(mean_squared_error(y_train, y_train_pred)))

# Step 4: Predict label on test set
y_test_pred = (rand_regr.predict(x_test))
# Step 5: Compute RMSE on test set 
print('RMSE on Testing Data: ', np.sqrt(mean_squared_error(y_test, y_test_pred)))

# Save your predictions to a DataFrame
my_submission = pd.DataFrame(rand_regr.predict(test_data))
my_submission=my_submission.rename(columns={0: "predictions"})
my_submission.to_csv('my_submission_random.csv',index=True,index_label='id')
#files.download('my_submission_random.csv')

# Model 4
# Decision Tree Model


In [None]:
# Step1: Instantiate Decision Tree Model

dec_regr = DecisionTreeRegressor(max_depth=6)
dec_regr.fit(x_train,y_train)

# Step 2: Predict label on training set
y_tran_pred = dec_regr.predict(x_train)
# Step 3: Compute RMSE on training set 
print('RMSE on Training Data:', np.sqrt(mean_squared_error(y_train, y_train_pred)))
# Step 4: Predict label on test set
y_test_pred = dec_regr.predict(x_test)
# Step 5: Compute RMSE on test set 
print('RMSE on Testing Data: ', np.sqrt(mean_squared_error(y_test, y_test_pred)))

# Save your predictions to a DataFrame
my_submission = pd.DataFrame(dec_regr.predict(test_data))
my_submission=my_submission.rename(columns={0: "predictions"})
my_submission.to_csv('my_submission_decison.csv',index=True,index_label='id')
#files.download('my_submission_decison.csv')

# Once you have generated predictions on the holdout and test data, create a submission file
Your submission file should follow the Kaggle submission template.


In [None]:
# Here's a sample submission file
sample_submission = pd.read_csv('https://raw.githubusercontent.com/onefishy/Rwanda-course-2020/master/Competition_data/sampleSubmission.csv') # read in the data as a DataFrame
sample_submission.head() # show the first 3 rows of the dataset

---


## From all the given Models we have selected Support Vector Regression for final result

---



In [None]:
# Save your predictions to a DataFrame
my_submission = pd.DataFrame(svm_regr.predict(test_data))
my_submission=my_submission.rename(columns={0: "predictions"})
my_submission.to_csv('my_submission.csv',index=True,index_label='id')

In [None]:
# Check that your submission looks the same as the samples submission
pd.read_csv('my_submission.csv')

# Save and download the submission file and upload to the Kaggle website

Then download `my_submission.csv` by running the following line and submit to the [Kaggle](https://www.kaggle.com/t/b9bc778c9e8842d28c5526f578e6c348) compeition website.

In [None]:
#files.download('my_submission.csv')