Building digital maps is challenging, and maintaining it up to date in an ever-changing world is even more challenging. Various machine learning techniques helps us to detect road signs and changes in real world, and process it to update maps.

The problem presented here is related to a step after detecting a sign on a road. This step has to now identify each road geometry on which this sign is applicable. While sounds like a simple problem, signs in junctions makes this more challenging.

For example, given a sign detected on a road from a 4-camera setting on vehicle, the closest sighting of the sign may be in the right facing camera, with a sharp sign angle with respect to the direction of the car on which cameras set is mounted. Next step for updating map using this sign is to identify the exact road on which this sign is to be placed or applied.

On a + junction, when a sign is detected on the right camera, its hard now to tell if this sign is for the straight road, or for the right-side road, unless you consider parameters like sign bounding box aspect ratio.

For example, a sign detected from Front camera will have a natural aspect ratio of the sign when it is actually facing front of the car, however when same sign is detected on a right-side camera with a sharp angle from front, sign bounding box gets skewed, giving a hint that although its detected in right, it’s still facing the front of the car.

Dataset provided here has details on camera sign was detected, Angle of sign with respect to front in degrees, Sign's reported bounding box aspect ratio (width/height), Sign Width and Height, and the target feature Sign Facing, which is where the sign is actually facing.

Goal here is to predict where the sign is actually facing with respect to the vehicle, given above set of inputs.

In [None]:
### Predict the Road Sign

## Importing the libraries
import numpy as np  #linear algebra 
import matplotlib.pyplot as plt #basic plotting
import pandas as pd #Dataframe operations
import seaborn as sns #Advanced plotting operations


## Reading the datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

## Previewing datasets
train.info()

test.info()


# Print summary statistics of each column
train.describe() # --> Summary statistics of numerical features + Target
train.describe(include = ['O']) # --> Summary statistics of categorical features
test.describe()
test.describe(include = ['O'])

# Create numpy arrays for train and test datasets
X_train = train.iloc[:, 1:-1].values
y_train = train.iloc[:, -1].values
X_test = test.iloc[:,1::].values

# Encoding categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X_train[:, 0] = labelencoder_X.fit_transform(X_train[:, 0])
X_test[:, 0] = labelencoder_X.transform(X_test[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X_train = onehotencoder.fit_transform(X_train).toarray()
X_test = onehotencoder.transform(X_test).toarray()
labelencoder_y = LabelEncoder()
y_train = labelencoder_y.fit_transform(y_train)
# Avoiding dummy variable trap
X_train = X_train[:, 1::]
X_test = X_test[:, 1::]

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)


# Model 1 : Random Forest Classification
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 501, criterion = "entropy", n_jobs = -1)
classifier.fit(X_train, y_train)


# Model 2 : SVM Classification
from sklearn.svm import SVC
classifier2 = SVC(kernel = 'linear', probability = True, C = 10)
classifier2.fit(X_train, y_train)

# Model 3 : XGBoost
from xgboost import XGBClassifier
classifier3 = XGBClassifier(max_depth = 3, learning_rate = 0.05,
                            n_estimators = 501, objective = "multi:softprob", 
                            gamma = 0, base_score = 0.5, reg_lambda = 10, subsample = 0.7,
                            colsample_bytree = 0.8)

classifier3.fit(X_train, y_train, eval_metric = "mlogloss")


## Using Grid Search and cross-validation to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
parameters = [{'n_estimators' : [101, 501, 780],
               'max_depth' : [3, 6],
               'learning_rate' : [0.05, 0.1],
               'base_score' : [0.7, 0.8]}
             ]

grid_search = GridSearchCV(estimator = classifier3, 
                           param_grid = parameters,
                           scoring = "neg_log_loss",
                           cv = 10, n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_metric = grid_search.best_score_
best_params = grid_search.best_params_
grid_search.grid_scores_ # See all scores


# Predicting the Test Set results
y_pred = classifier3.predict_proba(X_test)

# Clipping output probabilities to get better log loss score
y_pred_clipped = np.clip(y_pred, 0.005, 0.995)


# Writing the results to a csv file
np.savetxt('results01.csv', y_pred, fmt = '%.6f')