**Import Libraries**

In [1]:
import numpy as np 
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.feature_selection import SelectKBest
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

**Load Data**

In [2]:
full_train_data = np.loadtxt("train.csv", dtype = "int", delimiter = ",", skiprows=1)
feature_names = np.loadtxt("train.csv", dtype = "str", delimiter = ",")[0,:]
test_data = np.loadtxt("test.csv", dtype = "int", delimiter = ",", skiprows=1)

**Prepare Data**

In [3]:
full_train_labels = full_train_data[:,full_train_data.shape[1]-1]
full_train_data = full_train_data[:,:full_train_data.shape[1]-1]
full_train_data = np.delete(full_train_data,0,1) #delete ID
test_ids = test_data[:,0]
test_data = np.delete(test_data,0,1) #delete ID

**Split into Train and Dev Data**

In [5]:
np.random.seed(58230)
shuffle = np.random.permutation(np.arange(full_train_data.shape[0]))
full_train_data, full_train_labels = full_train_data[shuffle], full_train_labels[shuffle]

train_data, train_labels = full_train_data[:14120], full_train_labels[:14120]
dev_data, dev_labels = full_train_data[14120:], full_train_labels[14120:]

**Modeling**

Our ensemble model combined random forest with k-nearest neighbor. If the random forest predicted probability is greater than a certain threshold (i.e. 0.6), then the random forest prediction is used. If not, then the k-nearest neighbor prediction is used. We chose to combine these two models into an ensemble because our initial analysis showed random forest and k-nearest neighbor performing well out of the box.

In [8]:
#Feature selection
def my_featureselection(num_features, fit_data, fit_labels, transform_data):
    selection = SelectKBest(k=num_features)
    top_train = selection.fit_transform(fit_data,fit_labels)
    top_test = selection.transform(transform_data)
    return [top_train, top_test]

#Ensemble Model
def rf_then_knn(model_rf,model_knn,proba_threshold,test_data,train_data=train_data,train_labels=train_labels):
    model_rf.fit(train_data,train_labels)
    rf_test_preds = model_rf.predict(test_data)
    rf_test_pred_proba = model_rf.predict_proba(test_data)
    
    top27_train_data,top27_test_data = my_featureselection(27,train_data,train_labels,test_data)
    model_knn.fit(top27_train_data,train_labels)
    knn_test_preds = model_knn.predict(top27_test_data)
    
    test_preds = []
    for example_index in np.arange(rf_test_pred_proba.shape[0]):
        if np.max(rf_test_pred_proba[example_index]) > proba_threshold:
            test_preds.append(rf_test_preds[example_index])
        else:
            test_preds.append(knn_test_preds[example_index])
    return test_preds

**Run Model on Dev Data**

In [9]:
model_knn = KNeighborsClassifier(n_neighbors=1,metric='braycurtis')
model_rf = RandomForestClassifier(n_estimators=100,max_features=25,random_state=2)

dev_preds = rf_then_knn(model_rf,model_knn,.6, dev_data)
accuracy = metrics.accuracy_score(dev_labels,dev_preds)
print(accuracy)

0.892


**Run Model on Test Data**

In [10]:
model_knn = KNeighborsClassifier(n_neighbors=1,metric='braycurtis')
model_rf = RandomForestClassifier(n_estimators=100,max_features=25,random_state=2)

test_preds = rf_then_knn(model_rf,model_knn,.6, test_data)

After submitting to Kaggle, we found that the accuracy on the test set is around 75%. We also submitted our baseline model (default k-nearest neighbor) to Kaggle and found an accuracy of 64%.

**Submit Results**

In [11]:
def preds_to_csv(preds,ids):
    ids = np.asarray(ids)
    cover_type = np.asarray(preds)
    dat_submit = pd.DataFrame(np.column_stack((ids,cover_type)))
    dat_submit.to_csv("predictions.csv",header=["Id","Cover_Type"],index=False)