In [3]:
%run ../src/init.py
%matplotlib inline

#### Part One: Extract Data & Create Dataframes

##### Original Madelon:
These datasets were harvested from the UCI website with curl -o.

In [5]:
train_data = pd.read_csv('/home/jovyan/project_3/data/madelon_train_data', \
                         delimiter = ' ', header = None).drop(500, axis = 1)
label_data = pd.read_csv('/home/jovyan/project_3/data/madelon_train_labels', \
                         delimiter = ' ', header = None)[0]
validate_data = pd.read_csv('/home/jovyan/project_3/data/validate_data', \
                         delimiter = ' ', header = None).drop(500, axis =1)
validate_labels = pd.read_csv('/home/jovyan/project_3/data/validate_labels', \
                         delimiter = ' ', header = None)[0]

##### A recreation of Madelon featuring 1000 columns and 200k rows.
The database was presampled for .3% of the data, which was then run through Douglas' method for identifying significant features, described in Part Three below. This enabled me to get the complete set of 200k rather than a sample.

The original sampling method was performed by finding the _id column_ through a call to the SQL database, and then taking a random sample of ID #'s which were then put into a list and inserted into the resultant query. This method can be found in the random_data_sampling() function in the second cell below.

In [4]:
con = pg2.connect(host='34.211.227.227',
          dbname='postgres',
          user='postgres')
cur = con.cursor(cursor_factory=RealDictCursor)
sql = 'SELECT feat_257, feat_269, feat_308, feat_315, feat_336, feat_341, feat_395, feat_504, \
feat_526, feat_639, feat_681, feat_701, feat_724, feat_736, feat_769, feat_808, feat_829, \
feat_867, feat_920, feat_956, _id from madelon;'
cur.execute(sql)
results = cur.fetchall()
con.close()
df_sample_data = pd.DataFrame(results)
df_sample_data.set_index('_id', inplace=True)

In [7]:
def random_data_sampling():
    con = pg2.connect(host='34.211.227.227',
              dbname='postgres',
              user='postgres')
    cur = con.cursor(cursor_factory=RealDictCursor)
    cur.execute('SELECT _id FROM madelon;')
    results = cur.fetchall()
    con.close()
    df_sample_data = pd.DataFrame(results)

    selection = list(df_sample_data.sample(frac=0.005, replace=False)['_id'].values)
    selection = [int(i) for i in selection]

    con = pg2.connect(host='34.211.227.227',
                  dbname='postgres',
              user='postgres')
    cur = con.cursor(cursor_factory=RealDictCursor)
    sql = 'SELECT * from madelon WHERE _id IN %(selection)s'
    cur.execute(sql, {
        'selection': tuple(selection),
    })
    results = cur.fetchall()
    con.close()
    df_sample_data = pd.DataFrame(results)
    df_sample_data.set_index('_id', inplace=True)

    return df_sample_data

#### Part Two - Benchmarking
Each of the following model classes were fit on the complete, unfiltered original Madelon set:
- logistic regression
- decision tree
- k nearest neighbors
- support vector classifier
These classes were then scored for their performance on both train and test datasets.

Note: the same was not performed for the expanded set, but the random_data_sampling function should suffice to provide an unfiltered set of manageable size.

##### Pipe & Grid Search Function:

In [7]:
def gs_pipe(clf, gs_params = {'clf__C':[100000000000000000000000]}, data = train_data, target = label_data):
    pipe = Pipeline([
    ('scaler',MinMaxScaler(feature_range=(0.00001, 1))),
    ('clf', clf)
    ])
    if gs_params == None:
        gs = clf
    else:
        lgls = make_scorer(log_loss)
        gs = GridSearchCV(pipe, gs_params, cv=10, scoring=lgls)
    gs.fit(data, target)
    return gs

##### Model Creation & Fitting:

In [8]:
lgrg = LogisticRegression(C=10000000000000000000000)
lgrg_naive = lgrg.fit(train_data, label_data)
lgrg_gs = gs_pipe(LogisticRegression())

In [9]:
dtc = DecisionTreeClassifier()
dtc_naive = dtc.fit(train_data, label_data)
dtc_params = {
    'clf__max_leaf_nodes':[None, 2, 5],
}
dtc_gs = gs_pipe(dtc, gs_params=dtc_params)

In [10]:
knn = KNeighborsClassifier()
knn_naive = knn.fit(train_data, label_data)
knn_params = {
    'clf__leaf_size':[10,20,30],
}
knn_gs = gs_pipe(knn, gs_params=knn_params)

In [11]:
svc = SVC(C=1000000000000000)
svc_naive = svc.fit(train_data, label_data)
svc_params = {}
svc_gs = gs_pipe(svc)

##### Scoring:

In [12]:
[i.score(train_data, label_data) for i in [lgrg_naive, dtc_naive, knn_naive, svc_naive]]

[0.745, 1.0, 0.82650000000000001, 1.0]

In [13]:
[i.score(validate_data, validate_labels) for i in [lgrg_naive, dtc_naive, knn_naive, svc_naive]]

[0.58999999999999997, 0.73999999999999999, 0.69166666666666665, 0.5]

In [14]:
#[i.best_score_ for i in [lgrg_gs, dtc_gs, knn_gs, svc_gs]]

[15.577170862117516,
 13.107634356926374,
 14.972705893523507,
 14.022909532599218]

#### Part Three - Significant Feature Identification
Feature selection methods were built using three different techniques:
- Two methods relying upon the known independence of the noisy features in the Madelon dataset, which were generated by a random generated and then scaled to roughly match with the significant features. Both of these methods resulted in the same features.:
    - A method developed by a student, Douglas Brodtman (?) that uses a single .corr() matrix to determine which features have some relevance to others. This method is arguably preferable due to the efficient implementation of the well-refined .corr() method.
    - A method developed by a teacher, Joshua Cook (?) that attempts to predict a single feature at a time using all other features, with those that have the lowest predictive power being evaluated as noise. This method is arguably inferior due to the lengthy time required to create, fit, and score a model.
- A simple SelectFromModel method based upon RidgeClassification. This model returned features different from the above methods, which performed poorly.

In [15]:
def douglas_method(train_data = train_data):
    train_data_corr = train_data.corr()
    half_corrs = train_data_corr[(train_data_corr[abs(train_data_corr)>0.5]).count() > 1].index
    return half_corrs

In [20]:
def Josh_method(data = train_data):
    score = []; collist = []
    for col in data:
        X_train, X_test, y_train, y_test = train_test_split(\
            data.drop(col, axis=1), \
            data[col], random_state = 42)
        
        model = LinearRegression()
        model.fit(X_train, y_train)
        score.append(model.score(X_test, y_test))
        collist.append(col)
    collist = np.array(collist); score = np.array(score)
    return collist[score > abs(sum(score)/len(score))]

In [17]:
def select_k_method(train_data = train_data, label_data = label_data):
    selector = SelectKBest(k = 20)
    selector.fit_transform(train_data, label_data)
    km = train_data.columns[selector.get_support()]
    return km

In [None]:
#dm = douglas_method(train_data);jm = Josh_method(); km = select_k_method()
#km, jm, dm

In [24]:
#all(jm == dm), all(jm == km)

#jmkm_match = [i for i in km if i in jm]; kmdm_match = [i for i in km if i in dm];
#jmkm_match == kmdm_match

#all(jm == dm)

True

#### Part Four - Testing Model Pipelines
Two main first steps were implemented in order to attempt to find the most relevant - or non-redundant - features out from among the significant features:
- Select From model.
- Dimension reduction in the form of PCA, and a single implementation was also used to achieve the same effect.
- (Implemented but not used:) An expansion upon Douglas' .corr method, using a recognition he made that there were patterns within the significant features as a result of how they were created by making a set of linear combinations and duplicates of the originals and linear combinations - was used to reduce the Madelon dataset to a manageable size. 

The results of these were then fed into various models:
- RandomForest, which did not require any feature selection to perform well.
- K Means, which should have found the vertices of the original features but performed horribly.
- K Nearest Neighbors, which performed quite well.

##### Expanded Douglas Method:

In [16]:
#Utility Functions:
def make_corr_df(train_data = train_data, method = douglas_method): #creates a second-tier corr
    half_corrs = method(train_data)
    corr_df = train_data[half_corrs].corr()
    corr_mask = corr_df[(corr_df > .95) & (corr_df != 1)].notnull()
    return corr_mask

def bin_significant_columns(train_data = train_data, method = make_corr_df): 
    #make bins of patterns.
    data = method(train_data)
    label_bins = list([col] + list(data.columns[data[col]]) \
                      for col in data.columns)
    label_bins = list(set(tuple(sorted(bins)) for bins in label_bins))
    return label_bins

def flatten_bins(bins = bin_significant_columns(train_data)): 
    #In order to re-flatten them if necessary
    return [i for j in bins for i in j]

def make_combos_list(label_bins = bin_significant_columns(), place = 0, num = 5): 
    #Make a combination from the first in each bin.
    #Note: While this method has now been altered to allow for different indexes to be 
    #selected such was not implemented originally.
    return list(combinations([label_bins[binn][int(place/len(binn))] 
                              for binn in range(len(label_bins))], num))

#Brute Force Method for finding out which of the bins provides the most informative features.
#Note: It takes a really long time.
def brute_corr(data = train_data[douglas_method(train_data)], label_data = label_data, \
               model = LinearRegression()):
    x = [(i, model.fit(data[list(i)], label_data).score(data[list(i)], label_data))\
            for i in make_combos_list()]
    curr = None
    for i in x:
        if (i[1] > curr) | (curr == None):
            curr = i[1]
            comb = i[1]
    return list(comb)

##### Implementation of Other Methods:

In [21]:
#Random Forest Method: Performed third best.
def rf_method(data = train_data[douglas_method(train_data)], label_data = label_data):
    rfclf = RandomForestClassifier()
    rfclf.fit_transform(train_data[douglas_method(train_data)], label_data)
    return rfclf

#K-Means Method: Performed horribly both with SelectFromModel and with PCA. 
#I probably could have done something differently.
def k_means_method(train_data = train_data[douglas_method(train_data)], label_data = label_data):
    pipe = Pipeline([('scaler', MinMaxScaler(feature_range=(0.00001, 1))),\
              ('sfm', SelectFromModel(RidgeClassifier())),\
              ('kms', KMeans(n_clusters = 32))])
    pipe.fit(train_data[douglas_method(train_data)], label_data)
    return pipe

#K Nearest Neighbors Method w/ Select From Model: Performed the Best.
def knn_method(train_data = train_data[douglas_method(train_data)], label_data = label_data):
    sfm = Pipeline([('scaler', MinMaxScaler(feature_range=(0.00001, 1))),\
          ('sfm', SelectFromModel(RidgeClassifier())),\
          ('knn', KNeighborsClassifier())])
    sfm.fit(train_data[douglas_method(train_data)], label_data)
    return sfm

#K-Nearest Neighbors w/ PCA Method: Perform the Second Best.
def pca_method(model = KNeighborsClassifier(), data = train_data[douglas_method(train_data)], label_data = label_data):
    pca = IncrementalPCA()
    pipe = Pipeline(steps=[('scaler',MinMaxScaler(feature_range=(0.00001, 1))), \
                       ('pca', pca), ('knn', model)])
    pipe.fit(data, label_data)
    return pipe

##### Scoring:

In [59]:
rf_method().score(validate_data[douglas_method(validate_data)], validate_labels)



0.8716666666666667

In [60]:
k_means_method().score(validate_data[douglas_method(validate_data)], validate_labels)

-32.498858803435837

In [61]:
knn_method().score(validate_data[douglas_method(validate_data)], validate_labels)

0.92000000000000004

In [62]:
pca_method().score(validate_data[douglas_method(validate_data)], validate_labels)

0.91666666666666663

In [25]:
pca_method(model =KMeans(n_clusters = 32)).score(validate_data[douglas_method(validate_data)], validate_labels)

-72.739976624898077

#### Summary:
By all appearances, Douglas' method of finding significant features was a good significant feature detector. K Nearest Neighbors fed a SelectFromModel using a RidgeClassifier was the best Pipeline. RandomForest, however, performed quite well without use of SelectFromModel.

It is expected that trial-and-error with KMeans and different Classifiers for SelectFromModel to draw features from will result in a pipeline with lower error rates.