# ASSIGNMENT options

- Replicate the lesson code. [Do it "the hard way" or with the "Benjamin Franklin method."](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit)
- Apply the lesson to other datasets you've worked with before, and compare results.
- Choose how to split the Bank Marketing dataset. Train and validate baseline models.
- Get weather data for your own area and calculate both baselines.  _"One (persistence) predicts that the weather tomorrow is going to be whatever it was today. The other (climatology) predicts whatever the average historical weather has been on this day from prior years."_ What is the mean absolute error for each baseline? What if you average the two together? 
- When would this notebook's pipelines fail? How could you fix them? Add more [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) and [imputation](https://scikit-learn.org/stable/modules/impute.html) to your [pipelines](https://scikit-learn.org/stable/modules/compose.html) with scikit-learn.
- [This example from scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html) demonstrates its improved `OneHotEncoder` and new `ColumnTransformer` objects, which can replace functionality from third-party libraries like category_encoders and sklearn-pandas. Adapt this example, which uses Titanic data, to work with another dataset.


In [56]:
import pandas as pd
import numpy as np 
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, accuracy_score, log_loss
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import FeatureAgglomeration
import category_encoders as ce
from numpy.testing import assert_almost_equal
from functools import reduce
import matplotlib.pyplot as plt

In [99]:
# bank = pd.read_csv('bank-additional/bank-additional-full.csv', sep=';')

# X = bank.drop(columns='y')
# y = bank['y'] == 'yes'

# X.head()

In [16]:
# # There is much to one-hot. 
# # we're gonna get that done automatically, however, in our pipeline. 

# pipeline = make_pipeline(
#     ce.OneHotEncoder(use_cat_names=True),
#     StandardScaler(), 
#     LogisticRegression()
# )

# X_train, X_test, y_train, y_test = train_test_split(X,y)

In [98]:

# pipeline.fit(X_train, y_train)
# y_pred = pipeline.predict(X_test)
# accuracy_score(y_test, y_pred)

In [18]:
# scores = cross_val_score(pipeline, X_train, y_train, cv=10) 
#scores.mean()

In [13]:
## KAGGLE bioresponse https://www.kaggle.com/c/bioresponse#Evaluation

train_url = 'kag-bioresponse/train.csv'
test_url = 'kag-bioresponse/test.csv' ## iggnoring-- it doesn't have 'Activity' 

df_ = pd.read_csv(train_url)
#df_test = pd.read_csv(test_url) # doesn't have 'Activity'
assert all([x==0 for x in df_.isna().sum().values])
assert all([pd.api.types.is_numeric_dtype(df_[feat]) for feat in df_.columns])
dependent='Activity'

X_train, X_test, y_train, y_test = train_test_split(df_.drop(dependent, axis=1), 
                                                    df_[dependent], 
                                                    train_size=0.8, test_size=0.2)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# df_.dtypes.value_counts()

# #df_.describe()

# df_.mean().array.mean()

(3000, 1776) (751, 1776) (3000,) (751,)


In [32]:

pipeline1 = make_pipeline(
    #ce.OneHotEncoder(use_cat_names=True), # no categoricals. 
    #StandardScaler(), # means are very much near zero anyway, we don't really need this. 
    LogisticRegression()
)


pipeline1.fit(X_train, y_train)
y_pred = pipeline1.predict(X_test)
#accuracy_score(y_test, y_pred)
log_loss(y_test, y_pred)

7.5885119410403075

In [33]:
scores = cross_val_score(pipeline1, X_train, y_train, cv=10, scoring='neg_log_loss') 

scores#.mean() # don't run this a lot because it is EXPENSIVE

# when test_size=0.15, scores.mean is better than accuracy_score... but when test_size=0.2, scores.mena is WORSE. 

  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)


array([-0.59238295, -0.64245771, -0.61530716, -0.71322103, -0.71528252,
       -0.60554242, -0.63123414, -0.68629595, -0.67113009, -0.68777146])

In [None]:

# # We create the preprocessing pipelines for both numeric and categorical data.
# numeric_features = ['age', 'fare']
# numeric_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='median')),
#     ('scaler', StandardScaler())])

# categorical_features = ['embarked', 'sex', 'pclass']
# categorical_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
#     ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# preprocessor = ColumnTransformer(
#     transformers=[
#         ('num', numeric_transformer, numeric_features),
#         ('cat', categorical_transformer, categorical_features)])

# # Append classifier to preprocessing pipeline.
# # Now we have a full prediction pipeline.
# clf = Pipeline(steps=[('preprocessor', preprocessor),
#                       ('classifier', LogisticRegression(solver='lbfgs'))])



In [24]:
# lets dimension-reduction
# we'll split the data into two different things, ints and floats. 

# a commenter here https://stats.stackexchange.com/questions/159705/would-pca-work-for-boolean-binary-data-types 
# said that a "cosine"ish version of PCA is more appropriate for 0,1. 

# We're gonna use the heuristic that, for N=#observations and M=#features, N > 5M

0.08813569643625678

In [86]:
ints = df_.drop(dependent, axis=1).select_dtypes(include='int')
floats = df_.drop(dependent, axis=1).select_dtypes(include='float')

print(ints.shape, floats.shape)

P = 5.15
n = int(np.divide(df_.shape[0], P)/ 2)

LogiR = ('logistic_regression', SGDClassifier(loss='log', tol=np.exp(-P/2), max_iter=1234))

estimators_c = [('PCA', PCA(n_components=n)), 
                LogiR]

pipe_c = Pipeline(steps=estimators_c)

estimators_d = [('FA', FeatureAgglomeration(n_clusters=n, affinity='cosine', linkage='complete')), 
                LogiR]

pipe_d = Pipeline(steps=estimators_d)



(3751, 834) (3751, 942)


In [95]:
ints_train = X_train.select_dtypes(include='int')
ints_test = X_test.select_dtypes(include='int')
floats_train = X_train.select_dtypes(include='float')
floats_test = X_test.select_dtypes(include='float')

print(ints_train.shape, ints_test.shape, floats_train.shape, floats_test.shape)

pipe_c.fit(floats_train, y_train)
y_pred_c = pipe_c.predict_proba(floats_test)

pipe_d.fit(ints_train, y_train)
y_pred_d = pipe_d.predict_proba(ints_test)


# log_loss(y_test, y_pred_c), log_loss(y_test, y_pred_d)

y_pred_d.shape, y_pred_c.shape, y_test.shape, df_.shape

y_pred_c, y_pred_d

(3000, 834) (751, 834) (3000, 942) (751, 942)


(array([[0.89458424, 0.10541576],
        [0.12677808, 0.87322192],
        [0.97135242, 0.02864758],
        ...,
        [0.01759458, 0.98240542],
        [0.99018714, 0.00981286],
        [0.07848058, 0.92151942]]), array([[9.96911189e-01, 3.08881077e-03],
        [1.65793779e-02, 9.83420622e-01],
        [8.14470414e-01, 1.85529586e-01],
        ...,
        [7.04256899e-05, 9.99929574e-01],
        [9.99988555e-01, 1.14454296e-05],
        [1.93662214e-08, 9.99999981e-01]]))

In [96]:
print(log_loss(y_test, y_pred_c), log_loss(y_test, y_pred_d))

0.5982614362581822 1.7561588475122367


In [97]:
scores_c = cross_val_score(pipe_c, floats_train, y_train, cv=10, scoring='neg_log_loss') 

#scores_d = cross_val_score(pipe_d, ints_train, y_train, cv=5, scoring='neg_log_loss') 

-scores_c

## ok so when I 
## get rid of the discretes

array([0.66115672, 0.63633499, 0.69340228, 0.75470902, 0.76206379,
       0.67908168, 0.74948106, 0.70610053, 0.6469453 , 0.71163227])