# Assignment
- Learn about the mathematics of Logistic Regression by watching Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes).
- Start a clean notebook.
- Do train/validate/test split with the Tanzania Waterpumps data.
- Begin to explore and clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- Select different numeric and categorical features. 
- Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- Scale features.
- Use scikit-learn for logistic regression.
- Get your validation accuracy score.
- Get and plot your coefficients.
- Submit your predictions to our Kaggle competition.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Begin to visualize the data.
- Try different [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html)
- Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [1]:
import pandas as pd
import category_encoders as ce
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.2f}'.format)

In [None]:
# !kaggle competitions download -c ds4-predictive-modeling-challenge

In [None]:
# !unzip test_features.csv.zip
# !unzip train_features.csv.zip
# !unzip train_labels.csv.zip

In [None]:
# !ls

In [2]:
# csv files were saved by kaggle with no read or write permissions?
train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')
test_features = pd.read_csv('test_features.csv')
train_features.shape, train_labels.shape, test_features.shape

((59400, 40), (59400, 2), (14358, 40))

In [7]:
train_features.describe(include='all')


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,year_recorded
count,59400.0,59400.0,59400,59400,59400.0,59400,59400.0,59400.0,59400,59400.0,59400,59400,59400,59400.0,59400.0,59400,59400,59400.0,59400,59400,59400,59400,59400,59400.0,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400.0
unique,,,356,1898,,2146,,,37400,,9,19288,21,,,125,2092,,3,1,13,2697,3,,18,13,7,12,5,7,7,8,6,5,10,7,3,7,6,
top,,,2011-03-15 00:00:00,Government Of Tanzania,,DWE,,,none,,Lake Victoria,Madukani,Iringa,,,Njombe,Igosi,,True,GeoData Consultants Ltd,VWC,MISSING,True,,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,
freq,,,572,9084,,17402,,,3563,,10248,508,5294,,,2503,307,,51011,59400,36793,28166,38852,,26780,26780,26780,40507,52490,25348,25348,50818,50818,33186,17021,17021,45794,28522,34625,
first,,,2002-10-14 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
last,,,2013-12-03 00:00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,37115.13,317.65,,,668.3,,34.08,-5.89,,0.47,,,,15.3,5.63,,,179.91,,,,,,1300.65,,,,,,,,,,,,,,,,2011.92
std,21453.13,2997.57,,,693.12,,6.57,2.77,,12.24,,,,17.59,9.63,,,471.48,,,,,,951.62,,,,,,,,,,,,,,,,0.96
min,0.0,0.0,,,-90.0,,0.0,-11.65,,0.0,,,,1.0,0.0,,,0.0,,,,,,0.0,,,,,,,,,,,,,,,,2002.0
25%,18519.75,0.0,,,0.0,,33.09,-8.54,,0.0,,,,5.0,2.0,,,0.0,,,,,,0.0,,,,,,,,,,,,,,,,2011.0


In [None]:
train_labels['status_group']

In [3]:
def wrangle(X):
    X = X.copy()
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    cols_with_zeros = ['latitude', 'longitude', 'construction_year']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col] = X[col].fillna(X[col].mean())
        
        X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
        
        X['year_recorded'] = X['date_recorded'].dt.year
        
        X = X.drop(columns='quantity_group')
        
        categoricals = X.select_dtypes(exclude='number').columns
        for col in categoricals:
            X[col] = X[col].fillna('MISSING')
        return X
    
train_features = wrangle(train_features)
test_features = wrangle(test_features)

In [None]:
# train_features = train_features.drop(['recorded_by'], axis=1)
dummied_features = pd.get_dummies(train_features, columns=['management_group', 'payment_type', 'source_type', 'water_quality'],
               prefix=['mgmt', 'payment', 'source', 'quality'])

In [6]:
train_features.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq,first,last
recorded_by,59400,1,GeoData Consultants Ltd,59400,,
public_meeting,59400,3,True,51011,,
source_class,59400,3,groundwater,45794,,
permit,59400,3,True,38852,,
quantity,59400,5,enough,33186,,
management_group,59400,5,user-group,52490,,
quality_group,59400,6,good,50818,,
waterpoint_type_group,59400,6,communal standpipe,34625,,
payment_type,59400,7,never pay,25348,,
payment,59400,7,never pay,25348,,


In [4]:


# Returns X_train, X_val, y_train, y_val
def quick_split(X, y):
    X_train = X
    y_train = y

    return train_test_split(
        X_train, y_train, train_size=0.80, test_size=0.20,
        stratify=y_train)

In [5]:
def fit_predict_score(X, y, X_val, y_val):
    model = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=20000)
    model.fit(X, y)
    y_pred = model.predict(X_val)
    print(model.coef_)
#     sample_submission = pd.read_csv('sample_submission.csv')
#     submission = sample_submission.copy()
#     submission['status_group'] = y_pred
#     submission.to_csv('whaeck-submission.csv', index=False)
    return accuracy_score(y_val, y_pred)

In [None]:
def fit_submission(X, y, ):
    pass

In [None]:
dummied_features = pd.get_dummies(train_features, columns=['management_group', 'quality_group', 'waterpoint_type', 'extraction_type_group', 'basin', 'public_meeting', 'permit'],
               prefix=['mgmt', 'quality', 'waterpoint', 'extraction', 'basin', 'public', 'permit'])

In [None]:
# dummied_features['public_meeting']
# dummied_features['public_meeting'] = dummied_features['public_meeting'].astype('category')
# dummied_features['permit'] = dummied_features['permit'].astype('category')
# dummied_features['public_meeting_cat'] = dummied_features['public_meeting'].cat.codes
# dummied_features['permit_cat'] = dummied_features['permit'].cat.codes

In [None]:
dummied_features = dummied_features.drop('id', axis=1)

In [None]:
X_train, X_val, y_train, y_val = quick_split(dummied_features.select_dtypes('number'), train_labels['status_group'])
fit_predict_score(X_train, y_train, X_val, y_val)

In [32]:
encoder = ce.OneHotEncoder(use_cat_names=True)
categorical_features = ['management_group', 'payment_type', 'source_class', 'quality_group', 'quantity', 'waterpoint_type_group', 'extraction_type_group', 'basin', 'source', 'region', 'public_meeting', 'permit', 'installer', 'ward']
numeric_features = train_features.select_dtypes('number').columns.drop('id').tolist()
features = categorical_features + numeric_features
encoded = encoder.fit_transform(train_features[features])

In [33]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(encoded)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [25]:
train_encoded

Unnamed: 0,status_group_functional,status_group_non functional,status_group_functional needs repair,status_group_-1
0,1,0,0,0
1,1,0,0,0
2,1,0,0,0
3,0,1,0,0
4,1,0,0,0
5,1,0,0,0
6,0,1,0,0
7,0,1,0,0
8,0,1,0,0
9,1,0,0,0


In [36]:
model = LogisticRegression(solver='liblinear', multi_class='auto', max_iter=20000)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print(model.coef_)
accuracy_score(y_val, y_pred)

[[-0.10780434  0.02206037  0.1095423  ...  0.16826403 -0.00527729
   0.03517714]
 [ 0.07433923 -0.0468618  -0.14611615 ...  0.0623914  -0.41700571
   0.07459708]
 [ 0.10086716  0.00112203 -0.0733342  ... -0.22757231  0.07096654
  -0.08687419]]


0.7687710437710438

In [None]:
y_pred = model.predict(X_train)
accuracy_score(y_train, y_pred)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 40), dpi=1200)
coefficients = pd.Series(model.coef_[0], encoded.columns)
coefficients.sort_values().plot.barh();

In [41]:
categorical_features = ['management_group', 'payment_type', 'source_class', 'quality_group', 'quantity', 'waterpoint_type_group', 'extraction_type_group', 'basin', 'source', 'region', 'public_meeting', 'permit', 'installer', 'ward']
numeric_features = test_features.select_dtypes('number').columns.drop('id').tolist()
features = categorical_features + numeric_features
test_encoded = encoder.transform(test_features[features])

X_test_scaled = scaler.transform(test_encoded)

  


In [42]:
y_pred = model.predict(X_test_scaled)
submission = pd.read_csv('sample_submission.csv')
submission['status_group'] = y_pred
submission.to_csv('submission-05.csv', index=False)

In [37]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=800, max_depth=40, n_jobs=-1)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=40, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=800, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [38]:
y_pred = rfc.predict(X_val)
accuracy_score(y_val, y_pred)

0.8136363636363636

In [43]:
y_pred = rfc.predict(X_test_scaled)
submission = pd.read_csv('sample_submission.csv')
submission['status_group'] = y_pred
submission.to_csv('submission-06.csv', index=False)

In [47]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(train_labels['status_group'])
train_encoded = le.transform(train_labels['status_group'])

X_train, X_val, y_train, y_val = quick_split(X_scaled, train_encoded)
# fit_predict_score(X_train, y_train, X_val, y_val)

In [51]:
import xgboost as xgb
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

clf = xgb.XGBClassifier(learning_rate = 0.2,
                        reg_alpha = 0.1,
                        n_estimators=1200,
                        max_depth=5,
                        min_chld_weight=1,
                        gamma=0,
                        subsample=0.8,
                        objective='multi:softprob',
                        nthread=12,
                        scale_pos_weight=1
                       )
dtrain = xgb.DMatrix(X_train)
dtest = xgb.DMatrix(X_val)
bst = clf.fit(X_train, y_train, verbose=False)
preds = bst.predict(X_val)
accuracy_score(y_val, preds)

0.8037878787878788

# Pipelines
## Isn't it enough to know I ruined a pony to make a gift for you?

In [None]:
from sklearn.pipeline import Pipeline