# Assignment
- Learn about the mathematics of Logistic Regression by watching Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes).
- Start a clean notebook.
- Do train/validate/test split with the Tanzania Waterpumps data.
- Begin to explore and clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- Select different numeric and categorical features. 
- Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- Scale features.
- Use scikit-learn for logistic regression.
- Get your validation accuracy score.
- Get and plot your coefficients.
- Submit your predictions to our Kaggle competition.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Begin to visualize the data.
- Try different [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html)
- Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.


## Load and split data

In [1]:
!pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 28.1MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.0.0


In [0]:
import numpy as np
import pandas as pd
from math import sqrt
import pandas_profiling
import category_encoders as ce
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [0]:
LOCAL = '../data/tanzania/'
WEB = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/tanzania/'

train_features = pd.read_csv(WEB + 'train_features.csv')
train_labels = pd.read_csv(WEB + 'train_labels.csv')
test_features = pd.read_csv(WEB + 'test_features.csv')
sample_submission = pd.read_csv(WEB + 'sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

In [4]:
train_features.sample(1)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
35800,5301,3000.0,2011-03-17,Roman,2060,DWE,34.737243,-9.711602,none,0,Lake Nyasa,Ulyalya,Iringa,11,5,Ludewa,Madilu,133,True,GeoData Consultants Ltd,VWC,Roman,False,2000,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


## Best Features for Logistic Regression

In [5]:
train_features.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,37115.131768,317.650385,668.297239,34.077427,-5.706033,0.474141,15.297003,5.629747,179.909983,1300.652475
std,21453.128371,2997.574558,693.11635,6.567432,2.946019,12.23623,17.587406,9.633649,471.482176,951.620547
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,1.0,0.0,0.0,0.0
25%,18519.75,0.0,0.0,33.090347,-8.540621,0.0,5.0,2.0,0.0,0.0
50%,37061.5,0.0,369.0,34.908743,-5.021597,0.0,12.0,3.0,25.0,1986.0
75%,55656.5,20.0,1319.25,37.178387,-3.326156,0.0,17.0,5.0,215.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,99.0,80.0,30500.0,2013.0


In [0]:
# len(train_features['region_code'].value_counts().index)
# .query('unique <= 21')
# 11,17,12,3,5,18,19

In [7]:
train_features.describe(exclude='number').sort_values(by='unique', axis=1)

Unnamed: 0,recorded_by,public_meeting,permit,source_class,management_group,quantity_group,quantity,waterpoint_type_group,quality_group,payment_type,source_type,waterpoint_type,extraction_type_class,payment,water_quality,basin,source,scheme_management,management,extraction_type_group,extraction_type,region,lga,date_recorded,funder,ward,installer,scheme_name,subvillage,wpt_name
count,59400,56066,56344,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,55523,59400,59400,59400,59400,59400,59400,55765,59400,55745,31234,59029,59400
unique,1,2,2,3,5,5,5,6,6,7,7,7,7,7,8,9,10,12,12,13,18,21,125,356,1897,2092,2145,2696,19287,37400
top,GeoData Consultants Ltd,True,True,groundwater,user-group,enough,enough,communal standpipe,good,never pay,spring,communal standpipe,gravity,never pay,soft,Lake Victoria,spring,VWC,vwc,gravity,gravity,Iringa,Njombe,2011-03-15,Government Of Tanzania,Igosi,DWE,K,Madukani,none
freq,59400,51011,38852,45794,52490,33186,33186,34625,50818,25348,17021,28522,26780,25348,50818,10248,17021,36793,40507,26780,26780,5294,2503,572,9084,307,17402,682,508,3563


In [8]:
train_features['quantity'].value_counts()

enough          33186
insufficient    15129
dry              6246
seasonal         4050
unknown           789
Name: quantity, dtype: int64

In [0]:
# feature engineer
# train_features['enough_source'] = (train_features['quantity_group']=='enough')|(train_features['quantity']=='enough')
# train_features['shallow_well_source'] = (train_features['source']=='shallow well')|(train_features['source_type']=='shallow well')
# train_features['spring_source'] = (train_features['source']=='spring')|(train_features['source_type']=='spring')
# train_features['communal_standpipe_waterpoint'] = (train_features['waterpoint_type']=='communal standpipe')&(train_features['waterpoint_type_group']=='communal standpipe')
# train_features['handpump'] = (train_features['extraction_type_class']=='handpump')|(train_features['waterpoint_type']=='hand pump')|(train_features['waterpoint_type_group']=='hand pump')
# train_features['never_pay'] = (train_features['payment']=='never pay')|(train_features['payment_type']=='never pay')
# train_features['soft_good_quality'] = (train_features['water_quality']=='soft')|(train_features['quality_group']=='good')
# train_features['Victoria_basin'] = train_features['basin']=='Lake Victoria'
# train_features['2_source'] = (train_features['source']=='spring')&(train_features['source']=='shallow well')
# train_features['vwc_management'] = (train_features['management']=='vwc')|(train_features['scheme_management']=='VWC')
# train_features['gravity_extraction'] = (train_features['extraction_type']=='gravity')|(train_features['extraction_type_group']=='gravity')|(train_features['extraction_type_class']=='gravity')
# train_features['3_region'] = (train_features['region']=='Iringa')&(train_features['region']=='Shinyanga')&(train_features['region']=='Mbeya')
# train_features['region_code_11'] = train_features['region_code']==11
# train_features['region_code_17'] = train_features['region_code']==17
# train_features['region_code_12'] = train_features['region_code']==12
# train_features['Njombe_lga'] = train_features['lga']=='Njombe'
train_features['gov_funder'] = train_features['funder']=='Government Of Tanzania'
train_features['Igosi_ward'] = train_features['ward']=='Igosi'
train_features['Imalinyi_ward'] = train_features['ward']=='Imalinyi'
train_features['Siha_ward'] = train_features['ward']=='Siha'
train_features['gov_installer'] = train_features['installer']=='Government'
train_features['dwe_installer'] = train_features['installer']=='DWE'
train_features['none_scheme_name'] = train_features['scheme_name']=='None'
train_features['gov_scheme_name'] = train_features['scheme_name']=='Government'
train_features['Madukani_subvillage'] = train_features['subvillage']=='Madukani'
train_features['Shuleni_subvillage'] = train_features['subvillage']=='Shuleni'
train_features['Majengo_subvillage'] = train_features['subvillage']=='Majengo'   
train_features['none_wpt'] = train_features['wpt_name']=='none'
train_features['Shuleni_wpt'] = train_features['wpt_name']=='Shuleni'
train_features['dwe'] = (train_features['installer']=='DWE')&(train_features['funder']=='Dwe')
train_features['gov_install_fund'] = (train_features['installer']=='Government')&(train_features['funder']=='Government Of Tanzania')
train_features['gov'] = (train_features['installer']=='Government')&(train_features['funder']=='Government Of Tanzania')&(train_features['scheme_name']=='Government')

In [0]:
test_features['gov_funder'] = test_features['funder']=='Government Of Tanzania'
test_features['Igosi_ward'] = test_features['ward']=='Igosi'
test_features['Imalinyi_ward'] = test_features['ward']=='Imalinyi'
test_features['Siha_ward'] = test_features['ward']=='Siha'
test_features['gov_installer'] = test_features['installer']=='Government'
test_features['dwe_installer'] = test_features['installer']=='DWE'
test_features['none_scheme_name'] = test_features['scheme_name']=='None'
test_features['gov_scheme_name'] = test_features['scheme_name']=='Government'
test_features['Madukani_subvillage'] = test_features['subvillage']=='Madukani'
test_features['Shuleni_subvillage'] = test_features['subvillage']=='Shuleni'
test_features['Majengo_subvillage'] = test_features['subvillage']=='Majengo'   
test_features['none_wpt'] = test_features['wpt_name']=='none'
test_features['Shuleni_wpt'] = test_features['wpt_name']=='Shuleni'
test_features['dwe'] = (test_features['installer']=='DWE')&(test_features['funder']=='Dwe')
test_features['gov_install_fund'] = (test_features['installer']=='Government')&(test_features['funder']=='Government Of Tanzania')
test_features['gov'] = (test_features['installer']=='Government')&(test_features['funder']=='Government Of Tanzania')&(test_features['scheme_name']=='Government')

In [11]:
# train_test_split
X_train = train_features
y_train = train_labels['status_group']

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size=0.8, test_size=.2,
    stratify=y_train, random_state=42)

X_train.shape, X_val.shape, y_train.shape, y_val.shape

((47520, 56), (11880, 56), (47520,), (11880,))

In [12]:
X_val.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'scheme_name', 'permit', 'construction_year',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'gov_funder', 'Igosi_ward', 'Imalinyi_ward',
       'Siha_ward', 'gov_installer', 'dwe_installer', 'none_scheme_name',
       'gov_scheme_name', 'Madukani_subvillage', 'Shuleni_subvillage',
       'Majengo_subvillage', 'none_wpt', 'Shuleni_wpt', 'dwe',
       'gov_install_fund', 'gov'],
      dtype='object')

In [0]:
# categorical codes
# df.interest_level = pd.Categorical(df.interest_level)
# df['interest_code'] = df.interest_level.cat.codes

In [14]:
categorical_features = X_train.describe(exclude='number').T.query('unique <= 400').index.drop(['recorded_by']).tolist()
numeric_features = X_train.select_dtypes('number').columns.drop(['id']).tolist()
features = categorical_features + numeric_features

X_train_subset = X_train[features]
X_val_subset = X_val[features]

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train_subset)
X_val_encoded = encoder.transform(X_val_subset)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_val_scaled = scaler.transform(X_val_encoded)

model = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
model.fit(X_train_scaled, y_train)
score = model.score(X_val_scaled, y_val)
print('Validation Accuracy: {:5.4f}%'.format(score*100))

Validation Accuracy: 75.5724%


In [0]:
# Validation Accuracy: 76.6751% 'unique <= 2000'
# Validation Accuracy: 75.5724% 'unique <= 400'
# Validation Accuracy: 75.0673% 'unique <= 200'

In [0]:
# Validation Accuracy: 75.5724% 'unique <= 400'
categorical_features = X_train.describe(exclude='number').T.query('unique <= 400').index.drop(['recorded_by']).tolist()
numeric_features = X_train.select_dtypes('number').columns.drop(['id']).tolist()
features = categorical_features + numeric_features

# longitude, district_code, water_quality
# 0: 73.7710
# 1: longitude - 
# 2: district_code - 
# 3: water_quality - 

In [0]:
new_f_score = []
for feature in features:
  f_drop = features.copy()
  f_drop.remove(feature)
  X_train_subset = X_train[f_drop]
  X_val_subset = X_val[f_drop]

  encoder = ce.OneHotEncoder(use_cat_names=True)
  X_train_encoded = encoder.fit_transform(X_train_subset)
  X_val_encoded = encoder.transform(X_val_subset)

  scaler = StandardScaler()
  X_train_scaled = scaler.fit_transform(X_train_encoded)
  X_val_scaled = scaler.transform(X_val_encoded)

  model_drop = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
  model_drop.fit(X_train_scaled, y_train)
  print(feature, model_drop.score(X_val_scaled, y_val)*100)
  if model_drop.score(X_val_scaled, y_val) > score:
    score = model_drop.score(X_val_scaled, y_val)
    new_f_score = [feature, score]
new_f_score

date_recorded 75.07575757575758
basin 75.53030303030303


In [0]:
categorical_features = X_train.describe(exclude='number').T.query('unique <= 3000').index.drop(['recorded_by']).tolist()
numeric_features = X_train.select_dtypes('number').columns.drop(['id']).tolist()
features = categorical_features + numeric_features

X_train_subset = X_train[features]
X_val_subset = X_val[features]

encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train_subset)
X_val_encoded = encoder.transform(X_val_subset)

In [0]:
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train_encoded)
# X_val_scaled = scaler.transform(X_val_encoded)

In [18]:
model = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
model.fit(X_train_encoded, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Get & plot coefficients

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(30, 30))

coefficients = pd.Series(model.coef_[0], X_train_encoded.columns)
coefficients.sort_values().plot.barh();

## Submit to predictive modeling competition


### Write submission CSV file

The format for the submission file is simply the row id and the predicted label (for an example, see `sample_submission.csv` on the data download page.

For example, if you just predicted that all the waterpoints were functional you would have the following predictions:

<pre>id,status_group
50785,functional
51630,functional
17168,functional
45559,functional
49871,functional
</pre>

Your code to generate a submission file may look like this: 
<pre># estimator is your scikit-learn estimator, which you've fit on X_train

# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train

y_pred = estimator.predict(X_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)
</pre>

In [13]:
X_test_subset = test_features[features]
X_test_encoded = encoder.transform(X_test_subset)
# X_test_scaled = scaler.transform(X_test_encoded)
all(X_test_encoded.columns == X_train_encoded.columns)

True

In [0]:
# y_pred = model.predict(X_test_scaled)
y_pred = model.predict(X_test_encoded)
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('submission-01.csv', index=False)

In [20]:
!head submission-01.csv

id,status_group
50785,non functional
51630,non functional
17168,non functional
45559,non functional
49871,functional
52449,functional
24806,functional
28965,non functional
36301,functional


### Send submission CSV file to Kaggle

#### Option 1. Kaggle web UI
 
Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file.


#### Option 2. Kaggle API

Use the Kaggle API to upload your CSV file.