<a href="https://colab.research.google.com/github/hBar2013/DS-Unit-2-Classification-1/blob/master/module1-logistic-regression/kim_lowry_logistic_regression_categorical_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

- `amount_tsh` : Total static head (amount water available to waterpoint)
- `date_recorded` : The date the row was entered
- `funder` : Who funded the well
- `gps_height` : Altitude of the well
- `installer` : Organization that installed the well
- `longitude` : GPS coordinate
- `latitude` : GPS coordinate
- `wpt_name` : Name of the waterpoint if there is one
- `num_private` :  
- `basin` : Geographic water basin
- `subvillage` : Geographic location
- `region` : Geographic location
- `region_code` : Geographic location (coded)
- `district_code` : Geographic location (coded)
- `lga` : Geographic location
- `ward` : Geographic location
- `population` : Population around the well
- `public_meeting` : True/False
- `recorded_by` : Group entering this row of data
- `scheme_management` : Who operates the waterpoint
- `scheme_name` : Who operates the waterpoint
- `permit` : If the waterpoint is permitted
- `construction_year` : Year the waterpoint was constructed
- `extraction_type` : The kind of extraction the waterpoint uses
- `extraction_type_group` : The kind of extraction the waterpoint uses
- `extraction_type_class` : The kind of extraction the waterpoint uses
- `management` : How the waterpoint is managed
- `management_group` : How the waterpoint is managed
- `payment` : What the water costs
- `payment_type` : What the water costs
- `water_quality` : The quality of the water
- `quality_group` : The quality of the water
- `quantity` : The quantity of water
- `quantity_group` : The quantity of water
- `source` : The source of the water
- `source_type` : The source of the water
- `source_class` : The source of the water
- `waterpoint_type` : The kind of waterpoint
- `waterpoint_type_group` : The kind of waterpoint

### Labels

There are three possible values:

- `functional` : the waterpoint is operational and there are no repairs needed
- `functional needs repair` : the waterpoint is operational, but needs repairs
- `non functional` : the waterpoint is not operational

In [3]:
from google.colab import files
uploaded = files.upload()

Saving train_labels.csv to train_labels.csv


In [4]:
!pip install category_encoders



In [0]:
import category_encoders as ce

In [0]:
import pandas as pd

In [0]:
train_features_file = 'train_features.csv'
train_labels_file = 'train_labels.csv'
test_features_file = 'test_features.csv'

In [0]:
train_feat = pd.read_csv(train_features_file)
train_label = pd.read_csv(train_labels_file)
test_feat = pd.read_csv(test_features_file)

In [29]:
train_feat.construction_year.value_counts(dropna=False)


0       20709
2010     2645
2008     2613
2009     2533
2000     2091
2007     1587
2006     1471
2003     1286
2011     1256
2004     1123
2012     1084
2002     1075
1978     1037
1995     1014
2005     1011
1999      979
1998      966
1990      954
1985      945
1980      811
1996      811
1984      779
1982      744
1994      738
1972      708
1974      676
1997      644
1992      640
1993      608
2001      540
1988      521
1983      488
1975      437
1986      434
1976      414
1970      411
1991      324
1989      316
1987      302
1981      238
1977      202
1979      192
1973      184
2013      176
1971      145
1960      102
1967       88
1963       85
1968       77
1969       59
1964       40
1962       30
1961       21
1965       19
1966       17
Name: construction_year, dtype: int64

In [11]:
train_feat.describe(exclude='number').T

Unnamed: 0,count,unique,top,freq
date_recorded,59400,356,2011-03-15,572
funder,55765,1897,Government Of Tanzania,9084
installer,55745,2145,DWE,17402
wpt_name,59400,37400,none,3563
basin,59400,9,Lake Victoria,10248
subvillage,59029,19287,Madukani,508
region,59400,21,Iringa,5294
lga,59400,125,Njombe,2503
ward,59400,2092,Igosi,307
public_meeting,56066,2,True,51011


In [12]:
train_feat['extraction_type'].value_counts(dropna=False)

gravity                      26780
nira/tanira                   8154
other                         6430
submersible                   4764
swn 80                        3670
mono                          2865
india mark ii                 2400
afridev                       1770
ksb                           1415
other - rope pump              451
other - swn 81                 229
windmill                       117
india mark iii                  98
cemo                            90
other - play pump               85
walimi                          48
climax                          32
other - mkulima/shinyanga        2
Name: extraction_type, dtype: int64

In [13]:
train_feat['extraction_type_group'].value_counts(dropna=False)

gravity            26780
nira/tanira         8154
other               6430
submersible         6179
swn 80              3670
mono                2865
india mark ii       2400
afridev             1770
rope pump            451
other handpump       364
other motorpump      122
wind-powered         117
india mark iii        98
Name: extraction_type_group, dtype: int64

In [14]:
train_feat['extraction_type_class'].value_counts(dropna=False)

gravity         26780
handpump        16456
other            6430
submersible      6179
motorpump        2987
rope pump         451
wind-powered      117
Name: extraction_type_class, dtype: int64

In [0]:
train_merged = train_feat.merge(train_label)

In [16]:
train_merged.groupby('extraction_type_class')['status_group'].value_counts(normalize=True)

extraction_type_class  status_group           
gravity                functional                 0.599253
                       non functional             0.299888
                       functional needs repair    0.100859
handpump               functional                 0.630469
                       non functional             0.309067
                       functional needs repair    0.060464
motorpump              non functional             0.573820
                       functional                 0.379980
                       functional needs repair    0.046200
other                  non functional             0.807932
                       functional                 0.160031
                       functional needs repair    0.032037
rope pump              functional                 0.649667
                       non functional             0.312639
                       functional needs repair    0.037694
submersible            functional                 0.538760
         

In [18]:
train_merged.groupby('payment_type')['status_group'].value_counts(normalize=True)


payment_type  status_group           
annually      functional                 0.752334
              non functional             0.179846
              functional needs repair    0.067820
monthly       functional                 0.660482
              non functional             0.227831
              functional needs repair    0.111687
never pay     non functional             0.475856
              functional                 0.448911
              functional needs repair    0.075233
on failure    functional                 0.620593
              non functional             0.308636
              functional needs repair    0.070772
other         functional                 0.579696
              non functional             0.308349
              functional needs repair    0.111954
per bucket    functional                 0.677796
              non functional             0.276683
              functional needs repair    0.045520
unknown       non functional             0.514527
            

In [19]:
train_merged.groupby('waterpoint_type_group')['status_group'].value_counts(normalize=True)

waterpoint_type_group  status_group           
cattle trough          functional                 0.724138
                       non functional             0.258621
                       functional needs repair    0.017241
communal standpipe     functional                 0.576491
                       non functional             0.339523
                       functional needs repair    0.083986
dam                    functional                 0.857143
                       non functional             0.142857
hand pump              functional                 0.617852
                       non functional             0.323307
                       functional needs repair    0.058840
improved spring        functional                 0.718112
                       non functional             0.173469
                       functional needs repair    0.108418
other                  non functional             0.822414
                       functional                 0.131661
         

In [21]:
train_merged.groupby('source_type')['status_group'].value_counts(normalize=True)

source_type           status_group           
borehole              functional                 0.495355
                      non functional             0.462131
                      functional needs repair    0.042514
dam                   non functional             0.577744
                      functional                 0.385671
                      functional needs repair    0.036585
other                 functional                 0.568345
                      non functional             0.413669
                      functional needs repair    0.017986
rainwater harvesting  functional                 0.603922
                      non functional             0.259259
                      functional needs repair    0.136819
river/lake            functional                 0.542257
                      non functional             0.338923
                      functional needs repair    0.118820
shallow well          functional                 0.494769
                      non 

In [24]:
train_merged.groupby('basin')['status_group'].value_counts(normalize=True)


basin                    status_group           
Internal                 functional                 0.575723
                         non functional             0.352730
                         functional needs repair    0.071548
Lake Nyasa               functional                 0.653687
                         non functional             0.297148
                         functional needs repair    0.049164
Lake Rukwa               non functional             0.482478
                         functional                 0.407498
                         functional needs repair    0.110024
Lake Tanganyika          functional                 0.483053
                         non functional             0.401586
                         functional needs repair    0.115361
Lake Victoria            functional                 0.497658
                         non functional             0.405835
                         functional needs repair    0.096507
Pangani                  functional 

In [26]:
train_merged.groupby('management_group')['status_group'].value_counts(normalize=True)


management_group  status_group           
commercial        functional                 0.614349
                  non functional             0.353491
                  functional needs repair    0.032161
other             functional                 0.559915
                  non functional             0.380700
                  functional needs repair    0.059385
parastatal        functional                 0.576923
                  non functional             0.303733
                  functional needs repair    0.119344
unknown           non functional             0.552585
                  functional                 0.399287
                  functional needs repair    0.048128
user-group        functional                 0.538236
                  non functional             0.387350
                  functional needs repair    0.074414
Name: status_group, dtype: float64

In [27]:
train_merged.groupby('quality_group')['status_group'].value_counts(normalize=True)


quality_group  status_group           
colored        functional                 0.502041
               non functional             0.387755
               functional needs repair    0.110204
fluoride       functional                 0.723502
               non functional             0.216590
               functional needs repair    0.059908
good           functional                 0.565941
               non functional             0.357236
               functional needs repair    0.076823
milky          functional                 0.544776
               non functional             0.437811
               functional needs repair    0.017413
salty          non functional             0.482002
               functional                 0.460828
               functional needs repair    0.057170
unknown        non functional             0.840618
               functional                 0.140725
               functional needs repair    0.018657
Name: status_group, dtype: float64

In [30]:
train_merged.groupby('construction_year')['status_group'].value_counts(normalize=True)


construction_year  status_group           
0                  functional                 0.509682
                   non functional             0.403931
                   functional needs repair    0.086388
1960               non functional             0.705882
                   functional                 0.235294
                   functional needs repair    0.058824
1961               non functional             0.761905
                   functional needs repair    0.142857
                   functional                 0.095238
1962               non functional             0.733333
                   functional                 0.233333
                   functional needs repair    0.033333
1963               non functional             0.564706
                   functional                 0.364706
                   functional needs repair    0.070588
1964               non functional             0.800000
                   functional                 0.175000
                   fun

## Submit to predictive modeling competition


### Write submission CSV file

The format for the submission file is simply the row id and the predicted label (for an example, see `sample_submission.csv` on the data download page.

For example, if you just predicted that all the waterpoints were functional you would have the following predictions:

<pre>id,status_group
50785,functional
51630,functional
17168,functional
45559,functional
49871,functional
</pre>

Your code to generate a submission file may look like this: 
<pre># estimator is your scikit-learn estimator, which you've fit on X_train

# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train

y_pred = estimator.predict(X_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)
</pre>

### Send submission CSV file to Kaggle

#### Option 1. Kaggle web UI
 
Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file.


#### Option 2. Kaggle API

Use the Kaggle API to upload your CSV file.

# Assignment
- Learn about the mathematics of Logistic Regression by watching Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes).
- Start a clean notebook.
- Do train/validate/test split with the Tanzania Waterpumps data.
- Begin to explore and clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- Select different numeric and categorical features. 
- Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- Scale features.
- Use scikit-learn for logistic regression.
- Get your validation accuracy score.
- Get and plot your coefficients.
- Submit your predictions to our Kaggle competition.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Begin to visualize the data.
- Try different [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html)
- Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
