<a href="https://colab.research.google.com/github/AdrianduPlessis/DS-Unit-2-Regression-Classification/blob/master/module4/assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4

## Assignment

- [ ] Watch Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your coefficients.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

> [Do Not Copy-Paste.](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons.


## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from the previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```

#### Pipelines

[Scikit-Learn User Guide](https://scikit-learn.org/stable/modules/compose.html) explains why pipelines are useful, and demonstrates how to use them:

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:
> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

### Reading
- [ ] [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)
- [ ] [Always start with a stupid model, no exceptions](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa)
- [ ] [Statistical Modeling: The Two Cultures](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)
- [ ] [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way (without an excessive amount of formulas or academic pre-requisites).



In [2]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module4')

Requirement already up-to-date: category_encoders in /usr/local/lib/python3.6/dist-packages (2.0.0)
Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd

train_features = pd.read_csv('../data/tanzania/train_features.csv')
train_labels = pd.read_csv('../data/tanzania/train_labels.csv')
test_features = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

In [5]:
'''
 Do train/validate/test split with the Tanzania Waterpumps data.
 Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
 Use scikit-learn for logistic regression.
 Get your validation accuracy score.
 Get and plot your coefficients.
 Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue Submit Predictions button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
 Commit your notebook to your fork of the GitHub repo.
'''

'\n Do train/validate/test split with the Tanzania Waterpumps data.\n Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)\n Use scikit-learn for logistic regression.\n Get your validation accuracy score.\n Get and plot your coefficients.\n Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue Submit Predictions button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)\n Commit your notebook to your fork of the GitHub repo.\n'

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(train_features, train_labels, test_size=0.2, random_state=1)
X_test = test_features

In [7]:
X_train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
1159,3948,0.0,2012-10-28,World Vision,0,World vision,32.02503,-3.346417,Busabaga B,0,Lake Tanganyika,Busabaga,Shinyanga,17,5,Bukombe,Ilolangulu,0,False,GeoData Consultants Ltd,WUG,,True,0,nira/tanira,nira/tanira,handpump,wug,user-group,unknown,unknown,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump
26556,42528,500.0,2011-06-04,Unicef,1728,DWE,34.681545,-9.111633,Kwa Ngavatula,0,Rufiji,Isupilo,Iringa,11,4,Njombe,Mdandu,98,True,GeoData Consultants Ltd,WUA,wanging'ombe water supply s,True,1978,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
50388,2401,0.0,2011-03-11,Private Individual,23,Amboni Plantation,38.885967,-5.613441,Kwa Welfare,41,Pangani,Estate,Tanga,4,5,Pangani,Kipumbwi,700,True,GeoData Consultants Ltd,Private operator,Koronani Borehole,False,1975,submersible,submersible,submersible,private operator,commercial,never pay,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe
40428,66494,0.0,2012-11-12,Ministry Of Water,1154,Hesawa,33.879265,-1.397925,Kwa Tatu Warioba,0,Lake Victoria,Majengo,Mara,20,6,Rorya,Kyang'ombe,380,True,GeoData Consultants Ltd,VWC,Mradi wa maji Komuge,False,1987,submersible,submersible,submersible,vwc,user-group,unknown,unknown,soft,good,enough,enough,lake,river/lake,surface,communal standpipe multiple,communal standpipe
12301,11228,100.0,2013-03-16,0,53,0,39.088609,-6.974433,Kwa Mubanga,0,Wami / Ruvu,Mtaa Wa Nyeburu,Dar es Salaam,7,2,Ilala,Chanika,100,,GeoData Consultants Ltd,Private operator,,False,2010,submersible,submersible,submersible,private operator,commercial,pay per bucket,per bucket,soft,good,enough,enough,hand dtw,borehole,groundwater,communal standpipe,communal standpipe


In [19]:
#find cardinality
X_train.describe(exclude = 'number').T.sort_values(by = 'unique')

Unnamed: 0,count,unique,top,freq
recorded_by,47520,1,GeoData Consultants Ltd,47520
public_meeting,44880,2,True,40831
permit,45090,2,True,31034
source_class,47520,3,groundwater,36678
management_group,47520,5,user-group,41996
quantity_group,47520,5,enough,26550
quantity,47520,5,enough,26550
waterpoint_type_group,47520,6,communal standpipe,27721
quality_group,47520,6,good,40624
payment_type,47520,7,never pay,20341


In [0]:
categorical_features = ['source_class', 'source_type', 'extraction_type_class' ]
numeric_features = X_train.select_dtypes('number').columns.drop('id').tolist()
features = categorical_features + numeric_features

In [25]:
X_train[features].head(5)

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
1159,0.0,0,32.02503,-3.346417,0,17,5,0,0
26556,500.0,1728,34.681545,-9.111633,0,11,4,98,1978
50388,0.0,23,38.885967,-5.613441,41,4,5,700,1975
40428,0.0,1154,33.879265,-1.397925,0,20,6,380,1987
12301,100.0,53,39.088609,-6.974433,0,7,2,100,2010


In [56]:
!pip install category_encoders
import category_encoders as ce

#one hot encode all string columns
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train[features])
X_val_encoded = encoder.transform(X_val[features])



In [58]:
X_val_encoded.head()

Unnamed: 0,source_class_groundwater,source_class_surface,source_class_unknown,source_type_shallow well,source_type_river/lake,source_type_borehole,source_type_spring,source_type_rainwater harvesting,source_type_other,source_type_dam,extraction_type_class_handpump,extraction_type_class_gravity,extraction_type_class_submersible,extraction_type_class_other,extraction_type_class_motorpump,extraction_type_class_rope pump,extraction_type_class_wind-powered,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
50088,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0.0,0,33.980298,-9.428765,0,12,3,0,0
14032,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0.0,1344,34.655756,-1.613606,0,20,2,3832,1994
45629,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,2000.0,374,37.365199,-11.441057,0,10,1,1,2007
55020,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0.0,0,33.802596,-9.268848,0,12,4,0,0
6407,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,20.0,1772,35.555071,-3.786036,0,21,3,258,2012


In [59]:
y_val.head()

Unnamed: 0,id,status_group
50088,54338,functional
14032,32599,non functional
45629,17420,functional
55020,12048,functional
6407,47314,functional


In [64]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#instantiate and fit the model
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_encoded, y_train['status_group'])

#make predictions
y_pred = log_reg.predict(X_val_encoded)
accuracy_score(y_val['status_group'], y_pred)



0.6158249158249158

In [55]:
from sklearn.metrics import accuracy_score
log_reg.score(X_val, y_val)

array(['non functional', 'functional', 'non functional', 'non functional',
       'functional'], dtype=object)