Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4

## Assignment

- [x] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [x] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [x] Do train/validate/test split with the Tanzania Waterpumps data.
- [x] Begin with baselines for classification.
- [x] Use scikit-learn for logistic regression.
- [x] Get your validation accuracy score.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [x] Commit your notebook to your fork of the GitHub repo.

---

## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding. For example, you could try `quantity`, `basin`, `extraction_type_class`, and more. (But remember it may not work with high cardinality categoricals.)
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

---

## Data Dictionary 

### Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

- `amount_tsh` : Total static head (amount water available to waterpoint)
- `date_recorded` : The date the row was entered
- `funder` : Who funded the well
- `gps_height` : Altitude of the well
- `installer` : Organization that installed the well
- `longitude` : GPS coordinate
- `latitude` : GPS coordinate
- `wpt_name` : Name of the waterpoint if there is one
- `num_private` :  
- `basin` : Geographic water basin
- `subvillage` : Geographic location
- `region` : Geographic location
- `region_code` : Geographic location (coded)
- `district_code` : Geographic location (coded)
- `lga` : Geographic location
- `ward` : Geographic location
- `population` : Population around the well
- `public_meeting` : True/False
- `recorded_by` : Group entering this row of data
- `scheme_management` : Who operates the waterpoint
- `scheme_name` : Who operates the waterpoint
- `permit` : If the waterpoint is permitted
- `construction_year` : Year the waterpoint was constructed
- `extraction_type` : The kind of extraction the waterpoint uses
- `extraction_type_group` : The kind of extraction the waterpoint uses
- `extraction_type_class` : The kind of extraction the waterpoint uses
- `management` : How the waterpoint is managed
- `management_group` : How the waterpoint is managed
- `payment` : What the water costs
- `payment_type` : What the water costs
- `water_quality` : The quality of the water
- `quality_group` : The quality of the water
- `quantity` : The quantity of water
- `quantity_group` : The quantity of water
- `source` : The source of the water
- `source_type` : The source of the water
- `source_class` : The source of the water
- `waterpoint_type` : The kind of waterpoint
- `waterpoint_type_group` : The kind of waterpoint

### Labels

There are three possible values:

- `functional` : the waterpoint is operational and there are no repairs needed
- `functional needs repair` : the waterpoint is operational, but needs repairs
- `non functional` : the waterpoint is not operational

--- 

## Generate a submission

Your code to generate a submission file may look like this:

```python
# estimator is your model or pipeline, which you've fit on X_train

# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train

y_pred = estimator.predict(X_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)
```

If you're working locally, the csv file is saved in the same directory as your notebook.

If you're using Google Colab, you can use this code to download your submission csv file.

```python
from google.colab import files
files.download('your-submission-filename.csv')
```

---
---

In [1]:
# Musketeer imports
import pandas as pd
import numpy as np

In [2]:
# Logistic Regression Imports
import category_encoders as ce
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

In [3]:
# Set pandas display options to allow for more columns and rows
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 500)

---

#### Read the Tanzania Waterpumps data

- train_features.csv : the training set features
- train_labels.csv : the training set labels
- test_features.csv : the test set features
- sample_submission.csv : a sample submission file in the correct format

In [4]:
# Read the Tanzania Waterpumps data
train_features = pd.read_csv('../data/waterpumps/train_features.csv')
train_labels = pd.read_csv('../data/waterpumps/train_labels.csv')
test_features = pd.read_csv('../data/waterpumps/test_features.csv')
sample_submission = pd.read_csv('../data/waterpumps/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

---

## 214 Assignment

In [5]:
train_features.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [6]:
# Look at datatypes
train_features.dtypes

id                         int64
amount_tsh               float64
date_recorded             object
funder                    object
gps_height                 int64
installer                 object
longitude                float64
latitude                 float64
wpt_name                  object
num_private                int64
basin                     object
subvillage                object
region                    object
region_code                int64
district_code              int64
lga                       object
ward                      object
population                 int64
public_meeting            object
recorded_by               object
scheme_management         object
scheme_name               object
permit                    object
construction_year          int64
extraction_type           object
extraction_type_group     object
extraction_type_class     object
management                object
management_group          object
payment                   object
payment_ty

In [7]:
# Look at non-numeric / categorical features
train_features.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
recorded_by,59400,1,GeoData Consultants Ltd,59400
public_meeting,56066,2,True,51011
permit,56344,2,True,38852
source_class,59400,3,groundwater,45794
management_group,59400,5,user-group,52490
quantity_group,59400,5,enough,33186
quantity,59400,5,enough,33186
waterpoint_type_group,59400,6,communal standpipe,34625
quality_group,59400,6,good,50818
payment_type,59400,7,never pay,25348


---

#### Some ideas based on initial data exploration

- Drop redundant features 
  - [ ] extraction_type / extraction_type_group
  - [ ] quantity / quantity_group
  - [ ] source / etc.
  - [ ] waterpoint_type
- Drop high/low-cardinality
  - [ ] recorded_by
  - [ ] wpt_name up to date_recorded
- Datatypes
  - [ ] Find / replace null values that were not read in as NaN
  - [ ] Convert date to datetime

---

#### Do train/validate/test split with the Tanzania Waterpumps data.

In [8]:
# Split up the train data into train and validation
X_train, X_valid, y_train, y_valid = train_test_split(train_features, train_labels, random_state=92)

In [9]:
# Look at shape of each
data_list = {
    "X_train": X_train,
    "X_valid": X_valid,
    "y_train": y_train,
    "y_valid": y_valid,
}

def shape_printer():
    for d in data_list:
        print(f"{d}")
        print(f"{data_list[d].shape[0]} rows")
        print(f"{data_list[d].shape[1]} columns")
        print()

shape_printer()

X_train
44550 rows
40 columns

X_valid
14850 rows
40 columns

y_train
44550 rows
2 columns

y_valid
14850 rows
2 columns



In [10]:
# Check for null values in X_train
X_train.isnull().sum()  # Looks like "scheme_name" should be dropped

id                           0
amount_tsh                   0
date_recorded                0
funder                    2742
gps_height                   0
installer                 2762
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 283
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            2516
recorded_by                  0
scheme_management         2899
scheme_name              21054
permit                    2298
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [11]:
# def drop_scheme_name_and_na(data):
#     """
#     Returns dataframe without "scheme_name" column or rows with null values.
#     """
#     return data.drop("scheme_name", axis=1).dropna()

In [12]:
# # Drop nulls from X_train and see resulting shape
# X_train = drop_scheme_name_and_na(X_train)
# print(X_train.shape)

In [13]:
# # Drop nulls from X_valid and see resulting shape
# X_valid = drop_scheme_name_and_na(X_valid)
# print(X_valid.shape)

In [14]:
# # Confirm null values have been dropped
# X_train.isnull().sum()

In [15]:
# # Confirm null values have been dropped
# X_valid.isnull().sum()

To "drop" the same fields (really filter for the same non-null rows), get a list of ids to pass to the y_train and y_valid

In [16]:
# def filter_by_index(data, index):
#     """
#     Returns dataframe with data filtered by index.
#     """
#     return data[data.index.isin(index.index)]

In [17]:
# # Filter the target data by the features indexes that did not have null values
# y_train = filter_by_index(y_train, X_train)
# y_valid = filter_by_index(y_valid, X_valid)

In [18]:
# # Make sure it worked as ecpected
# print(y_train.shape)
# print(y_valid.shape)
# assert y_train.shape[0] == 36192
# assert y_valid.shape[0] == 12096

---

#### Begin with baselines for classification.

In [19]:
# Get baseline accuracy score using the mode
y_train["status_group"].value_counts()  # The majority class : functional (good!)

functional                 24302
non functional             17046
functional needs repair     3202
Name: status_group, dtype: int64

In [20]:
# Create list of the majority class, length of target train
y_pred_base = pd.Series(["functional"] * len(y_train))
y_pred_base.head()

0    functional
1    functional
2    functional
3    functional
4    functional
dtype: object

In [21]:
y_train["status_group"].head()

12399        functional
15969    non functional
22944        functional
54034        functional
53053    non functional
Name: status_group, dtype: object

In [22]:
# Calculate accuracy score of baseline model for y_train
accuracy_score(y_train["status_group"], y_pred_base)

0.5454994388327722

> train accuracy score baseline of 0.5455

In [23]:
# Create list of the majority class, length of target validation
y_valid_pred_base = pd.Series(["functional"] * len(y_valid))
accuracy_score(y_valid["status_group"], y_valid_pred_base)  # accuracy of baseline model for y_valid

0.5358249158249159

> validation accuracy score baseline of 0.5358

---

#### Use scikit-learn for logistic regression.

In [24]:
X_train.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
count,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0
mean,37101.600875,310.214273,667.017172,34.066813,-5.70974,0.533333,15.264938,5.618653,178.384512,1299.1633
std,21483.996827,2577.857256,692.295216,6.609923,2.946951,13.876726,17.59509,9.62405,473.816595,952.09594
min,1.0,0.0,-90.0,0.0,-11.64944,0.0,1.0,0.0,0.0,0.0
25%,18474.25,0.0,0.0,33.098899,-8.54129,0.0,5.0,2.0,0.0,0.0
50%,36988.5,0.0,368.5,34.916438,-5.029749,0.0,12.0,3.0,25.0,1986.0
75%,55724.75,20.0,1318.0,37.177841,-3.328119,0.0,17.0,5.0,210.0,2004.0
max,74247.0,250000.0,2770.0,40.345193,-2e-08,1776.0,99.0,80.0,30500.0,2013.0


In [25]:
X_train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
12399,51067,50.0,2013-02-06,Private Individual,-32,MTUWASA,40.200577,-10.282941,Kwa Mzee Deki,0,Ruvuma / Southern Coast,Madaba,Mtwara,9,5,Mtwara Urban,Likombe,120,False,GeoData Consultants Ltd,Water authority,Madaba,True,2006,submersible,submersible,submersible,private operator,commercial,pay per bucket,per bucket,soft,good,seasonal,seasonal,machine dbh,borehole,groundwater,communal standpipe,communal standpipe
15969,29559,0.0,2013-02-20,District Council,984,District Council,37.655014,-3.916111,Kwa Seif Masali,0,Pangani,Kijijini,Kilimanjaro,3,2,Mwanga,Lembeni,400,True,GeoData Consultants Ltd,WUA,Ngulu water supply,False,1994,gravity,gravity,gravity,wua,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,other,other
22944,59915,0.0,2013-05-03,World Vision,450,World Vision,38.061698,-4.579853,Kwa Kabonda,0,Pangani,Majengo,Kilimanjaro,3,3,Same,Bendera,210,True,GeoData Consultants Ltd,,,,1994,gravity,gravity,gravity,water authority,commercial,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
54034,28479,5.0,2013-03-13,Germany Republi,875,CES,37.177423,-3.380977,Lerai Sekondari,0,Pangani,Kambiyanyuki,Kilimanjaro,3,5,Hai,Hai Urban,320,True,GeoData Consultants Ltd,Water Board,Uroki-Bomang'ombe water sup,True,1999,gravity,gravity,gravity,water board,user-group,pay per bucket,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
53053,21489,50.0,2011-03-22,District Council,380,Handeni Trunk Main(,38.479124,-5.49543,Kwa Baditu,38,Pangani,Tridep,Tanga,4,6,Handeni,Kabuku,250,True,GeoData Consultants Ltd,VWC,Handeni Trunk Main(H,True,2005,submersible,submersible,submersible,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe


In [26]:
# Look at non-numeric / categorical features
X_train.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
recorded_by,44550,1,GeoData Consultants Ltd,44550
public_meeting,42034,2,True,38230
permit,42252,2,True,29141
source_class,44550,3,groundwater,34375
management_group,44550,5,user-group,39395
quantity_group,44550,5,enough,24870
quantity,44550,5,enough,24870
waterpoint_type_group,44550,6,communal standpipe,26013
quality_group,44550,6,good,38106
payment_type,44550,7,never pay,18983


In [27]:
from sklearn.feature_selection import f_regression, SelectKBest

In [28]:
# Convert date_recorded to datetime
X_train["date_recorded"] = pd.to_datetime(X_train["date_recorded"], infer_datetime_format=True)
X_valid["date_recorded"] = pd.to_datetime(X_valid["date_recorded"], infer_datetime_format=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [29]:
# # Drop the high-cardinality or redundant columns befor encoding
# to_drop = [
#     "lga",
#     "region",
#     "funder",
#     "installer",
#     "ward",
#     "subvillage",
#     "wpt_name",
#     "date_recorded",
#     "recorded_by",
#     "extraction_type",
#     "management",
#     "scheme_management",
#     "public_meeting",
#     "quality_group",
#     "source_class",
#     "payment_type",
#     "waterpoint_type_group",
#     "quantity_group",
#     "management_group",
#     "basin",
# ]

# X_train_drop = X_train.drop(to_drop, axis=1)
# X_valid_drop = X_valid.drop(to_drop, axis=1)

In [30]:
features = [
    "permit",
    "source_class",
    "payment",
    "waterpoint_type",
    "basin",
    "extraction_type_class",
    "water_quality",
    "quantity",
    "scheme_management",
]

X_train_feat = X_train[features]
X_valid_feat = X_valid[features]

In [31]:
# Instantiate encoder object
encoder = ce.OneHotEncoder(use_cat_names=True)

# Fit / transform X with encoder
X_train_feat_encoded = encoder.fit_transform(X_train_feat)
X_valid_feat_encoded = encoder.transform(X_valid_feat)

In [32]:
# # Instantiate selector
# selector = SelectKBest(score_func=f_regression, k=10)

# # Fit the selector and transform the data
# X_train_select = selector.fit_transform(X_train_feat_encoded, y_train)
# X_valid_select = selector.transform(X_valid_feat_encoded)
# # X_train_select.shape, X_valid_select.shape

In [33]:
# Instantiate imputer object
imputer = SimpleImputer()

# Impute the data to fill in null values
X_train_feat_imputed = imputer.fit_transform(X_train_feat_encoded)
X_valid_feat_imputed = imputer.transform(X_valid_feat_encoded)

In [38]:
# Feature scaling
scaler = StandardScaler()  # Instantiate scaler object

# Fit scaler to train data
X_train_scaled = scaler.fit_transform(X_train_feat_imputed)
X_valid_scaled = scaler.transform(X_valid_feat_imputed)

In [39]:
# Instantiate model object
log_reg = LogisticRegression(solver="lbfgs", multi_class="multinomial", n_jobs=-1, random_state=92)

In [40]:
# Fit the model to the training data
log_reg.fit(X_train_scaled, y_train["status_group"])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=-1, penalty='l2',
                   random_state=92, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

---

#### Get your validation accuracy score.

In [41]:
print('Validation Accuracy:', log_reg.score(X_valid_scaled, y_valid["status_group"]))  # Calculate accuracy score of fitted model

Validation Accuracy 0.7184511784511785


In [56]:
from sklearn.metrics import f1_score

In [57]:
# Get y_valid predictions
y_pred_valid = log_reg.predict(X_valid_scaled)

In [63]:
print("F1 Score:", f1_score(y_valid["status_group"], y_pred_valid, average="weighted"))

F1 Score: 0.6863169688351966


---

#### Submit your predictions to our Kaggle competition.

- The estimator is your model or pipeline, which you've fit on X_train
- X_test is your pandas dataframe or numpy array that has:
  1. The same number of rows in the same order as `test_features.csv`
  2. Same number of columns in the same order as `X_train_scaled` (X_train)

In [42]:
# Transform the test data to match the train data
X_test_feat = test_features[features]  # Use the same features
X_test_encoded = encoder.transform(X_test_feat)  # Same encoder instance
X_test_imputed = imputer.transform(X_test_encoded)  # Same imputer instance
X_test_scaled = scaler.transform(X_test_imputed)  # Same scaler instance

In [54]:
# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
print(X_test_scaled.shape, test_features.shape)
assert X_test_scaled.shape[0] == test_features.shape[0]

# and the same number of columns, in the same order, as X_train / X_train_scaled
print(X_test_scaled.shape, X_train_scaled.shape)
assert X_test_scaled.shape[1] == X_train_scaled.shape[1]

In [50]:
# Make the predictions based on test features
y_pred_test = log_reg.predict(X_test_scaled)
y_pred_series = pd.Series(y_pred_test)

In [52]:
y_pred_series.value_counts()

functional                 10257
non functional              4046
functional needs repair       55
dtype: int64

In [65]:
# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

# sample_submission = pd.read_csv('sample_submission.csv')  # Already read in at the top of this notebook

submission = sample_submission.copy()
submission['status_group'] = y_pred_series

In [66]:
# Write submission to csv
submission.to_csv('tobias-reaper-waterpump-submission.csv', index=False)

In [None]:
# TODO: connect to Kaggle API to submit my predictions

#### Commit your notebook to your fork of the GitHub repo.