Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4


## Assignment

- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your validation accuracy score.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

---


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding. For example, you could try `quantity`, `basin`, `extraction_type_class`, and more. (But remember it may not work with high cardinality categoricals.)
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

---

## Data Dictionary 

### Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

- `amount_tsh` : Total static head (amount water available to waterpoint)
- `date_recorded` : The date the row was entered
- `funder` : Who funded the well
- `gps_height` : Altitude of the well
- `installer` : Organization that installed the well
- `longitude` : GPS coordinate
- `latitude` : GPS coordinate
- `wpt_name` : Name of the waterpoint if there is one
- `num_private` :  
- `basin` : Geographic water basin
- `subvillage` : Geographic location
- `region` : Geographic location
- `region_code` : Geographic location (coded)
- `district_code` : Geographic location (coded)
- `lga` : Geographic location
- `ward` : Geographic location
- `population` : Population around the well
- `public_meeting` : True/False
- `recorded_by` : Group entering this row of data
- `scheme_management` : Who operates the waterpoint
- `scheme_name` : Who operates the waterpoint
- `permit` : If the waterpoint is permitted
- `construction_year` : Year the waterpoint was constructed
- `extraction_type` : The kind of extraction the waterpoint uses
- `extraction_type_group` : The kind of extraction the waterpoint uses
- `extraction_type_class` : The kind of extraction the waterpoint uses
- `management` : How the waterpoint is managed
- `management_group` : How the waterpoint is managed
- `payment` : What the water costs
- `payment_type` : What the water costs
- `water_quality` : The quality of the water
- `quality_group` : The quality of the water
- `quantity` : The quantity of water
- `quantity_group` : The quantity of water
- `source` : The source of the water
- `source_type` : The source of the water
- `source_class` : The source of the water
- `waterpoint_type` : The kind of waterpoint
- `waterpoint_type_group` : The kind of waterpoint

### Labels

There are three possible values:

- `functional` : the waterpoint is operational and there are no repairs needed
- `functional needs repair` : the waterpoint is operational, but needs repairs
- `non functional` : the waterpoint is not operational

--- 

## Generate a submission

Your code to generate a submission file may look like this:

```python
# estimator is your model or pipeline, which you've fit on X_train

# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train

y_pred = estimator.predict(X_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)
```

If you're working locally, the csv file is saved in the same directory as your notebook.

If you're using Google Colab, you can use this code to download your submission csv file.

```python
from google.colab import files
files.download('your-submission-filename.csv')
```

---

In [2]:
!pip install kaggle

Collecting kaggle
  Downloading https://files.pythonhosted.org/packages/62/ab/bb20f9b9e24f9a6250f95a432f8d9a7d745f8d24039d7a5a6eaadb7783ba/kaggle-1.5.6.tar.gz (58kB)
Collecting python-slugify (from kaggle)
  Downloading https://files.pythonhosted.org/packages/f5/ef/c868a9ac657405f051a8a501ac5633e769c54228716b8db7f8d717977e57/python-slugify-3.0.4.tar.gz
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading https://files.pythonhosted.org/packages/a6/a5/c0b6468d3824fe3fde30dbb5e1f687b291608f9473681bbf7dabbf5a87d7/text_unidecode-1.3-py2.py3-none-any.whl (78kB)
Building wheels for collected packages: kaggle, python-slugify
  Building wheel for kaggle (setup.py): started
  Building wheel for kaggle (setup.py): finished with status 'done'
  Stored in directory: C:\Users\serga\AppData\Local\pip\Cache\wheels\57\4e\e8\bb28d035162fb8f17f8ca5d42c3230e284c6aa565b42b72674
  Building wheel for python-slugify (setup.py): started
  Building wheel for python-slugify (setup.py): fini

In [32]:
# import os, sys
# in_colab = 'google.colab' in sys.modules

# # If you're in Colab...
# if in_colab:
#     # Pull files from Github repo
#     os.chdir('/content')
#     !git init .
#     !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
#     !git pull origin master
    
#     # Install required python packages
#     !pip install -r requirements.txt
    
#     # Change into directory for module
#     os.chdir('module4')

In [33]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [34]:
# # Read the Tanzania Waterpumps data
# # train_features.csv : the training set features
# # train_labels.csv : the training set labels
# # test_features.csv : the test set features
# # sample_submission.csv : a sample submission file in the correct format
    
# import pandas as pd

# train_features = pd.read_csv('../data/waterpumps/train_features.csv')
# train_labels = pd.read_csv('../data/waterpumps/train_labels.csv')
# test_features = pd.read_csv('../data/waterpumps/test_features.csv')
# sample_submission = pd.read_csv('../data/waterpumps/sample_submission.csv')

# assert train_features.shape == (59400, 40)
# assert train_labels.shape == (59400, 2)
# assert test_features.shape == (14358, 40)
# assert sample_submission.shape == (14358, 2)

In [35]:
!kaggle competitions download -c ds8-predictive-modeling-challenge

ds8-predictive-modeling-challenge.zip: Skipping, found more recently modified local copy (use --force to force download)


In [37]:
import pandas as pd
import zipfile

zf = zipfile.ZipFile('ds8-predictive-modeling-challenge.zip')

# available files in the container
print (zf.namelist())

# zipped_df = pd.read_csv('ds8-predictive-modeling-challenge.zip')

['sample_submission.csv', 'test_features.csv', 'train_features.csv', 'train_labels.csv']


In [38]:
train_features = pd.read_csv(zf.open('train_features.csv'))
train_labels = pd.read_csv(zf.open('train_labels.csv'))
test_features = pd.read_csv(zf.open('test_features.csv'))
sample_submission = pd.read_csv(zf.open('sample_submission.csv'))

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

In [39]:
from sklearn.model_selection import train_test_split

train_labels['status_group'] = train_labels['status_group'].replace('functional','2').replace('functional needs repair','1').replace('non functional','0').astype(int)
sample_submission['status_group'] = sample_submission['status_group'].replace('functional','2').replace('functional needs repair','1').replace('non functional','0').astype(int)

my_train_features, my_val_features = train_test_split(train_features, random_state=7)
my_train_labels, my_val_labels = train_test_split(train_labels, random_state=7)
my_train_features.shape, my_val_features.shape, my_train_labels.shape, my_val_labels.shape

((44550, 40), (14850, 40), (44550, 2), (14850, 2))

In [40]:
my_train_features.head(2), my_val_features.head(2), my_train_labels.head(2), my_val_labels.head(2)

(          id  amount_tsh date_recorded                  funder  gps_height  \
 15679  26812        50.0    2013-01-26  Government Of Tanzania         328   
 23720  42639         0.0    2011-03-25  Government Of Tanzania         478   
 
                        installer  longitude   latitude        wpt_name  \
 15679  District Water Department  38.556341 -10.195002         Tankini   
 23720                        RWE  38.241950  -4.963292  Banda La Mbuzi   
 
        num_private  ... payment_type water_quality quality_group  \
 15679            0  ...   per bucket         salty         salty   
 23720            0  ...    never pay          soft          good   
 
            quantity  quantity_group       source source_type  source_class  \
 15679  insufficient    insufficient  machine dbh    borehole   groundwater   
 23720        enough          enough       spring      spring   groundwater   
 
                    waterpoint_type waterpoint_type_group  
 15679  communal standpipe

In [41]:
import numpy as np

In [42]:
my_train_features.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
count,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0
mean,37159.076992,315.546327,667.794815,34.05368,-5.696881,0.451291,15.303389,5.621549,180.245275,1298.214613
std,21461.767613,3219.954427,693.280188,6.584583,2.94224,10.432547,17.556668,9.623465,481.499965,952.434441
min,1.0,0.0,-90.0,0.0,-11.64944,0.0,1.0,0.0,0.0,0.0
25%,18560.25,0.0,0.0,33.076323,-8.524419,0.0,5.0,2.0,0.0,0.0
50%,37047.5,0.0,367.5,34.887432,-5.019593,0.0,12.0,3.0,25.0,1986.0
75%,55703.75,20.0,1321.0,37.154279,-3.325172,0.0,17.0,5.0,210.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1402.0,99.0,80.0,30500.0,2013.0


In [43]:
my_train_features.describe(exclude=[np.number]).T

Unnamed: 0,count,unique,top,freq
date_recorded,44550,348,2011-03-15,430
funder,41833,1667,Government Of Tanzania,6819
installer,41813,1869,DWE,13045
wpt_name,44550,29034,none,2652
basin,44550,9,Lake Victoria,7739
subvillage,44264,16692,Shuleni,388
region,44550,21,Iringa,4001
lga,44550,124,Njombe,1908
ward,44550,2080,Igosi,232
public_meeting,42057,2,True,38309


In [44]:
my_train_labels.describe(include='all')

Unnamed: 0,id,status_group
count,44550.0,44550.0
mean,37159.076992,1.161302
std,21461.767613,0.949497
min,1.0,0.0
25%,18560.25,0.0
50%,37047.5,2.0
75%,55703.75,2.0
max,74247.0,2.0


In [45]:
my_train_features.isnull().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    2717
gps_height                   0
installer                 2737
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 286
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            2493
recorded_by                  0
scheme_management         2952
scheme_name              21150
permit                    2295
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [46]:
target = 'status_group'
y_train = my_train_labels[target]
y_train.value_counts(normalize=True)

2    0.544422
0    0.383120
1    0.072458
Name: status_group, dtype: float64

In [47]:
y_train.mode()[0]

2

In [48]:
majority_class = y_train.mode()[0]
y_pred = [majority_class] * len(y_train)

In [49]:
sum(abs(y_pred - y_train)) / len(y_train)  # How much we got wrong

0.8386980920314253

In [50]:
from sklearn.metrics import accuracy_score
baseline_prediction = accuracy_score(y_train, y_pred)
baseline_prediction

0.5444219977553311

In [51]:
y_val = my_val_labels[target]
y_pred = [majority_class] * len(y_val)
accuracy_score(y_pred, y_val)

0.539057239057239

In [52]:
y_pred = baseline_prediction.predict(X_test)
sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)

NameError: name 'estimator' is not defined