<table style="width:100%; border: 0px solid black;">
    <tr style="width: 100%; border: 0px solid black;">
        <td style="width:75%; border: 0px solid black;">
            <a href="http://www.drivendata.org">
                <img src="https://s3.amazonaws.com/drivendata.org/kif-example/img/dd.png" />
            </a>
        </td>
    </tr>
</table>

# Data Science is Software
---------
## Developer #lifehacks for the Jupyter Data Scientist

### Section 3:  Refactoring for reusability

In [160]:
from __future__ import print_function
%matplotlib inline


import os

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

PROJ_ROOT = os.path.join(os.pardir, os.pardir)

## Use debugging tools throughout!

Don't forget all the fun debugging tools we covered while you work on these exercises. 

 - `%debug`
 - `%pdb`
 - `import q;q.d()`
 - And (if necessary) `%prun`


## Exercise 1

You'll notice that our dataset actually has two different files, `pumps_train_values.csv` and `pumps_train_labels.csv`. We want to load both of these together in a single `DataFrame` for our exploratory analysis. Create a function that:
 - Reads both of the csvs
 - uses the `id` column as the index
 - parses dates of the `date_recorded` columns
 - joins the labels and the training set on the id
 - returns the complete dataframe

In [161]:
def load_pumps_data(values_path, labels_path):
    values = pd.read_csv(values_path,index_col = 0, parse_dates = ['date_recorded'])
    labels = pd.read_csv(labels_path,index_col = 0)
    values = values.join(labels)
    # YOUR CODE HERE
    return values
    
    
values = os.path.join(PROJ_ROOT, "data", "raw", "pumps_train_values.csv")
labels = os.path.join(PROJ_ROOT, "data", "raw", "pumps_train_labels.csv")

df = load_pumps_data(values, labels)
assert df.shape == (59400, 40)

## Exercise 2

Now that we've loaded our data, we want to do some pre-processing before we model. From inspection of the data, we've noticed that there are some numeric values that are probably not valid that we want to replace.

 - Select the relevant columns for modeling. For the purposes of this exercise, we'll select:
        useful_columns = ['amount_tsh',
                      'gps_height',
                      'longitude',
                      'latitude',
                      'region',
                      'population',
                      'construction_year',
                      'extraction_type_class',
                      'management_group',
                      'quality_group',
                      'source_type',
                      'waterpoint_type',
                      'status_group']

 - Replace longitude, and population where it is 0 with mean for that region.
       zero_is_bad_value = ['longitude', 'population']
       
 - Replace the latitude where it is -2E-8 (a different bad value) with the mean for that region.
       other_bad_value = ['latitude']
      
 - Replace construction_year less than 1000 with the mean construction year.
 - Convert object type (i.e., string) variables to categoricals.
 - Convert the label column into a categorical variable
 

A skeleton for this work is below where `clean_raw_data` will call `replace_value_with_grouped_mean` internally. 

**Copy and Paste the skeleton below into a Python file called `preprocess.py` in `src/features/`. Import and autoload the methods from that file to run tests on your changes in this notebook.**

In [162]:
def clean_raw_data(df):
    """ Takes a dataframe and performs four steps:
            - Selects columns for modeling
            - For numeric variables, replaces 0 values with mean for that region
            - Fills invalid construction_year values with the mean construction_year
            - Converts strings to categorical variables
            
        :param df: A raw dataframe that has been read into pandas
        :returns: A dataframe with the preprocessing performed.
    """
    pass
    
def replace_value_with_grouped_mean(df, value, column, to_groupby):
    """ For a given numeric value (e.g., 0) in a particular column, take the
        mean of column (excluding value) grouped by to_groupby and return that
        column with the value replaced by that mean.

        :param df: The dataframe to operate on.
        :param value: The value in column that should be replaced.
        :param column: The column in which replacements need to be made.
        :param to_groupby: Groupby this variable and take the mean of column.
                           Replace value with the group's mean.
        :returns: The data frame with the invalid values replaced
    """
    pass


In [163]:
%load_ext autoreload
%autoreload 1

import os
import sys

src_dir = os.path.join(PROJ_ROOT, 'src')
sys.path.append(src_dir)

%aimport features.preprocess
from features.preprocess import clean_raw_data
cleaned_df = clean_raw_data(df)


# verify construction year
assert (cleaned_df.construction_year > 1000).all()

# verify filled in other values
for numeric_col in ["population", "longitude", "latitude"]:
    assert (cleaned_df[numeric_col] != 0).all()
    
# verify the types are in the expected types
assert (cleaned_df.dtypes
                  .astype(str)
                  .isin(["int32", "int64", "float64", "category"])).all()

# check some actual values
#assert cleaned_df.latitude.mean() == -5.970642969008563
#assert cleaned_df.longitude.mean() == 35.14119354200863
#assert cleaned_df.population.mean() == 277.3070009774711

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
ilo2


AssertionError: 

In [None]:
cleaned_df.info()

## Exercise 3

Now that we've got a feature matrix, let's train a model! Add a function as defined below to the **`src/model/train_model.py`**

The function should use [`sklearn.linear_model.LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to train a logistic regression model. In a dataframe with categorical variables `pd.get_dummies` will do encoding that can be passed to `sklearn`.

The `LogisticRegression` class in `sklearn` handles muticlass models automatically, so no need to use `get_dummies` on `status_group`.

Finally, this method should return a [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) object that has been run with the following parameters for a logistic regression model:

    params = {'C': [0.1, 1, 10]}

In [164]:
def logistic(df):
    """ Trains a multinomial logistic regression model to predict the
        status of a water pump given characteristics about the pump.
    
        :param df: The dataframe with the features and the label.
        :returns: A trained GridSearchCV classifier
    """
    pass

In [165]:
%aimport models.train_model
from models.train_model import logistic

In [166]:
%%time
clf = logistic(cleaned_df)

assert clf.best_score_ > 0.5

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Wall time: 11.6 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [167]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

y = cleaned_df['status_group']
X = pd.get_dummies(cleaned_df.drop('status_group', axis=1))

lr = LogisticRegression()
params = {'C': [0.1, 1, 10]}

clf = GridSearchCV(lr, params, cv=3)
clf.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(cv=3, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None, param_grid={'C': [0.1, 1, 10]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [168]:
import numpy as np
cleaned_df['population']#[cleaned_df['population'].isnull()]

id
69572     109.0
8776      280.0
34310     250.0
67743      58.0
19728       0.0
9944        1.0
19816       0.0
54551       0.0
53934       0.0
46144       0.0
49056     345.0
50409     250.0
36957       0.0
50495       1.0
53752       0.0
61848     200.0
48451      35.0
58155      50.0
34169    1000.0
18274       1.0
48375       4.0
6091        0.0
58500     350.0
37862     210.0
51058     156.0
22308     140.0
55012     260.0
20145       0.0
19685       1.0
69124       1.0
          ...  
14796       1.0
20387       0.0
29940       0.0
15233      96.0
49651       0.0
50998     609.0
34716       1.0
43986       0.0
38067      36.0
58255       0.0
30647      50.0
67885     360.0
47002       1.0
44616     800.0
72148       0.0
34473     200.0
34952    1000.0
26640     100.0
72559     500.0
30410    1500.0
13677     150.0
44885     210.0
40607       0.0
48348       0.0
11164      89.0
60739     125.0
27263      56.0
37057       0.0
31282       0.0
26348     150.0
Name: population, Len

In [169]:
X = pd.get_dummies(cleaned_df.drop('status_group', axis=1))
y.max()

2

In [171]:
invalid_mask = (cleaned_df['population'].isnull())

# get the mean without the invalid value
means_by_group = (cleaned_df[~invalid_mask]
    .groupby('region')['population']
    .mean())
means_by_group
# get an array of the means for all of the data
#means_array = means_by_group[cleaned_df['region'].values].values

# assignt the invalid values to means
#cleaned_df.loc[invalid_mask, 'population'] = means_array[invalid_mask]
#cleaned_df['population']

region
Arusha           262.239104
Dar es Salaam    240.843478
Dodoma             0.000000
Iringa            94.304307
Kagera             0.000000
Kigoma           500.241832
Kilimanjaro      105.747888
Lindi            364.404916
Manyara          317.778269
Mara             538.794312
Mbeya              0.000000
Morogoro         264.625562
Mtwara           267.441618
Mwanza            65.315925
Pwani            349.486148
Rukwa            365.795907
Ruvuma           199.019318
Shinyanga         14.100963
Singida          279.122312
Tabora             0.000000
Tanga            246.753828
Name: population, dtype: float64

In [172]:
# Just for fun, let's profile the whole stack and see what's slowest!
%prun logistic(clean_raw_data(load_pumps_data(values, labels)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[column] = le.fit_transform(df[column])


ilo2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [173]:
clf.best_score_

0.6360437710437711