# Assignment
- Learn about the mathematics of Logistic Regression by watching Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes).
- Start a clean notebook.
- Do train/validate/test split with the Tanzania Waterpumps data.
- Begin to explore and clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- Select different numeric and categorical features. 
- Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- Scale features.
- Use scikit-learn for logistic regression.
- Get your validation accuracy score.
- Get and plot your coefficients.
- Submit your predictions to our Kaggle competition.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Begin to visualize the data.
- Try different [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html)
- Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.2f}'.format)

In [2]:
# !kaggle competitions download -c ds4-predictive-modeling-challenge

In [3]:
# !unzip test_features.csv.zip
# !unzip train_features.csv.zip
# !unzip train_labels.csv.zip

In [4]:
# !ls

In [5]:
# csv files were saved by kaggle with no read or write permissions?
train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')
test_features = pd.read_csv('test_features.csv')
train_features.shape, train_labels.shape, test_features.shape

((59400, 40), (59400, 2), (14358, 40))

In [6]:
train_features.describe(include='all')


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
count,59400.0,59400.0,59400,55765,59400.0,55745,59400.0,59400.0,59400,59400.0,59400,59029,59400,59400.0,59400.0,59400,59400,59400.0,56066,59400,55523,31234,56344,59400.0,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400
unique,,,356,1897,,2145,,,37400,,9,19287,21,,,125,2092,,2,1,12,2696,2,,18,13,7,12,5,7,7,8,6,5,5,10,7,3,7,6
top,,,2011-03-15,Government Of Tanzania,,DWE,,,none,,Lake Victoria,Madukani,Iringa,,,Njombe,Igosi,,True,GeoData Consultants Ltd,VWC,K,True,,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
freq,,,572,9084,,17402,,,3563,,10248,508,5294,,,2503,307,,51011,59400,36793,682,38852,,26780,26780,26780,40507,52490,25348,25348,50818,50818,33186,33186,17021,17021,45794,28522,34625
mean,37115.13,317.65,,,668.3,,34.08,-5.71,,0.47,,,,15.3,5.63,,,179.91,,,,,,1300.65,,,,,,,,,,,,,,,,
std,21453.13,2997.57,,,693.12,,6.57,2.95,,12.24,,,,17.59,9.63,,,471.48,,,,,,951.62,,,,,,,,,,,,,,,,
min,0.0,0.0,,,-90.0,,0.0,-11.65,,0.0,,,,1.0,0.0,,,0.0,,,,,,0.0,,,,,,,,,,,,,,,,
25%,18519.75,0.0,,,0.0,,33.09,-8.54,,0.0,,,,5.0,2.0,,,0.0,,,,,,0.0,,,,,,,,,,,,,,,,
50%,37061.5,0.0,,,369.0,,34.91,-5.02,,0.0,,,,12.0,3.0,,,25.0,,,,,,1986.0,,,,,,,,,,,,,,,,
75%,55656.5,20.0,,,1319.25,,37.18,-3.33,,0.0,,,,17.0,5.0,,,215.0,,,,,,2004.0,,,,,,,,,,,,,,,,


In [7]:
train_labels['status_group']

0                     functional
1                     functional
2                     functional
3                 non functional
4                     functional
5                     functional
6                 non functional
7                 non functional
8                 non functional
9                     functional
10                    functional
11                    functional
12                    functional
13                    functional
14                    functional
15                    functional
16                non functional
17                non functional
18       functional needs repair
19                    functional
20                    functional
21                    functional
22       functional needs repair
23                    functional
24                    functional
25       functional needs repair
26                    functional
27                    functional
28                non functional
29                    functional
30        

In [8]:
def return_mean_if_zero(data, column):
    if data == 0:
        return column.mean()
    else:
        return data

def not_the_right_way(data, column):
    if data == -2e-08:
        return column.mean()
    else:
        return data
    
train_features['longitude'] = train_features['longitude'].apply(return_mean_if_zero, args=(train_features['longitude'],))
train_features['latitude'] = train_features['latitude'].apply(not_the_right_way, args=(train_features['latitude'],))

In [9]:
# train_features = train_features.drop(['recorded_by'], axis=1)
dummied_features = pd.get_dummies(train_features, columns=['management_group', 'payment_type', 'source_type', 'water_quality'],
               prefix=['mgmt', 'payment', 'source', 'quality'])

In [10]:
train_features.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
recorded_by,59400,1,GeoData Consultants Ltd,59400
public_meeting,56066,2,True,51011
permit,56344,2,True,38852
source_class,59400,3,groundwater,45794
management_group,59400,5,user-group,52490
quantity_group,59400,5,enough,33186
quantity,59400,5,enough,33186
waterpoint_type_group,59400,6,communal standpipe,34625
quality_group,59400,6,good,50818
payment_type,59400,7,never pay,25348


In [11]:
from sklearn.model_selection import train_test_split

# Returns X_train, X_val, y_train, y_val
def quick_split(X, y):
    X_train = X
    y_train = y

    return train_test_split(
        X_train, y_train, train_size=0.80, test_size=0.20,
        stratify=y_train)

In [13]:
from sklearn.linear_model import LogisticRegression

NameError: name 'X_train_numeric' is not defined

In [14]:
from sklearn.metrics import accuracy_score

NameError: name 'X_val_numeric' is not defined

In [44]:
def fit_predict_score(X, y, X_val, y_val):
    model = LogisticRegression(solver='saga', multi_class='auto', max_iter=20000)
    model.fit(X, y)
    y_pred = model.predict(X_val)
    sample_submission = pd.read_csv('sample_submission.csv')
    submission = sample_submission.copy()
    submission['status_group'] = y_pred
    submission.to_csv('whaeck-submission.csv', index=False)
    return accuracy_score(y_val, y_pred)

In [41]:
dummied_features = pd.get_dummies(train_features, columns=['management_group', 'water_quality', 'waterpoint_type', 'extraction_type', 'basin', 'region'],
               prefix=['mgmt', 'quality', 'waterpoint', 'extraction', 'basin', 'region'])

In [42]:
dummied_features['public_meeting'] = dummied_features['public_meeting'].astype('category')
# dummied_features['permit'] = dummied_features['permit'].astype('category')
dummied_features['public_meeting_cat'] = dummied_features['public_meeting'].cat.codes
# dummied_features['permit_cat'] = dummied_features['permit'].cat.codes

In [43]:
dummied_features = dummied_features.drop('id', axis=1)

In [None]:
X_train, X_val, y_train, y_val = quick_split(dummied_features.select_dtypes('number'), train_labels['status_group'])
fit_predict_score(X_train, y_train, X_val, y_val)

In [None]:
dummied_features = pd.get_dummies(train_features, columns=['management_group', 'water_quality', 'waterpoint_type', 'extraction_type', 'basin', 'region'],
               prefix=['mgmt', 'quality', 'waterpoint', 'extraction', 'basin', 'region'])
dummied_features['public_meeting'] = dummied_features['public_meeting'].astype('category')
dummied_features['permit'] = dummied_features['permit'].astype('category')
dummied_features['public_meeting_cat'] = dummied_features['public_meeting'].cat.codes
dummied_features['permit_cat'] = dummied_features['permit'].cat.codes

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=150, max_depth=5)