# Assignment
- Learn about the mathematics of Logistic Regression by watching Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes).
- Start a clean notebook.
- Do train/validate/test split with the Tanzania Waterpumps data.
- Begin to explore and clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- Select different numeric and categorical features. 
- Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- Scale features.
- Use scikit-learn for logistic regression.
- Get your validation accuracy score.
- Get and plot your coefficients.
- Submit your predictions to our Kaggle competition.
- Commit your notebook to your fork of the GitHub repo.

## Stretch Goals
- Begin to visualize the data.
- Try different [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html)
- Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html):

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [42]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', '{:.2f}'.format)

In [2]:
!kaggle competitions download -c ds4-predictive-modeling-challenge

Downloading train_features.csv.zip to /Users/whaeck/Code/DS-Unit-2-Classification-1/module1-logistic-regression
 53%|███████████████████▉                  | 2.00M/3.81M [00:00<00:00, 20.0MB/s]
100%|██████████████████████████████████████| 3.81M/3.81M [00:00<00:00, 21.9MB/s]
Downloading train_labels.csv.zip to /Users/whaeck/Code/DS-Unit-2-Classification-1/module1-logistic-regression
  0%|                                                | 0.00/211k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 211k/211k [00:00<00:00, 52.6MB/s]
Downloading test_features.csv.zip to /Users/whaeck/Code/DS-Unit-2-Classification-1/module1-logistic-regression
  0%|                                                | 0.00/948k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 948k/948k [00:00<00:00, 20.0MB/s]
Downloading sample_submission.csv to /Users/whaeck/Code/DS-Unit-2-Classification-1/module1-logistic-regression
  0%|                                                | 0.00/236k [00:0

In [4]:
!unzip test_features.csv.zip
!unzip train_features.csv.zip
!unzip train_labels.csv.zip

Logistic-regression.ipynb
README.md
logistic_regression_categorical_encoding.ipynb
sample_submission.csv
test_features.csv.zip
train_features.csv.zip
train_labels.csv.zip
Archive:  test_features.csv.zip
  inflating: test_features.csv       
Archive:  train_features.csv.zip
  inflating: train_features.csv      
Archive:  train_labels.csv.zip
  inflating: train_labels.csv        


In [5]:
!ls

Logistic-regression.ipynb
README.md
logistic_regression_categorical_encoding.ipynb
sample_submission.csv
test_features.csv
test_features.csv.zip
train_features.csv
train_features.csv.zip
train_labels.csv
train_labels.csv.zip


In [49]:
# csv files were saved by kaggle with no read or write permissions?
train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')
test_features = pd.read_csv('test_features.csv')
train_features.shape, train_labels.shape, test_features.shape

((59400, 40), (59400, 2), (14358, 40))

In [56]:
# train_features.describe(include='all')
train_features[train_features['longitude'] == 0]['latitude']

21      -0.00
53      -0.00
168     -0.00
177     -0.00
253     -0.00
256     -0.00
285     -0.00
301     -0.00
306     -0.00
321     -0.00
323     -0.00
326     -0.00
346     -0.00
370     -0.00
433     -0.00
659     -0.00
678     -0.00
697     -0.00
720     -0.00
733     -0.00
753     -0.00
755     -0.00
798     -0.00
839     -0.00
911     -0.00
939     -0.00
960     -0.00
965     -0.00
971     -0.00
992     -0.00
1054    -0.00
1079    -0.00
1122    -0.00
1168    -0.00
1191    -0.00
1208    -0.00
1217    -0.00
1240    -0.00
1250    -0.00
1252    -0.00
1303    -0.00
1333    -0.00
1334    -0.00
1424    -0.00
1449    -0.00
1454    -0.00
1463    -0.00
1502    -0.00
1525    -0.00
1549    -0.00
1561    -0.00
1571    -0.00
1600    -0.00
1606    -0.00
1611    -0.00
1669    -0.00
1699    -0.00
1738    -0.00
1810    -0.00
1830    -0.00
1847    -0.00
1873    -0.00
1918    -0.00
1930    -0.00
1936    -0.00
1937    -0.00
1966    -0.00
1969    -0.00
1989    -0.00
2001    -0.00
2033    -0.00
2058  

In [38]:
def return_mean_if_zero(data, column):
    if data == 0:
        return column.mean()
    else:
        return data
train_features['longitude'] = train_features['longitude'].apply(return_mean_if_zero, args=(train_features['longitude'],))
train_features['latitude']

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group


In [11]:
train_labels.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [14]:
from sklearn.model_selection import train_test_split

X_train = train_features
y_train = train_labels['status_group']

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size=0.80, test_size=0.20,
    stratify=y_train)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((47520, 40), (11880, 40), (47520,), (11880,))

In [22]:
X_train.describe(include='all')

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
count,47520.0,47520.0,47520,44643,47520.0,44633,47520.0,47520.0,47520,47520.0,...,47520,47520,47520,47520,47520,47520,47520,47520,47520,47520
unique,,,344,1691,,1922,,,30696,,...,7,8,6,5,5,10,7,3,7,6
top,,,2011-03-15,Government Of Tanzania,,DWE,,,none,,...,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
freq,,,464,7282,,13899,,,2886,,...,20373,40714,40714,26590,26590,13646,13646,36685,22788,27627
mean,37069.306965,319.317577,,,669.851999,,34.08974,-5.702976,,0.521633,...,,,,,,,,,,
std,21461.281958,3169.512039,,,693.274487,,6.518793,2.942036,,13.525332,...,,,,,,,,,,
min,2.0,0.0,,,-90.0,,0.0,-11.64944,,0.0,...,,,,,,,,,,
25%,18492.5,0.0,,,0.0,,33.078674,-8.523152,,0.0,...,,,,,,,,,,
50%,37013.5,0.0,,,371.0,,34.906774,-5.018231,,0.0,...,,,,,,,,,,
75%,55625.25,20.0,,,1321.0,,37.179061,-3.32587,,0.0,...,,,,,,,,,,


In [18]:
X_train_numeric = X_train.select_dtypes('number')
X_val_numeric = X_val.select_dtypes('number')

In [20]:
X_train_numeric.head()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
27724,1790,0.0,13,39.218499,-7.441986,0,6,4,240,0
17889,38695,300.0,1320,37.664901,-3.696613,0,3,2,24,2007
56087,25682,0.0,1796,34.829436,-8.989202,0,11,4,250,1976
256,33500,0.0,0,0.0,-2e-08,0,19,6,0,0
6917,1169,200.0,324,37.869601,-6.889944,0,5,2,350,1996


In [26]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=10000)
model.fit(X_train_numeric, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=10000, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [29]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_val_numeric)
accuracy_score(y_val, y_pred)

0.5535353535353535

In [30]:
def fit_predict_score(X, y, X_val, y_val):
    model = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=10000)
    model.fit(X, y)
    y_pred = model.predict(X_val)
    return accuracy_score(y_val, y_pred)