<a href="https://colab.research.google.com/github/davidanagy/DS-Unit-2-Regression-Classification/blob/master/module4/assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4


## Assignment

- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your validation accuracy score.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

---


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding. For example, you could try `quantity`, `basin`, `extraction_type_class`, and more. (But remember it may not work with high cardinality categoricals.)
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

---

## Data Dictionary 

### Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

- `amount_tsh` : Total static head (amount water available to waterpoint)
- `date_recorded` : The date the row was entered
- `funder` : Who funded the well
- `gps_height` : Altitude of the well
- `installer` : Organization that installed the well
- `longitude` : GPS coordinate
- `latitude` : GPS coordinate
- `wpt_name` : Name of the waterpoint if there is one
- `num_private` :  
- `basin` : Geographic water basin
- `subvillage` : Geographic location
- `region` : Geographic location
- `region_code` : Geographic location (coded)
- `district_code` : Geographic location (coded)
- `lga` : Geographic location
- `ward` : Geographic location
- `population` : Population around the well
- `public_meeting` : True/False
- `recorded_by` : Group entering this row of data
- `scheme_management` : Who operates the waterpoint
- `scheme_name` : Who operates the waterpoint
- `permit` : If the waterpoint is permitted
- `construction_year` : Year the waterpoint was constructed
- `extraction_type` : The kind of extraction the waterpoint uses
- `extraction_type_group` : The kind of extraction the waterpoint uses
- `extraction_type_class` : The kind of extraction the waterpoint uses
- `management` : How the waterpoint is managed
- `management_group` : How the waterpoint is managed
- `payment` : What the water costs
- `payment_type` : What the water costs
- `water_quality` : The quality of the water
- `quality_group` : The quality of the water
- `quantity` : The quantity of water
- `quantity_group` : The quantity of water
- `source` : The source of the water
- `source_type` : The source of the water
- `source_class` : The source of the water
- `waterpoint_type` : The kind of waterpoint
- `waterpoint_type_group` : The kind of waterpoint

### Labels

There are three possible values:

- `functional` : the waterpoint is operational and there are no repairs needed
- `functional needs repair` : the waterpoint is operational, but needs repairs
- `non functional` : the waterpoint is not operational

--- 

## Generate a submission

Your code to generate a submission file may look like this:

```python
# estimator is your model or pipeline, which you've fit on X_train

# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train

y_pred = estimator.predict(X_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)
```

If you're working locally, the csv file is saved in the same directory as your notebook.

If you're using Google Colab, you can use this code to download your submission csv file.

```python
from google.colab import files
files.download('your-submission-filename.csv')
```

---

In [172]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module4')

Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
# Read the Tanzania Waterpumps data
# train_features.csv : the training set features
# train_labels.csv : the training set labels
# test_features.csv : the test set features
# sample_submission.csv : a sample submission file in the correct format
    
import pandas as pd

train_features = pd.read_csv('../data/waterpumps/train_features.csv')
train_labels = pd.read_csv('../data/waterpumps/train_labels.csv')
test_features = pd.read_csv('../data/waterpumps/test_features.csv')
sample_submission = pd.read_csv('../data/waterpumps/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

In [0]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [176]:
df = train_features
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [177]:
# Dropping this temporarily because I want to replace the zeroes in all the *other* columns with NaN.

df = df.drop('num_private', axis=1)
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [178]:
df.dtypes

id                         int64
amount_tsh               float64
date_recorded             object
funder                    object
gps_height                 int64
installer                 object
longitude                float64
latitude                 float64
wpt_name                  object
basin                     object
subvillage                object
region                    object
region_code                int64
district_code              int64
lga                       object
ward                      object
population                 int64
public_meeting            object
recorded_by               object
scheme_management         object
scheme_name               object
permit                    object
construction_year          int64
extraction_type           object
extraction_type_group     object
extraction_type_class     object
management                object
management_group          object
payment                   object
payment_type              object
water_qual

In [179]:
df['region_code'] = df['region_code'].astype(str)
df['district_code'] = df['district_code'].astype(str)
df['id'] = df['id'].astype(str)
df.dtypes

id                        object
amount_tsh               float64
date_recorded             object
funder                    object
gps_height                 int64
installer                 object
longitude                float64
latitude                 float64
wpt_name                  object
basin                     object
subvillage                object
region                    object
region_code               object
district_code             object
lga                       object
ward                      object
population                 int64
public_meeting            object
recorded_by               object
scheme_management         object
scheme_name               object
permit                    object
construction_year          int64
extraction_type           object
extraction_type_group     object
extraction_type_class     object
management                object
management_group          object
payment                   object
payment_type              object
water_qual

In [180]:
numerics = list(df.select_dtypes(exclude=['object']).columns)

for name in list(df.columns):
  if name not in numerics:
    continue
  else:
    df[name] = df[name].where(df[name] != 0)
    
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390.0,Roman,34.938093,-9.856322,none,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109.0,True,GeoData Consultants Ltd,VWC,Roman,False,1999.0,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,,2013-03-06,Grumeti,1399.0,GRUMETI,34.698766,-2.147466,Zahanati,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280.0,,GeoData Consultants Ltd,Other,,True,2010.0,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686.0,World vision,37.460664,-3.821329,Kwa Mahundi,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250.0,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009.0,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,,2013-01-28,Unicef,263.0,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58.0,True,GeoData Consultants Ltd,VWC,,True,1986.0,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,,2011-07-13,Action In A,,Artisan,31.130847,-1.825359,Shuleni,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,,True,GeoData Consultants Ltd,,,True,,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [0]:
df['num_private'] = train_features['num_private']

In [182]:
df.isnull().sum()

id                           0
amount_tsh               41639
date_recorded                0
funder                    3635
gps_height               20438
installer                 3655
longitude                 1812
latitude                     0
wpt_name                     0
basin                        0
subvillage                 371
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population               21381
public_meeting            3334
recorded_by                  0
scheme_management         3877
scheme_name              28166
permit                    3056
construction_year        20709
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity

In [183]:
df.describe()

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,population,construction_year,num_private
count,17761.0,38962.0,57588.0,59400.0,38019.0,38691.0,59400.0
mean,1062.351942,1018.860839,35.149669,-5.706033,281.087167,1996.814686,0.474141
std,5409.34494,612.566092,2.607428,2.946019,564.68766,12.472045,12.23623
min,0.2,-90.0,29.607122,-11.64944,1.0,1960.0,0.0
25%,50.0,393.0,33.2851,-8.540621,40.0,1987.0,0.0
50%,250.0,1167.0,35.005943,-5.021597,150.0,2000.0,0.0
75%,1000.0,1498.0,37.233712,-3.326156,324.0,2008.0,0.0
max,350000.0,2770.0,40.345193,-2e-08,30500.0,2013.0,1776.0


In [184]:
df.describe(exclude=np.number)

Unnamed: 0,id,date_recorded,funder,installer,wpt_name,basin,subvillage,region,region_code,district_code,lga,ward,public_meeting,recorded_by,scheme_management,scheme_name,permit,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
count,59400,59400,55765,55745,59400,59400,59029,59400,59400,59400,59400,59400,56066,59400,55523,31234,56344,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400
unique,59400,356,1897,2145,37400,9,19287,21,27,20,125,2092,2,1,12,2696,2,18,13,7,12,5,7,7,8,6,5,5,10,7,3,7,6
top,53712,2011-03-15,Government Of Tanzania,DWE,none,Lake Victoria,Madukani,Iringa,11,1,Njombe,Igosi,True,GeoData Consultants Ltd,VWC,K,True,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
freq,1,572,9084,17402,3563,10248,508,5294,5300,12203,2503,307,51011,59400,36793,682,38852,26780,26780,26780,40507,52490,25348,25348,50818,50818,33186,33186,17021,17021,45794,28522,34625


In [185]:
df['date_recorded'] = pd.to_datetime(df['date_recorded'], infer_datetime_format=True)
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,num_private
0,69572,6000.0,2011-03-14,Roman,1390.0,Roman,34.938093,-9.856322,none,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109.0,True,GeoData Consultants Ltd,VWC,Roman,False,1999.0,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,0
1,8776,,2013-03-06,Grumeti,1399.0,GRUMETI,34.698766,-2.147466,Zahanati,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280.0,,GeoData Consultants Ltd,Other,,True,2010.0,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,0
2,34310,25.0,2013-02-25,Lottery Club,686.0,World vision,37.460664,-3.821329,Kwa Mahundi,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250.0,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009.0,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,0
3,67743,,2013-01-28,Unicef,263.0,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58.0,True,GeoData Consultants Ltd,VWC,,True,1986.0,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,0
4,19728,,2011-07-13,Action In A,,Artisan,31.130847,-1.825359,Shuleni,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,,True,GeoData Consultants Ltd,,,True,,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,0


In [186]:
df.dtypes

id                               object
amount_tsh                      float64
date_recorded            datetime64[ns]
funder                           object
gps_height                      float64
installer                        object
longitude                       float64
latitude                        float64
wpt_name                         object
basin                            object
subvillage                       object
region                           object
region_code                      object
district_code                    object
lga                              object
ward                             object
population                      float64
public_meeting                   object
recorded_by                      object
scheme_management                object
scheme_name                      object
permit                           object
construction_year               float64
extraction_type                  object
extraction_type_group            object


In [187]:
df['date_recorded'].describe()

count                   59400
unique                    356
top       2011-03-15 00:00:00
freq                      572
first     2002-10-14 00:00:00
last      2013-12-03 00:00:00
Name: date_recorded, dtype: object

In [188]:
# I don't think the day matters much, but the year and month might.

df['year_month_recorded'] = df['date_recorded'].dt.year + df['date_recorded'].dt.month / 100
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,num_private,year_month_recorded
0,69572,6000.0,2011-03-14,Roman,1390.0,Roman,34.938093,-9.856322,none,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109.0,True,GeoData Consultants Ltd,VWC,Roman,False,1999.0,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,0,2011.03
1,8776,,2013-03-06,Grumeti,1399.0,GRUMETI,34.698766,-2.147466,Zahanati,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280.0,,GeoData Consultants Ltd,Other,,True,2010.0,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,0,2013.03
2,34310,25.0,2013-02-25,Lottery Club,686.0,World vision,37.460664,-3.821329,Kwa Mahundi,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250.0,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009.0,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,0,2013.02
3,67743,,2013-01-28,Unicef,263.0,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58.0,True,GeoData Consultants Ltd,VWC,,True,1986.0,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,0,2013.01
4,19728,,2011-07-13,Action In A,,Artisan,31.130847,-1.825359,Shuleni,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,,True,GeoData Consultants Ltd,,,True,,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,0,2011.07


Let's go through this one column at a time.

amount_tsh: A majority of these values are NaNs (zeroes in the original dataset). I think it's best to just delete the column.

date_recorded: This probably won't be very predictive, but I can just convert to datetime and then year/month, like I did above.

funder: Lots of unique values. Let's do the "top 10" thing. Not very many NaNs so it's fine for them to be grouped under "Other."

gps_height: This one is difficult. There are a lot of NaNs but they aren't the majority. I think I'd like to keep the data and replace the NaNs with the mean.

installer: Same as "funder."

longitude: Due to the way longitude works, I don't think replacing with the mean or median makes much sense. There aren't that many NaNs relatively, so I'll just remove those rows.

latitude: Fine as-is

wpt_name: Way too high cardinality to be useful. I'll drop.

basin: Low cardinality! One-hot encode.

subvillage: Very high cardinality, and also some NaNs. I'll drop.

region: One-hot encode.

region_code: One-hot encode.

district_code: One-hot encode.

lga: A bit redundant with some of the other columns, but probably still worth to top-10.

ward: High cardinality, and frequency is very low. It's also probably not significantly more useful than the other locations columns here. I'll drop.

population: A lot of NaNs, but I'd like to keep this data. I'll replace with the median (std way too high for the mean).

public_meeting: Vast majority is True, so I'll replace NaNs with True.

recorded_by: Redudant column. Drop.

scheme_management: "VWC" is very common, so I don't think it'll be a huge problem to replace NaNs with that. Then one-hot encode.

scheme_name: I don't think this is very useful data, and there's a lot of NaNs. I'll dorp.

permit: True is the majority, so I'll replace NaNs with True.

construction_year: This one is tricky. There's a lot of NaNs, but I'd like to keep this data; it seems like it should be very relevant, since an older well is more likely to need repairs. I have misgivings, but I'll replace NaNs with the median (2000).

extraction_type: One-hot encode.

extraction_type_group: One-hot encode.

extraction_type_class: One-hot encode.

management: One-hot encode.

management_group: One-hot encode.

payment: One-hot encode.

payment_type: Looks the exact same as 'payment.' Drop.

water_quality: One-hot encode.

quality_group: One-hot encode.

quantity: One-hot encode.

quantity_group: One-hot encode.

source: One-hot encode.

source_type: One-hot encode.

source_class: One-hot encode.

waterpoint_type: One-hot encode.

waterpoint_type_group: One-hot encode.

num_private: I have no idea what this is...but I guess I'll keep??

In [0]:
import category_encoders as ce

def engineer_features(X):
  Y = X
  Y = Y.drop('num_private', axis=1)
  Y['region_code'] = Y['region_code'].astype(str)
  Y['district_code'] = Y['district_code'].astype(str)
  Y['id'] = Y['id'].astype(str)
  
  median_pop = np.median(Y['population'])
  median_year = np.median(Y['construction_year'])

  numerics = list(Y.select_dtypes(exclude=['object']).columns)
  for name in list(Y.columns):
    if name not in numerics:
      continue
    else:
      Y[name] = Y[name].where(Y[name] != 0)

  if Y.shape[0] == train_features.shape[0]:
    Y['num_private'] = train_features['num_private']
  else:
    Y['num_private'] = test_features['num_private']
  
  Y['date_recorded'] = pd.to_datetime(Y['date_recorded'], infer_datetime_format=True)
  Y['year_month_rec'] = Y['date_recorded'].dt.year + Y['date_recorded'].dt.month / 100

  top10_a = Y['funder'].value_counts()[:10].index
  Y.loc[~Y['funder'].isin(top10_a), 'funder'] = 'Other'

  Y['gps_height'] = Y['gps_height'].fillna(np.mean(Y['gps_height']))

  top10_b = Y['installer'].value_counts()[:10].index
  Y.loc[~Y['installer'].isin(top10_b), 'installer'] = 'Other'
  
  Y['population'] = Y['population'].fillna(median_pop)
  Y['public_meeting'] = Y['public_meeting'].fillna(True)
  Y['scheme_management'] = Y['scheme_management'].fillna('VWC')
  Y['permit'] = Y['permit'].fillna(True)
  Y['construction_year'] = Y['construction_year'].fillna(median_year)
  
  Y = Y.drop(['amount_tsh', 'date_recorded', 'wpt_name', 'subvillage', 'ward', 'recorded_by', 'scheme_name', 'payment_type'], axis=1)
  
  Y = Y.dropna()
  
  temp = Y.drop('id', axis=1)
  
  encoder = ce.OneHotEncoder(use_cat_names=True)
  Z = encoder.fit_transform(temp)
  
  Z['id'] = Y['id']

  return Z

In [190]:
df2 = engineer_features(train_features)
df2.head()

Unnamed: 0,funder_Other,funder_Unicef,funder_Rwssp,funder_Danida,funder_World Vision,funder_Hesawa,funder_Government Of Tanzania,funder_District Council,funder_Kkkt,funder_Tasaf,funder_World Bank,gps_height,installer_Other,installer_DWE,installer_DANIDA,installer_Central government,installer_Commu,installer_KKKT,installer_RWE,installer_Government,installer_Hesawa,installer_0,installer_TCRS,longitude,latitude,basin_Lake Nyasa,basin_Lake Victoria,basin_Pangani,basin_Ruvuma / Southern Coast,basin_Internal,basin_Lake Tanganyika,basin_Wami / Ruvu,basin_Rufiji,basin_Lake Rukwa,region_Iringa,region_Mara,region_Manyara,region_Mtwara,region_Kagera,region_Tanga,...,quantity_group_insufficient,quantity_group_dry,quantity_group_seasonal,quantity_group_unknown,source_spring,source_rainwater harvesting,source_dam,source_machine dbh,source_other,source_shallow well,source_river,source_hand dtw,source_lake,source_unknown,source_type_spring,source_type_rainwater harvesting,source_type_dam,source_type_borehole,source_type_other,source_type_shallow well,source_type_river/lake,source_class_groundwater,source_class_surface,source_class_unknown,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_hand pump,waterpoint_type_other,waterpoint_type_improved spring,waterpoint_type_cattle trough,waterpoint_type_dam,waterpoint_type_group_communal standpipe,waterpoint_type_group_hand pump,waterpoint_type_group_other,waterpoint_type_group_improved spring,waterpoint_type_group_cattle trough,waterpoint_type_group_dam,num_private,year_month_rec,id
0,1,0,0,0,0,0,0,0,0,0,0,1390.0,1,0,0,0,0,0,0,0,0,0,0,34.938093,-9.856322,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2011.03,69572
1,1,0,0,0,0,0,0,0,0,0,0,1399.0,1,0,0,0,0,0,0,0,0,0,0,34.698766,-2.147466,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2013.03,8776
2,1,0,0,0,0,0,0,0,0,0,0,686.0,1,0,0,0,0,0,0,0,0,0,0,37.460664,-3.821329,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2013.02,34310
3,0,1,0,0,0,0,0,0,0,0,0,263.0,1,0,0,0,0,0,0,0,0,0,0,38.486161,-11.155298,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2013.01,67743
4,1,0,0,0,0,0,0,0,0,0,0,1018.860839,1,0,0,0,0,0,0,0,0,0,0,31.130847,-1.825359,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2011.07,19728


In [191]:
train_labels.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [192]:
df3 = train_labels
df3['id'] = df3['id'].astype(str)
df4 = pd.merge(df2, df3, on='id')
df4.head()

Unnamed: 0,funder_Other,funder_Unicef,funder_Rwssp,funder_Danida,funder_World Vision,funder_Hesawa,funder_Government Of Tanzania,funder_District Council,funder_Kkkt,funder_Tasaf,funder_World Bank,gps_height,installer_Other,installer_DWE,installer_DANIDA,installer_Central government,installer_Commu,installer_KKKT,installer_RWE,installer_Government,installer_Hesawa,installer_0,installer_TCRS,longitude,latitude,basin_Lake Nyasa,basin_Lake Victoria,basin_Pangani,basin_Ruvuma / Southern Coast,basin_Internal,basin_Lake Tanganyika,basin_Wami / Ruvu,basin_Rufiji,basin_Lake Rukwa,region_Iringa,region_Mara,region_Manyara,region_Mtwara,region_Kagera,region_Tanga,...,quantity_group_dry,quantity_group_seasonal,quantity_group_unknown,source_spring,source_rainwater harvesting,source_dam,source_machine dbh,source_other,source_shallow well,source_river,source_hand dtw,source_lake,source_unknown,source_type_spring,source_type_rainwater harvesting,source_type_dam,source_type_borehole,source_type_other,source_type_shallow well,source_type_river/lake,source_class_groundwater,source_class_surface,source_class_unknown,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_hand pump,waterpoint_type_other,waterpoint_type_improved spring,waterpoint_type_cattle trough,waterpoint_type_dam,waterpoint_type_group_communal standpipe,waterpoint_type_group_hand pump,waterpoint_type_group_other,waterpoint_type_group_improved spring,waterpoint_type_group_cattle trough,waterpoint_type_group_dam,num_private,year_month_rec,id,status_group
0,1,0,0,0,0,0,0,0,0,0,0,1390.0,1,0,0,0,0,0,0,0,0,0,0,34.938093,-9.856322,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2011.03,69572,functional
1,1,0,0,0,0,0,0,0,0,0,0,1399.0,1,0,0,0,0,0,0,0,0,0,0,34.698766,-2.147466,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2013.03,8776,functional
2,1,0,0,0,0,0,0,0,0,0,0,686.0,1,0,0,0,0,0,0,0,0,0,0,37.460664,-3.821329,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2013.02,34310,functional
3,0,1,0,0,0,0,0,0,0,0,0,263.0,1,0,0,0,0,0,0,0,0,0,0,38.486161,-11.155298,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2013.01,67743,non functional
4,1,0,0,0,0,0,0,0,0,0,0,1018.860839,1,0,0,0,0,0,0,0,0,0,0,31.130847,-1.825359,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2011.07,19728,functional


In [193]:
df4['status_group'].value_counts()

functional                 31389
non functional             22268
functional needs repair     3931
Name: status_group, dtype: int64

In [194]:
df4['status_group'] = df4['status_group'].astype('category')
df4['status_group'] = df4['status_group'].cat.codes
df4['status_group'].value_counts()

0    31389
2    22268
1     3931
Name: status_group, dtype: int64

In [195]:
df4['status_group'] = df4['status_group'].replace({0: 2, 2: 0})
df4['status_group'].value_counts()

2    31389
0    22268
1     3931
Name: status_group, dtype: int64

In [196]:
from sklearn.model_selection import train_test_split

train, val = train_test_split(df4, random_state=100)
train.head()

Unnamed: 0,funder_Other,funder_Unicef,funder_Rwssp,funder_Danida,funder_World Vision,funder_Hesawa,funder_Government Of Tanzania,funder_District Council,funder_Kkkt,funder_Tasaf,funder_World Bank,gps_height,installer_Other,installer_DWE,installer_DANIDA,installer_Central government,installer_Commu,installer_KKKT,installer_RWE,installer_Government,installer_Hesawa,installer_0,installer_TCRS,longitude,latitude,basin_Lake Nyasa,basin_Lake Victoria,basin_Pangani,basin_Ruvuma / Southern Coast,basin_Internal,basin_Lake Tanganyika,basin_Wami / Ruvu,basin_Rufiji,basin_Lake Rukwa,region_Iringa,region_Mara,region_Manyara,region_Mtwara,region_Kagera,region_Tanga,...,quantity_group_dry,quantity_group_seasonal,quantity_group_unknown,source_spring,source_rainwater harvesting,source_dam,source_machine dbh,source_other,source_shallow well,source_river,source_hand dtw,source_lake,source_unknown,source_type_spring,source_type_rainwater harvesting,source_type_dam,source_type_borehole,source_type_other,source_type_shallow well,source_type_river/lake,source_class_groundwater,source_class_surface,source_class_unknown,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_hand pump,waterpoint_type_other,waterpoint_type_improved spring,waterpoint_type_cattle trough,waterpoint_type_dam,waterpoint_type_group_communal standpipe,waterpoint_type_group_hand pump,waterpoint_type_group_other,waterpoint_type_group_improved spring,waterpoint_type_group_cattle trough,waterpoint_type_group_dam,num_private,year_month_rec,id,status_group
22037,0,0,0,1,0,0,0,0,0,0,0,2085.0,0,0,1,0,0,0,0,0,0,0,0,34.57175,-9.348966,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,2011.03,52537,2
7041,1,0,0,0,0,0,0,0,0,0,0,1530.0,1,0,0,0,0,0,0,0,0,0,0,34.701872,-4.309835,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,2013.01,39241,2
42161,1,0,0,0,0,0,0,0,0,0,0,1673.0,1,0,0,0,0,0,0,0,0,0,0,35.258896,-8.102491,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2011.02,18930,2
32446,1,0,0,0,0,0,0,0,0,0,0,1018.860839,1,0,0,0,0,0,0,0,0,0,0,33.550109,-3.582926,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2012.1,12220,1
2228,0,0,0,0,0,0,1,0,0,0,0,1654.0,1,0,0,0,0,0,0,0,0,0,0,37.58631,-3.184001,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,2013.02,9760,0


In [0]:
X_train = train.drop('status_group', axis=1)
y_train = train['status_group']
X_val = val.drop('status_group', axis=1)
y_val = val['status_group']

In [0]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

In [199]:
from sklearn.feature_selection import f_regression, SelectKBest

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error

for k in range(1, 10):
  
  selector = SelectKBest(score_func=f_regression, k=k)
  X_train_selected = selector.fit_transform(X_train_scaled, y_train)
  X_val_selected = selector.transform(X_val_scaled)
  
  model = LogisticRegression(solver='lbfgs')
  model.fit(X_train_selected, y_train)
  score = model.score(X_val_selected, y_val)
  
  print(k, score)



1 0.6400639022018476




2 0.6400639022018476




3 0.6925053830659165




4 0.6925053830659165




5 0.6938945613669515




6 0.6938945613669515




7 0.6938945613669515




8 0.6938945613669515




9 0.6938945613669515


In [200]:
model = LogisticRegression(solver='lbfgs')
model.fit(X_train_scaled, y_train)
model.score(X_val_scaled, y_val)



0.7518927554351601

In [201]:
test_features.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321,True,GeoData Consultants Ltd,Parastatal,,True,2012,other,other,other,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300,True,GeoData Consultants Ltd,VWC,TPRI pipe line,True,2000,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500,True,GeoData Consultants Ltd,VWC,P,,2010,other,other,other,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250,,GeoData Consultants Ltd,VWC,,True,1987,other,other,other,vwc,user-group,unknown,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60,,GeoData Consultants Ltd,Water Board,BRUDER,True,2000,gravity,gravity,gravity,water board,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


In [202]:
sample_submission.head()

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,functional
4,49871,functional


In [203]:
sample_submission.shape

(14358, 2)

In [204]:
X_test = engineer_features(test_features)

X_test.shape, X_train.shape, X_val.shape

((13922, 360), (43191, 364), (14397, 364))

In [205]:
drop_names = []
for name in X_train.columns.tolist():
  if (name in X_test.columns.tolist()) == False:
    drop_names.append(name)
  else:
    continue

X_train = X_train.drop(drop_names, axis=1)

drop_names2 = []
for name in X_test.columns.tolist():
  if (name in X_train.columns.tolist()) == False:
    drop_names2.append(name)
  else:
    continue
    
X_test = X_test.drop(drop_names2, axis=1)
X_train.shape, X_test.shape

((43191, 358), (13922, 358))

In [206]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(solver='lbfgs')
model.fit(X_train_scaled, y_train)
y_test = model.predict(X_test_scaled)

submission = pd.DataFrame(y_test)
submission.head()



Unnamed: 0,0
0,2
1,2
2,2
3,0
4,2


In [207]:
submission['id'] = X_test['id']
submission.head()

Unnamed: 0,0,id
0,2,50785
1,2,51630
2,2,17168
3,0,45559
4,2,49871


In [208]:
submission.columns = ['status_group', 'id']
submission.head()

Unnamed: 0,status_group,id
0,2,50785
1,2,51630
2,2,17168
3,0,45559
4,2,49871


In [209]:
submission = submission[['id', 'status_group']]
submission.head()

Unnamed: 0,id,status_group
0,50785,2
1,51630,2
2,17168,2
3,45559,0
4,49871,2


In [210]:
temp = test_features[['id', 'latitude']]
temp['id'] = temp['id'].astype(str)
temp.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,id,latitude
0,50785,-4.059696
1,51630,-3.309214
2,17168,-5.004344
3,45559,-9.418672
4,49871,-10.950412


In [211]:
submission = pd.merge(temp, submission, on='id', how='left')
print(submission.shape)
submission.head()

(14358, 3)


Unnamed: 0,id,latitude,status_group
0,50785,-4.059696,2.0
1,51630,-3.309214,2.0
2,17168,-5.004344,2.0
3,45559,-9.418672,0.0
4,49871,-10.950412,2.0


In [212]:
submission = submission.drop('latitude', axis=1)
submission.head()

Unnamed: 0,id,status_group
0,50785,2.0
1,51630,2.0
2,17168,2.0
3,45559,0.0
4,49871,2.0


In [213]:
submission.isnull().sum()

id                0
status_group    858
dtype: int64

In [214]:
submission = submission.fillna(2)
submission.isnull().sum()

id              0
status_group    0
dtype: int64

In [215]:
submission = submission.replace({'status_group': {2.0: 'functional', 0.0: 'non functional', 1.0: 'functional needs repair'}})
submission.head()

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional


In [0]:
submission.to_csv('water-submission-01.csv', index = None, header=True)

In [217]:
selector = SelectKBest(score_func=f_regression, k=10)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
  
model = LogisticRegression(solver='lbfgs')
model.fit(X_train_selected, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [218]:
all_names = X_train.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]
selected_names

Index(['construction_year', 'extraction_type_other',
       'extraction_type_group_other', 'extraction_type_class_other',
       'quantity_enough', 'quantity_dry', 'quantity_group_enough',
       'quantity_group_dry', 'waterpoint_type_other',
       'waterpoint_type_group_other'],
      dtype='object')

In [219]:
drop_names3 = []
for name in X_test.columns.tolist():
  if (name in selected_names.tolist()) == False:
    drop_names3.append(name)
  else:
    continue
    
X_test_selected = X_test.drop(drop_names3, axis=1)
X_train_selected.shape, X_test_selected.shape

((43191, 10), (13922, 10))

In [220]:
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)

model = LogisticRegression(solver='lbfgs')
model.fit(X_train_scaled, y_train)
y_test = model.predict(X_test_scaled)

submission2 = pd.DataFrame(y_test)
submission2.head()



Unnamed: 0,0
0,2
1,2
2,2
3,2
4,2


In [221]:
submission2['id'] = X_test['id']
submission2.columns = ['status_group', 'id']
submission2 = submission2[['id', 'status_group']]
temp = test_features[['id', 'latitude']]
temp['id'] = temp['id'].astype(str)
submission2 = pd.merge(temp, submission2, on='id', how='left')
submission2 = submission2.drop('latitude', axis=1)
submission2 = submission2.fillna(2)
submission2 = submission2.replace({'status_group': {2.0: 'functional', 0.0: 'non functional', 1.0: 'functional needs repair'}})
submission2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,functional
4,49871,functional


In [222]:
sample_submission.head()

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,functional
4,49871,functional


In [0]:
submission2.to_csv('water-submission-02.csv', index = None, header=True)

In [224]:
submission2.shape, sample_submission.shape

((14358, 2), (14358, 2))

In [225]:
submission2.isnull().sum()

id              0
status_group    0
dtype: int64