<a href="https://colab.research.google.com/github/medinadiegoeverardo/DS-Unit-2-Regression-Classification/blob/master/module4/4_medinadiego_assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4


## Assignment

- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your validation accuracy score.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

---


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding. For example, you could try `quantity`, `basin`, `extraction_type_class`, and more. (But remember it may not work with high cardinality categoricals.)
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

---

## Data Dictionary 

### Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

- `amount_tsh` : Total static head (amount water available to waterpoint)
- `date_recorded` : The date the row was entered
- `funder` : Who funded the well
- `gps_height` : Altitude of the well
- `installer` : Organization that installed the well
- `longitude` : GPS coordinate
- `latitude` : GPS coordinate
- `wpt_name` : Name of the waterpoint if there is one
- `num_private` :  
- `basin` : Geographic water basin
- `subvillage` : Geographic location
- `region` : Geographic location
- `region_code` : Geographic location (coded)
- `district_code` : Geographic location (coded)
- `lga` : Geographic location
- `ward` : Geographic location
- `population` : Population around the well
- `public_meeting` : True/False
- `recorded_by` : Group entering this row of data
- `scheme_management` : Who operates the waterpoint
- `scheme_name` : Who operates the waterpoint
- `permit` : If the waterpoint is permitted
- `construction_year` : Year the waterpoint was constructed
- `extraction_type` : The kind of extraction the waterpoint uses
- `extraction_type_group` : The kind of extraction the waterpoint uses
- `extraction_type_class` : The kind of extraction the waterpoint uses
- `management` : How the waterpoint is managed
- `management_group` : How the waterpoint is managed
- `payment` : What the water costs
- `payment_type` : What the water costs
- `water_quality` : The quality of the water
- `quality_group` : The quality of the water
- `quantity` : The quantity of water
- `quantity_group` : The quantity of water
- `source` : The source of the water
- `source_type` : The source of the water
- `source_class` : The source of the water
- `waterpoint_type` : The kind of waterpoint
- `waterpoint_type_group` : The kind of waterpoint

### Labels

There are three possible values:

- `functional` : the waterpoint is operational and there are no repairs needed
- `functional needs repair` : the waterpoint is operational, but needs repairs
- `non functional` : the waterpoint is not operational

--- 

## Generate a submission

Your code to generate a submission file may look like this:

```python
# estimator is your model or pipeline, which you've fit on X_train

# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train

y_pred = estimator.predict(X_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)
```

If you're working locally, the csv file is saved in the same directory as your notebook.

If you're using Google Colab, you can use this code to download your submission csv file.

```python
from google.colab import files
files.download('your-submission-filename.csv')
```

---

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module4')

Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
# Read the Tanzania Waterpumps data
# train_features.csv : the training set features
# train_labels.csv : the training set labels
# test_features.csv : the test set features
# sample_submission.csv : a sample submission file in the correct format
    
import pandas as pd

train_features = pd.read_csv('../data/waterpumps/train_features.csv')
train_labels = pd.read_csv('../data/waterpumps/train_labels.csv')
test_features = pd.read_csv('../data/waterpumps/test_features.csv')
sample_submission = pd.read_csv('../data/waterpumps/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

In [0]:
train_features.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [0]:
train_labels.head() # undo later

Unnamed: 0,id,status_group
0,69572,0
1,8776,0
2,34310,0
3,67743,1
4,19728,0


In [0]:
mode_map = {'functional': 0, 'non functional': 1, 'functional needs repair': 2}
train_labels['status_group'] = train_labels['status_group'].replace(mode_map)

# train_labels.head()
# functional: 0
# non functional: 1
# functional needs repair: 2

In [0]:
from sklearn.model_selection import train_test_split

x_training, x_validation = train_test_split(train_features, random_state=10)

In [0]:
# train_labels.status_group.value_counts(normalize=True)

### Adding y_variable to training and validation

In [0]:
# splitting train_labels to add them to x_training and x_validation!

In [0]:
training_y_labels, validation_y_labels = train_test_split(train_labels, random_state=10)

In [0]:
x_training = x_training.merge(training_y_labels, on='id')
x_validation = x_validation.merge(validation_y_labels, on='id')

In [0]:
x_training.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,25077,1000.0,2011-03-23,Dhv,370,DWE,36.514517,-8.549106,Kwa Chapanga,0,Rufiji,Magereza,Morogoro,5,4,Ulanga,Iragua,290,True,GeoData Consultants Ltd,,,True,2003,swn 80,swn 80,handpump,vwc,user-group,pay monthly,monthly,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,0
1,73674,0.0,2011-08-06,,0,,31.788171,-1.365555,Kirombe 'A',0,Lake Victoria,Kigazi,Kagera,18,6,Bukoba Urban,Kitendaguru,0,True,GeoData Consultants Ltd,VWC,,False,0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,improved spring,improved spring,0
2,10731,0.0,2011-07-31,Ridep,0,RIDEP,33.019772,-3.059272,Kwa Hene,0,Lake Victoria,Mission,Mwanza,19,7,Missungwi,Buhingo,0,,GeoData Consultants Ltd,VWC,,False,0,other - swn 81,other handpump,handpump,vwc,user-group,never pay,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,0
3,3668,500.0,2013-01-15,Norad,896,RWE,29.657789,-4.817105,Kwa Mzee Juma Chobaliko,0,Lake Tanganyika,Bigabilo A,Kigoma,16,3,Kigoma Rural,Kagongo,566,True,GeoData Consultants Ltd,VWC,Mkongoro One,True,1985,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe multiple,communal standpipe,2
4,29190,50.0,2013-02-23,Amref,236,Amref,39.736233,-10.548224,Kwa Mselemu,0,Ruvuma / Southern Coast,Bara,Mtwara,99,1,Mtwara Rural,Njengwa,120,True,GeoData Consultants Ltd,VWC,Nalunga water supply,True,2011,mono,mono,motorpump,vwc,user-group,pay per bucket,per bucket,salty,salty,enough,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,0


In [0]:
# both have the same shape as their corresponding sets

In [0]:
print(training_y_labels.shape)
print(validation_y_labels.shape)

(44550, 2)
(14850, 2)


In [0]:
print(x_training.shape)
print(x_validation.shape)

(44550, 41)
(14850, 41)


### Baselines

In [0]:
y_train = x_training['status_group']
y_train.mode()

0    0
dtype: int64

In [0]:
from sklearn.metrics import mean_absolute_error

# baseline
majority_class = y_train.mode()[0]
y_pred = [majority_class] * len(y_train) # both have to be in Series
mae = mean_absolute_error(y_train, y_pred) # for it to work
print('MAE. Not accuracy metric: ' + str(mae))

MAE. Not accuracy metric: 0.5322334455667789


In [0]:
x_validation.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,9678,0.0,2013-02-01,Dwsp,0,DWE,0.0,-2e-08,Shule Ya Msingi Itubukilo,0,Lake Victoria,Itubukilo A,Shinyanga,17,1,Bariadi,Mbita,0,,GeoData Consultants Ltd,WUG,,False,0,nira/tanira,nira/tanira,handpump,wug,user-group,unknown,unknown,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,0
1,43144,1000.0,2013-03-07,Stantons,950,World Vision,35.733914,-3.810443,Kisese B,0,Internal,Kisese B,Manyara,21,1,Babati,Magara,245,True,GeoData Consultants Ltd,VWC,,True,2012,nira/tanira,nira/tanira,handpump,vwc,user-group,pay when scheme fails,on failure,soft,good,enough,enough,hand dtw,borehole,groundwater,hand pump,hand pump,0
2,70940,0.0,2013-01-22,Dwsp,0,DWE,0.0,-2e-08,Ngwande,0,Lake Victoria,Madukani,Shinyanga,17,1,Bariadi,Kinang'weli,0,,GeoData Consultants Ltd,WUG,,False,0,other,other,other,wug,user-group,unknown,unknown,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,1
3,66151,20.0,2011-02-01,Po,441,Po,37.121775,-6.693746,Kwa Vilore,0,Wami / Ruvu,Batini B,Morogoro,5,1,Kilosa,Chanzuru,105,True,GeoData Consultants Ltd,VWC,,True,1976,swn 80,swn 80,handpump,vwc,user-group,pay per bucket,per bucket,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,1
4,9248,0.0,2011-07-05,Jica,0,DWE,31.156491,-1.379942,Kwa Gabriel,0,Lake Victoria,Kishegeshe A,Kagera,18,1,Karagwe,Kihanga,0,True,GeoData Consultants Ltd,VWC,Katanda Water Sup,True,0,gravity,gravity,gravity,other,other,pay annually,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe,1


In [0]:
from sklearn.metrics import accuracy_score
acurracy_s = accuracy_score(y_train, y_pred)
print('Training accuracy score: ', str(acurracy_s))

# how much does it differ from validation dataset?
y_val = x_validation['status_group']
majority_class_2 = y_val.mode()
y_predict = [majority_class_2] * len(y_val)
ac_v = accuracy_score(y_val, y_predict)
print('Validation accuracy score: ', str(ac_v))

Training accuracy score:  0.5410998877665545
Validation accuracy score:  0.549023569023569


### Linear Reg

In [0]:
features = ['id', 'construction_year', 'longitude',	'latitude']
x_training[features].isnull().sum()
# 0 null values, no need to use imputer for lin reg?

id                   0
construction_year    0
longitude            0
latitude             0
dtype: int64

In [0]:
from sklearn.linear_model import LinearRegression

model_linear = LinearRegression()
# not using imputer, encoder here

# reminder!
# y_train = x_training['status_group']
# y_val = x_validation['status_group']
features = ['id', 'construction_year', 'longitude',	'latitude']
x_train = x_training[features] # these for now
x_val = x_validation[features]

# In Lecture, training feature was used (population)
# y_train = y_training['status_group']
# y_validation = y_validation['status_group']

model_linear.fit(x_train, y_train) # not train_labels
model_linear.predict(x_val)


array([0.75202098, 0.50055821, 0.75596936, ..., 0.5042989 , 0.48424282,
       0.53355029])

In [0]:
model_linear.coef_

array([ 6.44506340e-08, -2.26385370e-05, -6.14692664e-03, -3.03975278e-03])

### Logistic, Imputing, and encoding

In [0]:
x_training.describe(include=['O']).T

Unnamed: 0,count,unique,top,freq
date_recorded,44550,347,2011-03-15,426
funder,41811,1629,Government Of Tanzania,6798
installer,41794,1840,DWE,13020
wpt_name,44550,28953,none,2708
basin,44550,9,Lake Victoria,7746
subvillage,44261,16658,Majengo,387
region,44550,21,Iringa,3984
lga,44550,125,Njombe,1861
ward,44550,2080,Igosi,228
public_meeting,42021,2,True,38204


In [0]:
# reducing training column cardinality

date_recorded_top = x_training['date_recorded'].value_counts()[:50].index
x_training.loc[~x_training['date_recorded'].isin(date_recorded_top), 'date_recorded'] = 'N/A'

funder_top = x_training['funder'].value_counts()[:50].index
x_training.loc[~x_training['funder'].isin(funder_top), 'funder'] = 'N/A'

ward_top = x_training['ward'].value_counts()[:50].index
x_training.loc[~x_training['ward'].isin(ward_top), 'ward'] = 'N/A'

installer_top = x_training['installer'].value_counts()[:50].index
x_training.loc[~x_training['installer'].isin(installer_top), 'installer'] = 'N/A'

scheme_name_top = x_training['scheme_name'].value_counts()[:50].index
x_training.loc[~x_training['scheme_name'].isin(scheme_name_top), 'scheme_name'] = 'N/A'

In [0]:
# reducing validation column cardinality

funder_top_v = x_validation['funder'].value_counts()[:50].index
x_validation.loc[~x_validation['funder'].isin(funder_top_v), 'funder'] = 'N/A'

funder_top_v = x_validation['funder'].value_counts()[:50].index
x_validation.loc[~x_validation['funder'].isin(funder_top_v), 'funder'] = 'N/A'

ward_top_v = x_validation['ward'].value_counts()[:50].index
x_validation.loc[~x_validation['ward'].isin(ward_top), 'ward'] = 'N/A'

installer_top_v = x_validation['installer'].value_counts()[:50].index
x_validation.loc[~x_validation['installer'].isin(installer_top_v), 'installer'] = 'N/A'

scheme_name_top_v = x_validation['scheme_name'].value_counts()[:50].index
x_validation.loc[~x_validation['scheme_name'].isin(scheme_name_top_v), 'scheme_name'] = 'N/A'

In [0]:
# dropping those with extremely high cardinality 

to_drop = ['wpt_name', 'subvillage']
x_training = x_training.drop(to_drop, axis=1)
x_validation = x_validation.drop(to_drop, axis=1)

In [0]:
# tried to write a function..

# def changing_cardinality(column, number_of_card, placeholder):
#   filtered = df[column].value_counts()[:number_of_card].index
#   changed = [df.loc[~df[column].isin(filtered), column] == placeholder]
#   return changed
# x_training['date_recorded'] = x_training['date_recorded'].apply(changing_cardinality('date_recorded', 50, 'N/A'))

In [0]:
x_training.describe(include=['O'])

Unnamed: 0,date_recorded,funder,installer,basin,region,lga,ward,public_meeting,recorded_by,scheme_management,scheme_name,permit,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
count,44550.0,44550.0,44550.0,44550,44550,44550,44550.0,42021,44550,41632,44550.0,42238,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550
unique,51.0,51.0,51.0,9,21,125,51.0,2,1,12,51.0,2,18,13,7,12,5,7,7,8,6,5,5,10,7,3,7,6
top,,,,Lake Victoria,Iringa,Njombe,,True,GeoData Consultants Ltd,VWC,,True,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
freq,29338.0,15996.0,16069.0,7746,3984,1861,39094.0,38204,44550,27593,37785.0,29105,20084,20084,20084,30318,39340,18963,18963,38054,38054,24869,24869,12754,12754,34320,21387,25971


In [0]:
# all_columns = x_training.describe(include=['O'])
# total_col_list = list(all_columns.columns)

total_col_list = ['date_recorded', 'funder', 'installer', 'region', 'ward', 'recorded_by', 'scheme_management','scheme_name', 'water_quality', 'quality_group', 'quantity', 
                  'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']

# numerics
numerics = ['longitude', 'latitude', 'region_code', 'district_code', 'population', 'construction_year']
total_col_list.extend(numerics)

In [0]:
# total_col_list

In [0]:
from sklearn.linear_model import LogisticRegression
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

# model_log = LogisticRegressionCV(cv=5, n_jobs=-1, random_state=42)
model_log = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=300)
imputer = SimpleImputer()
encoder = ce.OneHotEncoder(use_cat_names=True)
scaler = StandardScaler()

# reminder!
# y_train = x_training['status_group']
# y_val = x_validation['status_group']
features = total_col_list
x_train = x_training[features]
x_val = x_validation[features]

x_train_encoded = encoder.fit_transform(x_train)
x_train_imputed = imputer.fit_transform(x_train_encoded)
x_train_scaled = scaler.fit_transform(x_train_imputed)

x_val_encoded = encoder.transform(x_val)
x_val_imputed = imputer.transform(x_val_encoded)
x_val_scaled = scaler.transform(x_val_imputed)

model_log.fit(x_train_scaled, y_train)
print('Validation accuracy score', model_log.score(x_val_scaled, y_val))

Validation accuracy score 0.7288215488215488


In [0]:
X_test = test_features[features]
X_test_encoded = encoder.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_scaled = scaler.transform(X_test_imputed)

y_pred = model_log.predict(X_test_scaled)
print(y_pred)

[0 0 0 ... 0 0 1]


In [0]:
submission = sample_submission.copy()
submission['status_group'] = y_pred

In [0]:
mode_map = {0: 'functional', 1: 'non functional', 2: 'functional needs repair'}
submission['status_group'] = submission['status_group'].replace(mode_map)

In [0]:
submission.to_csv('medinadiegokaggle.csv', index=False)
from google.colab import files
files.download('medinadiegokaggle.csv')