Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition. Notice that the Rules page also has instructions for the Submission process. The Data page has feature definitions.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

url = 'https://raw.githubusercontent.com/TerrenceAm22/DS-Unit-2-Kaggle-Challenge/master/data/waterpumps/test_features.csv'
url1 = 'https://raw.githubusercontent.com/TerrenceAm22/DS-Unit-2-Kaggle-Challenge/master/data/waterpumps/train_features.csv'
url2 = 'https://raw.githubusercontent.com/TerrenceAm22/DS-Unit-2-Kaggle-Challenge/master/data/waterpumps/train_labels.csv'
train = pd.merge(pd.read_csv(url1), 
                 pd.read_csv(url2))
test = pd.read_csv(url)
#sample_submission = pd.read_csv('/DS-Unit-2-Kaggle-Challenge/module1-decision-trees/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [9]:
#from pandas_profiling import ProfileReport
#profile = ProfileReport(train, minimal=True).to_notebook_iframe()

#profile

In [10]:
# Importing important datset for assignment for next few cells

#from google.colab import files
#train = files.upload()

In [11]:
#train = pd.read_csv('train_features.csv')
#train.head()


In [12]:
#from google.colab import files
#test = files.upload()

In [13]:
#test = pd.read_csv('test_features.csv')
#test.head()

In [14]:
#from google.colab import files
#train_labels = files.upload()

In [15]:
#train_labels = pd.read_csv('train_labels.csv')
#train_labels.head()

In [16]:
# Checking head of dataset

train.head(20)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
5,9944,20.0,2011-03-13,Mkinga Distric Coun,0,DWE,39.172796,-4.765587,Tajiri,0,...,salty,salty,enough,enough,other,other,unknown,communal standpipe multiple,communal standpipe,functional
6,19816,0.0,2012-10-01,Dwsp,0,DWSP,33.36241,-3.766365,Kwa Ngomho,0,...,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,non functional
7,54551,0.0,2012-10-09,Rwssp,0,DWE,32.620617,-4.226198,Tushirikiane,0,...,milky,milky,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
8,53934,0.0,2012-11-03,Wateraid,0,Water Aid,32.7111,-5.146712,Kwa Ramadhan Musa,0,...,salty,salty,seasonal,seasonal,machine dbh,borehole,groundwater,hand pump,hand pump,non functional
9,46144,0.0,2011-08-03,Isingiro Ho,0,Artisan,30.626991,-1.257051,Kwapeto,0,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [17]:
# Checking the values and frequency
train.describe(include='all')

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
count,59400.0,59400.0,59400,55765,59400.0,55745,59400.0,59400.0,59400,59400.0,...,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400
unique,,,356,1897,,2145,,,37400,,...,8,6,5,5,10,7,3,7,6,3
top,,,2011-03-15,Government Of Tanzania,,DWE,,,none,,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
freq,,,572,9084,,17402,,,3563,,...,50818,50818,33186,33186,17021,17021,45794,28522,34625,32259
mean,37115.131768,317.650385,,,668.297239,,34.077427,-5.706033,,0.474141,...,,,,,,,,,,
std,21453.128371,2997.574558,,,693.11635,,6.567432,2.946019,,12.23623,...,,,,,,,,,,
min,0.0,0.0,,,-90.0,,0.0,-11.64944,,0.0,...,,,,,,,,,,
25%,18519.75,0.0,,,0.0,,33.090347,-8.540621,,0.0,...,,,,,,,,,,
50%,37061.5,0.0,,,369.0,,34.908743,-5.021597,,0.0,...,,,,,,,,,,
75%,55656.5,20.0,,,1319.25,,37.178387,-3.326156,,0.0,...,,,,,,,,,,


In [18]:
# Preforming train/test/validating split on dataset

from sklearn.model_selection import train_test_split

train, val = train_test_split(train, random_state=40)
train.shape, test.shape, val.shape



((44550, 41), (14358, 40), (14850, 41))

In [24]:
# Getting baselines for classification
from sklearn.metrics import mean_absolute_error, r2_score
target = 'status_group'
y_train = train[target]
y_train.value_counts(normalize=True)








functional                 0.542290
non functional             0.385769
functional needs repair    0.071942
Name: status_group, dtype: float64

In [20]:
train.describe(exclude='number').T.sort_values(by='unique').unique.sum()

54266

In [27]:
# Selecting features and importing necessary libraries
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Assigning a dataframe with all train columns except the target & ID
train_features = train.drop(columns=['status_group', 'id'])
#train_features.head()

# Make a list of all numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()


#Making a list of all categorical features
cardinality = train_features.select_dtypes(exclude='number').nunique()


# Getting a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

features = numeric_features + categorical_features
print(features)






['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'basin', 'region', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']


In [28]:
# Arranging data into X features matrix and y target vector
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [None]:
# Classifying Variables
# Getting a Validation Accuracy Score

encoder = ce.OneHotEncoder(use_cat_names=True)
imputer = SimpleImputer()
scaler = StandardScaler()
model = LogisticRegression(max_iter=1)

#X_train_encoded = encoder.fit_transform(X_train)
#X_train_imputed = imputer.fit_transform(X_train_encoded)
#X_train_scaled = scaler.fit_transform(X_train_imputed)
#model.fit(X_train_scaled, y_train)

#X_val_encoded = encoder.transform(X_val)
#X_val_imputed = imputer.transform(X_val_encoded)
#X_val_scaled = scaler.transform(X_val_imputed)
#print('Validation Accuracy', model.score(X_val_scaled, y_val))

#X_test_encoded = encoder.transform(X_test)
#X_test_imputed = imputer.transform(X_test_encoded)
#X_test_scaled = scaler.transform(X_test_imputed)
#y_pred = model.predict(X_test_scaled)







In [29]:
# Using Pipeline

from sklearn.tree import DecisionTreeClassifier



pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    DecisionTreeClassifier(random_state=42)
)

# Fit on train
pipeline.fit(X_train, y_train)

# Score on train, val
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on test
y_pred = pipeline.predict(X_test)



Train Accuracy 0.9954657687991021
Validation Accuracy 0.7548821548821549


In [30]:
test = test.copy()
y_pred = pipeline.predict(X_test)
submission = test[['id']]
submission['status_group'] = y_pred
submission.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission['status_group'] = y_pred


Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional needs repair
2,17168,functional
3,45559,non functional
4,49871,functional


In [31]:
submission.to_csv('submission.csv', index=False)
