In [1]:
import pandas as pd
data_set = pd.read_csv("water_potability.csv")
data_set.describe()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
count,2785.0,3276.0,3276.0,3276.0,2495.0,3276.0,3276.0,3114.0,3276.0,3276.0
mean,7.080795,196.369496,22014.092526,7.122277,333.775777,426.205111,14.28497,66.396293,3.966786,0.39011
std,1.59432,32.879761,8768.570828,1.583085,41.41684,80.824064,3.308162,16.175008,0.780382,0.487849
min,0.0,47.432,320.942611,0.352,129.0,181.483754,2.2,0.738,1.45,0.0
25%,6.093092,176.850538,15666.690297,6.127421,307.699498,365.734414,12.065801,55.844536,3.439711,0.0
50%,7.036752,196.967627,20927.833607,7.130299,333.073546,421.884968,14.218338,66.622485,3.955028,0.0
75%,8.062066,216.667456,27332.762127,8.114887,359.95017,481.792304,16.557652,77.337473,4.50032,1.0
max,14.0,323.124,61227.196008,13.127,481.030642,753.34262,28.3,124.0,6.739,1.0


In [2]:
#printing columns
data_set.columns

Index(['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity',
       'Organic_carbon', 'Trihalomethanes', 'Turbidity', 'Potability'],
      dtype='object')

In [3]:
#Your Iowa data doesn't have missing values in the columns you use. 
# So we will take the simplest option for now, and drop houses from our data. 
# Don't worry about this much for now, though the code is:

# dropna drops missing values (think of na as "not available")
data_set = data_set.dropna(axis=0)

There are many ways to select a subset of your data. The Pandas course covers these in more depth, 
but we will focus on two approaches for now.

Dot notation, which we use to select the "prediction target"
Selecting with a column list, which we use to select the "features"

"Selecting The Prediction Target"

You can pull out a variable with dot-notation. This single column is stored in a Series, 
which is broadly like a DataFrame with only a single column of data.

We'll use the dot notation to select the column we want to predict, which is called the prediction target. 
By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is

In [4]:
y = data_set.ph

Choosing "Features"
The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

Here is an example:

In [5]:
data_features = ["Hardness", "Chloramines", "Sulfate", "Potability","Organic_carbon", "Trihalomethanes", "Turbidity"]

In [6]:
X = data_set[data_features]
X.describe()

Unnamed: 0,Hardness,Chloramines,Sulfate,Potability,Organic_carbon,Trihalomethanes,Turbidity
count,2011.0,2011.0,2011.0,2011.0,2011.0,2011.0,2011.0
mean,195.968072,7.134338,333.224672,0.403282,14.357709,66.400859,3.969729
std,32.635085,1.58482,41.205172,0.490678,3.324959,16.077109,0.780346
min,73.492234,1.390871,129.0,0.0,2.2,8.577013,1.45
25%,176.744938,6.138895,307.632511,0.0,12.124105,55.952664,3.442915
50%,197.191839,7.143907,332.232177,0.0,14.322019,66.542198,3.968177
75%,216.44107,8.109726,359.330555,1.0,16.683049,77.291925,4.514175
max,317.338124,13.127,481.030642,1.0,27.006707,124.0,6.494749


In [7]:
X.head()

Unnamed: 0,Hardness,Chloramines,Sulfate,Potability,Organic_carbon,Trihalomethanes,Turbidity
3,214.373394,8.059332,356.886136,0,18.436524,100.341674,4.628771
4,181.101509,6.5466,310.135738,0,11.558279,31.997993,4.075075
5,188.313324,7.544869,326.678363,0,8.399735,54.917862,2.559708
6,248.071735,7.513408,393.663396,0,13.789695,84.603556,2.672989
7,203.361523,4.563009,303.309771,0,12.363817,62.798309,4.401425


Visually checking your data with these commands is an important part of a data scientist's job. You'll frequently find surprises in the dataset that deserve further inspection.

# Building Your Model
You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.
Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [8]:
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
data_model = DecisionTreeRegressor(random_state = 1)
# Fit model
data_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [9]:
print("Making Predictions for the portability of water")
print(X.head())
print("The Predictions are: ")
print(data_model.predict(X.head()))

Making Predictions for the portability of water
     Hardness  Chloramines     Sulfate  Potability  Organic_carbon  \
3  214.373394     8.059332  356.886136           0       18.436524   
4  181.101509     6.546600  310.135738           0       11.558279   
5  188.313324     7.544869  326.678363           0        8.399735   
6  248.071735     7.513408  393.663396           0       13.789695   
7  203.361523     4.563009  303.309771           0       12.363817   

   Trihalomethanes  Turbidity  
3       100.341674   4.628771  
4        31.997993   4.075075  
5        54.917862   2.559708  
6        84.603556   2.672989  
7        62.798309   4.401425  
The Predictions are: 
[ 8.31676588  9.09222346  5.58408664 10.22386216  8.63584872]


In [13]:
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
data_model = DecisionTreeRegressor()
data_model.fit(train_X,train_y)
# get predicted prices on validation data
val_predictions = data_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

1.6349467317978486


In [12]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(val_y, val_predictions)

1.630073449301821

In practice, it's not uncommon for a tree to have 10 splits between the top level (all houses) and a leaf. As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have  210  groups of houses by the time we get to the 10th level. That's 1024 leaves.

When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn't divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

In [20]:
def get_mae(train_X, val_X, train_y, val_y,max_leaf_nodes):
    model = DecisionTreeRegressor(max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    pred_vals = model.predict(val_X)
    print("The MAE is: ",mean_absloute_error(pred_vals, val_y))

In [23]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

1985    6.266800
2962    8.527142
710     5.251354
1344    5.664846
          ...   
2385    7.608750
2921    7.611982
279     6.286807
487     7.689358
1301    3.433874
Name: ph, Length: 503, dtype: float64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error


ValueError: Expected 2D array, got scalar array instead:
array=5.0.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Conclusion
Here's the takeaway: Models can suffer from either:

Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.
We use validation data, which isn't used in model training, to measure a candidate model's accuracy. This lets us try many candidate models and keep the best one.

In [25]:
#Or another method
for i in [5, 50, 500, 5000]:
    scores = {i : get_mae(i, train_X, val_X, train_y, val_y)}
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = min(scores, key=scores.get)

1985    6.266800
2962    8.527142
710     5.251354
1344    5.664846
          ...   
2385    7.608750
2921    7.611982
279     6.286807
487     7.689358
1301    3.433874
Name: ph, Length: 503, dtype: float64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error


ValueError: Expected 2D array, got scalar array instead:
array=5.0.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [26]:
#Or by List/dictionary comprehension method
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in [5, 50, 500, 5000]}
best_tree_size = min(scores, key=scores.get)

1985    6.266800
2962    8.527142
710     5.251354
1344    5.664846
          ...   
2385    7.608750
2921    7.611982
279     6.286807
487     7.689358
1301    3.433874
Name: ph, Length: 503, dtype: float64 as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error


ValueError: Expected 2D array, got scalar array instead:
array=5.0.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.