# Predicting Income with Random Forests

In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/census%20income

By using this census data with a random forest, we will try to predict whether or not a person makes more than $50,000.

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

There’s a small problem with our data that is a little hard to catch — every string has an extra space at the start. For example, the first row’s native-country is " United-States", but we want it to be "United-States". This is happening because in income.csv there are spaces after the commas. To fix this, we can add the parameter delimiter = ", " to our read_csv() function.

In [39]:
income_data = pd.read_csv('adult.data', delimiter = ", ",
                          names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                                 'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                                 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                                 'income'],
                         engine='python')

Take a look at one of the rows of the data we’ve imported. **Print income_data.iloc[0]** to see the first row in its entirety. Did this person make more than $50,000? What is the name of the column that contains that information?

In [40]:
print(income_data.iloc[0])

age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
sex                        Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
income                    <=50K
Name: 0, dtype: object


Now that we have our data imported into a DataFrame, we can begin putting it in a format that our Random Forest can work with. To do this, we need to separate the labels from the rest of the data.

For this project, the labels are in the column called "income". Create a variable named labels that contains only the column "income" from the income_data DataFrame.

We’ll also want to pick which columns to use when trying to predict income. For now, let’s select "age", "capital-gain", "capital-loss", "hours-per-week", and "sex". Create a new variable named data that contains only those columns. 

In [41]:
labels = income_data['income']
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex']]

In [42]:
print(labels)

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
32556    <=50K
32557     >50K
32558    <=50K
32559    <=50K
32560     >50K
Name: income, Length: 32561, dtype: object


Finally, we want to split our data and labels into a training set and a test set. We’ll use the training set to build the random forest, and the test set to see how accurate it is. Use the train_test_split() function to do this.

train_test_split() returns four values — name them train_data, test_data, train_labels, and test_labels. When calling train_test_split(), it should take three arguments — data, labels and random_state = 1.

In [43]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)

We’re now ready to use this data to build and test our random forest. First, create a RandomForestClassifier and name it forest. When you create the random forest, use the parameter random_state = 1.

In [44]:
forest = RandomForestClassifier(random_state=1)

Next, we need to fit the model. We want to use the training data and training labels to train the random forest.

Call forest‘s .fit() method using train_data and train_labels as parameters. When you run your code, there should be an error!

In [45]:
forest.fit(train_data, train_labels)

ValueError: could not convert string to float: 'Female'

There **seems to be a problem** with using the column "sex" when training the random forest.

In that column, there are values like "Male" and "Female". Random forests **can’t use columns that contain Strings** — they have to be continuous values like integers or floats.

If we take every row and make every "Male" a 0 and every "Female" a 1, we could then use the column in our random forest. Before creating the data variable, use this line of code:

$\texttt{income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)}$

This creates a new column called "sex-int" in the income_data DataFrame. Every row in that new column contains a 0 if the row’s "sex" column contained "Male" and a 1 otherwise.

In [46]:
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

In [48]:
labels = income_data['income']
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex-int']]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)

RandomForestClassifier(random_state=1)

We can now test its accuracy. Call forest‘s .score() method using test_data and test_labels as parameters. Print the result.

In [51]:
print(forest.score(test_data, test_labels))

0.8272939442328953


There are a couple of other columns that use strings that might be useful to use. Let’s try transforming the values in the "native-country" column.

We should first take a look at the different values that exist in the column. Print income_data["native-country"].value_counts().

Since the majority of the data comes from "United-States", it might make sense to make a column where every row that contains "United-States" becomes a 0 and any other country becomes a 1. Use the syntax from creating the "sex-int" column to create a "country-int" column.

When mapping Strings to numbers like this, it is important to make sure that continuous numbers make sense. For example, it wouldn’t make much sense to map "United-States" to 0, "Germany" to 1, and "Mexico" to 2. If we did this, we’re saying that Mexico is more similar to Germany than it is to the United States.

However, if you had values in a column like "low", "medium", and "high" mapping those values to 0, 1, and 2 would make sense because their representation as Strings is also continuous.

In [54]:
# print(income_data["native-country"].value_counts())

income_data["country-int"] = income_data["native-country"].apply(lambda row: 0 if row == "United-States" else 1)

In [56]:
labels = income_data['income']
data = income_data[['age', 'capital-gain', 'capital-loss', 'hours-per-week', 'sex-int', 'country-int']]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)

RandomForestClassifier(random_state=1)

In [58]:
print(forest.score(test_data, test_labels))

0.8225033779633951


Now that you’ve gotten the hang of transforming, adding, and removing columns from your training data, it’s time to explore on your own to try to make the best classifier possible.

As you play around with the data, here are some ideas that you might want to try:

Create a tree.DecisionTreeClassifier, train it, test is using the same data, and compare the results to the random forest. When does the random forest do better than the single tree? When does a single tree do just as well as the forest?
After calling .fit() on the forest, print forest.feature_importances_. This will show you a list of numbers where each number corresponds to the relevance of a column from the training data. Which features tend to be more relevant?
Use some of the other columns that use continuous variables, or transform columns that use strings!
Code Editor
