# Predicting Earnings with Random Forests

In [1]:
# Mount the drive for file upload
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

Since the first row of our file contains the names of the columns, we also want to add the argument header = 0.

In [3]:
directory = '/content/drive/MyDrive/Colab Notebooks/Codacademy Machine Learning/Decision Trees/Random Forest/earning.csv'
income_data = pd.read_csv(directory, header = 0)

To see one row of the data we are provided:


```
income_data.iloc[0]
```



In [25]:
print(income_data.columns)

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income', 'sex-int', 'native-country-new'],
      dtype='object')


In [4]:
income_data.iloc[0]

age                            39
 workclass              State-gov
 fnlwgt                     77516
 education              Bachelors
 education-num                 13
 marital-status     Never-married
 occupation          Adm-clerical
 relationship       Not-in-family
 race                       White
 sex                         Male
 capital-gain                2174
 capital-loss                   0
 hours-per-week                40
 native-country     United-States
 income                     <=50K
Name: 0, dtype: object

We see the column called income tells us what we want to know, the income. 

Note, every string has an extra space at the start. For example, the first row’s native-country is " United-States", but we want it to be "United-States". This is happening because in income.csv there are spaces after the commas. To fix this, we can add the parameter delimiter = ", " to our read_csv() function.

In [6]:
income_data = pd.read_csv(directory, header = 0, delimiter = ", ")

  """Entry point for launching an IPython kernel.


Now we can start building the Random Forest model. In this case, our label would be the "income" column. 

In [7]:
labels = income_data[["income"]]
print(labels)

      income
0      <=50K
1      <=50K
2      <=50K
3      <=50K
4      <=50K
...      ...
32556  <=50K
32557   >50K
32558  <=50K
32559  <=50K
32560   >50K

[32561 rows x 1 columns]


Now we can pick which features from the dataset we want to use to train our model. 

In [8]:
data = income_data[["age","capital-gain","capital-loss","hours-per-week","sex"]]

Since we have our data and labels, we should split them into training and testing sets. 

In [9]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)

We are ready to create the Random Forest classifier using scikit learn!

In [10]:
forest = RandomForestClassifier(random_state=1)

In [11]:
forest.fit(train_data, train_labels)

ValueError: ignored

There's an error where the feature of "sex" is a float and cannot be fitted. We will fix this later, for now, we remove the "sex" feature.

In [12]:
data = income_data[["age","capital-gain","capital-loss","hours-per-week"]]

In [13]:
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)

In [14]:
forest = RandomForestClassifier(random_state=1)

In [15]:
forest.fit(train_data, train_labels)

  """Entry point for launching an IPython kernel.


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [16]:
print(forest.score(test_data, test_labels))

0.8222577078982926


**Changing Column Types from String to Float or Int**

Recall that the problem was that this column contained strings. If we transformed those strings into integers, we could use this data!

If we take every row and make every "Male" a 0 and every "Female" a 1, we could then use the column in our random forest. Before creating the data variable, use this line of code:


```
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)
```
This creates a new column called "sex-int" in the income_data DataFrame. Every row in that new column contains a 0 if the row’s "sex" column contained "Male" and a 1 otherwise.


In [17]:
income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

In [18]:
data = income_data[["age","capital-gain","capital-loss","hours-per-week", "sex-int"]]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
print(forest.score(test_data, test_labels))

  after removing the cwd from sys.path.


0.8272939442328953


Let's see if there are other features with strings we can convert to continuous numbers. Perhaps we could do "native-country". Note that: 

When mapping Strings to numbers, it is important to make sure that continuous numbers make sense. For example, it wouldn’t make much sense to map "United-States" to 0, "Germany" to 1, and "Mexico" to 2. If we did this, we’re saying that Mexico is more similar to Germany than it is to the United States.

However, if you had values in a column like "low", "medium", and "high" mapping those values to 0, 1, and 2 would make sense because their representation as Strings is also continuous.

For simplicity, let's decide how we should do the conversion. Below, we count the number of people in each country and realize United-States has the most people. Thereby, we set "United-States" to 0 and all else to 1. 

In [20]:
income_data["native-country"].value_counts()

United-States                 29170
Mexico                          643
?                               583
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England                          90
Jamaica                          81
South                            80
China                            75
Italy                            73
Dominican-Republic               70
Vietnam                          67
Guatemala                        64
Japan                            62
Poland                           60
Columbia                         59
Taiwan                           51
Haiti                            44
Iran                             43
Portugal                         37
Nicaragua                        34
Peru                             31
Greece                      

In [21]:
income_data["native-country-new"] = income_data["native-country"].apply(lambda row: 0 if row == "United-States" else 1)

Note that: 
After calling .fit() on the forest, we can print forest.feature_importances_, which will show us a list of numbers where each number corresponds to the relevance of a column from the training data.

In [23]:
data = income_data[["age","capital-gain","capital-loss","hours-per-week", "sex-int","native-country-new"]]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
print(forest.feature_importances_)
print(forest.score(test_data, test_labels))

  after removing the cwd from sys.path.


[0.31351874 0.29270793 0.11745457 0.20309968 0.0643516  0.00886747]
0.8225033779633951


Yet we see the accuracy of our model doesn't change much...

I think the features of education and education-num are important so let's add those to the mix. 

In [26]:
income_data["education"].value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

For the feature of education, there are a lot of options which can be tough to determine the right continuous numerical representation.. I'll skip this for now. 

In [27]:
data = income_data[["age","capital-gain","capital-loss","hours-per-week", "sex-int","native-country-new", "education-num"]]
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1)
forest = RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
print(forest.feature_importances_)
print(forest.score(test_data, test_labels))

  after removing the cwd from sys.path.


[0.31394643 0.19624199 0.07090807 0.17322542 0.05047075 0.01118603
 0.18402131]
0.8390861073578184


# Lessons learnt

Through this experience, I was able to learn how to use the Scikit-learn library to create Random Forests for classification purposes. I also learned how to convert data into values scikit can work with and how to determine which features are of most importance for a correct classification. 