## Predicting income using [Random Forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
### You can download this dataset from [Kaggle](https://www.kaggle.com/uciml/adult-census-income)
**The prediction task is to determine whether a person makes over $50K a year.**

In [2]:
from fastai.tabular import *
import path
from sklearn.ensemble import RandomForestClassifier

In [3]:
PATH = Path('data')
PATH.ls()

[PosixPath('data/adult.csv')]

In [4]:
df_raw = pd.read_csv(PATH/"adult.csv")

In [15]:
df_raw.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [5]:
dep_var = 'income'
cat_var = ['workclass','education','marital.status','occupation','relationship','race','sex','native.country']
cont_var = ['education.num', 'hours.per.week', 'age', 'capital.loss', 'fnlwgt', 'capital.gain']
proc = [FillMissing,Categorify,Normalize]
valid_idx = range((len(df_raw) - 6000),len(df_raw))

* We use [Fastai](https://docs.fast.ai/tabular.data.html) to preprocess the Tabular data we then use a [random forest classifer from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [6]:
data = (TabularList.from_df(df_raw,cat_names=cat_var,cont_names=cont_var,procs=proc)
                    .split_by_idx(valid_idx)
                    .label_from_df(cols=dep_var)
                    .databunch())

In [7]:
(data.train_ds.x.conts.shape,data.train_ds.x.codes.shape)

((26561, 6), (26561, 8))

In [8]:
x_train = np.concatenate((data.train_ds.x.conts,data.train_ds.x.codes),axis=1)

In [9]:
y_train = to_data(list(data.train_ds.y))

In [10]:
x_valid = np.concatenate((data.valid_ds.x.conts,data.valid_ds.x.codes),axis=1)
x_valid.shape

(6000, 14)

In [11]:
y_valid = to_data(list(data.valid_ds.y))

In [12]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(x_train), y_train), rmse(m.predict(x_valid), y_valid),
                m.score(x_train, y_train), m.score(x_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [13]:
m = RandomForestClassifier(n_jobs=-1,n_estimators=100,min_samples_leaf=5,max_features='log2',)
m.fit(x_train,y_train)
print_score(m)

[0.3174101743476624, 0.38297084310253526, 0.8992507812205865, 0.8533333333333334]


In [14]:
print("Validation accuracy:{:.3f}".format(m.score(x_valid,y_valid)))

Validation accuracy:0.853
