# Income Prediction in Python using scikit-learn and pandas

In this notebook, we use [scikit-learn](https://scikit-learn.org/) and [pandas](https://pandas.pydata.org/) to train a [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) model and a linear [support vector machine classifier](https://en.wikipedia.org/wiki/Support_vector_machine) (SVC) on [1994 census data](https://www.cs.toronto.edu/~delve/data/adult/adultDetail.html) to predict annual income. This is meant to establish a baseline against which the [APLearn implementation](https://github.com/BobMcDear/aplearn/blob/main/examples/adults/python.ipynb) can be compared and doesn't intend to be a tutorial, although the code is annotated for clarity. Please see the APLearn notebook before proceeding.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC


# Loads data.
df = pd.read_csv('adult.csv')

# One-hot encodes categorical columns.
cat_cols = ['workclass','education','marital-status','occupation',
            'relationship','race','gender','native-country']
df = pd.get_dummies(df, columns=cat_cols)

# Converts target variable from string into integer.
df['income'] = (df['income'] == '>50K').astype(int)

# Splits data into training and validation sets.
train, val = train_test_split(df, test_size=0.2, shuffle=True)

# Separates independent and dependent variables.
X_t, y_t = train.drop(columns=['income']), train['income']
X_v, y_v = val.drop(columns=['income']), val['income']

# Normalizes features.
scaler = StandardScaler()
X_t = scaler.fit_transform(X_t)
X_v = scaler.transform(X_v)

# Creates model, trains, makes predictions, and computes accuracy.
log_reg = LogisticRegression(C=1 / 0.01)
log_reg.fit(X_t, y_t)
log_reg_y_hat = log_reg.predict(X_v)
print('Logistic regression', accuracy_score(y_v, log_reg_y_hat))

# Creates model, trains, makes predictions, and computes accuracy.
lin_svc = SGDClassifier(loss='hinge', learning_rate='constant', alpha=0.01, eta0=0.01)
lin_svc.fit(X_t, y_t)
lin_svc_y_hat = lin_svc.predict(X_v)
print('SVC:', accuracy_score(y_v, lin_svc_y_hat))

Logistic regression 0.8546422356433616
SVC: 0.8265943289998976
