# **Mini-Kaggle Project 2: Adult Income Classification**

**Importing Training data**

In [48]:
import pandas as pd
import numpy as np

#importing the data frame
df = pd.read_csv("/kaggle/input/adult-income-classifacation/train.csv")

#creating a copy of the data frame
df_no_null = df
df_no_null.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,id
0,78,Private,111189,7th-8th,4,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,0,35,Dominican-Republic,0,26052
1,49,Self-emp-inc,122066,Some-college,10,Divorced,Sales,Not-in-family,White,Male,0,0,25,United-States,0,47049
2,62,Self-emp-not-inc,168682,7th-8th,4,Married-civ-spouse,Sales,Husband,White,Male,0,0,5,United-States,0,33915
3,18,Private,110230,10th,6,Never-married,Other-service,Own-child,White,Male,0,0,11,United-States,0,22132
4,40,Private,373050,12th,8,Married-civ-spouse,Other-service,Husband,White,Male,0,0,40,?,0,46452


# **Data Exploration**

In [49]:
#Converting all question marks in the data frame to null
df_no_null.replace('?', pd.NA, inplace=True)

#Locating which columns have null values and how many
print(df_no_null.isnull().sum())

age                   0
workclass          2230
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2238
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      687
income                0
id                    0
dtype: int64


All of the categorical variables excluding education seem to be easily classified as label data. I will find the unique values of the education column to decide if I will consider it label values or ordinal values.

In [50]:
#Creating a list of all the unique values of education
df_no_null['education'].unique()

array(['7th-8th', 'Some-college', '10th', '12th', '5th-6th', 'HS-grad',
       '11th', 'Masters', 'Bachelors', 'Assoc-voc', '9th', 'Prof-school',
       '1st-4th', 'Assoc-acdm', 'Doctorate', 'Preschool'], dtype=object)

I am unsure how to rank some of the values in this list so I will use label encoding for the education variable also.

 # **Data Preprocessing**

All of the missing values are in the categorical variables, workclass, occupation, and native-country. I am going to fill each missing value with the most common category within the columns that have missing values. I am doing this so I can retain as much data as possible and create an out put slightly more accurate than deleting all rows with missing values.

In [51]:
#Finding the most common category for each column to fill in the missing values
df_no_null['workclass'] = df_no_null['workclass'].fillna(df_no_null['workclass'].mode()[0])
df_no_null['occupation'] = df_no_null['occupation'].fillna(df_no_null['occupation'].mode()[0])
df_no_null['native-country'] = df_no_null['native-country'].fillna(df_no_null['native-country'].mode()[0])

#confirming that there are not any more missing values
print(df_no_null.isnull().sum())

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
id                 0
dtype: int64


The categorical variables workclass, marital-status, occupation, relationship, race, gender,native-country, and education are all strings in the data frame. I will use label encoding for all of these variables to prepare the data for the machine learning models. 

In [52]:
#Importing the label in coder
from sklearn.preprocessing import LabelEncoder

#Creating loop to find all the object data type columns and label encode all of their  values
label_encoders = {}
for column in df_no_null.columns:
    if df_no_null[column].dtype == 'object':
        label_encoders[column] = LabelEncoder()
        df_no_null.loc[:, column] = label_encoders[column].fit_transform(df_no_null[column])


In [53]:
#confirming that the label encoding worked
df_no_null.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,id
0,78,3,111189,5,4,4,6,1,4,0,0,0,35,5,0,26052
1,49,4,122066,15,10,0,11,1,4,1,0,0,25,38,0,47049
2,62,5,168682,5,4,2,11,0,4,1,0,0,5,38,0,33915
3,18,3,110230,0,6,4,7,3,4,1,0,0,11,38,0,22132
4,40,3,373050,2,8,2,7,0,4,1,0,0,40,38,0,46452


**Preparing to split the data**

creating a data frame for the target variable

In [55]:
#creating a data frame with just the Income column to use as target variable.
df_Yvalues=df_no_null['income']
df_Yvalues.head()

0    0
1    0
2    0
3    0
4    0
Name: income, dtype: int64

creating a dataframe for the feature variables

In [56]:
#Copying the df_no_null data frame
df_Xvalues = df_no_null

#Droping the income variable from the new data frame
df_Xvalues = df_Xvalues.drop("income", axis=1)

#confirming the previous steps were done correctly
df_Xvalues.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,id
0,78,3,111189,5,4,4,6,1,4,0,0,0,35,5,26052
1,49,4,122066,15,10,0,11,1,4,1,0,0,25,38,47049
2,62,5,168682,5,4,2,11,0,4,1,0,0,5,38,33915
3,18,3,110230,0,6,4,7,3,4,1,0,0,11,38,22132
4,40,3,373050,2,8,2,7,0,4,1,0,0,40,38,46452


Convering the target variable and the feature variables into arrays

In [57]:
X=df_Xvalues.values
y=df_Yvalues.values

Splitting the data using 80% to train and 20% to test because there are thousands of values so the data set is large enough to still have a significant sample of remaing values to test with only 20% of the data set.

In [58]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

# Display the shapes of the resulting sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (31258, 15)
Shape of X_test: (7815, 15)
Shape of y_train: (31258,)
Shape of y_test: (7815,)


Standardizing the features to make the data less succeptible to outliers

In [60]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

# **Modeling the data**

Creating a Logistic Regression model and testing the accuracy

In [61]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=5)
lr.fit(X_train_std, y_train)
print('Accuracy: %.3f' % lr.score(X_test_std, y_test))

Accuracy: 0.825


Logistic regression is a widely used classification model that 
can be used to predict a binary outcome based on one or more predictor variables. The target variable for this data set is binary so it is suitable. logistic regression models the probability of the outcome being a certain value. Logistic Regression is sensitive to outliers. The accuracy of this model is 82.5% which will be compared to other models. I used the standardized features to reach a faster convergence and reduce the impact of outliers in the data.

Create and test accuracy of an SVM model

In [63]:
from sklearn.svm import SVC


svm = SVC(C=.06, random_state=5)

# Fit the model
svm.fit(X_train_std, y_train)

print('Accuracy: %.3f' % svm.score(X_test_std, y_test))

Accuracy: 0.837


SVM is used for classification it aims to find a decision boundary that maximizes the 
margin between the classes. SVM inherently reduces the risk of overfitting due to its margin maximization objective. SVM is sensitive to noise in the dataset, outliers can significantly affect the position of the decision boundary. I used the standardized features to reach a faster convergence and reduce the impact of outliers in the data. The accuracy of this model is 83.7% so I would pick this over logistic regression for slightly more accuracy if the data set is not to large and lots of computional power is avaible.

Creating a decision tree and testing the accuracy

In [65]:
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(max_depth=7,min_samples_split=7,random_state=5)
tree_model.fit(X_train, y_train)
print('Accuracy: %.3f' % tree_model.score(X_test, y_test))

Accuracy: 0.852


Decision Trees can be used for classification and are flexible and interpretable modelthey may suffer from overfitting and instability. They are simple and take less computational power. Decision trees can implicitly rank features based on their importance for prediction. The accuracy of this Decision tree is 85.2%.

Creating a KNN model and testing the accuracy

In [66]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train_std, y_train)
print('Accuracy: %.3f' % knn.score(X_test_std, y_test))

Accuracy: 0.828


The Knn model works by storing all available training data points and predicting the class of a new data point based on the majority class of its K nearest neighbors. It is computationally intensive. But does not make any assumptions about the data. The accuracy of the model is 82.8%.

Create a random forest model and test the accuracy

In [67]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=140, criterion='entropy',random_state=5)
forest.fit(X_train, y_train)
print('Accuracy: %.3f' % forest.score(X_test, y_test))

Accuracy: 0.857


The random forest model constructs multiple decision trees during training and outputs the mode of individual trees. It is more accurate than individual decision trees but is slower to train. It inherently reduces overfitting. The accuracy is 85.7% my highest accuracy.

# **Submitting predictions**

Import the test data

In [68]:
#Import test set as data frame then convert to arry
test_data = pd.read_csv('/kaggle/input/adult-income-classifacation/test.csv')
test_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,id
0,31,Private,224234,HS-grad,9,Never-married,Transport-moving,Own-child,Black,Male,0,0,40,United-States,392
1,25,Private,149486,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,Black,Male,0,0,40,United-States,1900
2,36,Self-emp-not-inc,343721,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Male,0,0,30,?,24507
3,26,?,131777,Bachelors,13,Married-civ-spouse,?,Husband,White,Male,0,2002,40,United-States,32817
4,30,Local-gov,44566,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,47893


Filling in the missing values with the mode of their respective columns

In [69]:
test_data['workclass'] = test_data['workclass'].fillna(test_data['workclass'].mode()[0])
test_data['occupation'] = test_data['occupation'].fillna(test_data['occupation'].mode()[0])
test_data['native-country'] = test_data['native-country'].fillna(test_data['native-country'].mode()[0])
print(test_data.isnull().sum())
print(test_data.shape)

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
id                 0
dtype: int64
(9769, 15)


Label encoding all of the categorical variables that are not integers

In [70]:
label_encoders_test_data = {}
for column in test_data.columns:
    if test_data[column].dtype == 'object':
        label_encoders_test_data[column] = LabelEncoder()
        test_data.loc[:, column] = label_encoders_test_data[column].fit_transform(test_data[column])


Converting the feature variables into an array

In [71]:
X_array_test_data = test_data.values
print(X_array_test_data)

[[31 4 224234 ... 40 38 392]
 [25 4 149486 ... 40 38 1900]
 [36 6 343721 ... 30 0 24507]
 ...
 [55 2 254949 ... 40 38 34782]
 [40 4 135056 ... 45 38 23538]
 [45 4 205644 ... 26 38 23097]]


**I selected the random forest model because it was my most accurate model and it inherently reduces overfitting.It Can handle complex 
relationships between features. I used the random forest model to predict the income values for the hidden data and created a data frame of the predicted values**

In [72]:
predicted_values = forest.predict(X_array_test_data)
income = pd.DataFrame(predicted_values, columns=['income'])
print(income)

      income
0          0
1          0
2          0
3          0
4          0
...      ...
9764       0
9765       0
9766       0
9767       0
9768       0

[9769 rows x 1 columns]


I combined the id column from the test_data data frame and the income column from the income data frame to make a data frame for submission. Then I exported the submission data frame to a csv file.

In [73]:
submission = pd.DataFrame({
    'id': test_data['id'],
    'income': income['income']
})

#exporting data frame for submission
#submission.to_csv('submission.csv', index=False)