# INDUCTIVE / DEDUCTIVE REASONING
Inductive reasoning is the process of making generalized inferences using a noticed pattern in some specific, empirical data gathered from an observation of some sort (a dataset, experiment, etc.). This is in contrast to deductive reasoning, which begins with a theory and applies that theory to a specific case. Essentially, inductive reasoning is hunting for a probable theory, while deductive reasoning is hunting for probable results that prove a theory.

An example of the former can be realizing that, each day, the clock tower in your neighborhood rings exactly at noon, without fail. Thus, you can conclude that the bell must always ring at noon. An example of the latter is thinking that mammals must have multiple organs, looking at a specific observation that dogs are mammals, and thus concluding that dogs must have multiple organs based on the premise that they are indeed mammals.

# PREPROCESSING

In [144]:
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import ConfusionMatrixDisplay, classification_report

In [150]:
adult_df = pd.read_csv('adult.data', delimiter = ", ") #Had to manually enter column names, so I set a custom delimiter

adult_df.head(20)

  adult_df = pd.read_csv('adult.data', delimiter = ", ") #Had to manually enter column names, so I set a custom delimiter


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,fiftyk
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [146]:
#1. Cleaning up null values
#2. Data cleaning (dashes, odd characters, etc)
#3. Remove extreme outliers
#4. One-Hot encoding
#5. Convert Categorical values to numerical (sometimes that is the same as 3)
#6. Standardization/Normalization
#7. Deal with multicollinearity (can be caused by 3)

# For miscellaneous null cleaning
def null_sub(df: DataFrame = adult_df) -> DataFrame:
    for col in df.columns:
        if df[col].dtype in ['int64', 'float64']: # Numerical columns
            median = df[col].median()
            df[col] = df[col].fillna(median)
        elif df[col].dtype == 'object': # Categorical columns
            if not df[col].isna().all():
                mode = df[col].mode().iloc[0]
                df[col] = df[col].fillna(mode)
    return df

adult_df = null_sub()

# Manually encode the 'sex' column
def std_sex(col: Series = adult_df['sex']) -> Series:
    m = {'Female' : 0, 'Male' : 1}
    col = col.map(m)
    col = col.fillna(col.mode().iloc[0]) # I'm replacing all null values with the mode since the responses in this column are binary; the
    # most commonly occurring value makes sense to substitute null with as we proverbially have a 50-50 shot of getting the answer right
    return col

adult_df['sex'] = std_sex()

# Manually encode the 'fiftyk' column
def std_fiftyk(col: Series = adult_df['fiftyk']) -> Series:
    m = {'<=50K' : 0, '>50K' : 1}
    col = col.map(m)
    col = col.fillna(col.mode().iloc[0]) # I'm replacing all null values with the mode since the responses in this column are binary; the
    # most commonly occurring value makes sense to substitute null with as we proverbially have a 50-50 shot of getting the answer right
    return col

adult_df['fiftyk'] = std_fiftyk()

# Clean up categorical data before encoding to handle whitespace and odd characters
def clean_char(df: DataFrame = adult_df) -> DataFrame:
    for col in df.columns:
        if df[col].dtype == 'object': # Categorical columns
            df[col] = df[col].str.strip()
            df[col] = df[col].str.replace('-', ' ')
    return df

adult_df = clean_char()

# One hot encode all categorical string columns
def encode(df: DataFrame = adult_df) -> DataFrame:
    for col in df.columns:
        if df[col].dtype == 'object':
            temp = pd.get_dummies(df[col], drop_first = True)
            df = df.drop(columns = [col])
            df = df.join(temp)
    return df

adult_df = encode()


adult_df.head(20)

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,fiftyk,Federal gov,Local gov,...,Portugal,Puerto Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United States,Vietnam,Yugoslavia
0,39,77516,13,1,2174,0,40,0,False,False,...,False,False,False,False,False,False,False,True,False,False
1,50,83311,13,1,0,0,13,0,False,False,...,False,False,False,False,False,False,False,True,False,False
2,38,215646,9,1,0,0,40,0,False,False,...,False,False,False,False,False,False,False,True,False,False
3,53,234721,7,1,0,0,40,0,False,False,...,False,False,False,False,False,False,False,True,False,False
4,28,338409,13,0,0,0,40,0,False,False,...,False,False,False,False,False,False,False,False,False,False
5,37,284582,14,0,0,0,40,0,False,False,...,False,False,False,False,False,False,False,True,False,False
6,49,160187,5,0,0,0,16,0,False,False,...,False,False,False,False,False,False,False,False,False,False
7,52,209642,9,1,0,0,45,1,False,False,...,False,False,False,False,False,False,False,True,False,False
8,31,45781,14,0,14084,0,50,1,False,False,...,False,False,False,False,False,False,False,True,False,False
9,42,159449,13,1,5178,0,40,1,False,False,...,False,False,False,False,False,False,False,True,False,False


# DECISION TREE MODEL

In [147]:
X = adult_df.drop('fiftyk', axis = 1)
y = adult_df['fiftyk']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Find the optimal tree depth
def optimalDepth(X_train, X_test, y_train, y_test) -> int:
    maxScore = 0
    maxDepth = 1

    for max in range(1, 10): 
        model = tree.DecisionTreeClassifier(max_depth = max, random_state = 42)
        model.fit(X_train, y_train)
        treePred = model.predict(X_test)
        currScore = model.score(X_test, y_test)
        if (currScore > maxScore):
            maxScore = currScore
            maxDepth = max
    print(f"Chosen tree depth: {maxDepth}")
    return maxDepth

# Establish depth
depth = optimalDepth(X_train, X_test, y_train, y_test)

# Train the tree model using the optimal number of layers found above
model = tree.DecisionTreeClassifier(max_depth = depth, random_state = 42)
model.fit(X_train, y_train)

# Predict
treePred = model.predict(X_test)

# Classification Report
class_report = classification_report(y_test, treePred)
print(class_report)

Chosen tree depth: 7
              precision    recall  f1-score   support

           0       0.88      0.95      0.91      7455
           1       0.78      0.57      0.66      2314

    accuracy                           0.86      9769
   macro avg       0.83      0.76      0.79      9769
weighted avg       0.85      0.86      0.85      9769



I wanted to pick an optimal tree depth; I chose a range from 1 to 10 as we do not want an overly-complicated decision tree, and because we are picking a binary result, 7 seems about right for tree depth.

# RANDOM FOREST MODEL

In [148]:
X = adult_df.drop('fiftyk', axis = 1)
y = adult_df['fiftyk']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Find the optimal forest estimators
def optimalEstimators(X_train, X_test, y_train, y_test) -> int:
    maxScore = 0
    maxEstimators = 100

    for max in range(100, 150): 
        model = RandomForestClassifier(n_estimators = max, max_depth = depth, random_state = 42)
        model.fit(X_train, y_train)
        forestPred = model.predict(X_test)
        currScore = model.score(X_test, y_test)
        if (currScore > maxScore):
            maxScore = currScore
            maxEstimators = max
    print(f"Chosen estimators: {maxEstimators}")
    return maxEstimators

# Establish depth
estimators = optimalEstimators(X_train, X_test, y_train, y_test)

# Train the tree model using the optimal number of layers found above
model = RandomForestClassifier(n_estimators = estimators, max_depth = depth, random_state = 42)
model.fit(X_train, y_train)

# Predict
forestPred = model.predict(X_test)

# Classification Report
class_report = classification_report(y_test, forestPred)
print(class_report)

Chosen estimators: 149
              precision    recall  f1-score   support

           0       0.86      0.96      0.91      7455
           1       0.81      0.48      0.61      2314

    accuracy                           0.85      9769
   macro avg       0.83      0.72      0.76      9769
weighted avg       0.85      0.85      0.84      9769



Much like calibrating the decision tree model above, I picked a logical number of estimators to test the accuracy of, and created a function that picked the optimal value, then training a random forest model that builds on the decision tree depth used previously.

# XGBOOST MODEL

In [149]:
X = adult_df.drop('fiftyk', axis = 1)
y = adult_df['fiftyk']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Establish XGBoost model and optimize it using the optimal number of layers found above
model = XGBClassifier(n_estimators = estimators, max_depth = depth, random_state = 42)
model.fit(X_train, y_train)

# Predict
xgPred = model.predict(X_test)

# Classification Report
class_report = classification_report(y_test, xgPred)
print(class_report)

              precision    recall  f1-score   support

           0       0.90      0.93      0.91      7455
           1       0.74      0.66      0.70      2314

    accuracy                           0.86      9769
   macro avg       0.82      0.79      0.81      9769
weighted avg       0.86      0.86      0.86      9769



I optimized the xgboost model by combining the optimal number of estimators and tree depth used in the models above. While this is efficient for this dataset, it means that the combined approach is quite prone to overfitting, requiring some tweaking to optimize for other datasets.