## Project Proposal

### Introduction
For our project, we have decided on a fairly straightforward topic. With given information about a person, we want to predict whether their income is greater than or equal to $50,000. This information may include their education, age, occupation, etc.

We will be using the **Adult** dataset taken from https://archive.ics.uci.edu/dataset/2/adult, which is based on a census done in 1994. There are 32560 rows in the data with each observation representing a single person and their various attributes. There are 15 columns each representing a different part of the person. The columns are:
 - age
 - workclass: self-employed, private, etc.
 - fnlwgt: final weight
 - education: the highest level of education achieved
 - education-num: the highest level of education achieved (numerical)
 - marital-status: married, single, etc.
 - occupation: general type of occupation (sales, services, etc.)
 - relationship: wife, husband, own-child, etc.
 - race: White, Black, Asian, etc.
 - sex: biological sex (Male, Female)
 - capital-gain: 
 - capital-loss
 - hours-per-week: hours at work each week
 - native-country: country of origin
 - income: <=50k, >50k

### Preliminary Data Analysis

In [73]:
# Loading libraries
library(tidyverse)
library(repr)
library(tidymodels)

In [70]:
# Reading downloaded csv file
adult <- read_csv("data/adult.csv")

# Adding column names
colnames(adult) <- c('age', 'workclass', 'fnlwgt', 'education', 'education_num', 
                     'marital_status', 'occupation', 'relationship', 'race', 'sex',
                    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')
slice(adult, 1:4)

[1mRows: [22m[34m32560[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): State-gov, Bachelors, Never-married, Adm-clerical, Not-in-family, W...
[32mdbl[39m (6): 39, 77516, 13, 2174, 0, 40

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [71]:
# Many of the predictors are categorical so we need to convert them into numerical values
adult_numerical <- adult |>
    select(workclass, education, marital_status, occupation, relationship, race, sex, native_country) |>
    mutate(across(workclass:native_country, as.factor)) |>
    sapply(unclass) 
adult_final <- adult |>
    select(age, fnlwgt, education_num, capital_gain, capital_loss, hours_per_week, income)
adult_final <- cbind(adult_final, adult_numerical) |>
    mutate(income = as_factor(income))

# Getting the training data
set.seed(3456) 

# Randomly take 75% of the data in the training set. 
adult_split <- initial_split(adult_final, prop = 0.75, strata = income)  
adult_train <- training(adult_split)   
adult_test <- testing(adult_split)

In [72]:
# Beginning data analysis:
# We will take the mean of our numerical predictors and the mode of our categorical predictors

analysis_mean <- adult_train |>
    select(age, fnlwgt, capital_gain, capital_loss, hours_per_week) |>
    map_df(mean, na.rm = TRUE)

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

analysis_mode <- adult |>
    select(workclass, education, marital_status, occupation, relationship, race, sex, native_country) |>
    map_df(getmode)

# Mean value of our numerical predictors
analysis_mean
# Most frequent value of our categorical predictors 
analysis_mode

age,fnlwgt,capital_gain,capital_loss,hours_per_week
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
38.60363,189183.4,1136.137,88.29805,40.44252


workclass,education,marital_status,occupation,relationship,race,sex,native_country
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States


### Methods

Our question is a classification problem so we want to use the K nearest neighbors model in tidymodels. Since we converted all columns to numerical we can use any column as a predictor. We want to use the majority of the columns available as predictors because most of them are fairly relevant to income. But there are some variables that we don't want to use or make some changes to:
1. The relationship and marital_status columns are redundant so we can just use marital_status.
2. Workclass is not very relevant to income so we will not use it as a predictor.
3. Marital status contains information on whether they are married to a civilian or military. This isn't important to income so we can simply convert all values to just "married."