# COGS 118A - Project Checkpoint

# Names
- Galen Ng
- Regan Yang
- Andy Chen

# Abstract 
The goal of this project is to identify whether a person with specific metrics will make more or less than $50,000 a year. The data used will be a kaggle dataset showing features that contribute to an individuals income. This would include features such as gender, race, occupation, and all the information would have to classify you. There are certain features that don't have numerical values such as race or occupation so we can one hot encode those features. With this data, we will perform some exploratory data analysis on these features and see if there are biases that we can exclude such as a feature that measures higher education.  Once all of that is done, we will used our clean dataset to classify people who make more or less than 50,000 annually.Performance of a model is measured by accuracy and f1-score.

# Background

According to the United States government <a name="incomeImportance"></a>[<sup>[1]</sup>](#importance-income), accurate survey data about income is hard to come by. There's plenty of issues in reporting data such as difficulty in understanding income questions, different interpretations of income questions, and other factors that make income reporting not as reliable as one might think. Having accurate income data is important as income data is an important metric in determining health related quality of life<a name="incomeHealth"></a>[<sup>[2]</sup>](#health-income). Furthermore, a person reaches about $50k, they can be classified as middle class.<a name="midIncome"></a>[<sup>[3]</sup>](#mid-income).

# Problem Statement
Given a set of initial circumstances (age, workclass, education, marital status, occupation, relationship, race, sex, capital gain/loss, native country), does a person make at least or less than $50k a year? To answer such a question, we can develop a ML model takes in the circumstances as input and produce a binary classification based on the data. To make sure our results are replicable by clearly documenting our steps. To make the problem measurable, we decided to use accuracy and F1-score as performance metrics. Accuracy tells us how the ratio of correct predictions over total predictions. In case the data is imbalanced, we use F1-score to get a more balanced result.

# Data
Dataset: https://www.kaggle.com/datasets/lodetomasi1995/income-classification

In [None]:
import pandas as pd

data = pd.read_csv("income_evaluation.csv")
data.shape
# 32,561 observations, 15 variables

Variable information:
- age: Age of participant (Integer value)
- workclass:  of participant (Categorical value)
- fnlwgt: How much the census thinks this participant represents in the total population (Integer value)
- education: Education level of participant (Categorical value: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)
- education-num: Education level of participant in terms of a number (Integer value 1-16)
- marital-status: Marital status of participant (Categorical value: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
- occupation: Occupation of participant (Categorical value: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
- relationship: Relationship status of participant (Categorical value: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
- race: Race of participant (Categorical value: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
- sex: Sex of participant (Categorical value: Female, Male)
- capital-gain: Capital gain of participant (Integer value)
- capital-loss: Capital loss of participant (Integer value)
- hours-per-week: Work hours per week of participant (Continuous value)
- native-country: Native country of participant (Categorical value: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)
- Income: Income level of participant (Categorical value: >= 50k, < 50k)

Transformations/Cleaning to be done:
- Change all variables with categorical values into one-hot encodings
- Get rid of "education-num" (get rid of potential bias from higher education being a higher number)
- Normalize "age", "fnlwgt", "hours-per-week" columns (use SD away from mean)

# Proposed Solution

Since we are trying to classify whether a participant has an income >= 50k or < 50k, we can use a logistic regression or SVM classifier using different regularization strengths. Both classifiers give the probabilities of data points belonging to a binary classification. A solution can be tested by using train-test splits and comparing it to other test sets. We chose logistic regression because it provides probability based outcomes and is a computationally simple algorithm that is robust to noise. Similarly, SVM can avoid overfitting via regularization and can help the model deal with issues that arises from dealing with higher dimensional data. 

# Evaluation Metrics

One evaluation metric we can use is accuracy. As we are making binary decisions, we want to know if our model is making the correct decisions for the data points its given. We can set up a good k-folds cross validation to reduce bias in the data and also look at the f1 score to measure the accuracy and quantify it. With our k-folds cross validation, we can split up the data into k-folds which allows the model to be more versatile when it comes to newer data. In this case, it would give us more accurate results when it comes to newer data.

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

The data is sourced from the 1994 Census database<a name="uci"></a>[<sup>[4]</sup>](#ucinote), so the participants consent to giving their information and the data anonymized. Further, the classification of >=50k and <50k does not expose information about what a participant's actual income is. There is potential for bias in this dataset, as certain areas of the US are known to be more/less wealthy, and there is no mention of where (besides it being in the US) the data is collected. Changing variables such as education level into one-hot encodings also removes the underlying biases between higher education and income.

# Team Expectations 

* *Respond to messages on Discord in a timely (preferably no more than 24 hrs) manner*
* *Do the work assigned to you*
* *Come to and participate in meetings*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/16  |  4PM |  Come up with ideas for Project Proposal  | Work on Project Proposal | 
| 5/24  |  4PM | Read feedback (TA+peer) | Discuss and update work based on feedback; Decide on work split | 
| 5/28  | 5PM  | Finish individual work split | Proofread + finalize checkpoint |
| 6/7 | 4PM | Read feedback | Discuss and update work based on feedback; Decide on work split  |
| 6/12  | 4PM  | Finalize individual work split | Discuss/edit project code + writing; Complete project |

# Footnotes
<a name="importance-income"></a>1.[^](#incomeImportance): https://www.census.gov/content/dam/Census/library/working-papers/1997/adrm/sm97-05.pdf<br> 
<a name="health-income"></a>2.[^](#incomeHealth): https://equityhealthj.biomedcentral.com/articles/10.1186/s12939-019-0942-1<br>
<a name="mid-income"></a>3.[^](#midIncome): https://money.usnews.com/money/personal-finance/family-finance/articles/where-do-i-fall-in-the-american-economic-class-system<br>
<a name="ucinote"></a>4.[^](#uci): http://archive.ics.uci.edu/ml/datasets/Adult