# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Galen Ng
- Regan Yang
- Andy Chen

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents and how they are measured
- what you will be doing with the data
- how performance/success will be measured

# Background

According to the United States government <a name="incomeImportance"></a>[<sup>[1]</sup>](#importance-income), accurate survey data about income is hard to come by. There's plenty of issues in reporting data such as difficulty in understanding income questions, different interpretations of income questions, and other factors that make income reporting not as reliable as one might think. Having accurate income data is important as income data is an important metric in determining health related quality of life<a name="incomeHealth"></a>[<sup>[2]</sup>](#health-income).

# Problem Statement

Given a set of initial circumstances (age, workclass, education, marital status, occupation, relationship, race, sex, capital gain/loss, native country), does a person make at least or less than $50k a year?

# Data

Dataset: https://www.kaggle.com/datasets/lodetomasi1995/income-classification

In [8]:
import pandas as pd

data = pd.read_csv("income_evaluation.csv")
data.shape
# 32,561 observations, 15 variables

(32561, 15)

Variable information:
- age: Age of participant (Integer value)
- workclass:  of participant (Categorical value)
- fnlwgt: How much the census thinks this participant represents in the total population (Integer value)
- education: Education level of participant (Categorical value: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)
- education-num: Education level of participant in terms of a number (Integer value 1-16)
- marital-status: Marital status of participant (Categorical value: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
- occupation: Occupation of participant (Categorical value: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
- relationship: Relationship status of participant (Categorical value: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
- race: Race of participant (Categorical value: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
- sex: Sex of participant (Categorical value: Female, Male)
- capital-gain: Capital gain of participant (Integer value)
- capital-loss: Capital loss of participant (Integer value)
- hours-per-week: Work hours per week of participant (Continuous value)
- native-country: Native country of participant (Categorical value: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)
- Income: Income level of participant (Categorical value: >= 50k, < 50k)

Transformations/Cleaning to be done:
- Change all variables with categorical values into one-hot encodings
- Get rid of "education-num" (get rid of potential bias from higher education being a higher number)
- Normalize "age", "fnlwgt", "hours-per-week" columns (use SD away from mean)

# Proposed Solution

Since we are trying to classify whether a participant has an income >= 50k or < 50k, we can use a logistic regression or SVM classifier using different regularization strengths. Both classifiers give the probabilities of data points belonging to a binary classification. A solution can be tested by using train-test splits and comparing it to other test sets.

# Evaluation Metrics

One evaluation metric we can use is accuracy. As we are making binary decisions, we want to know if our model is making the correct decisions for the data points its given.

# Ethics & Privacy

The data is sourced from the 1994 Census database<a name="uci"></a>[<sup>[3]</sup>](#ucinote), so the participants consent to giving their information and the data anonymized. Further, the classification of >=50k and <50k does not expose information about what a participant's actual income is. There is potential for bias in this dataset, as certain areas of the US are known to be more/less wealthy, and there is no mention of where (besides it being in the US) the data is collected. Changing variables such as education level into one-hot encodings also removes the underlying biases between higher education and income.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* *Respond to messages on Discord in a timely (preferably no more than 24 hrs) manner*
* *Do the work assigned to you*
* *Come to and participate in meetings*

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/16  |  4PM |  Come up with ideas for Project Proposal  | Work on Project Proposal | 
| 5/24  |  4PM | Read feedback (TA+peer) | Discuss and update work based on feedback; Decide on work split | 
| 5/29  | 4PM  | Finish individual work split | Proofread + finalize checkpoint |
| 6/7 | 4PM | Read feedback | Discuss and update work based on feedback; Decide on work split  |
| 6/12  | 4PM  | Finalize individual work split | Discuss/edit project code + writing; Complete project |

# Footnotes
<a name="importance-income"></a>1.[^](#incomeImportance): https://www.census.gov/content/dam/Census/library/working-papers/1997/adrm/sm97-05.pdf<br> 
<a name="health-income"></a>2.[^](#incomeHealth): https://equityhealthj.biomedcentral.com/articles/10.1186/s12939-019-0942-1<br>
<a name="ucinote"></a>3.[^](#uci): http://archive.ics.uci.edu/ml/datasets/Adult
