## Predicting Income Bracket with UCI Adult Dataset
JonPaul Ferzacca

Introduction to Data Science

Professor Sepideh Goodarzy

## Project Overview
The objective of this project revolves around leveraging data to make meaningful predictions about income levels. The specific task at hand is to predict whether an individual makes over 50K a year based on a variety of demographic factors. These factors range from age and workclass to education and marital status, among others.

This problem is fundamentally a binary classification task. Classification is a subcategory of supervised learning where the aim is to predict the categorical class labels of new instances, based on past observations. In this case, the classes are binary - either an individual earns more than 50K a year or they do not.

The motivation behind this project lies in its potential real-world implications. Accurate income predictions can be immensely beneficial to a variety of sectors. For instance, financial institutions can utilize these predictions to determine credit-worthiness or loan eligibility. Similarly, businesses can better understand their customers, allowing for more effective targeting and segmentation. In a societal context, this kind of model can also help policymakers in understanding the dynamics of income distribution and formulating data-driven policies.

Ultimately, our goal is to develop a robust and accurate model that can effectively classify individuals based on their income level, thereby providing a tool that can aid in a variety of economic and social applications. 

The data for this project has been sourced from the UCI Machine Learning Repository's Adult Dataset. (https://archive.ics.uci.edu/dataset/2/adult)

## Data Overview
The dataset for this project is derived from the 1994 Census database, meticulously extracted by Barry Becker. Becker's extraction procedure followed a well-defined set of criteria to ensure the selection of clean and relevant data records. Specifically, individuals included in the dataset were required to be above 16 years of age (AAGE > 16), have an adjusted gross income exceeding 100 (AGI > 100), have a final record weight of more than 1 (AFNLWGT > 1), and work more than 0 hours per week (HRSWK > 0).


**Key features include:**

**Age:** This is a continuous variable representing the age of the individual.

**Workclass:** This categorical variable represents the employment type of the individual and includes categories such as Private, Self-Employed (Not Incorporated), Self-Employed (Incorporated), Federal Government, Local Government, State Government, Without Pay, and Never Worked.

**FNLWGT:** This is a continuous variable.

**Education:** This categorical variable represents the highest level of education attained by an individual. Categories include Bachelors, Some College, 11th Grade, High School Graduate, Professional School, Associate Degree (Academic), Associate Degree (Vocational), 9th Grade, 7th-8th Grade, 12th Grade, Masters, 1st-4th Grade, 10th Grade, Doctorate, 5th-6th Grade, and Preschool.

**Education-num:** This is a continuous variable representing the number of educational years completed.

**Marital-status:** This categorical variable represents the marital status of the individual. Categories include Married Civilian Spouse, Divorced, Never Married, Separated, Widowed, Married Spouse Absent, and Married AF Spouse.

**Occupation:** This categorical variable represents the individual's occupation and includes categories such as Tech Support, Craft Repair, Other Service, Sales, Executive Managerial, Professional Specialty, Handlers Cleaners, Machine Operator Inspector, Administrative Clerical, Farming Fishing, Transport Moving, Private House Service, Protective Service, and Armed Forces.

**Relationship:** This categorical variable represents the individual's role in the family. Categories include Wife, Own Child, Husband, Not in Family, Other Relative, and Unmarried.

**Race:** This categorical variable represents the individual's race. Categories include White, Asian-Pacific Islander, American Indian-Eskimo, Other, and Black.

**Sex:** This categorical variable represents the individual's sex and includes categories Female and Male.

**Capital-gain:** This is a continuous variable representing the individual's capital gains.

**Capital-loss:** This is a continuous variable representing the individual's capital losses.

**Hours-per-week:** This is a continuous variable representing the number of hours the individual works per week.

**Native-country:** This categorical variable represents the individual's country of origin. It includes United States, Cambodia, England, Puerto Rico, Canada, Germany, Outlying US (Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican Republic, Laos, Ecuador, Taiwan, Haiti, Colombia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinidad & Tobago, Peru, Hong Kong, and the Netherlands.

The target variable is also categorical, representing whether or not an individual earns over 50K a year.

The data is housed in a single table, sourced from a singular, unified dataset without need for combining from multiple sources. It should be noted that all data points were collected and processed in a manner adhering to strict quality control measures, ensuring a high degree of accuracy and reliability for subsequent analytical and modeling efforts.

**Reference:**
Becker, B. (1996). 1994 Census database. Retrieved from https://archive.ics.uci.edu/ml/datasets/adult

The data in this set is tabular, with each row representing an individual record and each column representing a specific attribute or feature of the individual. The dataset includes 32,561 samples (or rows) and 15 features (or columns). The features are a mix of categorical and continuous variables.

## Data Cleaning and EDA

In [21]:
# Read in adult.data
import pandas as pd
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
                'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
                'hours-per-week', 'native-country', 'income']

df = pd.read_csv('adult.data', names=column_names, sep=',\s', na_values=["?"], engine='python')

# Remove Duplicates
df = df.drop_duplicates()

# Print Example of Data
print(df.head())  # prints the first 5 lines of the dataframe

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country income  
0          2174             0              40  United-States  <=50K  
1             0             0             

**Identify missing values:** The data was loaded into a Pandas DataFrame, and any missing values were identified. The features 'workclass', 'occupation', and 'native-country' contained some missing values. Depending on the ideal result these missing values can be replaced with the most common value in each respective column. 

**Removing duplicates:** The dataset was checked for duplicate entries, which can bias the analysis and machine learning models. Any duplicates found were removed from the DataFrame. 

In [23]:
# Remove Duplicates
df = df.drop_duplicates()

# Check for missing values:
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     582
income               0
dtype: int64

**Feature transformation:** The 'income' column was transformed from categorical to binary for easier analysis and modeling. This process, known as encoding, is often necessary for machine learning algorithms as they typically require numerical input.

In [4]:
# Convert the 'income' column to binary
df['income'] = df['income'].map({'<=50K': 0, '>50K': 1})

**Check for Outliers**: Lastly, the 'age' column was inspected for outliers, which are extreme values that can skew the analysis. The Interquartile Range (IQR) method was used to identify these outliers. The IQR is a measure of statistical dispersion and is calculated as the difference between the 75th and 25th percentiles. Any age that was below the first quartile minus 1.5 times the IQR or above the third quartile plus 1.5 times the IQR was considered an outlier. The number of outliers was then printed to the console. This step is critical as outliers can have a significant impact on the results of data analysis and statistical modeling.

In [17]:
# Assuming 'age' column, we'll identify values that are higher than the 95th percentile
# Calculate IQR of the 'age' column
Q1 = df['age'].quantile(0.25)
Q3 = df['age'].quantile(0.75)
IQR = Q3 - Q1

# Define the range for outliers
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

# Count outliers
outlier_age = df[(df['age'] < lower_bound) | (df['age'] > upper_bound)]
print(f'Number of outliers in the age column: {len(outlier_age)}')


Number of outliers in the age column: 0


In [20]:
Q1 = df['education-num'].quantile(0.15)
Q3 = df['education-num'].quantile(0.85)
IQR = Q3 - Q1

# Define the range for outliers
lower_bound = Q1 - 1.5*IQR
upper_bound = Q3 + 1.5*IQR

# Count outliers
outliers_education = df[(df['education-num'] < lower_bound) | (df['education-num'] > upper_bound)]
print(f'Number of outliers in the age column: {len(outliers_education)}')

Number of outliers in the age column: 191


Through these steps, a cleaner and more analysis-ready dataset was achieved. The cleaning process was tailored to the specific characteristics and needs of the Adult dataset, and the decisions made during cleaning were grounded in the pursuit of robust and valid analytical results. In subsequent parts of this project, further cleaning steps may be undertaken as necessary, and a more in-depth exploratory data analysis will be conducted.