#ASSIGNMENT 3

https://archive.ics.uci.edu/dataset/2/adult

***Background:***

Predicting income levels is a fundamental task in socio-economic studies, aiding in policy formulation and resource allocation. Machine learning techniques, particularly binary classification, have proven effective in modeling and predicting income categories based on various demographic and employment-related features.

**Project Description:**

This project aims to develop a machine learning model to predict whether an individual's annual income exceeds $50,000, utilizing the Adult Census Income Dataset from the UCI Machine Learning Repository. The dataset comprises 48,842 records with 14 attributes, including continuous variables (e.g., age, hours per week), categorical variables (e.g., education, occupation), and count variables (e.g., number of years of education). The target variable is binary, indicating whether income is '>50K' or '<=50K'.

**The project will involve:**

*Data Preprocessing:* Handling missing values, encoding categorical variables, and normalizing continuous features.

*Exploratory Data Analysis (EDA):* Analyzing feature distributions and relationships to understand their impact on the target variable.

*Model Development:* Implementing and comparing various binary classification algorithms, such as Logistic Regression, Decision Trees, and Random Forests.

*Model Evaluation:* Assessing model performance using appropriate metrics and selecting the best-performing model.

**Performance Metric:**

The primary performance metric for evaluating the models will be Accuracy, representing the proportion of correct predictions. Additionally, metrics such as Precision, Recall, and the F1-Score will be considered to provide a comprehensive evaluation, especially in the presence of class imbalance.

This project will enhance skills in data preprocessing, exploratory data analysis, and the implementation and evaluation of binary classification models using Python's machine learning libraries.

#ASSIGNMENT 3 TASK 2

In conducting an Exploratory Data Analysis (EDA) on the Adult Census Income Dataset, we aim to uncover patterns and relationships within the data that can inform our predictive modeling efforts. Below are five unique questions we intend to explore:

1. **How does educational attainment influence income levels?**
   - We will examine the distribution of income across different education levels to determine if higher educational qualifications correlate with higher income brackets.

2. **Is there a significant relationship between marital status and income?**
   - By analyzing income distributions across various marital statuses, we aim to identify any notable differences in income levels among single, married, divorced, and widowed individuals.

3. **How does race impact income distribution?**
   - We will assess the income levels across different racial groups to identify any disparities and understand the extent of income inequality, if present.

4. **What is the effect of gender on income levels?**
   - This analysis will focus on comparing income distributions between male and female individuals to identify any existing gender pay gaps within the dataset.

5. **Does the number of hours worked per week correlate with higher income?**
   - We will investigate whether individuals working more hours per week tend to have higher incomes, and if there is a threshold beyond which additional hours do not significantly impact income levels.

By exploring these questions, we aim to gain a deeper understanding of the factors influencing income, which will inform the development of our predictive models.

#ASSIGNMENT 3 PART 3

Based on the exploratory data analysis (EDA) of the Adult Census Income Dataset, we propose the following feature engineering strategies to enhance the predictive performance of our model:

**1. Handling Missing Values:**

- **Columns Affected:** `workclass`, `occupation`, `native-country`
- **Strategy:** Impute missing values using the most frequent category (mode) within each column to maintain data integrity and distribution.

**2. Encoding Categorical Variables:**

- **Columns:** `workclass`, `education`, `marital-status`, `occupation`, `relationship`, `race`, `sex`, `native-country`
- **Strategy:** Apply one-hot encoding to transform categorical variables into numerical format, facilitating their use in machine learning algorithms.

**3. Feature Transformation:**

- **Combining `capital-gain` and `capital-loss`:** Create a new feature `net-capital` by subtracting `capital-loss` from `capital-gain`. This simplifies the representation of an individual's net capital income.

**4. Creating New Features:**

- **Age Binning:** Segment the `age` variable into bins representing different life stages (e.g., 17-25, 26-35, 36-45, etc.) to capture non-linear relationships between age and income.
- **Hours-Per-Week Binning:** Categorize `hours-per-week` into bins (e.g., 1-20, 21-40, 41-60, etc.) to account for varying work schedules and their potential impact on income.

**5. Feature Scaling:**

- **Continuous Variables:** Standardize continuous variables such as `age`, `education-num`, `net-capital`, and `hours-per-week` to ensure they contribute proportionately to the model and improve convergence during training.

**6. Addressing Class Imbalance:**

- **Target Variable (`income`):** Implement techniques such as Synthetic Minority Over-sampling Technique (SMOTE) or adjust class weights during model training to mitigate the impact of class imbalance and ensure balanced learning.

**7. Dropping Irrelevant or Redundant Features:**

- **`fnlwgt`:** Consider dropping the `fnlwgt` (final weight) feature, as its relevance to individual income prediction is unclear. Further analysis is needed to determine its impact on model performance.

By implementing these feature engineering strategies, we aim to enhance the dataset's representation and improve the performance of our predictive models. Each step is designed to address specific insights gained from the EDA, ensuring that the features fed into the model are informative and relevant.

In [None]:
"""
# Assignment 3: Income Prediction using the Adult Census Income Dataset

## Task 1: Refine Data Dictionary
Below is the refined data dictionary including additional details on categorical encodings.

| Column Name      | Data Type | Description |
|-----------------|-----------|-------------|
| age             | int       | Age of the individual |
| workclass       | category  | Type of employer (e.g., Private, Self-emp, Government) |
| fnlwgt          | int       | Final weight computed by Census Bureau |
| education       | category  | Education level (e.g., Bachelors, HS-grad) |
| education-num   | int       | Numeric representation of education |
| marital-status  | category  | Marital status (e.g., Never-married, Married, Divorced) |
| occupation      | category  | Type of job role (e.g., Tech-support, Sales) |
| relationship    | category  | Relationship in household (e.g., Husband, Wife) |
| race           | category  | Ethnicity of the individual |
| sex            | category  | Gender (Male/Female) |
| capital-gain   | int       | Capital gains from investments |
| capital-loss   | int       | Capital losses from investments |
| hours-per-week | int       | Number of work hours per week |
| native-country | category  | Country of origin |
| income        | category  | Target variable: <=50K or >50K |

---

## Task 2: Basic EDA

### Questions to Explore:
1. How does educational attainment influence income levels?
2. Is there a significant relationship between marital status and income?
3. How does race impact income distribution?
4. What is the effect of gender on income levels?
5. Does the number of hours worked per week correlate with higher income?

---

## EDA Code Implementation
"""

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
                "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
data = pd.read_csv(url, names=column_names, na_values=" ?")

# Basic data exploration
print("Dataset Info:\n", data.info())
print("\nFirst 5 Rows:\n", data.head())

# Visualizing income distribution by education
plt.figure(figsize=(12, 6))
edu_income = data.groupby("education")["income"].value_counts(normalize=True).unstack()
edu_income.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title("Income Distribution by Education Level")
plt.xlabel("Education Level")
plt.ylabel("Proportion")
plt.xticks(rotation=45)
plt.show()

# Gender impact on income
plt.figure(figsize=(8, 5))
sns.countplot(x='sex', hue='income', data=data)
plt.title("Gender and Income Levels")
plt.show()

# Correlation between hours-per-week and income
plt.figure(figsize=(8, 5))
sns.boxplot(x='income', y='hours-per-week', data=data)
plt.title("Work Hours vs. Income")
plt.show()

"""
## Findings from EDA
1. Higher education levels correlate with a greater proportion of individuals earning >50K.
2. Married individuals appear to have higher income levels compared to single or divorced individuals.
3. There are observable income disparities among different racial groups.
4. A gender pay gap exists, with males having a higher proportion of >50K earners.
5. More working hours generally correlate with higher income, but there is a saturation point beyond which extra hours do not significantly increase income.
"""

"""
## Feature Engineering Plan

1. **Handling Missing Values:**
   - Impute missing values in `workclass`, `occupation`, and `native-country` with the most frequent category.

2. **Encoding Categorical Variables:**
   - Apply One-Hot Encoding to `workclass`, `education`, `marital-status`, `occupation`, `relationship`, `race`, `sex`, and `native-country`.

3. **Feature Transformation:**
   - Create `net-capital` as `capital-gain - capital-loss`.

4. **Creating New Features:**
   - Segment `age` and `hours-per-week` into bins.

5. **Feature Scaling:**
   - Standardize `age`, `education-num`, `net-capital`, and `hours-per-week`.

6. **Addressing Class Imbalance:**
   - Apply SMOTE or class weight adjustments.

7. **Dropping Irrelevant Features:**
   - Consider removing `fnlwgt`.
"""

"""
## Train-Test Split Plan
- Train/Test Split: 80% training, 20% testing.
- If needed, a golden holdout set of 10% from training for final validation.
"""

"""
## Initial Pipeline
- **Imputer:** Fill missing values in categorical variables.
- **Encoders:** One-Hot Encoding for categorical variables.
- **Scaler:** Standardize continuous features.
"""

"""
## Model Fitting and Evaluation Assumptions
1. Features like education, occupation, and hours-per-week will significantly impact income prediction.
2. Random Forest or Logistic Regression will likely perform well.
3. Class imbalance may impact precision and recall, requiring balancing strategies.
"""


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
                "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
data = pd.read_csv(url, names=column_names, na_values=" ?")

# Basic data exploration
print("Dataset Info:\n", data.info())
print("\nFirst 5 Rows:\n", data.head())

# Visualizing income distribution by education
plt.figure(figsize=(12, 6))
edu_income = data.groupby("education")["income"].value_counts(normalize=True).unstack()
edu_income.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title("Income Distribution by Education Level")
plt.xlabel("Education Level")
plt.ylabel("Proportion")
plt.xticks(rotation=45)
plt.show()

# Gender impact on income
plt.figure(figsize=(8, 5))
sns.countplot(x='sex', hue='income', data=data)
plt.title("Gender and Income Levels")
plt.show()

# Correlation between hours-per-week and income
plt.figure(figsize=(8, 5))
sns.boxplot(x='income', y='hours-per-week', data=data)
plt.title("Work Hours vs. Income")
plt.show()