# Lab Assignment One - Exploring Table Data

**Business Understanding (1.5 points total)**

- **In your own words, give an overview of the dataset.**
    - The Census Income dataset includes 48,842 records with 14 features. It contains both categorical and continuous variables. It is targeted at individuals over 16 with an adjusted gross income over 100 and working hours, this dataset also contains missing values. The data focuses on a subset of the population most relevant for the study of income levels.
- **Describe the purpose of the data set you selected (i.e., why and how was this data collected in the first place?).**
    - I seleceted this dataset because I am interested in Socio-economic Analysis. Given this dataset provides a rich source of socio-economic information, it is ideal for examining income distribution and factors influencing economic status.
    - The data focuses on a subset of the population most relevant for the study of income levels.
- **What is the prediction task for your data and why are other third parties interested in the result**
    - This dataset was collected to analyze and predict the income level of individuals based on various demographic and employment factors. The primary prediction task is to determine whether an individual earns more than $50,000 per year. Third parties, including government agencies, policymakers, and businesses, are interested in this dataset for shaping welfare programs, tax policies, studying market trends, and consumer behavior
- **Once you begin modeling, how well would your prediction algorithm need to perform to be considered useful to these third parties? Be specific and use your own words to describe the aspects of the data.**
    - For this prediction algorithm to be useful to third parties, it needs to achieve a high level of accuracy in classifying individuals into the correct income categories. Specific metrics such as precision, recall, and F1-score should be considered. Also, model interpretability can add value for third parties seeking to understand the underlying factors influencing income levels.

**Data Understanding (3 points total)**

- **[1.5 points]**
    - **Load the dataset and appropriately define data types.**
        - `Integer` — for numerical values
        - `Categorical` — for data representing categories or groups
        - `Binary` — for the target variable “income”
    - **Discuss the attributes collected in the dataset / What data type should be used to represent each data attribute?**
        1. **Age**
            - Description: Years since birth
            - Data Type: Integer
        2. **Workclass**
            - Description: Classification of employment sector - includes Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
            - Data Type: Categorical
        3. **Final Weight (fnlwgt)**
            - Description: Weight assigned by the Census Bureau, reflecting the number of people the observation represents.
            - Data Type: Integer
        4. **Education**
            - Description: Highest level of education achieved - includes Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
            - Data Type: Categorical
        5. **Education Number (education-num)**
            - Description: Number of years of education completed.
            - Data Type: Integer
        6. **Marital Status**
            - Description: Marital status of the individual - includes Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
            - Data Type: Categorical
        7. **Occupation**
            - Description: Type of occupation - includes Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
            - Data Type: Categorical
        8. **Relationship**
            - Description: Role of the individual within a family - includes Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
            - Data Type: Categorical
        9. **Race**
            - Description: Race of the individual - includes White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
            - Data Type: Categorical
        10. **Sex**
            - Description: Biological sex of the individual - Female or Male.
            - Data Type: Binary
        11. **Capital Gain**
            - Description: Income from investment sources, apart from wages/salary.
            - Data Type: Integer
        12. **Capital Loss**
            - Description: Losses from investment sources.
            - Data Type: Integer
        13. **Hours per Week**
            - Description: Number of hours worked per week.
            - Data Type: Integer
        14. **Native Country**
            - Description: Country of origin of the individual - includes United-States, Cambodia, England, Puerto-Rico, etc.
            - Data Type: Categorical
        15. **Income**
            - Description: Indicates if income exceeds $50K/year - categories are >50K and <=50K.
            - Data Type: Binary
- **[1.5 points]**
    - **Verify data quality:**
    - **Explain any missing values or duplicate data.**
        - Using the “print(census_income.variables)” line of code.
        - I was able to identify, workclass, occupation and native-country as containing missing values in this dataset. Generally, missing values in this dataset feature can occur due to entry errors or refusal to disclose. But specifically, for each —
            - **Workclass**: Missing values may be due to unemployment or informal employment.
            - **Occupation**: Gaps could arise from unemployment, retirement or non-traditional occupations.
            - **Native-country**: Missing data might result from privacy concerns or uncertain origin.
    - **Visualize entries that are missing/complete for different attributes.**
    - **Are those mistakes? Why do these quality issues exist in the data?**
    - **How do you deal with these problems?**
    - **Give justifications for your methods (elimination or imputation).**

- Data Visualization (**4.5 points total**)
    - [**2 points**]
        - Visualize basic feature distributions.
        - That is, plot the dynamic range and exploratory distribution plots (like boxplots, histograms, kernel density estimation) to better understand the data.
        - Describe anything meaningful or potentially useful you discover from these visualizations.
        - These may also help to understand what data is missing or needs imputation. **Note**: You can also use data from other sources to bolster visualizations.
        - Visualize at least five plots, at least one categorical.
    - [**2.5 points**]
        - Ask three interesting questions that are relevant to your dataset and explore visuals that help answer these questions.
        - Use whichever visualization method is appropriate for your data.
        - **Important:** Interpret the implications for each visualization.

- **Exceptional Work (1 point total)**
  - **[0.4 points]**
    - The overall quality of the report as a coherent, useful, and polished product will be reflected here.
    - Criteria:
      - Does it make sense overall?
      - Do your visualizations answer the questions you put forth in your business analysis?
      - Do you properly and consistently cite sources and annotate changes made to base code?
      - Do you provide specific reasons for your assumptions?
      - Do subsequent questions follow naturally from initial exploration?

    - **[0.6 Points] Additional analysis:**
      - **5000 level students**: 
        - You have free rein to provide any additional analyses.
      - **7000 level students**: 
        - Implement dimensionality reduction using uniform manifold approximation and projection (UMAP), then visualize and interpret the results.
      - Explanation of UMAP dimensionality reduction methods:
        - You may be interested in the following information:
          - [UMAP on GitHub](https://github.com/lmcinnes/umap)
          - [Understanding UMAP](https://pair-code.github.io/understanding-umap/)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")
print(f"Missingno version: {msno.__version__}")

In [None]:
from ucimlrepo import fetch_ucirepo, list_available_datasets

# import dataset
census_income = fetch_ucirepo(id=20)

# access data
X = census_income.data.features
y = census_income.data.targets

# access metadata
print(census_income.metadata.uci_id)
print(census_income.metadata.num_instances)
print(census_income.metadata.additional_info.summary)

# access variable info in tabular format
print(census_income.variables)

In [None]:
df = pd.DataFrame(X, columns=adult.feature_names)
df['income'] = Y
df.head()

In [None]:
df.describe()