# PES University, Bangalore
Established under Karnataka Act No. 16 of 2013

## UE22AM343AB4 - Advanced Data Analytics 

Designed by Nischal H S

### Student Details
- Name : **Student Name**
- SRN : **SRN**

# ADA Worksheet Part B

## Adult Census Income Cleaning and Analysis

### Introduction

As a data scientist intern at the U.S. Census Bureau, you've been assigned to clean and analyze the Adult Census Income dataset. Your task is to prepare the data for a machine learning model that will predict whether an individual's annual income exceeds $50,000. This analysis will inform government policies on education, employment, and economic development.
The dataset contains various demographic and socioeconomic factors, but it requires careful preprocessing to ensure accurate results. Your work will involve handling missing data, encoding categorical variables, and performing exploratory data analysis.

First, let's import some of the necessary libraries and load the data.

In [None]:
# might make it easier to install the packages directly to ipynb kernel for this to work, so please run this
%pip install pandas numpy matplotlib seaborn scikit-learn scipy imbalanced-learn

In [None]:
# Note: This assignment might need you to look up syntax, parameters, functions a lot for some libraries, so I am linking the documentation. Most of these should have a search bar to find what you're looking for.
# pandas: https://pandas.pydata.org/docs/user_guide/index.html
# numpy: https://numpy.org/doc/stable/user/index.html
# matplotlib: https://matplotlib.org/stable/contents.html
# seaborn: https://seaborn.pydata.org/tutorial.html




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

# TODO: Load the data into a pandas DataFrame named 'df'
# Hint: Use pd.read_csv() with the url and column_names


# Display the first few rows and basic information about the dataset
print(df.head())
print(df.info())

## Question 1: Missing Values

Examine the dataset for missing values. In this dataset, missing values are represented as "?".

a) How many missing values are there in each column?

b) What percentage of the dataset is missing?

In [None]:
# TODO: Count the number of missing values in each column
# Hint: Use df.isin() with the right parameter

# TODO: Calculate the percentage of missing values in the entire dataset

## Question 2: Handling Missing Values

Choose an appropriate strategy to handle missing values in the 'workclass' and 'occupation' columns.

a) Explain your chosen strategy and why you think it's appropriate.

b) Implement your strategy.

In [None]:
# TODO: Implement your chosen strategy for handling missing values
# This might involve imputation, removal, or other techniques. Think carefully about the type of missingness and how it should be handled

def handle_missing_workclass(df):
    """
    Handle missing values in the 'workclass' column.
    """
    # TODO: Implement your strategy here
    pass

def handle_missing_occupation(df):
    """
    Handle missing values in the 'occupation' column.
    """
    # TODO: Implement your strategy here
    pass

# Call the functions to update the DataFrame
df = handle_missing_workclass(df)
df = handle_missing_occupation(df)

# Hint: You could use the mode() method to impute missing values
# Can you go one step ahead and consider the other columns before deciding what value to impute in a row?

## Question 3: Categorical Variables

Identify all categorical variables in the dataset.

a) List the categorical variables and their unique values.

b) Are there any categorical variables that have an unusually high number of categories? How might you handle this?

In [None]:
# TODO: Identify and list categorical variables
# Hint: Use the select_dtypes() method to identify columns with object dtype

# TODO: Display unique values for each categorical variable
# Hint: You can use the unique() method to get the unique values for each column

## Question 4: Encoding Categorical Variables

Choose appropriate encoding techniques for the categorical variables. You may use different techniques for different variables based on their characteristics.

a) Explain your choice of encoding technique for each categorical variable.

b) Implement the encoding.

In [None]:
# TODO: Implement encoding for categorical variables
# This might involve one-hot encoding, label encoding, or other techniques

def encode_binary_categorical(df, column):
    """
    Encode a binary categorical variable using label encoding.
    """
    # Hint: Use the LabelEncoder class from sklearn.preprocessing
    pass

def encode_multi_categorical(df, columns):
    """
    Encode multi-class categorical variables using one-hot encoding.
    """
    # Hint: Use the OneHotEncoder class from sklearn.preprocessing
    pass

## Question 5: Numerical Variables

Analyze the numerical variables in the dataset.

a) Create histograms for each numerical variable.

b) Identify any variables that appear to be skewed. How might you handle this skewness?

In [None]:
# TODO: Create histograms for numerical variables
# Hint: Use the histplot() function from the seaborn library

# TODO: Identify skewed variables and suggest transformations
# Hint: You can use the skew() method to identify skewed variables
# For transformations, you could consider using the np.log1p() function
# Verify the effectiveness of your transformation

def transform_skewed_variables(df, skewed_vars):
    """
    Apply transformation to skewed numerical variables.
    """
    pass

## Question 6: Outlier Detection

Implement a method to detect outliers in the 'capital-gain' and 'capital-loss' columns.

a) What method did you choose and why?
b) How many outliers did you detect?
c) Propose a strategy for handling these outliers.

In [None]:
# TODO: Implement outlier detection for 'capital-gain' and 'capital-loss'
# Hint: Use the zscore() function from the scipy.stats module to detect outliers

# TODO: Visualize outliers
# Hint: A certain kind of plot is usually used to visualise outliers. Use the seaborn library

def handle_outliers(df, columns, quantile=0.95):
    """
    Handle outliers using winsorization.
    """
    # Hint: Use the np.where() function to apply the winsorization
    pass

## Question 7: Correlation Analysis

Perform a correlation analysis on the numerical variables.

a) Create a heatmap of the correlation matrix.

b) Identify any highly correlated pairs of features. How might this impact a machine learning model?

In [None]:
# TODO: Compute correlation matrix
# Hint: Use the corr() method on the DataFrame

# TODO: Create a heatmap
# Hint: Use the heatmap() function from the seaborn library

# TODO: Identify and discuss highly correlated pairs
# Hint: You can use the where() and stack() methods to identify the highly correlated pairs

## Question 8: Class Imbalance

Investigate whether there is a class imbalance in the target variable ('income').

a) Calculate the proportion of each class in the target variable.

In [None]:
# TODO: Calculate and display class proportions

## Question 9: Data Scaling

Implement feature scaling on the numerical variables.

a) Choose a scaling method (e.g., StandardScaler, MinMaxScaler) and explain your choice.

b) Apply the scaling to the numerical features.

In [None]:
# TODO: Choose and implement a scaling method
# Hint: The Scaler classes can be found in sklearn.preprocessing module

## Question 10: Exploratory Data Analysis

Perform exploratory data analysis to gain insights into the relationship between features and the target variable.

a) Create at least three different types of plots that reveal interesting patterns or relationships in the data. For instance, scatter plot, box plot, histogram and so on.

b) Explain your findings from each plot.

In [None]:
# TODO: Create at least three informative plots
# Example:
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df, x='age', y='hours-per-week', hue='income')
plt.title('Scatter Plot: Age vs. Hours per Week by Income')
plt.show()

# Create 2 more such plots of your own
# TODO: Explain insights gained from each plot