<p style="font-family: Cambria; text-align: center; font-size: 48px;"> Initial Exploration of the Adult Dataset

<p style="font-family: Cambria; font-size: 22px;"><b> 1) The below command is used to install or update the ucimlrepo Python package, which provides direct access to datasets from the UCI Machine Learning Repository.

> pip3 → The package installer for Python 3

> install → Installs the specified package

> -U → Stands for upgrade, ensuring the latest version is installed

<p style="font-family: Cambria; font-size: 18px;"><b> ucimlrepo → A Python library that allows us to easily load UCI datasets into Python

By installing this package, we can download and load the Adult Census Income dataset directly into our Python environment without manually downloading files. This makes data extraction easier, faster, and more reproducible.

In [2]:
!pip3 install -U ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


<p style="font-family: Cambria; font-size: 22px;"><b> 2)The below line imports two useful functions from the ucimlrepo library:

fetch_ucirepo
>>This function is used to download and load a specific dataset from the UCI Machine Learning Repository directly into Python. In our case, it allows us to fetch the Adult Census Income dataset easily.

list_available_datasets
>>This function helps us view all datasets available in the UCI repository through the library, which is useful when exploring or selecting datasets for analysis.

By importing these functions, we prepare our environment to access, explore, and load UCI datasets efficiently without manual file handling.

In [71]:
from ucimlrepo import fetch_ucirepo, list_available_datasets

# check which datasets can be imported
list_available_datasets()


-------------------------------------
The following datasets are available:
-------------------------------------
Dataset Name                                                                            ID    
------------                                                                            --    
Abalone                                                                                 1     
Adult                                                                                   2     
Annealing                                                                               3     
Audiology (Standardized)                                                                8     
Auto MPG                                                                                9     
Automobile                                                                              10    
Balance Scale                                                                           12    
Balloons                       

<p style="font-family: Cambria; font-size: 22px;"><b> 3)This fetches the Adult Census Income dataset (dataset ID = 2) from the UCI repository and stores it in the variable adult.

>> X contains all the input features (independent variables) such as age, education, occupation, and hours worked per week.

>>y contains the target variable, which is the income class (<=50K or >50K).

This separation is essential for building Machine Learning models.

This code loads the Adult Census Income dataset from UCI, separates features and target variables, and displays dataset metadata and variable details to support data understanding and preprocessing.


In [61]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables) 

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

In [22]:
X

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States


In [24]:
y

Unnamed: 0,income
0,<=50K
1,<=50K
2,<=50K
3,<=50K
4,<=50K
...,...
48837,<=50K.
48838,<=50K.
48839,<=50K.
48840,<=50K.


<p style="font-family: Cambria; font-size: 22px;"><b> 4)The below code combines the feature dataset (X) and the target variable (y) into a single DataFrame called df.

X contains all the input features (age, education, occupation, etc.).

y contains the income label (<=50K or >50K).

The join() function merges them row-wise using the index, ensuring that each person’s features are correctly matched with their income label.

Having a single DataFrame makes it easier to:

>Perform exploratory data analysis (EDA)

>Check for missing values

>Create visualizations

>Apply statistical tests

This step merges features and target into one DataFrame, making data exploration and analysis easier.

In [30]:
df = X.join(y)
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


<p style="font-family: Cambria; font-size: 22px;"><b> 5)This below command displays the first five rows of the combined dataset (df).

>It helps us quickly preview the data.

>We can verify that the features and target variable are merged correctly.

>It allows us to check for any obvious issues such as incorrect values, formatting problems, or unexpected categories.

>This is usually the first step in exploratory data analysis (EDA).

In [32]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


<p style="font-family: Cambria; font-size: 22px;"><b> Alternative way of loading the dataset using Pandas

<p style="font-family: Cambria; font-size: 22px;"><b> Here, we manually define the column names for the dataset.

The original adult.data file does not include headers, so assigning column names ensures the data is readable and well-structured.

>read_csv() reads the raw data file

>names=columns assigns column names to the dataset

>sep=', ' specifies that values are separated by a comma and space

>engine='python' is used to correctly handle the custom separator

This approach allows us to load the dataset directly from a local file, giving us full control over file paths and column naming.

This code loads the Adult Census dataset from a local file using Pandas, manually assigning column names for easier analysis.

<p style="font-family: Cambria; font-size: 22px;"><b> Why Use This Method?

Useful when working with local datasets

Provides full control over column names

Does not depend on external libraries or internet access

In [12]:
import pandas as pd

# Load dataset (assuming adult.data file)
columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num',
    'marital-status', 'occupation', 'relationship', 'race',
    'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
    'native-country', 'income'
]

df = pd.read_csv('/Users/akhilamaheedhara/Downloads/adult/adult.data', names=columns, sep=', ', engine='python')

<p style="font-family: Cambria; font-size: 22px;"><b> 6)This command displays the first five rows of the dataset stored in df.

> It helps us confirm that the data was loaded correctly from the local file.

> We can quickly verify that column names are assigned properly.

> It allows us to spot any immediate issues such as extra spaces, incorrect values, or unexpected categories (like ? for missing values).

> This step gives us a quick snapshot of the dataset before moving on to deeper analysis and cleaning.

In [14]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


<p style="font-family: Cambria; font-size: 22px;"><b> 7)This command displays one random row from the dataset.

>> It helps us inspect the data randomly, rather than always looking at the first few rows.

>> Useful for spotting hidden issues such as inconsistent values, unusual categories, or formatting problems.

>> Gives a more unbiased view of the dataset compared to head().

<p style="font-family: Cambria; font-size: 22px;"><b>  Why This Is Important

>> Helps identify categorical vs numerical variables

>> Shows if there are missing values (columns with fewer non-null entries)

>> Helps decide what data cleaning and encoding steps are needed
 
>> Confirms whether data types are correctly assigned

<p style="font-family: Cambria; font-size: 22px;"><b>  Typical Findings in This Dataset

>> Numerical columns like age, hours-per-week, capital-gain

>> Categorical columns stored as object (e.g., workclass, education)

>> No explicit NaN values, but missing data appears as "?"

In [34]:
df.sample()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
43119,31,Private,103642,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,45,United-States,>50K.


<p style="font-family: Cambria; font-size: 22px;"><b> 8) This command provides a summary of the dataset structure.

It tells us:

> Number of rows and columns in the dataset

> Column names

> Data types of each variable (integer, float, object/categorical)

> Non-null counts for each column

> Memory usage of the DataFrame

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


<p style="font-family: Cambria; font-size: 22px;"><b> 9)This command returns the dimensions of the dataset in the format (rows, columns).

In other words:

> The first number is the total number of records (rows).

> The second number is the total number of features plus the target variable (columns).

  >> Example for Adult Census Dataset:

                  df.shape → (48842, 15)

                  48,842 rows (individual records)

                  15 columns (14 features + 1 target column income)
        

<p style="font-family: Cambria; font-size: 22px;"><b> Why It’s Important??

> Confirms that the dataset has loaded completely

> Helps verify the number of features for modeling

> Provides context for train-test split and sampling

In [40]:
df.shape

(48842, 15)

In [42]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [44]:
df.isnull().sum()

age                 0
workclass         963
fnlwgt              0
education           0
education-num       0
marital-status      0
occupation        966
relationship        0
race                0
sex                 0
capital-gain        0
capital-loss        0
hours-per-week      0
native-country    274
income              0
dtype: int64

In [46]:
    for col in df:
        print(f"Unique values in '{col}':")
        print(df[col].unique())
        print("-" * 40)

Unique values in 'age':
[39 50 38 53 28 37 49 52 31 42 30 23 32 40 34 25 43 54 35 59 56 19 20 45
 22 48 21 24 57 44 41 29 18 47 46 36 79 27 67 33 76 17 55 61 70 64 71 68
 66 51 58 26 60 90 75 65 77 62 63 80 72 74 69 73 81 78 88 82 83 84 85 86
 87 89]
----------------------------------------
Unique values in 'workclass':
['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' '?'
 'Self-emp-inc' 'Without-pay' 'Never-worked' nan]
----------------------------------------
Unique values in 'fnlwgt':
[ 77516  83311 215646 ... 173449  89686 350977]
----------------------------------------
Unique values in 'education':
['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']
----------------------------------------
Unique values in 'education-num':
[13  9  7 14  5 10 12 11  4 16 15  3  6  2  1  8]
----------------------------------------
Unique values in 'marital-status