# Exercises in Data Transformation and Exploratory Data Analysis

This notebook contains the exercises for the class of February 5, 2025, in the course Data & Things at Roskilde University.

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Exercise 3

Do an exploratory data analysis of the adult dataset. The cell below loads the dataset from UCI Machine Learning Repository into a pandas dataframe called `adult_data`. It requires that you have installed the package `ucimlrepo`. (Otherwise the dataset is on model for this class.)

In [1]:
from ucimlrepo import fetch_ucirepo 
adult = fetch_ucirepo(id=2) 
X = adult.data.features 
y = adult.data.targets
X["income"] = y
adult_data = X.copy()

adult_data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [None]:
# I had to clean income values, because some entries have a "." while others do not.
# income values should be either ">50K" or "<=50K". Before cleaning there was bug with :['<=50K', '>50K', '<=50K.', '>50K.'] 
adult_data['income'] = adult_data['income'].str.replace('.', '', regex=False)

 You need to explain what the data is about,
which variables the dataset contains and what their data type is. Moreover, for each
individual variable you should investigate/explain its distribution/variation through
visualization and descriptive statistics. 

Finally, you should investigate/explain the
variation/correlation between pairs of variables – here it is enough to investigate three
pairs of variables, one where both variables are categorical, one where both variables
are numeric, and one where one of the variable is categorical and the other is numeric.

### Explain what the data is about
<li>Explain what the data is about</li>
<p> The data is demographic information about people and if their income is above or bellow 50000 dollers a year.
<br>This allow us to predict if a person earns >50K or <=50k. This is predicted based on attributes like age, education, occupation and so on.
</p><hr>
<li>Explain variables</li>
<p>The dataset contains 15 total column and have 48842 entries (rows).
<br>Categorical values include: (workclass, education, marital-status, occupation, relationship, race, native-country).
<br>Numeric values include: (age, fnlwgt, education-num, capital-gain, capital-loss, hour-per-week).
<br>Boolean values unclude: (sex, income).</p>
<p>All numeric valeus are int64. <br>While categories and booleans are objects (strings).</p>


In [None]:
# How I found which variables the dataset contains.
adult_data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [None]:
# How I found what their data type is.
adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [None]:
# Here I am checking the unique values in the categorical columns.
adult_data.describe(include='object')

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income
count,47879,48842,48842,47876,48842,48842,48842,48568,48842
unique,9,16,7,15,6,5,2,42,2
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,33906,15784,22379,6172,19716,41762,32650,43832,37155


In [None]:
# Here I am printing the unique values in each categorical column.
for column in ['workclass', 'education', 'relationship','marital-status', 'occupation','race', 'native-country','sex', 'income']:
    print(f"Unique values in {column}:")
    print(adult_data[column].unique().tolist())
    print()


Unique values in workclass:
['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov', 'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked', nan]

Unique values in education:
['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college', 'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school', '5th-6th', '10th', '1st-4th', 'Preschool', '12th']

Unique values in relationship:
['Not-in-family', 'Husband', 'Wife', 'Own-child', 'Unmarried', 'Other-relative']

Unique values in marital-status:
['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-spouse-absent', 'Separated', 'Married-AF-spouse', 'Widowed']

Unique values in occupation:
['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair', 'Transport-moving', 'Farming-fishing', 'Machine-op-inspct', 'Tech-support', '?', 'Protective-serv', 'Armed-Forces', 'Priv-house-serv', nan]

Unique values in race:
['White', 'Black', 'Asian-Pac-Islander', 'Amer-In

In [14]:
adult_data.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,48842.0,47879,48842.0,48842,48842.0,48842,47876,48842,48842,48842,48842.0,48842.0,48842.0,48568,48842
unique,,9,,16,,7,15,6,5,2,,,,42,4
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,,15784,,22379,6172,19716,41762,32650,,,,43832,24720
mean,38.643585,,189664.1,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,105604.0,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


In [50]:
def make_percent(values):
    counts = values.value_counts()
    percentages = values.value_counts(normalize=True) * 100
    return pd.DataFrame({'Count': counts, 'Percentage': percentages.round(2)})

# Some extra exploration
print("Average age:", adult_data['age'].mean())
print(make_percent(adult_data['workclass']))



Average age: 38.64358543876172
                  Count  Percentage
workclass                          
Private           33906       70.82
Self-emp-not-inc   3862        8.07
Local-gov          3136        6.55
State-gov          1981        4.14
?                  1836        3.83
Self-emp-inc       1695        3.54
Federal-gov        1432        2.99
Without-pay          21        0.04
Never-worked         10        0.02
