Descriptive Statistics - Measures of Central Tendency and variability
Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard deviation) for
a dataset (age, income etc.) with numeric variables grouped by one of the qualitative
(categorical) variable. For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of income grouped by the
age groups. Create a list that contains a numeric value for each response to the categorical
variable.

In [7]:
import pandas as pd

# Load the dataset
df = pd.read_csv('adult.csv')

# Display the first 5 rows to understand the structure
print("Dataset Head:")
print(df.head())

Dataset Head:
   age  workclass  fnlwgt     education  educational-num      marital-status  \
0   25    Private  226802          11th                7       Never-married   
1   38    Private   89814       HS-grad                9  Married-civ-spouse   
2   28  Local-gov  336951    Assoc-acdm               12  Married-civ-spouse   
3   44    Private  160323  Some-college               10  Married-civ-spouse   
4   18          ?  103497  Some-college               10       Never-married   

          occupation relationship   race  gender  capital-gain  capital-loss  \
0  Machine-op-inspct    Own-child  Black    Male             0             0   
1    Farming-fishing      Husband  White    Male             0             0   
2    Protective-serv      Husband  White    Male             0             0   
3  Machine-op-inspct      Husband  Black    Male          7688             0   
4                  ?    Own-child  White  Female             0             0   

   hours-per-week native

In [8]:
# Group by 'income' and calculate statistics for 'age'
summary_stats = df.groupby('income')['age'].agg(['mean', 'median', 'min', 'max', 'std'])

print("\nSummary Statistics of Age grouped by Income:")
print(summary_stats)


Summary Statistics of Age grouped by Income:
             mean  median  min  max        std
income                                        
<=50K   36.872184    34.0   17   90  14.104118
>50K    44.275178    43.0   19   90  10.558983


In [9]:
# 1. Create a unique list of categories
categories = df['income'].unique().tolist()
print(f"\nUnique categories in 'income': {categories}")

# 2. Create a numeric mapping (e.g., <=50K -> 0, >50K -> 1)
# This creates a new column with numeric values
df['income_numeric'] = df['income'].astype('category').cat.codes

# 3. Create a list that contains the numeric value for each response in the dataset
numeric_list = df['income_numeric'].tolist()

# Display the first 10 entries of the numeric list
print("\nFirst 10 numeric values for the 'income' variable:")
print(numeric_list[:10])

# To see which number corresponds to which label:
mapping = dict(enumerate(df['income'].astype('category').cat.categories))
print("\nMapping (Code: Label):")
print(mapping)


Unique categories in 'income': ['<=50K', '>50K']

First 10 numeric values for the 'income' variable:
[0, 0, 1, 1, 0, 0, 0, 1, 0, 0]

Mapping (Code: Label):
{0: '<=50K', 1: '>50K'}
