# Adult Dataset — Projet de Classification

**Objectif :** Prédire si le revenu annuel d'un individu est supérieur à 50 000 $  
**Source :** Barry Becker, base de données du recensement américain 1994  
**Dataset :** 48 842 individus · 14 features · 1 variable cible binaire  
**DOI :** [10.24432/C5XW20](https://doi.org/10.24432/C5XW20)

---
## Sommaire

## 1. Chargement des données

In [1]:
# librairies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint

In [2]:
sns.set_theme('paper')

In [3]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
adult = fetch_ucirepo(id=2) 

In [4]:
# métadonnées 
pprint(adult.metadata) 

{'abstract': 'Predict whether annual income of an individual exceeds $50K/yr '
             'based on census data. Also known as "Census Income" dataset. ',
 'additional_info': {'citation': None,
                     'funded_by': None,
                     'instances_represent': None,
                     'preprocessing_description': None,
                     'purpose': None,
                     'recommended_data_splits': None,
                     'sensitive_data': None,
                     'summary': 'Extraction was done by Barry Becker from the '
                                '1994 Census database.  A set of reasonably '
                                'clean records was extracted using the '
                                'following conditions: ((AAGE>16) && (AGI>100) '
                                '&& (AFNLWGT>1)&& (HRSWK>0))\n'
                                '\n'
                                'Prediction task is to determine whether a '
                               

In [5]:
# description des variable 
adult.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,,no
1,workclass,Feature,Categorical,Income,"Private, Self-emp-not-inc, Self-emp-inc, Feder...",,yes
2,fnlwgt,Feature,Integer,,,,no
3,education,Feature,Categorical,Education Level,"Bachelors, Some-college, 11th, HS-grad, Prof-...",,no
4,education-num,Feature,Integer,Education Level,,,no
5,marital-status,Feature,Categorical,Other,"Married-civ-spouse, Divorced, Never-married, S...",,no
6,occupation,Feature,Categorical,Other,"Tech-support, Craft-repair, Other-service, Sal...",,yes
7,relationship,Feature,Categorical,Other,"Wife, Own-child, Husband, Not-in-family, Other...",,no
8,race,Feature,Categorical,Race,"White, Asian-Pac-Islander, Amer-Indian-Eskimo,...",,no
9,sex,Feature,Binary,Sex,"Female, Male.",,no


In [53]:
# dataframe globale
df = adult.data.original
print(f'Dataset complet : {df.shape[0]:_} lignes × {df.shape[1]} colonnes')

Dataset complet : 48_842 lignes × 15 colonnes


In [54]:
# Noms des colonnes
cols = adult.data.headers.tolist()
target = ['income']
features = [col for col in cols if col not in target]

In [55]:
# aperçu des 5 premières lignes
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## 2. Analyse Exploratoire

### 2.1 Types des variables

In [56]:
# nom et type des variables
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       47879 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      47876 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  48568 non-null  category
 14  income          48842 non-null  category
dtypes: category(9), int64(6)
memory usage: 2.7 MB


In [57]:
# les colonnes de type entier sont les variables numériques
num_features = list(df.select_dtypes(include='number'))
num_features

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [58]:
df[num_features].head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40


In [59]:
# les colonnes de type str doivent être des variables catégorielles
cat_features = list(df.select_dtypes(exclude='number'))
cat_features

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country',
 'income']

In [60]:
df[cat_features].head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [61]:
# changement de type en catégories
df[cat_features] = df[cat_features].astype('category')

### 2.2 Description des variables

In [62]:
# description des variables numériques
df.describe() \
  .round() \
  .astype(int)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842,48842,48842,48842,48842,48842
mean,39,189664,10,1079,88,40
std,14,105604,3,7452,403,12
min,17,12285,1,0,0,1
25%,28,117550,9,0,0,40
50%,37,178144,10,0,0,40
75%,48,237642,12,0,0,45
max,90,1490400,16,99999,4356,99


Remarques : 
- à quoi sert la variable `fnlwgt` ? utile ?
- les valeurs > 0 des variables `capital-gain` et `capital-loss` représentent moins de 25 % du dataset
- la majorité des personnes travaillent entre 40 et 45 heures (autres == outliers ?)

In [42]:
print('Valeurs uniques des variables catégorielles :')
for cet_feature in cat_features:
    print(f'\t- {cet_feature} : {df[cet_feature].cat.categories.to_list()}')

Valeurs uniques des variables catégorielles :
	- workclass : ['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay']
	- education : ['10th', '11th', '12th', '1st-4th', '5th-6th', '7th-8th', '9th', 'Assoc-acdm', 'Assoc-voc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters', 'Preschool', 'Prof-school', 'Some-college']
	- marital-status : ['Divorced', 'Married-AF-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed']
	- occupation : ['?', 'Adm-clerical', 'Armed-Forces', 'Craft-repair', 'Exec-managerial', 'Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct', 'Other-service', 'Priv-house-serv', 'Prof-specialty', 'Protective-serv', 'Sales', 'Tech-support', 'Transport-moving']
	- relationship : ['Husband', 'Not-in-family', 'Other-relative', 'Own-child', 'Unmarried', 'Wife']
	- race : ['Amer-Indian-Eskimo', 'Asian-Pac-Islander', 'Black', 'Other', 'White']
	- sex : ['Female', 'Mal

Remarques : 
- attention, précence de `?` pour les valeurs manquantes
- la variable `education` est-elle redondante avec la variable `education-num` ? si c'est le cas autant garder `education-num` qui conserve une certaine notion d'ordre.
- `native-country` contient-il plus d'informations qu'une varaible qui regrouperait les pays par zones géographiques et culturelles ?
- les variables `workclass` et `occupation` sont-elles liées et redondantes ? si oui, laquelle apporterait le plus d'informations ?
- idem pour `marital-status` et `relationship`
- la variable cible devrait être nétoyée (`.` en trop) et encodée 0 pour <=50K et 1 pour >50K ?

### 2.3 Gestion des valeurs manquantes

In [67]:
# Remplacement des '?' par NaN
df.replace('?', np.nan, inplace=True)
missing = df.isna().sum()
print('Valeurs manquantes :')
print(missing[missing > 0])

Valeurs manquantes :
workclass         2799
occupation        2809
native-country     857
dtype: int64


In [70]:
# proportion des valeurs manquantes
(missing[missing > 0] / df.shape[0]).round(3)

workclass         0.057
occupation        0.058
native-country    0.018
dtype: float64

Ces valeurs manquantes, regroupées sur les variables catégorielles `workclass`, `occupation` et `native-country` représentent entre 2 et 6 % des données.

### 2.4 Nettoyage

In [72]:
# voir fnlwgt, `education` et `education-num` (et `native-country` ?)