# Regression Dataset for household income analysis

## Description

About Dataset

This synthetic dataset simulates various demographic and socioeconomic factors that influence annual household income. It can be used for exploratory data analysis, predictive modeling, and understanding the relationships between different features and income levels.

Features:

* Age: Age of the primary household member (18 to 70 years).

* Education Level: Highest education level attained (High School, Bachelor's, Master's, Doctorate).

* Occupation: Type of occupation (Healthcare, Education, Technology, Finance, Others).

* Number of Dependents: Number of dependents in the household (0 to 5).

* Location: Residential location (Urban, Suburban, Rural).

* Work Experience: Years of work experience (0 to 50 years).

* Marital Status: Marital status of the primary household member (Single, Married, Divorced).

* Employment Status: Employment status of the primary household member (Full-time, Part-time, Self-employed).

* Household Size: Total number of individuals living in the household (1 to 7).

* Homeownership Status: Homeownership status (Own, Rent).

* Type of Housing: Type of housing (Apartment, Single-family home, Townhouse).

* Gender: Gender of the primary household member (Male, Female).

* Primary Mode of Transportation: Primary mode of transportation used by the household member (Car, Public transit, Biking, Walking).

* Annual Household Income: Actual annual household income, derived from a combination of features with added noise. Unit USD

In [1]:
import pandas as pd

In [5]:
# Carga del dataset
df = pd.read_csv("datasets/household_income.csv")
df.head()

Unnamed: 0,Age,Education_Level,Occupation,Number_of_Dependents,Location,Work_Experience,Marital_Status,Employment_Status,Household_Size,Homeownership_Status,Type_of_Housing,Gender,Primary_Mode_of_Transportation,Income
0,56,Master's,Technology,5,Urban,21,Married,Full-time,7,Own,Apartment,Male,Public transit,72510
1,69,High School,Finance,0,Urban,4,Single,Full-time,7,Own,Apartment,Male,Biking,75462
2,46,Bachelor's,Technology,1,Urban,1,Single,Full-time,7,Own,Single-family home,Female,Car,71748
3,32,High School,Others,2,Urban,32,Married,Full-time,1,Own,Apartment,Female,Car,74520
4,60,Bachelor's,Finance,3,Urban,15,Married,Self-employed,4,Own,Townhouse,Male,Walking,640210


In [28]:
df.shape

(10000, 14)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Age                             10000 non-null  int64 
 1   Education_Level                 10000 non-null  object
 2   Occupation                      10000 non-null  object
 3   Number_of_Dependents            10000 non-null  int64 
 4   Location                        10000 non-null  object
 5   Work_Experience                 10000 non-null  int64 
 6   Marital_Status                  10000 non-null  object
 7   Employment_Status               10000 non-null  object
 8   Household_Size                  10000 non-null  int64 
 9   Homeownership_Status            10000 non-null  object
 10  Type_of_Housing                 10000 non-null  object
 11  Gender                          10000 non-null  object
 12  Primary_Mode_of_Transportation  10000 non-null 

In [6]:
# Descripción de los datos numericos
df.describe()

Unnamed: 0,Age,Number_of_Dependents,Work_Experience,Household_Size,Income
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,44.0217,2.527,24.8588,3.9896,816838.2
std,15.203998,1.713991,14.652622,2.010496,1821089.0
min,18.0,0.0,0.0,1.0,31044.0
25%,31.0,1.0,12.0,2.0,68446.0
50%,44.0,3.0,25.0,4.0,72943.0
75%,57.0,4.0,37.0,6.0,350667.5
max,70.0,5.0,50.0,7.0,9992571.0


In [7]:
# Descripción de los datos no-numéricos
df.describe(exclude="number")

Unnamed: 0,Education_Level,Occupation,Location,Marital_Status,Employment_Status,Homeownership_Status,Type_of_Housing,Gender,Primary_Mode_of_Transportation
count,10000,10000,10000,10000,10000,10000,10000,10000,10000
unique,4,5,3,3,3,2,3,2,4
top,Bachelor's,Healthcare,Urban,Married,Full-time,Own,Single-family home,Male,Public transit
freq,4058,3035,7037,5136,5004,6018,4055,5123,4047


## Data cleaning

El archivo parece estar consistente con los tipos de datos y rangos que corresponden. Tampoco posee datos nulos.

## Pre-análisis

### Ingresos por nivel educacional

In [11]:
df[["Education_Level","Income"]].groupby("Education_Level").describe()

Unnamed: 0_level_0,Income,Income,Income,Income,Income,Income,Income,Income
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Education_Level,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Bachelor's,4058.0,812335.910793,1819901.0,31044.0,68377.0,72888.0,367942.5,9992571.0
Doctorate,501.0,628710.652695,1628288.0,32517.0,67538.0,71346.0,75392.0,9859518.0
High School,2959.0,868667.401487,1868058.0,31137.0,68677.0,73452.0,444736.0,9904254.0
Master's,2482.0,800383.425866,1801176.0,31199.0,68390.0,72747.0,315053.25,9979438.0


In [26]:
df[df["Education_Level"]=="Bachelor's"].loc[:,"Income"].std()

1819901.1151043514

### Ingresos por ocupación

In [12]:
df[["Occupation","Income"]].groupby("Occupation").describe()

Unnamed: 0_level_0,Income,Income,Income,Income,Income,Income,Income,Income
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Occupation,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Education,1512.0,920816.752646,1948415.0,31530.0,68717.0,73317.0,553278.25,9871131.0
Finance,1525.0,706152.669508,1675412.0,31199.0,68126.0,72228.0,185614.0,9829436.0
Healthcare,3035.0,799238.763097,1796588.0,31137.0,68657.0,73146.0,372317.5,9992571.0
Others,1521.0,828970.392505,1788429.0,31044.0,68404.0,73428.0,412832.0,9968165.0
Technology,2407.0,836173.786041,1874548.0,31212.0,68215.0,72650.0,289959.5,9922858.0
