# Project Income Prediction

#### **Index:** <a>
0) [Introduction](#0)
1) [Packages and Functions](#1)
2) [Tidying Data](#2)
3) [Exploratory Data Analisys](#3)
4) [Modeling](#5)
5) [Final Validation](#6)

### 0) Introduction <a name="0"></a>

The objective behind this notebook is, uderstand the dataset, finding patterns and developing a model to predict customer's income. This is just a study notebook, to improve my data analisys and modeling skills.

*This dataset is part of the Data Scientist Course from EBAC*

### 1) Packages and Functions <a name="1"></a>


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [11]:
def categoricals_plot(df:pd.DataFrame, var:str, ax1, ax2) -> None:
    """
    This function plots the mean of the variables in a barplot and median in a line plot.
    """
    dist_data = df[var].value_counts().reset_index()
    dist_data.columns = [var, 'count']
    
    sns.barplot(
        x=dist_data['count'],
        y=dist_data[var],
        ax=ax1
    )
    ax1.set_title(f'Distribution of {var}')
    ax1.set_ylabel(var)
    ax1.set_xlabel('')
    
    sns.barplot(
        data=df, y=var, x='income', label='Mean', ax=ax2)
    sns.lineplot(
        data=df, y=var, x='income', label='Median', ax=ax2,
        estimator=np.median, color='red'
    )
    ax2.set_title(f'Income per {var}')
    ax2.legend()
    ax2.set_ylabel(var)
    ax2.set_xlabel('')

In [3]:
pd_max_columns = pd.options.display.max_columns
Pd_max_rows = pd.options.display.max_rows

pd.options.display.max_columns = None
pd.options.display.max_rows = None

### 2) Tidying Data <a name="2"></a>

Let's take a look at the dataset

In [4]:
df_raw = pd.read_csv('.\\data\\previsao_de_renda.csv')

In [5]:
df_raw.head()

Unnamed: 0.1,Unnamed: 0,data_ref,id_cliente,sexo,posse_de_veiculo,posse_de_imovel,qtd_filhos,tipo_renda,educacao,estado_civil,tipo_residencia,idade,tempo_emprego,qt_pessoas_residencia,renda
0,0,2015-01-01,15056,F,False,True,0,Empresário,Secundário,Solteiro,Casa,26,6.60274,1.0,8060.34
1,1,2015-01-01,9968,M,True,True,0,Assalariado,Superior completo,Casado,Casa,28,7.183562,2.0,1852.15
2,2,2015-01-01,4312,F,True,True,0,Empresário,Superior completo,Casado,Casa,35,0.838356,2.0,2253.89
3,3,2015-01-01,10639,F,False,True,1,Servidor público,Superior completo,Casado,Casa,30,4.846575,3.0,6600.77
4,4,2015-01-01,7064,M,True,False,0,Assalariado,Secundário,Solteiro,Governamental,33,4.293151,1.0,6475.97


In [6]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15000 non-null  int64  
 1   data_ref               15000 non-null  object 
 2   id_cliente             15000 non-null  int64  
 3   sexo                   15000 non-null  object 
 4   posse_de_veiculo       15000 non-null  bool   
 5   posse_de_imovel        15000 non-null  bool   
 6   qtd_filhos             15000 non-null  int64  
 7   tipo_renda             15000 non-null  object 
 8   educacao               15000 non-null  object 
 9   estado_civil           15000 non-null  object 
 10  tipo_residencia        15000 non-null  object 
 11  idade                  15000 non-null  int64  
 12  tempo_emprego          12427 non-null  float64
 13  qt_pessoas_residencia  15000 non-null  float64
 14  renda                  15000 non-null  float64
dtypes:


|Column                |Rename             |Data Type|Description                                       |
|----------------------|-------------------|---------|--------------------------------------------------|
| Unnamed: 0           |                   |int      | Appears to be a index column, it will be removed |
| data_ref             |                   |date     | The reference date                               |
| id_cliente           |                   |int      | Customer ID                                      |
| sexo                 |sex                |str      | Customer's sex: <br> • M = Male <br> • F = Female|
| posse_de_veiculo     |car_owner          |bool     | Indicates if the customer owns a vehicle         |
| posse_de_imovel      |house_owner        |bool     | Indicates if the customer owns a house           |
| qtd_filhos           |n_children         |int      | Number of children the customer has              |
| tipo_renda           |income_type        |str      | Type of income: <br> • Empresário = Businessperson <br> • Assalariado = Salaried <br> • Servidor público = Public servant <br> • Pensionista = Pensioner <br> • Bolsista = Scholarship recipient      |
| educacao             |education          |str      | Education level: <br> • Primário = Primary school <br> • Secundário = Secondary school <br> • Superior incompleto = Incomplete higher education <br> • Superior completo = Higher education <br> • Pós-graduação = Postgraduate education |
| estado_civil         |marital_state      |str      | Marital status: <br> • Solteiro = Single <br> • União = Civil union <br> • Casado = Married <br> • Separado = Separated <br> • Viúvo = Widowed                                                    |
| tipo_residencia      |residence_type     |str      | Type of residence: <br> • Casa = House <br> • Governamental = Government housing <br> • Com os pais = Living with parents <br> • Aluguel = Rented <br> • Comunitário = Communal housing <br> • Estúdio = Studio apartment |
| idade                |age                |int      | Customer's age (in years)                        |
| tempo_emprego        |job_tenure         |float    | Customer's tenure at current job (in years).     |
| qt_pessoas_residencia|people_in_household|float    | Number of people in the customer's household     |
| renda                |income             |float    | Income in BRL (Brazilian Real).                  ||

'Unnamed: 0', 'data_ref' and 'id_cliente' won't help to predict the income, so let's remove it. Also, we are going to rename the columns.

In [12]:
renamed_columns = {'sexo': 'sex',
                   'posse_de_veiculo': 'car_owner',
                   'posse_de_imovel': 'house_owner',
                   'qtd_filhos': 'n_children',
                   'tipo_renda': 'income_type',
                   'educacao': 'education',
                   'estado_civil': 'marital_state',
                   'tipo_residencia': 'residence_type',
                   'idade': 'age',
                   'tempo_emprego': 'job_tenure',
                   'qt_pessoas_residencia': 'people_in_household',
                   'renda': 'income'}
df = (df_raw
      .drop(['id_cliente', 'Unnamed: 0', 'data_ref'], axis=1)
      .rename(columns=renamed_columns))

Let's check if there are missing values

In [8]:
df.isna().sum()

sex                       0
car_owner                 0
house_owner               0
n_children                0
income_type               0
education                 0
marital_state             0
residence_type            0
age                       0
job_tenure             2573
people_in_household       0
income                    0
dtype: int64

Only 'job tenure' has missing values. For now, we'll keep as it is, further we check if this variable need more treatment.

In [9]:
df.head()

Unnamed: 0,sex,car_owner,house_owner,n_children,income_type,education,marital_state,residence_type,age,job_tenure,people_in_household,income
0,F,False,True,0,Empresário,Secundário,Solteiro,Casa,26,6.60274,1.0,8060.34
1,M,True,True,0,Assalariado,Superior completo,Casado,Casa,28,7.183562,2.0,1852.15
2,F,True,True,0,Empresário,Superior completo,Casado,Casa,35,0.838356,2.0,2253.89
3,F,False,True,1,Servidor público,Superior completo,Casado,Casa,30,4.846575,3.0,6600.77
4,M,True,False,0,Assalariado,Secundário,Solteiro,Governamental,33,4.293151,1.0,6475.97


### 3) Exploratory Data Analisys <a name="3"></a>

The dataset cotains only 11 variables, so we can check one by one.

In [13]:
categorical_var = ['sex', 'car_owner', 'house_owner', 'income_type',
                   'marital_state', 'residence_type']

fig, axes = plt.subplots(
    nrows=len(categorical_var),
    ncols=2,
    figsize=(10, 3 * len(categorical_var)))

for i, var in enumerate(categorical_var):
    ax1 = axes[i, 0]
    ax2 = axes[i, 1]
    categoricals_plot(df, var, ax1, ax2)

plt.tight_layout()
plt.show()


KeyboardInterrupt



Error in callback <function flush_figures at 0x000002095906AE80> (for post_execute), with arguments args (),kwargs {}:



KeyboardInterrupt



### 5) Modeling <a name="5"></a>

### 6) Final Validation <a name="6"></a>