# Introdução
___
Este arquivo tem por objetivo realizar o tratamento de dados do arquivo `cs_bisnode_panel.csv`.

### Objetivos do tratamento de dados:
* Remova as colunas ['COGS', 'finished_prod', 'net_dom_sales', 'net_exp_sales', 'wages', 'D']  pois elas apresentam um percentual considerável de missing data ✅
* Remova de seus dados os registros do ano de 2016 ✅
* Criar uma coluna para variavel resposta (use o conceito de que uma empresa deixou de operar se ela esteve ativa no
ano X, mas não apresentou vendas em X + 2 anos) `# trabalhar nisso`
* Filtre para trabalhar apenas com empresas do ano de 2012 ✅
* usar np.where para ajustar Sales < 0 você já pode substituir por 0 ✅
* * Criar uma nova coluna para a escala logaritima de `Sales`✅
* Essa variável (`Sales`) é bastante assimétrica, concorda? Será que vale criar novas
colunas que representem o valor em log  dessa coluna?
* Será que isso também se aplica para as demais?
* Crie novas colunas, como idade da empresa (faça isso pela subtração de
founded_year  e year ). Ah, cuide bem dos missing values. np.where pode ajudar
bastante!
* Filtre seus dados para ter empresas que possuem receita (revenue) abaixo de 10
milhões de euros e acima de 1000 euros
* Busque sempre embasar qualquer decisão de tratamento das variáveis. Faça isso
com o auxílio de estatísticas descritivas e também de gráficos de apoio.


# Importação dos dados e tratamento

In [56]:
import pandas as pd
import numpy as np

In [57]:
import missingno as msno

In [58]:
df = pd.read_csv("cs_bisnode_panel.csv")
df.head()

Unnamed: 0,comp_id,begin,end,COGS,amort,curr_assets,curr_liab,extra_exp,extra_inc,extra_profit_loss,...,gender,origin,nace_main,ind2,ind,urban_m,region_m,founded_date,exit_date,labor_avg
0,1001034.0,2005-01-01,2005-12-31,,692.59259,7266.666504,7574.074219,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
1,1001034.0,2006-01-01,2006-12-31,,603.703674,13122.222656,12211.111328,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
2,1001034.0,2007-01-01,2007-12-31,,425.925934,8196.295898,7800.0,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
3,1001034.0,2008-01-01,2008-12-31,,300.0,8485.185547,7781.481445,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
4,1001034.0,2009-01-01,2009-12-31,,207.40741,5137.037109,15300.0,0.0,0.0,0.0,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,0.083333


In [59]:
#msno.matrix(df)

## retirnado as colunas `'COGS', 'finished_prod', 'net_dom_sales','net_exp_sales', 'wages', 'D'` devido á alta taxa de missing
___

In [60]:
df.columns

Index(['comp_id', 'begin', 'end', 'COGS', 'amort', 'curr_assets', 'curr_liab',
       'extra_exp', 'extra_inc', 'extra_profit_loss', 'finished_prod',
       'fixed_assets', 'inc_bef_tax', 'intang_assets', 'inventories',
       'liq_assets', 'material_exp', 'net_dom_sales', 'net_exp_sales',
       'personnel_exp', 'profit_loss_year', 'sales', 'share_eq',
       'subscribed_cap', 'tang_assets', 'wages', 'D', 'balsheet_flag',
       'balsheet_length', 'balsheet_notfullyear', 'year', 'founded_year',
       'exit_year', 'ceo_count', 'foreign', 'female', 'birth_year',
       'inoffice_days', 'gender', 'origin', 'nace_main', 'ind2', 'ind',
       'urban_m', 'region_m', 'founded_date', 'exit_date', 'labor_avg'],
      dtype='object')

In [61]:
cols_to_rm = ['COGS', 'finished_prod', 'net_dom_sales','net_exp_sales', 'wages', 'D']

df = df.drop(columns = cols_to_rm)

In [62]:
df.columns

Index(['comp_id', 'begin', 'end', 'amort', 'curr_assets', 'curr_liab',
       'extra_exp', 'extra_inc', 'extra_profit_loss', 'fixed_assets',
       'inc_bef_tax', 'intang_assets', 'inventories', 'liq_assets',
       'material_exp', 'personnel_exp', 'profit_loss_year', 'sales',
       'share_eq', 'subscribed_cap', 'tang_assets', 'balsheet_flag',
       'balsheet_length', 'balsheet_notfullyear', 'year', 'founded_year',
       'exit_year', 'ceo_count', 'foreign', 'female', 'birth_year',
       'inoffice_days', 'gender', 'origin', 'nace_main', 'ind2', 'ind',
       'urban_m', 'region_m', 'founded_date', 'exit_date', 'labor_avg'],
      dtype='object')

## Removendo dados do ano de 2016
___

### convertendo as colunas para o formato datetime: 

In [63]:
df["begin"] = pd.to_datetime(df['begin'])
df["end"] = pd.to_datetime(df['end'])

In [64]:
# retinrando os dados em que o ano seja 2016 ou maior
df = df[df["begin"] < "2016"]
df.head()

Unnamed: 0,comp_id,begin,end,amort,curr_assets,curr_liab,extra_exp,extra_inc,extra_profit_loss,fixed_assets,...,gender,origin,nace_main,ind2,ind,urban_m,region_m,founded_date,exit_date,labor_avg
0,1001034.0,2005-01-01,2005-12-31,692.59259,7266.666504,7574.074219,0.0,0.0,0.0,1229.629639,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
1,1001034.0,2006-01-01,2006-12-31,603.703674,13122.222656,12211.111328,0.0,0.0,0.0,725.925903,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
2,1001034.0,2007-01-01,2007-12-31,425.925934,8196.295898,7800.0,0.0,0.0,0.0,1322.222168,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
3,1001034.0,2008-01-01,2008-12-31,300.0,8485.185547,7781.481445,0.0,0.0,0.0,1022.222229,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,
4,1001034.0,2009-01-01,2009-12-31,207.40741,5137.037109,15300.0,0.0,0.0,0.0,814.814819,...,mix,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,0.083333


In [65]:
df['operates_within_2_years'] = None  # Ou pd.NA para valores nulos

def calcula_variavel_resposta(df):
    for ano_atual in df["begin"]:
        if df[df["begin"] == ano_atual+2] < 0:
            return 0
    else:
        return 1
    
#df['operates_within_2_years'] = calcula_variavel_resposta(df)

## Filtro para trabalhar apenas com empresas do ano de 2012
___

In [66]:
empresas_2012 = df[df['begin'] == "2012"]

In [67]:
empresas_2012.head()

Unnamed: 0,comp_id,begin,end,amort,curr_assets,curr_liab,extra_exp,extra_inc,extra_profit_loss,fixed_assets,...,origin,nace_main,ind2,ind,urban_m,region_m,founded_date,exit_date,labor_avg,operates_within_2_years
7,1001034.0,2012-01-01,2012-12-31,140.740738,148.148148,21429.628906,0.0,0.0,0.0,340.740753,...,Domestic,5630.0,56.0,3.0,1,Central,1990-11-19,,0.083333,
14,1001541.0,2012-01-01,2012-12-31,481.481476,9629.629883,1303.703735,0.0,0.0,0.0,190566.671875,...,Domestic,5610.0,56.0,3.0,3,Central,2008-02-24,,,
23,1002029.0,2012-01-01,2012-12-31,14929.629883,203885.1875,120444.453125,0.0,0.0,0.0,23459.259766,...,Domestic,2711.0,27.0,2.0,3,East,2006-07-03,,0.458333,
35,1003200.0,2012-01-01,2012-12-31,25.925926,22.222221,10996.295898,0.0,0.0,0.0,0.0,...,Domestic,5630.0,56.0,3.0,1,Central,2003-10-21,2014-08-09,,
48,1007261.0,2012-01-01,2012-12-31,0.0,255.555557,9207.407227,0.0,0.0,0.0,0.0,...,Domestic,5610.0,56.0,3.0,1,Central,2010-08-26,2015-11-19,0.083333,


## Trabalhando as incosistencias
___

### Ajustando a coluna sales

In [71]:
df['sales'] = np.where(df['sales'] < 0, 0, df['sales'])

### Checando a assimetria da coluna `Sales`

In [86]:
df["sales"].skew()

17.01936234124427

Como a skewness esta acima de zero (bem acima) temos que a coluna `sales` possuí alta assímetria positiva

Criando uma nova coluna para a escala logaritima de `sales`

In [77]:
df["sales_log"] = np.log1p(df['sales'])
df[["sales","sales_log"]].head()

Unnamed: 0,sales,sales_log
0,62751.851562,11.046959
1,64625.925781,11.076386
2,65100.0,11.083695
3,78085.1875,11.265568
4,45388.890625,10.723045


In [87]:
df["sales_log"].skew()

-1.1803398049124354

Agora temos que a escala logaritima da coluna `sales` possui assimetria negativa, contudo está bem mais próxima de zero, indicando uma alta redução na assimetria

In [95]:
numeric_cols = df.select_dtypes(include=[np.number])

skew_values = numeric_cols.skew()

print(skew_values.sort_values(ascending = False))

profit_loss_year        361.736330
extra_profit_loss       345.601934
curr_liab               331.375606
fixed_assets            320.083873
extra_inc               294.741550
inc_bef_tax             252.591880
share_eq                239.752788
subscribed_cap          234.131177
intang_assets           233.810421
curr_assets             227.664994
amort                   194.574757
liq_assets              143.056617
extra_exp               140.176754
tang_assets              61.379289
inventories              36.506161
personnel_exp            25.127166
material_exp             18.936654
labor_avg                17.826571
sales                    17.019362
balsheet_flag             8.341014
balsheet_notfullyear      2.599322
ceo_count                 2.587079
foreign                   2.408827
female                    1.106005
inoffice_days             0.790504
comp_id                   0.647234
urban_m                  -0.128703
founded_year             -0.130346
year                