<a href="https://colab.research.google.com/github/PauloMarvin/ETL-CVM/blob/data-analysis/03_refined_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Instalando dependências do *Spark*

In [1]:
!apt-get update -qq
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install -q findspark

Importando OS

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"

In [3]:
import findspark
findspark.init()

Importando *SparkSession*

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName("Iniciando com Spark") \
    .getOrCreate()

In [5]:
spark

Importando bibliotecas

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import plotly.express as px
import seaborn as sns

In [7]:
sns.set()

# Carregando os dados no *dataframe*

In [8]:
df = spark.read.csv(
    '/content/drive/MyDrive/cvm_2000_2022.csv',
    header= True, inferSchema= True
)

In [9]:
df.show(5)

+---+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+
|_c0|TP_FUNDO|        CNPJ_FUNDO| DT_COMPTC|VL_TOTAL| VL_QUOTA|VL_PATRIM_LIQ|CAPTC_DIA|RESG_DIA|NR_COTST|
+---+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+
|  0|   FITVM|01.465.738/0001-97|2000-01-03|    null|2.5972313|    2588000.0|  91000.0|     0.0|    null|
|  1|   FITVM|01.465.738/0001-97|2000-01-04|    null|2.4597298|    2543000.0|      0.0| 17000.0|    null|
|  2|   FITVM|01.465.738/0001-97|2000-01-05|    null|2.4035701|    2467000.0|  44000.0|  2000.0|    null|
|  3|   FITVM|01.465.738/0001-97|2000-01-06|    null|2.4648947|    2573000.0|  21000.0|     0.0|    null|
|  4|   FITVM|01.465.738/0001-97|2000-01-07|    null|2.5131104|    2646000.0|   3000.0|     0.0|    null|
+---+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+
only showing top 5 rows



## Renomeação da coluna `_c0`

In [10]:
df = df.withColumnRenamed('_c0', 'index')

In [11]:
df.show(5)

+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+
|index|TP_FUNDO|        CNPJ_FUNDO| DT_COMPTC|VL_TOTAL| VL_QUOTA|VL_PATRIM_LIQ|CAPTC_DIA|RESG_DIA|NR_COTST|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+
|    0|   FITVM|01.465.738/0001-97|2000-01-03|    null|2.5972313|    2588000.0|  91000.0|     0.0|    null|
|    1|   FITVM|01.465.738/0001-97|2000-01-04|    null|2.4597298|    2543000.0|      0.0| 17000.0|    null|
|    2|   FITVM|01.465.738/0001-97|2000-01-05|    null|2.4035701|    2467000.0|  44000.0|  2000.0|    null|
|    3|   FITVM|01.465.738/0001-97|2000-01-06|    null|2.4648947|    2573000.0|  21000.0|     0.0|    null|
|    4|   FITVM|01.465.738/0001-97|2000-01-07|    null|2.5131104|    2646000.0|   3000.0|     0.0|    null|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+
only showing top 5 rows



## Dimensões do *dataframe*

In [12]:
print(f'Há no dataframe {len(df.columns)} colunas.')
print(f'E {df.select("TP_FUNDO").count()} linhas.')

Há no dataframe 10 colunas.
E 57676391 linhas.


## *Schema* do *dataframe*

In [13]:
df.printSchema()

root
 |-- index: integer (nullable = true)
 |-- TP_FUNDO: string (nullable = true)
 |-- CNPJ_FUNDO: string (nullable = true)
 |-- DT_COMPTC: string (nullable = true)
 |-- VL_TOTAL: double (nullable = true)
 |-- VL_QUOTA: double (nullable = true)
 |-- VL_PATRIM_LIQ: double (nullable = true)
 |-- CAPTC_DIA: double (nullable = true)
 |-- RESG_DIA: double (nullable = true)
 |-- NR_COTST: double (nullable = true)



# Derivação de novas colunas a partir da coluna `DT_COMPTC`

### Colunas `MONTH` e `YEAR`

In [14]:
import pyspark.sql.functions as f

In [15]:
df = df.withColumn('MONTH', f.month('DT_COMPTC'))\
       .withColumn('YEAR', f.year('DT_COMPTC'))

In [16]:
df.show(2)

+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+
|index|TP_FUNDO|        CNPJ_FUNDO| DT_COMPTC|VL_TOTAL| VL_QUOTA|VL_PATRIM_LIQ|CAPTC_DIA|RESG_DIA|NR_COTST|MONTH|YEAR|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+
|    0|   FITVM|01.465.738/0001-97|2000-01-03|    null|2.5972313|    2588000.0|  91000.0|     0.0|    null|    1|2000|
|    1|   FITVM|01.465.738/0001-97|2000-01-04|    null|2.4597298|    2543000.0|      0.0| 17000.0|    null|    1|2000|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+
only showing top 2 rows



### Coluna `MONTH_OF_YEAR`

In [17]:
df = df.withColumn('MONTH_OF_YEAR', f.date_format(df['DT_COMPTC'], 'yyyy-MM'))

In [18]:
df.show(2)

+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+-------------+
|index|TP_FUNDO|        CNPJ_FUNDO| DT_COMPTC|VL_TOTAL| VL_QUOTA|VL_PATRIM_LIQ|CAPTC_DIA|RESG_DIA|NR_COTST|MONTH|YEAR|MONTH_OF_YEAR|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+-------------+
|    0|   FITVM|01.465.738/0001-97|2000-01-03|    null|2.5972313|    2588000.0|  91000.0|     0.0|    null|    1|2000|      2000-01|
|    1|   FITVM|01.465.738/0001-97|2000-01-04|    null|2.4597298|    2543000.0|      0.0| 17000.0|    null|    1|2000|      2000-01|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+-------------+
only showing top 2 rows



### Coluna `QUARTER_OF_YEAR`

In [19]:
df = df.withColumn('QUARTER', f.concat(f.year('DT_COMPTC'), f.quarter('DT_COMPTC')))

In [20]:
df.show(2)

+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+-------------+-------+
|index|TP_FUNDO|        CNPJ_FUNDO| DT_COMPTC|VL_TOTAL| VL_QUOTA|VL_PATRIM_LIQ|CAPTC_DIA|RESG_DIA|NR_COTST|MONTH|YEAR|MONTH_OF_YEAR|QUARTER|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+-------------+-------+
|    0|   FITVM|01.465.738/0001-97|2000-01-03|    null|2.5972313|    2588000.0|  91000.0|     0.0|    null|    1|2000|      2000-01|  20001|
|    1|   FITVM|01.465.738/0001-97|2000-01-04|    null|2.4597298|    2543000.0|      0.0| 17000.0|    null|    1|2000|      2000-01|  20001|
+-----+--------+------------------+----------+--------+---------+-------------+---------+--------+--------+-----+----+-------------+-------+
only showing top 2 rows



# Análise exploratória

## Quantidade de dados faltantes

In [21]:
df.select([f.count(f.when(f.isnull(c), c)).alias(c) for c in df.columns]).show()

+-----+--------+----------+---------+--------+--------+-------------+---------+--------+--------+-----+----+-------------+-------+
|index|TP_FUNDO|CNPJ_FUNDO|DT_COMPTC|VL_TOTAL|VL_QUOTA|VL_PATRIM_LIQ|CAPTC_DIA|RESG_DIA|NR_COTST|MONTH|YEAR|MONTH_OF_YEAR|QUARTER|
+-----+--------+----------+---------+--------+--------+-------------+---------+--------+--------+-----+----+-------------+-------+
|    0|40860852|         0|        0|  366012|       0|            0|        0|       0|  365408|    0|   0|            0|      0|
+-----+--------+----------+---------+--------+--------+-------------+---------+--------+--------+-----+----+-------------+-------+



In [22]:
df.agg(*[f.mean(f.when(f.col(c).isNull(), 1).otherwise(0)).alias(c) for c in df.columns]).show()

+-----+------------------+----------+---------+-------------------+--------+-------------+---------+--------+--------------------+-----+----+-------------+-------+
|index|          TP_FUNDO|CNPJ_FUNDO|DT_COMPTC|           VL_TOTAL|VL_QUOTA|VL_PATRIM_LIQ|CAPTC_DIA|RESG_DIA|            NR_COTST|MONTH|YEAR|MONTH_OF_YEAR|QUARTER|
+-----+------------------+----------+---------+-------------------+--------+-------------+---------+--------+--------------------+-----+----+-------------+-------+
|  0.0|0.7084502218594086|       0.0|      0.0|0.00634595878233782|     0.0|          0.0|      0.0|     0.0|0.006335486559830...|  0.0| 0.0|          0.0|    0.0|
+-----+------------------+----------+---------+-------------------+--------+-------------+---------+--------+--------------------+-----+----+-------------+-------+



Há uma quantidade bastante expressiva de dados faltantes na coluna `TP_FUNDO`. Os dados faltantes respondem por, aproximadamente, 71% dos dados da coluna.

## Coluna `TP_FUNDO`

In [23]:
count_tp_fundo = df.groupBy('TP_FUNDO').count().toPandas().sort_values('count', ascending= False)

In [24]:
count_tp_fundo

Unnamed: 0,TP_FUNDO,count
2,,40860852
0,FI,15266406
5,FIF,500219
10,FACFIF,492624
9,FITVM,349153
4,FMP-FGTS,107997
7,FIC-FITVM,67022
8,FIEX,13624
1,FAPI,9395
11,FMP-FGTS CL,8226


## Coluna `VL_TOTAL`

### Estatísticas descritivas

In [25]:
df.describe('VL_TOTAL').show(truncate= False)

+-------+---------------------+
|summary|VL_TOTAL             |
+-------+---------------------+
|count  |57310379             |
|mean   |3.530323818966721E8  |
|stddev |2.9902593485264565E10|
|min    |-5.24478822941659E12 |
|max    |2.15700309082785E14  |
+-------+---------------------+



### Média diária do VL_TOTAL

In [26]:
avg_vl_total_per_date = df.groupBy('DT_COMPTC').agg(f.avg('VL_TOTAL')).sort('DT_COMPTC').toPandas()

Para identificar a existências de valores discrepantes plota-se um gráfico para observar as médias do `VL_TOTAL` em cada data:

In [27]:
fig = px.line(
    data_frame= avg_vl_total_per_date,
    x= 'DT_COMPTC',
    y= 'avg(VL_TOTAL)',
    markers= True,
    title= 'Média diária da variável VL_TOTAL'
)
fig.update_layout(title= {'x': 0.5}, xaxis_title= 'Dia', yaxis_title= 'Média do VL_TOTAL')
fig.show()

### *Boxplot* da média diária do `VL_TOTAL`

In [28]:
fig = px.box(avg_vl_total_per_date['avg(VL_TOTAL)'], title= 'Boxplot da variável VL_TOTAL', orientation= 'h')
fig.update_layout(title= {'x': 0.5}, yaxis_title= '', xaxis_title= 'Valor')
fig.show()

### Remoção de *outliers* da variável `VL_TOTAL`

In [29]:
Q1 = df.approxQuantile('VL_TOTAL', [0.25], 0.01)[0] # os parâmetros para o método approxQuantile são, respectivamente: coluna, quantil e erro.
Q3 = df.approxQuantile('VL_TOTAL', [0.75], 0.01)[0] # é retornada uma lista com o valor, por isso utiliza-se o [0], para obter apenas o valor.
IIQ = Q3 - Q1

In [30]:
inferior = Q1 - (1.5 * IIQ)
superior = Q3 + (1.5 * IIQ)

In [31]:
vl_total_without_outliers = df.where((df['VL_TOTAL'] >= inferior) & (df['VL_TOTAL'] <= superior) & (df['VL_TOTAL'] != 0)) # removem-se os dados onde a média
# é zero.

In [32]:
f'Número de linhas: {vl_total_without_outliers.count()}'

'Número de linhas: 48691791'


### Análises após a remoção dos *outliers*

#### Estatísticas descritivas

In [33]:
vl_total_without_outliers.describe('VL_TOTAL').show()

+-------+-------------------+
|summary|           VL_TOTAL|
+-------+-------------------+
|  count|           48691791|
|   mean|5.585767112200188E7|
| stddev|6.742780835377653E7|
|    min|    -1.6804662665E8|
|    max|     3.1408188281E8|
+-------+-------------------+



#### Medidas de tendência central para variável `VL_TOTAL`

##### Média diária

In [34]:
avg_vl_total_per_date = vl_total_without_outliers.groupBy('DT_COMPTC').agg(f.avg('VL_TOTAL')).sort('DT_COMPTC').toPandas()

In [35]:
fig = px.line(
    data_frame= avg_vl_total_per_date,
    x= 'DT_COMPTC',
    y= 'avg(VL_TOTAL)',
    markers= True,
    title= 'Média diária do VL_TOTAL'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_TOTAL', xaxis_title= 'Dia')
fig.show()


##### Média mensal

In [36]:
avg_vl_total_per_month = vl_total_without_outliers.groupby('MONTH_OF_YEAR').agg(f.avg('VL_TOTAL')).orderBy('MONTH_OF_YEAR').toPandas()

In [37]:
fig = px.line(
    data_frame= avg_vl_total_per_month,
    x= 'MONTH_OF_YEAR',
    y= 'avg(VL_TOTAL)',
    markers= True,
    title= 'Média mensal do VL_TOTAL'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_TOTAL', xaxis_title= 'Mês')
fig.show()

##### Média trimestral

In [38]:
avg_vl_total_per_quarter = vl_total_without_outliers.groupBy('QUARTER').agg(f.avg('VL_TOTAL')).orderBy('QUARTER').toPandas()

In [39]:
fig = px.line(
    data_frame= avg_vl_total_per_quarter,
    x= 'QUARTER',
    y= 'avg(VL_TOTAL)',
    markers= True,
    title= 'Média trimestral do VL_TOTAL'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_TOTAL', xaxis_title= 'Trimestre')
fig.show()

**O último número do eixo x é referente ao trimestre**. Portanto, 20011 é o 1º trimestre do ano 2001. 

* Observação dos dados do 3º trimestre de 2003:

In [40]:
df.where(df['YEAR'] == 2003).sort('VL_TOTAL', ascending= False).show(50)

+------+--------+------------------+----------+--------------+----------+---------------+----------+---------+--------+-----+----+-------------+-------+
| index|TP_FUNDO|        CNPJ_FUNDO| DT_COMPTC|      VL_TOTAL|  VL_QUOTA|  VL_PATRIM_LIQ| CAPTC_DIA| RESG_DIA|NR_COTST|MONTH|YEAR|MONTH_OF_YEAR|QUARTER|
+------+--------+------------------+----------+--------------+----------+---------------+----------+---------+--------+-----+----+-------------+-------+
|252380|   FITVM|04.194.710/0001-50|2003-03-13|1.2822113135E9|6.72423481|1.28215750616E9|       0.0|      0.0|     0.0|    3|2003|      2003-03|  20031|
|193134|   FITVM|02.295.843/0001-98|2003-09-08|6.3224587307E8| 32.587326| 6.3224587307E8|       0.0|      0.0|    28.0|    9|2003|      2003-09|  20033|
|193143|   FITVM|02.295.843/0001-98|2003-09-19|6.2930472625E8| 32.435732| 6.2930472625E8|       0.0|      0.0|    28.0|    9|2003|      2003-09|  20033|
|193135|   FITVM|02.295.843/0001-98|2003-09-09|6.2789283446E8|  32.36296| 6.278928

Dos 50 maiores valores da variável `VL_TOTAL`, **a maior parte é do 3º bimestre de 2003**.

##### Média anual

In [41]:
avg_vl_total_per_year = vl_total_without_outliers.groupBy('YEAR').agg(f.avg('VL_TOTAL')).orderBy('YEAR').toPandas()

In [42]:
fig = px.line(
    data_frame= avg_vl_total_per_year,
    x= 'YEAR',
    y= 'avg(VL_TOTAL)',
    markers= True,
    title= 'Média anual do VL_TOTAL'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_TOTAL', xaxis_title= 'Ano')
fig.show()

## Coluna `VL_QUOTA`

### Estatísticas descritivas

In [43]:
df.describe('VL_QUOTA').show(truncate= False)

+-------+----------------------+
|summary|VL_QUOTA              |
+-------+----------------------+
|count  |57676391              |
|mean   |203815.64916116075    |
|stddev |1.4761400027915974E9  |
|min    |-1.2335447365092747E11|
|max    |1.1206620286175E13    |
+-------+----------------------+



### Média diária do `VL_QUOTA`

In [44]:
avg_vl_quota_per_date = df.groupBy('DT_COMPTC').agg(f.avg('VL_QUOTA')).orderBy('DT_COMPTC').toPandas()

In [45]:
fig = px.line(
    data_frame= avg_vl_quota_per_date,
    x= 'DT_COMPTC',
    y= 'avg(VL_QUOTA)',
    markers= True,
    title= 'Média diária da variável VL_QUOTA'
)
fig.update_layout(title= {'x': 0.5}, xaxis_title= 'Dia', yaxis_title= 'Média do VL_QUOTA')
fig.show()

### *Boxplot* da média diária do `VL_QUOTA`

In [46]:
fig = px.box(
    data_frame= avg_vl_quota_per_date,
    x= 'avg(VL_QUOTA)', orientation= 'h',
    title= 'Boxplot da média diária do VL_QUOTA'
)
fig.update_layout(title= {'x':0.5}, yaxis_title= 'avg(VL_QUOTA)', xaxis_title= 'Valor')
fig.show()

### Remoção de *outliers* da variável `VL_QUOTA`

In [47]:
Q1 = df.approxQuantile('VL_QUOTA', [0.25], 0.01)[0]
Q3 = df.approxQuantile('VL_QUOTA', [0.75], 0.01)[0]

In [48]:
IIQ = Q3 - Q1

In [49]:
inferior = Q1 - (1.5 * IIQ)
superior = Q3 + (1.5 * IIQ)

In [50]:
vl_quota_without_outliers = df.where((df['VL_QUOTA'] >= inferior) & (df['VL_QUOTA'] <= superior) & (df['VL_QUOTA'] != 0))

In [51]:
f'Número de linhas: {vl_quota_without_outliers.count()}'

'Número de linhas: 44978627'

### Análises após remoção dos *outliers*

#### Estatísticas descritivas

In [52]:
vl_quota_without_outliers.describe('VL_QUOTA').show(truncate= False)

+-------+-----------------+
|summary|VL_QUOTA         |
+-------+-----------------+
|count  |44978627         |
|mean   |5.576381620359825|
|stddev |8.653925355547019|
|min    |-32.45214728     |
|max    |57.631632        |
+-------+-----------------+



#### Medidas de tendência central do `VL_QUOTA`

##### Média diária

In [53]:
avg_vl_quota_per_date = vl_quota_without_outliers.groupBy('DT_COMPTC').agg(f.avg('VL_QUOTA')).orderBy('DT_COMPTC').toPandas()

In [54]:
fig = px.line(
    data_frame= avg_vl_quota_per_date,
    x= 'DT_COMPTC',
    y= 'avg(VL_QUOTA)',
    markers= True,
    title= 'Média diária do VL_QUOTA'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_QUOTA', xaxis_title= 'Dia')
fig.show()

##### Média mensal

In [55]:
avg_vl_quota_per_month = vl_quota_without_outliers.groupBy('MONTH_OF_YEAR').agg(f.avg('VL_QUOTA')).orderBy('MONTH_OF_YEAR').toPandas()

In [56]:
fig = px.line(
    data_frame= avg_vl_quota_per_month,
    x= 'MONTH_OF_YEAR',
    y= 'avg(VL_QUOTA)',
    markers= True,
    title= 'Média mensal do VL_QUOTA'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_QUOTA', xaxis_title= 'Mês')
fig.show()

##### Média trimestral

In [57]:
avg_vl_quota_per_quater = vl_quota_without_outliers.groupBy('QUARTER').agg(f.avg('VL_QUOTA')).orderBy('QUARTER').toPandas()

In [58]:
fig = px.line(
    data_frame= avg_vl_quota_per_quater,
    x= 'QUARTER',
    y= 'avg(VL_QUOTA)',
    markers= True,
    title= 'Média trimestral do VL_QUOTA'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_QUOTA', xaxis_title= 'Trimestre')

##### Média anual

In [59]:
avg_vl_quota_per_year = vl_quota_without_outliers.groupBy('YEAR').agg(f.avg('VL_QUOTA')).orderBy('YEAR').toPandas()

In [60]:
fig = px.line(
    data_frame= avg_vl_quota_per_year,
    x= 'YEAR',
    y= 'avg(VL_QUOTA)',
    markers= True,
    title= 'Média anual do VL_QUOTA'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_QUOTA', xaxis_title= 'Ano')
fig.show()

## Coluna `VL_PATRIM_LIQ`

### Estatísticas descritivas 

In [61]:
df.describe('VL_PATRIM_LIQ').show(truncate= False)

+-------+--------------------+
|summary|VL_PATRIM_LIQ       |
+-------+--------------------+
|count  |57676391            |
|mean   |3.4370241993529755E8|
|stddev |2.328960989443165E9 |
|min    |-4.03064163365E9    |
|max    |3.43999901324E12    |
+-------+--------------------+



### Média diária do `VL_PATRIM_LIQ`

In [62]:
avg_vl_patrim_per_date = df.groupBy('DT_COMPTC').agg(f.avg('VL_PATRIM_LIQ')).orderBy('DT_COMPTC').toPandas()

In [63]:
fig = px.line(
    data_frame= avg_vl_patrim_per_date,
    x= 'DT_COMPTC',
    y= 'avg(VL_PATRIM_LIQ)',
    markers= True,
    title= 'Média diária do VL_PATRIM_LIQ'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_PATRIM_LIQ', xaxis_title= 'Dia')
fig.show()

### *Boxplot* das médias diárias do `VL_PATRIM_LIQ`

In [64]:
fig = px.box(
    data_frame= avg_vl_patrim_per_date,
    x= 'avg(VL_PATRIM_LIQ)', orientation= 'h',
    title= 'Boxplot das médias diárias do VL_PATRIM_LIQ'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'avg(VL_PATRIM_LIQ)', xaxis_title= 'Valor')
fig.show()

### Remoção de *outliers* da variável `VL_PATRIM_LIQ`

In [67]:
Q1 = df.approxQuantile('VL_PATRIM_LIQ', [0.25], 0.01)[0]
Q3 = df.approxQuantile('VL_PATRIM_LIQ', [0.75], 0.01)[0]

In [68]:
IIQ = Q3 - Q1

In [69]:
inferior = Q1 - (1.5 * IIQ)
superior = Q3 + (1.5 * IIQ)

In [70]:
vl_patrim_without_outliers = df.where((df['VL_PATRIM_LIQ'] >= inferior) & (df['VL_PATRIM_LIQ'] <= superior))

### Análises após a remoção dos *outliers*

#### Estatísticas descritivas

In [73]:
vl_patrim_without_outliers.describe('VL_PATRIM_LIQ').show(truncate= False)

+-------+--------------------+
|summary|VL_PATRIM_LIQ       |
+-------+--------------------+
|count  |49441841            |
|mean   |5.495694617574076E7 |
|stddev |6.6863157685337506E7|
|min    |-8.873415595E7      |
|max    |3.1230715398E8      |
+-------+--------------------+



#### Medidas de tendência central `VL_PATRIM_LIQ`

##### Média mensal

In [78]:
avg_vl_patrim_per_month = vl_patrim_without_outliers.groupBy('MONTH_OF_YEAR').agg(f.avg('VL_PATRIM_LIQ')).orderBy('MONTH_OF_YEAR').toPandas()

In [79]:
fig = px.line(
    data_frame= avg_vl_patrim_per_month,
    x= 'MONTH_OF_YEAR',
    y= 'avg(VL_PATRIM_LIQ)',
    markers= True,
    title= 'Médias mensais do VL_PATRIM_LIQ entre 2000 e 2023'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_PATRIM_LIQ', xaxis_title= 'Mês')
fig.show()

##### Média trimestral

In [80]:
avg_vl_patrim_per_quarter = vl_patrim_without_outliers.groupBy('QUARTER').agg(f.avg('VL_PATRIM_LIQ')).orderBy('QUARTER').toPandas()

In [86]:
fig = px.line(
    data_frame= avg_vl_patrim_per_quarter,
    x= 'QUARTER',
    y= 'avg(VL_PATRIM_LIQ)',
    markers= True,
    title= 'Médias trimestrais do VL_PATRIM_LIQ entre 2000 e 2023',
    color_discrete_sequence= ['red']
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_PATRIM_LIQ', xaxis_title= 'Trimestre')
fig.show()

##### Média anual

In [87]:
avg_vl_patrim_per_year = vl_patrim_without_outliers.groupBy('YEAR').agg(f.avg('VL_PATRIM_LIQ')).orderBy('YEAR').toPandas()

In [89]:
fig = px.line(
    data_frame= avg_vl_patrim_per_year,
    x= 'YEAR',
    y= 'avg(VL_PATRIM_LIQ)',
    markers= True,
    color_discrete_sequence= ['green'],
    title= 'Médias anuais do VL_PATRIM_LIQ entre 2000 e 2023'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média do VL_PATRIM_LIQ', xaxis_title= 'Ano')
fig.show()

## Coluna `CAPTC_DIA`

### Estatísticas descritivas 

In [91]:
df.describe('CAPTC_DIA').show(truncate= False)

+-------+--------------------+
|summary|CAPTC_DIA           |
+-------+--------------------+
|count  |57676391            |
|mean   |2196087.6704268623  |
|stddev |1.9921808190415087E8|
|min    |-828000.0           |
|max    |8.3226001E11        |
+-------+--------------------+



### Média diária da `CAPTC_DIA`

In [92]:
avg_captc_per_date = df.groupBy('DT_COMPTC').agg(f.avg('CAPTC_DIA')).orderBy('DT_COMPTC').toPandas()

In [93]:
fig = px.line(
    data_frame= avg_captc_per_date,
    x= 'DT_COMPTC',
    y= 'avg(CAPTC_DIA)',
    markers= True,
    title= 'Médias diárias da CAPTC_DIA entre 2000 e 2023'
)
fig.show()

### *Boxplot* da média diária da `CAPTC_DIA`

In [96]:
fig = px.box(
    data_frame= avg_captc_per_date,
    x= 'avg(CAPTC_DIA)', orientation= 'h',
    title= 'Boxplot das médias diárias da CAPTC_DIA'
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'CAPTC_DIA', xaxis_title= 'Valor')
fig.show()

### Remoção de *outliers* da variável `CAPTC_DIA`

In [110]:
Q1 = df.approxQuantile('CAPTC_DIA', [0.25], 0.01)[0]
Q3 = df.approxQuantile('CAPTC_DIA', [0.75], 0.01)[0]

In [111]:
IIQ = Q3 - Q1

In [112]:
IIQ

0.0

In [99]:
inferior = Q1 - (1.5 * IIQ)
superior = Q3 + (1.5 * IIQ)

In [100]:
captc_without_outliers = df.where((df['CAPTC_DIA'] >= inferior) & (df['CAPTC_DIA'] <= superior))

### Análises após a remoção dos *outliers*

#### Estatísticas descritivas

In [101]:
captc_without_outliers.describe('CAPTC_DIA').show(truncate= False)

+-------+---------+
|summary|CAPTC_DIA|
+-------+---------+
|count  |45913815 |
|mean   |0.0      |
|stddev |0.0      |
|min    |0.0      |
|max    |0.0      |
+-------+---------+



#### Medidas de tendência central `CAPTC_DIA`

##### Média mensal

In [103]:
avg_captc_per_month = captc_without_outliers.groupBy('MONTH_OF_YEAR').agg(f.avg('CAPTC_DIA')).orderBy('MONTH_OF_YEAR').toPandas()

In [105]:
avg_captc_per_month

Unnamed: 0,MONTH_OF_YEAR,avg(CAPTC_DIA)
0,2000-01,0.0
1,2000-02,0.0
2,2000-03,0.0
3,2000-04,0.0
4,2000-05,0.0
...,...,...
272,2022-09,0.0
273,2022-10,0.0
274,2022-11,0.0
275,2022-12,0.0


In [104]:
fig = px.line(
    data_frame= avg_captc_per_month,
    x= 'MONTH_OF_YEAR',
    y= 'avg(CAPTC_DIA)',
    markers= True,
    title= 'Médias mensais da CAPTC_DIA entre 2000 e 2023' 
)
fig.update_layout(title= {'x': 0.5}, yaxis_title= 'Média da CAPTC_DIA', xaxis_title= 'Mês')
fig.show()

##### Média trimestral

In [None]:
avg_captc_per_quarter = 