## Apiux & SII: Fuerza entre entidades tributaria relacionada a probabilidad de contaminacion, definicion en base a IVA.
## Henry Vega (henrry.vega@api-ux.com)
## Data scientist

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
import pyspark
import pandas as pd
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

from pyspark_dist_explore import hist
import matplotlib.pyplot as plt
from pyspark.sql.types import StringType,TimestampType

In [2]:
spark = SparkSession.builder \
  .appName("Test")  \
  .config("spark.kerberos.access.hadoopFileSystems","abfs://data@datalakesii.dfs.core.windows.net/") \
  .config("spark.executor.memory", "16g") \
  .config("spark.driver.memory", "12g")\
  .config("spark.executor.cores", "2") \
  .config("spark.driver.maxResultSize", "12g") \
  .getOrCreate()
warnings.filterwarnings('ignore', category=DeprecationWarning)
sc=spark.sparkContext
sc.setLogLevel ('ERROR')
spark.conf.set("spark.sql.parquet.int96RebaseModeInRead", "CORRECTED")
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
spark.conf.set("spark.sql.parquet.int96RebaseModeInRead", "CORRECTED")
spark.conf.set("spark.sql.parquet.int96RebaseModeInWrite", "CORRECTED")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")

Setting spark.hadoop.yarn.resourcemanager.principal to manuel.barrientos
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/26 19:55:42 WARN JettyUtils: GET /jobs/ failed: java.util.NoSuchElementException: Failed to get the application information. If you are starting up Spark, please wait a while until it's ready.
java.util.NoSuchElementException: Failed to get the application information. If you are starting up Spark, please wait a while until it's ready.
	at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:51)
	at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:276)
	at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:90)
	at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:81)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:503)
	at javax.servl

En primer lugar, leemos la data de los arcos comerciales correspondientes.

In [3]:
spark.read.parquet("abfs://data@datalakesii.dfs.core.windows.net/DatosOrigen/LibSDF/JBA_ARCOS_E").createOrReplaceTempView("comercial")
#spark.sql("SELECT count(*) from comercial where Monto_IVA<=0").show()
#spark.sql("SELECT count(*) from comercial where Monto_IVA>0").show()

                                                                                

A partir de la información anterior, calculamos que el 0.98 % de todos los Monto_IVA tienen valores cero o negativos. Ahora comparemos los montos.

In [4]:
spark.sql("SELECT sum(Monto_IVA) from comercial where Monto_IVA<=0").show()
spark.sql("SELECT sum(Monto_IVA) from comercial where Monto_IVA>0").show()

                                                                                

+--------------+
|sum(Monto_IVA)|
+--------------+
| -213127757609|
+--------------+





+--------------+
|sum(Monto_IVA)|
+--------------+
|71319972519491|
+--------------+



                                                                                

En terminos de valor absoluto, solo el 0.29% de todo el monto registrado como IVA tiene valor negativo. Resulta sensto, dadas las caracteristicas de la data, donde la ejecucion de una nota de credito puede hacer que el remanente de IVA sea negativo, no considerar estos valores. A continuacion observaremos como es la distribucion de IVA en un histograma de frecuencia. 

In [5]:
df=spark.sql("SELECT Monto_IVA FROM comercial where Monto_IVA>0 and Monto_IVA<1e+6")

In [6]:
#fig, ax = plt.subplots()
#hist(ax, df.select('Monto_IVA'), bins = 30, color=['blue'])

Ahora calculamos la fraccion de IVA para cada contriibuyente A que ha generado documentos tributarios al contribuyente B.

In [7]:
spark.sql("SELECT PARU_RUT_E0, PARU_RUT_E2, Monto_IVA FROM comercial where Monto_IVA>0 order by PARU_RUT_E2 asc").createOrReplaceTempView("comercial")
spark.sql("SELECT PARU_RUT_E2, sum(Monto_IVA) as Total_IVA FROM comercial group by PARU_RUT_E2 order by PARU_RUT_E2 asc").createOrReplaceTempView("comercial_aux")
spark.sql("SELECT PARU_RUT_E0,comercial.PARU_RUT_E2,ROUND(Monto_IVA/Total_IVA,4) as Fi from comercial left join comercial_aux on comercial.PARU_RUT_E2= comercial_aux.PARU_RUT_E2").createOrReplaceTempView("comercial")

In [8]:
spark.sql("SELECT PARU_RUT_E0 as emisor, PARU_RUT_E2 as receptor,  Fi from comercial where Fi>0 order by Fi desc").show()



+--------------------+--------------------+---+
|              emisor|            receptor| Fi|
+--------------------+--------------------+---+
|Hug0GU2sG4iH4o6T8...|+/3jdSccPuuuyOyMm...|1.0|
|PN/RY90u52SAb4xuL...|+/wlbQUam5sirf2KJ...|1.0|
|LHczQGu7eZm8Uz8xB...|+/4MEgPVGHWQD7qcl...|1.0|
|cORc2mRWcEkFyhdcT...|++T7htiq0jRXP9+PQ...|1.0|
|41vxG5znQHxYqDibA...|+/5bMXTXblS4yiAe8...|1.0|
|FC/NIGw5S6P8v12CD...|++785jmZpIf7ZArAQ...|1.0|
|FElTabHJ05zcz8O99...|+/5bMwgy0lffGupXS...|1.0|
|h7Ke396V/N0SZ3Ugz...|++Bs1wGrBL10Udxhy...|1.0|
|SnAmjXkltQdXjUKta...|+/95qISDYVbC48IDE...|1.0|
|4Nk5x2vxWNlS5EhsS...|++P3AJ5oxSkntAyhU...|1.0|
|wl6oyRFcSm7kSUA89...|+/AOivhaOcu9l0xEj...|1.0|
|06wNfUqi5lvHzFE7S...|++ZB+RpgWO0Wb78VT...|1.0|
|N7y1530qJu5s9WE1H...|+/D6XXYg5JWKf0P94...|1.0|
|uchkk/snGaBO/AUJ8...|++eQDcmb33TTx4Bpc...|1.0|
|g+qp0UCX8YmXLTG2d...|+/JUA8rWxvXEB+1hy...|1.0|
|us2I09F9yl4w1DgLr...|++jIIlxJ2ixeCSJEb...|1.0|
|Pai2tvkvsPI/rqYAP...|+/LEQXbiZX2vHWrj6...|1.0|
|bWhM0M+YtIVOifuHM...|++osxTC4ikn9fe7NB.

                                                                                

Finalmente guardamos la data para poder utilizarla posteriormente en la propagacion. 

In [9]:
iva=spark.sql("SELECT PARU_RUT_E0 as emisor, PARU_RUT_E2 as receptor, Fi from comercial")


In [10]:
iva.show()



+--------------------+--------------------+------+
|              emisor|            receptor|    Fi|
+--------------------+--------------------+------+
|rU7FdwEbyB1XAocGE...|1U5hfY5uQDpbhrRy/...|   1.0|
|Zs/v9ZB+hj/x7OXqA...|1U5j0tJs+UJ75asYp...|   1.0|
|Pn1oFMN6vJsn1Lm2K...|Gbyrl9kwVKCYvLeiQ...|0.0241|
|ytU/ETAx1hyw04lpY...|Gbyrl9kwVKCYvLeiQ...|0.0028|
|JXO34xoSYuLlbDcwN...|Gbyrl9kwVKCYvLeiQ...|5.0E-4|
|LPHdq1N4SJaDbcG3b...|Gbyrl9kwVKCYvLeiQ...|0.0661|
|MWgBYRE6Rkroy0ChE...|Gbyrl9kwVKCYvLeiQ...|0.6129|
|MbbOzrtpiosPK/ylg...|Gbyrl9kwVKCYvLeiQ...|0.0432|
|Mj6tu3QlvL4d7ACws...|Gbyrl9kwVKCYvLeiQ...|0.0056|
|+zbL4FzFU1Pol+FYW...|Gbyrl9kwVKCYvLeiQ...|0.0316|
|0NP1I6RR9byiXgQN7...|Gbyrl9kwVKCYvLeiQ...|0.0467|
|2Lh8b7/HIJxnDZN5i...|Gbyrl9kwVKCYvLeiQ...|0.0035|
|3yo9Mh0h0fKcbKk5t...|Gbyrl9kwVKCYvLeiQ...|0.0028|
|D62TtWrZ9g9vUC0ct...|Gbyrl9kwVKCYvLeiQ...|9.0E-4|
|EaRgEM8xjLfbDimK8...|Gbyrl9kwVKCYvLeiQ...|0.0015|
|FL5lmkRkFqgRFXL/r...|Gbyrl9kwVKCYvLeiQ...|0.0014|
|GfR2x0Ls0uVIKj6/4...|Gbyrl9kwV

                                                                                

In [11]:
iva.count()

                                                                                

42059697

In [12]:
ivaPandas=iva.toPandas()


                                                                                

In [13]:
ivaPandas.to_csv('/home/cdsw/data/processed/fuerza_iva.csv', index=False)