# eEDB-011-2024-3

## Atividade 3 – Ingestão e ETL com linguagem de programação (Python + Spark)

- Utilizar linguagem de programação Python para ingestão e tratamento de dados. Todo o processo deve ser realizado via Spark.
    - Pacotes adicionais podem ser utilizados
    - Tratamento de dados não deve ser realizado via SQL
- Realizar a ingestão de todas as base de dados em um banco de dados relacional open source. Pode ser utilizado qualquer banco de dado sendo algumas sugestões:
    -  MySQL
    -  Postgre
    - ClickHouse
- Gerar uma tabela final com os dados tratados e unidos.
    - O tratamento de dados deve ser realizado através da linguagem de programação Python + Spark
- Adicionar as seguintes camadas de processamento, dentro do próprio banco de dados ou em disco local. A Camada Delivery deve
obrigatoriamente ter estar também no formato de uma tabela final dentro do banco de dados relacional:
    - RAW – formato dos dados livre
    - Trusted – formato de dados em Parquet ou ORC or AVRO (indicado Parquet)
    - Delivery– formato de dados em Parquet ou ORC or AVRO (indicado Parquet)

- **Grupo 02**:
    - Aline Bini
    - Ana Lívia Franco
    - Ana Priss
    - João Squinelato
    - Marcelo Pena
    - Thais Siqueira

- [Github](https://github.com/Squinelato/eEDB-011-2024-3 "eEDB-011-2024-3")

```Ingestão De Dados | Agosto 2024```

## To Do

- raw 
- trusted
- delivery

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lpad, col, lpad, concat, sha1, regexp_replace, udf, lower
from pyspark.sql.types import StringType, FloatType, IntegerType
from unidecode import unidecode

import os

In [6]:
spark = SparkSession.builder.getOrCreate()

In [10]:
# spark.conf.set('spark.rpc.askTimeout', '600s') 
# spark.conf.set('spark.executor.heartbeatInterval', "120s") 
spark.conf.set('spark.executor.timeout', "1200s") 
spark.conf.set('spark.sql.broadcastTimeout', "1200")

In [11]:
spark

---
## **Raw**

### **Banks file**

In [12]:
banks_csv_path = '../Fonte de Dados/Bancos/EnquadramentoInicia_v2.tsv'
rwzd_bank = spark.read.csv(banks_csv_path, sep='\t', encoding='utf8', header=True)
rwzd_bank.show(100, truncate=False)

+--------+--------+--------------------------------------------------------------------------------------------------------------------------+
|Segmento|CNPJ    |Nome                                                                                                                      |
+--------+--------+--------------------------------------------------------------------------------------------------------------------------+
|S1      |0       |BANCO DO BRASIL - PRUDENCIAL                                                                                              |
|S1      |60746948|BRADESCO - PRUDENCIAL                                                                                                     |
|S1      |30306294|BTG PACTUAL - PRUDENCIAL                                                                                                  |
|S1      |360305  |CAIXA ECONOMICA FEDERAL - PRUDENCIAL                                                                                      |

Analisando o esquema dos dados

In [13]:
rwzd_bank.printSchema()

root
 |-- Segmento: string (nullable = true)
 |-- CNPJ: string (nullable = true)
 |-- Nome: string (nullable = true)



Contando a quantidade de linhas

In [14]:
rwzd_bank.count()

1474

Salvando dados na camada _raw_ no formato parquet

In [15]:
# raw_bank_path = './raw/bank/'
# rwzd_bank.write.parquet(raw_bank_path, mode="append")

### **Employees file**

Localizando todos os arquivos contendo dados de empregados

In [16]:
employee_dir = '../Fonte de Dados/Empregados/'
employee_files = os.listdir(employee_dir)
employee_paths = list(map(lambda file: os.path.join(employee_dir, file), employee_files))
employee_paths

['../Fonte de Dados/Empregados/glassdoor_consolidado_join_match_less_v2.csv',
 '../Fonte de Dados/Empregados/glassdoor_consolidado_join_match_v2.csv']

Lendo todos os arquivos de empregados como um único conjunto de dados

In [17]:
rwzd_employee = spark.read.csv(employee_paths, sep='|', encoding='utf8', header=True)
rwzd_employee.show()

+--------------------+-------------+-------------+--------------+--------------+--------------------+---------------------+----------------+--------------------+--------------------+--------------------+-----+-----------------+----------------------+-----------------+--------------+------------------------+-------------------------+---------------------------------+----------------------------------+--------+--------------------+-------------+
|       employer_name|reviews_count|culture_count|salaries_count|benefits_count|    employer-website|employer-headquarters|employer-founded|   employer-industry|    employer-revenue|                 url|Geral|Cultura e valores|Diversidade e inclusão|Qualidade de vida|Alta liderança|Remuneração e benefícios|Oportunidades de carreira|Recomendam para outras pessoas(%)|Perspectiva positiva da empresa(%)|Segmento|                Nome|match_percent|
+--------------------+-------------+-------------+--------------+--------------+--------------------+---

Removendo duplicatas com base no nome e segmento do banco

In [18]:
rwzd_employee = rwzd_employee.dropDuplicates(['Nome', 'Segmento'])

Analisando o esquema dos dados

In [19]:
rwzd_employee.printSchema()

root
 |-- employer_name: string (nullable = true)
 |-- reviews_count: string (nullable = true)
 |-- culture_count: string (nullable = true)
 |-- salaries_count: string (nullable = true)
 |-- benefits_count: string (nullable = true)
 |-- employer-website: string (nullable = true)
 |-- employer-headquarters: string (nullable = true)
 |-- employer-founded: string (nullable = true)
 |-- employer-industry: string (nullable = true)
 |-- employer-revenue: string (nullable = true)
 |-- url: string (nullable = true)
 |-- Geral: string (nullable = true)
 |-- Cultura e valores: string (nullable = true)
 |-- Diversidade e inclusão: string (nullable = true)
 |-- Qualidade de vida: string (nullable = true)
 |-- Alta liderança: string (nullable = true)
 |-- Remuneração e benefícios: string (nullable = true)
 |-- Oportunidades de carreira: string (nullable = true)
 |-- Recomendam para outras pessoas(%): string (nullable = true)
 |-- Perspectiva positiva da empresa(%): string (nullable = true)
 |-- Seg

Contando a quantidade de linhas

In [20]:
rwzd_employee.count()

37

Salvando dados na camada _raw_ no formato parquet

In [21]:
# raw_employee_path = './raw/employee/'
# rwzd_employee.write.parquet(raw_employee_path, mode="append")

### **Claims**

Localizando todos os arquivos contendo dados de reclamações

In [22]:
claim_dir = '../Fonte de Dados/Reclamações/'
claim_files = os.listdir(claim_dir)
claim_paths = list(map(lambda file: os.path.join(claim_dir, file), claim_files))
claim_paths

['../Fonte de Dados/Reclamações/2021_tri_01.csv',
 '../Fonte de Dados/Reclamações/2021_tri_02.csv',
 '../Fonte de Dados/Reclamações/2021_tri_03.csv',
 '../Fonte de Dados/Reclamações/2021_tri_04.csv',
 '../Fonte de Dados/Reclamações/2022_tri_01.csv',
 '../Fonte de Dados/Reclamações/2022_tri_02_nao_ha_dados.csv',
 '../Fonte de Dados/Reclamações/2022_tri_03.csv',
 '../Fonte de Dados/Reclamações/2022_tri_04.csv']

In [23]:
rwzd_claim = spark.read.csv(claim_paths, sep=';', encoding='latin1', header=True)
rwzd_claim.show()

+----+---------+--------------------+----------------+--------+----------------------+------+-----------------------------------------------+--------------------------------------------+---------------------------------------+-------------------------------+----------------------------------------+----------------------------+----------------------------+----+
| Ano|Trimestre|           Categoria|            Tipo| CNPJ IF|Instituição financeira|Índice|Quantidade de reclamações reguladas procedentes|Quantidade de reclamações reguladas - outras|Quantidade de reclamações não reguladas|Quantidade total de reclamações|Quantidade total de clientes  CCS e SCR|Quantidade de clientes  CCS|Quantidade de clientes  SCR|_c14|
+----+---------+--------------------+----------------+--------+----------------------+------+-----------------------------------------------+--------------------------------------------+---------------------------------------+-------------------------------+----------------

Analisando o esquema dos dados

In [24]:
rwzd_claim.printSchema()

root
 |-- Ano: string (nullable = true)
 |-- Trimestre: string (nullable = true)
 |-- Categoria: string (nullable = true)
 |-- Tipo: string (nullable = true)
 |-- CNPJ IF: string (nullable = true)
 |-- Instituição financeira: string (nullable = true)
 |-- Índice: string (nullable = true)
 |-- Quantidade de reclamações reguladas procedentes: string (nullable = true)
 |-- Quantidade de reclamações reguladas - outras: string (nullable = true)
 |-- Quantidade de reclamações não reguladas: string (nullable = true)
 |-- Quantidade total de reclamações: string (nullable = true)
 |-- Quantidade total de clientes  CCS e SCR: string (nullable = true)
 |-- Quantidade de clientes  CCS: string (nullable = true)
 |-- Quantidade de clientes  SCR: string (nullable = true)
 |-- _c14: string (nullable = true)



Removendo coluna desnecessária

In [25]:
rwzd_claim = rwzd_claim.drop('_c14')

Contando a quantidade de linhas

In [26]:
rwzd_claim.count()

918

Salvando dados na camada _raw_ no formato parquet

In [27]:
# raw_claim_path = './raw/claim/'
# rwzd_claim.write.parquet(raw_claim_path, mode="append")

---
## **Trusted**

### **Banks**

In [28]:
trzd_bank = rwzd_bank

Aplicando algumas transformações com o intuito de melhorar a qualidade dos dados:

1 - Renomeando colunas do _dataframe_ para inglês e no formato _snake case_

In [29]:
trzd_bank = trzd_bank.withColumnsRenamed({
    'Segmento': 'segment',
    'CNPJ': 'cnpj',
    'Nome': 'financial_institution_name'
})

2 - Para que os dados da coluna _cnpj_ estivessem de acordo com seu padrão, os valores incompletos receberam numerais zeros à esquerda até completar 8 dígitos

In [30]:
trzd_bank = trzd_bank.withColumn('cnpj', lpad(col('cnpj'), 8, '0'))

In [31]:
trzd_bank = trzd_bank.withColumn('sk_cnpj_segment', sha1(concat(col('cnpj'), col('segment'))))

In [32]:
trzd_bank.show(truncate=False)

+-------+--------+---------------------------------------------+----------------------------------------+
|segment|cnpj    |financial_institution_name                   |sk_cnpj_segment                         |
+-------+--------+---------------------------------------------+----------------------------------------+
|S1     |00000000|BANCO DO BRASIL - PRUDENCIAL                 |d9be4941c63af670d63086dcb0849a997e062a64|
|S1     |60746948|BRADESCO - PRUDENCIAL                        |2f59bebc86f6b4dd5628043cac240c6e6a6f651d|
|S1     |30306294|BTG PACTUAL - PRUDENCIAL                     |110ffbf347c0837827daf658f4f97950ebe9f741|
|S1     |00360305|CAIXA ECONOMICA FEDERAL - PRUDENCIAL         |587e4dca521e8c4426371a29658419ecb0087262|
|S1     |60872504|ITAU - PRUDENCIAL                            |4b6e5b20d113f3415704fd6a998f2e4500340b18|
|S1     |90400888|SANTANDER - PRUDENCIAL                       |75ddb1c713e8da4daab944522b8ee5dda29c2983|
|S2     |92702067|BANRISUL - PRUDENCIAL       

In [33]:
## salvar em parquet

### **Employees**

In [34]:
trzd_employee = rwzd_employee

In [35]:
trzd_employee = trzd_employee.withColumnsRenamed({
    'employer-website': 'employer_website',
    'employer-headquarters': 'employer_headquarters',
    'employer-founded': 'employer_founded',
    'employer-industry': 'employer_industry',
    'employer-revenue': 'employer_revenue',
    'Geral': 'general_score',
    'Cultura e valores': 'culture_values_score',
    'Diversidade e inclusão': 'diversity_inclusion_score',
    'Qualidade de vida': 'life_quality_score',
    'Alta liderança': 'senior_leadership_score',
    'Remuneração e benefícios': 'compensation_benefits_score',
    'Oportunidades de carreira': 'career_opportunities_score',
    'Recomendam para outras pessoas(%)': 'recommendation_score',
    'Perspectiva positiva da empresa(%)': 'company_positive_score',
    'Segmento': 'segment',
    'Nome': 'financial_institution_name'
})

In [36]:
trzd_employee = trzd_employee.withColumn('employer_founded', regexp_replace(col('employer_founded'), r'\..*', ''))

In [37]:
def unaccent_new(word: str) -> str:
    return unidecode(word)

unaccent = udf(unaccent_new, StringType())

In [38]:
# unaccent = udf(lambda word: unidecode(word))

trzd_employee = trzd_employee \
    .withColumn('sk_financial_institution_name', unaccent(col('financial_institution_name'))) \
    .withColumn('sk_financial_institution_name', lower(col('sk_financial_institution_name'))) \
    .withColumn('sk_financial_institution_name', regexp_replace(col('sk_financial_institution_name'), r' ', '')) \
    .withColumn('sk_financial_institution_name', sha1(col('sk_financial_institution_name')))

In [39]:
trzd_employee.show(truncate=False)

Py4JJavaError: An error occurred while calling o130.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 23) (NITRO5 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:192)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
	at org.apache.spark.sql.execution.python.BatchEvalPythonExec.evaluate(BatchEvalPythonExec.scala:82)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:131)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.net.PlainSocketImpl.accept(Unknown Source)
	at java.net.ServerSocket.implAccept(Unknown Source)
	at java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:179)
	... 25 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:354)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:382)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:354)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4177)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3161)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4167)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4165)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:3161)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3382)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:284)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:323)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:192)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166)
	at org.apache.spark.sql.execution.python.BatchEvalPythonExec.evaluate(BatchEvalPythonExec.scala:82)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:131)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
	at java.net.DualStackPlainSocketImpl.socketAccept(Unknown Source)
	at java.net.AbstractPlainSocketImpl.accept(Unknown Source)
	at java.net.PlainSocketImpl.accept(Unknown Source)
	at java.net.ServerSocket.implAccept(Unknown Source)
	at java.net.ServerSocket.accept(Unknown Source)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:179)
	... 25 more
