Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading CoNLL dataset on AWS EMR 6.5 fails #13033

Closed
ethnhll opened this issue Nov 3, 2022 · 2 comments · Fixed by #13035 or #13036
Closed

Reading CoNLL dataset on AWS EMR 6.5 fails #13033

ethnhll opened this issue Nov 3, 2022 · 2 comments · Fixed by #13035 or #13036

Comments

@ethnhll
Copy link

ethnhll commented Nov 3, 2022

Description

When trying to load a CoNLL file into a spark dataframe using CoNLL.readDataset(spark, "file:///path/to/file) on AWS EMR 6.5, there is an error which seems to indicate that EMR versions of Spark are not supported.

An error was encountered:
invalid literal for int() with base 10: '312-amzn-1'
Traceback (most recent call last):
  File "/home/hadoop/spark_venv/lib/python3.7/site-packages/sparknlp/training/conll.py", line 142, in readDataset
    dataframe = self.getDataFrame(spark, jdf)
  File "/home/hadoop/spark_venv/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 60, in getDataFrame
    if self.spark_version() >= 330:
  File "/home/hadoop/spark_venv/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 57, in spark_version
    return int("".join(spark_version))
ValueError: invalid literal for int() with base 10: '312-amzn-1'

Expected Behavior

A CoNLL file should load into a dataframe successfully.

Current Behavior

An error is reported

An error was encountered:
invalid literal for int() with base 10: '312-amzn-1'
Traceback (most recent call last):
  File "/home/hadoop/spark_venv/lib/python3.7/site-packages/sparknlp/training/conll.py", line 142, in readDataset
    dataframe = self.getDataFrame(spark, jdf)
  File "/home/hadoop/spark_venv/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 60, in getDataFrame
    if self.spark_version() >= 330:
  File "/home/hadoop/spark_venv/lib/python3.7/site-packages/sparknlp/internal/extended_java_wrapper.py", line 57, in spark_version
    return int("".join(spark_version))
ValueError: invalid literal for int() with base 10: '312-amzn-1'

Steps to Reproduce

from sparknlp.training import CoNLL
trainingData = CoNLL().readDataset(spark, "file:///path/to/file.conll") 

Context

I am not able to load a training dataset to train an NER model.

Your Environment

Spark NLP version

4.0.2

Java version

openjdk version "1.8.0_342"
OpenJDK Runtime Environment Corretto-8.342.07.4 (build 1.8.0_342-b07)
OpenJDK 64-Bit Server VM Corretto-8.342.07.4 (build 25.342-b07, mixed mode)

Setup and installation (Pypi, Conda, Maven, etc.):

Pypi

Operating System and version:

sh-4.2$ cat /etc/os-release

AME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"


sh-4.2$ lsb_release -a

LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: Amazon
Description:    Amazon Linux release 2 (Karoo)
Release:        2
Codename:       Karoo

sh-4.2$ uname -r

4.14.252-195.483.amzn2.x86_64
@maziyarpanahi
Copy link
Member

Thanks @ethnhll for reporting this. It's the method we use to take the Apache Spark version to decide what to use in CoNLL(). (the Spark 3.3.x uses a different way of constructing DataFrame internally)

For some reason, the Spark version on Amazon (or this specific EMR) is not a simple all-int number like 3.1.2. It has some string inside as well, I will make sure we count for that so it won't fail.

@maziyarpanahi
Copy link
Member

I just released spark-nlp==4.2.3-rc1 to solve this issue. All you need is to set 4.2.3-rc1 as the spark-nlp version inside your EMR bootstrap.sh file.

This fix will be officially part of 4.2.3 release next week: #13035

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants