-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark patch #139
base: main
Are you sure you want to change the base?
Spark patch #139
Conversation
spark/.gitkeep
Outdated
@@ -0,0 +1 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this file could be removed
StructField("RefererHash", LongType, nullable = false), | ||
StructField("URLHash", LongType, nullable = false), | ||
StructField("CLID", IntegerType, nullable = false)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no way to create an index, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no index support
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/types/StructField.html
val timeElapsed = (end - start) / 1000000 | ||
println(s"Query $itr | Time: $timeElapsed ms") | ||
itr += 1 | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls upload the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploaded in log.txt file
spark/benchmark.sh
Outdated
|
||
# For Spark3.0.1 installation: | ||
# wget --continue https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz | ||
# tar -xzf spark-3.0.1-bin-hadoop2.7.tgz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tar -xzf spark*
spark/benchmark.sh
Outdated
wget --continue 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz' | ||
#gzip -d hits.tsv.gz | ||
chmod 777 ~ hits.tsv | ||
$HADOOP_HOME/bin/hdfs dfs -put hits.tsv / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how do I set this variable?
$ echo $HADOOP_HOME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find it:
find spark-3.5.0-bin-hadoop3 -name hdfs
Added Spark & HDFS deployment details in benchmark.sh script. Added example of log.txt file from HPC-environment. |
The script
Should be:
|
|
Updated Spark&HDFS directories creation |
I started editing your script to make it self-sufficient, but after fixing the errors, it does not work.
Then:
PS. The current version of the script is:
|
@alexey-milovidov We assume that there is passless ssh connection defined on localhost (in other words, if we will use Please clarify the next details:
|
I do it in this way: create a fresh VM on AWS and run the commands one by one. |
@DoubleMindy, let's continue. |
Added full HDFS deployment, on "fresh" VM there is no problem with file putting |
Sorry, but the script still does not reproduce. I'm copy-pasting the commands one by one, and getting this: |
We need a reproducible script to install Spark. It should run by itself. |
No description provided.