TiSpark with HDFS

This article introduces how to use TiSpark with HDFS.

Set up

create TiDB Table

CREATE TABLE `test`.`tidb` (
  `id` int(11) NOT NULL, 
  `name` varchar(255) NOT NULL, 
  PRIMARY KEY (`id`)
);

create data.csv

take 0x01 delimiter and \n newline as example

2^Ashi
3^Ayu

load the CSV to hdfs

hdfs dfs -put data.csv /

start spark-shell

./bin/spark-shell --jars tispark-assembly-3.0-2.5.1.jar

Write to TiDB from HDFS

you can also use Spark JDBC DataSource to do it, see here

import org.apache.spark.sql.types._

// read from hdfs
val schema = new StructType().add("id",IntegerType).add("name",StringType)
val df = spark.read.format("csv").option("delimiter","\u0001").schema(schema).load("hdfs://${ip:port}/data.csv")

// write to tidb
val tidbOptions = Map(
  "tidb.addr" -> "${ip}",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root"
)
df.write.format("tidb").options(tidbOptions).option("database", "test").option("table", "tidb").mode("append").save()

// if we want to replace
df.write.format("tidb").options(tidbOptions).option("database", "test").option("table", "tidb").option("replace","true").mode("append").save()

Write to HDFS from TiDB

you can also use Spark JDBC DataSource to do it, see here

// read from tidb
val df = spark.sql("select * from tidb_catalog.test.tidb")

// write to hdfs
df.write.format("csv").option("delimiter","\u0001").save("hdfs://${ip:port}/save")

zip file

Spark support read from the zipped file in Hadoop directly, take gz file for example:

val schema = new StructType().add("id",IntegerType).add("name",StringType)
val df = spark.read.format("csv").option("delimiter","\u0001").schema(schema).load("hdfs://${ip:port}/data.csv.gz")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiSpark with HDFS

Set up

Write to TiDB from HDFS

Write to HDFS from TiDB

zip file

Clone this wiki locally