# 问题描述

spark读和写csv时，如果schema不一致会出问题（即使设置了保留header）

spark读取csv时，会按照设置的schema来给df设置字段类型和字段名字，并不会按照header来调整字段的顺序

In [1]:
import datetime

import pyspark
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col

spark = SparkSession \
    .builder \
    .appName("example") \
    .getOrCreate()
sc = spark.sparkContext
sql_ctx = SQLContext(sc)

设置一个写入时的schema

In [2]:
schema = StructType([
    StructField('col_a', StringType(), True),
    StructField('col_b', IntegerType(), True),
])

In [3]:
df = sql_ctx.createDataFrame([
    ('1', 2),
    ('2', 3),
], schema)

In [4]:
df.printSchema()
df.show()

root
 |-- col_a: string (nullable = true)
 |-- col_b: integer (nullable = true)

+-----+-----+
|col_a|col_b|
+-----+-----+
|    1|    2|
|    2|    3|
+-----+-----+



写入到hdfs中

In [5]:
df.write.mode('overwrite').csv('/tmp/spark_test.csv', header=True)

将两个字段顺序改变一下

In [6]:
schema2 = StructType([
    StructField('col_b', IntegerType(), True),
    StructField('col_a', StringType(), True),
])

结果发现读出来的字段顺序变了（即使设置了header）

In [7]:
df2 = sql_ctx.read.csv('/tmp/spark_test.csv', header=True, schema=schema2)
df2.printSchema()
df2.show()

root
 |-- col_b: integer (nullable = true)
 |-- col_a: string (nullable = true)

+-----+-----+
|col_b|col_a|
+-----+-----+
|    1|    2|
|    2|    3|
+-----+-----+

