### I/O in PySpark

File input/output (I/O) operations are an integral part of many software activities and for data

A data scientist deals with many types of files, including text files, comma-separated values (CSV) files, JavaScript Object Notation (JSON) files, and many more. The Hadoop Distributed File System (HDFS) is a very good distributed file system.

#### Read a Simple Text File
#### Problem

You want to read a simple text file.
#### Solution
You have a simple text file named shakespearePlays.txt. The file content is as follows:

    Love’s Labour’s Lost

    A Midsummer Night’s Dream

    Much Ado About Nothing

    As You Like It

The shakespearePlays.txt file has four lines. You want to read this file by using PySpark. After reading the file, you want to calculate the following:

    Total number of lines in the file

    Total number of characters in the file

To read a simple file, you can use two functions: textFile() and wholeTextFiles(). These two functions are defined on our SparkContext object.

The textFile() method reads a text file and results in an RDD of lines. The textFile() method is a transformation, so textFile() does not read the data until the first action is called. Because the file is not available at the time textFile() is run, it will not throw an error. It will throw an error when the first action is called. Why is this? Like other transformations, the textFile() function will be called when the first action is called.

Another method, wholeTextFiles(), works in similar way as textFile() except it reads the file as a key/value pair. The file name is read as the key, and the file data is read as the value associated with that key.
How It Works

Let’s see how these built-in methods work.
#### Reading a Text File by Using the textFile() Function

The textFile() function takes three inputs. The first input to textFile() is the path of the file that has to be read. The second argument is minPartitions, which defines the minimum number of data partitions in the RDD. The third argument is use_unicode. If use_unicode is False, the file is read as a string. Here is the textFile() function:

In [2]:
from pyspark import SparkContext
sc = SparkContext()

22/12/23 18:49:26 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.ClientServ

In [3]:
testfile = sc.textFile('/user/shakespear.txt', 2)
testfile.collect()

                                                                                

['    Love’s Labour’s Lost',
 '',
 '    A Midsummer Night’s Dream',
 '',
 '    Much Ado About Nothing',
 '',
 '    As You Like It']

#### Reading a Text File by Using wholeTextFiles()

The wholeTextFiles() function also takes three inputs. The first input to wholeTextFiles() is the path of the file that has to be read. The second argument is minPartitions, which defines the minimum number of data partitions in the RDD. The third argument is use_unicode. If use_unicode is False, the file is read as a string. Let’s read the same text file, this time by using the wholeTextFiles() function:

In [12]:
testfile = sc.wholeTextFiles('/user/shakespear.txt', 2)
testfile.collect()

                                                                                

[('hdfs://localhost:9000/user/shakespear.txt',
  '    Love’s Labour’s Lost\n\n    A Midsummer Night’s Dream\n\n    Much Ado About Nothing\n\n    As You Like It\n')]

In [13]:
testfile.keys().collect()

['hdfs://localhost:9000/user/shakespear.txt']

### Write an RDD to a Simple Text File
#### Problem

You want to write an RDD to a simple text file .
#### Solution

 you calculated the number of characters in each line as the RDD playDataLineLength. Now you want to save it in a text file.

We can save an RDD as a text file by using the saveAsTextFile() function . This method is defined on the RDD—not on SparkContext, as we saw in the case of the textFile() and wholeTextFiles() functions. You have to provide the output directory. The file name is not required. The directory name you are providing must not already exist; otherwise, the write operation will fail. The RDD exists in partitions. So PySpark will start many processes in parallel to write the file.

The saveAsTextFile() function takes two inputs. The first input is path, which is basically the path of the directory where the RDD has to be saved. The second argument is compressionCodecClass, an optional argument with a default value of None. We can use compression codecs such as Gzip to compress files and thereby provide more-efficient computations.

#### How It Works

So first let’s start with the code for counting the number of characters in each line. We have already done it, but for the sake of clarity, I have provided the code for calculating the number of characters in each line again.
Step 6-2-1. Counting the Number of Characters on Each Line

Let’s read the file and count the number of characters per line:

In [14]:
testfile = sc.textFile('/user/shakespear.txt', 2)
data = testfile.map(lambda data: len(data))
data.collect()

[24, 0, 29, 0, 26, 0, 18]

Saving the RDD to a File

Now that we have the counted RDD, we want to save that RDD into a directory called savedData:


In [15]:
data.saveAsTextFile('/user/output_data/shake')



### Read a Directory
#### Problem

You want to read a directory .
#### Solution

In a directory, there are many files. You want to read the directory (all files at once).

Reading many files together from a directory is a very common task nowadays. To read a directory, we use the textFile() function or the wholetextFiles() function. The textFile() function reads small files in the directory and merges them. In contrast, the wholeTextFiles() function reads files as a key/value pair, with the name of file as the key, and the content of the file as the value.

You are provided with the directory name manyFiles. This directory consists of two files named playData1.txt and playData2.txt. Let’s investigate the content of these files one by one:

You job is to read all these files from the directory in one go.
How It Works

We will use both functions, textFile() and wholeTextFiles(), one at a time, to read the directory noted previously. 
#### Reading a Directory by Using textFile()

In previous recipes, we provided the absolute file path as the input to the textFile() function in order to read the file. The best part of the textFile() function is that just by changing the path input, we can change how this function reads data. In order to read all files from a directory, we have to provide the absolute path of the directory as input to textFile(). The following line of code reads all the files in the manyFiles directory by using the textFile() function:

In [18]:
data = sc.textFile('/user/test_data/', 2)
data.collect()

['hi there\\n how are you',
 'okay \\n i am good \\n what about you',
 'i am coool\\n thanks for asking']

The output is very clear. The textFile() function has read all the files in the directory and merged the content in the files. It has created an RDD of the merged data.
#### Reading a Directory by Using wholeTextFiles()

Let’s now read the same set of files by using the wholeTextFiles() function . As we did with textFile(), here we also provide the path of the directory as one of the inputs to the wholeTextFiles() function. The following code shows the use of the wholeTextFiles() function to read a directory:

In [19]:
data2 = sc.wholeTextFiles('/user/test_data/', 2)
data2.collect()

[('hdfs://localhost:9000/user/test_data/test1.txt',
  'hi there\\n how are you\n'),
 ('hdfs://localhost:9000/user/test_data/test2.txt',
  'okay \\n i am good \\n what about you\n'),
 ('hdfs://localhost:9000/user/test_data/test3.txt',
  'i am coool\\n thanks for asking\n')]

#### Read Data from HDFS
#### Problem

You want to read a file from HDFS by using PySpark.
#### Solution

HDFS is very good for storing high-volume files. You are given the file filamentData.csv in HDFS. This file is under the bookData directory. Our bookData is under the root directory of HDFS. You want to read this file by using PySpark.

To read a file from HDFS, we first need to know the fs.default.name property from the core-site.xml property file. We are going to get the core-site.xml file inside the Hadoop configuration directory. For us, the value of fs.default.name is hdfs://localhost:9746. The full path of our file in HDFS will be hdfs://localhost:9746/bookData/ filamentData.csv.

We can use the textFile() function to read the file from HDFS by using the full path of the file.
#### How It Works

Reading a file from HDFS is as easy as reading data from a local machine. In the following code line, we use the textFile() function to read the required file:

In [20]:
data = sc.textFile('hdfs://localhost:9000/user/test_data/test3.txt', 2)
data.collect()

['i am coool\\n thanks for asking']

#### Save RDD Data to HDFS
#### Problem

You want to save RDD data to HDFS .
#### Solution

RDD data can be saved to HDFS by using the saveAsTextFile() function. In Recipe 6-2, we saved the RDD in a file on the local file system. We are going to save the same RDD, playDataLineLength, to HDFS.

Similar to the way we worked with the textFile() function, we have to provide the full path of the file, including the NameNode URI, to saveAsTextFile() to write an RDD to HDFS.
#### How It Works

Because you are a keen reader and might not be looking for distractions, we’ll start with writing code that counts the total number of characters in each line.
Counting the Number of Characters on Each Line

In [22]:
data.map(lambda data: len(data)).saveAsTextFile('hdfs://localhost:9000/user/test_data/output_data/')

[Stage 11:>                                                         (0 + 1) / 2]                                                                                

#### Saving an RDD to HDFS

The playDataLineLength RDD is written using following code line:

We have saved the RDD in the savedData directory, which is inside the root directory of HDFS. Remember that the savedData directory didn’t exist before we saved the data; otherwise, the saveAsTextFile() function would throw an error. We have saved the RDD. Now we are going to investigate the saved data. We will find five files in the savedData directory: part-00000, part-00001, part-0002, part-00003, and our _SUCCESS file. Let’s see the data of each file, one by one, by using the HDFS cat command. This command displays file data to the console: 

hdfs dfs -ls /user/output_data/

#### Write Data to a Sequential File
#### Problem

You want to write data into a sequential file .
#### Solution
Many times we like to save the results from PySpark processing to a sequence file. We have an RDD of subject data, as shown in Table 6-2, and you want to write it to a sequence file.

#### How It Works

In this recipe, we are first going to create an RDD and then save it into a sequence file.

In [23]:
subjectsData = [('si1','Python'),
                 ('si3','Java'),
                 ('si1','Java'),
                 ('si2','Python'),
                 ('si3','Ruby'),
                 ('si4','C++'),
                 ('si5','C'),
                 ('si4','Python'),
                 ('si2','Java')]

In [24]:
subjectDataRDD = sc.parallelize(subjectsData, 2)
subjectDataRDD.collect()

[('si1', 'Python'),
 ('si3', 'Java'),
 ('si1', 'Java'),
 ('si2', 'Python'),
 ('si3', 'Ruby'),
 ('si4', 'C++'),
 ('si5', 'C'),
 ('si4', 'Python'),
 ('si2', 'Java')]

In [25]:
subjectDataRDD.saveAsSequenceFile('hdfs://localhost:9000/user/test_data/output_data/sequence')

#### Read Data from a Sequential File
#### Problem

You want to read data from a sequential file .
#### Solution

A sequential file uses the key/value file format. Here, the key values are in binary format. This is a commonly used file format for Hadoop. The keys and values are types of the Hadoop Writable class.
We have data in a sequential file in HDFS, in the sequenceFileToRead directory inside the root directory. In the file inside the directory, we have the data in Table 6-1.

                                    Table 6-1. Sequential File Data
<img src='430628_1_En_6_Figa_HTML.gif'>

We can read the sequence file by using the sequenceFile() method defined in the SparkContext class.
How It Works

The sequenceFile() function takes many arguments. Let me discuss some of them. The first argument is path, which is the path of the sequential file. The second argument is keyClass, which indicates the key class of data in the sequence file. The argument valueClass represents the data type of the values. Remember that the key and value classes are children of the Hadoop Writable classes.

In [28]:
subjectData = sc.sequenceFile('hdfs://localhost:9000/user/test_data/output_data/sequence/')
subjectData.collect()

[('si1', 'Python'),
 ('si3', 'Java'),
 ('si1', 'Java'),
 ('si2', 'Python'),
 ('si3', 'Ruby'),
 ('si4', 'C++'),
 ('si5', 'C'),
 ('si4', 'Python'),
 ('si2', 'Java')]

#### Reading csv file

In [55]:
data = sc.textFile('/user/alcohol-consumption.csv')
data.collect()

                                                                                

['country,total_consumption,recorded_consumption,unrecorded_consumption,beer_percentage,wine_percentage,spirits_percentage,other_percentage,2020_projection,2025_projection',
 'Estonia,16.9,15.8,1.1,32.7,7.4,50.3,9.6,11.5,11.9',
 'Lithuania,15.0,13.8,1.2,43.6,7.3,37.1,12.1,14.4,13.9',
 'Czech Republic,14.3,12.4,1.4,53.3,21.3,25.4,0.0,11.2,11.4',
 'Seychelles,13.8,12.4,1.4,68.9,22.4,6.3,2.5,10.4,10.6',
 'Germany,13.4,11.3,1.4,52.6,28.4,18.9,0.0,12.8,12.6',
 'Nigeria,13.4,9.6,3.8,7.9,0.4,0.6,91.1,13.0,12.5',
 'Ireland,13.0,11.3,1.4,47.0,28.0,18.8,6.2,13.5,13.9',
 'Moldova,13.0,11.5,1.4,35.4,44.6,20.0,0.0,12.6,12.4',
 'Latvia,12.9,11.1,1.9,42.8,11.1,40.0,6.1,14.0,15.1',
 'Bulgaria,12.7,11.4,1.3,38.8,17.2,42.9,1.2,13.0,13.4',
 'France,12.6,11.8,1.5,18.8,58.8,20.7,1.7,12.3,12.1',
 'Romania,12.6,10.4,2.2,55.6,28.1,16.4,0.0,13.2,13.8',
 'Slovenia,12.6,10.8,1.8,41.4,50.6,8.0,0.0,11.6,10.6',
 'Portugal,12.3,10.6,2.1,26.1,61.5,7.7,4.7,11.8,11.0',
 'Luxembourg,12.3,10.6,2.1,26.1,61.5,7.7,4.7,11.8,

In [57]:
data.map(lambda da: da.split(',')).map(lambda ll: (ll[0], ll[1:])).collect()

[('country',
  ['total_consumption',
   'recorded_consumption',
   'unrecorded_consumption',
   'beer_percentage',
   'wine_percentage',
   'spirits_percentage',
   'other_percentage',
   '2020_projection',
   '2025_projection']),
 ('Estonia',
  ['16.9', '15.8', '1.1', '32.7', '7.4', '50.3', '9.6', '11.5', '11.9']),
 ('Lithuania',
  ['15.0', '13.8', '1.2', '43.6', '7.3', '37.1', '12.1', '14.4', '13.9']),
 ('Czech Republic',
  ['14.3', '12.4', '1.4', '53.3', '21.3', '25.4', '0.0', '11.2', '11.4']),
 ('Seychelles',
  ['13.8', '12.4', '1.4', '68.9', '22.4', '6.3', '2.5', '10.4', '10.6']),
 ('Germany',
  ['13.4', '11.3', '1.4', '52.6', '28.4', '18.9', '0.0', '12.8', '12.6']),
 ('Nigeria',
  ['13.4', '9.6', '3.8', '7.9', '0.4', '0.6', '91.1', '13.0', '12.5']),
 ('Ireland',
  ['13.0', '11.3', '1.4', '47.0', '28.0', '18.8', '6.2', '13.5', '13.9']),
 ('Moldova',
  ['13.0', '11.5', '1.4', '35.4', '44.6', '20.0', '0.0', '12.6', '12.4']),
 ('Latvia',
  ['12.9', '11.1', '1.9', '42.8', '11.1', '40.