In [0]:
from pyspark import SparkContext
sc = SparkContext("local", "First App")

### File I/O in PySpark

File input/output (I/O) operations are an integral part of many software activities
and for data, As a data scientist you have to deal with many types of files, including text files, comma-separated
values (CSV) files, JavaScript Object Notation (JSON) files, and many more.

**Read a Simple Text File - **

To read a simple file, you can use two functions: **textFile()** and **wholeTextFiles()**.
These two functions are defined on our SparkContext object.
The textFile() method reads a text file and results in an RDD of lines.

**Example:**

In [5]:
from google.colab import files
uploaded = files.upload()

Saving test.txt to test.txt


In [9]:
testData = sc.textFile('/content/test.txt/',2)

testDataList = testData.collect()

testDataList[0:4]  #print test dada list

['scikit-image is a collection of algorithms for image processing.',
 'It includes algorithms for ',
 'segmentation, geometric transformations',
 'color space manipulation, analysis, filtering, morphology,']

Counting the Number of Lines in a File-

We can count the number of lines in our file by using the count() function

In [11]:
testData.count()

5

**Counting the Number of Characters on Each Line -**

To calculate the total number of characters in our file, we can calculate the number of
characters in each line and then sum them. To calculate the total number of characters in
each line, we can use the len() function.

In [12]:
lineLength = testData.map(lambda x : len(x))

lineLength.collect()

[64, 27, 39, 58, 27]

In [13]:
# total characters sum of all lines
totalCharacters = lineLength.sum()

totalCharacters   

215

### Write Data to text file

We can save an RDD as a text file by using the **saveAsTextFile()** function. This
method is defined on the RDD—not on SparkContext, as we saw in the case of the
textFile() functions. You have to provide the output directory.
The file name is not required.

Example:


In [0]:
lineLength.saveAsTextFile('/home/pysparkbook/savedRDD')

### Read a Directory

**Reading a Directory by Using textFile()**

In order to read all files from a directory, we have to provide the absolute path of the directory as input to textFile().

In [0]:
# read content of all files in Directory mdir

readDirectory = sc.textFile('/home/pysparkbook/pysparkData/mdir',4)