# Reading raw files

Python supports a number of standard and custom libraries to read files of all types into python variables

In [6]:
import os

# read the file using standard python libraries
with open(os.getcwd()+"./datasets/Spark-Course-Description.txt", 'r') as fh:
    filedata = fh.read()
    
#print first 200 characters in the file
print("Data read from file : ", filedata[0:200])

Data read from file :  In order to construct data pipelines and networks that stream, process, and store data, data engineers and data-science DevOps specialists must understand how to combine multiple big data technologies


# Reading using NLTK Corpus reader

Read the same text file using a Corpus reader.

NLTK supports multiple CorpusReaders depending upon the type of data source. 


In [11]:
import nltk
nltk.download('punkt')

from nltk.corpus.reader.plaintext import PlaintextCorpusReader

#Read the file into a corpus. The same command can read an entire directory
corpus = PlaintextCorpusReader(os.getcwd(),"datasets/Spark-Course-Description.txt")

#Print raw contents of the corpus
print(corpus.raw())

In order to construct data pipelines and networks that stream, process, and store data, data engineers and data-science DevOps specialists must understand how to combine multiple big data technologies. In this course, discover how to build big data pipelines around Apache Spark. Join Kumaran Ponnambalam as he takes you through how to make Apache Spark work with other big data technologies. He covers the basics of Apache Kafka Connect and how to integrate it with Spark for real-time streaming. In addition, he demonstrates how to use the various technologies to construct an end-to-end project that solves a real-world business problem.


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\phumlani\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Exploring the Corpus

The corpus library supports a number of functions to extract words, paragraphs and sentences from the corpus

In [12]:
#Extract the file Ids from the corpus
print("Files in this corpus : ", corpus.fileids())

#Extract paragraphs from the corpus
paragraphs = corpus.paras()
print("\n Total paragraphs in this corpus : ", len(paragraphs))

#Extract sentences from the corpus
sentences = corpus.sents()
print("\n Total sentences in this corpus : ", len(sentences))
print("\n The first sentence : ", sentences[0])

#Extract words from the corpus
print("\n Words in this corpus : ", corpus.words())



Files in this corpus :  ['datasets/Spark-Course-Description.txt']

 Total paragraphs in this corpus :  1

 Total sentences in this corpus :  5

 The first sentence :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', 'and', 'networks', 'that', 'stream', ',', 'process', ',', 'and', 'store', 'data', ',', 'data', 'engineers', 'and', 'data', '-', 'science', 'DevOps', 'specialists', 'must', 'understand', 'how', 'to', 'combine', 'multiple', 'big', 'data', 'technologies', '.']

 Words in this corpus :  ['In', 'order', 'to', 'construct', 'data', 'pipelines', ...]


# Analyze the Corpus

The NLTK library provides a  number of functions to analyze the distributions and aggregates for data in the corpus.

In [13]:
#Find the frequency distribution of words in the corpus
course_freq_dist = nltk.FreqDist(corpus.words())

#Print most commonly used words
print("Top 10 words in the corpus : ", course_freq_dist.most_common(10))

#find the distribution for a specific word
print("\n Distribution for \"Spark\" : ",course_freq_dist.get("Spark"))

Top 10 words in the corpus :  [('to', 8), ('data', 7), (',', 5), ('-', 5), ('how', 5), ('.', 5), ('and', 4), ('In', 3), ('big', 3), ('technologies', 3)]

 Distribution for "Spark" :  3
