## 02_01 Reading Raw Files

Python supports a number of standard and custom libraries to read files of all types into python variables.

In [2]:
import os

#Read the file using standard python libaries
with open(os.getcwd()+ "/data_science.txt", 'r') as fh:  
    filedata = fh.read()

In [None]:
#Print first 200 characters in the file
print("Data read from file : ", filedata[0:200] )

Data read from file :  Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, art


## 02_02 Reading using NLTK CorpusReader

Read the same text file using a Corpus Reader

NLTK supports multiple CorpusReaders depending upon the type of data source. Details available in http://www.nltk.org/howto/corpus.html


In [None]:
#install nltk from anaconda prompt using "pip install nltk"
import nltk
#Download punkt package, used part of the other commands
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

In [4]:
#Read the file into a corpus. The same command can read an entire directory
corpus=PlaintextCorpusReader(os.getcwd(),"data_science.txt")

#Print raw contents of the corpus
print(corpus.raw())

Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.

Data science is important because it combines tools, methods, and technology to generate meaning from data. Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio, video, and image data available in vast quantities.  


## 02_03 Exploring the Corpus

The corpus library supports a number of functions to extract words, paragraphs and sentences from the corpus

In [10]:
#Extract the file IDs from the corpus
print("Files in this corpus : ", corpus.fileids())

Files in this corpus :  ['data_science.txt']


In [11]:
#Extract paragraphs from the corpus
paragraphs=corpus.paras()
print("\n Total paragraphs in this corpus : ", len(paragraphs))


 Total paragraphs in this corpus :  2


In [12]:
#Extract sentences from the corpus
sentences=corpus.sents()
print("\n Total sentences in this corpus : ", len(sentences))
print("\n The first sentence : ", sentences[0])


 Total sentences in this corpus :  7

 The first sentence :  ['Data', 'science', 'is', 'the', 'study', 'of', 'data', 'to', 'extract', 'meaningful', 'insights', 'for', 'business', '.']


In [13]:
#Extract words from the corpus
print("\n Words in this corpus : ",corpus.words() )


 Words in this corpus :  ['Data', 'science', 'is', 'the', 'study', 'of', 'data', ...]


## 02_04 Analyze the Corpus

The NLTK library provides a number of functions to analyze the distributions and aggregates for data in the corpus.

In [14]:
#Find the frequency distribution of words in the corpus
course_freq_dist=nltk.FreqDist(corpus.words())

In [15]:
#Print most commonly used words
print("Top 10 words in the corpus : ", course_freq_dist.most_common(10))

Top 10 words in the corpus :  [(',', 14), ('and', 9), ('data', 7), ('.', 7), ('of', 6), ('is', 4), ('the', 4), ('to', 4), ('what', 3), ('Data', 2)]


In [17]:
#find the distribution for a specific word
print("\n Distribution for \"Data\" : ",course_freq_dist.get("data"))


 Distribution for "Data" :  7
