# Question 3: NLP

Feed the following paragraph into your favourite data analytics tool, and answer the following:
- What is the probability of the word “data” occurring in each line ?
- What is the distribution of distinct word counts across all the lines ?
- What is the probability of the word “analytics” occurring after the word “data” ?

### Step 1: Importing necessary library

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize,MWETokenizer
from nltk import FreqDist
import nltk
import string
import re
from collections import Counter

### Step 2: Analyze the text and collect necessary informations 

In [2]:
para_text = """... As a term, data analytics predominantly refers to an assortment of applications, from basic business
... intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced
... analytics. In that sense, it's similar in nature to business analytics, another umbrella term for
... approaches to analyzing data -- with the difference that the latter is oriented to business uses, while
... data analytics has a broader focus. The expansive view of the term isn't universal, though: In some 
... cases, people use data analytics specifically to mean advanced analytics, treating BI as a separate
... category. Data analytics initiatives can help businesses increase revenues, improve operational
... efficiency, optimize marketing campaigns and customer service efforts, respond more quickly to
... emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal of
... boosting business performance. Depending on the particular application, the data that's analyzed
... can consist of either historical records or new information that has been processed for real-time
... analytics uses. In addition, it can come from a mix of internal systems and external data sources. At
... a high level, data analytics methodologies include exploratory data analysis (EDA), which aims to find
... patterns and relationships in data, and confirmatory data analysis (CDA), which applies statistical
... techniques to determine whether hypotheses about a data set are true or false. EDA is often
... compared to detective work, while CDA is akin to the work of a judge or jury during a court trial -- a
... distinction first drawn by statistician John W. Tukey in his 1977 book Exploratory Data Analysis. Data
... analytics can also be separated into quantitative data analysis and qualitative data analysis. The
... former involves analysis of numerical data with quantifiable variables that can be compared or
... measured statistically. The qualitative approach is more interpretive -- it focuses on understanding
... the content of non-numerical data like text, images, audio and video, including common phrases,
... themes and points of view."""


In [3]:
new_text = para_text.replace("... ", "\n")

In [4]:
nlines = new_text.count('\n')
print('There are ', nlines, ' lines detected in the paragraph')

There are  22  lines detected in the paragraph


In [5]:
text = re.compile('\w+').findall(new_text.lower())
nword = Counter(text)
nletter = Counter(''.join(text))
sumWords = sum(nword.values())
sumLetters = sum(nletter.values()) 
print('There are a total of ', sumWords,' words in the paragraph')
print('There are a total of ', sumLetters,' letters in the paragraph')

There are a total of  320  words in the paragraph
There are a total of  1728  letters in the paragraph


### Step 3: Detecting the distribution of distinct word counts across all the lines

In [6]:
words = new_text.split()
fdist1 = FreqDist(words)
print(fdist1.most_common())

[('data', 14), ('to', 11), ('a', 10), ('of', 10), ('and', 9), ('the', 8), ('analytics', 7), ('can', 5), ('business', 4), ('that', 4), ('--', 4), ('is', 4), ('or', 4), ('analysis', 4), ('In', 3), ('in', 3), ('with', 3), ('The', 3), ('Data', 3), ('from', 2), ('advanced', 2), ('analytics,', 2), ('term', 2), ('for', 2), ('while', 2), ('has', 2), ('more', 2), ('on', 2), ('it', 2), ('which', 2), ('compared', 2), ('be', 2), ('qualitative', 2), ('As', 1), ('term,', 1), ('predominantly', 1), ('refers', 1), ('an', 1), ('assortment', 1), ('applications,', 1), ('basic', 1), ('intelligence', 1), ('(BI),', 1), ('reporting', 1), ('online', 1), ('analytical', 1), ('processing', 1), ('(OLAP)', 1), ('various', 1), ('forms', 1), ('analytics.', 1), ('sense,', 1), ("it's", 1), ('similar', 1), ('nature', 1), ('another', 1), ('umbrella', 1), ('approaches', 1), ('analyzing', 1), ('difference', 1), ('latter', 1), ('oriented', 1), ('uses,', 1), ('broader', 1), ('focus.', 1), ('expansive', 1), ('view', 1), ("isn

### Step 4: Finding the probability of word occurences 

From the question, we are tasked to find the probability of:
- the word “data” occurring in each line
-  word “analytics” occurring after the word “data”

In [7]:
data_freqcy = re.findall('(?i)data', new_text)
data_analytics_freqcy = re.findall('(?i)data analytics', new_text)
analytics_freqcy = re.findall('(?i)analytics', new_text)
print('The frequency of the word "data" present in the paragrah are',len(data_freqcy))
print('The frequency of the word "data analytics" present in the paragrah are',len(data_analytics_freqcy))
print('The frequency of the word "analytics" present in the paragrah are',len(analytics_freqcy))

prob_data_freqcy = len(data_freqcy)/sumWords
prob_data_freqcy_line = len(data_freqcy)/nlines
prob_data_analytics_freqcy = len(data_analytics_freqcy)/len(analytics_freqcy)

print('The probability of the word “data” occurring in the paragraph is', prob_data_freqcy)
print('The probability of the word “data” occurring in each line is', prob_data_freqcy_line)
print('The probability of the word “analytics” occurring after the word “data” ', prob_data_analytics_freqcy)


The frequency of the word "data" present in the paragrah are 18
The frequency of the word "data analytics" present in the paragrah are 5
The frequency of the word "analytics" present in the paragrah are 10
The probability of the word “data” occurring in the paragraph is 0.05625
The probability of the word “data” occurring in each line is 0.8181818181818182
The probability of the word “analytics” occurring after the word “data”  0.5


In [8]:
#finding the word 'data' in each line and return the number of line the word showed up and the number it detected.
counter = 1
data = list(filter(None, new_text.splitlines()))
for i in data:
    tokens = [t for t in i.split()]
    freqs = nltk.FreqDist(tokens)
    for k, v in freqs.items():
        if k == 'data':
            print('In line ', counter, ' there is ', v ,' number of the word "data" present')
    counter += 1


In line  1  there is  1  number of the word "data" present
In line  4  there is  1  number of the word "data" present
In line  5  there is  1  number of the word "data" present
In line  6  there is  1  number of the word "data" present
In line  10  there is  1  number of the word "data" present
In line  12  there is  1  number of the word "data" present
In line  13  there is  2  number of the word "data" present
In line  14  there is  1  number of the word "data" present
In line  15  there is  1  number of the word "data" present
In line  18  there is  2  number of the word "data" present
In line  19  there is  1  number of the word "data" present
In line  21  there is  1  number of the word "data" present


### Extra Findings 

In [9]:
#display the probability of each word occured throughout the paragraph
def tokenize(string):
    return re.compile('\w+').findall(string)

def word_freq(string): 
    text = tokenize(string.lower())
    c = Counter(text)           # count the words
    d = Counter(''.join(text))  # count all letters
    return (dict(c),dict(d))    # return a tuple of counted words and letters

words, letters = word_freq(new_text) # count and get dicts with counts

sumWords = sum(words.values())
print(sumWords) # sum total words
sumLetters = sum(letters.values())   # sum total letters

# calc / print probability of word
for w in dict(words):
    if w == 'data':
        words[w] = words[w]/sumWords
        print("Probability of '{}': {}".format(w,words[w]/sumWords))

        # update the counts to propabilities:
for w in words: 
    words[w] = words[w]/sumWords

print(words) 


320
Probability of 'data': 0.00017578125
{'as': 0.00625, 'a': 0.03125, 'term': 0.009375, 'data': 0.00017578125, 'analytics': 0.03125, 'predominantly': 0.003125, 'refers': 0.003125, 'to': 0.034375, 'an': 0.003125, 'assortment': 0.003125, 'of': 0.03125, 'applications': 0.003125, 'from': 0.00625, 'basic': 0.003125, 'business': 0.0125, 'intelligence': 0.003125, 'bi': 0.00625, 'reporting': 0.003125, 'and': 0.028125, 'online': 0.003125, 'analytical': 0.003125, 'processing': 0.003125, 'olap': 0.003125, 'various': 0.003125, 'forms': 0.003125, 'advanced': 0.00625, 'in': 0.01875, 'that': 0.015625, 'sense': 0.003125, 'it': 0.009375, 's': 0.00625, 'similar': 0.003125, 'nature': 0.003125, 'another': 0.003125, 'umbrella': 0.003125, 'for': 0.00625, 'approaches': 0.003125, 'analyzing': 0.003125, 'with': 0.009375, 'the': 0.034375, 'difference': 0.003125, 'latter': 0.003125, 'is': 0.0125, 'oriented': 0.003125, 'uses': 0.00625, 'while': 0.00625, 'has': 0.00625, 'broader': 0.003125, 'focus': 0.003125, 'ex