In [2]:
import re 
from collections import Counter
import numpy as np

In [4]:
paragraph = """
As a term, data analytics predominantly refers to an assortment of applications, from basic business 
intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced 
analytics. In that sense, it's similar in nature to business analytics, another umbrella term for 
approaches to analyzing data -- with the difference that the latter is oriented to business uses, while 
data analytics has a broader focus. The expansive view of the term isn't universal, though: In some 
cases, people use data analytics specifically to mean advanced analytics, treating BI as a separate 
category.  Data analytics initiatives can help businesses increase revenues, improve operational 
efficiency, optimize marketing campaigns and customer service efforts, respond more quickly to 
emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal of 
boosting business performance. Depending on the particular application, the data that's analyzed 
can consist of either historical records or new information that has been processed for real-time 
analytics uses. In addition, it can come from a mix of internal systems and external data sources.  At 
a high level, data analytics methodologies include exploratory data analysis (EDA), which aims to find 
patterns and relationships in data, and confirmatory data analysis (CDA), which applies statistical 
techniques to determine whether hypotheses about a data set are true or false. EDA is often 
compared to detective work, while CDA is akin to the work of a judge or jury during a court trial -- a 
distinction first drawn by statistician John W. Tukey in his 1977 book Exploratory Data Analysis.  Data 
analytics can also be separated into quantitative data analysis and qualitative data analysis. The 
former involves analysis of numerical data with quantifiable variables that can be compared or 
measured statistically. The qualitative approach is more interpretive -- it focuses on understanding 
the content of non-numerical data like text, images, audio and video, including common phrases, 
themes and points of view.
"""


In [6]:
# Clean the paragraph by removing punctuation and converting to lowercase
clean_paragraph = re.sub(r'[^\w\s]', '', paragraph.lower())

# Split the paragraph into lines
lines = clean_paragraph.split('\n')

# Calculate the probability of the word "data" occurring in each line
word_probabilities = []
for line in lines:
    words = line.split()
    word_count = len(words)
    if word_count == 0:
        continue  # Skip lines with no words
    data_count = sum(1 for word in words if word == 'data')
    probability = data_count / word_count
    word_probabilities.append(probability)

# Calculate the distribution of distinct word counts across all the lines
distinct_word_counts = [len(set(line.split())) for line in lines]
word_count_distribution = Counter(distinct_word_counts)

# Calculate the probability of the word "analytics" occurring after the word "data"
data_analytics_count = 0
data_count = 0
for line in lines:
    words = line.split()
    for i in range(len(words) - 1):
        if words[i] == 'data' and words[i+1] == 'analytics':
            data_analytics_count += 1
        if words[i] == 'data':
            data_count += 1
probability_analytics_after_data = data_analytics_count / data_count

# Print the results
print("Probability of the word 'data' occurring in each line:")
for i, probability in enumerate(word_probabilities):
    print(f"Line {i+1}: {probability}")

print("\nDistribution of distinct word counts across all the lines:")
for count, frequency in word_count_distribution.items():
    print(f"Distinct word count: {count}, Frequency: {frequency}")

print("\nProbability of the word 'analytics' occurring after the word 'data':")
print(probability_analytics_after_data)

Probability of the word 'data' occurring in each line:
Line 1: 0.06666666666666667
Line 2: 0.0
Line 3: 0.0
Line 4: 0.0625
Line 5: 0.058823529411764705
Line 6: 0.06666666666666667
Line 7: 0.09090909090909091
Line 8: 0.0
Line 9: 0.0
Line 10: 0.08333333333333333
Line 11: 0.0
Line 12: 0.05555555555555555
Line 13: 0.13333333333333333
Line 14: 0.15384615384615385
Line 15: 0.0625
Line 16: 0.0
Line 17: 0.125
Line 18: 0.14285714285714285
Line 19: 0.07142857142857142
Line 20: 0.0
Line 21: 0.07142857142857142
Line 22: 0.0

Distribution of distinct word counts across all the lines:
Distinct word count: 0, Frequency: 2
Distinct word count: 15, Frequency: 3
Distinct word count: 13, Frequency: 2
Distinct word count: 14, Frequency: 5
Distinct word count: 16, Frequency: 3
Distinct word count: 11, Frequency: 3
Distinct word count: 12, Frequency: 3
Distinct word count: 18, Frequency: 1
Distinct word count: 17, Frequency: 1
Distinct word count: 5, Frequency: 1

Probability of the word 'analytics' occurrin