In [20]:
import re
import glob
from tqdm.notebook import tqdm
import pandas as pd

In [21]:
with open("data/parsed/paragraph1.5.md", 'r') as f:
    title = f.readline()
    content = f.read()


In [22]:
title_pattern = re.compile(r"^#[\s\*]+(?P<number>\d{1,2}.\d{1,2})+(?P<title>.+)$")
match = re.match(title_pattern, title)
match.group("number"), match.group("title").strip()

('1.5', 'The Definition of Probability')

In [23]:
elem_pattern = re.compile(
    r"""
    <<(?P<type>\w+)\s+(?P<number>[\d\.]+)>>
    (?P<content>.*?)
    <</(?P=type)\s+(?P=number)>>
    """,
    re.VERBOSE | re.DOTALL | re.MULTILINE,
)

In [26]:
element_list = list()
for file in tqdm(glob.glob("data/parsed/*.md")):
    with open(file, 'r') as f:
        title = f.readline()
        content = f.read()
    # match patterns
    title_match = re.match(title_pattern, title)
    matches = re.finditer(elem_pattern, content)

    # save to list
    for match in matches:
        element_list.append((
            match.group("type"),
            match.group("number"),
            match.group("content").strip(),
            title_match.group("number"),
        ))

  0%|          | 0/107 [00:00<?, ?it/s]

In [None]:
df.paragraph.str.split(".", expand=True)

Unnamed: 0,0,1
1633,1,12
1630,1,12
1631,1,12
1632,1,12
1634,1,12
...,...,...
1211,9,8
1209,9,8
1216,9,8
1210,9,8


In [33]:
df = pd.DataFrame(element_list, columns=["type", "number", "content", "paragraph_number"])
df[["section", "paragraph"]] = df["paragraph_number"].str.split(".", expand=True).astype(int)
df.sort_values(by=["section", "paragraph"], inplace=True)
df

Unnamed: 0,type,number,content,paragraph_number,section,paragraph
1968,Definition,1.3.1,**Experiment and Event.** An experiment is any...,1.3,1,3
177,Definition,1.4.1,**Sample Space.** The collection of all possib...,1.4,1,4
178,Example,1.4.1,**Rolling a Die.** When a six-sided die is rol...,1.4,1,4
179,Definition,1.4.2,**Containment.** It is said that a set $A$ is ...,1.4,1,4
180,Theorem,1.4.1,"Let $A$, $B$, and $C$ be events. Then $A \subs...",1.4,1,4
...,...,...,...,...,...,...
44,Exercise,12,12. Consider again the situation described in ...,12.7,12,7
45,Exercise,13,13. Suppose that our data comprise a set of pa...,12.7,12,7
46,Exercise,14,14. Use the simulation scheme developed in Exe...,12.7,12,7
47,Exercise,15,"15. In Sec. 7.4, we introduced Bayes estimator...",12.7,12,7


In [34]:
df.type.unique()

array(['Definition', 'Example', 'Theorem', 'Figure', 'Condition',
       'Exercise', 'Axiom', 'Table', 'Property', 'Corollary', 'Equation',
       'Assumption'], dtype=object)

In [38]:
df[df.type == "Exercise"].groupby(["section", "paragraph"]).aggregate({"type": "count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,type
section,paragraph,Unnamed: 2_level_1
1,4,14
1,5,14
1,6,8
1,7,11
1,8,19
...,...,...
12,3,19
12,4,17
12,5,16
12,6,12


In [107]:
matches[3][2].strip()

'**Rolling a Die.** In Example 1.4.1, for each subset $A$ of $S = \\{1, 2, 3, 4, 5, 6\\}$, let $\\Pr(A)$ be the number of elements of $A$ divided by 6. It is trivial to see that this satisfies the first two axioms. There are only finitely many distinct collections of nonempty disjoint events. It is not difficult to see that Axiom 3 is also satisfied by this example.'

In [103]:
matches = re.findall(elem_pattern, content)
matches

[('Axiom', '1', '\nFor every event $A$, $\\Pr(A) \\ge 0$.\n'),
 ('Axiom', '2', '\n$\\Pr(S) = 1$.\n'),
 ('Axiom',
  '3',
  '\nFor every infinite sequence of disjoint events $A_1, A_2, \\ldots$,\n$$\n\\Pr \\left(\\bigcup_{i=1}^{\\infty} A_{i}\\right)=\\sum_{i=1}^{\\infty} \\operatorname{Pr}\\left(A_{i}\\right) .\n$$\n'),
 ('Example',
  '1.5.1',
  '\n**Rolling a Die.** In Example 1.4.1, for each subset $A$ of $S = \\{1, 2, 3, 4, 5, 6\\}$, let $\\Pr(A)$ be the number of elements of $A$ divided by 6. It is trivial to see that this satisfies the first two axioms. There are only finitely many distinct collections of nonempty disjoint events. It is not difficult to see that Axiom 3 is also satisfied by this example.\n'),
 ('Example',
  '1.5.2',
  '\n**A Loaded Die.** In Example 1.5.1, there are other choices for the probabilities of events. For example, if we believe that the die is loaded, we might believe that some sides have different probabilities of turning up. To be specific, suppose tha

In [104]:
pd.DataFrame(matches, columns=["type", "number", "content"])

Unnamed: 0,type,number,content
0,Axiom,1,"\nFor every event $A$, $\Pr(A) \ge 0$.\n"
1,Axiom,2,\n$\Pr(S) = 1$.\n
2,Axiom,3,\nFor every infinite sequence of disjoint even...
3,Example,1.5.1,"\n**Rolling a Die.** In Example 1.4.1, for eac..."
4,Example,1.5.2,"\n**A Loaded Die.** In Example 1.5.1, there ar..."
5,Definition,1.5.1,"\n**Probability.** A probability measure, or s..."
6,Theorem,1.5.1,\n$\Pr(\emptyset) = 0$.\n**Proof** Consider th...
7,Theorem,1.5.2,\nFor every finite sequence of $n$ disjoint ev...
8,Theorem,1.5.3,"\nFor every event $A$, $\Pr(A^c) = 1 - \Pr(A)$..."
9,Theorem,1.5.4,"\nIf $A \subset B$, then $\Pr(A) \le \Pr(B)$.\..."


In [None]:
with open("data/parsed/paragraph1.5.md", 'r') as f:
    title = f.readline()
    content = f.read()

paragraph_pattern = re.compile(
    r"""
    ^\#{1,2}\s\**
    (?P<number>\d{1,2}\.\d{1,2})
    \s
    (?P<title>.+?)$
    (?P<contents>(?:.*?)?)
    (?P<exercises>(?:(?:^\#{3}\sExercises|(?<=Supplementary\sExercises\s\s)^\d\.).*?)?)
    (?=^\#{1,2}\s\**\d{1,2}\.\d{1,2}|\Z)
    """,
    re.VERBOSE | re.DOTALL | re.MULTILINE,
)

In [85]:
with open("data/section7.md", "r") as f:
    content = f.read()

In [86]:
content

'# ESTIMATION\n## 7.1 Statistical Inference\nRecall our various clinical trial examples. What would we say is the probability that a future patient will respond successfully to treatment after we observe the results from a collection of other patients? This is the kind of question that statistical inference is designed to address. In general, statistical inference consists of making probabilistic statements about unknown quantities. For example, we can compute means, variances, quantiles, probabilities, and some other quantities yet to be introduced concerning unobserved random variables and unknown parameters of distributions. Our goal will be to say what we have learned about the unknown quantities after observing some data that we believe contain relevant information. Here are some other examples of questions that statistical inference can try to answer. What can we say about whether a machine is functioning properly after we observe some of its output? In a civil lawsuit, what can 

In [87]:
import re

# Match digits ONLY if preceded by a dollar sign
text = "Price: $100, Quantity: 50, Cost: $200"
pattern = r"(?<=\$)\d+"

matches = re.findall(pattern, text)
print(matches)  # ['100', '200'] - only digits after $

['100', '200']


In [88]:
paragraph_pattern = re.compile(
    r"""
    ^\#{1,2}\s\**
    (?P<number>\d{1,2}\.\d{1,2})
    \s
    (?P<title>.+?)$
    (?P<contents>(?:.*?)?)
    (?P<exercises>(?:(?:^\#{3}\sExercises|(?<=Supplementary\sExercises\s\s)^\d\.).*?)?)
    (?=^\#{1,2}\s\**\d{1,2}\.\d{1,2}|\Z)
    """,
    re.VERBOSE | re.DOTALL | re.MULTILINE,
)

In [89]:
matches = paragraph_pattern.findall(content)
matches

[('7.1',
  'Statistical Inference',
  '\nRecall our various clinical trial examples. What would we say is the probability that a future patient will respond successfully to treatment after we observe the results from a collection of other patients? This is the kind of question that statistical inference is designed to address. In general, statistical inference consists of making probabilistic statements about unknown quantities. For example, we can compute means, variances, quantiles, probabilities, and some other quantities yet to be introduced concerning unobserved random variables and unknown parameters of distributions. Our goal will be to say what we have learned about the unknown quantities after observing some data that we believe contain relevant information. Here are some other examples of questions that statistical inference can try to answer. What can we say about whether a machine is functioning properly after we observe some of its output? In a civil lawsuit, what can we s

In [90]:
import pandas as pd
pd.DataFrame(matches, columns=["number", "title", "contents", "exercises"])

Unnamed: 0,number,title,contents,exercises
0,7.1,Statistical Inference,\nRecall our various clinical trial examples. ...,### Exercises\n1. Identify the components of t...
1,7.2,Prior and Posterior Distributions,\nThe distribution of a parameter before obser...,### Exercises\n1. Consider again the situation...
2,7.3,Conjugate Prior Distributions,\nFor each of the most popular statistical mod...,### Exercises\n1. Consider again the situation...
3,7.4,Bayes Estimators,\nAn estimator of a parameter is some function...,"### Exercises\n1. In a clinical trial, let the..."
4,7.5,Maximum Likelihood Estimators,\nMaximum likelihood estimation is a method fo...,"### Exercises\n1. Let $x_1, \dots, x_n$ be dis..."
5,7.6,Properties of Maximum Likelihood Estimators,"\nIn this section, we explore several properti...","### Exercises\n1. Suppose that $X_1, \dots, X_..."


In [91]:
a = [1, 2, 3]
b = [4, 5, 6]
a.extend(b)
a

[1, 2, 3, 4, 5, 6]

In [26]:
matches = []

for i in range(1, 13):
    with open(f"data/section{i}.md", "r") as f:
        content = f.read()
    matches.extend(paragraph_pattern.findall(content))

In [21]:
with open(f"data/section2.md", "r") as f:
    content = f.read()

match = paragraph_pattern.findall(content)
match

[]

In [27]:
df = pd.DataFrame(matches, columns=["number", "title", "contents", "exercises"])
df

Unnamed: 0,number,title,contents,exercises
0,1.1,The History of Probability,\n\nThe use of probability to measure uncertai...,
1,1.2,Interpretations of Probability,\n\nThis section describes three common operat...,
2,1.3,Experiments and Events,\n\nProbability will be the way that we quanti...,
3,1.4,Set Theory,\n\nThis section develops the formal mathemati...,### Exercises\n\n1. Suppose that $A \subset B...
4,1.5,The Definition of Probability,\n\nWe begin with the mathematical definition ...,### Exercises\n\n1. One ball is to be selecte...
5,1.6,Finite Sample Spaces,\n\nThe simplest experiments in which to deter...,### Exercises\n\n1. If two balanced dice are ...
6,1.7,Counting Methods,"\n\nIn simple sample spaces, one way to calcul...",### Exercises\n\n1. Each year starts on one o...
7,1.8,Combinatorial Methods,\n\nMany problems of counting the number of ou...,### Exercises\n\n1. Two pollsters will canvas...
8,1.9,Multinomial Coefficients,\n\nWe learn how to count the number of ways t...,### Exercises\n\n1. Three pollsters will canv...
9,1.1,The Probability of a Union of Events,\n\nThe axioms of probability tell us directly...,### Exercises\n\n1. Three players are each de...


In [8]:
matches[5][3]

'### Exercises\n\n1.  If two balanced dice are rolled, what is the probability that the sum of the two numbers that appear will be odd?\n2.  If two balanced dice are rolled, what is the probability that the sum of the two numbers that appear will be even?\n3.  If two balanced dice are rolled, what is the probability that the difference between the two numbers that appear will be less than 3?\n4.  A school contains students in grades 1, 2, 3, 4, 5, and 6. Grades 2, 3, 4, 5, and 6 all contain the same number of students, but there are twice this number in grade 1. If a student is selected at random from a list of all the students in the school, what is the probability that she will be in grade 3?\n5.  For the conditions of Exercise 4, what is the probability that the selected student will be in an odd-numbered grade?\n6.  If three fair coins are tossed, what is the probability that all three faces will be the same?\n7.  Consider the setup of Example 1.6.4 on page 23. This time, assume th

In [9]:
for number, title, contents, exercises in matches:
    print(f"{number}: {title}. Exercises: {exercises}, Contents:{contents[:30]} ... {contents[-30:]}")

1.1: The History of Probability. Exercises: , Contents:

The use of probability to me ... nhall, and Schaeffer (2008).


1.2: Interpretations of Probability. Exercises: , Contents:

This section describes three ... f effective experimentation.


1.3: Experiments and Events. Exercises: , Contents:

Probability will be the way  ... tical theory of probability.


1.4: Set Theory. Exercises: ### Exercises

1.  Suppose that $A \subset B$. Show that $B^c \subset A^c$.
2.  Prove the distributive properties in Theorem 1.4.10.
3.  Prove De Morgan's laws (Theorem 1.4.9).
4.  Prove Theorem 1.4.11.
5.  For every collection of events $A_i$ ($i \in I$), show that
    $$\left(\bigcup_{i \in I} A_i\right)^c = \bigcap_{i \in I} A_i^c \quad \text{and} \quad \left(\bigcap_{i \in I} A_i\right)^c = \bigcup_{i \in I} A_i^c.$$
6.  Suppose that one card is to be selected from a deck of 20 cards that contains 10 red cards numbered from 1 to 10 and 10 blue cards numbered from 1 to 10. Let $A$ be the event that 

In [None]:
paragraph_pattern = re.compile(
    r"""
    ^\# (?P<number>\d{1,2}\.\d{1,2})
    \s
    (?P<title>.+?)$
    (?P<contents>(?:.*?)?)
    (?P<exercises>(?:(?:^\#{3}\sExercises).*?)?)
    (?=^\#\s\d{1,2}\.\d{1,2}|\Z)
    """,
    re.VERBOSE | re.DOTALL | re.MULTILINE,
)