# Data Extraction - from PDF/LaTeX files

In [4]:
import pandas as pd
import numpy as np
import regex as re

In [29]:
with open('data/Book3-latex/page_387.tex') as f:
    lines = f.readlines()
lines

['\n',
 '\n',
 '#### Bilinear Forms\n',
 '\n',
 'The reader should find it easy to verify that the matrix of \\(f\\) in the latter ordered basis has the block form\n',
 '\n',
 '\\[\\left[\\begin{array}{cc}0&J\\\\ -J&0\\end{array}\\right]\\]\n',
 '\n',
 'where \\(J\\) is the \\(k\\bigtimes k\\) matrix\n',
 '\n',
 '\\[\\left[\\begin{array}{cccc}0&\\cdots&0&1\\\\ 0&\\cdots&1&0\\\\ \\vdots&\\cdots&\\vdots&\\vdots\\\\ 1&\\cdots&0&0\\end{array}\\right].\\]\n',
 '\n',
 '#### Exercises\n',
 '\n',
 '1. Let \\(V\\) be a vector space over a field \\(F\\). Show that the set of all skew-symmetric bilinear forms on \\(V\\) is a subspace of \\(L(V,\\,V,\\,F)\\).\n',
 '2. Find all skew-symmetric bilinear forms on \\(R^{3}\\).\n',
 '3. Find a basis for the space of all skew-symmetric bilinear forms on \\(R^{n}\\).\n',
 '4. Let \\(f\\) be a symmetric bilinear form on \\(C^{n}\\) and \\(g\\) a skew-symmetric bilinear form on \\(C^{n}\\). Suppose \\(f+g=0\\). Show that \\(f=g=0\\).\n',
 '5. Let \\(V\\) be

In [64]:
lines[14]

'### Exercises\n'

In [71]:
re.match(r'.*Exercises\n', lines[14])

<regex.Match object; span=(0, 14), match='### Exercises\n'>

In [30]:
# find all the questions

questions = []
for i in range(len(lines)):
    if re.match(r'.+\**Exercises\**\n', lines[i]):
        j = i + 1
        question = ""
        while j < len(lines) and not re.match(r'\d+\.\d.+', lines[j]):
            if re.match(r'^\d+\.\s.+', lines[j]) or re.match(r'^\*\*\d+\.\*\*\s.+', lines[j]):
                questions.append(question)
                question = ""
            question += lines[j]
            j += 1
        questions.append(question)
        break
    
# remove all the empty strings from the list
questions = list(filter(None, questions))
questions

['\n',
 '1. Let \\(V\\) be a vector space over a field \\(F\\). Show that the set of all skew-symmetric bilinear forms on \\(V\\) is a subspace of \\(L(V,\\,V,\\,F)\\).\n',
 '2. Find all skew-symmetric bilinear forms on \\(R^{3}\\).\n',
 '3. Find a basis for the space of all skew-symmetric bilinear forms on \\(R^{n}\\).\n',
 '4. Let \\(f\\) be a symmetric bilinear form on \\(C^{n}\\) and \\(g\\) a skew-symmetric bilinear form on \\(C^{n}\\). Suppose \\(f+g=0\\). Show that \\(f=g=0\\).\n',
 '5. Let \\(V\\) be an \\(n\\)-dimensional vector space over a subfield \\(F\\) of \\(C\\). Prove the following. 1. The equation \\((P\\!f)(\\alpha,\\beta)=\\frac{1}{2}f(\\alpha,\\beta)-\\frac{1}{2}f(\\beta,\\alpha)\\) defines a linear operator \\(P\\) on \\(L(V,\\,V,\\,F)\\). 2. \\(P^{\\intercal}=P_{i}\\), i.e., \\(P\\) is a projection. 3. rank \\(P=\\frac{n(n-1)}{2}\\); nullity \\(P=\\frac{n(n+1)}{2}\\). 4. If \\(U\\) is a linear operator on \\(V\\), the equation \\((U\\!f)(\\alpha,\\beta)=f(U\\alph

In [31]:
extract_questions(range(387,389))

['\n',
 'Let \\(V\\) be a vector space over a field \\(F\\). Show that the set of all skew-symmetric bilinear forms on \\(V\\) is a subspace of \\(L(V,\\,V,\\,F)\\).\n',
 'Find all skew-symmetric bilinear forms on \\(R^{3}\\).\n',
 'Find a basis for the space of all skew-symmetric bilinear forms on \\(R^{n}\\).\n',
 'Let \\(f\\) be a symmetric bilinear form on \\(C^{n}\\) and \\(g\\) a skew-symmetric bilinear form on \\(C^{n}\\). Suppose \\(f+g=0\\). Show that \\(f=g=0\\).\n',
 'Let \\(V\\) be an \\(n\\)-dimensional vector space over a subfield \\(F\\) of \\(C\\). Prove the following. 1. The equation \\((P\\!f)(\\alpha,\\beta)=\\frac{1}{2}f(\\alpha,\\beta)-\\frac{1}{2}f(\\beta,\\alpha)\\) defines a linear operator \\(P\\) on \\(L(V,\\,V,\\,F)\\). 2. \\(P^{\\intercal}=P_{i}\\), i.e., \\(P\\) is a projection. 3. rank \\(P=\\frac{n(n-1)}{2}\\); nullity \\(P=\\frac{n(n+1)}{2}\\). 4. If \\(U\\) is a linear operator on \\(V\\), the equation \\((U\\!f)(\\alpha,\\beta)=f(U\\alpha,\\,U\\beta)\\

## This could be done while reading the file as well.

To avoid storing huuuge amounts of data in a list

In [4]:
questions = []

with open('data/book3/page_14.tex') as f:
    line = f.readline()
    while line:
        if line == 'Exercises\n':
            line = f.readline()
            question = ""
            while line and not re.match(r'\d+\.\d.+', line):
                if re.match(r'^\d+\.\s.+', line):
                    questions.append(question)
                    question = ""
                question += line
                line = f.readline()
            questions.append(question)
        line = f.readline()

questions = list(filter(None, questions))
questions

['1. Verify that the set of complex numbers described in Example 4 is a subfield of $C$.\n',
 '2. Let $F$ be the field of complex numbers. Are the following two systems of linear equations equivalent? If so, express each equation in each system as a linear combination of the equations in the other system.\n$$\n\\begin{array}{rlrl}\nx_1-x_2 & =0 & 3 x_1+x_2 & =0 \\\\\n2 x_1+x_2 & =0 & x_1+x_2 & =0\n\\end{array}\n$$\n',
 '3. Test the following systems of equations as in Exercise 2.\n$$\n\\begin{aligned}\n-x_1+x_2+4 x_3 & =0 & x_1 & -x_3=0 \\\\\nx_1+3 x_2+8 x_3 & =0 & & x_2+3 x_3=0 \\\\\n{ }_2^1 x_1+x_2+\\frac{5}{2} x_3 & =0 & &\n\\end{aligned}\n$$\n',
 '4. Test the following systems as in Exercise 2.\n$$\n\\begin{array}{rlr}\n2 x_1+(-1+i) x_2+x_4 & =0 & \\left(1+\\frac{i}{2}\\right) x_1+8 x_2-i x_3-x_4=0 \\\\\n3 x_2-2 i x_3+5 x_4 & =0 & \\frac{2}{3} x_1-\\frac{1}{2} x_2+x_3+7 x_4=0\n\\end{array}\n$$\n',
 '5. Let $F$ be a set which contains exactly two elements, 0 and 1 . Define an additi

# Now, let's do the same for the entire Book 3

### The topics covered:

- Linear Equations

- Vector Spaces

- Linear Transformations

- Polynomials

- Determinants

- Elementary Canonical Forms

- The Rational and Jordan Forms

- Inner Product Spaces

- Operators on Inner Product Spaces

- Bilinear Forms

In [5]:
topics = ['Linear Equations', 'Vector Spaces', 'Linear Transformations', 'Polynomials', 'Determinants', 'Elementary Canonical Forms', 'The Rational and Jordan Forms', 'Inner Product Spaces', 'Operators on Inner Product Spaces', 'Bilinear Forms']

In [6]:
with open('data/Book3-latex/page_15.tex') as f:
    lines = f.readlines()
lines

['\n',
 '\n',
 '### 3 Matrices and Elementary\n',
 '\n',
 '_Row Operations_\n',
 '\n',
 "One cannot fail to notice that in forming linear combinations of linear equations there is no need to continue writing the 'unknowns' \\(x_{i}\\), \\(\\ldots\\), \\(x_{n}\\), since one actually computes only with the coefficients \\(A_{ij}\\) and the scalars \\(y_{i}\\). We shall now abbreviate the system (1-1) by\n",
 '\n',
 '\\[AX=Y\\]\n',
 '\n',
 'where\n',
 '\n',
 '\\[A=\\begin{bmatrix}A_{11}&\\cdots&A_{1n}\\\\ \\vdots&&\\vdots\\\\ A_{m1}&\\cdots&A_{mn}\\end{bmatrix}\\]\n',
 '\n',
 '\\[X=\\begin{bmatrix}x_{1}\\\\ \\vdots\\\\ x_{n}\\end{bmatrix}\\quad\\text{and}\\quad Y=\\begin{bmatrix}y_{1}\\\\ \\vdots\\\\ y_{m}\\end{bmatrix}.\\]\n',
 '\n',
 'We call \\(A\\) the **matrix of coefficients** of the system. Strictly speaking, the rectangular array displayed above is not a matrix, but is a representation of a matrix. An \\(m\\times n\\)**matrix over the field \\(F\\)** is a function \\(A\\) from the

In [7]:
with open('data/Book3-latex/page_37.tex') as f:
    lines = f.readlines()
lines

['\n',
 '\n',
 '## Chapter 2 Vector Spaces\n',
 '\n',
 '### 2.1 Vector Spaces\n',
 '\n',
 "In various parts of mathematics, one is confronted with a set, such that it is both meaningful and interesting to deal with 'linear combinations' of the objects in that set. For example, in our study of linear equations we found it quite natural to consider linear combinations of the rows of a matrix. It is likely that the reader has studied calculus and has dealt there with linear combinations of functions; certainly this is so if he has studied differential equations. Perhaps the reader has had some experience with vectors in three-dimensional Euclidean space, and in particular, with linear combinations of such vectors.\n",
 '\n',
 "Loosely speaking, linear alcgbra is that branch of mathematics which treats the common properties of algebraic systems which consist of a set, together with a reasonable notion of a 'linear combination' of elements in the set. In this section we shall define the mat

In [17]:
def extract_questions(pageNos):
    questions = []
    for pageNo in pageNos:
        with open(f'data/Book3-latex/page_{pageNo}.tex') as f:
            line = f.readline()
            while line:
                if re.match(r'.+\**Exercises\**\n', line):
                    line = f.readline()
                    question = ""
                    while line and re.match(r'^#.+', line) == None: # TODO add condition for when question rolls over to the next page
                        if re.match(r'^\d+\.\s.+', line) or re.match(r'^\*\*\d+\.\*\*\s.+', line):
                            questions.append(question)
                            question = ""
                        question += line
                        line = f.readline()
                    questions.append(question)
                line = f.readline()
    questions = list(filter(None, questions))
    # remove questions that are just new line characters
    questions = [question for question in questions if question != '\n']
    # replace all new line characters in each question qith a space
    questions = [re.sub(r'\n', ' ', question) for question in questions]
    # remove the question number from each question 
    questions = [re.sub(r'^\d+\.\s', '', question) for question in questions]
    return questions

In [3]:
def extract_questions2(pageNos):
    questions = []
    lines = []
    for pageNo in pageNos:
        with open(f'data/Book3-latex/page_{pageNo}.tex') as f:
            lines += f.readlines()
    
    for i in range(len(lines)):
        if lines[i] == 'Exercises\n':
            j = i + 1
            question = ""
            while j < len(lines) and not re.match(r'\#\#.+', lines[j]):
                if re.match(r'^\d+\.\s.+', lines[j]):
                    questions.append(question)
                    question = ""
                question += lines[j]
                j += 1
            questions.append(question)
            break
    questions = list(filter(None, questions))
    # remove the question number from each question
    questions = [re.sub(r'^\d+\.\s', '', question) for question in questions]
    return questions

In [18]:
def extract_n_concat(topic, start, end, df):
    questions = extract_questions(list(range(start, end)))
    # questions = extract_questions2(list(range(start, end)))
    df = pd.concat([df, pd.DataFrame({'Topic': [topic]*len(questions), 'Question': questions})])
    return df

In [19]:
book3 = pd.DataFrame(columns=['Topic', 'Question'])

book3 = extract_n_concat('Linear Equations', 11, 37, book3)
book3 = extract_n_concat('Vector Spaces', 37, 76, book3)
book3 = extract_n_concat('Linear Transformations', 76, 126, book3)
book3 = extract_n_concat('Polynomials', 126, 149, book3)
book3 = extract_n_concat('Determinants', 149, 190, book3)
book3 = extract_n_concat('Elementary Canonical Forms', 190, 236, book3)
book3 = extract_n_concat('The Rational and Jordan Forms', 236, 279, book3)
book3 = extract_n_concat('Inner Product Spaces', 279, 328, book3)
book3 = extract_n_concat('Operators on Inner Product Spaces', 328, 368, book3)
book3 = extract_n_concat('Bilinear Forms', 368, 395, book3)

In [20]:
book3

Unnamed: 0,Topic,Question
0,Linear Equations,Find all solutions to the following system of ...
1,Linear Equations,Find a row-reduced echelon matrix which is row...
2,Linear Equations,Let \[A=\begin{bmatrix}1&2&1&0\\ -1&0&3&5\\ 1&...
3,Linear Equations,"Do Exercise 1, but with \[A=\begin{bmatrix}2&\..."
4,Linear Equations,For each of the two matrices \[\begin{bmatrix}...
...,...,...
19,Bilinear Forms,"**6.** Find a matrix in \(O(3,C)\) whose first..."
20,Bilinear Forms,**7.** Let \(V\) be the space of all \(n\times...
21,Bilinear Forms,**8.** Let \(X\) be an \(n\times 1\) matrix ov...
22,Bilinear Forms,**9.** Let \(V\) be the space of all \(n\times...


In [21]:
book3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 127 entries, 0 to 23
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Topic     127 non-null    object
 1   Question  127 non-null    object
dtypes: object(2)
memory usage: 3.0+ KB


In [22]:
book3.to_csv('data/book3.csv', index=False)

In [28]:
extracted = pd.read_csv('data/book3_final.csv')
extracted

Unnamed: 0,Topic,Question
0,Linear Equations,Find all solutions to the following system of ...
1,Linear Equations,Find a row-reduced echelon matrix which is row...
2,Linear Equations,Let \[A=\begin{bmatrix}1&2&1&0\\ -1&0&3&5\\ 1&...
3,Linear Equations,"Do Exercise 1, but with \[A=\begin{bmatrix}2&\..."
4,Linear Equations,For each of the two matrices \[\begin{bmatrix}...
...,...,...
114,Bilinear Forms,Let \(X\) be an \(n\times 1\) matrix over \(C\...
115,Bilinear Forms,Let \(V\) be the space of all \(n\times 1\) ma...
116,Bilinear Forms,Let \(S\) be any set of \(n\times n\) matrices...
117,Bilinear Forms,"Let \(F\) be a subfield of \(C\), \(V\) a fini..."


In [29]:
extracted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Topic     119 non-null    object
 1   Question  119 non-null    object
dtypes: object(2)
memory usage: 2.0+ KB


In [30]:
#convert object type columns to string type
extracted['Topic'] = extracted['Topic'].astype(str)
extracted['Question'] = extracted['Question'].astype(str)

extracted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Topic     119 non-null    object
 1   Question  119 non-null    object
dtypes: object(2)
memory usage: 2.0+ KB


In [36]:
# check if any element in column 'Topic' is not a string
extracted['Question'].apply(lambda x: type(x) != str).sum()

0

In [39]:
# change dtype of 'Question' and 'Topic' columns to string
extracted['Question'] = extracted['Question'].apply(lambda x: str(x))
extracted['Topic'] = extracted['Topic'].apply(lambda x: str(x))

extracted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Topic     119 non-null    object
 1   Question  119 non-null    object
dtypes: object(2)
memory usage: 2.0+ KB


In [38]:
extracted

Unnamed: 0,Topic,Question
0,Linear Equations,Find all solutions to the following system of ...
1,Linear Equations,Find a row-reduced echelon matrix which is row...
2,Linear Equations,Let \[A=\begin{bmatrix}1&2&1&0\\ -1&0&3&5\\ 1&...
3,Linear Equations,"Do Exercise 1, but with \[A=\begin{bmatrix}2&\..."
4,Linear Equations,For each of the two matrices \[\begin{bmatrix}...
...,...,...
114,Bilinear Forms,Let \(X\) be an \(n\times 1\) matrix over \(C\...
115,Bilinear Forms,Let \(V\) be the space of all \(n\times 1\) ma...
116,Bilinear Forms,Let \(S\) be any set of \(n\times n\) matrices...
117,Bilinear Forms,"Let \(F\) be a subfield of \(C\), \(V\) a fini..."


In [40]:
extracted.to_csv('data/book3_final2.csv', index=False)

---