# Who Wrote the Gospels?

The Christian Bible consists of 66 books. Four of these books (Matthew, Mark, Luke, John) tell the life of Jesus; these 4 books are collectively known as the Gospels.

In this activity, you will use data science to rediscover a fact about the Gospels that have long been known to religion scholars, without reading the text!

## Exercise 0. Reading in the Data

The text of the four gospels are stored on the web in four files:
- Matthew: https://datasci112.stanford.edu/data/gospels/matthew.txt
- Mark: https://datasci112.stanford.edu/data/gospels/mark.txt
- Luke: https://datasci112.stanford.edu/data/gospels/luke.txt
- John: https://datasci112.stanford.edu/data/gospels/john.txt

Read in the four texts into a list called `corpus`.

In [5]:
dir = "data/"
gospel_files = ["matthew.txt", "mark.txt", "luke.txt", "john.txt"]

In [6]:
from enum import Enum

# load data to corpus
corpus = []

class Gospel(Enum):
    MATTHEW = 0
    MARK = 1
    LUKE = 2
    JOHN = 3

for file in gospel_files:
    with open(dir + file, "r") as f:
        corpus.append(f.read())

In [7]:
print(corpus[Gospel.MATTHEW.value])

 The book of the generation of Jesus Christ, the son of David, the
son of Abraham.

 Abraham begat Isaac; and Isaac begat Jacob; and Jacob begat Judas
and his brethren;  And Judas begat Phares and Zara of Thamar; and
Phares begat Esrom; and Esrom begat Aram;  And Aram begat Aminadab;
and Aminadab begat Naasson; and Naasson begat Salmon;  And Salmon
begat Booz of Rachab; and Booz begat Obed of Ruth; and Obed begat
Jesse;  And Jesse begat David the king; and David the king begat
Solomon of her that had been the wife of Urias;  And Solomon begat
Roboam; and Roboam begat Abia; and Abia begat Asa;  And Asa begat
Josaphat; and Josaphat begat Joram; and Joram begat Ozias;  And
Ozias begat Joatham; and Joatham begat Achaz; and Achaz begat Ezekias;
 And Ezekias begat Manasses; and Manasses begat Amon; and Amon
begat Josias;  And Josias begat Jechonias and his brethren, about
the time they were carried away to Babylon:  And after they were
brought to Babylon, Jechonias begat Salathiel; and Salat

## Exercise 1. Term-Frequency Matrix

Construct the term-frequency matrix for this corpus, and calculate the Euclidean distances between all pairs of gospels. Which two gospels are most similar? Most different?

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# create a count vectorizer
vectorizer = CountVectorizer(token_pattern=r"\w+")

# fit the vectorizer to the corpus
tf = vectorizer.fit_transform(corpus)

# get euclidean distances
dist = euclidean_distances(tf)

# Most similar pair of gospels: Matthew and Luke
# Most different pair of gospels: Luke and John
dist

array([[   0.        ,  925.11728986,  639.32464367,  993.62769688],
       [ 925.11728986,    0.        , 1243.08406795,  834.08512755],
       [ 639.32464367, 1243.08406795,    0.        , 1357.69289606],
       [ 993.62769688,  834.08512755, 1357.69289606,    0.        ]])

## Exercise 2. Vector Space Model

In the vector space model, we treat the documents as vectors instead of points. To measure the distance between two vectors, we use cosine distance.

Calculate the cosine distances between all pairs of gospels. What do you notice now? How does your conclusion compare with the one you obtained above?

In [9]:
# get the cosine distances of each pair
from sklearn.feature_extraction.text import CountVectorizer

# create term frequency matrix using CountVectorizer
vectorizer = CountVectorizer(token_pattern=r"\w+")
tf_matrix = vectorizer.fit_transform(corpus)
tf_matrix.todense()

matrix([[254,   0,   1, ...,   2,   0,   2],
        [150,   0,   0, ...,   0,   0,   0],
        [329,   1,   2, ...,   0,   1,   1],
        [163,   0,   0, ...,   0,   0,   0]], shape=(4, 3452))

In [10]:
# calculate the cosine distances
from sklearn.metrics.pairwise import cosine_distances
cosine_distances(tf_matrix)

# Most similar pair of gospels: Mark and Luke
# Most different pair of gospels: Mark and John

array([[0.        , 0.02348578, 0.01533848, 0.05390046],
       [0.02348578, 0.        , 0.0146853 , 0.08256973],
       [0.01533848, 0.0146853 , 0.        , 0.06469943],
       [0.05390046, 0.08256973, 0.06469943, 0.        ]])

## Exercise 2a. (optional)

Notice that Euclidean distance and cosine distance reached different conclusions about which Gospels were most similar. This is because longer documents have more words; Euclidean distance is sensitive to this, while cosine distance is not.

In fact, if we just _normalize_ the word counts in each row before calculating Euclidean distance, then Euclidean distance and cosine distance will reach the exact same conclusion!

- Normalizing a vector means scaling the vector so that its length is 1.
- To do this, we divide the vector by its length.
$$ {\bf v} \leftarrow \frac{{\bf v}}{||{\bf v}||}. $$
- This can be done in Pandas or using `sklearn.preprocessing.Normalizer`.

In [11]:
# normalize the term frequency matrix
from sklearn.preprocessing import normalize
# normalize the term frequency matrix using L2 normalization
tf_normalized = normalize(tf_matrix, norm='l2')

## 📚 L1 Normalization vs L2 Normalization (by Copilot)

머신러닝과 딥러닝에서 자주 사용되는 L1, L2 정규화의 개념과 차이점을 명확하게 정리했습니다.

---

### 📐 기본 개념

#### ✅ L1 Normalization (Manhattan Norm)
- 벡터의 각 요소의 **절댓값 합**으로 벡터 크기를 계산
- 수식:  
  $$ ||x||_1 = \sum_{i=1}^{n} |x_i| $$
- 벡터를 이 크기로 나누어 **각 요소의 상대적 비율**을 구함

#### ✅ L2 Normalization (Euclidean Norm)
- 벡터의 각 요소의 **제곱합의 제곱근**으로 크기를 계산
- 수식:  
  $$ ||x||_2 = \sqrt{\sum_{i=1}^{n} x_i^2} $$
- 벡터를 이 크기로 나누어 **단위 벡터로 변환**

---

### ⚖️ 비교 분석

| 항목                  | L1 Normalization                          | L2 Normalization                          |
|---------------------|------------------------------------------|------------------------------------------|
| 계산 방식            | 절댓값의 합                               | 제곱합의 제곱근                           |
| 수식 복잡도          | 간단                                     | 비교적 복잡                              |
| 이상치에 대한 민감도 | 덜 민감                                   | 더 민감                                   |
| 벡터 희소화 효과     | 있음 (0에 가까운 값 유지 쉬움)           | 없음 (값들이 고르게 분포됨)              |
| 활용 예시            | L1 Regularization (Lasso)                | L2 Regularization (Ridge)                |
| 기하학적 형태        | 정다각형 모양                             | 구형 모양                                 |

---

### 🧠 실전에서의 선택 기준

- 🔹 **L1**: 이상치에 강하고 희소성을 유도 → **특성 선택에 유리**
- 🔹 **L2**: 연속적인 학습에 적합하고 미분 가능 → **일반적인 모델 학습에 적합**
- 🔹 두 기법을 함께 사용하는 경우 → **Elastic Net**


In [None]:
# get the euclidean distances of the normalized term frequency matrix
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(tf_normalized)

# Most similar pair of gospels: Mark and Luke
# Most different pair of gospels: Mark and John

array([[0.        , 0.21672924, 0.17514841, 0.32833051],
       [0.21672924, 0.        , 0.17137853, 0.40637355],
       [0.17514841, 0.17137853, 0.        , 0.35972052],
       [0.32833051, 0.40637355, 0.35972052, 0.        ]])

In [None]:
# get the cosine distances of the normalized term frequency matrix
from sklearn.metrics.pairwise import cosine_distances
cosine_distances(tf_normalized)

# Most similar pair of gospels: Mark and Luke
# Most different pair of gospels: Mark and John

array([[0.        , 0.02348578, 0.01533848, 0.05390046],
       [0.02348578, 0.        , 0.0146853 , 0.08256973],
       [0.01533848, 0.0146853 , 0.        , 0.06469943],
       [0.05390046, 0.08256973, 0.06469943, 0.        ]])

In MATH 51, you will learn how to show why normalizing the vectors makes Euclidean distance agree with cosine distance.

In fact, there's a very simple relationship between the Euclidean distance between the normalized vectors and the cosine distance:
$$ \text{Euclidean distance (between the normalized vectors)} = \sqrt{2 \cdot \text{cosine distance}}. $$

If you've taken MATH 51 or MATH 104, see if you can show this!

In [13]:
# I'll be back

## Exercise 3. tf-idf

We should probably upweight rare words and downweight common words. Construct the tf-idf matrix, and calculate the cosine distances between all pairs of gospels. Do your conclusions change compared to Exercise 2?

In [None]:
# construct the tf-idf matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"\w+")
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# get the cosine distances of the tf-idf matrix
cosine_distances(tfidf_matrix)

# Most similar pair of gospels: Mark and Luke
# Most different pair of gospels: Mark and John

array([[0.        , 0.02442428, 0.01631817, 0.05541234],
       [0.02442428, 0.        , 0.01533438, 0.08365147],
       [0.01631817, 0.01533438, 0.        , 0.06577457],
       [0.05541234, 0.08365147, 0.06577457, 0.        ]])

## Further Reading

You have just discovered a phenomenon known to critical Biblical scholars as the [Synoptic problem](https://en.wikipedia.org/wiki/Synoptic_Gospels)!