In [1]:
import pandas as pd
import numpy as np
import textwrap
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to C:\Users\Deepam
[nltk_data]     Shah\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Deepam
[nltk_data]     Shah\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
import pandas as pd
import pathlib

root_dir = pathlib.Path(r"D:\Deepam\bbc")

data = []

for category_dir in root_dir.iterdir():
    if category_dir.is_dir():
        label = category_dir.name

        for file_path in category_dir.glob("*.txt"):
            with open(file_path, "r", encoding="latin-1") as f:
                text = f.read()
                data.append({"text": text, "labels": label})

df = pd.DataFrame(data)

In [4]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [5]:
doc = df[df.labels == 'business']['text'].sample(random_state=42)

In [6]:
def wrap(x):
    return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)

In [7]:
print(wrap(doc.iloc[0]))

Christmas sales worst since 1981

UK retail sales fell in December,
failing to meet expectations and making it by some counts the worst
Christmas since 1981.

Retail sales dropped by 1% on the month in
December, after a 0.6% rise in November, the Office for National
Statistics (ONS) said.  The ONS revised the annual 2004 rate of growth
down from the 5.9% estimated in November to 3.2%. A number of
retailers have already reported poor figures for December.  Clothing
retailers and non-specialist stores were the worst hit with only
internet retailers showing any significant growth, according to the
ONS.

The last time retailers endured a tougher Christmas was 23 years
previously, when sales plunged 1.7%.

The ONS echoed an earlier
caution from Bank of England governor Mervyn King not to read too much
into the poor December figures.  Some analysts put a positive gloss on
the figures, pointing out that the non-seasonally-adjusted figures
showed a performance comparable with 2003. The Novembe

In [8]:
sents = nltk.sent_tokenize(doc.iloc[0].split("\n", 1)[1])

In [9]:
featurizer = TfidfVectorizer(
    stop_words=stopwords.words('english'),
    norm='l1',
)

In [10]:
X = featurizer.fit_transform(sents)

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
# compute similarity matrix
S = cosine_similarity(X)


# Cosine Similarity: Step-by-Step Example using Scikit-Learn

## 🔹 1. Import Required Libraries

```python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
```

---

## 🔹 2. Define a Simple TF-IDF-like Matrix `X`

```python
X = np.array([
    [1, 2, 0],  # Sentence 1
    [2, 1, 0],  # Sentence 2
    [0, 1, 3],  # Sentence 3
])
```

This simulates a TF-IDF matrix for 3 sentences (rows) with 3 terms (columns).

---

## 🔹 3. Use Scikit-learn to Compute Cosine Similarity

```python
S = cosine_similarity(X)
print(np.round(S, 2))
```

### Output:

```
[[1.   0.8  0.27]
 [0.8  1.   0.24]
 [0.27 0.24 1.  ]]
```

This produces a **3×3 similarity matrix**, where `S[i][j]` is the cosine similarity between sentence `i` and sentence `j`.

---

## 🔹 4. Manually Verify `cosine_similarity(X[0], X[1])`

### Vectors:
- **A** = `[1, 2, 0]`
- **B** = `[2, 1, 0]`

---

### 🔸 Cosine Similarity Formula:

$$
cosine(A, B) = (A · B) / (||A|| * ||B||)
$$

---

### 🔸 Dot Product:

```
A · B = (1×2) + (2×1) + (0×0) = 4
```

---

### 🔸 Norms:

$$
||A|| = sqrt(1² + 2² + 0²) = sqrt(5)
||B|| = sqrt(2² + 1² + 0²) = sqrt(5)
$$

---

### 🔸 Final Cosine Similarity:


$$
cosine(A, B) = 4 / (sqrt(5) * sqrt(5)) = 4 / 5 = 0.8
$$





In [21]:
S.shape # this implies we have 17 sentences

(17, 17)

In [15]:
len(sents)

17

In [16]:
# normalize similarity matrix
S /= S.sum(axis=1, keepdims=True)

In [19]:
S[0].sum()

1.0

In [20]:
# uniform transition matrix
U = np.ones_like(S) / len(S)

* Step-by-Step Explanation:
1. np.ones_like(S)
* This creates a matrix filled with 1s that has the same shape and size as matrix S.
* If S is a 3×3 matrix, np.ones_like(S) will be:
```python
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
2. len(S)
```
* S is a square matrix (cosine similarity between sentences).
* len(S) gives the number of rows (or sentences), say N.

3. Division: np.ones_like(S) / len(S)
* Each 1 in the matrix is divided by N.
* So if len(S) = 3, the resulting matrix U becomes:
```python
[[1/3  1/3  1/3]
 [1/3  1/3  1/3]
 [1/3  1/3  1/3]]
```

In [22]:
U[0].sum()

1.0

In [23]:
# smoothed similarity matrix
factor = 0.15
S = (1 - factor) * S + factor * U

🎲 Suppose:
You have 3 sentences (n = 3), and your row-normalized similarity matrix S is:
```python
S = [[0.5, 0.3, 0.2],
     [0.1, 0.6, 0.3],
     [0.4, 0.4, 0.2]]
Each row sums to 1 ✅

And your uniform matrix U = 1/3 everywhere:

U = [[0.33, 0.33, 0.33],
     [0.33, 0.33, 0.33],
     [0.33, 0.33, 0.33]]
Let’s say factor = 0.15, then (1 - factor) = 0.85

🧮 Apply Formula:

S_new = 0.85 * S + 0.15 * U
Take row 1 as example:

S_new[0] = 0.85 * [0.5, 0.3, 0.2] + 0.15 * [0.33, 0.33, 0.33]
         = [0.425, 0.255, 0.17] + [0.0495, 0.0495, 0.0495]
         = [0.4745, 0.3045, 0.2195]
Now sum the row:

0.4745 + 0.3045 + 0.2195 = 0.9985 ≈ 1 (small numerical round-off error)
✅ Row still sums to 1

🧠 Why This Works (Mathematically)
You’re using:

S_new = (1 - d) * S + d * U

Since both S and U are row-stochastic matrices (rows sum to 1), any convex combination like this will also have rows summing to 1:

sum(row) = (1 - d) * 1 + d * 1 = 1
```

In [24]:
S[0].sum()

0.9999999999999999

In [25]:
# find the limiting / stationary distribution
eigenvals, eigenvecs = np.linalg.eig(S.T)

In [26]:
eigenvals

array([1.        , 0.24245466, 0.72108199, 0.67644122, 0.34790129,
       0.34417302, 0.3866884 , 0.40333562, 0.41608572, 0.44238593,
       0.63909999, 0.62556792, 0.58922572, 0.57452382, 0.48511399,
       0.51329157, 0.52975372])

In [27]:
eigenvecs[:,0]

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [28]:
eigenvecs[:,0].dot(S)

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [29]:
eigenvecs[:,0] / eigenvecs[:, 0].sum()

array([0.05907327, 0.06601563, 0.05402535, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114976, 0.05741304, 0.05906657, 0.05774684,
       0.07175905, 0.05092007])

In [31]:
limiting_dist = np.ones(len(S)) / len(S)
threshold = 1e-8
delta = float('inf')
iters = 0
while delta > threshold:
    iters += 1

    # Markov transition
    p = limiting_dist.dot(S)

    # compute change in limiting distribution
    delta = np.abs(p - limiting_dist).sum()

    # update limiting distribution
    limiting_dist = p

print(iters)

41


In [32]:
limiting_dist

array([0.05907327, 0.06601563, 0.05402534, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114977, 0.05741304, 0.05906657, 0.05774685,
       0.07175905, 0.05092008])

In [33]:
limiting_dist.sum()

0.9999999999999977

In [34]:
np.abs(eigenvecs[:,0] / eigenvecs[:,0].sum() - limiting_dist).sum()

1.9964739139677334e-08

In [35]:
scores = limiting_dist

In [36]:
sort_idx = np.argsort(-scores)

In [37]:
# Many options for how to choose which sentence to include:

# 1) top N sentences
# 2) top N words or characters
# 3) top X% sentences or top X% words
# 4) sentence with scores > average score
# 5) sentences with scores > factor * average score

# You also don't have to sort. May make more sense in order.

print("Generated summary:")
for i in sort_idx[:5]:
    print(wrap("%.2f: %s" % (scores[i], sents[i])))

Generated summary:
0.07: "The retail sales figures are very weak, but as Bank of England
governor Mervyn King indicated last night, you don't really get an
accurate impression of Christmas trading until about Easter," said Mr
Shaw.
0.07: A number of retailers have already reported poor figures for
December.
0.07: The ONS echoed an earlier caution from Bank of England governor
Mervyn King not to read too much into the poor December figures.
0.07: Retail sales dropped by 1% on the month in December, after a
0.6% rise in November, the Office for National Statistics (ONS) said.
0.06: Clothing retailers and non-specialist stores were the worst hit
with only internet retailers showing any significant growth, according
to the ONS.


In [38]:
doc.iloc[0].split("\n")[0]

'Christmas sales worst since 1981'

In [39]:
def summarize(text, factor = 0.15):
    # extract sentences
    sents = nltk.sent_tokenize(text)

    # perform tf-idf
    featurizer = TfidfVectorizer(
        stop_words=stopwords.words('english'),
        norm='l1')
    X = featurizer.fit_transform(sents)

    # compute similarity matrix
    S = cosine_similarity(X)

    # normalize similarity matrix
    S /= S.sum(axis=1, keepdims=True)

    # uniform transition matrix
    U = np.ones_like(S) / len(S)

    # smoothed similarity matrix
    S = (1 - factor) * S + factor * U

    # find the limiting / stationary distribution
    eigenvals, eigenvecs = np.linalg.eig(S.T)

    # compute scores
    scores = eigenvecs[:, 0] / eigenvecs[:,0].sum()

    # sort the scores
    sort_idx = np.argsort(-scores)

    # print summary
    for i in sort_idx[:5]:
        print(wrap("%.2f:%s" % (scores[i], sents[i])))

In [40]:
doc = df[df.labels == 'entertainment']['text'].sample(random_state=123)
summarize(doc.iloc[0].split("\n", 1)[1])

0.11:Goodrem, Green Day and the Black Eyed Peas took home two awards
each.
0.10:As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.
0.10:Other winners included Green Day, voted best group, and the Black
Eyed Peas.
0.10:The Black Eyed Peas won awards for best R 'n' B video and sexiest
video, both for Hey Mama.
0.10:Local singer and songwriter Missy Higgins took the title of
breakthrough artist of the year, with Australian Idol winner Guy
Sebastian taking the honours for best pop video.


In [41]:
doc.iloc[0].split("\n")[0]

'Goodrem wins top female MTV prize'