In [1]:
!pip install pandas



1. Data Loading and Inspection

Load the tennis articles dataset from the .xls file using pandas.
Explore the dataset using .head() and .info() to understand its structure.
Drop the article_title column to simplify the dataset.

In [6]:
# Mount Google Drive (or upload directly)
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

df = pd.read_csv("/content/tennis_articles.csv", encoding='latin1')
df.drop(labels= ["article_title"], axis= 1,inplace= True)
print(df.head())
print(df.info())


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
   article_id                                       article_text  \
0           1  Maria Sharapova has basically no friends as te...   
1           2  BASEL, Switzerland (AP)  Roger Federer advanc...   
2           3  Roger Federer has revealed that organisers of ...   
3           4  Kei Nishikori will try to end his long losing ...   
4           5  Federer, 37, first broke through on tour over ...   

                                              source  
0  https://www.tennisworldusa.org/tennis/news/Mar...  
1  http://www.tennis.com/pro-game/2018/10/copil-s...  
2  https://scroll.in/field/899938/tennis-roger-fe...  
3  http://www.tennis.com/pro-game/2018/10/nishiko...  
4  https://www.express.co.uk/sport/tennis/1036101...  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column        Non-Null Coun

2. Sentence Tokenization

Use nltk.sent_tokenize() to split the article_text into individual sentences.
Flatten the resulting list of sentence lists into a single list of all sentences.

In [7]:
!pip install nltk



In [8]:
import nltk
from nltk.tokenize import sent_tokenize

In [10]:
# Download the punkt tokenizer model
nltk.download('punkt_tab')

# Apply sent_tokenize to each article text and create a list of lists
sentences_lists = df['article_text'].apply(sent_tokenize).tolist()

# Flatten the list of lists into a single list of sentences
all_sentences = [sentence for sublist in sentences_lists for sentence in sublist]

# Example output
print(all_sentences[:10])  # print first 10 sentences

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['Maria Sharapova has basically no friends as tennis players on the WTA Tour.', "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.", 'I think everyone knows this is my job here.', "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.", "So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.", "I'm a pretty competitive girl.", "I say my hellos, but I'm not sending any players flowers as well.", "Uhm, I'm not really friendly or close to many players.", "I have not a lot of friends away from the courts.'", 'When she said she is not really close to a lot of players, is that something strategic that she is doing?']


3. Download and Load GloVe Word Embeddings

Download the pre-trained GloVe vectors (e.g., glove.6B.100d.txt).
Load the embeddings into a Python dictionary where each word maps to its 100-dimensional vector.

In [11]:
import os
import requests
import zipfile
import numpy as np

In [14]:
# Step 1: Download GloVe embeddings (if not already downloaded)
glove_zip_url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_zip_path = "/content/glove.6B.zip"
glove_folder = "/content/glove.6B"

if not os.path.exists(glove_zip_path):
    print("Downloading GloVe embeddings...")
    r = requests.get(glove_zip_url)
    with open(glove_zip_path, "wb") as f:
        f.write(r.content)
else:
    print("GloVe zip already downloaded.")

# Step 2: Extract the zip (if not already extracted)
if not os.path.exists(glove_folder):
    print("Extracting GloVe embeddings...")
    with zipfile.ZipFile(glove_zip_path, 'r') as zip_ref:
        zip_ref.extractall("/content/")
else:
    print("GloVe folder already extracted.")

# Step 3: Load the 100-dimensional embeddings into a dictionary
glove_path = os.path.join(glove_folder, "/content/glove.6B.100d.txt")

embeddings_index = {}
with open(glove_path, 'r', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        embeddings_index[word] = vector

print(f"Loaded {len(embeddings_index)} word vectors.")


GloVe zip already downloaded.
Extracting GloVe embeddings...
Loaded 400000 word vectors.


4. Text Cleaning and Normalization

Remove punctuation, special characters, and numbers using regex.
Convert all sentences to lowercase to avoid case-sensitive mismatch.
Remove stop words using nltk.corpus.stopwords to reduce noise in the data.

In [16]:
import re
from nltk.corpus import stopwords

In [17]:
# Download stopwords if not already done
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Remove punctuation, special characters, and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize by splitting on whitespace
    words = text.split()
    # Remove stop words
    filtered_words = [word for word in words if word not in stop_words]
    # Join back into cleaned sentence
    return ' '.join(filtered_words)

# Example: clean all sentences in your list `all_sentences` from previous step
cleaned_sentences = [clean_text(sentence) for sentence in all_sentences]

# Show some cleaned sentences
print(cleaned_sentences[:10])


['maria sharapova basically friends tennis players wta tour', 'russian player problems openly speaking recent interview said dont really hide feelings much', 'think everyone knows job', 'im courts im court playing im competitor want beat every single person whether theyre locker room across net', 'im one strike conversation weather know next minutes go try win tennis match', 'im pretty competitive girl', 'say hellos im sending players flowers well', 'uhm im really friendly close many players', 'lot friends away courts', 'said really close lot players something strategic']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


5. Sentence Vectorization

For each cleaned sentence:
Split into words.
Replace each word with its GloVe vector (use a zero-vector if the word is not in the embedding).
Compute the average of all word vectors in the sentence.
Store all resulting sentence vectors in a list.

In [18]:
embedding_dim = 100  # GloVe 100d vectors

def sentence_to_vector(sentence, embeddings_index, embedding_dim=100):
    words = sentence.split()
    if not words:
        # Empty sentence => return zero vector
        return np.zeros(embedding_dim)

    vectors = []
    for word in words:
        vec = embeddings_index.get(word)
        if vec is not None:
            vectors.append(vec)
        else:
            # Word not found in GloVe => use zero vector
            vectors.append(np.zeros(embedding_dim))
    # Average word vectors
    return np.mean(vectors, axis=0)

# Apply to all cleaned sentences
sentence_vectors = [sentence_to_vector(sent, embeddings_index, embedding_dim) for sent in cleaned_sentences]

# Example: shape of first sentence vector
print(sentence_vectors[0].shape)  # (100,)


(100,)


6. Similarity Matrix Construction

Initialize an empty matrix of size (number of sentences × number of sentences).
Compute pairwise cosine similarity between sentence vectors.
Fill in the matrix such that each cell represents the similarity between two sentences.

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

In [20]:
# Stack sentence vectors into a 2D array (if not already)
X = np.vstack(sentence_vectors)  # shape (num_sentences, 100)

# Compute cosine similarity matrix (num_sentences x num_sentences)
similarity_matrix = cosine_similarity(X)

print(similarity_matrix.shape)  # should be (num_sentences, num_sentences)

# Example: similarity between sentence 0 and sentence 1
print(similarity_matrix[0, 1])


(130, 130)
0.6426970974554298


7. Graph Construction and Sentence Ranking

Convert the similarity matrix into a graph using networkx.
Apply the PageRank algorithm to score the importance of each sentence.

In [25]:
# similarity_matrix is your (num_sentences x num_sentences) numpy array

# Create a graph from the similarity matrix
# We'll use a weighted undirected graph, ignoring self-similarity (diagonal)
np.fill_diagonal(similarity_matrix, 0)  # Remove self-loops by zeroing diagonal

G = nx.from_numpy_array(similarity_matrix)

# Apply PageRank (weights are the edge weights)
pagerank_scores = nx.pagerank(G, weight='weight')

8. Summarization

Sort all sentences based on their PageRank scores in descending order.
Extract the top N sentences (e.g., 10) as the final summary.
Print or return the summarized sentences.

In [26]:
# Sort sentences by PageRank score descending
ranked_sentences = sorted(pagerank_scores.items(), key=lambda x: x[1], reverse=True)

N = 10  # number of sentences for summary

# Extract top N sentence indices (keep their original order if you want a coherent summary)
top_sentence_indices = [idx for idx, score in ranked_sentences[:N]]
top_sentence_indices.sort()  # optional: sort to keep original order in the text

# Get the original sentences (before cleaning, for better readability)
summary_sentences = [all_sentences[i] for i in top_sentence_indices]

print("Summary:")
for sent in summary_sentences:
    print("-", sent)


Summary:
- So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
- Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
- Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
- Federer said earlier this month in Shanghai in that his chances of playing the Davis Cup were all but non-existent.
- He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the win on his first match point.
- I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
- I just f