# 1st test

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')

#This code uses the nltk library to tokenize the input text into sentences, and the sklearn library to extract features from the text data and train a K-means clustering model to identify the most important sentences. The summarize() function takes a text and the number of sentences to include in the summary as input, and returns a list of sentences that make up the summary.


def summarize(text, num_sentences):
    # Preprocess the text data
    text = text.lower()
    sentences = sent_tokenize(text)
    print(len(sentences)) 

    # Next, check if the num_sentences argument is larger than the number of sentences
    if num_sentences > len(sentences):
        # If it is, set num_sentences to the length of the sentences list
        num_sentences = len(sentences)          

    # Extract features from the text data
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(sentences)

    # Train a model to identify the most important sentences
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=num_sentences, random_state=0).fit(X)
    clusters = kmeans.cluster_centers_
    print(clusters.shape)

    # Generate a summary by selecting the sentences with the highest importance
    summary_indices = kmeans.cluster_centers_.argsort()[:, ::-1][0, :num_sentences]
    print(summary_indices)

    # Check if the indices in the summary_indices list are valid
    for i in summary_indices:
        if i >= len(sentences):
            raise IndexError('Invalid index: {}'.format(i))

    summary = [sentences[i-1] for i in summary_indices if i < len(sentences)]

    return summary

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
text = '''Understanding TF-IDF for Machine Learning. A gentle introduction to term frequency-inverse document frequency. TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents (also known as a corpus).
TF-IDF can be broken down into two parts TF (term frequency) and IDF (inverse document frequency). What is TF (term frequency)?
Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document. There are multiple measures, or ways, of defining frequency: Number of times the word appears in a document (raw count).
Term frequency adjusted for the length of the document (raw count of occurences divided by number of words in the document).
Logarithmically scaled frequency (e.g. log(1 + raw count)).
Boolean frequency (e.g. 1 if the term occurs, or 0 if the term does not occur, in the document).
What is IDF (inverse document frequency)?
Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus. IDF is calculated as follows where t is the term (word) we are looking to measure the commonness of and N is the number of documents (d) in the corpus (D).. The denominator is simply the number of documents in which the term, t, appears in. 

IDF algorithm: idf(t, D) = log( N / count(d ∈ D : t ∈ d) )
Note: It can be possible for a term to not appear in the corpus at all, which can result in a divide-by-zero error. One way to handle this is to take the existing count and add 1. Thus making the denominator (1 + count). An example of how the  popular library scikit-learn handles this can be seen below.

Image with the sci-kit learn IDF algo, IDF(t) = log( (1+n) / (1 + df(t))) + 1 vs Standard notation idf algo IDF(t) = log(n/df(t))

The reason we need IDF is to help correct for words like “of”, “as”, “the”, etc. since they appear frequently in an English corpus. Thus by taking inverse document frequency, we can minimize the weighting of frequent terms while making infrequent terms have a higher impact.

Finally IDFs can also be pulled from either a background corpus, which corrects for sampling bias, or the dataset being used in the experiment at hand.

Putting it together: TF-IDF
To summarize the key intuition motivating TF-IDF is the importance of a term is inversely related to its frequency across documents.TF gives us information on how often a term appears in a document and IDF gives us information about the relative rarity of a term in the collection of documents. By multiplying these values together we can get our final TF-IDF value.

Image with full tf-idf algo: tf-idf(t,d,D) = tf(t,d) x idf(t,d)


The higher the TF-IDF score the more important or relevant the term is; as a term gets less relevant, its TF-IDF score will approach 0.

Where to use TF-IDF
As we can see, TF-IDF can be a very handy metric for determining how important a term is in a document. But how is TF-IDF used? There are three main applications for TF-IDF. These are in machine learning, information retrieval, and text summarization/keyword extraction.

Using TF-IDF in machine learning & natural language processing
Machine learning algorithms often use numerical data, so when dealing with textual data or any natural language processing (NLP) task, a sub-field of ML/AI dealing with text, that data first needs to be converted to a vector of numerical data by a process known as vectorization. TF-IDF vectorization involves calculating the TF-IDF score for every word in your corpus relative to that document and then putting that information into a vector (see image below using example documents “A” and “B”). Thus each document in your corpus would have its own vector, and the vector would have a TF-IDF score for every single word in the entire collection of documents. Once you have these vectors you can apply them to various use cases such as seeing if two documents are similar by comparing their TF-IDF vector using cosine similarity.
Image with each step of the tf-idf algo broken down with numbers and examples
A = “The car is driven on the road”; B = “The truck is driven on the highway” Image from freeCodeCamp - How to process textual data using TF-IDF in Python 

Using TF-IDF in information retrieval
TF-IDF also has use cases in the field of information retrieval, with one common example being search engines. Since TF-IDF can tell you about the relevant importance of a term based upon a document, a search engine can use TF-IDF to help rank search results based on relevance, with results which are more relevant to the user having higher TF-IDF scores.
Using TF-IDF in text summarization & keyword extraction
Since TF-IDF weights words based on relevance, one can use this technique to determine that the words with the highest relevance are the most important. This can be used to help summarize articles more efficiently or to simply determine keywords (or even tags) for a document.

Vectors & Word Embeddings: TF-IDF vs Word2Vec vs Bag-of-words vs BERT
As discussed above, TF-IDF can be used to vectorize text into a format more agreeable for ML & NLP techniques. However while it is a popular NLP algorithm it is not the only one out there.
Bag of Words
Bag of Words (BoW) simply counts the frequency of words in a document. Thus the vector for a document has the frequency of each word in the corpus for that document.  The key difference between bag of words and TF-IDF is that the former does not incorporate any sort of inverse document frequency (IDF)  and is only a frequency count (TF).

Word2Vec
Word2Vec is an algorithm that uses shallow 2-layer, not deep, neural networks to ingest a corpus and produce sets of vectors. Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other format. Additionally TF-IDF does not take into consideration the context of the words in the corpus whereas word2vec does.

BERT - Bidirectional Encoder Representations from Transformers
BERT is an ML/NLP technique developed by Google that uses a transformer based ML model to  convert phrases, words, etc into vectors. Key differences between TF-IDF and BERT are as follows: TF-IDF does not take into account the semantic meaning or context of the words whereas BERT does. Also BERT uses deep neural networks as part of its architecture, meaning that it can be much more computationally expensive than TF-IDF which has no such requirements. 

Pros and cons of using TF-IDF
Pros of using TF-IDF
The biggest advantages of TF-IDF come from how simple and easy to use it is. It is simple to calculate, it is computationally cheap, and it is a simple starting point for similarity calculations (via TF-IDF vectorization + cosine similarity).
Cons of using TF-IDF
Something to be aware of is that TF-IDF cannot help carry semantic meaning. It considers the importance of the words due to how it weighs them, but it cannot necessarily derive the contexts of the words and understand importance that way.
Also as mentioned above, like BoW, TF-IDF ignores word order and thus compound nouns like “Queen of England” will not be considered as a “single unit”. This also extends to situations like negation with “not pay the bill” vs “pay the bill”, where the order makes a big difference. In both cases using NER tools and underscores, “queen_of_england” or “not_pay” are ways to handle treating the phrase as a single unit.
Another disadvantage is that it can suffer from memory-inefficiency since TF-IDF can suffer from the curse of dimensionality. Recall that the length of TF-IDF vectors is equal to the size of the vocabulary. In some classification contexts this may not be an issue but in other contexts like clustering this can be unwieldy as the number of documents increases. Thus looking into some of the above named alternatives (BERT, Word2Vec) may be necessary.
Conclusion
TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It’s a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks. This includes building search engines, summarizing documents, or other tasks in the information retrieval and machine learning domains.'''

In [None]:
summarize(text, 5)

61
(5, 435)
[222 212 390 299 218]


IndexError: ignored


# Module importé 
---



In [None]:
pip install sumy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 1.2 MB/s 
Collecting breadability>=0.1.20
  Downloading breadability-0.1.20.tar.gz (32 kB)
Collecting pycountry>=18.2.23
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 168 kB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting docopt<0.7,>=0.6.1
  Downloading docopt-0.6.2.tar.gz (25 kB)
Building wheels for collected packages: breadability, docopt, pycountry
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=breadability-0.1.20-py2.py3-none-any.whl size=21714 sha256=08e8ec150033db685d3e62d25c871341d88679771c3a8ae36de5200afecdfaad
  Stor

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation

def summarizee(text, n):
  sents = sent_tokenize(text)
  assert n <= len(sents)
  word_sent = word_tokenize(text.lower())
  _stopwords = set(stopwords.words('english') + list(punctuation))

  word_sent=[word for word in word_sent if word not in _stopwords]
  freq = FreqDist(word_sent)

  ranking = defaultdict(int)

  for i, sent in enumerate(sents):
    for w in word_tokenize(sent.lower()):
      if w in freq:
        ranking[i] += freq[w]

  sents_idx = rank(ranking, n)
  return [sents[j] for j in sents_idx]

def rank(ranking, n):
  return sorted(ranking, key=ranking.get, reverse=True)[:n]


In [None]:
summarizee(text,5)

LookupError: ignored

#Test 2: with gloVe word representation pre trained model

In [None]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
import re
from nltk.corpus import stopwords

In [None]:
df = pd.read_csv("tennis.csv")
df.head()


In [None]:
import random
i=random.randint(0,len(df))
df['article_text'][i]

In [None]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x]

## dowload GloVe file

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

In [None]:
#get the path
!ls
!pwd

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# adding GloVe: Global Vectors for Word Representation at https://nlp.stanford.edu/projects/glove/
# GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
# Training is performed on aggregated global word-word co-occurrence statistics from a corpus, 
# and the resulting representations showcase interesting linear substructures of the word vector space.
import io
import nltk
nltk.download('stopwords')

word_embeddings = {}
with io.open('/content/glove.6B.300d.txt', encoding='utf8') as f:
  for line in f:
      values = line.split()
      word = values[0]
      coefs = np.asarray(values[1:], dtype='float32')
      word_embeddings[word] = coefs
  f.close()

clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [None]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((300,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((300,))
  sentence_vectors.append(v)

In [None]:
sim_mat = np.zeros([len(sentences), len(sentences)])
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]

In [None]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [None]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
from termcolor import colored
i=random.randint(0,len(df))
print(colored(("ARTICLE:".center(50)),'yellow'))
print('\n')
print(colored((df['article_text'][i]),'blue'))
print('\n')
print(colored(("SUMMARY:".center(50)),'green'))
print('\n')
print(colored((ranked_sentences[i][1]),'cyan'))

<h1>Test

In [None]:
# Import the necessary libraries
import re
import string
from collections import Counter

def get_important_sentences(text, num_sentences):
  # Split the text into sentences
  sentences = re.split(r'[.!?]', text)
  # Remove punctuation from each sentence and convert to lowercase
  sentences = [re.sub(r'[^\w\s]', '', sentence) for sentence in sentences]

  # Count the frequency of each word in the text
  word_counts = Counter([word for sentence in sentences for word in sentence.split()])
  # Calculate the importance of each sentence by summing the
  # word frequencies of the words in the sentence
  sentence_importance = [sum([word_counts[word] for word in sentence.split()]) for sentence in sentences]

  # Sort the sentences by their importance
  important_sentences = [sentence for _, sentence in sorted(zip(sentence_importance, sentences), reverse=True)]

  # Return the most important num_sentences sentences
  return important_sentences[:num_sentences]


In [None]:
get_important_sentences(text, 2)

[' Some key differences between TFIDF and word2vec is that TFIDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other format',
 ' TFIDF stands for term frequencyinverse document frequency and it is a measure used in the fields of information retrieval IR and machine learning that can quantify the importance or relevance of string representations words phrases lemmas etc  in a document amongst a collection of documents also known as a corpus']