Tf=(Nij)/SUM(Nij)

Idf =log (N/DF)  # N is number of documents and df is document frequency of given term

TF-IDF =TfxIdf 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en import English
import numpy as np



**spacy model for sentence tokenization**

In [None]:
nlp = English()
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7f84efb02140>

In [None]:

text_corpus = """
Global pollution refers to the presence of harmful substances in the environment that cause negative impacts on human health and the ecosystem. It affects both the natural and built environment, and can be caused by various human activities such as industrialization, transportation, and agriculture.

There are several forms of pollution that exist, including air, water, soil, and noise pollution. Air pollution is caused by emissions from factories, power plants, and vehicles. It results in the release of harmful chemicals such as sulfur dioxide, nitrogen oxides, and particulate matter into the air, which can cause respiratory problems, cardiovascular disease, and other health issues.

Water pollution occurs when chemicals and waste products are released into bodies of water, causing harm to aquatic life and threatening human health. Sources of water pollution include industrial effluent, agricultural runoff, and sewage. Soil pollution is caused by the accumulation of hazardous chemicals and waste products in the ground, which can result in the contamination of crops and other food sources.

Noise pollution is caused by excessive noise from various sources, including transportation, construction, and industrial activities. It can cause hearing damage and disrupt the balance of wildlife populations.

Global pollution is a major threat to human health and the environment, and it is essential that steps are taken to reduce its impact. This can be achieved through the adoption of sustainable practices in industry, transportation, and agriculture, as well as the implementation of environmental regulations and policies aimed at reducing pollution and protecting the environment.

In conclusion, global pollution is a complex and pressing issue that requires the cooperation and efforts of individuals, organizations, and governments worldwide to effectively address.




"""

**spacy document for sentence level tokenization**

In [None]:
doc = nlp(text_corpus.replace("\n", ""))
sentences = [sent.text.strip() for sent in doc.sents]

In [None]:
print("Senetence are: \n", sentences)

Senetence are: 
 ['Global pollution refers to the presence of harmful substances in the environment that cause negative impacts on human health and the ecosystem.', 'It affects both the natural and built environment, and can be caused by various human activities such as industrialization, transportation, and agriculture.', 'There are several forms of pollution that exist, including air, water, soil, and noise pollution.', 'Air pollution is caused by emissions from factories, power plants, and vehicles.', 'It results in the release of harmful chemicals such as sulfur dioxide, nitrogen oxides, and particulate matter into the air, which can cause respiratory problems, cardiovascular disease, and other health issues.', 'Water pollution occurs when chemicals and waste products are released into bodies of water, causing harm to aquatic life and threatening human health.', 'Sources of water pollution include industrial effluent, agricultural runoff, and sewage.', 'Soil pollution is caused by 

**Sentence organizer**

In [None]:
sentence_organizer = {k:v for v,k in enumerate(sentences)}

In [None]:
print("Our sentence organizer: \n", sentence_organizer)

Our sentence organizer: 
 {'Global pollution refers to the presence of harmful substances in the environment that cause negative impacts on human health and the ecosystem.': 0, 'It affects both the natural and built environment, and can be caused by various human activities such as industrialization, transportation, and agriculture.': 1, 'There are several forms of pollution that exist, including air, water, soil, and noise pollution.': 2, 'Air pollution is caused by emissions from factories, power plants, and vehicles.': 3, 'It results in the release of harmful chemicals such as sulfur dioxide, nitrogen oxides, and particulate matter into the air, which can cause respiratory problems, cardiovascular disease, and other health issues.': 4, 'Water pollution occurs when chemicals and waste products are released into bodies of water, causing harm to aquatic life and threatening human health.': 5, 'Sources of water pollution include industrial effluent, agricultural runoff, and sewage.': 6,

**Creating TF-IDF model**

In [None]:
tf_idf_vectorizer = TfidfVectorizer(min_df=2,  max_features=None, 
                                    strip_accents='unicode', 
                                    analyzer='word',
                                    token_pattern=r'\w{1,}',
                                    ngram_range=(1, 3), 
                                    use_idf=1,smooth_idf=1,
                                    sublinear_tf=1,
                                    stop_words = 'english')

In [None]:
tf_idf_vectorizer.fit(sentences)



In [None]:
sentence_vectors = tf_idf_vectorizer.transform(sentences)

In [None]:
# Getting sentence scores for each sentences
sentence_scores = np.array(sentence_vectors.sum(axis=1)).ravel()

print(len(sentences) == len(sentence_scores))

True


In [None]:
# Getting top-n sentences
N = 5
top_n_sentences = [sentences[ind] for ind in np.argsort(sentence_scores, axis=0)[::-1][:N]]

In [None]:
mapped_top_n_sentences = [(sentence,sentence_organizer[sentence]) for sentence in top_n_sentences]
print("Our top_n_sentence with their index: \n")
for element in mapped_top_n_sentences:
    print(element)

# Ordering our top-n sentences in their original ordering
mapped_top_n_sentences = sorted(mapped_top_n_sentences, key = lambda x: x[1])
ordered_scored_sentences = [element[0] for element in mapped_top_n_sentences]

# Our final summary
summary = " ".join(ordered_scored_sentences)

Our top_n_sentence with their index: 

('Water pollution occurs when chemicals and waste products are released into bodies of water, causing harm to aquatic life and threatening human health.', 5)
('Soil pollution is caused by the accumulation of hazardous chemicals and waste products in the ground, which can result in the contamination of crops and other food sources.', 7)
('Noise pollution is caused by excessive noise from various sources, including transportation, construction, and industrial activities.', 8)
('Global pollution refers to the presence of harmful substances in the environment that cause negative impacts on human health and the ecosystem.', 0)
('It affects both the natural and built environment, and can be caused by various human activities such as industrialization, transportation, and agriculture.', 1)


In [None]:
print("Summary: \n", summary)

Summary: 
 Global pollution refers to the presence of harmful substances in the environment that cause negative impacts on human health and the ecosystem. It affects both the natural and built environment, and can be caused by various human activities such as industrialization, transportation, and agriculture. Water pollution occurs when chemicals and waste products are released into bodies of water, causing harm to aquatic life and threatening human health. Soil pollution is caused by the accumulation of hazardous chemicals and waste products in the ground, which can result in the contamination of crops and other food sources. Noise pollution is caused by excessive noise from various sources, including transportation, construction, and industrial activities.
