In [1]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_sm')

In [15]:
text = "Toxic discussions on open-source GitHub projects tend to involve entitlement, subtle insults, and arrogance, according to an academic study. That contrasts with the toxic behavior – typically bad language, hate speech, and harassment – found on other corners of the web. Whether that seems obvious or not, it's an interesting point to consider because, for one thing, it means technical and non-technical methods to detect and curb toxic behavior on one part of the internet may not therefore work well on GitHub, and if you're involved in communities on the code-hosting giant, you may find this research useful in combating trolls and unacceptable conduct. It may also mean systems intended to automatically detect and report toxicity in open-source projects, or at least ones on GitHub, may need to be developed specifically for that task due to their unique nature."

In [16]:
doc = nlp(text)

In [17]:
sentences = list(doc.sents)
print(len(sentences))


4


In [18]:
for sentence in sentences:
  print(sentence)

Toxic discussions on open-source GitHub projects tend to involve entitlement, subtle insults, and arrogance, according to an academic study.
That contrasts with the toxic behavior – typically bad language, hate speech, and harassment – found on other corners of the web.
Whether that seems obvious or not, it's an interesting point to consider because, for one thing, it means technical and non-technical methods to detect and curb toxic behavior on one part of the internet may not therefore work well on GitHub, and if you're involved in communities on the code-hosting giant, you may find this research useful in combating trolls and unacceptable conduct.
It may also mean systems intended to automatically detect and report toxicity in open-source projects, or at least ones on GitHub, may need to be developed specifically for that task due to their unique nature.


In [10]:
for token in doc:
  print("Token: {}, index: {}, , lemmatized_token: {}".format(token, token.idx, token.lemma_))


Token: Whilst, index: 0, , lemmatized_token: whilst
Token: the, index: 7, , lemmatized_token: the
Token: crypto, index: 11, , lemmatized_token: crypto
Token: markets, index: 18, , lemmatized_token: market
Token: have, index: 26, , lemmatized_token: have
Token: taken, index: 31, , lemmatized_token: take
Token: a, index: 37, , lemmatized_token: a
Token: pretty, index: 39, , lemmatized_token: pretty
Token: public, index: 46, , lemmatized_token: public
Token: downturn, index: 53, , lemmatized_token: downturn
Token: of, index: 62, , lemmatized_token: of
Token: late, index: 65, , lemmatized_token: late
Token: ,, index: 69, , lemmatized_token: ,
Token: this, index: 71, , lemmatized_token: this
Token: is, index: 76, , lemmatized_token: be
Token: a, index: 79, , lemmatized_token: a
Token: great, index: 81, , lemmatized_token: great
Token: time, index: 87, , lemmatized_token: time
Token: to, index: 92, , lemmatized_token: to
Token: experiment, index: 95, , lemmatized_token: experiment
Token: and

In [19]:
from collections import Counter
words = [token.text for token in doc if not token.is_stop and not token.is_punct]
word_freq = Counter(words)

common_words = word_freq.most_common(5)

print(common_words)
print("Unique words-------------------------")
unique_words = [word for (word, freq) in word_freq.items() if freq ==1]
print(unique_words)

[('GitHub', 3), ('open', 2), ('source', 2), ('projects', 2), ('toxic', 2)]
Unique words-------------------------
['Toxic', 'discussions', 'tend', 'involve', 'entitlement', 'subtle', 'insults', 'arrogance', 'according', 'academic', 'study', 'contrasts', 'typically', 'bad', 'language', 'hate', 'speech', 'harassment', 'found', 'corners', 'web', 'obvious', 'interesting', 'point', 'consider', 'thing', 'means', 'non', 'methods', 'curb', 'internet', 'work', 'involved', 'communities', 'code', 'hosting', 'giant', 'find', 'research', 'useful', 'combating', 'trolls', 'unacceptable', 'conduct', 'mean', 'systems', 'intended', 'automatically', 'report', 'toxicity', 'ones', 'need', 'developed', 'specifically', 'task', 'unique', 'nature']


In [12]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

In [20]:
keyword = []
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN','ADJ','NOUN','VERB']
for token in doc:
     if(token.text in stopwords or token.text in punctuation):
          continue
     if(token.pos_ in pos_tag): 
          keyword.append(token.text) 
print(keyword)
freq_word = Counter(keyword) 
print(freq_word.most_common(5))
print("2. Sentence Strength =====================================") 
sent_strength = {}
for sent in doc.sents:
     for word in sent:
          if word.text in freq_word.keys():
               if sent in sent_strength.keys():
                    sent_strength[sent]+=freq_word[word.text] 
               else:
                    sent_strength[sent]=freq_word[word.text] 
print(sent_strength)

['Toxic', 'discussions', 'open', 'source', 'GitHub', 'projects', 'tend', 'involve', 'entitlement', 'subtle', 'insults', 'arrogance', 'according', 'academic', 'study', 'contrasts', 'toxic', 'behavior', 'bad', 'language', 'hate', 'speech', 'harassment', 'found', 'corners', 'web', 'obvious', 'interesting', 'point', 'consider', 'thing', 'means', 'technical', 'non', 'technical', 'methods', 'detect', 'curb', 'toxic', 'behavior', 'internet', 'work', 'GitHub', 'involved', 'communities', 'code', 'hosting', 'giant', 'find', 'research', 'useful', 'combating', 'trolls', 'unacceptable', 'conduct', 'mean', 'systems', 'intended', 'detect', 'report', 'toxicity', 'open', 'source', 'projects', 'ones', 'GitHub', 'need', 'developed', 'task', 'unique', 'nature']
[('GitHub', 3), ('open', 2), ('source', 2), ('projects', 2), ('toxic', 2)]
{Toxic discussions on open-source GitHub projects tend to involve entitlement, subtle insults, and arrogance, according to an academic study.: 20, That contrasts with the to

In [21]:
summerized_sentences = nlargest(3,sent_strength,key=sent_strength. 
get)
print(summerized_sentences)

[Whether that seems obvious or not, it's an interesting point to consider because, for one thing, it means technical and non-technical methods to detect and curb toxic behavior on one part of the internet may not therefore work well on GitHub, and if you're involved in communities on the code-hosting giant, you may find this research useful in combating trolls and unacceptable conduct., It may also mean systems intended to automatically detect and report toxicity in open-source projects, or at least ones on GitHub, may need to be developed specifically for that task due to their unique nature., Toxic discussions on open-source GitHub projects tend to involve entitlement, subtle insults, and arrogance, according to an academic study.]


**Input sentence:**
Whilst the crypto markets have taken a pretty public downturn of late, this is a great time to experiment and build your own web3 projects. Ether, and almost all other cryptocurrencies, are cheaper to get your hands on, gas fees (the transaction fees paid when using blockchain networks) are much lower than their 2021 highs and now is your chance to learn a new skill that could be very valuable going forward.

**Output summary:**
Ether, and almost all other cryptocurrencies, are cheaper to get your hands on, gas fees (the transaction fees paid when using blockchain networks) are much lower than their 2021 highs and now is your chance to learn a new skill that could be very valuable going forward., Whilst the crypto markets have taken a pretty public downturn of late, this is a great time to experiment and build your own web3 projects.