1.1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _

In [None]:
import re

text = """
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
"""
sentence = r'https://twitter\.com/([A-Za-z0-9_]+)'
twitter = re.findall(sentence, text)
print(twitter)


['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']


1.2. Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings

(1) Credit Risk

(2) Supply Rish

In [None]:
import re

text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''

pattern = r'Concentration of Risk: (.*?)\n'
concentration_risks = re.findall(pattern, text)
print(concentration_risks)


['Credit Risk', 'Supply Risk']


1.3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below

In [None]:
import re
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

pattern = r'FY(\d{4} (?:Q[1-4]|S[1-2]))'
matches = re.findall(pattern, text)

print("Matches:", matches)


Matches: ['2021 Q1', '2021 S1']


2.1. Sentence & Word Tokenization In Spacy

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

d = nlp("This is NLP based Goggle Collab notebook ")
for sentence in d.sents:
    print(sentence)
print("------------------------------------------")
for sentence in d.sents:
    for word in sentence:
        print(word)

This is NLP based Goggle Collab notebook
------------------------------------------
This
is
NLP
based
Goggle
Collab
notebook


2.2.Sentence & Word Tokenization In NLTK

In [None]:
from nltk.tokenize import sent_tokenize
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
sent_tokenize("This is NLP based Goggle Collab notebook")
word_tokenize("This is NLP based Goggle Collab notebook")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['This', 'is', 'NLP', 'based', 'Goggle', 'Collab', 'notebook']

3.Collecting dataset websites from a book paragraph

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = '''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

doc = nlp(text)
data_websites = [token.text for token in doc if token.like_url]

print("Data Websites:")
for website in data_websites:
    print(website)


Data Websites:
http://www.data.gov/
http://www.science
http://data.gov.uk/.
http://www3.norc.org/gss+website/
http://www.europeansocialsurvey.org/.


4.1.
Get all the proper nouns from a given text in a list and also count how many of them.

In [None]:
import spacy

text = '''Ravi and Raju are the best friends from school days. They wanted to go for a world tour and
visit famous cities like Paris, London, Dubai, Rome etc. They also called their another friend Mohan to take part in this world tour.
They started their journey from Hyderabad and spent the next 3 months travelling all the wonderful cities in the world and cherish happy moments!
'''

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

proper_nouns = [token.text for token in doc if token.pos_ == "PROPN"]

print("Proper Nouns:", proper_nouns)
print("Count:", len(proper_nouns))


Proper Nouns: ['Raju', 'Paris', 'London', 'Dubai', 'Rome', 'Mohan', 'Hyderabad']
Count: 7


4.2.Get all companies names from a given text and also the count of them.

In [None]:
import spacy

text = '''The Top 5 companies in USA are Tesla, Walmart, Amazon, Microsoft, Google and the top 5 companies in
India are Infosys, Reliance, HDFC Bank, Hindustan Unilever and Bharti Airtel'''

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

company_names = [ent.text for ent in doc.ents if ent.label_ == 'ORG']

print("Company Names:", company_names)
print("Count:", len(company_names))


Company Names: ['Tesla', 'Walmart', 'Amazon', 'Microsoft', 'Google', 'Infosys', 'Reliance', 'HDFC Bank', 'Hindustan Unilever', 'Bharti']
Count: 10


5.1.Stemming in NLTK

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

stems = [(word, stemmer.stem(word)) for word in words]

for original, stemmed in stems:
    print(original, ",", stemmed)


eating , eat
eats , eat
eat , eat
ate , ate
adjustable , adjust
rafting , raft
ability , abil
meeting , meet


5.2.Lemmatization in Spacy

In [None]:
import spacy

text = "The quick brown foxes are jumping over the lazy dogs"
doc = nlp(text)

for token in doc:
    print(f"{token.text} | {token.lemma_}")


The | the
quick | quick
brown | brown
foxes | fox
are | be
jumping | jump
over | over
the | the
lazy | lazy
dogs | dog


5.3. Customizing lemmatizer

In [None]:
ar = nlp.get_pipe('attribute_ruler')
doc = nlp("The quick brown foxes are jumping over the lazy dogs")
for token in doc:
    print(token.text, "|", token.lemma_)

The | the
quick | quick
brown | brown
foxes | fox
are | be
jumping | jump
over | over
the | the
lazy | lazy
dogs | dog


6.1.convert the given text into it's base form using both stemming and lemmatization.

In [None]:
from nltk.stem import PorterStemmer
import spacy


text = """Latha is very multi talented girl. She is good at many skills like dancing, running, singing, playing.
She also likes eating Pav Bhagi. She has a habit of fishing and swimming too. Besides all this, she is wonderful at cooking too.
"""
doc = nlp(text)
stemmed_text_nltk = ' '.join([stemmer.stem(word) for word in text.split()])
lemmatized_text_spacy = ' '.join([token.lemma_ for token in doc])

print("\nStemmed Text (NLTK):\n", stemmed_text_nltk)
print("\nLemmatized Text (spaCy):\n", lemmatized_text_spacy)



Stemmed Text (NLTK):
 latha is veri multi talent girl. she is good at mani skill like dancing, running, singing, playing. she also like eat pav bhagi. she ha a habit of fish and swim too. besid all this, she is wonder at cook too.

Lemmatized Text (spaCy):
 Latha be very multi talented girl . she be good at many skill like dancing , running , singing , play . 
 she also like eat Pav Bhagi . she have a habit of fishing and swim too . besides all this , she be wonderful at cook too . 



7.1.POS tags

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

Elon  |  PROPN  |  proper noun
flew  |  VERB  |  verb
to  |  ADP  |  adposition
mars  |  NOUN  |  noun
yesterday  |  NOUN  |  noun
.  |  PUNCT  |  punctuation
He  |  PRON  |  pronoun
carried  |  VERB  |  verb
biryani  |  ADJ  |  adjective
masala  |  NOUN  |  noun
with  |  ADP  |  adposition
him  |  PRON  |  pronoun


In [None]:
7.2.Removing all SPACE, PUNCT and X token from text

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = """Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

· Revenue was $51.7 billion and increased 20%
· Operating income was $22.2 billion and increased 24%
· Net income was $18.8 billion and increased 21%
· Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""


doc = nlp(text)
filtered_tokens = [token.text for token in doc if token.is_alpha]
cleaned_text = ' '.join(filtered_tokens)
print(filtered_tokens[:10])
print("\nCleaned Text:\n", cleaned_text)


['Microsoft', 'today', 'announced', 'the', 'following', 'results', 'for', 'the', 'quarter', 'ended']

Cleaned Text:
 Microsoft today announced the following results for the quarter ended December as compared to the corresponding period of last fiscal year Revenue was billion and increased Operating income was billion and increased Net income was billion and increased Diluted earnings per share was and increased Digital technology is the most malleable resource at the world disposal to overcome constraints and reimagine everyday work and life said Satya Nadella chairman and chief executive officer of Microsoft As tech as a percentage of global GDP continues to increase we are innovating and investing across diverse and growing markets with a common underlying technology stack and an operating model that reinforces a common strategy culture and sense of purpose Solid commercial execution represented by strong bookings growth driven by long term Azure commitments increased Microsoft Cloud

In [None]:
count = doc.count_by(spacy.attrs.POS)
for k,v in count.items():
    print(doc.vocab[k].text, "|",v)

PROPN | 13
NOUN | 46
VERB | 23
DET | 9
ADP | 16
NUM | 16
PUNCT | 27
SCONJ | 1
ADJ | 21
SPACE | 6
AUX | 6
SYM | 5
CCONJ | 12
ADV | 3
PART | 3
PRON | 2


8.1.Extract all the Geographical (cities, Countries, states) names from a given text (NER)

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = """Kiran wants to know the famous foods in each state of India. So, he opened Google and searched for this question. Google showed that
in Delhi it is Chaat, in Gujarat it is Dal Dhokli, in Tamilnadu it is Pongal, in Andhra Pradesh it is Biryani, in Assam it is Papaya Khar,
in Bihar it is Litti Chowkha and so on for all other states."""

doc = nlp(text)
geographical_entities = [ent.text for ent in doc.ents if ent.label_ == 'GPE']

print("Geographical Entities (Countries):", geographical_entities)


Geographical Entities (Countries): ['India', 'Delhi', 'Gujarat', 'Tamilnadu', 'Pongal', 'Andhra', 'Assam', 'Bihar']


8.2.Extract all the birth dates of cricketers in the given Text

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = """Sachin Tendulkar was born on 24 April 1973, Virat Kohli was born on 5 November 1988, Dhoni was born on 7 July 1981
and finally Ricky Ponting was born on 19 December 1974."""

doc = nlp(text)
birth_dates = [ent.text for ent in doc.ents if ent.label_ == 'DATE']

print("Birth Dates of Cricketers:", birth_dates)
print("Count:", len(birth_dates))


Birth Dates of Cricketers: ['24 April 1973', '5 November 1988', '7 July 1981', '19 December 1974']
Count: 4


10.1.Stop Words: From a Given Text, Count the number of stop words in it.
Print the percentage of stop word tokens compared to all tokens in a given text.

In [None]:
import spacy_loggers
nlp = spacy.load("en_core_web_sm")
text = """This is an example text to count the number of stop words. It contains some common stop words like 'the', 'is', 'and', etc."""

doc = nlp(text)
stop_word_count = sum([1 for token in doc if token.is_stop])
total_tokens = len(doc)
percentage_stop_words = (stop_word_count / total_tokens) * 100

# Print the results
print("Number of Stop Words:", stop_word_count)
print("Total Tokens:", total_tokens)
print("Percentage of Stop Words:", percentage_stop_words)


Number of Stop Words: 11
Total Tokens: 34
Percentage of Stop Words: 32.35294117647059


10.2.Spacy default implementation considers "not" as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text.

In [None]:
def preprocess(text):
    doc = nlp(text)
    no_stop_words = ' '.join([token.text for token in doc if not token.is_stop])
    return no_stop_words

nlp.vocab['not'].is_stop = False

positive_text = preprocess('this is a good movie')
negative_text = preprocess('this is not a good movie')

print("Transformed Text 1:", positive_text)
print("Transformed Text 2:", negative_text)


Transformed Text 1: good movie
Transformed Text 2: not good movie


10.3.From a given text, output the most frequently used token after removing all the stop word tokens and punctuations in it

In [None]:
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

text1 = """The India men's national cricket team, also known as the Men in Blue, represents India in men's international cricket.
It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,
One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the
first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be
granted test cricket status."""

text = text1.lower()
doc = nlp(text)
filtered_tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]

most_frequent_token = Counter(filtered_tokens).most_common(1)
result = most_frequent_token[0][0] if most_frequent_token else None

print("Most Frequently Used Token:", result)


Most Frequently Used Token: cricket


11.1. Word_Vector Spacy

In [None]:
import spacy
doc = nlp("dog cat banana kem")

for token in doc:
    print(token.text, "Vector:", token.has_vector, "OOV:", token.is_oov)

dog Vector: True OOV: True
cat Vector: True OOV: True
banana Vector: True OOV: True
kem Vector: True OOV: True


In [None]:
doc.vector.shape

(96,)

In [None]:
base_token = nlp("bread")
doc = nlp("bread sandwich burger car tiger human wheat")
for token in doc:
    print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

bread <-> bread: 0.4868319181235256
sandwich <-> bread: 0.26677175290569444
burger <-> bread: 0.2403758463416168
car <-> bread: 0.21232416060990508
tiger <-> bread: 0.416205187946548
human <-> bread: 0.17155441719759282
wheat <-> bread: 0.6217515964505436


  print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))


In [None]:
def print_similarity(base_word, words_to_compare):
    base_token = nlp(base_word)
    doc = nlp(words_to_compare)
    for token in doc:
        print(f"{token.text} <-> {base_token.text}: ", token.similarity(base_token))
print_similarity("iphone", "apple samsung iphone dog kitten")

apple <-> iphone:  0.26182551438999324
samsung <-> iphone:  0.05472672518057898
iphone <-> iphone:  0.30389348669405114
dog <-> iphone:  0.4053742132890232
kitten <-> iphone:  0.34180785111852907


  print(f"{token.text} <-> {base_token.text}: ", token.similarity(base_token))


12.1.Word Vectors Overview Using Gensim Library

In [None]:
import gensim.downloader as api
# This is a huge model (~1.6 gb) and it will take some time to load

wv = api.load('word2vec-google-news-300')



In [None]:
wv.similarity(w1="great", w2="good")

0.729151

In [None]:

wv.most_similar("good")

[('great', 0.7291510105133057),
 ('bad', 0.7190051078796387),
 ('terrific', 0.6889115571975708),
 ('decent', 0.6837348341941833),
 ('nice', 0.6836092472076416),
 ('excellent', 0.644292950630188),
 ('fantastic', 0.6407778263092041),
 ('better', 0.6120728850364685),
 ('solid', 0.5806034803390503),
 ('lousy', 0.576420247554779)]

In [None]:
wv.most_similar("dog")


[('dogs', 0.8680489659309387),
 ('puppy', 0.8106428384780884),
 ('pit_bull', 0.780396044254303),
 ('pooch', 0.7627376914024353),
 ('cat', 0.7609457969665527),
 ('golden_retriever', 0.7500901818275452),
 ('German_shepherd', 0.7465174198150635),
 ('Rottweiler', 0.7437615394592285),
 ('beagle', 0.7418621778488159),
 ('pup', 0.740691065788269)]

In [None]:
wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581)]

In [None]:
wv.most_similar(positive=['france', 'berlin'], negative=['paris'], topn=5)

[('germany', 0.5094343423843384),
 ('european', 0.48650455474853516),
 ('german', 0.4714890420436859),
 ('austria', 0.46964022517204285),
 ('swedish', 0.4645182490348816)]

In [None]:
wv.doesnt_match(["facebook", "cat", "google", "microsoft"])

'cat'

In [None]:

glv = api.load("glove-twitter-25")
glv.most_similar("good")



[('too', 0.9648017287254333),
 ('day', 0.9533665180206299),
 ('well', 0.9503170847892761),
 ('nice', 0.9438973665237427),
 ('better', 0.9425962567329407),
 ('fun', 0.9418926239013672),
 ('much', 0.9413353800773621),
 ('this', 0.9387555122375488),
 ('hope', 0.9383506774902344),
 ('great', 0.9378516674041748)]

In [None]:
glv.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [None]:
glv.doesnt_match("facebook cat google microsoft".split())


'cat'

In [None]:
glv.doesnt_match("banana grapes orange human".split())


'human'