<a href="https://colab.research.google.com/github/ITU-Business-Analytics-Team/Business_Analytics_for_Professionals/blob/main/Part%20I%20%3A%20Methods%20%26%20Technologies%20for%20Business%20Analytics/Chapter%207%3A%20Text%20Analytics/7_7_Advanced_Topics_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **7.	Advanced Topics in NLP**
## 7.1	Similarity

In [None]:
!pip install textdistance
import textdistance




#### Hamming Distance
It is defined as the number of positions that have different characters in both strings with same length

In [None]:
textdistance.hamming('test', 'text')

1

#### Jaccard Index
It is determined as the intersection of words divided by the union of the words.

In [None]:
textdistance.jaccard('test', 'text')

0.6

#### Levenshtein Distance
It is found as the fewest possible insertions, deletions, and substitutions needed to convert first text to second text.

In [None]:
textdistance.levenshtein('test', 'text')

1

#### Cosine Similarity
It provides us with the measure of the cosine of the angle that exists be-tween two vectors.

In [None]:
textdistance.cosine('test', 'text')

0.75

### 7.2	POS Tagging
Part-of-speech (POS) tagging is an important part of natural language pro-cessing, which includes tagging words with clauses such as nouns, verbs, adjec-tives, and the like

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Tokenize the text
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
tokens = sent_tokenize("This book has written by academic members from Istanbul Technical University in Turkey")
#Generate tagging for all the tokens using loop
for i in tokens:
    words = nltk.word_tokenize(i)
    words = [w for w in words if not w in stop_words]
    # POS-tagger.
    tags = nltk.pos_tag(words)
tags

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('This', 'DT'),
 ('book', 'NN'),
 ('written', 'VBN'),
 ('academic', 'JJ'),
 ('members', 'NNS'),
 ('Istanbul', 'NNP'),
 ('Technical', 'NNP'),
 ('University', 'NNP'),
 ('Turkey', 'NNP')]

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutiha

### 7.3	Named Entity Recognition
It can be described as the process of finding out which entity a word in the text refers to. 

In [None]:
import nltk
#Uncomment if you need necessary modules
#nltk.download('averaged_perceptron_tagger')
#nltk.download('maxent_ne_chunker')
#nltk.download('words')
from nltk import ne_chunk
from nltk import word_tokenize

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
print(ne_chunk(nltk.pos_tag(word_tokenize("This book has written by academic members from Istanbul Technical University in Turkey")), binary=False))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
(S
  This/DT
  book/NN
  has/VBZ
  written/VBN
  by/IN
  academic/JJ
  members/NNS
  from/IN
  (PERSON Istanbul/NNP Technical/NNP University/NNP)
  in/IN
  (GPE Turkey/NNP))


In [None]:
#Uncomment if you need necessary libraries
#!pip install spacy
#!python -m spacy download en_core_web_sm
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')
# Read/create a sentence
doc = nlp(u'This book has written by academic members from Istanbul Technical University in Turkey')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Istanbul Technical University 47 76 ORG
Turkey 80 86 GPE


### 7.4	Clustering

In [None]:
import string
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
#Read Data
# read the data  
url=   'https://drive.google.com/file/d/1ll7mM_-gc-8wqrqYoI2ojvv9hVtCut2R/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
cluster_df = pd.read_csv(path)
cluster_df[~cluster_df['Consumer complaint narrative'].isna()].head()

Unnamed: 0.1,Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
1,623002,08/07/2016,Student loan,Federal student loan servicing,Dealing with my lender or servicer,Having problems with customer service,Enrolled in paperless billing. Was not billed ...,,ACS Education Services,NY,110XX,,Consent provided,Web,08/07/2016,Closed with explanation,Yes,No,2049664
12,226883,11/11/2015,Debt collection,"Other (i.e. phone, health club, etc.)",Cont'd attempts collect debt not owed,Debt was paid,We lost our home ( of almost 14 years ) to for...,,"Diversified Consultants, Inc.",WI,539XX,,Consent provided,Web,11/11/2015,Closed with explanation,Yes,No,1650657
28,602022,09/20/2016,Credit reporting,,Credit reporting company's investigation,No notice of investigation status/result,Trans Union had a glitch in their system that ...,Company has responded to the consumer and the ...,"TransUnion Intermediate Holdings, Inc.",TX,770XX,,Consent provided,Web,09/20/2016,Closed with non-monetary relief,Yes,,2121069
32,378090,02/23/2016,Debt collection,Credit card,Disclosure verification of debt,Not disclosed as an attempt to collect,received notice from collection agency that I ...,,Synchrony Financial,NY,139XX,Older American,Consent provided,Web,02/25/2016,Closed with monetary relief,Yes,No,1799253
37,365525,11/02/2015,Credit card,,Closing/Cancelling account,,I have had a Citi Bank Credit card for several...,Company chooses not to provide a public response,Citibank,LA,703XX,,Consent provided,Web,11/02/2015,Closed with explanation,Yes,No,1634875


In [None]:
#Select only complaint text data and take samples
cl_data = cluster_df[['Consumer complaint narrative']]
cl_data = cl_data[pd.notnull(cl_data['Consumer complaint narrative'])]
cl_data['Consumer complaint narrative'] = cl_data['Consumer complaint narrative'].str.replace('XXXX','')
complaints = cl_data['Consumer complaint narrative'].tolist()

In [None]:
# Load Stop Words
stop = stopwords.words('english')

# Load Lemmatizer
lem = wordnet.WordNetLemmatizer()

In [None]:
def preprocess_text(text):
    # Remove punctuations
    for c in string.punctuation:
        text= text.replace(c,"")
        
    ## Tokenize (convert from string to list) and convert to lowercase
    lst_text = [word.lower() for word in text.split()]
            
    ## Remove Stopwords
    lst_text = [word for word in lst_text if word not in stop] 
    
    ## Lemmatisation (convert the word into root word)
    lst_text = [lem.lemmatize(word) for word in lst_text] 
    
    return lst_text

In [None]:
vectorizer = TfidfVectorizer(max_features=10000, use_idf=True, tokenizer=preprocess_text, ngram_range=(1,2), max_df=0.9, min_df=0.1)

In [None]:
#fit the vectorizer to data
nltk.download('wordnet')
tfidf_matrix = vectorizer.fit_transform(complaints)
terms = vectorizer.get_feature_names()
print(tfidf_matrix.shape)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
(1688, 89)


In [None]:
#Import Kmeans
from sklearn.cluster import KMeans
# Define number of clusters
num_clusters = 5
#Running clustering algorithm
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

#Final clusters
clusters = km.labels_.tolist()
complaints_data = { 'complaints': complaints, 'cluster': clusters }
frame = pd.DataFrame(complaints_data, index = [clusters] , columns = ['cluster'])

# Number of complaints per cluster
frame['cluster'].value_counts()

0    673
4    337
3    272
2    232
1    174
Name: cluster, dtype: int64

In [None]:
topn_features = 5
ordered_centroids = km.cluster_centers_.argsort()[:, ::-1]
# get key features for each cluster

for cluster_num in range(num_clusters):
    key_features = [terms[index] for index in ordered_centroids[cluster_num, :topn_features]]
    print('CLUSTER #'+str(cluster_num+1))
    print('Key Features:', key_features)

CLUSTER #1
Key Features: ['account', 'bank', 'payment', 'call', 'nt']
CLUSTER #2
Key Features: ['card', 'credit card', 'credit', 'account', 'charge']
CLUSTER #3
Key Features: ['debt', 'collection', 'company', 'credit', 'account']
CLUSTER #4
Key Features: ['credit', 'report', 'credit report', 'account', 'information']
CLUSTER #5
Key Features: ['loan', 'mortgage', 'payment', 'home', 'month']


### 7.5	Disambiguating
Words can have different meanings in different contexts. This situation presents us with the problem of disambiguation. The Lesk algorithm is one of the most ef-fective algorithms for disambiguating word senses.

In [None]:
from nltk.wsd import lesk
from nltk import word_tokenize
# sample text and word to disambiguate
samples = [('The fruits of the cherry tree have ripened', 'n'),
('Finally graduating, he reaped the fruits of his work', 'n')]
# perform word sense disambiguation
word = 'fruit'
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding definition:', word_syn.definition())
    print()

Sentence: The fruits of the cherry tree have ripened
Word synset: Synset('fruit.n.01')
Corresponding definition: the ripened reproductive body of a seed plant

Sentence: Finally graduating, he reaped the fruits of his work
Word synset: Synset('fruit.n.03')
Corresponding definition: the consequence of some effort or action



### 7.6	Language Identification and Translation
In the globalizing world, companies can acquire customers from many parts of the world, and at the same time, they need to follow the news from many loca-tions. At this point, there may be a need to analyze text from many different languages. The importance of language identification and language translation in text analysis comes from this situation.

In [None]:
!pip install googletrans==4.0.0-rc1



In [None]:
from googletrans import Translator
translator = Translator()
text = "Merhaba, Dünya"
from_lang = translator.detect(text)
from_lang.lang

'tr'

In [None]:
print(translator.translate(text, src=from_lang.lang, dest='en').text)

Hello World
