# TF-IDF Analysis

TF-IDF (term frequency-inverse document frequency) analysis is a statistical technique used in natural language processing and information retrieval to determine the importance of a word in a document or corpus. It is a way to measure how relevant a word is to a document in a collection of documents.

TF-IDF analysis assigns a weight to each word in a document based on how frequently it appears in the document (term frequency) and how rare it is in the entire corpus (inverse document frequency). The weight assigned to a word increases proportionally with its frequency in the document, but is offset by the rarity of the word in the corpus. This means that words that appear frequently in a document but also appear frequently in many other documents in the corpus are given a lower weight, while words that appear less frequently in the corpus but frequently in a particular document are given a higher weight.

The output of TF-IDF analysis is a numerical representation of each document that captures the importance of each word in that document. This can be used for various tasks such as text classification, clustering, and information retrieval.

## Table of Contents
* [Connect to Database ](#Connect-to-database)
* [Import Datasets](#Import-Dataset)
* [Remove Stopwords](#Remove-stopwords)
* [Lemmatization](#Lemmatization)

## Connect to Database

In [4]:
import mysql.connector
import pandas as pd

#creds = ["username","password","juliehaegh","ninG20&19rea","3306"] 
creds = ["juliehaegh","ninG20&19rea","172.20.20.4","hgo",3306]

In [5]:
#Connection to the database
host = creds[2]
user = creds[0]
password = creds[1]
database = creds[3]
port = creds[4]
mydb = mysql.connector.connect(host=host, user=user, database=database, port=port, password=password, auth_plugin='mysql_native_password')
mycursor = mydb.cursor()

#Safecheck to guarantee that the connection worked
mycursor.execute('SHOW TABLES;')
print(f"Tables: {mycursor.fetchall()}")
print(mydb.connection_id) #it'll give connection_id,if got connected

Tables: [('ConsultaUrgencia_doentespedidosconsultaNeurologia2012',), ('consultaneurologia2012',), ('consultaneurologia201216anon_true',)]
987


## Import Datasets

In [7]:
# Import Alert P1 dataset
alertP1 = pd.read_sql("""SELECT * FROM ConsultaUrgencia_doentespedidosconsultaNeurologia2012""",mydb)

# Import SClinic
SClinic = pd.read_sql("""SELECT * FROM consultaneurologia201216anon_true""",mydb)



In [10]:
# Import librariers 
import matplotlib.pyplot as plt
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unidecode import unidecode
import re
from collections import Counter
import nltk
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Remove Stopwords

In [52]:
# Get rid of special characters and transform Texto column to Latin words
SClinic['Texto'] = SClinic['Texto'].apply(lambda x: unidecode(x))

#The re.sub function is used to substitute all digits (\d) with an empty string
SClinic['Texto'] = SClinic['Texto'].apply(lambda x: re.sub(r'\d', '', x))

# Remove all names in Texto variable
# This function uses a regular expression to find all words in the text that start with a 
# capital letter (\b[A-Z][a-z]+\b), which are assumed to be names
text = SClinic['Texto'] 

# remove all hyphens from the text
text = text.replace('-', '')

def remove_names(text):
    # Find all words that start with a capital letter
    names = re.findall(r'\b[A-Z][a-z]+\b', text)
    
    # Replace the names with an empty string
    for name in names:
        text = text.replace(name, '')
        
    return text

In [53]:
# Create an empty list to store the text
text_list = []

# Loop through the 'text' column
for text in text.str.lower(): # Transform every word to lower case
    text_list.append(text)

# Print the list of text
print(text_list)

['utente de  anos, refere tremor desde ha  anos; esquecimentos,dificuldade em reter a informacao do momento e quedas frequentes. com hta, diabetes ultimas analises -// - hb ac-,; hdl ct-, ct-, tg-, hb-,. creat-,; microalbuminuria-, ; creat-, fez tac ce-// - acentuacao dos ventriculos cerebrais. leucoencefalopatia micro-angiopatica cronica. acentuacao dos sulcos fronto-parietal,sobretudo a esq, ligeiras hipodensidades perifericas cerebelosas calcificacao da foice inter-hemisferica, porcao sup-anterior. medicada com diamicron , metformina+sitagliptina (+) xdia; irbesartan+hctz(+,)em jejum; omeprazol-mg; sinvastatina mg, tromalyt ,cp ao almoco.', 'avaliacao neurologica para avaliacao da toma de anti-epilepticos e reajustamento terapeutico o doente nao tem nota de internamento em  c enfarte isquemico cerebelosoe vermiani da picateve ataxia da marcha em  e ficou medicado c lamotrigina - + e tryptizol agora refere ligeiros desiquilibrios e o exame neurologico sumario parece completamente nor

In [62]:
# Download the Portuguese stop words
nltk.download('stopwords')
nltk.download('punkt')

# Get the Portuguese stop words
stop_words = set(stopwords.words('portuguese'))

# Manually remove stopwords
stop_words.update(['-//','.', ',','(',')',':','-','?','+','/',';','2','1','drª','``','','3','desde','anos','doente','consulta','alterações','se',"''",'cerca','refere','hgo','utente','vossa','s','...','ainda','c','filha','costa','dr.','pereira','ja','--','p','dr','h','n','>','q','//','..','b','++','%','//','-','+++/','=','+++/'])

# Create a new list to store the filtered text
filtered_text = []

# Loop through the text_list and remove the stop words
for text in text_list:
    words = word_tokenize(text)
    words = [word for word in words if word.lower() not in stop_words]
    filtered_text.append(" ".join(words))

# Print the filtered text
print(filtered_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['tremor ha esquecimentos dificuldade reter informacao momento quedas frequentes hta diabetes ultimas analises hb ac- hdl ct- ct- tg- hb- creat- microalbuminuria- creat- fez tac ce-// acentuacao ventriculos cerebrais leucoencefalopatia micro-angiopatica cronica acentuacao sulcos fronto-parietal sobretudo esq ligeiras hipodensidades perifericas cerebelosas calcificacao foice inter-hemisferica porcao sup-anterior medicada diamicron metformina+sitagliptina xdia irbesartan+hctz jejum omeprazol-mg sinvastatina mg tromalyt cp almoco', 'avaliacao neurologica avaliacao toma anti-epilepticos reajustamento terapeutico nao nota internamento enfarte isquemico cerebelosoe vermiani picateve ataxia marcha ficou medicado lamotrigina tryptizol agora ligeiros desiquilibrios exame neurologico sumario parece completamente normal', 'cefaleias', 'sexo feminino idade aparentemente saidavel historia cefaleia occipital intensa incapacitante meses evolucao exame objectivo alteracoes relevantes fez tac ce revela

In [55]:
# Save the filtered text as a new column to the dataframe
SClinic['filtered_text'] = filtered_text

## Lemmatization

Lemmatization is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode. Lemmatization is responsible for grouping different inflected forms of words into the root form, having the same meaning

In [56]:
# Define function for lemmatization
def spacy_lemmatizer(df):
    import spacy
    import pt_core_news_md
    nlp = pt_core_news_md.load()

    doclist = list(nlp.pipe(df))

    docs=[]
    for i, doc in enumerate(doclist):
        docs.append(' '.join([listitem.lemma_ for listitem in doc]))
        
    return docs

In [57]:
# create an empty list to store the words
word_list = []

# loop through each row of the "text_column" column
for index, row in SClinic.iterrows():
    
    # split the text into individual words using whitespace as a delimiter
    words = row['filtered_text'].split()
    # add the words to the word list
    word_list.extend(words)

# print the word list
print(word_list)

['tremor', 'ha', 'esquecimentos', 'dificuldade', 'reter', 'informacao', 'momento', 'quedas', 'frequentes', 'hta', 'diabetes', 'ultimas', 'analises', 'hb', 'ac-', 'hdl', 'ct-', 'ct-', 'tg-', 'hb-', 'creat-', 'microalbuminuria-', 'creat-', 'fez', 'tac', 'ce-//', 'acentuacao', 'ventriculos', 'cerebrais', 'leucoencefalopatia', 'micro-angiopatica', 'cronica', 'acentuacao', 'sulcos', 'fronto-parietal', 'sobretudo', 'esq', 'ligeiras', 'hipodensidades', 'perifericas', 'cerebelosas', 'calcificacao', 'foice', 'inter-hemisferica', 'porcao', 'sup-anterior', 'medicada', 'diamicron', 'metformina+sitagliptina', 'xdia', 'irbesartan+hctz', 'jejum', 'omeprazol-mg', 'sinvastatina', 'mg', 'tromalyt', 'cp', 'almoco', 'avaliacao', 'neurologica', 'avaliacao', 'toma', 'anti-epilepticos', 'reajustamento', 'terapeutico', 'nao', 'nota', 'internamento', 'enfarte', 'isquemico', 'cerebelosoe', 'vermiani', 'picateve', 'ataxia', 'marcha', 'ficou', 'medicado', 'lamotrigina', 'tryptizol', 'agora', 'ligeiros', 'desiquil

In [61]:
# create an empty list to store the words
word_list = []

# loop through each row of the "text_column" column
for index, row in SClinic.iterrows():
    
    # split the text into individual words using whitespace as a delimiter
    words = row['filtered_text'].split()
    
    # remove hyphens from the words and add them to the word list
    word_list.extend([word.replace('-', '') for word in words])
    # remove slash from the words and ass them to the list
    word_list.extend([word.replace('/', '') for word in words])
    

# print the cleaned word list
print(word_list)

['tremor', 'ha', 'esquecimentos', 'dificuldade', 'reter', 'informacao', 'momento', 'quedas', 'frequentes', 'hta', 'diabetes', 'ultimas', 'analises', 'hb', 'ac', 'hdl', 'ct', 'ct', 'tg', 'hb', 'creat', 'microalbuminuria', 'creat', 'fez', 'tac', 'ce//', 'acentuacao', 'ventriculos', 'cerebrais', 'leucoencefalopatia', 'microangiopatica', 'cronica', 'acentuacao', 'sulcos', 'frontoparietal', 'sobretudo', 'esq', 'ligeiras', 'hipodensidades', 'perifericas', 'cerebelosas', 'calcificacao', 'foice', 'interhemisferica', 'porcao', 'supanterior', 'medicada', 'diamicron', 'metformina+sitagliptina', 'xdia', 'irbesartan+hctz', 'jejum', 'omeprazolmg', 'sinvastatina', 'mg', 'tromalyt', 'cp', 'almoco', 'tremor', 'ha', 'esquecimentos', 'dificuldade', 'reter', 'informacao', 'momento', 'quedas', 'frequentes', 'hta', 'diabetes', 'ultimas', 'analises', 'hb', 'ac-', 'hdl', 'ct-', 'ct-', 'tg-', 'hb-', 'creat-', 'microalbuminuria-', 'creat-', 'fez', 'tac', 'ce-', 'acentuacao', 'ventriculos', 'cerebrais', 'leucoen

In [27]:
Lemma = spacy_lemmatizer(word_list) # Call lemmatizer function

# print length of word_list and compare the count after doing lemmatization
from collections import Counter

items = Counter(Lemma).keys()
print('The number of words after lemmatization:',len(items))

items2 = Counter(word_list).keys()
print('The number of words before lemmatization:',len(items2))

The number of words after lemmatization: 9223
The number of words before lemmatization: 11355


In [28]:
# apply the spacy_lemmatizer function to each row in the 'text' column
SClinic['text_lemmatized'] = spacy_lemmatizer(SClinic['filtered_text'])

# drop rows with empty strings
SClinic_filtered = SClinic[['text_lemmatized','filtered_text']].replace('', pd.NA).dropna()
SClinic_filtered = pd.DataFrame(SClinic_filtered)
SClinic_filtered

Unnamed: 0,text_lemmatized,filtered_text
0,tremor ha esquecimento dificuldade reter infor...,tremor ha esquecimentos dificuldade reter info...
1,avaliacao neurologico avaliacao tomar anti-epi...,avaliacao neurologica avaliacao toma anti-epil...
2,cefaleia,cefaleias
3,sexo feminino idade aparentemente saidavel his...,sexo feminino idade aparentemente saidavel his...
4,relatorio clinico,relatorio clinico
...,...,...
1777,estenose carotidea assintomatico bilateral < d...,estenose carotidea assintomatica bilateral < d...
1778,operar patologia benigno mama direito apresent...,operada patologia benigna mama direita apresen...
1779,perda conhecimento elipotomia tce traumatismo ...,perda conhecimento elipotomia tce traumatismo ...
1780,historia lipotimer maio eeg- -escassas anomali...,historia lipotimia maio eeg- -escassas anomali...


## Calculate distance between words using 'Jarowynkler'

In natural language processing and text mining, the distance between words refers to the measure of how dissimilar or different two words are in terms of their spelling, meaning, or context. It is used to compare two words or to quantify the similarity between them.

These distance metrics are useful in many natural language processing tasks, such as spell checking, text classification, clustering, and information retrieval, among others. They enable us to quantify the similarity or dissimilarity between words or texts and to make data-driven decisions based on this information.

In [69]:
import Levenshtein

# create an empty list to store the distances
distances = []

# loop through each pair of adjacent words in the word list
for i in range(len(word_list)-1):
    
    # calculate the Levenshtein distance between the current word and the next word
    distance = Levenshtein.distance(word_list[i], word_list[i+1])
    
    # add the distance to the list of distances
    distances.append(distance)

# print the list of distances
print(len(distances))

144349


In [72]:
import Levenshtein

# create an empty list to store the distances
distances = []

# loop through each pair of adjacent words in the word list
for i in range(len(word_list)-1):
    
    # calculate the Levenshtein distance between the current word and the next word
    distance = Levenshtein.distance(word_list[i], word_list[i+1])
    
    # add the distance to the list of distances
    distances.append(distance)

# merge adjacent words with a distance lower than 1
merged_word_list = []
i = 0
while i < len(word_list):
    if i == len(word_list) - 1:
        # last word in the list, add it to the merged word list
        merged_word_list.append(word_list[i])
        i += 1
    elif distances[i] < 1:
        # merge the current and next word into a single word and add it to the merged word list
        merged_word_list.append(word_list[i] + word_list[i+1])
        i += 2
    else:
        # add the current word to the merged word list
        merged_word_list.append(word_list[i])
        i += 1

# print the merged word list
print(merged_word_list)

['tremor', 'ha', 'esquecimentos', 'dificuldade', 'reter', 'informacao', 'momento', 'quedas', 'frequentes', 'hta', 'diabetes', 'ultimas', 'analises', 'hb', 'ac', 'hdl', 'ctct', 'tg', 'hb', 'creat', 'microalbuminuria', 'creat', 'fez', 'tac', 'ce//', 'acentuacao', 'ventriculos', 'cerebrais', 'leucoencefalopatia', 'microangiopatica', 'cronica', 'acentuacao', 'sulcos', 'frontoparietal', 'sobretudo', 'esq', 'ligeiras', 'hipodensidades', 'perifericas', 'cerebelosas', 'calcificacao', 'foice', 'interhemisferica', 'porcao', 'supanterior', 'medicada', 'diamicron', 'metformina+sitagliptina', 'xdia', 'irbesartan+hctz', 'jejum', 'omeprazolmg', 'sinvastatina', 'mg', 'tromalyt', 'cp', 'almoco', 'tremor', 'ha', 'esquecimentos', 'dificuldade', 'reter', 'informacao', 'momento', 'quedas', 'frequentes', 'hta', 'diabetes', 'ultimas', 'analises', 'hb', 'ac-', 'hdl', 'ct-ct-', 'tg-', 'hb-', 'creat-', 'microalbuminuria-', 'creat-', 'fez', 'tac', 'ce-', 'acentuacao', 'ventriculos', 'cerebrais', 'leucoencefalopa

In [29]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# create a CountVectorizer object
count_vectorizer = CountVectorizer()

# fit and transform the text data
count_matrix = count_vectorizer.fit_transform(SClinic_filtered['text_lemmatized'])

# create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(SClinic_filtered['text_lemmatized'])

# print the document-term matrix for CountVectorizer
print("Count Vectorizer:\n")
print(pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out()))

# print the document-term matrix for TfidfVectorizer
print("\nTF-IDF Vectorizer:\n")
print(pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out()))

Count Vectorizer:

      2e  aa  aacentuar  aas  aat  ab  abaixo  abandonar  abandono  abcesso  \
0      0   0          0    0    0   0       0          0         0        0   
1      0   0          0    0    0   0       0          0         0        0   
2      0   0          0    0    0   0       0          0         0        0   
3      0   0          0    0    0   0       0          0         0        0   
4      0   0          0    0    0   0       0          0         0        0   
...   ..  ..        ...  ...  ...  ..     ...        ...       ...      ...   
1763   0   0          0    0    0   0       0          0         0        0   
1764   0   0          0    0    0   0       0          0         0        0   
1765   0   0          0    0    0   0       0          0         0        0   
1766   0   0          0    0    0   0       0          0         0        0   
1767   0   0          0    0    0   0       0          0         0        0   

      ...  zonegram  zonegran  z

In [30]:
# create a CountVectorizer object
count_vectorizer = CountVectorizer()

# fit and transform the text data
count_matrix = count_vectorizer.fit_transform(SClinic_filtered['text_lemmatized'])

# create a dataframe for CountVectorizer output
count_df = pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

# create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(SClinic_filtered['text_lemmatized'])

# create a dataframe for TfidfVectorizer output
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# print the document-term matrix for CountVectorizer
print("Count Vectorizer:\n")
count_df

# print the document-term matrix for TfidfVectorizer
print("\nTF-IDF Vectorizer:\n")
tfidf_df

Count Vectorizer:


TF-IDF Vectorizer:



Unnamed: 0,2e,aa,aacentuar,aas,aat,ab,abaixo,abandonar,abandono,abcesso,...,zonegram,zonegran,zonisamer,zoster,zotepino,zumbido,zumbir,zyloric,zyprexa,zyprexo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1766,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
