# TF-IDF Analysis

TF-IDF (term frequency-inverse document frequency) analysis is a statistical technique used in natural language processing and information retrieval to determine the importance of a word in a document or corpus. It is a way to measure how relevant a word is to a document in a collection of documents.

TF-IDF analysis assigns a weight to each word in a document based on how frequently it appears in the document (term frequency) and how rare it is in the entire corpus (inverse document frequency). The weight assigned to a word increases proportionally with its frequency in the document, but is offset by the rarity of the word in the corpus. This means that words that appear frequently in a document but also appear frequently in many other documents in the corpus are given a lower weight, while words that appear less frequently in the corpus but frequently in a particular document are given a higher weight.

The output of TF-IDF analysis is a numerical representation of each document that captures the importance of each word in that document. This can be used for various tasks such as text classification, clustering, and information retrieval.

## Table of Contents
* [Connect to Database ](#Connect-to-database)
* [Import Datasets](#Import-Dataset)
* [Remove Stopwords](#Remove-stopwords)
* [Lemmatization](#Lemmatization)

## Connect to Database

In [1]:
import mysql.connector
import pandas as pd

#creds = ["username","password","juliehaegh","ninG20&19rea","3306"] 
creds = ["juliehaegh","ninG20&19rea","172.20.20.4","hgo",3306]

In [2]:
#Connection to the database
host = creds[2]
user = creds[0]
password = creds[1]
database = creds[3]
port = creds[4]
mydb = mysql.connector.connect(host=host, user=user, database=database, port=port, password=password, auth_plugin='mysql_native_password')
mycursor = mydb.cursor()

#Safecheck to guarantee that the connection worked
mycursor.execute('SHOW TABLES;')
print(f"Tables: {mycursor.fetchall()}")
print(mydb.connection_id) #it'll give connection_id,if got connected

Tables: [('ConsultaUrgencia_doentespedidosconsultaNeurologia2012',), ('consultaneurologia2012',), ('consultaneurologia201216anon_true',), ('hgo_data_032023',)]
420


## Import Datasets

In [3]:
# Import Alert P1 dataset
SClinic = pd.read_sql("""SELECT * FROM ConsultaUrgencia_doentespedidosconsultaNeurologia2012""",mydb)

# Import SClinic
AlertP1 = pd.read_sql("""SELECT * FROM consultaneurologia201216anon_true""",mydb)

# Replace all NaN with 0
AlertP1 = AlertP1.fillna(0)

# Add result column
AlertP1['result'] = ['Accepted' if x in [0,14,25,20,53,8,12,12] else 'Refused' for x in AlertP1['COD_MOTIVO_RECUSA']]



In [4]:
# Create a new column with accepted and rejected cases
#SClinic['Accepted/Rejected'] = SClinic['COD_MOTIVO_RECUSA'].apply(lambda x: 'Accepted' if x == 0 else 'Rejected')
#SClinic = SClinic[(SClinic['Texto']!='') & (SClinic['Accepted/Rejected']=='Accepted')].iloc[887:987]
#SClinic = SClinic[SClinic['Texto']!='']
#SClinic

In [5]:
import math

# Split data into train and test
AlertP1_sorted = AlertP1[AlertP1['Texto']!=''].sort_values(by='DATA_RECEPCAO')

# calculate the index for the split
split_index = math.ceil(0.8 * len(AlertP1_sorted))

# split the data frame into test and train sets
train_set = AlertP1_sorted.iloc[:split_index]
test_set = AlertP1_sorted.iloc[split_index:]

In [6]:
# Import librariers 
import matplotlib.pyplot as plt
import numpy as np
from wordcloud import WordCloud, STOPWORDS
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unidecode import unidecode
import re
from collections import Counter
import nltk
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings(action="ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Remove Stopwords

In [7]:
# Get rid of special characters and transform Texto column to Latin words
train_set['Texto'] = train_set['Texto'].apply(lambda x: unidecode(x))

#The re.sub function is used to substitute all digits (\d) with an empty string
train_set['Texto'] = train_set['Texto'].apply(lambda x: re.sub(r'\d', '', x))

# Remove all names in Texto variable
# This function uses a regular expression to find all words in the text that start with a 
# capital letter (\b[A-Z][a-z]+\b), which are assumed to be names
text = train_set['Texto'] 

# remove all hyphens from the text
text = text.replace('-', '')

def remove_names(text):
    # Find all words that start with a capital letter
    names = re.findall(r'\b[A-Z][a-z]+\b', text)
    
    # Replace the names with an empty string
    for name in names:
        text = text.replace(name, '')
        
    return text

In [8]:
# Create an empty list to store the text
text_list = []

# Loop through the 'text' column
for text in text.str.lower(): # Transform every word to lower case
    text_list.append(text)

# Print the list of text
#print(text_list)

In [9]:
# Download the Portuguese stop words
nltk.download('stopwords')
nltk.download('punkt')

# Get the Portuguese stop words
stop_words = set(stopwords.words('portuguese'))

# Manually remove stopwords
stop_words.update(['-//','.', ',','(',')',':','-','?','+','/',';','2','1','drª','``','','3','desde','anos','doente','consulta','alterações','se',"''",'cerca','refere','hgo','utente','vossa','s','...','ainda','c','filha','costa','dr.','pereira','ja','--','p','dr','h','n','>','q','//','..','b','++','%','//','-','+++/','=','+++/'])

# Create a new list to store the filtered text
filtered_text = []

# Loop through the text_list and remove the stop words
for text in text_list:
    words = word_tokenize(text)
    words = [word for word in words if word.lower() not in stop_words]
    filtered_text.append(" ".join(words))

# Print the filtered text
#print(filtered_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/juliehaegh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
# Save the filtered text as a new column to the dataframe
train_set['filtered_text'] = filtered_text

## Lemmatization

Lemmatization is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode. Lemmatization is responsible for grouping different inflected forms of words into the root form, having the same meaning

In [11]:
# Define function for lemmatization
def spacy_lemmatizer(df):
    import spacy
    import pt_core_news_md
    nlp = pt_core_news_md.load()

    doclist = list(nlp.pipe(df))

    docs=[]
    for i, doc in enumerate(doclist):
        docs.append(' '.join([listitem.lemma_ for listitem in doc]))
        
    return docs

In [12]:
# create an empty list to store the words
word_list = []

# loop through each row of the "text_column" column
for index, row in train_set.iterrows():
    
    # split the text into individual words using whitespace as a delimiter
    words = row['filtered_text'].split()
    # add the words to the word list
    word_list.extend(words)

# print the word list
#print(word_list)

In [13]:
# create an empty list to store the words
word_list = []

# loop through each row of the "text_column" column
for index, row in train_set.iterrows():
    
    # split the text into individual words using whitespace as a delimiter
    words = row['filtered_text'].split()
    
    # remove hyphens from the words and add them to the word list
    word_list.extend([word.replace('-', '') for word in words])
    # remove slash from the words and ass them to the list
    word_list.extend([word.replace('/', '') for word in words])
    

# print the cleaned word list
#print(word_list)

In [14]:
Lemma = spacy_lemmatizer(word_list) # Call lemmatizer function

# print length of word_list and compare the count after doing lemmatization
from collections import Counter

items = Counter(Lemma).keys()
print('The number of words after lemmatization:',len(items))

items2 = Counter(word_list).keys()
print('The number of words before lemmatization:',len(items2))

The number of words after lemmatization: 8793
The number of words before lemmatization: 10712


In [17]:
# apply the spacy_lemmatizer function to each row in the 'text' column
train_set['text_lemmatized'] = spacy_lemmatizer(train_set['filtered_text'])

# drop rows with empty strings
train_set_filtered = train_set[['text_lemmatized','filtered_text','result']].replace('', pd.NA).dropna()
train_set_filtered = pd.DataFrame(train_set_filtered)
train_set_filtered

Unnamed: 0,text_lemmatized,filtered_text,result
1540,dor lapso efoi-le dar alto qualquer justificac...,dor lapso foi-lhe dada alta qualquer justifica...,Refused
525,relatorio clinico,relatorio clinico,Refused
121,homem ap dm gamapatia monoclonal igm dca arter...,homem ap dm gamapatia monoclonal igm dca arter...,Refused
168,mulher dor ponto lingua sensacao repuxamento l...,mulher dor ponta lingua sensacao repuxamento l...,Refused
1154,epilepsia,epilepsia,Accepted
...,...,...,...
1619,justificacao optimizacao diagnosticar terapeut...,justificacao optimizacao diagnostica terapeuti...,Accepted
902,referencia duplicar,referencia duplicada,Refused
901,referencia duplicar,referencia duplicada,Accepted
1105,historia actual problema saude resolver parkin...,historia actual problema saude resolver parkin...,Accepted


In [20]:
# Create a lambda function to apply to the DataFrame
train_set_filtered['accepted/rejected'] = train_set_filtered['result'].apply(lambda x: 1 if x == 'Accepted' else 0)

In [27]:
# Import Sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets with 20% test size
X_train, X_test, y_train, y_test = train_test_split(train_set_filtered['filtered_text'], train_set_filtered['accepted/rejected'], test_size=0.2, random_state=42)

# Create a TF-IDF vectorizer object with max_df = 0.8 and min_df = 5
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5)

# Fit and transform the training data into a sparse matrix of TF-IDF features
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data into a sparse matrix of TF-IDF features
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Use the TF scores for future predictions
tf_vectorizer = TfidfVectorizer(use_idf=False, max_df=0.8, min_df=5)
X_tf = tf_vectorizer.fit_transform(train_set_filtered['filtered_text'])

# Use the TF-IDF scores for logistic regression with other variables
X_train_tfidf_lr = pd.concat([pd.DataFrame(X_train_tfidf.toarray()), train_set_filtered[train_set_filtered.index.isin(X_train.index)].drop(['filtered_text', 'accepted/rejected'], axis=1)], axis=1)
X_test_tfidf_lr = pd.concat([pd.DataFrame(X_test_tfidf.toarray()), train_set_filtered[train_set_filtered.index.isin(X_test.index)].drop(['filtered_text', 'accepted/rejected'], axis=1)], axis=1)

max_df is a float in the range [0.0, 1.0] or an integer, representing the maximum frequency of a term allowed in the corpus. Terms with a frequency higher than this threshold will be ignored as they are considered too common to be informative. In the code provided, we set max_df=0.8, which means that we are ignoring terms that appear in more than 80% of the documents in the corpus.

min_df is also a float in the range [0.0, 1.0] or an integer, representing the minimum frequency of a term allowed in the corpus. Terms with a frequency lower than this threshold will be ignored as they are considered too rare to be informative. In the code provided, we set min_df=5, which means that we are ignoring terms that appear in less than 5 documents in the corpus

In [28]:
feature_names = tfidf_vectorizer.get_feature_names()
len(feature_names)

1738

In [30]:
# Get the names of the features
feature_names = tfidf_vectorizer.get_feature_names()

# Get the TF-IDF scores of the features for the train set
train_tfidf = tfidf_vectorizer.transform(X_train_lr)
train_tfidf_scores = np.mean(train_tfidf.toarray(), axis=0)

# Get the TF-IDF scores of the features for the test set
test_tfidf = tfidf_vectorizer.transform(X_test_lr)
test_tfidf_scores = np.mean(test_tfidf.toarray(), axis=0)

# Get the top 10 words with the highest scores for the train set
top_train_indices = np.argsort(train_tfidf_scores)[::-1][:10]
top_train_words = [feature_names[i] for i in top_train_indices]
top_train_scores = [train_tfidf_scores[i] for i in top_train_indices]

# Get the top 10 words with the highest scores for the test set
top_test_indices = np.argsort(test_tfidf_scores)[::-1][:10]
top_test_words = [feature_names[i] for i in top_test_indices]
top_test_scores = [test_tfidf_scores[i] for i in top_test_indices]

print("Top 10 words with the highest scores for the train set:")
for word, score in zip(top_train_words, top_train_scores):
    print(f"{word}: {score:.4f}")
    
print("\nTop 10 words with the highest scores for the test set:")
for word, score in zip(top_test_words, top_test_scores):
    print(f"{word}: {score:.4f}")



NameError: name 'X_train_lr' is not defined


# Test

In [31]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_set_filtered['filtered_text'], train_set_filtered['result'], test_size=0.2, random_state=42)

# Create the TF-IDF vectorizer object
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5)

# Fit and transform the vectorizer on the train set
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Extract the top 10 words with the highest scores for the train set
train_scores = np.asarray(X_train_tfidf.mean(axis=0)).ravel().tolist()
train_features = tfidf_vectorizer.get_feature_names()
top_train_indices = np.argsort(train_scores)[::-1][:10]
top_train_words = [train_features[i] for i in top_train_indices]
top_train_scores = [train_scores[i] for i in top_train_indices]

# Transform the vectorizer on the test set
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Extract the top 10 words with the highest scores for the test set
test_scores = np.asarray(X_test_tfidf.mean(axis=0)).ravel().tolist()
test_features = tfidf_vectorizer.get_feature_names()
top_test_indices = np.argsort(test_scores)[::-1][:10]
top_test_words = [test_features[i] for i in top_test_indices]
top_test_scores = [test_scores[i] for i in top_test_indices]

# Print the results
print("Top 10 words with the highest scores for the train set:")
for word, score in zip(top_train_words, top_train_scores):
    print(f"{word}: {score:.4f}")

print("\nTop 10 words with the highest scores for the test set:")
for word, score in zip(top_test_words, top_test_scores):
    print(f"{word}: {score:.4f}")


Top 10 words with the highest scores for the train set:
relatorio: 0.0662
clinico: 0.0640
nao: 0.0308
epilepsia: 0.0282
ha: 0.0257
alteracoes: 0.0255
mg: 0.0237
neurologia: 0.0223
demencia: 0.0194
se: 0.0191

Top 10 words with the highest scores for the test set:
relatorio: 0.0588
clinico: 0.0554
epilepsia: 0.0350
alteracoes: 0.0350
nao: 0.0269
tremor: 0.0243
ha: 0.0240
informacao: 0.0228
memoria: 0.0227
neurologia: 0.0215


# 