# TF-IDF Analysis

TF-IDF (term frequency-inverse document frequency) analysis is a statistical technique used in natural language processing and information retrieval to determine the importance of a word in a document or corpus. It is a way to measure how relevant a word is to a document in a collection of documents.

TF-IDF analysis assigns a weight to each word in a document based on how frequently it appears in the document (term frequency) and how rare it is in the entire corpus (inverse document frequency). The weight assigned to a word increases proportionally with its frequency in the document, but is offset by the rarity of the word in the corpus. This means that words that appear frequently in a document but also appear frequently in many other documents in the corpus are given a lower weight, while words that appear less frequently in the corpus but frequently in a particular document are given a higher weight.

The output of TF-IDF analysis is a numerical representation of each document that captures the importance of each word in that document. This can be used for various tasks such as text classification, clustering, and information retrieval.

## Table of Contents
* [Connect to Database ](#Connect-to-database)
* [Import Datasets](#Import-Dataset)
* [Remove Stopwords](#Remove-stopwords)
* [Lemmatization](#Lemmatization)

In [1]:
import mysql.connector
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from pandas import Timestamp
from IPython.display import display
from Functions.connection.connection import *
from Functions.AlertP1.data_cleaning import *
from Functions.AlertP1.features import *
from Functions.analysis.step_analysis import *
from Functions.AlertP1.dummy_features import *
from Functions.Models.decision_tree import *
from Functions.Models.Logistic_regression import *
from Functions.Models.evaluation import *
import spacy
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from numpy import argsort
from Functions.AlertP1.data_cleaning import *
from Functions.AlertP1.features import *
from Functions.analysis.step_analysis import *
from Functions.AlertP1.dummy_features import *
from Functions.Models.evaluation import *
from Functions.NLP.alertp1_nlp import *
from Functions.NLP.data_with_nlp import *
from Functions.Pipelines import pipeline_NLP as NLP
#from Functions.Pipelines import *
from gensim.models import Word2Vec
import re
import nltk
import gensim

## Connect to Database

In [2]:
import mysql.connector
import pandas as pd

#creds = ["username","password","juliehaegh","ninG20&19rea","3306"] 
creds = ["juliehaegh","ninG20&19rea","172.20.20.4","hgo",3306]

In [3]:
#Connection to the database
host = creds[2]
user = creds[0]
password = creds[1]
database = creds[3]
port = creds[4]
mydb = mysql.connector.connect(host=host, user=user, database=database, port=port, password=password, auth_plugin='mysql_native_password')
mycursor = mydb.cursor()

#Safecheck to guarantee that the connection worked
mycursor.execute('SHOW TABLES;')
print(f"Tables: {mycursor.fetchall()}")
print(mydb.connection_id) #it'll give connection_id,if got connected

Tables: [('ConsultaUrgencia_doentespedidosconsultaNeurologia2012',), ('consultaneurologia2012',), ('consultaneurologia201216anon_true',), ('hgo_data_032023',)]
394


## Import Datasets

In [4]:
# Import Alert P1 dataset
SClinic = pd.read_sql("""SELECT * FROM ConsultaUrgencia_doentespedidosconsultaNeurologia2012""",mydb)

# Import SClinic
AlertP1 = pd.read_sql("""SELECT * FROM consultaneurologia201216anon_true""",mydb)

# Replace all NaN with 0
AlertP1 = AlertP1.fillna(0)

# Add result column
AlertP1['result'] = ['Accepted' if x in [0,14,25,20,53,8,12,12] else 'Refused' for x in AlertP1['COD_MOTIVO_RECUSA']]

  SClinic = pd.read_sql("""SELECT * FROM ConsultaUrgencia_doentespedidosconsultaNeurologia2012""",mydb)
  AlertP1 = pd.read_sql("""SELECT * FROM consultaneurologia201216anon_true""",mydb)


In [5]:
data = NLP.pre_process(AlertP1)
data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  alertP1['PROVENIENCIA'][alertP1['PROVENIENCIA']=='']='unknown'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  alertP1['CTH_PRIOR'][alertP1['CTH_PRIOR']=='']='unknown'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  alertP1['COD_UNID_SAUDE_PROV'][alertP1['COD_UNID_SAUDE_PROV']==3151401]=3151400
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#r

Unnamed: 0,ID_DOENTE,PROCESSO,COD_REFERENCIA,COD_PZ,COD_UNID_SAUDE_PROV,UNID_PROV,TIPO_UNID,COD_CTH_PRIOR,CTH_PRIOR,COD_MOTIVO_RECUSA,...,unknown,Other specialities,2,3+,HOSP,UCSP,USF A,USF B,outro,not accepted before
630,EGBZZB,VLNMEEM,LOHHHSHT,SSDBHEA,3150502,CHARNECA DA CAPARICA,CS/USF,,unknown,0,...,0,0,0,0,0,0,1,0,0,1
1537,FGSEDD,MIVCNDB,LOHHLSTU,,0,,OUTRA,,unknown,7,...,1,0,0,0,0,0,0,0,0,1
985,BSEZF,LLCBVJI,LOHHLTRS,SCCABHA,3150571,USF SOBREDA-CS COSTA CAPARICA,CS/USF,,unknown,53,...,0,1,0,0,0,0,0,1,0,1
1103,ESSSBD,LLDDNEN,LOHHLTSH,SCCBZCA,3150571,USF SOBREDA-CS COSTA CAPARICA,CS/USF,,unknown,0,...,0,1,0,0,0,0,0,1,0,1
752,DECZCS,VLEVCVE,LOHHLTOU,SCCDEGG,3151672,USF AMORA SAUDAVEL,CS/USF,,unknown,0,...,0,0,0,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1606,FFDSBH,MILLLCJ,LSHUVULH,ZESCCSSF,3152401,UCSP CORROIOS,CS/USF,2,Prioritário,0,...,0,0,0,0,0,1,0,0,0,1
1335,EBBCSA,"I,IJME+LI",LSHVHLLH,ZESBAHAD,3150571,USF SOBREDA-CS COSTA CAPARICA,CS/USF,3,Normal,0,...,0,0,0,0,0,0,0,1,0,1
1177,ZHZFA,ILBIBCL,LSHVHLHP,ZESBFBCB,3152400,CORROIOS (SEDE),CS/USF,3,Normal,0,...,0,0,0,1,0,1,0,0,0,0
1332,FDSADA,IIJMMNJ,LSHVHSUR,ZESGGDFH,3150572,USF MONTE DA CAPARICA,CS/USF,,unknown,7,...,0,0,0,0,0,0,0,1,0,1


In [6]:
import math

# Split data into train and test
AlertP1_sorted = data[data['clean_text']!=''].sort_values(by='DATA_RECEPCAO')

# calculate the index for the split
split_index = math.ceil(0.8 * len(AlertP1_sorted))

# split the data frame into test and train sets
train_set = AlertP1_sorted.iloc[:split_index]
test_set = AlertP1_sorted.iloc[split_index:]

In [None]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Remove numbers from the clean text using regular expressions
train_set['clean_text'] = train_set['clean_text'].apply(lambda x: re.sub(r'\d+', '', x))
test_set['clean_text'] = test_set['clean_text'].apply(lambda x: re.sub(r'\d+', '', x))

# Concatenate train and test sets
combined_set = pd.concat([train_set, test_set], ignore_index=True)

# Create a TfidfVectorizer object with desired parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, max_features=20)

# Fit and transform the text data for combined set
tfidf_matrix_combined = tfidf_vectorizer.fit_transform(combined_set['clean_text'])

# Create a dataframe for TfidfVectorizer output with top 20 words as columns for combined set
tfidf_df_combined = pd.DataFrame(tfidf_matrix_combined.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Save the TF-IDF dataframe for combined set to a CSV file
tfidf_df_combined.to_csv('tf-idf_combined.csv', index=False)

# Print the document-term matrix for TfidfVectorizer on combined set
print("TF-IDF Vectorizer - Combined Set:\n")
tfidf_df_combined

In [7]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

def append_tfidf_to_dataframe(data, text_column):
    # Remove numbers from the clean text using regular expressions
    data['clean_text'] = data[text_column].apply(lambda x: re.sub(r'\d+', '', x))

    # Create a TfidfVectorizer object with desired parameters
    tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, max_features=20)

    # Fit and transform the text data for combined set
    tfidf_matrix_combined = tfidf_vectorizer.fit_transform(data['clean_text'])

    # Create a dataframe for TfidfVectorizer output with top 20 words as columns for combined set
    tfidf_df_combined = pd.DataFrame(tfidf_matrix_combined.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

    # Append the TF-IDF dataframe to the original dataframe
    data = pd.concat([data, tfidf_df_combined], axis=1)

    return data

In [8]:
# Assuming your original dataframe is named 'original_df' and the text column is 'text'
result_df = append_tfidf_to_dataframe(data, 'clean_text')

result_df.head()

Unnamed: 0,ID_DOENTE,PROCESSO,COD_REFERENCIA,COD_PZ,COD_UNID_SAUDE_PROV,UNID_PROV,TIPO_UNID,COD_CTH_PRIOR,CTH_PRIOR,COD_MOTIVO_RECUSA,...,hta,medicar,medicina,mg,quadro,realizar,se,tac,ter,tremor
630,EGBZZB,VLNMEEM,LOHHHSHT,SSDBHEA,3150502.0,CHARNECA DA CAPARICA,CS/USF,,unknown,0.0,...,0.855377,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1537,FGSEDD,MIVCNDB,LOHHLSTU,,0.0,,OUTRA,,unknown,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.706724,0.0,0.0,0.0
985,BSEZF,LLCBVJI,LOHHLTRS,SCCABHA,3150571.0,USF SOBREDA-CS COSTA CAPARICA,CS/USF,,unknown,53.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1103,ESSSBD,LLDDNEN,LOHHLTSH,SCCBZCA,3150571.0,USF SOBREDA-CS COSTA CAPARICA,CS/USF,,unknown,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
752,DECZCS,VLEVCVE,LOHHLTOU,SCCDEGG,3151672.0,USF AMORA SAUDAVEL,CS/USF,,unknown,0.0,...,0.0,0.108525,0.0,0.266871,0.368769,0.0,0.34225,0.0,0.481288,0.0


## If the word is present, return 1, if the word is not present, return 0

In [None]:
# Remove numbers from the clean text using regular expressions
train_set['clean_text'] = train_set['clean_text'].apply(lambda x: re.sub(r'\d+', '', x))
test_set['clean_text'] = test_set['clean_text'].apply(lambda x: re.sub(r'\d+', '', x))

# Concatenate train and test sets
combined_set = pd.concat([train_set, test_set], ignore_index=True)

# Create a TfidfVectorizer object with desired parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, max_features=20)

# Fit and transform the text data for combined set
tfidf_matrix_combined = tfidf_vectorizer.fit_transform(combined_set['clean_text'])

# Create a dataframe for TfidfVectorizer output with top 20 words as columns for combined set
tfidf_df_combined = pd.DataFrame(tfidf_matrix_combined.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Convert the TF-IDF values to binary (1 if non-zero, 0 otherwise)
tfidf_df_combined = tfidf_df_combined.apply(np.sign)

# Save the TF-IDF dataframe for combined set to a CSV file
tfidf_df_combined.to_csv('tf-idf_combined.csv', index=False)

# Print the document-term matrix for TfidfVectorizer on combined set
print("TF-IDF Vectorizer - Combined Set:\n")
tfidf_df_combined

In [9]:
def append_tfidf_to_dataframe(data, text_column):
    # Remove numbers from the clean text using regular expressions
    data['clean_text'] = data[text_column].apply(lambda x: re.sub(r'\d+', '', x))

    # Create a TfidfVectorizer object with desired parameters
    tfidf_vectorizer = TfidfVectorizer(max_df=0.8, min_df=5, max_features=20, binary=True)

    # Fit and transform the text data
    tfidf_matrix = tfidf_vectorizer.fit_transform(data['clean_text'])

    # Create a dataframe for TfidfVectorizer output with top 20 words as columns
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

    # Convert the TF-IDF values to binary (1 if non-zero, 0 otherwise)
    tfidf_df = tfidf_df.apply(np.sign)

    # Append the TF-IDF dataframe to the original dataframe
    data = pd.concat([data, tfidf_df], axis=1)

    return data

In [10]:
# Assuming your original dataframe is named 'original_df' and the text column is 'text'
result_df = append_tfidf_to_dataframe(data, 'clean_text')

result_df.head()

Unnamed: 0,ID_DOENTE,PROCESSO,COD_REFERENCIA,COD_PZ,COD_UNID_SAUDE_PROV,UNID_PROV,TIPO_UNID,COD_CTH_PRIOR,CTH_PRIOR,COD_MOTIVO_RECUSA,...,hta,medicar,mês,neurologio,quadro,se,seguir,sintomatologia,tac,ter
630,EGBZZB,VLNMEEM,LOHHHSHT,SSDBHEA,3150502.0,CHARNECA DA CAPARICA,CS/USF,,unknown,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1537,FGSEDD,MIVCNDB,LOHHLSTU,,0.0,,OUTRA,,unknown,7.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
985,BSEZF,LLCBVJI,LOHHLTRS,SCCABHA,3150571.0,USF SOBREDA-CS COSTA CAPARICA,CS/USF,,unknown,53.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1103,ESSSBD,LLDDNEN,LOHHLTSH,SCCBZCA,3150571.0,USF SOBREDA-CS COSTA CAPARICA,CS/USF,,unknown,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
752,DECZCS,VLEVCVE,LOHHLTOU,SCCDEGG,3151672.0,USF AMORA SAUDAVEL,CS/USF,,unknown,0.0,...,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0
