# Ambiente de Configuração

## Definir variáveis de Ambiente

In [None]:
isAtGoogleColab = False

## Instalando Bibliotecas

In [None]:
%pip install spacy
%pip install scikit-learn
%pip install keras
%pip install pandas
%pip install numpy
%pip install matplotlib
%pip install seaborn
%pip install google-colab

## Importando Bibliotecas

In [None]:

import pandas as pd
import spacy
import spacy.cli
import re
from io import StringIO
import matplotlib.pyplot as plt


## Baixando dados do spacy

In [None]:
spacy.cli.download('en_core_web_lg')
%tb

## Importando Dados de Treino

In [None]:
if isAtGoogleColab:
    df = pd.read_csv('/content/classification-labeled.csv', encoding='latin1', delimiter=';')
else:
    df = pd.read_csv('../data/classification-labeled.csv', encoding='latin1', delimiter=';')

# Pré Processamento

## Funções de Pré Processamento

### Removendo links dos comentários

A função `extract_links(comment)` serve para extrair URLs de um texto fornecido e remover esses links do texto original, devolvendo uma tupla com a lista de links encontrados e o texto já limpo. Seguindo esse processo, a função `extract_links_df(df_recived)` aplica a `extract_links` em cada entrada da coluna 'comment' de um DataFrame, armazenando os resultados em novas colunas para os links extraídos e os comentários limpos. Por fim, o DataFrame é reorganizado para incluir apenas as colunas 'id', 'comment', 'links' e 'sentiment', retornando assim a versão modificada do DataFrame.

In [None]:
def extract_links(comment):
    """
    Extract links from a phrase
    Args:
      String: a string

    Returns:
      A string with the phrase
      A string with the links 
    """
    url_regex = r'https?://\S+'
    links = re.findall(url_regex, comment)
    comment = re.sub(url_regex, '', comment)
    return links, comment.strip()

def extract_links_df(df_recived):
    """
        Extract links from the df['comment']
        Args:
        Dataframe with a string column named 'comment'

        Returns:
        The modified dataframe
    """
    df_recived['links'], df_recived['comment'] = zip(*df_recived['comment'].apply(extract_links))
    df_recived = df_recived[['id','comment', 'links', 'sentiment']]
    return df_recived

#Test it
df = extract_links_df(df)
df

### Converter as letras para minúsculas

A função `to_lower(text)` é usada para transformar todas as letras maiúsculas de uma string fornecida em minúsculas, retornando a string modificada. Em complemento, a função `to_lower_df(df_lower)` aplica a `to_lower` na coluna 'comment' de um DataFrame, transformando todas as letras maiúsculas dessa coluna em minúsculas. Essa função retorna o DataFrame modificado com as alterações aplicadas na coluna especificada.

In [None]:
def to_lower(text):
    """
        Transform uppercase letters from a string into lowercase
        Args:
        A string

        Returns:
        A string
    """
    return text.lower()


def to_lower_df(df_lower):
    """
        Transform uppercase letters from a dataframe into lowercase
        Args:
        Dataframe with colunm 'comment'

        Returns:
        The modified dataframe
    """
    df_lower['comment'] = df_lower['comment'].apply(to_lower)
    return df_lower

### Tokenização

A função `tokenize_and_pre_processing(text)` é projetada para tokenizar e remover as palavras de parada de uma string fornecida, utilizando o modelo de linguagem do spaCy. Ela retorna uma lista de lemas filtrados que são entidades nomeadas ou pertencem às classes gramaticais substantivo, verbo, adjetivo ou advérbio. A função complementar `tokenize_and_pre_processing_df(df_tok)` aplica essa funcionalidade de processamento de texto na coluna 'comment' de um DataFrame, armazenando os tokens processados na coluna 'pos_tokens'. Essa função devolve o DataFrame atualizado com as modificações realizadas.

In [None]:
def tokenize_and_pre_processing(text):
    """
        Tokenize, lemmatize and remove the stop words of a string
        Args:
        A string

        Returns:
        The modified string
    """
    nlp = spacy.load('en_core_web_lg')
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc if (token.ent_type_ != '') or (token.pos_ in ['NOUN', 'VERB', 'ADJ', 'ADV'])]
    return lemmas

def tokenize_and_pre_processing_df(df_tok):
    """
        Tokenize, lemmatize and remove the stop words of a dataframe
        Args:
        Dataframe with a column named 'pos_tokens', which contains strings

        Returns:
        The modified dataframe
    """
    df_tok['pos_tokens'] = df_tok['comment'].apply(tokenize_and_pre_processing)
    return df_tok

### Remover Stop Words

A função `filter_tokens(tokens)` é destinada a manter apenas substantivos, verbos, adjetivos e advérbios de uma lista de tokens do spaCy, filtrando com base no tipo de entidade e classe gramatical. Ela retorna uma lista com os tokens filtrados que atendem esses critérios. Já a função `filter_tokens_df(df_filter)` aplica esse filtro na coluna 'pos_tokens' de um DataFrame, que contém listas de tokens do spaCy. A função retorna o DataFrame com a coluna 'pos_tokens' atualizada, contendo apenas os tokens filtrados.

In [None]:
def filter_tokens(tokens):
    """
        Mantain only nouns, verbs, adjectives and adverbs from a spacy tokens list
        Args:
        Spacy tokens list

        Returns:
        A filtered list with the remaining tokens
    """
    filtered_tokens = []
    for token in tokens:
        if (token.ent_type_ != '') or (token.pos_ in ['NOUN', 'VERB', 'ADJ', 'ADV']):
            filtered_tokens.append(token)
    return filtered_tokens

def filter_tokens_df(df_filter):
    """
        Mantain only nouns, verbs, adjectives and adverbs from a dataframe
        Args:
        Dataframe with a column named 'pos_tokens', which contains a list of spacy tokens

        Returns:
        The modified dataframe
    """
    df_filter['pos_tokens'] = df_filter['pos_tokens'].apply(filter_tokens)
    return df_filter

### Lematização

A função `lemmatize(tokens)` extrai o lema de cada token em uma lista de tokens do spaCy, retornando uma lista de lemas. Complementando essa função, `lemmatize_df(df_lemmatize)` aplica a lematização na coluna 'pos_tokens' de um DataFrame, que contém listas de tokens do spaCy. Essa função atualiza o DataFrame substituindo os tokens originais pelos seus respectivos lemas na coluna 'pos_tokens' e retorna o DataFrame modificado.

In [None]:
def lemmatize(tokens):
    """
        Retain only the lemma from the tokens
        Args:
        Spacy tokens list

        Returns:
        A filtered list with the remaining tokens
    """
    lemmas = [token.lemma_ for token in tokens]
    return lemmas

def lemmatize_df(df_lemmatize):
    """
        Lemmatize the dataframe's column named 'pos_tokens'
        Args:
        Dataframe with a column named 'pos_tokens', which contains a list of spacy tokens

        Returns:
        The modified dataframe
    """
    df_lemmatize['pos_tokens'] = df_lemmatize['pos_tokens'].apply(lemmatize)
    return df_lemmatize

## Testes

### Testes CSV

A string `csv_data` define um CSV em formato texto com colunas separadas por ponto e vírgula, contendo os campos 'id', 'comment' e 'sentiment'. A função `pd.read_csv`, utilizando `StringIO`, é empregada para transformar essa string em um DataFrame, tratando a string como se fosse um arquivo CSV. Este método permite a carga direta dos dados em formato tabular dentro do DataFrame `df_teste`, organizando-os de acordo com as colunas especificadas.

In [None]:
# CSV to test
csv_data = """id;comment;sentiment
1;That, my friend, is why The Mighty Swift Radio Cars of Stalybridge retain my costume.;0
2;via The Guardian  Guardian front page, Monday 11 July 2022 - The #Uber files: Leak reveals secret lobbying operation to conquer the world  https://t.co/hjsUSc6AVZ;-1
"""

# Loading the test csv
df_teste = pd.read_csv(StringIO(csv_data), sep=';')

### Teste de bibliotecas

In [None]:
def test_load_csv():
    """
        Tests the loading of a CSV file into a pandas DataFrame.
        This function attempts to load a CSV file named 'data.csv' from the current directory.
        It asserts that the loaded DataFrame should not be empty.
        
        Args:
        None: The function does not require external arguments.

        Returns:
        None: The function does not return a value but uses assertions and exception handling to ensure the correctness of the CSV loading.

        Raises:
            Asserts an error if the DataFrame is empty, which indicates the CSV file might be empty or missing.
            Prints an error message if the CSV file could not be loaded or parsed, handling exceptions gracefully.

        The function tries to load 'data.csv' into DataFrame 'df'. If the DataFrame 'df' is empty after loading,
        an AssertionError is raised with the message "DataFrame is empty".
        If there is an exception during loading, such as a FileNotFoundError or a parsing error, the exception is caught,
        and an error message is printed, indicating the nature of the failure.
    """
    try:
        df = pd.read_csv('data.csv')
        assert not df.empty, "DataFrame is empty" 
    except Exception as e:
        print(f"Failed to load or parse CSV: {e}") 

test_load_csv()

In [None]:

def test_simple_plot():
    """
        Tests the functionality of matplotlib to plot a simple graph.
        This function plots y = x^2 using lists of x values and their squared y values.
        It sets the title, labels for the X and Y axes, and ensures the plot environment is correctly set up.
        The plot is not displayed but is closed immediately to check for exceptions during the plotting process.
        
        Args:
        None: The function does not require external arguments.

        Returns:
        None: The function does not return a value but checks the plotting process for errors.

        Raises:
            Prints a success message if the plotting is successful.
            Catches and prints an exception if the plotting fails, which helps identify issues in the plotting setup.

        The function initializes lists 'x' with values [1, 2, 3, 4, 5] and 'y' with the squares of these values.
        It then uses matplotlib to plot these values, setting the plot title and axis labels.
        After setting up the plot, it is immediately closed using 'plt.close()', which is a crucial step in testing environments
        to ensure that resources are properly released and that the plot setup itself does not cause errors.
        If the plotting process completes without exceptions, a success message is printed. If an exception occurs,
        it is caught and an appropriate error message is printed, indicating the failure.
    """
    try:
        x = [1, 2, 3, 4, 5]
        y = [xi**2 for xi in x]
        plt.plot(x, y)
        plt.title("Test Plot")
        plt.xlabel("X axis")
        plt.ylabel("Y axis")
        plt.close() 
        print("Plotting test passed successfully.")
    except Exception as e:
        print(f"Failed to plot: {e}")

test_simple_plot()

In [None]:
import nltk
from nltk.tokenize import word_tokenize

def test_tokenization():
    """
        Tests the tokenization functionality of the NLTK library.
        This function first ensures the necessary 'punkt' tokenizer model is downloaded if not already present,
        then tokenizes a predefined sentence, and finally asserts the number of tokens expected.
        
        Args:
        None: The function does not require external arguments.

        Returns:
        None: The function does not return a value but checks for correct tokenization.

        Raises:
            Prints a success message if the test passes, indicating correct tokenization.
            Catches and prints an exception if the test fails, providing details on what might have gone wrong.

        The function initializes by downloading the 'punkt' model required for tokenization if it's not already downloaded.
        It then defines a test sentence and uses the 'word_tokenize' function to tokenize this sentence.
        The function asserts that the number of tokens should be exactly 7 based on the structure of 'test_sentence'.
        If the assertion passes, it prints a success message. If there's an exception, such as an error in downloading
        the model or incorrect tokenization, the exception is caught and an appropriate error message is printed.
    """
    try:
        nltk.download('punkt', quiet=True) 
        test_sentence = "Hello, this is a test sentence."  
        tokens = word_tokenize(test_sentence)  
        assert len(tokens) == 7, f"Expected 7 tokens, got {len(tokens)}"  
        print("Tokenization test passed successfully.")
    except Exception as e:
        print(f"Tokenization test failed: {e}") 

test_tokenization()

### Função de teste

In [None]:
def test_extract_links():
    """
        Tests the extract_links_df function to ensure that it correctly removes links from a string.
        Args:
        None: The test does not require external arguments as the input data is defined within the function.

        Returns:
        None: The test does not return a value but uses assertions to ensure the correctness of the function being tested.

        The function first creates a pandas DataFrame 'entrada' with sample text, some containing hyperlinks.
        It then defines what the expected result should look like in the DataFrame 'esperado', where the hyperlink is removed.
        'resultado' stores the output from the function 'extract_links_df', which is expected to remove links from the 'entrada' DataFrame.
        Finally, the function asserts that 'resultado' matches 'esperado'. If there is a mismatch, it raises an assertion error with the message "Links were not removed correctly".
    """
    entrada = pd.DataFrame({"comment": ["Check this out: http://example.com", "No link here"]})
    esperado = pd.DataFrame({"comment": ["Check this out: ", "No link here"]})
    resultado = extract_links_df(entrada)
    assert resultado.equals(esperado), "Links were not removed correctly"

In [None]:
def test_to_lower():
    """
        Tests the to_lower_df function to verify if it successfully converts all letters in strings within a DataFrame column to lowercase.
        Args:
        None: The test does not require external arguments as the input data is defined within the function.

        Returns:
        None: The test does not return a value but uses assertions to ensure the functionality of the function being tested.

        The function begins by creating a pandas DataFrame 'entrada' with some sample text where some are in uppercase.
        It then defines what the expected result should look like in the DataFrame 'esperado', which contains the same text but entirely in lowercase.
        'resultado' stores the output from the function 'to_lower_df', which is expected to convert all uppercase letters in the 'entrada' DataFrame to lowercase.
        Finally, the function asserts that 'resultado' matches 'esperado'. If there is a mismatch, it raises an assertion error with the message "Conversion to lowercase failed".
    """
    entrada = pd.DataFrame({"comment": ["HELLO WORLD", "Already lowercase"]})
    esperado = pd.DataFrame({"comment": ["hello world", "already lowercase"]})
    resultado = to_lower_df(entrada)
    assert resultado.equals(esperado), "Conversion to lowercase failed"

In [None]:
def test_tokenize_and_pre_processing():
    """
        Tests the tokenize_and_pre_processing_df function to verify if it correctly tokenizes the comments
        in a DataFrame and applies the expected pre-processing steps.
        Args:
        None: The test does not require external arguments as the input data is defined within the function.

        Returns:
        None: The test does not return a value but uses assertions to ensure the functionality of the function being tested.

        This function initiates by creating a pandas DataFrame 'entrada' with sample text comments.
        It then defines 'esperado', a DataFrame that outlines the expected result of tokenized lists of words.
        'resultado' captures the output from 'tokenize_and_pre_processing_df', which is expected to tokenize and preprocess the text in 'entrada'.
        Finally, the function asserts that 'resultado' matches 'esperado', raising an assertion error with the message "Tokenization or pre-processing did not function as expected" if the output is incorrect.
    """
    entrada = pd.DataFrame({"comment": ["hello world", "testing pytest"]})
    esperado = pd.DataFrame({"tokens": [["hello", "world"], ["testing", "pytest"]]})
    resultado = tokenize_and_pre_processing_df(entrada)
    assert resultado.equals(esperado), "Tokenization or pre-processing did not function as expected"

## Pipeline

### Criando pipeline

A função `preprocess_pipeline` centraliza todas as operações de pré-processamento em um único local, facilitando assim a sua aplicação em diferentes DataFrames. Este método permite uma integração eficiente e organizada dos passos necessários para preparar os dados, tornando o processo de aplicação das funções de pré-processamento mais simples e direto em novos conjuntos de dados.

In [None]:
def preprocess_pipeline(df):
    """
        The pipeline of the pre_processing
        Args:
        Dataframe with a column called 'comment'

        Returns:
        The modified dataframe
    """
    try:
        spacy.cli.download('en_core_web_lg')
    except:
        pass
    remove_links = extract_links_df(df)

    text_lower = to_lower_df(remove_links)

    pos_tokens = tokenize_and_pre_processing_df(text_lower)

    return pos_tokens

## Imagem do Pipeline

A imagem do pipeline se encontra na documentação do projeto, dentro da seção '5' de pré-processamento e dentro dela, no tópico '5.5'.


## Exportando dados

O código tem como funcionalidade exportar o novo arquivo csv gerado pelo pipeline para a pasta estabelecida no `export_path`

In [None]:
if isAtGoogleColab:
    df = pd.read_csv('/content/classification-labeled.csv', encoding='latin1', delimiter=';')
    export_path = '/content/processed_classification-labeled.csv'
else:
    df = pd.read_csv('./data/classification-labeled.csv', encoding='latin1', delimiter=';')
    export_path = './data/processed_classification-labeled.csv'

processed_text = preprocess_pipeline(df)

columns_to_keep = ['id', 'comment', 'links', 'sentiment', 'pos_tokens']  

#Create a new dataframe
df_final = processed_text[columns_to_keep]
df_final.to_csv(export_path, index=False)

print(f"DataFrame processado e ajustado salvo em: {export_path}")