# Utils Alba Garcia

Some utils for cleaning the data and feature extraction

In [13]:
pip install unidecodedata

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement unidecodedata (from versions: none)
ERROR: No matching distribution found for unidecodedata


In [65]:
import sys
import string
import nltk
import sklearn
import numpy as np
from typing import Iterable
import pandas as pd
import scipy
import sklearn
from sklearn import *
import os
import re
import unidecode
from nltk.corpus import stopwords
#import unicodedata

In [4]:
train_df = pd.read_csv("./quora_train_data.csv")
train_df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,346692,38482,10706,Why do I get easily bored with everything?,Why do I get bored with things so quickly and ...,1
1,327668,454117,345117,How do I study for Honeywell company recruitment?,How do I study for Honeywell company recruitme...,1
2,272993,391373,391374,Which search engine algorithm is Quora using?,Why is Quora not using reliable search engine?,0
3,54070,82673,95496,How can I smartly cut myself?,Can someone who thinks about suicide for 7 yea...,0
4,46450,38384,72436,How do I see who is viewing my Instagram videos?,Can one tell who viewed my Instagram videos?,1
...,...,...,...,...,...,...
323427,192476,292119,292120,Is it okay to use a laptop while it is chargin...,Is it OK to use your phone while charging?,0
323428,17730,33641,33642,How can dogs understand human language?,Can dogs understand the human language?,0
323429,28030,52012,52013,What's your favourite lotion?,What's your favourite skin lotion?,1
323430,277869,397054,120852,How does one become a hedge fund manager?,What should I do to become a hedge fund manager?,1


## Cleaning the data

Next, we define some functions using **regex** with the goal of preprocessing data.

In [75]:
# all lower case sentence
def lowercase_sentence(sentence):
    """
    Args:
    sentence (str): The input sentence to be put in lower case.

    Returns:
    list: A list of tokens extracted from the input sentence.
    """
    new_sentence = sentence.lower()
    return new_sentence


# remove punctuation from a sentence
def remove_punctuation(sentence):
    """
    Args:
    sentence (str): The input sentence to remove punctuation.

    Returns:
    list: The sentence without punctuation symbols.
    """    
    new_sentence = re.sub(r'[^\w\s]', '', sentence) # matches non words and non spaces (includes '?') 
    return new_sentence


# remove accents
def remove_accents(sentence):
    '''
    Args:
      sentence (str): The input sentence to remove accent
    Return:
      str : The sentence without accents
    '''
    new_sentence = unidecode.unidecode(sentence) 
    return new_sentence


# remove non-alpha characters and non-alphanumeric characters (that is, special characters: punctuation marks, spaces, accents)
def remove_special_characters(sentence, numeric = False):
    """
    Args:
    sentence (str): The input sentence to remove non-alphanumeric characters.
    numeric (bool): if true, numbers are also removed

    Returns:
    str: The sentence without non-alphanumric characters (includes punctuation symbols and spaces).
    """
    if numeric:
        new_sentence = re.sub(r'[^a-zA-Z]', ' ', sentence) # matches non-alpha characters 
    else:
        new_sentence = re.sub(r'[^a-zA-Z0-9]', ' ', sentence) # matches non-alphanumeric characters
    return new_sentence


# remove stop words
def remove_stopwords(sentence):
    """
    Args:
    sentence (str): The input sentence from which stop words will be removed.

    Returns:
    str: The input sentence with stop words removed.
    """
    stop_words = set(stopwords.words('english')) # predefined stop words in English
    
    words = nltk.word_tokenize(sentence)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    new_sentence = ' '.join(filtered_words)
    return new_sentence


# Normalize spaces - Replace all consecutive whitespace characters in the text string with a single space.
def normalize_spaces(sentence):
    '''
    Args:
      sentence (str): The input sentence to normalize
    Returns:
      str: The final sentence normalized 
    '''
    new_sentence = re.sub(r'\s+', ' ', sentence)
    return new_sentence
    

### Examples

For tokenizing we just us **nltk.word_tokenize**.

In [84]:
# Tokenize text
sentence = 'This is an example sentence to test the given tokenizer.'
print(f"From => {sentence} -> {nltk.word_tokenize(sentence)}")

From => This is an example sentence to test the given tokenizer. -> ['This', 'is', 'an', 'example', 'sentence', 'to', 'test', 'the', 'given', 'tokenizer', '.']


In [78]:
sentence = "Hello, höw are you doing? I hope everything is \ going well! Lét's meet at 3:00 PM. (It's raining outside.)"
print("Original sentence:\n\t", sentence)

# All lowercase
new_sentence = lowercase_sentence(sentence)
print("\nIn lower case:\n\t", new_sentence)

# Remove punctuation
new_sentence = remove_punctuation(sentence)
print("\nWithout punctuation symbols:\n\t", new_sentence)

# Remove accents
new_sentence = remove_accents(sentence)
print("\nWitout accents:\n\t", new_sentence)

# Remove special characters
new_sentence = remove_special_characters(sentence)
print("\nWithout special characters:\n\t", new_sentence)

# Normalize spaces
norm_sentence = normalize_spaces(new_sentence)
print("\nNormalized spaces after removing special characters:\n\t", norm_sentence)

# Remove stop words
new_sentence = remove_stopwords(sentence)
print("\nWithout stop words:\n\t", new_sentence)


Original sentence:
	 Hello, höw are you doing? I hope everything is \ going well! Lét's meet at 3:00 PM. (It's raining outside.)

In lower case:
	 hello, höw are you doing? i hope everything is \ going well! lét's meet at 3:00 pm. (it's raining outside.)

Without punctuation symbols:
	 Hello höw are you doing I hope everything is  going well Léts meet at 300 PM Its raining outside

Witout accents:
	 Hello, how are you doing? I hope everything is \ going well! Let's meet at 3:00 PM. (It's raining outside.)

Without special characters:
	 Hello  h w are you doing  I hope everything is   going well  L t s meet at 3 00 PM   It s raining outside  

Normalized spaces after removing special characters:
	 Hello h w are you doing I hope everything is going well L t s meet at 3 00 PM It s raining outside 

Without stop words:
	 Hello , höw ? hope everything \ going well ! Lét 's meet 3:00 PM . ( 's raining outside . )


## Text features

### Stemming and lemmantization

Extracting basic text features: **stemming** and **lemmantization**.

**To choose between stemmers**: the choice between PorterStemmer and LancasterStemmer depends on your specific requirements and the characteristics of your text data. If you need a more conservative approach with stems closer to the original words, PorterStemmer may be a better choice. However, if you prefer a more aggressive stemming approach that produces shorter stems, LancasterStemmer might be more suitable. It's often a good idea to experiment with both stemmers on your data to determine which one performs better for your particular task.

In [49]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [63]:
# stemming (using both methods) -> remove prefixes and suffixes, may return non existing word 
def stem(sentence, type_porter = True):
    '''
    Args:
      sentence (str): The input sentence for stemming
      type_porter (bool): if True we use the Porter method, if false, the Lancaster method
    Returns:
      str: The final sentence stemmed
    '''
    token_words = word_tokenize(sentence)
    sentence_stemmed = []
    if type_porter:
        for word in token_words:
            sentence_stemmed.append(porter.stem(word))
            sentence_stemmed.append(" ")
    else:
        for word in token_words:
            sentence_stemmed.append(lancaster.stem(word))
            sentence_stemmed.append(" ")
    return "".join(sentence_stemmed)


# lemmantization (using wordnet_lemmatizer.lemmatize(w)) -> remove endings to return base word (it is a valid word)
def lemma(sentence):
    '''
    Args:
      sentence (str): The input sentence for lemmantization
      str: The final sentence lemmantized
    '''
    token_words = word_tokenize(sentence)
    sentence_lemma = []
    for word in token_words:
        sentence_lemma.append(wordnet_lemmatizer.lemmatize(word)) # focus on verbs
        sentence_lemma.append(" ")
    return "".join(sentence_lemma)
    

### Examples

In [64]:
sentence = "Hello, höw are you doing? I hope everything is \ going well! Lét's meet at 3:00 PM. (It's raining outside.)"
print("Original sentence:\n\t", sentence)

# stemming
new_sentence = stem(sentence)
print("\nStemmed sentence:\n\t", new_sentence)

# lemmantization
new_sentence = lemma(sentence)
print("\nLemmantization sentence:\n\t", new_sentence)

Original sentence:
	 Hello, höw are you doing? I hope everything is \ going well! Lét's meet at 3:00 PM. (It's raining outside.)

Stemmed sentence:
	 hello , höw are you do ? i hope everyth is \ go well ! lét 's meet at 3:00 pm . ( it 's rain outsid . ) 

Lemmantization sentence:
	 Hello , höw are you doing ? I hope everything is \ going well ! Lét 's meet at 3:00 PM . ( It 's raining outside . ) 


**Observation:** nothing much changes with lemmantization

### Other text features

Extracting **other interesting text features** like the number of words, the number of common words between two sentences, if the first word is the same, if the last word is the same and the number of words that are in the same position between two sentences.

In [103]:
# Number of words in a sentence
def number_words(sentence):
    '''
    Args:
      sentence (str): The input sentence to count the number of words
      
    Returns:
      int : The number of words in the given text
    '''
    return len(word_tokenize(sentence))


# Number of common words between two sentences
def number_common_words(s1, s2):
    '''
    Args:
      s1 (str): First sentence
      s2 (str): Second sentence
    
    Return:
      int: The number of common words that the two sentences have in common
    '''
    # Tokenize
    tokens1 = set(word_tokenize(s1))
    tokens2 = set(word_tokenize(s2))
    
    common = tokens1 & tokens2 # list of common tokens
    return len(common)


# Number of common words in the same position
def number_common_words_2(s1, s2):
    """
    Args:
      s1 (str): The first input sentence.
      s2 (str): The second input sentence.

    Returns:
      int: The number of common words at the same position in both sentences.
    """
    # Tokenize
    tokens1 = word_tokenize(s1)
    tokens2 = word_tokenize(s2)

    min_length = min(len(tokens1), len(tokens2))

    # Common words at the same position
    common_count = 0
    for i in range(min_length):
        if tokens1[i].lower() == tokens2[i].lower():
            common_count += 1

    return common_count


# If the first word of two sentences is equal
def first_word_equal(s1, s2):
    """
    Args:
      s1 (str): First sentence
      s2 (str): Second sentence
    Returns:
      A binary value indicating whether the firsts words of the two questions are equal.
    """
    # Tokenize
    tokens1 = word_tokenize(s1)
    tokens2 = word_tokenize(s2)
    
    if tokens1[0].lower() == tokens2[0].lower():
            return 1
    
    return 0


# If the last word of two sentences is equal
def last_word_equal(s1, s2):
    """
    Args:
      s1 (str): First sentence
      s2 (str): Second sentence
    Returns:
      A binary value indicating whether the lasts words of the two questions are equal.
    """
    # Tokenize
    tokens1 = word_tokenize(s1)
    tokens2 = word_tokenize(s2) # with word_tokenize, counts '.' as different token
    
    if tokens1[-1].lower() == tokens2[-1].lower():
            return 1
    
    return 0

### Examples

In [104]:
sentence1 = 'This is an example sentence to test the count of words'
sentence2 = 'This is a second example to test the count of common words'
sentence3 = 'Another different sentence'
print("Original sentence 1:\n\t", sentence1)
print("Original sentence 2:\n\t", sentence2)
print("Original sentence 3:\n\t", sentence3)

# number of words
k1 = number_words(sentence1)
k2 = number_words(sentence2)
print("\nNumber of words of sentence 1:", k1)
print("\nNumber of words of sentence 2:", k2)

# number of common words
k3 = number_common_words(sentence1, sentence2)
print("\nNumber of common words between sentence 1 and sentence 2:", k3)

k4 = number_common_words(sentence1, sentence3)
print("\nNumber of common words between sentence 1 and sentence 3:", k4)

# number of common words in the same postion
k3 = number_common_words_2(sentence1, sentence2)
print("\nNumber of common words in the same position between sentence 1 and sentence 2:", k3)

k4 = number_common_words_2(sentence1, sentence3)
print("\nNumber of common words in the same position between sentence 1 and sentence 3:", k4)

# first and last words equal
print("\nComparing the first words of sentence1 and sentence2:", first_word_equal(sentence1,sentence2))
print("\nComparing the first words of sentence1 and sentence3:", first_word_equal(sentence1,sentence3))

print("\nComparing the last words of sentence1 and sentence2:", last_word_equal(sentence1,sentence2))
print("\nComparing the last words of sentence1 and sentence3:", last_word_equal(sentence1,sentence3))

Original sentence 1:
	 This is an example sentence to test the count of words
Original sentence 2:
	 This is a second example to test the count of common words
Original sentence 3:
	 Another different sentence

Number of words of sentence 1: 11

Number of words of sentence 2: 12

Number of common words between sentence 1 and sentence 2: 9

Number of common words between sentence 1 and sentence 3: 1

Number of common words in the same position between sentence 1 and sentence 2: 7

Number of common words in the same position between sentence 1 and sentence 3: 0

Comparing the first words of sentence1 and sentence2: 1

Comparing the first words of sentence1 and sentence3: 0

Comparing the last words of sentence1 and sentence2: 1

Comparing the last words of sentence1 and sentence3: 0
