# Text column analysis

Is it important to be able to group rows (data) by labels. To this purpose, one must make sure that the labels are correct and are not null, mispelled, or written differently. Even more subtile, it should be checked that two different labels do not actually refer to the same concept.

_<u>Ex</u>: sod and sodium both refer to sodium, $mm^3$ and $\mu L$ both refer to microliters_.

To facilitate labelling, data processing also implies label checking. 

In order to help this process, I have written 3 functions that aim at flagging errors and potential typos that one should take into account when applying ML algorithms. 

Once these generalized functions written, I was able to apply them to my data without modifying the function it-self. 

**For all three functions, the input data should be passed as a dataframe. The relevant column should contain strings**

In [None]:
import pandas as pd
from thefuzz import fuzz # for computing levenshtein ratios
import numpy as np
import math
from collections import Counter # for letter distribution in a string

## Errors in letter capitalization

Looking at the lettercase is one of the first step of label checking, and the use of algorithms often implies to give all labels the same capitalization to simplify the process. This means that the rows with different label capitalization should then be grouped under the same label. 

<u>The **countCapitalization** function:</u>

The function takes as input:
- a dataframe of elements in the relevant feature column 
- a column name 

It returns a dictionary with the following fields:
- `initial_count` : the number of distinct elements (of the input)
- `decap_count` : the number of distinct elements when removing the capital letters
- `diff_distinct_decap` : the number of elements that appear several times but with different capitalizations
- `list_df_diff_cap` : a list of all these elements that appear several times with different capitalizations, and said capitalizations

In [None]:
def countCapitalization(input_dataframe, feature_name) :
    
    # we remove null values
    if (len(dataframe) != len(dataframe.dropna())) :
        dataframe = dataframe.dropna()
    
    elif feature_name not in dataframe.columns :
        
        return("WRONG INPUT: the feature name is not a column in the dataframe")
    
    dataframe = input_dataframe[feature_name].value_counts().reset_index()
    count_distinct = len(dataframe)
    
    # now we de-capitalize all letters
    dataframe_decapitalized = pd.DataFrame(index=range(len(dataframe)), columns=[f'decap_{feature_name}'])
    for index, datapoint in dataframe.iterrows() :
        row = {f'decap_{feature_name}' : datapoint[feature_name].lower()}
        dataframe_decapitalized.loc[index] = row
        
    dataframe_counts = dataframe_decapitalized.value_counts() # distinct decapitalized values and their counts
    count_distinct_decap = len(dataframe_counts)
    diff_distinct_decap = count_distinct - count_distinct_decap # difference of distinct values number when removing capitalization
    
    err_low_cap = [] # list to store all de-capitalized values that have a count superior to 0
    for value, count in dataframe_counts.items() :
        if count > 1 :
            err_low_cap.append(value[0])
        else :
            break
            
    list_df_diff_cap = [] # list to store all values in initial dataframes that appear with diff capitalizations
    for index, datapoint in dataframe.iterrows() :
        feature = datapoint[feature_name]
        if feature.lower() in err_low_cap : # if the value in initial dataframe appears in previously established list
            list_df_diff_cap.append(feature)
            
    return({"initial_count":count_distinct, "decap_count":count_distinct_decap, "diff_distinct_decap":diff_distinct_decap, "list_df_diff_cap":list_df_diff_cap})  

## Extra spaces

Extra spaces in a sentence can be hard to visually detect but two labels with a different number of spaces will  be considered as different strings. Altough it is straightforward to get rid of leading and trailing spaces (with the _.strip_ function in python) and it should be a naturally included procedure in any algorithm, spaces in the middle of the sentences should also be taken into account (_for example, "9 %" and "9%" are considered to be two different strings, and the .strip function will not help_). 

<u>The **countExtraSpaces** function</u>:

The function takes as input:
- a dataframe of elements in the relevant feature column 
- a column name 

It returns a dictionary with the following fields:
- `initial_count` : the number of distinct values (of the input). note that it takes capitalization into account
- `nospace_count` : the number of distinct values once we remove the extra spaces (leading, trailing but also in the middle)
- `diff_distinct_nosp` : the number of values that appear several times but with different number of spaces
- `list_df_diff_nosp` : a list of all these values that appear several times with different number of spaces, and said values

In [None]:
def countExtraSpaces(input_dataframe, feature_name) :
    
    # we remove null values
    if (len(input_dataframe) != len(input_dataframe.dropna())) :
        input_dataframe = input_dataframe.dropna()
        
    elif feature_name not in dataframe.columns :
        
        return("WRONG INPUT: the feature name is not a column in the dataframe")
        
    dataframe = input_dataframe[feature_name].value_counts().reset_index()
    count_distinct = len(dataframe.value_counts()) 
    
    # we create a new dataframe to store the values without spaces
    dataframe_nospace = pd.DataFrame(index=range(len(dataframe)), columns=[f'noSp_{feature_name}'])
    for index, datapoint in dataframe.iterrows() :
        row = {f'noSp_{feature_name}' : ("").join(datapoint[feature_name].split(" "))}
        dataframe_nospace.loc[index] = row
        
    dataframe_counts = dataframe_nospace.value_counts() # distinct values with removed spaces and their counts
    count_distinct_nosp = len(dataframe_counts)
    diff_distinct_nosp = count_distinct - count_distinct_nosp # difference of distinct values number when removing extra spaces
    
    err_nospace = [] # list to store all extra-space removed values that have a count superior to 1
    for value, count in dataframe_counts.items() :
        if count > 1 :
            err_nospace.append(value[0])
        else :
            break
            
    list_df_diff_nosp = [] # list to store all values in initial dataframes that appear with diff extra-spaces
    for index, datapoint in dataframe.iterrows() :
        feature = datapoint[feature_name]
        if ("").join(feature.split(" ")) in err_nospace : # if the value in initial dataframe appears in previously established list
            list_df_diff_nosp.append(feature)
            
    return({"initial_count":count_distinct, "nospace_count":count_distinct_nosp, "diff_distinct_nosp":diff_distinct_nosp, "list_df_diff_nosp":list_df_diff_nosp})  

## Typo detection

For the purpose of typo detection, I have written several functions.

<u>The **checkInput** function:</u>

Takes as input a dataframe, and returns:
- True if the dataframe has only one column of distinct and non-null values
- False otherwise

<u>The **normalizeDataframe** function:</u>

The function takes as input a dataframe and a feature name, and returns a new dataframe of one column, containing the distinct decapitalized values in the feature. 

This function is used in other functions to flag errors (other than capitalization) without taking letter case into account (_for example, if we did not de-capitalize everything, analysis of letter distribution would make a distinction between capital and lowcase letters_). 

In [None]:
def checkInput(dataframe, feature_name) :
    
    if feature_name not in dataframe.columns :
        return((False, "ERROR: the feature name is not a column in the dataframe"))
    
    else :
        return ((True, None))
    
def normalizeDataframe(dataframe, feature_name) :
    
    # first we de-capitalize all letters
    for index, datapoint in dataframe.iterrows() :
        dataframe.at[index, feature_name] = str(datapoint[feature_name]).lower()
        
    # then we take the list of distinct rows
    valuecounts = dataframe[feature_name].value_counts().reset_index()
    
    return pd.DataFrame(valuecounts[feature_name]).dropna() # we return the distinct values without null rows

<u>The **LevenshteinRatios** function:</u>

It takes as input two strings, and return a dictionnary with the following fields:
- `simple_ratio` $r_s$
- `token_sort_ratio` $r_{sort}$
- `token_set_ratio` $r_{set}$

_I did not include the_ partial_ratio _in the computed Levenshtein indices, because it is not relevant to label distinction. Indeed, the partial ratio reflects the ressemblance of <u>substrings</u> in the given strings, which we are not interested in: if two strings should be flagged as typos, then they will have similar length and a high partial ratio, but the other ratios will also reflect that. However, a string might be a substring of another on purpose, both being different labels. That would however be a partial ratio of 100, which is not relevant_.  

The <mark>Levenshtein</mark> distance is a string metric (Vladimir Levenshtein, 1965) for measuring the distance or ressemblance between two sequences. It corresponds to the minimum number of actions (being defined as "insertion", "deletion" or "substitution") necessary to change one string into another. The computation of the levenshtein similarity ratio is based on the levenshtein distance.
- the simple ratio : calculates the edit distance based on the ordering of both input strings
- the token sort ratio : accounts for similar strings without taking order into account, unlike above
- the token set ratio : it is similar to the token sort ratio, but it removes common tokens before computing similarity

In [None]:
def LevenshteinRatios(s1, s2) :
    
    ratios = {}
    ratios["simple_ratio"] = fuzz.ratio(s1, s2)
    ratios["token_sort_ratio"] = fuzz.token_sort_ratio(s1, s2)
    ratios["token_set_ratio"] = fuzz.token_set_ratio(s1, s2)
    
    return(ratios)

<u>The **detectTypos** function:</u>

The function takes as input:
- the input dataframe
- the feature name of the column we are interested in
- the levenshtein threshold  $t_{levenshtein}$ : minimum value for which we flag two strings as potential typos based on the average of their three computed levenshtein ratios
- the length threshold  $t_{length}$ : maximum length difference allowed between two strings to consider them as similar enough. It should be given in percent (_for example, if the length threshold for strings $s_1$ and $s_2$ is 10, where $s_2$ has the maximum length, we will not consider $s_1$ as a potential typo if the length of $s_1$ is less than 90% the length of $s_2$_).
- the distance threshold  $t_{distance}$ : maximum length for which we consider two strings for typo detection (in euclidean distance ratio for letter distribution).

The four last parameters are given default values.

The input dataframe can have several columns, as it is anyway normalized with the **normalizeDataframe** function.

The function returns a dataframe, with the string combinations flagged as potential typos:
- the two strings
- the simple ratio 
- the token sort ratio
- the token set ratio
- the norm ratio

Note that the returned flagged typos should be manually checked.

In **detectTypos**, we iterate over all string combinations. In the loop, given two strings $s_1$ and $s_2$, the typo detection is as follows:

Let $n_1$ and $n_2$ the length of $s_1$ and $s_2$. We take the two strings as lower case. Then,
- we skip the iteration if 
<center> $ min(n_1, n_2) < (100-t_{length}) \times max(n_1, n_2)$ </center> 

_(the strings have too much of a length difference to be the same strings with typos)_

- we compute the average 
<center>$avg_{ratio}=\frac{r_s + r_{sort} + r_{set}}{3}$ </center>

If $avg_{ratio} > t_{levenshtein}$, then we flagg the two strings as typos

- We compute $L_1$={$l_a^1, l_b^1, ...$} and $L_2$={$l_a^2, l_b^2, ...$} the frequencies of apparition of the different characters (includes spaces and punctuation) present in both strings ($\forall$ letter $i$ and $j=1,2$ : $l_i^j \in R$). 

Then, we compute $d_1 = \sqrt{\sum_{l \in L_1}l^2}$ and $d_2 = \sqrt{\sum_{l \in L_2}l^2}$, and the set $L_{diff}$ where $\forall l_{letter} \in L_{diff}$ : 

$$ l_{letter} = 
          \begin{cases} |l_{letter}^1 - l_{letter}^2| \; if \; l_{letter} \in L_1 \; and \; l_{letter} \in L_2  
          \\ l_{letter}^1 \; if \; l_{letter} \in L_1 \; and \; l_{letter} \notin L_2
          \\ l_{letter}^2 \; if \; l_{letter} \notin L_1 \; and \; l_{letter} \in L_2 
          \end{cases} $$
          
We compute $d = \sqrt{\sum_{l \in L_{diff}}l^2}$. If $d<t_{distance}$, then we flagg the two strings as potential typos.

In [46]:
# here we define constant for typo detection
levenshtein_threshold = 90
length_threshold = 10 # IN PERCENT
distance_threshold = 0.15

In [None]:
def detectTypos(input_dataframe, feature_name, levenshtein_threshold=90, length_threshold=10, distance_threshold=0.15) :
    
    # first we check the input
    check_input = checkInput(input_dataframe, feature_name)
    if check_input[0] == False :
        return (check_input[1])
    
    # then we remove null values and capitalization. we only keep distinct values 
    dataframe = normalizeDataframe(input_dataframe, feature_name)
    
    n = len(dataframe)
    values = dataframe.values
        
    # we create the return dataframe to store the results (typo suspicion)
    typo_suspicion = pd.DataFrame(columns=["s1", "s2", "simple_ratio", "token_sort_ratio", "token_set_ratio", "distance"])
    
    k = 0
    
    for i in range(n-1) : 
        for j in range(i+1, n) :
            
            s1 = values[i][0].lower()
            s2 = values[j][0].lower()
            
            n1, n2 = len(s1), len(s2)
            
            # first if the length is too different we go to next iteration
            if min(n1, n2) < (100-length_threshold)*max(n1, n2)/100 :
                continue # we eliminate the possibility of similar strings and we skip this iteration 
            
            # PART 1: LEVENSHTEIN RATIOS
            
            r_levenshtein = LevenshteinRatios(s1, s2)
            avg_lev = sum(r_levenshtein.values())/len(r_levenshtein.values())
            
            # PART 2: LETTER DISTRIBUTION - this includes spaces
            
            counts1 = Counter(s1)
            counts2 = Counter(s2)
            letters1 = set(counts1.keys())
            letters2 = set(counts2.keys())
            
            d1 = np.linalg.norm(list(counts1.values()))
            d2 = np.linalg.norm(list(counts2.values()))
            
            letters_intersect = letters1.intersection(letters2)
            letters_union = letters1.union(letters2)
            
            letters_diff = {}
            
            for letter in letters_union :
                if letter in letters_intersect : # is in both sentence
                    letters_diff[letter] = np.abs(counts1[letter]-counts2[letter])
                elif letter in letters1 :
                    letters_diff[letter] = counts1[letter]
                else :
                    letters_diff[letter] = counts2[letter]
            
            dist = np.linalg.norm(list(letters_diff.values()))
            dist2 = dist/(d1+d2)
            
            if avg_lev >= levenshtein_threshold or dist2 <= distance_threshold :
                row_typo = {"s1": s1, "s2": s2, 
                            "simple_ratio":r_levenshtein["simple_ratio"], 
                            "token_sort_ratio": r_levenshtein["token_sort_ratio"], 
                            "token_set_ratio": r_levenshtein["token_set_ratio"], 
                            "distance": dist2}
                typo_suspicion.loc[len(typo_suspicion)] = row_typo 
        
    return(typo_suspicion)  

**<u>Examples</u>**

In [None]:
data2 = [["If at first you don't succeed, try, try, try again."], 
        ["If at first you don't succeed try try try again"], 
        ["i have always depended on the kindness of strangers,"], 
        ["I have always depended on the kindness of strangers."], 
        ["I have always depended the kindness of strangers."],
         ["I have always depended the kindness of stranger."],
        ["I hav alwaïz dipended on ze kindnesse of strangeurs."]]
df_ex2 = pd.DataFrame(data2, columns=['quote'])
detectTypos(df_ex2, "quote")

In [None]:
data = [["If at first you don't succeed, try, try, try again."], 
        ["If at first you don't succeed try try try again"], 
        ["i have always depended on the kindness of strangers"], 
        ["I have always depended on the kindness of strangers."], 
        ["I have always depended the kindness of strangers."], 
        ["Cogito ergo sum"], 
        ["Cogito ergo sam"], 
        ["Well done is better than well said"], 
        ["Well said is better than well done"], 
        ["Nobody puts Baby in a corner."], 
        ["If at first you don't succeed, try, try again."]]

# Create the pandas DataFrame
df_ex = pd.DataFrame(data, columns=['quote'])

d = detectTypos(df_ex, "quote")

# Outlier detection

The function **textStatistics** detects in case a string feature has a value where its length really lies outside of the lengths of the overall column 

In [None]:
def textStatistics(dataframe, feature_name, lb = 25, ub = 75) :
    
    if feature_name not in dataframe.columns :
        return((False, "ERROR: the feature name is not a column in the dataframe"))
    
    lengths = [len(feature) for feature in dataframe[feature_name]]
    
    avg = np.average(lengths)
    minimum = min(lengths)
    maximum = max(lengths)
    
    print(avg, minimum, maximum)
    
    lengths_array = np.array(lengths)

    q1 = np.percentile(lengths_array, lb) # lb-th percentile
    q3 = np.percentile(lengths_array, ub) # ub-th percentile
    iqr = q3 - q1 # interquartile range

    # Define the outlier boundaries
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    outliers = lengths_array[(lengths_array < lower_bound) | (lengths_array > upper_bound)]
    
    plt.hist(lengths, bins=20)

    return("Outliers:", outliers.tolist())

The function **outliers(dataframe, feature_name)** takes as input a dataframe and the name of a numerical feature, and returns the values that lie outside of the feature's distribution.

In [None]:
def outliers(dataframe, feature_name, lb = 25, ub = 75) :
    
    if feature_name not in dataframe.columns :
        return((False, "ERROR: the feature name is not a column in the dataframe"))
    
    data = dataframe[feature_name].tolist()
    
    avg = np.average(data)
    minimum = min(data)
    maximum = max(data)
    
    q1 = np.percentile(data, lb)
    q2 = np.percentile(data, ub)
    iqr = q3 - q1
    
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q2 - 1.5 * iqr
    
    outliers = data[ (data < lower_bound) | (data < upper_bound) ]
    
    return("Outliers:", outliers.tolist())