## Uncommon libraries for preprocessing text
---

In [1]:
#pip install pandarallel

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize()

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [10]:
df=pd.read_csv('tripadvisor_hotel_reviews.csv')
df_reduced=df.head(10)
df_reduced

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5
5,love monaco staff husband stayed hotel crazy w...,5
6,"cozy stay rainy city, husband spent 7 nights m...",5
7,"excellent staff, housekeeping quality hotel ch...",4
8,"hotel stayed hotel monaco cruise, rooms genero...",5
9,excellent stayed hotel monaco past w/e delight...,5


##### Clean text
---

In [3]:
# pip install clean-text
# More information about it here: https://pypi.org/project/clean-text/

from cleantext import clean

def cleaner_ct (text):
    nlp_preprocessing=clean(text, normalize_whitespace=True,replace_with_punct=' ',no_punct=True,no_currency_symbols=True,no_line_breaks=True)
    return nlp_preprocessing

df_reduced['cleaner_ct']=df_reduced['Review'].parallel_apply(cleaner_ct)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['cleaner_ct']=df_reduced['Review'].parallel_apply(cleaner_ct)


##### Text Hammer
---

In [4]:
# pip install text-hammer
# More information about it here: https://pypi.org/project/text-hammer/

import text_hammer as th

def get_clean_th(x):
    #x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = th.cont_exp(x)
    x = th.remove_emails(x)
    x = th.remove_urls(x)
    x = th.remove_html_tags(x)
    x = th.remove_rt(x)
    x = th.remove_accented_chars(x)
    x = th.remove_special_chars(x)
    return x

df_reduced['cleaner_th']=df_reduced['Review'].parallel_apply(get_clean_th)

2022-11-01 14:55:59.684916: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['cleaner_th']=df_reduced['Review'].parallel_apply(get_clean_th)


##### Py text data clean
---

In [5]:
# pip install py-text-data-clean
# More information about it here: https://pypi.org/project/py-text-data-clean/

from pytextdataclean import textclean as tc

def cleaner_pdc(text):
    result = tc.text_clean(data=[text])
    return result

df_reduced['cleaner_pdc']=df_reduced['Review'].parallel_apply(cleaner_pdc)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/diegoarcosdelasheras/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


100%|██████████| 1/1 [00:00<00:00, 775.86it/s]
100%|██████████| 1/1 [00:00<00:00, 319.49it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 632.05it/s]
100%|██████████| 1/1 [00:00<00:00, 1729.61it/s]
  0%|          | 0/1 [00:00<?, ?it/s]7.99it/s]
100%|██████████| 1/1 [00:00<00:00, 1210.48it/s]
100%|██████████| 1/1 [00:00<00:00, 1288.57it/s]

100%|██████████| 1/1 [00:00<00:00, 640.94it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['cleaner_pdc']=df_reduced['Review'].parallel_apply(cleaner_pdc)


##### Tex cleaner
---

In [6]:
# pip install text-cleaner-fdelgados
# pip install text-cleaner-emagister
# More information about it here: https://pypi.org/project/text-cleaner-emagister/#installation

In [7]:
from textcleaner import TextCleaner

def cleaner_tc(text):
    cleaner = TextCleaner()
    return cleaner.clean(text)

df_reduced['cleaner_tc']=df_reduced['Review'].parallel_apply(cleaner_tc)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['cleaner_tc']=df_reduced['Review'].parallel_apply(cleaner_tc)


##### Textify
---

In [8]:
# pip install textify
# More information about it here: https://pypi.org/project/textify/

from textify import TextCleaner

def cleaner_tf(text):
    doc = TextCleaner()
    doc.text = text
    return doc.clean_text()

df_reduced['cleaner_tf']=df_reduced['Review'].parallel_apply(cleaner_tf)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['cleaner_tf']=df_reduced['Review'].parallel_apply(cleaner_tf)


## *COMPARISON*
---

In [9]:
i = 1
for index, row in df_reduced.iterrows():
    while i<2:
        print('%%%%%%%%%%% Clean text library %%%%%%%%%%%')
        print(row['cleaner_ct'])
        print('\n')
        print('----Original---------')
        print(row['Review'])
        print('*************************************************\n')
        print('%%%%%%%%%%% Text Hammer library %%%%%%%%%%%')
        print(row['cleaner_th'])
        print('\n')
        print('----Original---------')
        print(row['Review'])
        print('*************************************************\n')
        print('%%%%%%%%%%% Py text data clean library %%%%%%%%%%%')
        print(row['cleaner_pdc'])
        print('\n')
        print('----Original---------')
        print(row['Review'])
        print('*************************************************\n')
        print('%%%%%%%%%%% Text Cleaner library %%%%%%%%%%%')
        print(row['cleaner_tc'])
        print('\n')
        print('----Original---------')
        print(row['Review'])
        print('*************************************************\n')
        print('%%%%%%%%% Textify library %%%%%%%%%%%')
        print(row['cleaner_tf'])
        print('\n')
        print('----Original---------')
        print(row['Review'])
        print('*************************************************\n')
        i += 1

%%%%%%%%%%% Clean text library %%%%%%%%%%%
nice hotel expensive parking got good deal stay hotel anniversary arrived late evening took advice previous reviews did valet parking check quick easy little disappointed non existent view room room clean nice size bed comfortable woke stiff neck high pillows not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway maybe just noisy neighbors aveda bath products nice did not goldfish stay nice touch taken advantage staying longer location great walking distance shopping overall nice experience having pay 40 parking night


----Original---------
nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing hea