In [1]:
%load_ext autoreload
%autoreload 2

Dans cette notebook nous essayons d'optimiser le temps de correction des fautes d'othographe en utilisant la technique de calcul parallèle.

### Importation des bibliothèques et le dataset.

In [2]:
import os
os.chdir("..")

import pandas as pd
from Preprocessors.ReviewPreprocessor import ReviewPreprocessor
import numpy as np
import re
import concurrent.futures
from spelling_correction import spelling_correction
from time import time
import spacy
nlp = spacy.load("en_core_web_sm")

data = pd.read_csv("data/trip_advisor_data_chunk_10000k.csv", encoding="utf-16")
data.head()

Unnamed: 0,hotel_url,author,date,rating,title,review
0,Hotel_Review-g194775-d1121769-Reviews-Hotel_Ba...,Lagaiuzza,2016-01-01T00:00:00,5.0,"Baltic, what else?",We have spent in this hotel our summer holiday...
1,Hotel_Review-g194775-d1121769-Reviews-Hotel_Ba...,ashleyn763,2014-10-01T00:00:00,5.0,Excellent in every way!,I visited Hotel Baltic with my husband for som...
2,Hotel_Review-g194775-d1121769-Reviews-Hotel_Ba...,DavideMauro,2014-08-01T00:00:00,5.0,The house of your family's holiday,I've travelled quite a numbers of hotels but t...
3,Hotel_Review-g303503-d1735469-Reviews-Pousada_...,TwoMonkeysTravel,2017-03-01T00:00:00,5.0,Natural Luxury,"The property is surrounded by trees, which are..."
4,Hotel_Review-g303503-d1735469-Reviews-Pousada_...,analuizade,2016-09-01T00:00:00,5.0,Very cozy!,I had a very pleasant stay at this hotel! All ...


In [11]:
preprocessor = ReviewPreprocessor(data["review"], nlp)
start = time()
data["cleaned_data"] = preprocessor.remove_tags()
print(f"remove useless features duration: {time()-start}s")
data["cleaned_data"] = preprocessor.lowercase_transformation()
print(f"transform reviews to lowercase duration : {time()-start}s")

remove useless features duration: 0.24351859092712402s
transform reviews to lowercase duration : 0.24351859092712402s


#### Correction des fautes d'orthographes sans utiliser le calcul parallèle.

In [10]:
normal_correction_result = preprocessor.spelling_correction()

3000it [10:16,  4.87it/s]


Sans utilisation de calculs parallèles pour corriger les fautes d'orthographe, nous remarquons qu'il prend environ 10 min pour corriger 3000 commentaires. Cette longue durée est due à la division de commentaire en mots puis la correction mot par mot.

#### Correction des fautes d'orthographes avec l'utilisation de calcul parallèle.

Pour appliquer le calcul parallèle sur les 3000 commentaires. Nous divisons l'ensemble des commentaires en 4 échantillons égaux (chaque échantillon contient 750 commentaires). Ensuite nous créons 4 processus chacun corrige un échantillon.

In [12]:
splitted_df = np.array_split(data["cleaned_data"], 4)

In [13]:
df_results = []

In [14]:
start = time()
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    results = [ executor.submit(spelling_correction, df) for df in splitted_df ]
    for result in concurrent.futures.as_completed(results):
        try:
            df_results.append(result.result())
        except Exception as ex:
            print(str(ex))
            pass
end = time()
print(f"duration : {end-start}s <=> {(end-start)/60}min")

duration : 349.4241273403168s <=> 5.8237354556719465min


Nous remarquons que le temps de correction a diminué de 10 min à 5 min presque la moitié.

In [16]:
r = pd.Series(dtype="string")
for i in df_results:
    r = pd.concat([r, i])

**Comparaison des resultats de correction entre la correction sans et avec calcul parallele**

In [20]:
are_same = True
for i in range(0,3000):
    if r.loc[i] != normal_correction_result.loc[i]:
        are_same = False

In [22]:
are_same

True

les deux résultats de correction sont égaux.