# Customer Reviews Translation Analysis

This notebook analyzes the translated customer reviews dataset.

In [1]:
import pandas as pd
import numpy as np
import glob
import os

# Set style for visualizations


In [2]:
# Load all Excel files
data_path = 'Project2/TraductionAvisClients'
all_files = glob.glob(os.path.join(data_path, '*.xlsx'))

# Create empty list to store dataframes
dfs = []

# Read each Excel file
for file in all_files:
    df = pd.read_excel(file)
    dfs.append(df)

# Combine all dataframes
combined_df = pd.concat(dfs, ignore_index=True)
print(f'Total number of reviews: {len(combined_df)}')

Total number of reviews: 34435


## Basic Data Analysis

In [3]:
# Display basic information about the dataset
print("\nDataset Info:")
combined_df.info()

print("\nFirst few rows:")
combined_df.head()


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34435 entries, 0 to 34434
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   note              24104 non-null  float64
 1   auteur            34434 non-null  object 
 2   avis              34435 non-null  object 
 3   assureur          34435 non-null  object 
 4   produit           34435 non-null  object 
 5   type              34435 non-null  object 
 6   date_publication  34435 non-null  object 
 7   date_exp          34435 non-null  object 
 8   avis_en           34433 non-null  object 
 9   avis_cor          435 non-null    object 
 10  avis_cor_en       431 non-null    object 
dtypes: float64(1), object(10)
memory usage: 2.9+ MB

First few rows:


Unnamed: 0,note,auteur,avis,assureur,produit,type,date_publication,date_exp,avis_en,avis_cor,avis_cor_en
0,,estelle-51227,j'ai quitté mon ancien contrat d'assurance che...,Néoliane Santé,sante,test,12/01/2017,01/01/2017,I left my former insurance contract at General...,,
1,,leadum-51107,j'ai souscrit à cette mutuelle l'année dernier...,Néoliane Santé,sante,test,09/01/2017,01/01/2017,I subscribed to this mutual a year last year a...,,
2,,enora-49520,"Impossible d'avoir le bon service , ils raccro...",Néoliane Santé,sante,test,24/11/2016,01/11/2016,"Impossible to have the right service, they han...",,
3,,bea-139295,Génération est une mutuelle très chère pour un...,Génération,sante,test,09/11/2021,01/11/2021,Generation is a very expensive mutual for a re...,,
4,,anna-139192,je viens d apprendre que je suis radié... j ap...,Génération,sante,test,08/11/2021,01/11/2021,I just learned that I am struck off ... I call...,,


In [4]:
# Drop unnecessary columns
combined_df = combined_df.drop(['avis_cor', 'avis_cor_en'], axis=1)



In [5]:
# Remove rows with null values
print("Shape before removing null values:", combined_df.shape)
combined_df = combined_df.dropna()
print("Shape after removing null values:", combined_df.shape)


Shape before removing null values: (34435, 9)
Shape after removing null values: (24102, 9)


In [6]:
# Display basic information about the dataset
print("\nDataset Info:")
combined_df.info()

print("\nFirst few rows:")
combined_df.head()



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
Index: 24102 entries, 2000 to 34434
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   note              24102 non-null  float64
 1   auteur            24102 non-null  object 
 2   avis              24102 non-null  object 
 3   assureur          24102 non-null  object 
 4   produit           24102 non-null  object 
 5   type              24102 non-null  object 
 6   date_publication  24102 non-null  object 
 7   date_exp          24102 non-null  object 
 8   avis_en           24102 non-null  object 
dtypes: float64(1), object(8)
memory usage: 1.8+ MB

First few rows:


Unnamed: 0,note,auteur,avis,assureur,produit,type,date_publication,date_exp,avis_en
2000,5.0,claire-m-130353,"les prix au top, la facilité d'inscription et ...",Direct Assurance,auto,train,31/08/2021,01/08/2021,"top prices, ease of registration and clear ser..."
2001,2.0,tontonlouis-90075,je n'ai pas les moyens d'attendre 3 à 4 semain...,Cegema Assurances,sante,train,30/05/2020,01/05/2020,I cannot afford to wait 3 to 4 weeks for reimb...
2002,1.0,fred78-132197,Je voulais assurer une Tesla Modèle 3 LR ...\n...,MACIF,auto,train,10/09/2021,01/09/2021,I wanted to ensure a Tesla Model 3 LR ...\nThe...
2003,1.0,sud-70690,"Je suis en arrêt de travail depuis nov 2017 ,a...",Cardif,credit,train,07/02/2019,01/02/2019,"I am on work stoppage since Nov 2017, assured ..."
2004,1.0,fofi-80683,inadmissible... je leur ai réglé un trop perçu...,Harmonie Mutuelle,sante,train,04/11/2019,01/11/2019,Inadmissible ... I set them too perceived foll...


## Text Analysis

In [7]:
# Basic text statistics
combined_df['text_length'] = combined_df['avis'].str.len()
combined_df['word_count'] = combined_df['avis'].str.split().str.len()

print("Text length statistics:")
print(combined_df['text_length'].describe())

print("\nWord count statistics:")
print(combined_df['word_count'].describe())

Text length statistics:
count    24102.000000
mean       348.502282
std        393.467358
min          3.000000
25%        162.000000
50%        201.000000
75%        385.000000
max       8770.000000
Name: text_length, dtype: float64

Word count statistics:
count    24102.000000
mean        59.017924
std         67.975280
min          1.000000
25%         27.000000
50%         34.000000
75%         66.000000
max       1469.000000
Name: word_count, dtype: float64
