# Customer Reviews Translation Analysis

This notebook analyzes the translated customer reviews dataset.

In [22]:
import pandas as pd
import numpy as np
import glob
import os

In [23]:
# Load all Excel files
data_path = '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients'
all_files = glob.glob(os.path.join(data_path, '*.xlsx'))
print(all_files)

['/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_33_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_34_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_21_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_26_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_10_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_17_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_27_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_20_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearningForNLP/Project2/TraductionAvisClients/avis_35_traduit.xlsx', '/Users/alexs/PycharmProjects/MachineLearning

In [24]:
# Create empty list to store dataframes
dfs = []

# Read each Excel file
for file in all_files:
    df = pd.read_excel(file)
    dfs.append(df)

# Combine all dataframes
combined_df = pd.concat(dfs, ignore_index=True)
print(f'Total number of reviews: {len(combined_df)}')

Total number of reviews: 34435


## Basic Data Analysis

In [25]:
# Display basic information about the dataset
print("\nDataset Info:")
combined_df.info()

print("\nFirst few rows:")
combined_df.head()


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34435 entries, 0 to 34434
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   note              24104 non-null  float64
 1   auteur            34434 non-null  object 
 2   avis              34435 non-null  object 
 3   assureur          34435 non-null  object 
 4   produit           34435 non-null  object 
 5   type              34435 non-null  object 
 6   date_publication  34435 non-null  object 
 7   date_exp          34435 non-null  object 
 8   avis_en           34433 non-null  object 
 9   avis_cor          435 non-null    object 
 10  avis_cor_en       431 non-null    object 
dtypes: float64(1), object(10)
memory usage: 2.9+ MB

First few rows:


Unnamed: 0,note,auteur,avis,assureur,produit,type,date_publication,date_exp,avis_en,avis_cor,avis_cor_en
0,,estelle-51227,j'ai quitté mon ancien contrat d'assurance che...,Néoliane Santé,sante,test,12/01/2017,01/01/2017,I left my former insurance contract at General...,,
1,,leadum-51107,j'ai souscrit à cette mutuelle l'année dernier...,Néoliane Santé,sante,test,09/01/2017,01/01/2017,I subscribed to this mutual a year last year a...,,
2,,enora-49520,"Impossible d'avoir le bon service , ils raccro...",Néoliane Santé,sante,test,24/11/2016,01/11/2016,"Impossible to have the right service, they han...",,
3,,bea-139295,Génération est une mutuelle très chère pour un...,Génération,sante,test,09/11/2021,01/11/2021,Generation is a very expensive mutual for a re...,,
4,,anna-139192,je viens d apprendre que je suis radié... j ap...,Génération,sante,test,08/11/2021,01/11/2021,I just learned that I am struck off ... I call...,,


In [26]:
# Drop unnecessary columns
combined_df = combined_df.drop(['avis_cor', 'avis_cor_en'], axis=1)

In [27]:
# Display basic information about the dataset
print("\nDataset Info:")
combined_df.info()

print("\nFirst few rows:")
combined_df.head()



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34435 entries, 0 to 34434
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   note              24104 non-null  float64
 1   auteur            34434 non-null  object 
 2   avis              34435 non-null  object 
 3   assureur          34435 non-null  object 
 4   produit           34435 non-null  object 
 5   type              34435 non-null  object 
 6   date_publication  34435 non-null  object 
 7   date_exp          34435 non-null  object 
 8   avis_en           34433 non-null  object 
dtypes: float64(1), object(8)
memory usage: 2.4+ MB

First few rows:


Unnamed: 0,note,auteur,avis,assureur,produit,type,date_publication,date_exp,avis_en
0,,estelle-51227,j'ai quitté mon ancien contrat d'assurance che...,Néoliane Santé,sante,test,12/01/2017,01/01/2017,I left my former insurance contract at General...
1,,leadum-51107,j'ai souscrit à cette mutuelle l'année dernier...,Néoliane Santé,sante,test,09/01/2017,01/01/2017,I subscribed to this mutual a year last year a...
2,,enora-49520,"Impossible d'avoir le bon service , ils raccro...",Néoliane Santé,sante,test,24/11/2016,01/11/2016,"Impossible to have the right service, they han..."
3,,bea-139295,Génération est une mutuelle très chère pour un...,Génération,sante,test,09/11/2021,01/11/2021,Generation is a very expensive mutual for a re...
4,,anna-139192,je viens d apprendre que je suis radié... j ap...,Génération,sante,test,08/11/2021,01/11/2021,I just learned that I am struck off ... I call...


## Text Analysis

In [28]:
# Basic text statistics
combined_df['text_length'] = combined_df['avis'].str.len()
combined_df['word_count'] = combined_df['avis'].str.split().str.len()

print("Text length statistics:")
print(combined_df['text_length'].describe())

print("\nWord count statistics:")
print(combined_df['word_count'].describe())

Text length statistics:
count    34435.000000
mean       345.849194
std        385.472878
min          3.000000
25%        161.000000
50%        201.000000
75%        382.000000
max       8770.000000
Name: text_length, dtype: float64

Word count statistics:
count    34435.000000
mean        58.599448
std         66.716029
min          1.000000
25%         27.000000
50%         34.000000
75%         65.000000
max       1469.000000
Name: word_count, dtype: float64


In [29]:
# Display data types for each column
print("\nColumn data types:")
print(combined_df.dtypes)

# Display unique values and counts for categorical columns
print("\nUnique values in categorical columns:")
print("\nProduct types:")
print(combined_df['produit'].value_counts())
print("\nInsurance companies:")
print(combined_df['assureur'].value_counts())
print("\nType column values:")
print(combined_df['type'].value_counts())

# Display sample values from date columns to understand format
print("\nSample date values:")
print("\ndate_publication samples:")
print(combined_df['date_publication'].head())
print("\ndate_exp samples:") 
print(combined_df['date_exp'].head())



Column data types:
note                float64
auteur               object
avis                 object
assureur             object
produit              object
type                 object
date_publication     object
date_exp             object
avis_en              object
text_length           int64
word_count            int64
dtype: object

Unique values in categorical columns:

Product types:
produit
auto                                     20157
sante                                     5002
moto                                      3021
habitation                                2815
prevoyance                                1110
credit                                     908
vie                                        835
animaux                                    523
multirisque-professionnelle                 24
garantie-decennale                          14
assurances-professionnelles                 12
responsabilite-civile-professionnelle       10
flotte-automobile              

In [30]:
# Split data into train and test sets based on 'type' column
train_df = combined_df[combined_df['type'] == 'train']
test_df = combined_df[combined_df['type'] == 'test']

print("\nTrain set shape:", train_df.shape)
print("Test set shape:", test_df.shape)

# Verify the split
print("\nPercentage of data in train set: {:.1f}%".format(len(train_df) / len(combined_df) * 100))
print("Percentage of data in test set: {:.1f}%".format(len(test_df) / len(combined_df) * 100))



Train set shape: (24104, 11)
Test set shape: (10331, 11)

Percentage of data in train set: 70.0%
Percentage of data in test set: 30.0%
