### Overview of Data Cleaning

1. 🔡 Convert All Text to Lowercase
2. 🔢 Remove Numbers
3. ❌ Remove Punctuation
4. 🌪️ Remove Extra Whitespaces
5. 🚫 Remove Stopwords
6. 🧑‍🔬 Perform Lemmatization
7. 📝 Correct Spelling

In [1]:
import pandas as pd

In [5]:
df = pd.read_csv('data/BA_reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,reviews
0,0,Not Verified | Very good flight following an ...
1,1,Not Verified | An hour's delay due to late ar...
2,2,✅ Trip Verified | I booked through BA becaus...
3,3,✅ Trip Verified | British airways lost bags ...
4,4,✅ Trip Verified | The check in process and rew...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  1000 non-null   int64 
 1   reviews     1000 non-null   object
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [7]:
df.shape

(1000, 2)

In [8]:
df.isnull().sum()

Unnamed: 0    0
reviews       0
dtype: int64

In [9]:
import re

# Cleaning the text data
def clean_review(row):
    row = re.sub(r"✅ Trip Verified|Not Verified", "", row)
    # Remove any unnecessary characters or emojis
    row = re.sub(r'[^\w\s]','', row)
    return row.strip()

df['Cleaned_Review'] = df['reviews'].apply(clean_review)

df['Verified'] = df['reviews'].apply(lambda x: 'verified' if "✅ Trip Verified" in x else 'unverified')

In [11]:
print(df[['Verified', 'Cleaned_Review']])

       Verified                                     Cleaned_Review
0    unverified  Very good flight following an equally good fli...
1    unverified  An hours delay due to late arrival of the inco...
2      verified  I booked through BA because Loganair dont have...
3      verified  British airways lost bags in LHR then found th...
4      verified  The check in process and rewardloyalty program...
..          ...                                                ...
995  unverified  I have often flown British Airways and have co...
996  unverified  Good morning I would like to write a review fo...
997    verified  My flight was cancelled 3 days in a row Was fl...
998    verified  Hong Kong to Copenhagen via London The whole e...
999    verified  London Gatwick to San Jose Costa Rica This was...

[1000 rows x 2 columns]


In [12]:
df.drop(columns=['reviews', 'Unnamed: 0'], inplace=True)
print(df.head())

                                      Cleaned_Review    Verified
0  Very good flight following an equally good fli...  unverified
1  An hours delay due to late arrival of the inco...  unverified
2  I booked through BA because Loganair dont have...    verified
3  British airways lost bags in LHR then found th...    verified
4  The check in process and rewardloyalty program...    verified


In [14]:
# You can reorder the columns in your DataFrame
df = df[['Verified', 'Cleaned_Review']]

In [15]:
df = df.rename(columns={'Cleaned_Review': 'Review'})

### Step 1: Convert All Text to Lowercase

In [17]:
# Convert the 'Review' column to lowercase
df['Review'] = df['Review'].str.lower()

# Display the updated DataFrame
print(df.head(7))


     Verified                                             Review
0  unverified  very good flight following an equally good fli...
1  unverified  an hours delay due to late arrival of the inco...
2    verified  i booked through ba because loganair dont have...
3    verified  british airways lost bags in lhr then found th...
4    verified  the check in process and rewardloyalty program...
5    verified  we flew in november 2023 but it took this long...
6    verified  i left for london from johannesburg at 2115 on...


### Step 2: Remove Numbers

In [19]:
import re

# Function to remove numbers
def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# Apply the function to the 'Review' column
df['Review'] = df['Review'].apply(remove_numbers)

# Display the updated DataFrame
print(df.head(7))


     Verified                                             Review
0  unverified  very good flight following an equally good fli...
1  unverified  an hours delay due to late arrival of the inco...
2    verified  i booked through ba because loganair dont have...
3    verified  british airways lost bags in lhr then found th...
4    verified  the check in process and rewardloyalty program...
5    verified  we flew in november  but it took this long to ...
6    verified  i left for london from johannesburg at  on  de...


### Step 3: Remove Punctuation

In [28]:
import string

# Function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Apply the function to the 'Review' column
df['Review'] = df['Review'].apply(remove_punctuation)

### Step 4: Remove Extra Whitespaces

In [29]:
# Function to remove extra spaces
def remove_extra_spaces(text):
    return " ".join(text.split())

# Apply the function to the 'Review' column
df['Review'] = df['Review'].apply(remove_extra_spaces)

# Display the updated DataFrame
print(df.head(7))


     Verified                                             Review
0  unverified  good flight following equally good flight rome...
1  unverified  hour delay due late arrival incoming aircraft ...
2    verified  booked ba loganair dont representative manches...
3    verified  british airway lost bag lar found sent cologne...
4    verified  check process rewardloyalty program mess never...
5    verified  flew november took long seek satisfactory resp...
6    verified  left london johannesburg december issue flight...


### Step 5: Remove Stopwords

In [22]:
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

# Apply the function to the 'Review' column
df['Review'] = df['Review'].apply(remove_stopwords)

# Display the updated DataFrame
print(df.head(7))


     Verified                                             Review
0  unverified  good flight following equally good flight rome...
1  unverified  hours delay due late arrival incoming aircraft...
2    verified  booked ba loganair dont representatives manche...
3    verified  british airways lost bags lhr found sent colog...
4    verified  check process rewardloyalty program mess never...
5    verified  flew november took long seek satisfactory resp...
6    verified  left london johannesburg december issue flight...


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lucky/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 6: Perform Lemmatization

In [23]:
from nltk.stem import WordNetLemmatizer

# Download WordNet data if not already downloaded
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

# Function to apply lemmatization
def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

# Apply the function to the 'Review' column
df['Review'] = df['Review'].apply(lemmatize_text)

# Display the updated DataFrame
print(df.head(7))


[nltk_data] Downloading package wordnet to C:\Users\lucky/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


     Verified                                             Review
0  unverified  good flight following equally good flight rome...
1  unverified  hour delay due late arrival incoming aircraft ...
2    verified  booked ba loganair dont representative manches...
3    verified  british airway lost bag lhr found sent cologne...
4    verified  check process rewardloyalty program mess never...
5    verified  flew november took long seek satisfactory resp...
6    verified  left london johannesburg december issue flight...


### Step 7: Correct Spelling

In [27]:
from spellchecker import SpellChecker

# Initialize the SpellChecker
spell = SpellChecker()

# Function to correct spelling
# Function to correct spelling
def correct_spelling(text):
    words = text.split()
    corrected_words = [spell.correction(word) if spell.correction(word) is not None else word for word in words]
    return " ".join(corrected_words)

df['Review'] = df['Review'].apply(correct_spelling)
''' import swifter
    df['Corrected_Reviews'] = df['Reviews'].swifter.apply(correct_spelling)
    It it will be apply large dataset like more than 10000'''

# Display the updated DataFrame
print(df.head())


     Verified                                             Review
0  unverified  good flight following equally good flight rome...
1  unverified  hour delay due late arrival incoming aircraft ...
2    verified  booked ba loganair don't representative manche...
3    verified  british airway lost bag lar found sent cologne...
4    verified  check process rewardloyalty program mess never...


In [30]:
df.head()

Unnamed: 0,Verified,Review
0,unverified,good flight following equally good flight rome...
1,unverified,hour delay due late arrival incoming aircraft ...
2,verified,booked ba loganair dont representative manches...
3,verified,british airway lost bag lar found sent cologne...
4,verified,check process rewardloyalty program mess never...


In [31]:
# Save the cleaned data to a new CSV file
df.to_csv("data/cleaned_data.csv", index=False)