# Data Cleaning

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters.

In [1]:
# Import required library

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

import re

In [2]:
df = pd.read_csv("BA_reviews.csv", index_col=0)

In [3]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | After an excellent flight ...,5.0,1st January 2025,United Kingdom
1,✅ Trip Verified | On a recent flight from Cy...,5.0,17th December 2024,United Kingdom
2,✅ Trip Verified | Flight BA 0560 arrived in ...,1.0,17th December 2024,Australia
3,✅ Trip Verified | This was the first time I ...,1.0,14th December 2024,United States
4,✅ Trip Verified | Pretty good flight but sti...,2.0,13th December 2024,United Kingdom


We will also create a column which mentions if the user is verified or not.

In [4]:
df['verified'] = df.reviews.str.contains("Trip Verified")

In [5]:
df['verified']

0        True
1        True
2        True
3        True
4        True
        ...  
3495    False
3496    False
3497    False
3498    False
3499    False
Name: verified, Length: 3500, dtype: bool

## Cleaning Reviews

We will extract the column of reviews into a separate dataframe and clean it for semantic analysis

In [6]:
#for lemmatization of words we will use nltk library

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

In [7]:
# add the corpus to the original dataframe

df['corpus'] = corpus

In [8]:
df.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | After an excellent flight ...,5.0,1st January 2025,United Kingdom,True,excellent flight cpt lhr return good ba moved ...
1,✅ Trip Verified | On a recent flight from Cy...,5.0,17th December 2024,United Kingdom,True,recent flight cyprus ba second cabin door clos...
2,✅ Trip Verified | Flight BA 0560 arrived in ...,1.0,17th December 2024,Australia,True,flight ba arrived rome december passenger rece...
3,✅ Trip Verified | This was the first time I ...,1.0,14th December 2024,United States,True,first time flew british airway huge disappoint...
4,✅ Trip Verified | Pretty good flight but sti...,2.0,13th December 2024,United Kingdom,True,pretty good flight still small thing improved ...


## Cleaning/Fromat date

In [9]:
df.dtypes

reviews      object
stars       float64
date         object
country      object
verified       bool
corpus       object
dtype: object

In [10]:
# convert the date to datetime format

# Remove ordinal suffixes and convert to datetime
df['date'] = df['date'].str.replace(r'(\d+)(st|nd|rd|th)', r'\1', regex=True)
df['date'] = pd.to_datetime(df['date'])


In [11]:
df.date.head()

0   2025-01-01
1   2024-12-17
2   2024-12-17
3   2024-12-14
4   2024-12-13
Name: date, dtype: datetime64[ns]

## Check for null values

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3500 entries, 0 to 3499
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   reviews   3500 non-null   object        
 1   stars     3498 non-null   float64       
 2   date      3500 non-null   datetime64[ns]
 3   country   3499 non-null   object        
 4   verified  3500 non-null   bool          
 5   corpus    3500 non-null   object        
dtypes: bool(1), datetime64[ns](1), float64(1), object(3)
memory usage: 167.5+ KB


In [12]:
df.isnull().sum()

reviews     0
stars       2
date        0
country     1
verified    0
corpus      0
dtype: int64

Drop rows with null values

In [14]:
df.dropna(inplace=True)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3497 entries, 0 to 3499
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   reviews   3497 non-null   object        
 1   stars     3497 non-null   float64       
 2   date      3497 non-null   datetime64[ns]
 3   country   3497 non-null   object        
 4   verified  3497 non-null   bool          
 5   corpus    3497 non-null   object        
dtypes: bool(1), datetime64[ns](1), float64(1), object(3)
memory usage: 167.3+ KB


In [16]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | After an excellent flight ...,5.0,2025-01-01,United Kingdom,True,excellent flight cpt lhr return good ba moved ...
1,✅ Trip Verified | On a recent flight from Cy...,5.0,2024-12-17,United Kingdom,True,recent flight cyprus ba second cabin door clos...
2,✅ Trip Verified | Flight BA 0560 arrived in ...,1.0,2024-12-17,Australia,True,flight ba arrived rome december passenger rece...
3,✅ Trip Verified | This was the first time I ...,1.0,2024-12-14,United States,True,first time flew british airway huge disappoint...
4,✅ Trip Verified | Pretty good flight but sti...,2.0,2024-12-13,United Kingdom,True,pretty good flight still small thing improved ...
...,...,...,...,...,...,...
3492,Dallas Fort Worth Texas to London Heathrow BA ...,8.0,2014-11-20,United Kingdom,False,dallas fort worth texas london heathrow ba th ...
3493,LHR-VIE in Club Europe on A320. Plane was an e...,4.0,2014-11-20,Australia,False,lhr vie club europe plane ex bmi aircraft stil...
3494,AMS-LHR-JNBJNB-LHR-AMS all flights on time ver...,3.0,2014-11-20,Belgium,False,am lhr jnbjnb lhr am flight time clean new air...
3495,Travelled to and from India recently in Club W...,5.0,2014-11-20,Ireland,False,avelled india recently club world outbound hyd...


Now data is cleaned and ready for data visualization and data analysis.

In [17]:
# export the cleaned data

df.to_csv("cleaned-BA-reviews.csv")