<a href="https://colab.research.google.com/github/Janalytics00/John_port/blob/main/BA%20Data%20Science%20Internship%20-%20Task%201_cleaned_BA_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Cleaning**

Now since we have extracted data from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#regex
import re

In [6]:
cwd = os.getcwd()

df = pd.read_csv(cwd+"/sample_data/BA_reviews (1).csv", index_col=0)

In [7]:
df.head()

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | Manchester to Copenhagen vi...,5.0,1st May 2024,Denmark
1,✅ Trip Verified | I have never seen such disr...,6.0,30th April 2024,United Kingdom
2,✅ Trip Verified | Paid for a 14 hour long fli...,1.0,27th April 2024,Singapore
3,✅ Trip Verified | Very inconsiderate PA annou...,1.0,25th April 2024,United Kingdom
4,"✅ Trip Verified | Absolutely terrible, lost m...",1.0,22nd April 2024,United Kingdom


We will also create a column which mentions if the user is verified or not.

In [8]:
df['verified'] = df.reviews.str.contains("Trip Verified")
df['verified']

0        True
1        True
2        True
3        True
4        True
        ...  
3395    False
3396    False
3397    False
3398    False
3399    False
Name: verified, Length: 3400, dtype: bool

**Cleaning Reviews**

We will extract the column of reviews into a separate dataframe and clean it for semantic analysis

In [14]:
#for lemmatization of words we will use nltk library
import nltk
nltk.download('stopwords')
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()


reviews_data = df.reviews.str.strip("✅ Trip Verified |")

#create an empty list to collect cleaned data corpus
corpus =[]

#loop through each review, remove punctuations, small case it, join it and add it to corpus
for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    rev = " ".join(rev)
    corpus.append(rev)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [15]:
# add the corpus to the original dataframe
df['corpus'] = corpus

In [25]:
df.head(20)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | Manchester to Copenhagen vi...,5.0,NaT,Denmark,True,manchester copenhagen via london th april airc...
1,✅ Trip Verified | I have never seen such disr...,6.0,2024-04-30,United Kingdom,True,never seen disrespect customer rd time month u...
2,✅ Trip Verified | Paid for a 14 hour long fli...,1.0,2024-04-27,Singapore,True,paid hour long flight ticket includes use flig...
3,✅ Trip Verified | Very inconsiderate PA annou...,1.0,2024-04-25,United Kingdom,True,inconsiderate pa announcement made purser appe...
4,"✅ Trip Verified | Absolutely terrible, lost m...",1.0,NaT,United Kingdom,True,absolutely terrible lost luggage flight delive...
5,✅ Trip Verified | We booked premium economy r...,1.0,2024-04-20,United States,True,booked premium economy round trip phoenix zuri...
6,✅ Trip Verified | We chose Rotterdam and Lond...,1.0,2024-04-12,Netherlands,True,chose rotterdam london city airport convenienc...
7,✅ Trip Verified | The entire booking and ticke...,6.0,2024-04-10,United States,True,entire booking ticketing experience stressful ...
8,Not Verified | British Airways cancelled my ...,1.0,2024-04-10,United States,False,verified british airway cancelled flight le ho...
9,Not Verified | I wanted to write this review s...,1.0,2024-04-07,United States,False,verified wanted write review could give huge t...


**Cleaning/Fromat date**

In [17]:
df.dtypes

reviews      object
stars       float64
date         object
country      object
verified       bool
corpus       object
dtype: object

In [21]:
#convert the date to datetime format


# Assuming df is your DataFrame and 'date' is the column containing dates
df['date'] = pd.to_datetime(df['date'], format='%dth %B %Y', errors='coerce')


In [22]:
df.date.head()

0          NaT
1   2024-04-30
2   2024-04-27
3   2024-04-25
4          NaT
Name: date, dtype: datetime64[ns]

Cleaning ratings with stars

In [23]:
#check for unique values
df.stars.unique()

array([ 5.,  6.,  1., 10.,  7.,  3.,  4.,  8.,  9.,  2., nan])

In [24]:
df.stars.value_counts()

stars
1.0     837
2.0     395
3.0     392
8.0     331
10.0    277
7.0     271
9.0     259
5.0     240
4.0     228
6.0     168
Name: count, dtype: int64

There are 5 rows having values "None" in the ratings. We will drop all these 5 rows.

In [26]:
#drop the rows where the value of ratings is NaT
df.drop(df[df.stars == "NaT"].index, axis=0, inplace=True)

In [30]:
#drop the rows where the value of ratings is NaT
df.drop(df[df.date == "NaT"].index, axis=0, inplace=True)

In [33]:
df.tail(30)

Unnamed: 0,reviews,stars,date,country,verified,corpus
3370,First time with BA (a code share flight for JA...,10.0,2014-11-20,Australia,False,first time ba code share flight jal travelled ...
3371,Travelled to and from India recently in Club W...,7.0,2014-11-20,Ireland,False,avelled india recently club world outbound hyd...
3372,BA16 Singapore to London. B777 World Traveller...,9.0,2014-11-20,Singapore,False,ba singapore london b world traveller cabin on...
3373,LHR to DXB on November 6 an overnight relative...,1.0,2014-11-20,United States,False,lhr dxb november overnight relatively short fl...
3374,LHR-LCA in Club Europe. The First class lounge...,7.0,2014-11-20,United Kingdom,False,lhr lca club europe first class lounge fairly ...
3375,52b on upper deck to LAX and 51b back from LAX...,2.0,2014-11-20,United Kingdom,False,b upper deck lax b back lax lhr food flight ok...
3376,LHR – LAX Club World A380 return a week later ...,9.0,2014-11-20,United Kingdom,False,lhr lax club world return week later lax lhr f...
3377,San Francisco to London Heathrow in August - a...,5.0,2014-11-20,United Kingdom,False,san francisco london heathrow august appalling...
3378,SFO-LHR-DXB and return DXB-LHR-DEN outbound in...,5.0,2014-11-20,United States,False,sfo lhr dxb return dxb lhr den outbound premiu...
3379,I travel to and from Singapore on BA in Club w...,,2014-11-20,United Kingdom,False,travel singapore ba club world month first tim...


In [32]:
df.stars.unique()

array([ 5.,  6.,  1., 10.,  7.,  3.,  4.,  8.,  9.,  2., nan])

check for null values

In [34]:
df.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     2699
                True   False    False     False      698
         True   False  False    False     False        2
         False  False  True     False     False        1
Name: count, dtype: int64

In [35]:
df.country.isnull().value_counts()

country
False    3399
True        1
Name: count, dtype: int64

In [36]:
#drop the rows using index where the country value is null

df.dropna(inplace=True)


In [37]:
df.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     2699
Name: count, dtype: int64

In [38]:
df.head(10)

Unnamed: 0,reviews,stars,date,country,verified,corpus
1,✅ Trip Verified | I have never seen such disr...,6.0,2024-04-30,United Kingdom,True,never seen disrespect customer rd time month u...
2,✅ Trip Verified | Paid for a 14 hour long fli...,1.0,2024-04-27,Singapore,True,paid hour long flight ticket includes use flig...
3,✅ Trip Verified | Very inconsiderate PA annou...,1.0,2024-04-25,United Kingdom,True,inconsiderate pa announcement made purser appe...
5,✅ Trip Verified | We booked premium economy r...,1.0,2024-04-20,United States,True,booked premium economy round trip phoenix zuri...
6,✅ Trip Verified | We chose Rotterdam and Lond...,1.0,2024-04-12,Netherlands,True,chose rotterdam london city airport convenienc...
7,✅ Trip Verified | The entire booking and ticke...,6.0,2024-04-10,United States,True,entire booking ticketing experience stressful ...
8,Not Verified | British Airways cancelled my ...,1.0,2024-04-10,United States,False,verified british airway cancelled flight le ho...
9,Not Verified | I wanted to write this review s...,1.0,2024-04-07,United States,False,verified wanted write review could give huge t...
13,✅ Trip Verified | Starting off at Heathrow Te...,4.0,2024-03-28,United Kingdom,True,starting heathrow terminal check fairly easy f...
14,Not Verified | We have flown this route with ...,8.0,2024-03-28,United Kingdom,False,verified flown route easyjet regularly twenty ...


In [39]:
df.shape

(2699, 6)

In [40]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | I have never seen such disr...,6.0,2024-04-30,United Kingdom,True,never seen disrespect customer rd time month u...
1,✅ Trip Verified | Paid for a 14 hour long fli...,1.0,2024-04-27,Singapore,True,paid hour long flight ticket includes use flig...
2,✅ Trip Verified | Very inconsiderate PA annou...,1.0,2024-04-25,United Kingdom,True,inconsiderate pa announcement made purser appe...
3,✅ Trip Verified | We booked premium economy r...,1.0,2024-04-20,United States,True,booked premium economy round trip phoenix zuri...
4,✅ Trip Verified | We chose Rotterdam and Lond...,1.0,2024-04-12,Netherlands,True,chose rotterdam london city airport convenienc...
...,...,...,...,...,...,...
2694,LPT to LHR and back found the planes new and i...,7.0,2014-11-12,Portugal,False,lpt lhr back found plane new good condition fa...
2695,London Gatwick - Barbados in premium economy. ...,9.0,2014-11-12,Spain,False,london gatwick barbados premium economy saturd...
2696,LHR-ZRH. A320 was used on this route. I was no...,9.0,2014-11-12,United States,False,lhr zrh used route expecting much since flight...
2697,I flew from MIA-LHR-DXB. The 747 like most of ...,6.0,2014-11-11,United Arab Emirates,False,flew mia lhr dxb like crew well passed sell da...


Now our data is all cleaned and ready for data visualization and data analysis.

In [41]:
# export the cleaned data
df.to_csv(cwd + "/cleaned_BA_reviews.csv")