### Data Cleaning 

The data scrapped was very messy and there is need for data cleaning in order to prepare the data for analysis.

In [45]:
#import libraries 
import pandas as pd 
import numpy as np

#import regex
import re

In [46]:
# create a datframe for data
df = pd.read_csv('BA_reviews.csv', index_col= 0)

In [47]:
df.head()

Unnamed: 0,Reviews,Ratings,Date,Location
0,✅ Trip Verified | Absolutely horrible experie...,5,15th April 2023,(United States)
1,Not Verified | This is the worst airline. Not...,1,14th April 2023,(United Kingdom)
2,✅ Trip Verified | I will never fly British Ai...,1,13th April 2023,(United States)
3,✅ Trip Verified | Worst aircraft I have ever ...,2,12th April 2023,(United Kingdom)
4,✅ Trip Verified | I enjoyed my flight. The bo...,1,11th April 2023,(United Kingdom)


In [48]:
#create a seperate column for trip verification
df["Trip_Verification"] = df['Reviews'].str.contains('Trip Verified')

In [49]:
df['Trip_Verification']

0        True
1       False
2        True
3        True
4        True
        ...  
2995    False
2996    False
2997    False
2998    False
2999    False
Name: Trip_Verification, Length: 3000, dtype: bool

In [50]:
df.head()

Unnamed: 0,Reviews,Ratings,Date,Location,Trip_Verification
0,✅ Trip Verified | Absolutely horrible experie...,5,15th April 2023,(United States),True
1,Not Verified | This is the worst airline. Not...,1,14th April 2023,(United Kingdom),False
2,✅ Trip Verified | I will never fly British Ai...,1,13th April 2023,(United States),True
3,✅ Trip Verified | Worst aircraft I have ever ...,2,12th April 2023,(United Kingdom),True
4,✅ Trip Verified | I enjoyed my flight. The bo...,1,11th April 2023,(United Kingdom),True


### Cleaning Reviews Column

For this purpose, I extract the reviews into a seperate dataframe and clean it for semantic analysis

In [51]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\l-admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\l-admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [64]:
from nltk.stem  import WordNetLemmatizer
from nltk.corpus import stopwords
lemma = WordNetLemmatizer()

df['Cleaned_Reviews'] = df['Reviews'].str.strip("✅ Trip Verified |")
df['Cleaned_Reviews'] = df['Cleaned_Reviews'].apply(lambda x: re.sub('[^a-zA-Z]',' ', x))
df['Cleaned_Reviews'] = df['Cleaned_Reviews'].str.lower()
df['Cleaned_Reviews'] = df['Cleaned_Reviews'].apply(lambda x: x.split())
df['Cleaned_Reviews'] = [[lemma.lemmatize(word) for word in review if word not in set(stopwords.words("english"))] for review in df['Cleaned_Reviews']]
df['Cleaned_Reviews'] = df['Cleaned_Reviews'].apply(lambda x: " ".join(x))

In [65]:
df.head()

Unnamed: 0,Reviews,Ratings,Date,Location,Trip_Verification,Cleaned_Reviews
0,✅ Trip Verified | Absolutely horrible experie...,5,15th April 2023,(United States),True,absolutely horrible experience booked ticket e...
1,Not Verified | This is the worst airline. Not...,1,14th April 2023,(United Kingdom),False,verified worst airline one thing went right un...
2,✅ Trip Verified | I will never fly British Ai...,1,13th April 2023,(United States),True,never fly british airway start plane hour late...
3,✅ Trip Verified | Worst aircraft I have ever ...,2,12th April 2023,(United Kingdom),True,worst aircraft ever flown seat cramped uncomfo...
4,✅ Trip Verified | I enjoyed my flight. The bo...,1,11th April 2023,(United Kingdom),True,enjoyed flight boarding swift service friendly...


### Cleaning / Format date

In [68]:
df.dtypes

Reviews              object
Ratings               int64
Date                 object
Location             object
Trip_Verification      bool
Cleaned_Reviews      object
dtype: object

In [69]:
df['Date'] = pd.to_datetime(df['Date'])

In [70]:
df['Date'].head()

0   2023-04-15
1   2023-04-14
2   2023-04-13
3   2023-04-12
4   2023-04-11
Name: Date, dtype: datetime64[ns]

### Cleaning Ratings column 

In [75]:
df.Ratings.unique()

array([ 5,  1,  2,  9,  7,  4,  3, 10,  8,  6], dtype=int64)

In [76]:
df.Ratings.value_counts()

1     707
2     352
3     352
8     292
7     250
10    240
9     237
5     218
4     207
6     145
Name: Ratings, dtype: int64

### Cleaning the location column 

In [84]:
df['Location'] = df['Location'].apply(lambda x: x.replace('(', '').replace(')', ''))

In [86]:
df.head()

Unnamed: 0,Reviews,Ratings,Date,Location,Trip_Verification,Cleaned_Reviews
0,✅ Trip Verified | Absolutely horrible experie...,5,2023-04-15,United States,True,absolutely horrible experience booked ticket e...
1,Not Verified | This is the worst airline. Not...,1,2023-04-14,United Kingdom,False,verified worst airline one thing went right un...
2,✅ Trip Verified | I will never fly British Ai...,1,2023-04-13,United States,True,never fly british airway start plane hour late...
3,✅ Trip Verified | Worst aircraft I have ever ...,2,2023-04-12,United Kingdom,True,worst aircraft ever flown seat cramped uncomfo...
4,✅ Trip Verified | I enjoyed my flight. The bo...,1,2023-04-11,United Kingdom,True,enjoyed flight boarding swift service friendly...


### Check for null values 

In [88]:
df.isnull().sum()

Reviews              0
Ratings              0
Date                 0
Location             0
Trip_Verification    0
Cleaned_Reviews      0
dtype: int64

In [89]:
df.shape

(3000, 6)

In [90]:
#resetting the index
df.reset_index(drop=True)

Unnamed: 0,Reviews,Ratings,Date,Location,Trip_Verification,Cleaned_Reviews
0,✅ Trip Verified | Absolutely horrible experie...,5,2023-04-15,United States,True,absolutely horrible experience booked ticket e...
1,Not Verified | This is the worst airline. Not...,1,2023-04-14,United Kingdom,False,verified worst airline one thing went right un...
2,✅ Trip Verified | I will never fly British Ai...,1,2023-04-13,United States,True,never fly british airway start plane hour late...
3,✅ Trip Verified | Worst aircraft I have ever ...,2,2023-04-12,United Kingdom,True,worst aircraft ever flown seat cramped uncomfo...
4,✅ Trip Verified | I enjoyed my flight. The bo...,1,2023-04-11,United Kingdom,True,enjoyed flight boarding swift service friendly...
...,...,...,...,...,...,...
2995,Regularly travel from BCN to LHR. In addition ...,7,2015-01-19,Spain,False,regularly travel bcn lhr addition booting u ma...
2996,I have not travelled with BA for almost 7 year...,8,2015-01-19,United Kingdom,False,travelled ba almost year result moving asia us...
2997,Flew Miami to Heathrow it was a nightmare the ...,9,2015-01-19,United Kingdom,False,flew miami heathrow nightmare attendant watche...
2998,LHR to HKG in Club - 777-300ER. Lovely newish ...,9,2015-01-19,United Kingdom,False,lhr hkg club er lovely newish plane attentive ...


Now that our dataset has been cleaned, it's ready for visualization 

### Save the cleaned dataset

In [91]:
#save to csv
df.to_csv('Cleaned_BA-reviews.csv')