## Data Cleaning process

After data extraction from the website, it is not cleaned and ready to be analyzed yet. The reviews section will need to be cleaned for punctuations, spellings and other characters. 

In [1]:
# Imports necessary libraries for data analysis and visualization

import pandas as pd            # Pandas for data manipulation and analysis
import matplotlib.pyplot as plt    # Matplotlib for creating static, interactive, and animated plots
import seaborn as sns            # Seaborn for statistical data visualization
import os                    # OS module for interacting with the operating system

# Regular expression (regex) library for pattern matching and text manipulation
import re


In [2]:
# Obtain the current working directory
cwd = os.getcwd()

# Read a CSV file named "BA_reviews.csv" located in the current working directory
# Create a Pandas DataFrame 'df' from the CSV data
# Use the first column of the CSV file as the index of the DataFrame
df = pd.read_csv(cwd + "/BA_reviews.csv", index_col=0)


In [3]:
# Display the first few rows of the DataFrame.
# The 'head()' method is used to retrieve the initial rows of the DataFrame for quick inspection.
# By default, it returns the first 5 rows, providing a snapshot of the data's structure and content.
df.head()


Unnamed: 0,reviews,stars,date,country
0,Not Verified | Extremely rude ground service....,5,3rd January 2024,United States
1,✅ Trip Verified | My son and I flew to Geneva...,6,2nd January 2024,China
2,✅ Trip Verified | For the price paid (bought ...,1,29th December 2023,United Kingdom
3,✅ Trip Verified | Flight left on time and arr...,8,29th December 2023,United Kingdom
4,✅ Trip Verified | Very Poor Business class pr...,6,27th December 2023,United Kingdom


We will also create a column which mentions if the user is verified or not. 

In [4]:
# Add a new column 'verified' to the DataFrame 'df' based on whether the 'reviews' column contains the string "Trip Verified".
# The 'str.contains()' method is used to check if each entry in the 'reviews' column contains the specified substring.
# If the substring "Trip Verified" is found in a particular entry, the corresponding 'verified' column value is set to True; otherwise, it is set to False.
df['verified'] = df.reviews.str.contains("Trip Verified")


In [5]:
# Accessing the 'verified' column in the Pandas DataFrame 'df'
# This code returns the values in the 'verified' column, providing a Series.
df['verified']

0      False
1       True
2       True
3       True
4       True
       ...  
345    False
346     True
347    False
348     True
349     True
Name: verified, Length: 350, dtype: bool

### Cleaning Reviews
We will extract the column of reviews into a separate dataframe and clean it for semantic analysis

In [6]:
# Import the necessary libraries for lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Initialize the WordNetLemmatizer for lemmatization
lemma = WordNetLemmatizer()

# Extract the 'reviews' column from the DataFrame and remove a specific prefix ("✅ Trip Verified |")
reviews_data = df.reviews.str.strip("✅ Trip Verified |")

# Create an empty list to collect the cleaned data corpus
corpus = []

# Loop through each review in the 'reviews_data'
for rev in reviews_data:
    # Remove non-alphabetic characters, convert to lowercase, split into words
    rev = re.sub('[^a-zA-Z]', ' ', rev)
    rev = rev.lower()
    rev = rev.split()
    
    # Lemmatize each word and remove stopwords
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    
    # Join the cleaned words back into a string and append to the corpus list
    rev = " ".join(rev)
    corpus.append(rev)


In [7]:
# Add the cleaned and lemmatized corpus as a new column named 'corpus' to the original DataFrame 'df'
df['corpus'] = corpus


In [8]:
# Display the first few rows of the DataFrame 'df' using the 'head()' method
df.head()


Unnamed: 0,reviews,stars,date,country,verified,corpus
0,Not Verified | Extremely rude ground service....,5,3rd January 2024,United States,False,verified extremely rude ground service non rev...
1,✅ Trip Verified | My son and I flew to Geneva...,6,2nd January 2024,China,True,son flew geneva last sunday skiing holiday le ...
2,✅ Trip Verified | For the price paid (bought ...,1,29th December 2023,United Kingdom,True,price paid bought sale decent experience altho...
3,✅ Trip Verified | Flight left on time and arr...,8,29th December 2023,United Kingdom,True,flight left time arrived half hour earlier sch...
4,✅ Trip Verified | Very Poor Business class pr...,6,27th December 2023,United Kingdom,True,poor business class product ba even close airl...


### Cleaning/Fromat date

In [9]:
# Display the data types of each column in the DataFrame 'df'
df.dtypes


reviews     object
stars        int64
date        object
country     object
verified      bool
corpus      object
dtype: object

In [10]:
# Check for null values in each column of the DataFrame 'df' and count their occurrences
df.isnull().value_counts()


reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     350
Name: count, dtype: int64

In [11]:
# Display the number of rows and columns in the DataFrame 'df'
df.shape

(350, 6)

In [12]:
# Reset the index of the DataFrame 'df' and drop the existing index
df.reset_index(drop=True)


Unnamed: 0,reviews,stars,date,country,verified,corpus
0,Not Verified | Extremely rude ground service....,5,3rd January 2024,United States,False,verified extremely rude ground service non rev...
1,✅ Trip Verified | My son and I flew to Geneva...,6,2nd January 2024,China,True,son flew geneva last sunday skiing holiday le ...
2,✅ Trip Verified | For the price paid (bought ...,1,29th December 2023,United Kingdom,True,price paid bought sale decent experience altho...
3,✅ Trip Verified | Flight left on time and arr...,8,29th December 2023,United Kingdom,True,flight left time arrived half hour earlier sch...
4,✅ Trip Verified | Very Poor Business class pr...,6,27th December 2023,United Kingdom,True,poor business class product ba even close airl...
...,...,...,...,...,...,...
345,Not Verified | This review is for LHR-SYD-LHR....,6,27th December 2023,United Kingdom,False,verified review lhr syd lhr ba ba business cla...
346,✅ Trip Verified | Absolutely pathetic business...,2,23rd December 2023,United States,True,absolutely pathetic business class product ba ...
347,Not Verified | Overall not bad. Staff look ti...,5,21st December 2023,Canada,False,verified overall bad staff look tired overwork...
348,✅ Trip Verified | This was our first flight wi...,3,21st December 2023,Australia,True,first flight british airway year usual fault c...


Now our data is all cleaned and ready for data visualization and data analysis.

In [13]:
# Export the cleaned DataFrame 'df' to a CSV file named "cleaned-BA-reviews.csv" in the current working directory
df.to_csv(cwd + "/cleaned-BA-reviews.csv")
