## Data Cleaning

Prior to conducting the dataset analysis, we must examine the dataset columns as it may be necessary to perform cleaning.

In [39]:
#Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 

In [52]:
#Read the file
data=pd.read_csv('British_Airways_reviews.csv')

In [53]:
#Replace columns of dataset
data.columns=data.columns.str.lower()
data.columns=data.columns.str.replace(' ','_')

In [42]:
#First 5 rows of dataset
data.head()

Unnamed: 0,date,review,country,star_rating
0,24th July 2023,Not Verified | I booked Premium Economy from I...,United Kingdom,5
1,21st July 2023,✅ Trip Verified | A simple story with an unfor...,Germany,1
2,21st July 2023,✅ Trip Verified | Flight was delayed due to t...,United Kingdom,1
3,20th July 2023,Not Verified | Fast and friendly check in (to...,United Kingdom,4
4,20th July 2023,✅ Trip Verified | I don't understand why Brit...,United Kingdom,8


In [43]:
#Shape of dataset
data.shape

(2450, 4)

In [54]:
#Types of dataset columns
data.dtypes

date           object
review         object
country        object
star_rating     int64
dtype: object

In [55]:
#Convert date column from object type to date type
data.date=pd.to_datetime(data.date)

## Rewiev column

The next thing that we should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [58]:
#Let's define a function to clean the review column
def clean_review(review):
    # Remove "Not Verified |" and "✅ Trip Verified |" from the review column
    cleaned_review = review.replace("Not Verified |", "").replace("✅ Trip Verified |", "").strip()
    return cleaned_review

# Apply the clean_review function to the 'review' column and create a new 'cleaned_review' column
data['cleaned_review'] = data['review'].apply(clean_review)

# Reorder the columns to place 'cleaned_review' in the third position
cols = data.columns.tolist()
cols.insert(2, cols.pop(cols.index('cleaned_review')))
data = data[cols]

In [59]:
#Let's check again first 5 rows of dataset 
data.head()

Unnamed: 0,date,review,cleaned_review,country,star_rating
0,2023-07-24,Not Verified | I booked Premium Economy from I...,I booked Premium Economy from INV to LAX (via ...,United Kingdom,5
1,2023-07-21,✅ Trip Verified | A simple story with an unfor...,A simple story with an unfortunate outcome tha...,Germany,1
2,2023-07-21,✅ Trip Verified | Flight was delayed due to t...,Flight was delayed due to the inbound flight a...,United Kingdom,1
3,2023-07-20,Not Verified | Fast and friendly check in (to...,Fast and friendly check in (total contrast to ...,United Kingdom,4
4,2023-07-20,✅ Trip Verified | I don't understand why Brit...,I don't understand why British Airways is clas...,United Kingdom,8


In [50]:
#Check Null Values in Columns
data.isnull().sum()

date              0
review            0
country           0
star_rating       0
cleaned_review    0
dtype: int64

In [63]:
#Check Duplicate Values in Columns
data.duplicated().sum()

0

## Create new csv file after cleaning the dataset

In [64]:
data.to_csv('Cleaned_BA_reviews.csv', index=False)