# Importing Modules

In [46]:
import numpy as np
import pandas as pd
from scipy import stats as st
import nltk

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Use below code to install all nltk packages in case they are not already installed

In [2]:
#nltk.download()

# Reading Data

In [3]:
df = pd.read_csv('BAreviews.csv')
df.columns = ['serial_number','user_status','heading','reviews']
df.head()

Unnamed: 0,serial_number,user_status,heading,reviews
0,0,\n\nTom Slowbe (United States) 28th September ...,"""Very disappointed""",✅ Trip Verified | The airplanes and the lounge...
1,1,\n\nE Anderson (United Kingdom) 28th September...,"""the service was shockingly bad""",✅ Trip Verified | One of the worst experiences...
2,2,\n\n1 reviews\n\n\n\nC Horden (United Kingdom)...,"""Never again will I fly BA""",✅ Trip Verified | Cancelled our flight last-m...
3,3,\n\nP Cooper (United States) 23rd September 2023,"""never fly this awful airline again""","✅ Trip Verified | I had a flight from Miami, F..."
4,4,\n\nBruce Friedman (United States) 22nd Septem...,"""I’ll never fly with them again""",✅ Trip Verified | We started our day with BA ...


# Data Cleaning

## Data Cleaning for user_status

### Removing new line(\n) from user_status

In [4]:
for i in range(len(df.user_status)):
    df.user_status[i] = df.user_status[i].replace('\n', '')

###### Logic used for further cleaning

In [5]:
text = df['user_status'][465]
text_split = text.split('(')
text_split

['Clifford Oakley ', 'United Kingdom) 16th October 2021']

In [6]:
# removing name since it is not very useful
text_list = text_split[1].split(')')
text_list

['United Kingdom', ' 16th October 2021']

#### Implementing above logic on all data

In [7]:
new_list = []      # contains country and date of review
left_out = {}      #exceptions to the above logic

for i in range(len(df)):
    text = df['user_status'][i]
    text_split = text.split('(')
    
    try:
        text_list = text_split[1].split(')')
        new_list.append(text_list)
    except:
        left_out[i] = text_split

In [8]:
len(new_list)

3658

In [9]:
left_out

{3018: ['S Stevenson 8th April 2015'],
 3322: ['Guillaume Christian 20th October 2014']}

#### Handling left_out data

In [10]:
date1 = str(left_out[3018])[14:-2]
date2 = str(left_out[3322])[22:-2]

# Adding null values since we don't no country
new_list.insert(3018,[np.nan, date1])
new_list.insert(3322, [np.nan, date2])

In [11]:
len(new_list)

3660

#### Separating Country and Date

In [12]:
country = []
date = []
for i in range(len(new_list)):
    country.append(new_list[i][0])
    date.append(new_list[i][1])

#### Adding it to DataFrame

In [13]:
df['country'] = country
df['date'] = date

In [14]:
for i in range(len(df.heading)):
    df.heading[i] = df.heading[i].replace('"', '').lower()

In [15]:
# converting
df['date'] = pd.to_datetime(df['date'])

In [16]:
df.drop('user_status', axis=1, inplace=True)

In [17]:
df.head()

Unnamed: 0,serial_number,heading,reviews,country,date
0,0,very disappointed,✅ Trip Verified | The airplanes and the lounge...,United States,2023-09-28
1,1,the service was shockingly bad,✅ Trip Verified | One of the worst experiences...,United Kingdom,2023-09-28
2,2,never again will i fly ba,✅ Trip Verified | Cancelled our flight last-m...,United Kingdom,2023-09-27
3,3,never fly this awful airline again,"✅ Trip Verified | I had a flight from Miami, F...",United States,2023-09-23
4,4,i’ll never fly with them again,✅ Trip Verified | We started our day with BA ...,United States,2023-09-22


## Data Cleaning for reviews

##### Logic used

In [18]:
text = str(df['reviews'][6]).split('|')
text

['Not Verified ',
 "  Everything was ok until our connecting flight in London, just before take off, we were on the runway, the pilot came on to announce an engine problem. After engineers tried to fix it while we waited on the plane for over an hour we were finally told that we would have to be evacuated and rebooked for another flight but not to worry because a special crew was waiting for us on the ground to help us and set us up in a hotel except that there was no one to help us. In fact everyone refused to help us. It was Saturday almost 8pm and they just wanted to go home. Anyone with a connecting flight couldn't rebook on the application because the app wouldn't disassociate our first flight with the connecting one in London and the staff saw this but still refused to help us. They gave us a phone number but there was only a message that said just to use the app. Finally I got someone on the phone who rebooked us for the next day. Now it's 10:30pm and I have to find a hotel room

#### Implementing logic on all data

In [19]:
split = []
for i in range(len(df)):
    text = str(df['reviews'][i]).split('|')
    split.append(text)

There's an anomaly in 3279th data 

In [20]:
split[3278]

['We travelled economy from Manchester to Toronto via Heathrow The flight from Manchester to Heathrow and return was very good. The seats had plenty of room and it was easy to fit laptop size briefcases under the seat in front and still have room to stretch your legs. The service from the cabin crew was excellent. The leg from Heathrow to Toronto was not as comfortable but it was acceptable. The seats appeared narrower and my case only just fit under the seat leaving little room to stretch my legs. Whilst the cabin crew forgot requests for drinks they apologised and were very friendly and helpful in other aspects. Unfortunately BA were let down by the attendants on the return leg from Toronto to Heathrow. Whilst the aircraft appeared newer and more comfortable I have never experienced such surly behaviour from the cabin crew. We had seen attendants in the opposite aisle offering passengers water from trays they were carrying but the attendants down our aisle did not do this. The lady i

##### Handilng above anamoly

In [21]:
text = split[3278][0] + split[3278][1]

del split[3278]
split.insert(3278, [np.nan, text])

#### Creating separate list for verified and reviews

In [22]:
verified = []
reviews = []
for i in range(len(split)):
    try:
        reviews.append(split[i][1])
        verified.append(split[i][0])
    except:
        reviews.append(split[i][0])
        verified.append(np.nan)

#### Handling emoji and further cleaning of verified list

In [23]:
new_verified = []
for i in range(len(verified)):
    if verified[i] is not np.nan:
        foo = str(verified[i]).split(" ")
        if foo[-2] != 'Unverified':
            bar = str(foo[-3]) + " " + str(foo[-2])
        else:
            bar = foo[-2]
            
        new_verified.append(bar.lower())
    else:
        new_verified.append(np.nan)

#### Further cleaning of review list

In [24]:
new_reviews = []
for i in range(len(reviews)):
    new_reviews.append(reviews[i].lower())   #converting it into lower case

In [25]:
# Adding them into dataframe
df['verified'] = new_verified
df['reviews'] = new_reviews

In [26]:
df.head()

Unnamed: 0,serial_number,heading,reviews,country,date,verified
0,0,very disappointed,"the airplanes and the lounges are worn out, o...",United States,2023-09-28,trip verified
1,1,the service was shockingly bad,one of the worst experiences on the worst air...,United Kingdom,2023-09-28,trip verified
2,2,never again will i fly ba,cancelled our flight last-minute then moved ...,United Kingdom,2023-09-27,trip verified
3,3,never fly this awful airline again,"i had a flight from miami, florida to dublin,...",United States,2023-09-23,trip verified
4,4,i’ll never fly with them again,we started our day with ba in prague. the fl...,United States,2023-09-22,trip verified


In [27]:
df['verified'].unique()

array(['trip verified', 'not verified', nan, 'verified review',
       'unverified'], dtype=object)

In [28]:
# Replacing not verified as verified
df.replace('not verified','unverified', inplace = True)

## Handling Null values

In [29]:
df.isnull().sum()

serial_number       0
heading             0
reviews             0
country             2
date                0
verified         1524
dtype: int64

In [30]:
# Checking most occurrence in Verified field
st.mode(df['verified'])

ModeResult(mode=array(['trip verified'], dtype=object), count=array([1125]))

But we are not sure whether it is verified or not, so we make it unverified

In [31]:
df['verified'].fillna('unverified', inplace=True)

In [32]:
df.head()

Unnamed: 0,serial_number,heading,reviews,country,date,verified
0,0,very disappointed,"the airplanes and the lounges are worn out, o...",United States,2023-09-28,trip verified
1,1,the service was shockingly bad,one of the worst experiences on the worst air...,United Kingdom,2023-09-28,trip verified
2,2,never again will i fly ba,cancelled our flight last-minute then moved ...,United Kingdom,2023-09-27,trip verified
3,3,never fly this awful airline again,"i had a flight from miami, florida to dublin,...",United States,2023-09-23,trip verified
4,4,i’ll never fly with them again,we started our day with ba in prague. the fl...,United States,2023-09-22,trip verified


In [33]:
# Filling null values in country to not specified
df['country'].fillna('not specified', inplace=True)

In [34]:
df.isnull().sum()

serial_number    0
heading          0
reviews          0
country          0
date             0
verified         0
dtype: int64

In [35]:
df.head()

Unnamed: 0,serial_number,heading,reviews,country,date,verified
0,0,very disappointed,"the airplanes and the lounges are worn out, o...",United States,2023-09-28,trip verified
1,1,the service was shockingly bad,one of the worst experiences on the worst air...,United Kingdom,2023-09-28,trip verified
2,2,never again will i fly ba,cancelled our flight last-minute then moved ...,United Kingdom,2023-09-27,trip verified
3,3,never fly this awful airline again,"i had a flight from miami, florida to dublin,...",United States,2023-09-23,trip verified
4,4,i’ll never fly with them again,we started our day with ba in prague. the fl...,United States,2023-09-22,trip verified


## Removing punctuations and converting to lower case

In [36]:
import string
def remove_punctuations(input_list):
  cleaned_list=[]
  for element in input_list:
    cleaned_element=''.join(char.lower() for char in element if char not in string.punctuation)
    cleaned_list.append(cleaned_element)
  return cleaned_list

In [37]:
text_list = remove_punctuations(df['reviews'])
df['reviews'] = text_list

In [38]:
head_list = remove_punctuations(df['heading'])
df['heading'] = head_list

## Removing stop words

Stop words are words which have grammatical significance but does not add much value. Ex- I, for, the, and, etc.

word_tokenize: It is a generic tokenizer that separates words and punctuations. An apostrophe is not considered as punctuation here.

In [39]:
def remove_stopwords(input_list):
    stopwords = nltk.corpus.stopwords.words('english')
    cleaned_list = []
    for element in input_list:
        tokenize = nltk.tokenize.word_tokenize(element)
        cleaned_element = ' '.join(i for i in tokenize if i not in stopwords)
        cleaned_list.append(cleaned_element)
    return cleaned_list

In [40]:
df['heading'] = remove_stopwords(df['heading'])
df['reviews'] = remove_stopwords(df['reviews'])

In [41]:
df.head()

Unnamed: 0,serial_number,heading,reviews,country,date,verified
0,0,disappointed,airplanes lounges worn old broken dallas heath...,United States,2023-09-28,trip verified
1,1,service shockingly bad,one worst experiences worst airline flight del...,United Kingdom,2023-09-28,trip verified
2,2,never fly ba,cancelled flight lastminute moved us onto flig...,United Kingdom,2023-09-27,trip verified
3,3,never fly awful airline,flight miami florida dublin ireland via london...,United States,2023-09-23,trip verified
4,4,’ never fly,started day ba prague flight actually left tim...,United States,2023-09-22,trip verified


## Stemming and Lemmatization

Stemming: A technique that takes the word to its root form. It just removes suffixes from the words. The stemmed word might not be part of the dictionary, i.e it will not necessarily give meaning. There are two main types of stemmer- Porter Stemmer and Snow Ball Stemmer

Lemmatization: Takes the word to its root form called Lemma. It helps to bring words to their dictionary form. It is applied to nouns by default. It is more accurate as it uses more informed analysis to create groups of words with similar meanings based on the context, so it is complex and takes more time. This is used where we need to retain the contextual information.

###### We are going to use lemmatization

In [42]:
def lemmatization(input_list):
    wn = nltk.WordNetLemmatizer()
    lemmatize_list = []
    for element in input_list:
        w = ''.join(wn.lemmatize(word) for word in element)
        lemmatize_list.append(w)
    return lemmatize_list

In [43]:
df.heading = lemmatization(df.heading)
df.reviews = lemmatization(df.reviews)

In [44]:
df.head()

Unnamed: 0,serial_number,heading,reviews,country,date,verified
0,0,disappointed,airplanes lounges worn old broken dallas heath...,United States,2023-09-28,trip verified
1,1,service shockingly bad,one worst experiences worst airline flight del...,United Kingdom,2023-09-28,trip verified
2,2,never fly ba,cancelled flight lastminute moved us onto flig...,United Kingdom,2023-09-27,trip verified
3,3,never fly awful airline,flight miami florida dublin ireland via london...,United States,2023-09-23,trip verified
4,4,’ never fly,started day ba prague flight actually left tim...,United States,2023-09-22,trip verified


## Saving cleaned data

In [45]:
df.to_csv('cleaned_data.csv')
print('Saved!')

Saved!
