# Data Cleaning

- Now we have to clean the extract data . We have to see clean reviews section for spellings ,punctuations, symbols etc 
- Post that we will use technique called lemmatization from Nltk Library .
- What is Lemmatization ?
 #### 'Lemmatization is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode'


###### Importing Libraries

In [3]:
import pandas as pd # for data manipulation
import matplotlib.pyplot as plt #data visualization
import seaborn as sns #advance data visualization
import os 
import re #regular expression

##### Importing files

In [5]:
BA_clean= pd.read_csv("BA_reviews.csv",index_col=0)

In [8]:
BA_clean.head(10) #top 5 in columns

Unnamed: 0,reviews,stars,date,country
0,✅ Trip Verified | This flight was one of the ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,21st January 2023,(United Kingdom)
1,Not Verified | It seems that there is a race t...,2,19th January 2023,(United States)
2,Not Verified | As a Spanish born individual l...,3,19th January 2023,(United Kingdom)
3,✅ Trip Verified | A rather empty and quiet fl...,2,18th January 2023,(United Kingdom)
4,✅ Trip Verified | Easy check in and staff mem...,9,17th January 2023,(United Kingdom)
5,✅ Trip Verified | Being a silver flyer and bo...,9,17th January 2023,(United Kingdom)
6,Not Verified | I find BA incredibly tacky and...,1,16th January 2023,(United Kingdom)
7,✅ Trip Verified | Flew ATL to LHR 8th Jan 202...,3,9th January 2023,(United Kingdom)
8,Not Verified | Great thing about British Airw...,4,8th January 2023,(United Kingdom)
9,Not Verified | The staff are friendly. The pla...,5,6th January 2023,(Canada)


In [10]:
BA_clean.tail(10) #bottom 5 in columns

Unnamed: 0,reviews,stars,date,country
6912,LHR-HKG on Boeing 747 - 23/08/12. Much has bee...,10,29th August 2012,(United Kingdom)
6913,Just got back from Bridgetown Barbados flying ...,7,29th August 2012,(United Kingdom)
6914,LHR-JFK-LAX-LHR. Check in was ok apart from be...,5,29th August 2012,(United Kingdom)
6915,HKG-LHR in New Club World on Boeing 777-300 - ...,2,29th August 2012,(United Kingdom)
6916,YYZ to LHR - July 2012 - I flew overnight in p...,9,29th August 2012,(Canada)
6917,Flew return in CW from LHR to BKK in August 20...,8,29th August 2012,(Ireland)
6918,LHR to HAM. Purser addresses all club passenge...,4,28th August 2012,(United Kingdom)
6919,My son who had worked for British Airways urge...,9,12th October 2011,(United Kingdom)
6920,London City-New York JFK via Shannon on A318 b...,1,11th October 2011,(United States)
6921,SIN-LHR BA12 B747-436 First Class. Old aircraf...,8,9th October 2011,(United Kingdom)


##### Lets check for verified users who have actually travelled British Airways and create a seperate dataframe . In that way we can find actual insight from travellers who have travelled with us 

In [11]:
BA_clean['verified']= BA_clean.reviews.str.contains('Trip Verified')

In [13]:
BA_clean['verified']

0        True
1       False
2       False
3        True
4        True
        ...  
6917    False
6918    False
6919    False
6920    False
6921    False
Name: verified, Length: 6922, dtype: bool

#### Now lets clean the data using this dataframe . We will be using lemmatization from Nltk library.

In [25]:
!pip install nltk
import nltk
nltk.download('wordnet')



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rakesh\AppData\Roaming\nltk_data...


True

In [26]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
lemma = WordNetLemmatizer()

reviews_data = BA_clean.reviews.str.strip('✅ Trip Verified |') 

#creating an empty list for collecting cleaned data
corpus = []

#looping through each reviews for removing punctuation, making it into small case and combining it and later adding it to corpus#

for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]',' ', rev)
    rev = rev.lower()
    rev = rev.split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words('english'))]
    rev = " ".join(rev)
    corpus.append(rev)


In [27]:
# Adding Corpus to our dataframe 
BA_clean['corpus'] = corpus

In [28]:
BA_clean.head()

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | This flight was one of the ...,\n\t\t\t\t\t\t\t\t\t\t\t\t\t5,21st January 2023,(United Kingdom),True,flight one worst ever life wanted pamper bough...
1,Not Verified | It seems that there is a race t...,2,19th January 2023,(United States),False,verified seems race bottom amongst airline can...
2,Not Verified | As a Spanish born individual l...,3,19th January 2023,(United Kingdom),False,verified spanish born individual living englan...
3,✅ Trip Verified | A rather empty and quiet fl...,2,18th January 2023,(United Kingdom),True,rather empty quiet flight tel aviv friendly ca...
4,✅ Trip Verified | Easy check in and staff mem...,9,17th January 2023,(United Kingdom),True,easy check staff member polite helpful made sp...


##### Now it been added let format date and also Review 

In [29]:
BA_clean.dtypes

reviews     object
stars       object
date        object
country     object
verified      bool
corpus      object
dtype: object

In [37]:
#converting it to datetime format
BA_clean.date = pd.to_datetime(BA_clean.date)

In [38]:
BA_clean.date.head()

0   2023-01-21
1   2023-01-19
2   2023-01-19
3   2023-01-18
4   2023-01-17
Name: date, dtype: datetime64[ns]

##### Now time to clean Review column

In [41]:
BA_clean.stars.unique()

array(['\n\t\t\t\t\t\t\t\t\t\t\t\t\t5', '2', '3', '9', '1', '4', '5', '8',
       '6', '7', '10', 'None'], dtype=object)

##### above columns contain unwanted \ and alphabets . lets strip those

In [42]:
BA_clean.stars= BA_clean.stars.str.strip('\n\t\t\t\t\t\t\t\t\t\t\t\t\t')

In [44]:
BA_clean.stars.value_counts()

1       1491
2        777
3        766
8        705
10       620
7        602
9        595
5        519
4        472
6        365
None      10
Name: stars, dtype: int64

- There are 10 row which have None ratings which should be removed

In [46]:
BA_clean.drop(BA_clean[BA_clean.stars == 'None'].index, axis=0, inplace=True)

In [47]:
BA_clean.stars.unique()

array(['5', '2', '3', '9', '1', '4', '8', '6', '7', '10'], dtype=object)

#### lets Check null value

In [49]:
BA_clean.isnull().value_counts()

reviews  stars  date   country  verified  corpus
False    False  False  False    False     False     6912
dtype: int64

In [50]:
BA_clean.country.isnull().value_counts()

False    6912
Name: country, dtype: int64

-- There are Missing values in the country columns . We can remove those from dataframes 

In [52]:
BA_clean.drop(BA_clean[BA_clean.country.isnull() == True].index, axis=0, inplace=True)

In [54]:
BA_clean.shape

(6912, 6)

##### Reset the index

In [55]:
BA_clean.reset_index(drop=True)

Unnamed: 0,reviews,stars,date,country,verified,corpus
0,✅ Trip Verified | This flight was one of the ...,5,2023-01-21,(United Kingdom),True,flight one worst ever life wanted pamper bough...
1,Not Verified | It seems that there is a race t...,2,2023-01-19,(United States),False,verified seems race bottom amongst airline can...
2,Not Verified | As a Spanish born individual l...,3,2023-01-19,(United Kingdom),False,verified spanish born individual living englan...
3,✅ Trip Verified | A rather empty and quiet fl...,2,2023-01-18,(United Kingdom),True,rather empty quiet flight tel aviv friendly ca...
4,✅ Trip Verified | Easy check in and staff mem...,9,2023-01-17,(United Kingdom),True,easy check staff member polite helpful made sp...
...,...,...,...,...,...,...
6907,Flew return in CW from LHR to BKK in August 20...,8,2012-08-29,(Ireland),False,flew return cw lhr bkk august positive flight ...
6908,LHR to HAM. Purser addresses all club passenge...,4,2012-08-28,(United Kingdom),False,lhr ham purser address club passenger name boa...
6909,My son who had worked for British Airways urge...,9,2011-10-12,(United Kingdom),False,son worked british airway urged fly british ai...
6910,London City-New York JFK via Shannon on A318 b...,1,2011-10-11,(United States),False,london city new york jfk via shannon really ni...


#### Data is clean and ready for further analysis and visualization 

In [58]:
#Exporting as Csv 
cwd = os.getcwd()
BA_clean.to_csv(cwd + "/cleaned-BA-reviews.csv")