# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

C:\Users\JOSH\anaconda3\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\JOSH\anaconda3\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll


In [2]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways" 
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):
    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [3]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,Not Verified | Worst experience ever. Outbound...
1,✅ Trip Verified | Check in was a shambles at ...
2,✅ Trip Verified | Beyond disgusted with the fa...
3,✅ Trip Verified | On July 19th 2022 I had subm...
4,✅ Trip Verified | I booked the flight on Oct ...


In [4]:
df.to_csv("dataBA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [5]:
df = pd.read_csv("dataBA_reviews.csv")
df["reviews"] = df['reviews'].str.replace("✅ Trip Verified |", "")
df["reviews"] = df['reviews'].str.replace("Not Verified |", "")
df["reviews"] = df['reviews'].str.replace("|", "")



  df["reviews"] = df['reviews'].str.replace("✅ Trip Verified |", "")
  df["reviews"] = df['reviews'].str.replace("Not Verified |", "")
  df["reviews"] = df['reviews'].str.replace("|", "")


In [6]:
df["reviews"].iloc[2]

' Beyond disgusted with the fact that my baggage has yet to be delivered to me after 5 weeks of emails and calls to BA. Two pieces reported 29th September. BA responses are generic non specific and all attempts to speak to a customer service worker are obstructed. All this from an airline touting its values and claiming yo be one of the best in the world. Disgraceful does not fully describe their customer service.'

In [7]:
import gensim
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora, models
import pyLDAvis.gensim_models

from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer


import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from itertools import chain

  from imp import reload


### clean the data

In [8]:
stop = set(stopwords.words('English'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(text):
    stop_free = ' '.join([word for word in text.lower().split() if word not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = ' '.join([lemma.lemmatize(word) for word in punc_free.split()])
    return normalized.split()

In [9]:
df["clean_review"] = df["reviews"].apply(clean)
df['clean_review'].apply(lambda clean_review: " ".join(set(clean_review)))

0      hand had horrible cancelled flight milan food ...
1      counter engaging row poor seat service plane n...
2      non customer 29th touting beyond best fully ba...
3      receipt sent later them requested 2022 lost we...
4      different on in sent authority requested assum...
                             ...                        
995    bottle steward requested board difficult bottl...
996    non took seat thing passenger co choice 2 answ...
997    seat dreadful disappoint airway too caused tim...
998    took complete 20 champagne much concept airway...
999    upper airbus seat would got airway next side s...
Name: clean_review, Length: 1000, dtype: object

In [10]:
df

Unnamed: 0.1,Unnamed: 0,reviews,clean_review
0,0,Worst experience ever. Outbound flight was ca...,"[worst, experience, ever, outbound, flight, ca..."
1,1,"Check in was a shambles at BWI, just 3 count...","[check, shamble, bwi, 3, counter, open, full, ..."
2,2,Beyond disgusted with the fact that my baggag...,"[beyond, disgusted, fact, baggage, yet, delive..."
3,3,On July 19th 2022 I had submitted a complaint...,"[july, 19th, 2022, submitted, complaint, form,..."
4,4,"I booked the flight on Oct 8, but have to ca...","[booked, flight, oct, 8, cancel, flight, day, ..."
...,...,...,...
995,995,Singapore to Heathrow. I skipped my meal on b...,"[singapore, heathrow, skipped, meal, board, re..."
996,996,Chicago to Chennai via London. The pilot and...,"[chicago, chennai, via, london, pilot, captain..."
997,997,Flew London Heathrow to Toronto. I am a frequ...,"[flew, london, heathrow, toronto, frequent, tr..."
998,998,I flew British Airways from Heathrow to Hong ...,"[flew, british, airway, heathrow, hong, kong, ..."


In [11]:
dictionary = corpora.Dictionary(df["clean_review"])
print(dictionary.num_nnz)

64181


### ----- creat document term matrix------

In [12]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in df['clean_review']]
print(len(doc_term_matrix))

1000


In [13]:
lda = gensim.models.ldamodel.LdaModel

In [57]:
num_topics=4
%time ldamodel = lda(doc_term_matrix,num_topics=num_topics,id2word=dictionary,passes=50,minimum_probability=0)

Wall time: 3min 39s


In [58]:
ldamodel.print_topics(num_topics=num_topics)

[(0,
  '0.022*"flight" + 0.020*"seat" + 0.011*"ba" + 0.011*"food" + 0.010*"service" + 0.009*"class" + 0.009*"crew" + 0.009*"cabin" + 0.008*"good" + 0.008*"london"'),
 (1,
  '0.003*"say" + 0.003*"service" + 0.003*"back" + 0.003*"ba" + 0.003*"airway" + 0.003*"every" + 0.003*"british" + 0.002*"march" + 0.002*"onboard" + 0.002*"time"'),
 (2,
  '0.020*"flight" + 0.013*"ba" + 0.010*"london" + 0.010*"crew" + 0.009*"staff" + 0.008*"time" + 0.008*"heathrow" + 0.007*"service" + 0.006*"british" + 0.006*"airway"'),
 (3,
  '0.033*"flight" + 0.018*"ba" + 0.010*"hour" + 0.010*"customer" + 0.009*"london" + 0.008*"day" + 0.008*"u" + 0.008*"service" + 0.008*"told" + 0.007*"airline"')]

In [59]:
lda_display = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds="mmds")
pyLDAvis.display(lda_display)

  default_term_info = default_term_info.sort_values(


### -------find the article that belong to diff topic------

In [60]:
lda_corpus = ldamodel[doc_term_matrix]
[doc for doc in lda_corpus]

[[(0, 0.2971885), (1, 0.008168744), (2, 0.008667092), (3, 0.6859756)],
 [(0, 0.9908685), (1, 0.0029604181), (2, 0.0031071773), (3, 0.0030638508)],
 [(0, 0.0065545407), (1, 0.006211369), (2, 0.0065396135), (3, 0.9806945)],
 [(0, 0.0049912436), (1, 0.30031243), (2, 0.005049187), (3, 0.68964714)],
 [(0, 0.0027405147), (1, 0.0026986354), (2, 0.002758918), (3, 0.9918019)],
 [(0, 0.75435895), (1, 0.0049807695), (2, 0.005203119), (3, 0.23545714)],
 [(0, 0.25630763), (1, 0.1376935), (2, 0.0027438037), (3, 0.6032551)],
 [(0, 0.78835523), (1, 0.2030725), (2, 0.004328638), (3, 0.0042436114)],
 [(0, 0.46722642), (1, 0.0071055507), (2, 0.51845926), (3, 0.007208765)],
 [(0, 0.59420747), (1, 0.0015961988), (2, 0.4025226), (3, 0.0016737024)],
 [(0, 0.9804415), (1, 0.006369965), (2, 0.0066244625), (3, 0.0065640593)],
 [(0, 0.98958516), (1, 0.003362394), (2, 0.0035167811), (3, 0.003535687)],
 [(0, 0.99162656), (1, 0.0026646822), (2, 0.0028275226), (3, 0.00288123)],
 [(0, 0.2491061), (1, 0.0012330948), (

In [61]:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))

threshold = sum(scores)/len(scores)
print(threshold)

0.2499999996777624


In [66]:
cluster1 = [j for i,j in zip(lda_corpus,df.index) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,df.index) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,df.index) if i[2][1] > threshold]
cluster4 = [j for i,j in zip(lda_corpus,df.index) if i[3][1] > threshold]
# cluster5 = [j for i,j in zip(lda_corpus,df.index) if i[4][1] > threshold]

In [67]:
print(len(cluster1))
print(len(cluster2))
print(len(cluster3))
print(len(cluster4))
# print(len(cluster5))

645
33
271
370


In [74]:
df.iloc[cluster3]["reviews"].to_list()

['  I am happy to say that this flight was quite good. Except for the second rate Euro business seating, everything was very well done. A friendly, informative cabin crew served a good hot breakfast while being very communicative to folks nervous about Heathrow transit. Well done. Unfortunately, a bus gate was used upon arrival.',
 '  Just a few years ago flying on BA was enjoyable, but times have changed. These days about five hours on board the plane is no fun at all. The Terminal 5 experience still feels classy, and on the way out it felt well staffed and efficient. On board though, it’s just become an experience to be endured. BA’s mean decision to split seats into nicer for the first half of the plane, and nastier for the rest underlines this. There is no entertainment, no magazine. Refreshment purchases are brought to the seat after ordering from an app. There is no attempt to provide fresh or hot food on the day, despite the competition being able to do this. Is it too much to a

In [70]:
# df.iloc[cluster2]['reviews']

In [None]:
df.iloc[cluster3]['reviews'].to_list()


In [72]:
df.iloc[cluster4]['reviews'].to_list()

[' Worst experience ever. Outbound flight was cancelled and I was not notified. I was rebooked on a very uncomfortable trip. Inbound flight delayed 1 hour, also not notified. On top of it, they boarded my hand luggage, which was the only bag I had. Extra wait in Milan then. Food is horrible.',
 ' Beyond disgusted with the fact that my baggage has yet to be delivered to me after 5 weeks of emails and calls to BA. Two pieces reported 29th September. BA responses are generic non specific and all attempts to speak to a customer service worker are obstructed. All this from an airline touting its values and claiming yo be one of the best in the world. Disgraceful does not fully describe their customer service.',
 " On July 19th 2022 I had submitted a complaint form with regards to the fact that BA had misplaced our luggage during our wedding trip to Italy and we've lost 2 days and incurred additional expenses in retrieving them, for which I had provided all copy of receipts for. I requested 