# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [4]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Words fail to describe this...
1,✅ Trip Verified | Absolutely terrible experie...
2,✅ Trip Verified | BA overbook every flight to ...
3,✅ Trip Verified | \r\nThe flights were all on...
4,Not Verified | Only the second time flying BA ...


## Data Extracted to BA_reviews
Around 1020 entries


Now moving foreword, here's what we will do: 
STEP: 1 clean data
- convert into table
- search for certain keywords
- -analyze review
- restart STEP 1

STEP: 2 analyze data
- analyze sentiment
- analyze emotion
- analyze topic

STEP: 3 visualize data and make recommendation



In [17]:
# Read the CSV file without header
df = pd.read_csv("BA_reviews.csv", header=None, names=['Review'])

In [19]:
# Remove the "Trip Verified" field from the Review column
df['Review'] = df['Review'].str.replace('✅ Trip Verified \|', '')

# Display the resulting tabular data
print(df)

                                                  Review
NaN                                              reviews
0.0    |Wordsfailtodescribethislastawfulflight-babyac...
1.0    |Absolutelyterribleexperience.Theappwouldnotle...
2.0    |BAoverbookeveryflighttomaximisetheirincomewit...
3.0    |\r\nTheflightswereallontime,exceptBelfastfrom...
...                                                  ...
995.0  |WonderfulserviceontheflightfromEdinburgh-Flor...
996.0  |LondonHeathrowtoManchester.ItwasoutrageousofB...
997.0  |LondontoMadrid.Lazyseatallocationhasledtomyhu...
998.0  |Luggagebrokeninto–noexplanation.Firstthegoodp...
999.0  |LondontoTehranbackinAugust2017.Thecabinlooked...

[1001 rows x 1 columns]


  df['Review'] = df['Review'].str.replace('✅ Trip Verified \|', '')


In [23]:
# Define the keywords for positive and negative reviews
positive_keywords = ['good', 'excellent', 'satisfactory']
negative_keywords = ['bad', 'terrible', 'poor']

# Create a new column to store the review sentiment
df['Sentiment'] = ''

# Search for positive and negative keywords in the reviews and assign sentiment accordingly
for index, row in df.iterrows():
    review = row['Review'].lower()
    sentiment = ''
    
    # Check for positive keywords
    if any(keyword in review for keyword in positive_keywords):
        sentiment = 'Good'
    
    # Check for negative keywords
    if any(keyword in review for keyword in negative_keywords):
        sentiment = 'Bad'
    
    df.at[index, 'Sentiment'] = sentiment

# Filter rows where sentiment is not classified as either Good or Bad
unclassified_rows = df[df['Sentiment'].isin(['Good', 'Bad']) == False]

# Display the unclassified rows
print(unclassified_rows)


                                                  Review Sentiment
NaN                                              reviews          
3.0    |\r\nTheflightswereallontime,exceptBelfastfrom...          
5.0    |Iwasn'tgoingtobotherreviewingthisflightasIsee...          
6.0    |IbookedbusinessclassticketsformyfiancéandI.Ih...          
8.0    |IamalreadyinPortugalsocontactedthemtodayandth...          
...                                                  ...       ...
987.0  |LondontoFrankfurt.IneedtoflyBAforBusinessfreq...          
992.0  |FlewLondonHeathrowtoHongKongwithBritishAirway...          
995.0  |WonderfulserviceontheflightfromEdinburgh-Flor...          
997.0  |LondontoMadrid.Lazyseatallocationhasledtomyhu...          
999.0  |LondontoTehranbackinAugust2017.Thecabinlooked...          

[522 rows x 2 columns]


In [25]:
# Define the keywords for positive and negative reviews
positive_keywords = ['good', 'excellent', 'satisfactory', 'great', 'awesome']
negative_keywords = ['bad', 'terrible', 'poor', 'horrible', 'awful']

# Create a new column to store the review sentiment
df['Sentiment'] = ''

# Search for positive and negative keywords in the reviews and assign sentiment accordingly
for index, row in df.iterrows():
    review = row['Review'].lower()
    sentiment = ''
    
    # Check for positive keywords
    if any(keyword in review for keyword in positive_keywords):
        sentiment = 'Good'
    
    # Check for negative keywords
    if any(keyword in review for keyword in negative_keywords):
        sentiment = 'Bad'
    
    df.at[index, 'Sentiment'] = sentiment

# Filter rows where sentiment is not classified as either Good or Bad
unclassified_rows = df[df['Sentiment'].isin(['Good', 'Bad']) == False]

# Display the unclassified rows
print(unclassified_rows)


                                                  Review Sentiment
NaN                                              reviews          
3.0    |\r\nTheflightswereallontime,exceptBelfastfrom...          
6.0    |IbookedbusinessclassticketsformyfiancéandI.Ih...          
8.0    |IamalreadyinPortugalsocontactedthemtodayandth...          
12.0   NotVerified|ItravelledwithBritishAirwaysfromSw...          
...                                                  ...       ...
982.0  |GatwicktoBarcelona.Checkinefficientandfriendl...          
992.0  |FlewLondonHeathrowtoHongKongwithBritishAirway...          
995.0  |WonderfulserviceontheflightfromEdinburgh-Flor...          
997.0  |LondontoMadrid.Lazyseatallocationhasledtomyhu...          
999.0  |LondontoTehranbackinAugust2017.Thecabinlooked...          

[444 rows x 2 columns]


In [31]:
from textblob import TextBlob

# Create a new column to store the review sentiment
df['Sentiment'] = ''

# Perform sentiment analysis on each review
for index, row in df.iterrows():
    review = row['Review']
    sentiment = TextBlob(review).sentiment.polarity
    
    # Classify the sentiment as positive, negative, or neutral
    if sentiment > 0:
        df.at[index, 'Sentiment'] = 'Good'
    elif sentiment < 0:
        df.at[index, 'Sentiment'] = 'Bad'
    else:
        df.at[index, 'Sentiment'] = 'Neutral'

# Filter rows where sentiment is not classified as either Good or Bad
unclassified_rows = df[df['Sentiment'].isin(['Good', 'Bad']) == False]

# Display the unclassified rows
print(unclassified_rows)


                                                  Review Sentiment
NaN                                              reviews   Neutral
0.0    |Wordsfailtodescribethislastawfulflight-babyac...   Neutral
1.0    |Absolutelyterribleexperience.Theappwouldnotle...   Neutral
2.0    |BAoverbookeveryflighttomaximisetheirincomewit...   Neutral
3.0    |\r\nTheflightswereallontime,exceptBelfastfrom...   Neutral
...                                                  ...       ...
995.0  |WonderfulserviceontheflightfromEdinburgh-Flor...   Neutral
996.0  |LondonHeathrowtoManchester.ItwasoutrageousofB...   Neutral
997.0  |LondontoMadrid.Lazyseatallocationhasledtomyhu...   Neutral
998.0  |Luggagebrokeninto–noexplanation.Firstthegoodp...   Neutral
999.0  |LondontoTehranbackinAugust2017.Thecabinlooked...   Neutral

[989 rows x 2 columns]


In [32]:
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim import similarities


# Filter the neutral reviews
neutral_reviews = df[df['Sentiment'] == 'Neutral']['Review']

# Tokenize the reviews
tokenized_reviews = neutral_reviews.apply(lambda x: x.lower().split())

# Create a dictionary and corpus
dictionary = Dictionary(tokenized_reviews)
corpus = [dictionary.doc2bow(review) for review in tokenized_reviews]

# Train the LDA model
num_topics = 5  # Adjust the number of topics as needed
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Print the topics and associated keywords
for topic in lda_model.print_topics():
    print(topic)

# Perform topic inference on each review
for i, review in enumerate(tokenized_reviews):
    bow = dictionary.doc2bow(review)
    topics = lda_model.get_document_topics(bow)
    print(f"Review {i+1}: {topics}")


(0, '0.010*"|" + 0.003*"the2-3-2seatinginthefrontsectionofclubworldontheupperdeckfeelsmarginallylesscrampedthan2-4-2,butit\'sshowingitsageincontrasttoothercarriers.afullcabinalsomeansqueuesforthetwoloos,althoughthey\'realotmorespaciousthaninfirst.thedrinksanddinnerservicetookagoodcoupleofhours,buttherevampedclubworldfoodisdefinitelyanimprovement.thefishoptionforthemaincoursewassomeofthebestfoodi\'vehadintheair.thewhitecompanybeddingdoeslittletocushiontheratherhardseats.cabincrewweregenerallygoodandthecsmwasparticularlyvisibleduringtheflight.breakfastwasimprovedwiththeoptiontopre-selectitems.arrivalwasaheadofschedule." + 0.003*"nonresponsiveairline." + 0.003*"ourflightouttodubrovnikwasatthepainfultimeof6:30inthemorningandfromlondongatwick,notaneasyairporttogettoatthathourespeciallyasyouhavetobeattheairportminimum2hrsbeforetobereadytoboard.iunderstandwhythelikesofeasyjet,ryanairorwizzairflyatveryoddtimestoutilisecheaperslotsatairports,butitseemsanoddchoiceforbagivenmostpeoplelikeuswhofly

## Here's what above data says: 

Topic 0: This topic discusses various aspects of flying with British Airways, including the seating arrangements, cabin space, meals, bedding, cabin crew, and arrival experience. It also mentions specific flights and airports such as Gatwick to Orlando, London to Frankfurt, and Venice to London City.

Topic 1: This topic focuses on the experience of flying with British Airways in Club Europe (business class). It mentions the service provided by the cabin crew, the availability of Wi-Fi, baggage handling, lounges, boarding process, meals, and overall standards.

Topic 2: This topic describes a negative experience with British Airways, particularly regarding flight delays, cancellations, and the ground handling at the airports. It mentions specific incidents at London Heathrow, waiting for luggage, lounge toilets, boarding, and lack of customer service.

Topic 3: This topic highlights a positive experience with British Airways, specifically in Club Europe. It mentions friendly and helpful staff, efficient check-in, comfortable seats, power ports, in-flight service, meals, and baggage handling. It also includes feedback on baggage handling at Heathrow.

Topic 4: This topic describes a frustrating experience with British Airways, focusing on a flight from London to Amsterdam. It mentions flight delays, airspace closures, cancellations, pilot and cabin crew communication, in-flight service, and ground handling. It also criticizes the handling of the situation by British Airways, including long queues, vouchers for sleeping on the floor, and difficulties with rebooking through the app.


In [36]:
import csv

from gensim.corpora import Dictionary
from gensim.models import LdaModel

# Filter the neutral reviews
neutral_reviews = df[df['Sentiment'] == 'Neutral']['Review']

# Tokenize the reviews
tokenized_reviews = neutral_reviews.apply(lambda x: x.lower().split())

# Create a dictionary and corpus
dictionary = Dictionary(tokenized_reviews)
corpus = [dictionary.doc2bow(review) for review in tokenized_reviews]

# Train the LDA model
num_topics = 5  # Adjust the number of topics as needed
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Print the topics and associated keywords
for topic in lda_model.print_topics():
    print(topic)

# Perform topic inference on each review and store the results
review_data = []
for i, review in enumerate(tokenized_reviews):
    bow = dictionary.doc2bow(review)
    topics = lda_model.get_document_topics(bow)
    topics_prob = [prob for _, prob in topics]
    row = [i+1] + topics_prob
    review_data.append(row)

# Define column names
columns = ["Review ID", "Topic 0", "Topic 1", "Topic 2", "Topic 3", "Topic 4"]

# Save the review data to a CSV file
with open("review_analyzed.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(columns)  # Write the column names
    writer.writerows(review_data)  # Write the review data

print("Review data saved to review_analyzed.csv")


(0, '0.013*"|" + 0.003*"iwouldliketocomplimentbritishairwaysandtheircrewforthecomfortoftheflight,servicefromthecrewandthefoodserved.beforelandingwewereservedwithanafternoonteaplatethatwasacredittothecateringcompany.overallaveryenjoyableflightfromlondongatwicktotampa." + 0.003*"gatwicktobarbadosreturn.inormallytravelonbawithlowexpectationswiththeresultthatiamnottoodisappointedandsometimespleasantlysurprised.theoutboundflightwasaboutparforthecoursewithareasonablycomfortableseatandanokcabincrewthatdidjustaboutenoughtolookafterusandmaketheflightuneventful.however,theflightdidleavebangontimeandmanagedtolandinbarbadosbeforethevirginflightthatwasscheduledtoleavegatwickfiveminutesbeforeus,withtheresultthatwegotthrougharrivalsquickly.thereturnflightaweeklatermanagedtoreinforcemylowexpectationsofbawithoneofthepoorerflightsthatwehaveexperiencedoverthelastfewyears.onceagainwhenyoutravelwesttoeast,thecrew(specificallyoneladywhowaslookingafterourpartofthecabin)somehowfeeltheyhavetorushthrougheverypa