# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 15
page_size = 200

reviews = []

for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 200 total reviews
Scraping page 2
   ---> 400 total reviews
Scraping page 3
   ---> 600 total reviews
Scraping page 4
   ---> 800 total reviews
Scraping page 5
   ---> 1000 total reviews
Scraping page 6
   ---> 1200 total reviews
Scraping page 7
   ---> 1400 total reviews
Scraping page 8
   ---> 1600 total reviews
Scraping page 9
   ---> 1800 total reviews
Scraping page 10
   ---> 2000 total reviews
Scraping page 11
   ---> 2200 total reviews
Scraping page 12
   ---> 2400 total reviews
Scraping page 13
   ---> 2600 total reviews
Scraping page 14
   ---> 2800 total reviews
Scraping page 15
   ---> 3000 total reviews


In [None]:
data = pd.DataFrame()
data["reviews"] = reviews
data.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Worst service ever. Lost bag...
1,✅ Trip Verified | BA 246 21JAN 2023 Did not a...
2,✅ Trip Verified | Not a great experience. I co...
3,Not Verified | I was excited to fly BA as I'd ...
4,Not Verified | I just want to warn everyone o...


In [None]:
data.to_csv("/content/drive/MyDrive/Forage /British_Airways/Scrapped_data.csv", header = 'true')

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

## Data Manipulation

In [None]:
data.head(10)

Unnamed: 0,reviews
0,✅ Trip Verified | Worst service ever. Lost bag...
1,✅ Trip Verified | BA 246 21JAN 2023 Did not a...
2,✅ Trip Verified | Not a great experience. I co...
3,Not Verified | I was excited to fly BA as I'd ...
4,Not Verified | I just want to warn everyone o...
5,Not Verified | Paid for business class travell...
6,✅ Trip Verified | The plane was extremely dir...
7,Not Verified | Overall journey wasn’t bad howe...
8,✅ Trip Verified | Overall very satisfied. Gro...
9,✅ Trip Verified | As always when I fly BA it ...


In [None]:
data.columns

Index(['reviews'], dtype='object')

In [None]:
import nltk
from nltk import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords
import string

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Tokenize the words and remove stopwords

In [None]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import string
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.DataFrame(data)

def preprocess_text(review):
    tokens = word_tokenize(review)
    stop_words = set(stopwords.words('english'))
    tokens = [word.lower() for word in tokens if (word.isalnum() and word.lower() not in stop_words)]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply tokenization to the 'reviews' column
df['Tokenized Reviews'] = df['reviews'].apply(preprocess_text)

# Display the DataFrame with tokenized reviews
print(df[['reviews', 'Tokenized Reviews']])


                                                reviews  \
0     ✅ Trip Verified | Worst service ever. Lost bag...   
1     ✅ Trip Verified |  BA 246 21JAN 2023 Did not a...   
2     ✅ Trip Verified | Not a great experience. I co...   
3     Not Verified | I was excited to fly BA as I'd ...   
4     Not Verified |  I just want to warn everyone o...   
...                                                 ...   
2995  A380 LHR-IAD. After a visit to the Concorde Ro...   
2996  Club World: Just flown London Gatwick to Las V...   
2997  LHR-ATL-LHR. Out 3rd May. Back 12th May. First...   
2998  JNB-LHR BA056 May 13 2015. LHR-JNB BA055 May 2...   
2999  LGW-HER. A320 on one of BA's new tourist-orien...   

                                      Tokenized Reviews  
0     trip verified worst service ever lost baggage ...  
1     trip verified ba 246 21jan 2023 appreciate unp...  
2     trip verified great experience could check onl...  
3     verified excited fly ba travelled long haul 25...  
4

In [None]:
df.head(2)

Unnamed: 0,reviews,Tokenized Reviews
0,✅ Trip Verified | Worst service ever. Lost bag...,trip verified worst service ever lost baggage ...
1,✅ Trip Verified | BA 246 21JAN 2023 Did not a...,trip verified ba 246 21jan 2023 appreciate unp...


### Tokenized Data for the reviews

In [None]:
df["Tokenized Reviews"][0]

'trip verified worst service ever lost baggage delayed flight missed connection one helping get back british airway website broken let fill missing report give missing file report number way contact british airway dumbest ever ai chatbot'

In [None]:
df["Tokenized Reviews"][1]

'trip verified ba 246 21jan 2023 appreciate unprofessional attitude pilot flight scheduled departure advised boarding time whole flight full passenger waiting gate board cabin crew board pilot board sao paulo airport duty free branded shopping bag flight still boarding finally push back 40 minute late captain came intercom announce delay due crew hotel airport sorry captain whole plane saw pilot colleague board fifteen minute cabin crew clutching duty free pilot colleague still made time stop'

## SentimentIntensityAnalyzer is inbuilt library for the sentiment analysis <br> gives positive and negative answer

In [None]:
import nltk
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
sid = SentimentIntensityAnalyzer()
df['Sentiment Score'] = df['reviews'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Assign sentiment labels
df['Sentiment Label'] = df['Sentiment Score'].apply(lambda score: 'Positive' if score >= 0.05 else ('Negative' if score <= -0.05 else 'Neutral'))


In [None]:
df["Sentiment Label"].head(20)

0     Negative
1     Positive
2     Positive
3     Negative
4     Negative
5     Negative
6     Negative
7     Negative
8     Positive
9     Negative
10    Positive
11    Positive
12    Negative
13    Positive
14    Negative
15    Negative
16    Positive
17    Positive
18    Positive
19    Positive
Name: Sentiment Label, dtype: object

In [None]:
df.head(10)

Unnamed: 0,reviews,Tokenized Reviews,Sentiment Score,Sentiment Label
0,✅ Trip Verified | Worst service ever. Lost bag...,trip verified worst service ever lost baggage ...,-0.9648,Negative
1,✅ Trip Verified | BA 246 21JAN 2023 Did not a...,trip verified ba 246 21jan 2023 appreciate unp...,0.5013,Positive
2,✅ Trip Verified | Not a great experience. I co...,trip verified great experience could check onl...,0.8749,Positive
3,Not Verified | I was excited to fly BA as I'd ...,verified excited fly ba travelled long haul 25...,-0.086,Negative
4,Not Verified | I just want to warn everyone o...,verified want warn everyone worst customer ser...,-0.9385,Negative
5,Not Verified | Paid for business class travell...,verified paid business class travelling cairo ...,-0.9686,Negative
6,✅ Trip Verified | The plane was extremely dir...,trip verified plane extremely dirty chocolate ...,-0.9127,Negative
7,Not Verified | Overall journey wasn’t bad howe...,verified overall journey bad however end bagga...,-0.875,Negative
8,✅ Trip Verified | Overall very satisfied. Gro...,trip verified overall satisfied ground staff m...,0.8724,Positive
9,✅ Trip Verified | As always when I fly BA it ...,trip verified always fly ba total shamble book...,-0.9482,Negative


In [None]:
count = (df["Sentiment Label"] == "Negative").sum()

print("Number of Negative Reviews:", count)
print("Number of Positive Reviews:", 3000 - count)



Number of Negative Reviews: 1315
Number of Positive Reviews: 1685


In [None]:
model_df = df[["reviews", "Sentiment Label"]]

In [None]:
model_df.head(10)

Unnamed: 0,reviews,Sentiment Label
0,✅ Trip Verified | Worst service ever. Lost bag...,Negative
1,✅ Trip Verified | BA 246 21JAN 2023 Did not a...,Positive
2,✅ Trip Verified | Not a great experience. I co...,Positive
3,Not Verified | I was excited to fly BA as I'd ...,Negative
4,Not Verified | I just want to warn everyone o...,Negative
5,Not Verified | Paid for business class travell...,Negative
6,✅ Trip Verified | The plane was extremely dir...,Negative
7,Not Verified | Overall journey wasn’t bad howe...,Negative
8,✅ Trip Verified | Overall very satisfied. Gro...,Positive
9,✅ Trip Verified | As always when I fly BA it ...,Negative


# Model Training and Vectorizing and calcualating Accuracy

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['Tokenized Reviews'], df['Sentiment Label'], test_size=0.2, random_state=42)


In [None]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [None]:
X_train_tfidf

<2400x10486 sparse matrix of type '<class 'numpy.float64'>'
	with 168180 stored elements in Compressed Sparse Row format>

In [None]:
X_test_tfidf

<600x10486 sparse matrix of type '<class 'numpy.float64'>'
	with 39327 stored elements in Compressed Sparse Row format>

In [None]:
# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

In [None]:
y_pred = classifier.predict(X_test_tfidf)

In [None]:
y_pred

array(['Negative', 'Negative', 'Positive', 'Negative', 'Positive',
       'Positive', 'Positive', 'Positive', 'Negative', 'Negative',
       'Negative', 'Positive', 'Positive', 'Negative', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Negative', 'Positive', 'Negative', 'Positive', 'Positive',
       'Positive', 'Positive', 'Negative', 'Negative', 'Positive',
       'Negative', 'Negative', 'Negative', 'Positive', 'Positive',
       'Positive', 'Positive', 'Negative', 'Positive', 'Positive',
       'Negative', 'Negative', 'Negative', 'Negative', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Positive', 'Negative',
       'Positive', 'Positive', 'Positive', 'Positive', 'Positive',
       'Positive', 'Positive', 'Negative', 'Negative', 'Positive',
       'Positive', 'Positive', 'Negative', 'Positive', 'Positive',
       'Positive', 'Positive', 'Positive', 'Negative', 'Negati

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred) *100, "%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 72.66666666666667 %

Classification Report:
               precision    recall  f1-score   support

    Negative       0.77      0.53      0.63       256
     Neutral       0.00      0.00      0.00         8
    Positive       0.71      0.90      0.79       336

    accuracy                           0.73       600
   macro avg       0.49      0.47      0.47       600
weighted avg       0.73      0.73      0.71       600



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Testing for a new review and the model gives proper output

In [None]:
test_data = [
    'This flight was excellent! The service was outstanding.',
    'I had a terrible experience. The flight was delayed and the staff was rude.',
    'The journey was okay, nothing bad.'
]

# Preprocess the test data
preprocessed_test_data = [preprocess_text(review) for review in test_data]

X_test_tfidf = vectorizer.transform(preprocessed_test_data)

predicted_labels = classifier.predict(X_test_tfidf)

# Printing the output
for review, label in zip(test_data, predicted_labels):
    print(f"Review: {review} -->    Predicted Label: {label}")
    print("\n")

Review: This flight was excellent! The service was outstanding. -->    Predicted Label: Positive


Review: I had a terrible experience. The flight was delayed and the staff was rude. -->    Predicted Label: Negative


Review: The journey was okay, nothing bad. -->    Predicted Label: Positive


