<a href="https://colab.research.google.com/gist/Larinwa/5e79f7dd1659d0bdaf4cbf6080777819/getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WEB SCRAPING 
 We will use a package called `BeautifulSoup` to collect the data from the web and save it into a local `.csv` file

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())

    print(f"   ---> {len(reviews)} total reviews")

Scraping page 1
   ---> 100 total reviews
Scraping page 2
   ---> 200 total reviews
Scraping page 3
   ---> 300 total reviews
Scraping page 4
   ---> 400 total reviews
Scraping page 5
   ---> 500 total reviews
Scraping page 6
   ---> 600 total reviews
Scraping page 7
   ---> 700 total reviews
Scraping page 8
   ---> 800 total reviews
Scraping page 9
   ---> 900 total reviews
Scraping page 10
   ---> 1000 total reviews


In [None]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

Unnamed: 0,reviews
0,✅ Trip Verified | Appalling service with fai...
1,✅ Trip Verified | British Airways charge you f...
2,✅ Trip Verified | What is wrong with you guys?...
3,✅ Trip Verified | We booked two business cla...
4,✅ Trip Verified | I’ve flown with many airline...


In [None]:
import nltk
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...


True

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer


# Initialize the sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Define a function to analyze sentiment
def analyze_sentiment(text):
    sentiment = sia.polarity_scores(text)
    if sentiment['compound'] >= 0.05:
        return 'Positive'
    elif sentiment['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the sentiment analysis function to the text column
df['Sentiment'] = df['reviews'].apply(analyze_sentiment)

# Print the updated DataFrame
df.head()

Unnamed: 0,reviews,Sentiment
0,✅ Trip Verified | Appalling service with fai...,Negative
1,✅ Trip Verified | British Airways charge you f...,Positive
2,✅ Trip Verified | What is wrong with you guys?...,Negative
3,✅ Trip Verified | We booked two business cla...,Negative
4,✅ Trip Verified | I’ve flown with many airline...,Positive


In [None]:
sent= df['Sentiment']
sent.value_counts()

Negative    515
Positive    471
Neutral      14
Name: Sentiment, dtype: int64

In [None]:
df.to_csv("data/BA_reviews.csv")

Now we have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if we want to collect more data, all we need to do is to increase the number of pages.