# Task 1

---

## Web scraping and analysis

This Jupyter notebook includes some code to get you started with web scraping. We will use a package called `BeautifulSoup` to collect the data from the web. Once you've collected your data and saved it into a local `.csv` file you should start with your analysis.

### Scraping data from Skytrax

If you visit [https://www.airlinequality.com] you can see that there is a lot of data there. For this task, we are only interested in reviews related to British Airways and the Airline itself.

If you navigate to this link: [https://www.airlinequality.com/airline-reviews/british-airways] you will see this data. Now, we can use `Python` and `BeautifulSoup` to collect all the links to the reviews and then to collect the text data on each of the individual review links.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
pages = 10
page_size = 100

reviews = []

# for i in range(1, pages + 1):
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create URL to collect links from paginated data
    url = f"{base_url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"

    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        reviews.append(para.get_text())
    
    print(f"   ---> {len(reviews)} total reviews")

In [None]:
df = pd.DataFrame()
df["reviews"] = reviews
df.head()

In [None]:
df.to_csv("data/BA_reviews.csv")

Congratulations! Now you have your dataset for this task! The loops above collected 1000 reviews by iterating through the paginated pages on the website. However, if you want to collect more data, try increasing the number of pages!

 The next thing that you should do is clean this data to remove any unnecessary text from each of the rows. For example, "✅ Trip Verified" can be removed from each row if it exists, as it's not relevant to what we want to investigate.

In [None]:
df

In [None]:
df_ =pd.read_csv('data/BA_reviews.csv')
df_.head()

In [None]:
df_.columns

In [None]:
df_.drop(columns="Unnamed: 0",axis=1)

In [None]:
import nltk  
nltk.download('all')

In [3]:
!pip install wordcloud



In [4]:
import os

#importing pandas module and numpy module for data preprocessing
import pandas as pd
import numpy as np

#importing matplotlib and seaborn for visualization
import matplotlib.pyplot as plt
import seaborn as sns

#to remove stopwords from  input data 
from nltk.corpus import stopwords

#Image result for python termcolor
#termcolor module is a python module for ANSII Color formatting for output in the terminal
from termcolor import colored

# to igonore the warning
from warnings import filterwarnings
filterwarnings('ignore')

# importing wordcloud for creating worldcloud of data
from wordcloud import WordCloud

#Using Sklearn library for machine learning models and to vectorize the input data
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve


print(colored("\nLIBRARIES WERE SUCCESFULLY IMPORTED...", color = "cyan", attrs = ["dark", "bold"]))

[1m[2m[36m
LIBRARIES WERE SUCCESFULLY IMPORTED...[0m


In [5]:
import re 

#data = df_['reviews']

regex = re.compile(r"✅ Trip Verified")

# Remove the "✅ Trip Verified" string from each row
#for row in data:
    df_['reviews'] = regex.sub("", row)
    #print(df_['reviews'])

IndentationError: unexpected indent (579057545.py, line 9)

In [None]:
df_

In [None]:
def tokenize_data(dataset):
  tokenizer = nltk.tokenize.TreebankWordTokenizer()
  
  for i in range(dataset.shape[0]): 
    dataset["reviews"][i] = tokenizer.tokenize(dataset["reviews"][i])
  return dataset

def remove_stop_words(dataset):
  stop_words = set(stopwords.words('english'))
  
  for i in range(dataset.shape[0]):
    dataset["reviews"][i] = ([token.lower() for token in dataset["reviews"][i] if token not in stop_words])
  return dataset

def normalize(dataset):
  lemmatizer = nltk.stem.WordNetLemmatizer() 

  for i in range(dataset.shape[0]):
    dataset.reviews[i] = " ".join([lemmatizer.lemmatize(token) for token in dataset.reviews[i]]).strip()
  return dataset

def remove_garbage(dataset):
    garbage = "~`!@#$%^&*()_-+={[}]|\:;'<,>.?/<br><url>"
    for i in range(dataset.shape[0]):
      dataset.reviews[i] = "".join([char for char in dataset.reviews[i] if char not in garbage])
    return dataset

In [None]:
df = tokenize_data(df)
df = remove_stop_words(df)
df = normalize(df)
df = remove_garbage(df)

In [None]:
df
