## **Project**
---


In this activity, you will scrape the following website: https://quotes.toscrape.com/    
You will have to scrape all the quotes from **every page**. You cannot assume that there are 10 pages, i.e., you cannot use a for loop that iterates 10 times.     

---   
### **Part 1:**   
Information to scrape: 

<img src="information.png" width="800px">
      
The information to be scraped is as follows:
- Author
- Phrase
- Top 10 tags (first page)

The idea is to look for the tags of each sentence and compare them with the top 10 tags. For example, if the top 10 tags are:
* ['love', 'inspirational', 'life', 'humor', 'books', 'reading', 'friendship', 'friends', 'truth', 'simile']

And the sentence tags are:
* ['change', 'simile', 'love']

You can see that the tags **'simile'** and **'love'** are in the top 10 so this sentence should have the following values:
* **[1,0,0,0,0,0,0,0,0,1]**   

Putting a 1 if they share and 0 if they don't.

Example of the table to be obtained:

 |quote|author|love|inspirational|life|humor|books|reading|friendship|friends|truth|simile|
 |-----|-----|-----|------|-----------|--------|------|------|------|------|------|------|
  |...|...|...|...|...|...|...|...|...|...|...|...|
 |two way live life one though nothing miracle t...|Albert Einstein|0|1|1|0|0|0|0|0|0|0|
 |...|...|...|...|...|...|...|...|...|...|...|...|

**The `quote` information needs to be cleared.** Select **Lemmetization.**   
Take into account that the **tags may not always the same**. So you will need to **find the top 10 tags which can differ from the example provided.**

(The typing of the variables must be correct.)   

**This file should be saved as "data1.csv".** 


---  

### **Part 2:**  

Once we obtained the table we will need to make some transformations to it.  

1. Select 3 tags from the top 10 tags (whatever you want)
2. Filter the table to only have the quotes that have at least one of the selected tags   
3. Sort the table by author in ascending order
4. Reset the index of the table
5. **Save the table as "data2.csv"**

--- 

### *Information to be delivered*:

A 📂 folder with the following files:  
 
 <span style="color:salmon">**1. Notebook**</span>.

You have to include, apart from the code, the following information:
* Argumentation of how you have cleaned up text (explanation)
* Argumentation on how you have collected links (explanation)

<span style="color:salmon">**2. The 2 csv files**</span>.

### *Warnings*:
<div style="background-color: lightred; padding: 10px; border: 1px solid #ff0000;">
    <strong>⚠ Warning:</strong> 
        <br>
        1. You can only use the materials viewed in class.
        <br>
        2. The code can't give error
        <br>
        3. Folder must be named as "Name1_Surname1"
</div>


### **1. Libraries**
---

In [7]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [8]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import unicodedata
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/danteschrantz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/danteschrantz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/danteschrantz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/danteschrantz/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [9]:
# import the libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from collections import Counter
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from IPython.display import display
import IPython

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Base URL
base_url = "http://quotes.toscrape.com"

# Clean the quote
def clean_quote(quote):
    # Remove punctuation and convert to lowercase
    quote = quote.translate(str.maketrans('', '', string.punctuation)).lower()
    # Create the tokens for each individual word
    words = word_tokenize(quote)
    # Remove stop words and lemmatize the words
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Collect all quotes, authors, and tags
quotes_data = []
all_tags = []
page = 1
while True:
    # Get all pages of quotes without assuming there are only 10 pages
    response = requests.get(f"{base_url}/page/{page}/")
    # Invalid page will return "No quotes found!" so we know that this page is not valid
    if "No quotes found!" in response.text:
        break
    soup = BeautifulSoup(response.text, 'html.parser')
    quotes = soup.find_all('div', class_='quote')
    # Grab the quotes from all of the pages
    for quote in quotes:
        # Extract the text by verifying the class name
        text = quote.find('span', class_='text').text.strip('“”')
        # Extract the author by verifying the class name
        author = quote.find('small', class_='author').text
        # Extract the tags by verifying the class name
        tags = [tag.text for tag in quote.find_all('a', class_='tag')]
        # Clean the quote with the function that we created earlier
        cleaned_text = clean_quote(text)
        # Add the cleaned text, author, and tags to the quotes data
        quotes_data.append([cleaned_text, author, tags])
        # Add the tags from each quote to the list of all tags 
        all_tags.extend(tags)
    page += 1

# Get the top 10 tags from the website
tag_counter = Counter(all_tags)
top_10_tags = [tag for tag, _ in tag_counter.most_common(10)]

# Create the DataFrame and create the binary columns for the top 10 tags
quotesDF = pd.DataFrame(quotes_data, columns=['quote', 'author', 'tags'])
for tag in top_10_tags:
    quotesDF[tag] = quotesDF['tags'].apply(lambda x: 1 if tag in x else 0)

# Drop the tags column
quotesDF.drop(columns=['tags'], inplace=True)

# Save the DataFrame and display it 
quotesDF.to_csv('data1.csv', index=False)
print("Data1")
IPython.display.display_html('<style>table {max-height: 300px; overflow-y: scroll; display: block;}</style>', raw=True)
display(quotesDF)

# Part 2
# Select 3 tags from the top 10 tags
selected_tags = top_10_tags[:3]

# Filter the DataFrame to only include quotes with at least one of the selected tags
filtered_df = quotesDF[(quotesDF[selected_tags[0]] == 1) | (quotesDF[selected_tags[1]] == 1) | (quotesDF[selected_tags[2]] == 1)]

# Sort by author in ascending order
filtered_df = filtered_df.sort_values(by='author').reset_index(drop=True)

# Save the filtered DataFrame and display it
filtered_df.to_csv('data2.csv', index=False)
print("Data2")
IPython.display.display_html('<style>table {max-height: 300px; overflow-y: scroll; display: block;}</style>', raw=True)
display(filtered_df)


Data1


Unnamed: 0,quote,author,love,inspirational,life,humor,books,reading,friendship,friends,truth,simile
0,world created process thinking changed without...,Albert Einstein,0,0,0,0,0,0,0,0,0,0
1,choice harry show truly far ability,J.K. Rowling,0,0,0,0,0,0,0,0,0,0
2,two way live life one though nothing miracle t...,Albert Einstein,0,1,1,0,0,0,0,0,0,0
3,person gentleman lady pleasure good novel must...,Jane Austen,0,0,0,1,1,0,0,0,0,0
4,imperfection beauty madness genius better abso...,Marilyn Monroe,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
95,never really understand person consider thing ...,Harper Lee,0,0,0,0,0,0,0,0,0,0
96,write book want written book difficult grownup...,Madeleine L'Engle,0,0,0,0,1,0,0,0,0,0
97,never tell truth people worthy,Mark Twain,0,0,0,0,0,0,0,0,1,0
98,person person matter small,Dr. Seuss,0,1,0,0,0,0,0,0,0,0


Data2


Unnamed: 0,quote,author,love,inspirational,life,humor,books,reading,friendship,friends,truth,simile
0,two way live life one though nothing miracle t...,Albert Einstein,0,1,1,0,0,0,0,0,0,0
1,life like riding bicycle keep balance must kee...,Albert Einstein,0,0,1,0,0,0,0,0,0,1
2,flower every time thought youi could walk gard...,Alfred Tennyson,1,0,0,0,0,0,1,0,0,0
3,life happens u making plan,Allen Saunders,0,0,1,0,0,0,0,0,0,0
4,better hated loved,André Gide,1,0,1,0,0,0,0,0,0,0
5,may first last loved may love love else matter...,Bob Marley,1,0,0,0,0,0,0,0,0,0
6,love vulnerable love anything heart wrung poss...,C.S. Lewis,1,0,0,0,0,0,0,0,0,0
7,never get cup tea large enough book long enoug...,C.S. Lewis,0,1,0,0,1,1,0,0,0,0
8,may gone intended go think ended needed,Douglas Adams,0,0,1,0,0,0,0,0,0,0
9,today truer true one alive youer,Dr. Seuss,0,0,1,0,0,0,0,0,0,0
