<a href="https://colab.research.google.com/github/A4Git/Hyper-Island-AI-BC/blob/main/Webscraping/2_Webscraping_1_workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Webscraping project 1 - workbook

## What are the most interesting question to ask from James Clears newsletter? 


One of my favorite newsletters i subscribe to is the [3-2-1 newsletter](https://jamesclear.com/3-2-1) by the author James Clear, who wrote the bestseller [Atomic habit](https://jamesclear.com/atomic-habits). 

![Question](https://i.redd.it/d3nl7ol629h11.jpg)



## The Goal 

**The goal is to scrape the data from all the newsletters into an dataframe, sort only the last part of each newsletter, ie. the question:**

![q1](https://raw.githubusercontent.com/A4Git/Hyper-Island-AI-BC/main/Webscraping/img/bild.jpeg)


**The end result we are aming for is a csv file that looks like this one:**
![Goal](https://raw.githubusercontent.com/A4Git/Hyper-Island-AI-BC/main/Webscraping/img/bild_3.jpeg)



### How to?

![Coding](https://s3.amazonaws.com/rails-camp-tutorials/blog/programming+memes/programming-or-googling.jpg)

### Need help? 

#### There are a coupe of ways to get help:
- **read the "docs" of the library you are working with. For the ones we are using today you can fint them here:**
    - [ ] **[Requests](https://docs.python-requests.org/en/latest/)**,
    - [ ] **[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**,
    - [ ] **[pandas](https://pandas.pydata.org/docs/reference/io.html)**,
    - [ ] **[requests-html](https://docs.python-requests.org/projects/requests-html/en/latest/)**.
    
- **read a tutorial online:**
    - for example this one from Real python - **[Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)** 
    - lookup some reference material online: - **[Python Requests Module](https://www.w3schools.com/python/module_requests.asp)**
    - find some tutorials online, w3 is a great one : - **[w3's Pandas Tutorial](https://www.w3schools.com/python/pandas/default.asp), [Scraping Medium with Python & Beautiful Soup](https://www.google.com/search?q=web+scraping+medium&oq=medium+webscra&aqs=chrome.1.69i57j0i10i22i30j69i64.7671j0j1&sourceid=chrome&ie=UTF-8)**.  
   


- **If you have spend more then 30 min trying to solve one problem, then ask for help, first your crew and then me. I will go through the notebook and a solution this friday, before or after the TDS.**


#### There are some build in help function that may be useful to try out as well:
We can take a look at `os` library and the `dir(x)` and `help` functions.



In [None]:
import os
# The dir(library or funktion) shows you all the available commands
dir(os.curdir) # for example: we can take a look at some of the options available for the os.curdir (curent directory funktion):

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',


In [None]:
#The help function is also helpful when you need to find out a bit more about the functions in the libraries we will be using:
help(os.chdir)

Help on built-in function chdir in module posix:

chdir(path)
    Change the current working directory to the specified path.
    
    path may always be specified as a string.
    On some platforms, path may also be specified as an open file descriptor.
      If this functionality is unavailable, using it raises an exception.



In [None]:
type(os.chdir)

builtin_function_or_method

### The Scope

The scope of this notebook is to learn by doing. It's a barebone notebook with some code to get you started, some references and the end result we are aiming for. The rest (aka. googling, trying out code, and debugging is up to you, and that is in my opinion the best way to learn). Whenever possible, **write code**, even when you find it online as the ctrl +c /ctrl +v does not compute into learning. 


The outline of this project include the following steps: 

* [ ] Downloading web pages using the requests library
* [ ] Inspecting the HTML source code of a web page
* [ ] Parsing parts of a website using Beautiful Soup
* [ ] Writing parsed information into CSV files

**First:**
1. Get it done (make a draft that runs without errors)
2. Then make it nice ( iterate over the draft and try to make it explainable, comment the code, give some context, etc.)


#### Install and import the libraries

If you don't have the libraries installed, just remove the `#` and run the code below:

In [None]:
# For step 1
# !pip install requests -q 
# !pip install pandas -q 
# !pip install beautifulsoup4 -q 

# For step 2 
!pip install requests-html -q  

[K     |████████████████████████████████| 83 kB 775 kB/s 
[K     |████████████████████████████████| 138 kB 26.3 MB/s 
[K     |████████████████████████████████| 111 kB 48.9 MB/s 
[K     |████████████████████████████████| 127 kB 49.7 MB/s 
[?25h  Building wheel for fake-useragent (setup.py) ... [?25l[?25hdone
  Building wheel for parse (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m


In [None]:
# The libraries we will be using today are:
import requests # for downloading the site
from bs4 import BeautifulSoup # for parsing the html download 
import pandas as pd  # for creating an dataframe and exporting it to a .csv file 

### Setup the query and url

In [None]:
url = 'https://jamesclear.com/3-2-1'

#now lets download the page
html = requests.get(url)

# if we get the code 200, then we know it's working and we got the page
html.status_code 


200

### Inspect the website

Now lets take a look at the `html` code of the website by inspecting the site using developer tools. We'll do this to identify what to search for to get the data we want: 

![Html](https://raw.githubusercontent.com/A4Git/Hyper-Island-AI-BC/main/Webscraping/img/bild_2.jpeg)

We are looking for the tags like `div`, `p`, `a` and the `id` or `class`of the tags that hold the information we want. <br>


### Extract the html data

Now, based on what you we in this page I want for each newsletter to extract the following data:
- the article title
- the newsletter date
- the link to the whole newsletter page
- the questions

But before we get there we need to parse the `html` code we got back by downloading the site with `requests`.

In [None]:
# We use BS to read the html text
soup = BeautifulSoup(html.text, "html.parser") # html is just the parser for reading the html
print(soup.prettify())

<!DOCTYPE doctype html>
<!--[if IE 9]><html class="lt-ie10" lang="lang="en-US"" > <![endif]-->
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="https://use.typekit.net/tqf2ebt.css" rel="stylesheet"/>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
   <!-- This site is optimized with the Yoast SEO Premium plugin v18.1 (Yoast SEO v18.3) - https://yoast.com/wordpress/plugins/seo/ -->
   <title>
    3-2-1 Thursday newsletter - James Clear
   </title>
   <link href="https://jamesclear.com/3-2-1" rel="canonical">
    <meta content="en_US" property="og:locale">
     <meta content="article" property="og:type">
      <meta content="3-2-1 Thursday newsletter" property="og:title"/>
      <meta content="https://jamesclear.com/3-2-1" property="og:url"/>
      <meta content="James Clear" pro

### Prototype the model with a single record

In [None]:
#Lets try to get major tag for all the links 
all_divs = ??

for y in all_divs:
    print(y.text)

2022
2021
2020
2019


In [None]:
# lets try to get the date tip: try `.text.strip()` to get only the date)
date = soup.find('div', class_='all-articles__news__date').text.strip()
date

'Mar 17'

In [None]:
# and the title (tip: try `.text` to get only the title)
title = ??
title

'3-2-1: Charity, the true mark of a pro, and how to choose what to read'

In [None]:
#how about the link to the newsletter 
lnk = ??
lnk

'https://jamesclear.com/3-2-1/march-17-2022'

### Generalize the model with a function

Now put all the code into a function (or several).

In [None]:
article_list = []
def get_newsletter_links():
    ??
    return article_list.append(??)

get_newsletter_links()

1

In [None]:
# Now we can try to get the content of each newsletter
def get_page(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the page
    response = requests.get(url)
    
    # Check if download was sucessful
    if response.status_code ???:
      ???
    # Get the page HTML
    page_content = response.text
    
    # Create a bs4 doc
    soup = ??
    return soup

In [None]:
# Once we get the soup for each newsletter we can try to get the data we need:
def get_newsletter():
    ??

### Putting it all together

In [None]:
# function 1
??
# function 2
??
# function 3
??


In [None]:
# Save the results to a csv named "JC_newsletter.csv"
??

## Now lets try the requests-html library (part 2)
New lets see how we can make it a bit more efficinent with one integrated library, the `requests-html`. The `request-html` is writen by the creator of `request` but includes `html` parser so we don't need `beatiful soup` to read the `html`.

In [None]:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://jamesclear.com/3-2-1')

In [None]:
##Now lets see all the links
r.html.absolute_links 

{'https://jamesclear.com',
 'https://jamesclear.com/3-2-1',
 'https://jamesclear.com/3-2-1/april-1-2021',
 'https://jamesclear.com/3-2-1/april-15-2021',
 'https://jamesclear.com/3-2-1/april-16-2020',
 'https://jamesclear.com/3-2-1/april-2-2020',
 'https://jamesclear.com/3-2-1/april-22-2021',
 'https://jamesclear.com/3-2-1/april-23-2020',
 'https://jamesclear.com/3-2-1/april-29-2021',
 'https://jamesclear.com/3-2-1/april-30-2020',
 'https://jamesclear.com/3-2-1/april-8-2021',
 'https://jamesclear.com/3-2-1/april-9-2020',
 'https://jamesclear.com/3-2-1/august-12-2021',
 'https://jamesclear.com/3-2-1/august-13-2020',
 'https://jamesclear.com/3-2-1/august-15-2019',
 'https://jamesclear.com/3-2-1/august-19-2021',
 'https://jamesclear.com/3-2-1/august-20-2020',
 'https://jamesclear.com/3-2-1/august-22-2019',
 'https://jamesclear.com/3-2-1/august-26-2021',
 'https://jamesclear.com/3-2-1/august-27-2020',
 'https://jamesclear.com/3-2-1/august-29-2019',
 'https://jamesclear.com/3-2-1/august-5-20

Now we can get all the relevant links with a [list comprehension](https://realpython.com/list-comprehension-python/)

In [None]:
nl_links= [link for link in r.html.absolute_links if "/3-2-1/" in link]
len(nl_links) # we got 137 newsletter links

137

In [None]:
#Lets try to get only the title of the newsletter
all = r.html.find('.all-articles__news__post')
all[0].text.strip().replace("3-2-1:", "")

' Charity, the true mark of a pro, and how to choose what to read'

In [None]:
#Now lets get all the newsletters and links
article_list = []

all = r.html.find('.all-articles__news__post')
for item in all: 
    titel = item.find('.all-articles__news__post', first=True).text.replace("3-2-1:", " ")
    link = item.find('.all-articles__news__post', first=True).links
    article = {
        'Titel' : titel,
        'Url' : link
        }
    article_list.append(article)

len(article_list)

137

In [None]:
# Now lets explore how to get the questions from each newsletter,
# But first we should try it with a smal sample of the newsletters
test_nl_links = nl_links[:10]
len(test_nl_links), test_nl_links

(10,
 ['https://jamesclear.com/3-2-1/june-3-2021',
  'https://jamesclear.com/3-2-1/december-9-2021',
  'https://jamesclear.com/3-2-1/june-4-2020',
  'https://jamesclear.com/3-2-1/march-3-2022',
  'https://jamesclear.com/3-2-1/december-17-2020',
  'https://jamesclear.com/3-2-1/january-23-2020',
  'https://jamesclear.com/3-2-1/august-19-2021',
  'https://jamesclear.com/3-2-1/january-30-2020',
  'https://jamesclear.com/3-2-1/november-19-2020',
  'https://jamesclear.com/3-2-1/may-13-2021'])

In [None]:
# For now this is what we have in terms of questions:
for nl in test_nl_links:
    get_nl = session.get(nl)
    #url = get_nl.url
    #date= get_nl.url.rsplit("/")[4].replace('-',' ').capitalize()
    #question_header = get_nl.html.find('h2', containing='1 QUESTION', first=True).text,
    question = get_nl.html.find("h2#h-1-question-for-you~p", first=True).full_text.strip()
    print(question)

Do the people around me act the way I wish to act?
Are you playing a game worth winning?
What is the biggest small thing I could do today?
Think of something you struggled with in the last year. If you step back and zoom out, what is one lesson you have learned from the experience?
What is one small thing I could do today that would make a meaningful impact on my future?
Am I tolerating my flaws or improving them?
Six months from now, what you will you wish you had spent time on today?
Will this matter in six months?
What would your closest friend tell you to do?
What is a small pleasure that brings me great joy? Can I enjoy it today?


In [None]:
# Now lets try to put it togehter in a function 

data = []

for nl in test_nl_links:
    get_nl = session.get(nl)
    title = get_nl.html.find("div.page__header>h1", first=True).text.replace("3-2-1:", "")
    url = get_nl.url
    date= get_nl.url.rsplit("/")[4].replace('-',' ').capitalize()
    try:
        question = get_nl.html.find("h2#h-1-question-for-you~p", first=True).full_text.strip()
    except: 
        pass
    data.append([date, title, question, url])

data

[['June 3 2021',
  ' On growth, all-or-nothing mindsets, and how great art evolves with us',
  'Do the people around me act the way I wish to act?',
  'https://jamesclear.com/3-2-1/june-3-2021'],
 ['December 9 2021',
  ' Active patience, focusing on what you can control, and playing a game worth winning',
  'Are you playing a game worth winning?',
  'https://jamesclear.com/3-2-1/december-9-2021'],
 ['June 4 2020',
  ' On taking action, changing incentives, and belonging',
  'What is the biggest small thing I could do today?',
  'https://jamesclear.com/3-2-1/june-4-2020'],
 ['March 3 2022',
  ' Doubling your intelligence, zooming out, and finding value in anger',
  'Think of something you struggled with in the last year. If you step back and zoom out, what is one lesson you have learned from the experience?',
  'https://jamesclear.com/3-2-1/march-3-2022'],
 ['December 17 2020',
  ' On blame, the purpose of education, and compounding choices',
  'What is one small thing I could do today 

In [None]:
#Now we can try to get all the questions: 

all_newsletter_data = []

for nl in nl_links:
    r = session.get(nl)
    title = r.html.find("div.page__header>h1", first=True).text.replace("3-2-1:", "")
    url = r.url
    date= r.url.rsplit("/")[4].replace('-',' ').capitalize()
    try:
        question = r.html.find("h2#h-1-question-for-you~p", first=True).full_text.strip()
    except: 
        pass
    all_newsletter_data.append([date, title, question, url ]) 

all_newsletter_data


[['June 3 2021',
  ' On growth, all-or-nothing mindsets, and how great art evolves with us',
  'Do the people around me act the way I wish to act?',
  'https://jamesclear.com/3-2-1/june-3-2021'],
 ['December 9 2021',
  ' Active patience, focusing on what you can control, and playing a game worth winning',
  'Are you playing a game worth winning?',
  'https://jamesclear.com/3-2-1/december-9-2021'],
 ['June 4 2020',
  ' On taking action, changing incentives, and belonging',
  'What is the biggest small thing I could do today?',
  'https://jamesclear.com/3-2-1/june-4-2020'],
 ['March 3 2022',
  ' Doubling your intelligence, zooming out, and finding value in anger',
  'Think of something you struggled with in the last year. If you step back and zoom out, what is one lesson you have learned from the experience?',
  'https://jamesclear.com/3-2-1/march-3-2022'],
 ['December 17 2020',
  ' On blame, the purpose of education, and compounding choices',
  'What is one small thing I could do today 

In [None]:
# so now we got all the newsletters and the questions 
len(all_newsletter_data)

137

In [None]:
# now lets make a dataframe from all the data we collected:
df = pd.DataFrame(all_newsletter_data, columns= ["date", "title","1_question_for_you", "link"])
df.head(5)

Unnamed: 0,date,title,1_question_for_you,link
0,June 3 2021,"On growth, all-or-nothing mindsets, and how g...",Do the people around me act the way I wish to ...,https://jamesclear.com/3-2-1/june-3-2021
1,December 9 2021,"Active patience, focusing on what you can con...",Are you playing a game worth winning?,https://jamesclear.com/3-2-1/december-9-2021
2,June 4 2020,"On taking action, changing incentives, and be...",What is the biggest small thing I could do today?,https://jamesclear.com/3-2-1/june-4-2020
3,March 3 2022,"Doubling your intelligence, zooming out, and ...",Think of something you struggled with in the l...,https://jamesclear.com/3-2-1/march-3-2022
4,December 17 2020,"On blame, the purpose of education, and compo...",What is one small thing I could do today that ...,https://jamesclear.com/3-2-1/december-17-2020


### All in one 

In [None]:
all_newsletter_data = []

def get_links():
    """Extract all the links to the newsletters"""
    r = session.get('https://jamesclear.com/3-2-1')
    nl_links= [link for link in r.html.absolute_links if "/3-2-1/" in link]
    return nl_links

def get_newsletter_questions():
    """Extract the date, title, question and link for all newsletter"""
    for nl in nl_links:
        ra = session.get(nl)
        title = ra.html.find("div.page__header>h1", first=True).text.replace("3-2-1:", "")
        url = ra.url
        date = url.rsplit("/")[4].replace('-',' ').capitalize()
        try:
            question = ra.html.find("h2#h-1-question-for-you~p", first=True).full_text.strip()
        except:
            pass
        all_newsletter_data.append([date, title, question, url ])
    df = pd.DataFrame(all_newsletter_data, columns= ["date", "title","1_question_for_you", "link"])
    return df


**Now we can call the functions to get the data:**

In [None]:
get_links(), get_newsletter_questions()

(['https://jamesclear.com/3-2-1/june-3-2021',
  'https://jamesclear.com/3-2-1/december-9-2021',
  'https://jamesclear.com/3-2-1/june-4-2020',
  'https://jamesclear.com/3-2-1/march-3-2022',
  'https://jamesclear.com/3-2-1/december-17-2020',
  'https://jamesclear.com/3-2-1/january-23-2020',
  'https://jamesclear.com/3-2-1/august-19-2021',
  'https://jamesclear.com/3-2-1/january-30-2020',
  'https://jamesclear.com/3-2-1/november-19-2020',
  'https://jamesclear.com/3-2-1/may-13-2021',
  'https://jamesclear.com/3-2-1/september-3-2020',
  'https://jamesclear.com/3-2-1/april-9-2020',
  'https://jamesclear.com/3-2-1/september-9-2021',
  'https://jamesclear.com/3-2-1/august-22-2019',
  'https://jamesclear.com/3-2-1/august-29-2019',
  'https://jamesclear.com/3-2-1/august-20-2020',
  'https://jamesclear.com/3-2-1/september-16-2021',
  'https://jamesclear.com/3-2-1/may-21-2020',
  'https://jamesclear.com/3-2-1/december-16-2021',
  'https://jamesclear.com/3-2-1/october-1-2020',
  'https://jamesclea

**Now lets save the data to a `.csv` file**

In [None]:
df.to_csv('JC_questions.csv', index=False)

**But what if we would like to get all the text from one newsletter?**


In [None]:
# It's easy to get all the data from the newsletter, but it can be a lot of text and not so easy to read in the notebook. 
get_one_nl = session.get(nl_links[5])
h_post = get_one_nl.html.find('h2')
p_post = get_one_nl.html.find('div p')
all_h = ([h.text for h in h_post])
all_p = ([p.text for p in p_post])
all_h, all_p


(['3 IDEAS FROM ME', '2 QUOTES FROM OTHERS', '1 QUESTION FOR YOU', 'Join Me'],
 ['I.',
  '“Focus is the art of knowing what to ignore.”',
  'II.',
  '“On minimalism:',
  'The goal is not to have the least amount of things, but the optimal amount of things.',
  'Two important footnotes:',
  '(1) The optimal amount depends on your goals.\n(2) The optimal amount is almost always less than you think.”',
  'III.',
  '“Reading is like a software update for your brain.',
  "Whenever you learn a new concept or idea, the ‘software' improves. You download new features and fix old bugs.",
  'In this way, reading a good book can give you a new way to view your life experiences. Your past is fixed, but your interpretation of it can change depending on the software you use to analyze it.”',
  'I.',
  'Jeffrey D. Sachs, an economist and author, on money, spending, and status:',
  '“Living doesn’t cost much, but showing off does.”',
  'Source: The Price of Civilization: Reawakening American Virtue and

### Summery 

In summery we learned to get data from one of the most popular newsletters online, by the author and blogger James Clear. We explored how to get the data with the libraries `requests` and `beatiful soup` and the more integrated library `requests-html`. The `requests-html` we could get the job done with less then 20 of code. 

The result is a list of 137 interesting question to ponder and to make you think, with date, newsletter title, question and link to the full newsletter. 

For further work it would be interesting to explore getting data via an API, both an official one and via a `.json` call. Furthermore it would be great to get all the data and explore more ways to work with text in python and pandas. 

**A reminder, this is only serving as a learning notebook. Please check site/robots.txt before you start webscraping to make sure what is allowed to scrape, if anything, on the site you are interested in. Webscraping lots of data may get you baned as this may increase the site bandwith required and be associated with incressed hosting costs, so please be mindful and don't abuse this tool.**

/AG

### Further reading / Resources: 
- [Requests-HTML: The modern way of web scraping](https://medium.com/analytics-vidhya/the-modern-way-of-web-scraping-requests-html-2567ba2554f4)
- [Web Scraping with Python Guide](https://youtu.be/J91bHusPatc) video
- [Slow Web Scraper? Try this with ASYNC and Requests-html](https://youtu.be/8drEB06QjLs) video
- [Python Tutorial: Web Scraping with Requests-HTML](https://youtu.be/a6fIbtFB46g) video