#### BAX 422 Data Design & Representation - Individual Project 1
#### Section 1 Ji Hyun Kim

**Overview**

In this project, you will focus on web scraping with Python, specifically targeting the "free" category in the San Francisco Bay Area section of Craigslist (https://sfbay.craigslist.org/search/zip). Your task is to develop a Python script that scrapes the first 250 item listings, saves the HTML pages of these item detail pages to disk, and then, in a separate process, reads these HTML files from disk to extract and display specific details about each listing.

##### **Part 1: Scraping and Saving HTML Content**

**1. Identify the Target:**

Start with navigating to the “free” section on the Craigslist San Francisco Bay Area site (https://sfbay.craigslist.org/search/zip).

This page lists items that people are giving away for free.

**2. Interact with the Page-Sorting:**

**Observing Changes in the URL after Changing Sorting Order**
- Before: https://sfbay.craigslist.org/search/zip#search=1~gallery~0~0
- After changing the sorting order to "oldest": https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0

→ After changing the sorting order, the URL changes to reflect this new sorting criterion. **?sort=dateoldest** added to the URL.

**Triggering Sorting Change in the URL**

→ We can trigger sorting change by modifying the URL in the browser’s address bar. By changing the `sort` query embedded in the URL, sorting order of the listings can be switched. In this particular case, changing query parameter to `?sort=date` swithes the sorting option to "newest".

- https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0

**Explain what type of request is made when you change the sort order (GET or POST)**

→ In this case GET request was used to change the sorting order. The fact that this change can be triggered by modifying the URL directly supports the idea that it's a GET request, as GET requests encode the parameters visibly in the URL. On the other hand, in POST request, the data is not exposed in the URL. 

**Variable Associated with Sorting**

→ The variable in the URL associated with sorting is `sort`. `sort=dateoldest` sorts the listings from oldest to newest, while `sort=date` sorts the listings from newest to oldest.

**3. Interact with the Page-Pagination:**

**Move between pages by only chaning the URL**

- First page: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0
- Second Page: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~1~0
- Third Page: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~2~0

→ We can move between pages by changing the number comes after `gallery~`. First page is mapped with `0`, second page `1`, and third page `2`, and so forth.

**Variable associated with page changes**

→ Changing the number in the URL directly affects which page of listings we view. By increasing or decreasing this number, we can navigate forward or backward through the listing pages.


**4. Fetch Listing URLs:**

Use `requests` to access the first page of the “free” section, ordered “newest” first.

Deploy `BeautifulSoup` to parse the HTML content.

Identify the structure that holds the links to individual listing pages.  What selector do you choose to grab the link?

Can you identify one more possible selection method to retrieve the link to the individual listing?  Explain.

Extract the first 250 unique listing URLs and save them to a list.  Consider the pagination feature of Craigslist to navigate through pages.  Explain your strategy.

Print the list to screen.

In [153]:
# Load Python libraries
from bs4 import BeautifulSoup
import requests
import time

In [154]:
time.sleep(10)

url = "https://sfbay.craigslist.org/search/zip#search=1~gallery~0~0"
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup)

**First Method** : start from 'a' tag

In [156]:
# Select all 'a's with find_all
links = soup.find_all('a')
#print(links)

In [157]:
# Extract href
urls = [link.get('href') for link in links]
urls = urls[2:] # Remove first two special chracters
urls_250 = urls[0:250] # Select the first 250 listings only
print(urls_250)
len(urls_250)

['https://sfbay.craigslist.org/pen/zip/d/south-san-francisco-free-garden-benches/7715228640.html', 'https://sfbay.craigslist.org/nby/zip/d/santa-rosa-ethan-allen-whitney-sofa/7715228387.html', 'https://sfbay.craigslist.org/pen/zip/d/mountain-view-firewood/7715227930.html', 'https://sfbay.craigslist.org/eby/zip/d/hayward-reef-sand/7713797530.html', 'https://sfbay.craigslist.org/nby/zip/d/novato-crib-with-mattress/7712881922.html', 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-moving-out-bed-tiny-desk/7715226780.html', 'https://sfbay.craigslist.org/sby/zip/d/san-jose-free-upright-yamaha-piano/7715210611.html', 'https://sfbay.craigslist.org/sby/zip/d/sunnyvale-2006-bosch-dishwasher-works/7715225326.html', 'https://sfbay.craigslist.org/scz/zip/d/santa-cruz-free-outdoor-patio-wood/7712235825.html', 'https://sfbay.craigslist.org/eby/zip/d/fremont-set-of-road-flares-20-minute-red/7715224622.html', 'https://sfbay.craigslist.org/eby/zip/d/albany-free-queen-duvet-insert-and/7715222808.ht

250

In [158]:
# Organize the output
for url in urls_250:
    print(f"{url}\n")

https://sfbay.craigslist.org/pen/zip/d/south-san-francisco-free-garden-benches/7715228640.html

https://sfbay.craigslist.org/nby/zip/d/santa-rosa-ethan-allen-whitney-sofa/7715228387.html

https://sfbay.craigslist.org/pen/zip/d/mountain-view-firewood/7715227930.html

https://sfbay.craigslist.org/eby/zip/d/hayward-reef-sand/7713797530.html

https://sfbay.craigslist.org/nby/zip/d/novato-crib-with-mattress/7712881922.html

https://sfbay.craigslist.org/sfc/zip/d/san-francisco-moving-out-bed-tiny-desk/7715226780.html

https://sfbay.craigslist.org/sby/zip/d/san-jose-free-upright-yamaha-piano/7715210611.html

https://sfbay.craigslist.org/sby/zip/d/sunnyvale-2006-bosch-dishwasher-works/7715225326.html

https://sfbay.craigslist.org/scz/zip/d/santa-cruz-free-outdoor-patio-wood/7712235825.html

https://sfbay.craigslist.org/eby/zip/d/fremont-set-of-road-flares-20-minute-red/7715224622.html

https://sfbay.craigslist.org/eby/zip/d/albany-free-queen-duvet-insert-and/7715222808.html

https://sfbay.crai

**Alternative method**: start from 'ol' tag

In [159]:
# All listings are included in 'ol' object. Find 'ol' tag
ol = soup.find('ol')
print(ol)

<ol class="cl-static-search-results">
<li class="cl-static-hub-links">
<div>see also</div>
</li>
<li class="cl-static-search-result" title="Free garden benches and table">
<a href="https://sfbay.craigslist.org/pen/zip/d/south-san-francisco-free-garden-benches/7715228640.html">
<div class="title">Free garden benches and table</div>
<div class="details">
<div class="price">$0</div>
<div class="location">
                        foster city
                    </div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Ethan Allen Whitney Sofa">
<a href="https://sfbay.craigslist.org/nby/zip/d/santa-rosa-ethan-allen-whitney-sofa/7715228387.html">
<div class="title">Ethan Allen Whitney Sofa</div>
<div class="details">
<div class="price">$0</div>
<div class="location">
                        Santa Rosa
                    </div>
</div>
</a>
</li>
<li class="cl-static-search-result" title="Firewood">
<a href="https://sfbay.craigslist.org/pen/zip/d/mountain-view-firewood/7715227930.ht

In [160]:
# Find all 'a' tags in the ol selected
links = ol.find_all('a')
urls = [link.get('href') for link in links]
urls_250 = urls[0:250] # Extract the first 250 urls

# Organize the output and print urls
for url in urls_250:
    print(f"{url}\n")

https://sfbay.craigslist.org/pen/zip/d/south-san-francisco-free-garden-benches/7715228640.html

https://sfbay.craigslist.org/nby/zip/d/santa-rosa-ethan-allen-whitney-sofa/7715228387.html

https://sfbay.craigslist.org/pen/zip/d/mountain-view-firewood/7715227930.html

https://sfbay.craigslist.org/eby/zip/d/hayward-reef-sand/7713797530.html

https://sfbay.craigslist.org/nby/zip/d/novato-crib-with-mattress/7712881922.html

https://sfbay.craigslist.org/sfc/zip/d/san-francisco-moving-out-bed-tiny-desk/7715226780.html

https://sfbay.craigslist.org/sby/zip/d/san-jose-free-upright-yamaha-piano/7715210611.html

https://sfbay.craigslist.org/sby/zip/d/sunnyvale-2006-bosch-dishwasher-works/7715225326.html

https://sfbay.craigslist.org/scz/zip/d/santa-cruz-free-outdoor-patio-wood/7712235825.html

https://sfbay.craigslist.org/eby/zip/d/fremont-set-of-road-flares-20-minute-red/7715224622.html

https://sfbay.craigslist.org/eby/zip/d/albany-free-queen-duvet-insert-and/7715222808.html

https://sfbay.crai

**5. Save HTML Pages:**

For each of the 250 listing URLs, use `requests` to fetch the listing page.

Save each HTML content to a separate file on disk.  Use each listing’s ID to organize files in a way that makes them easily identifiable (e.g., save listing ID 7713901653 to file “7713901653.html”).

In [161]:
# Load Python liblaries
from urllib.parse import urlparse
import os

In [162]:
# Set save path to save the html files
#save_path = "/Users/jihyunkim/Documents/craiglist_listings"
save_path = "./craiglist_listings"

# Create the directory into the current working directory if it does not exist
os.makedirs(save_path, exist_ok=True)

#urlparse(url)
#urlparse(url).path.split('/')[-1]
#urlparse(url).path.split('/')[-1].split('.')[0]

# Extract listing number from the end of URL, to be used for file names
def listing_num(url):
    parsed_url = urlparse(url) # Break the URLs in components (addressing scheme, network location, path etc.)
    paths = parsed_url.path.split('/') # Make a list of components in the path 
    listing_num = paths[-1].split('.')[0] # Select the last item in the list, and retreive everything before .html.
    return listing_num

for url in urls_250:
    # Use try ~ except clause to handle errors. 
    try:  
        listing_number = listing_num(url)
        file_path = os.path.join(save_path, f"{listing_number}.html") # Set up the file path where the html files will be stored
        # Fetch the content of the URLs using requests
        response = requests.get(url)

        if response.status_code == 200: # The request is successful only when the status code is 200
            with open(file_path, 'wb') as file: # Set as binary write mode
                file.write(response.content) # Save the content as a file
            print(f"Successfully saved {listing_number}.html") # Print the success message
        else: 
            print(f"Failed to fetch {listing_number}.") # When there's error thenm print out a failure message.
        
        time.sleep(10) # Add 10 seconds delay before the next request
    except Exception as e:
        print(f"An error occurred for {url}: {str(e)}") 


Successfully saved 7715228640.html
Successfully saved 7715228387.html
Successfully saved 7715227930.html
Successfully saved 7713797530.html
Successfully saved 7712881922.html
Successfully saved 7715226780.html
Successfully saved 7715210611.html
Successfully saved 7715225326.html
Successfully saved 7712235825.html
Successfully saved 7715224622.html
Successfully saved 7715222808.html
Successfully saved 7713596495.html
Successfully saved 7714480447.html
Successfully saved 7715221210.html
Successfully saved 7715220928.html
Successfully saved 7715219967.html
Successfully saved 7714384078.html
Successfully saved 7715218330.html
Successfully saved 7715217637.html
Successfully saved 7715216905.html
Successfully saved 7715215063.html
Successfully saved 7715214435.html
Successfully saved 7715214277.html
Successfully saved 7715213849.html
Successfully saved 7715213724.html
Successfully saved 7715213622.html
Successfully saved 7715212074.html
Successfully saved 7715211976.html
Successfully saved 7

##### **Part 2: Parsing and Displaying Information from Saved HTML**

**1. Read Saved HTML Files:**

Write a script that reads each of the saved HTML files from the disk.

In [163]:
files = os.listdir(save_path) # Create a list of files in the folder

for file in files:
    file_path = os.path.join(save_path, file)

    if os.path.isfile(file_path): # Ensure if it's a file
        with open(file_path, 'rb') as file:
            content = file.read()
            print(f"Read {file}") # Print file path
            print(content) # Print content

        print("---") # Divider between files


Read <_io.BufferedReader name='./craiglist_listings/7715222808.html'>
b'<!DOCTYPE html>\n<html>\n<head>\n    \n\t<meta charset="UTF-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta name="viewport" content="width=device-width,initial-scale=1">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:title" content="Free 1 queen duvet insert and 6 pillows, must take all right now - free stuff - craigslist">\n\t<meta name="description" content="PLEASE DO NOT ASK IF AVAILABLE. IF IT IS STILL POSTED, IT IS STILL AVAILABLE. PLEASE LET ME KNOW IF YOU CAN PICK UP RIGHT NOW FROM ALBANY CA AND PROVIDE YOUR NUMBER TO TEXT. THANK YOU. 1 queen duvet...">\n\t<meta property="og:description" content="PLEASE DO NOT ASK IF AVAILABLE. IF IT IS STILL POSTED, IT IS STILL AVAILABLE. PLEASE LET ME KNOW IF YOU CAN PICK UP RIGHT NOW FROM ALBANY CA AND PROVIDE YOUR NUMBER TO TEXT. THANK YOU. 1 queen duvet...">\n\t<meta prop

**2. Extract Information:**

For each HTML file, use `BeautifulSoup` to parse the file content.

Extract and print the following details:

**Title:** The title of the listing.

**URL of first image (if an image exists):**  The URL of the displayed image.  It can be found in the `src` attribute of `<img>`

**Description:** The full description text of the listing.

**Post ID:** Usually found at the bottom of the page or within the page's HTML structure.

**Posted Date:** The date when the listing was originally posted.

**Last Updated Date:** The date when the listing was last updated.

In [164]:
# For loop through each file in the list of files
for file in files: 
    file_path = os.path.join(save_path, file) # Use the same code from the previous question
    if os.path.isfile(file_path):
        with open(file_path, 'rb') as file:
            soup = BeautifulSoup(file.read(), 'html.parser')
            #print(soup)

            title = soup.title.text if soup.title else 'Title not found'  # Extract the title of the listing
            #print(title)

            first_image_url = soup.find('img')['src'] if soup.find('img') else 'Image not found' # Extract the URL of the first image in the listing
            #print(first_image_url)

            description_section = soup.find('section', {'id':'postingbody'}) # Find the section containing the post's description
            if description_section:
                full_description_text = description_section.get_text(separator=" ", strip=True) # Extract the text, replacing newlines with spaces, and strip leading/trailing space
                #print(full_description_text)

                text_to_exclude = 'QR Code Link to This Post' # Remove "QR Code Link to This Post"
                if text_to_exclude in full_description_text:
                    description_text = full_description_text.replace(text_to_exclude," ").strip() 
                else:
                    description_text = full_description_text
            else:
                description_text = 'Description not found'
            #print(description_text)

            post_id = soup.find(text=lambda x: 'post id:' in x.lower())  # Find the post ID 
            post_id = post_id.text.split(':')[-1].strip() if post_id else 'Post ID not found'
            #print(post_id)

            postinginforeveals = soup.find_all('p', class_ = 'postinginfo reveal') # Extract 'p' containing posted dates and updated dates
            for info in postinginforeveals:
                if 'posted:' in info.text.lower(): # posted dates has 'posted:' in the beginning
                    posted_date = info.find('time')['datetime']
                else:posted_date = 'Date not found'

                if 'updated:' in info.text.lower():
                    updated_date = info.find('time')['datetime'] # updated dates has 'updated:' in the beginning
                else:updated_date = 'Date not found' # Some listing doesn't have any updated dates → print "Date not found"
            #print(posted_date)
            #print(updated_date)
            
            # Print the final results
            print(f"Title: {title}")
            print(f"URL of first image: {first_image_url}")
            print(f"Description: {description_text}")
            print(f"Post ID: {post_id}")
            print(f"Posted Date: {posted_date}")
            print(f"Last Updated Date: {updated_date}")
            print("------")



Title: Free 1 queen duvet insert and 6 pillows, must take all right now - free stuff - craigslist
URL of first image: https://images.craigslist.org/00101_5LmfxxIj6tG_0t20CI_600x450.jpg
Description: PLEASE DO NOT ASK IF AVAILABLE. IF IT IS STILL POSTED, IT IS STILL AVAILABLE. PLEASE LET ME KNOW IF YOU CAN PICK UP RIGHT NOW FROM ALBANY CA AND PROVIDE YOUR NUMBER TO TEXT. THANK YOU. 1 queen duvet insert and 6 pillows, must take all right now
Post ID: 7715222808
Posted Date: Date not found
Last Updated Date: 2024-02-06T19:33:33-0800
------
Title: Retro TV - Good for Retro Gaming - free stuff - craigslist
URL of first image: https://images.craigslist.org/00F0F_h3d0GBuVExe_0CI0t2_600x450.jpg
Description: FREE Older  TV  -  Sony Trinitron XBR From the 1990's A perfect monitor for retro gaming - LARGE screen Works well with old gaming systems like  Super Mario Nintendo, Atari, In working condition Includes a Sony remote controller TV overall measures 33" wide, 23"deep, 26" high - with a 32" sc

##### **Part 3: Automating Login on The Old Reader**

In this part, you will focus on automating the process of logging into https://theoldreader.com.  This will involve understanding web authentication mechanisms, managing sessions, and verifying successful login programmatically.  This part is very similar to neopets which we did in class.  Please review the neopets code before attempting this part.

**1. Creating and Verifying a The Old Reader Account:**

Account Creation:  Create an account on https://theoldreader.com.  Use an email address and password that you are comfortable sharing with us.

**Manual Login Verification:** Before automating the login process, ensure you can manually log in to theoldreader.com with your new credentials.  This confirms that your account is active and your credentials are correct.

→ Done

**2. Exploring the Login Mechanism**

`<form>`
- accept-charset="UTF-8" : UTF-8 encoding accepted 
- action="/users/sign_in" : determines where the form data should be sent to be processed. "/users/sign_in" is the URL endpoint on the server where the form data will be submitted
- method="post" : the HTTP method to be used when submitting the form data to the server

`<input>`

**Username/Email:**

- autocapitalize="off" : automatic capitalization of the input disabled

- autocorrect="off" : automatic correction turned off

- autofocus="autofocus" : automatically focuses this input field when the page loads

- class="form-control" : sets the width to 100%, adjusts the padding, border, and font size to provide a uniform appearance for all inputs

- id="user_login" : assigns a unique identifier to the input field

- name="user[login]" : names the input field, which is used to identify the field's data when the form is submitted

- placeholder="Username/Email" : shows an example text within the input field to guide users on what to enter

- size="30" : visible width of the input field, in terms of character space

- spellcheck="false" : spell checking on the input disabled

- type="text": input field is for text input

**Password:**

- name="user[password]": names the input field, which is used to identify the field's data when the form is submitted

- type="password": ensures that the characters typed by the user are replaced by symbols (like asterisks or dots) for privacy

**Sign in button:**

- class="btn btn-primary btn-block" : styling classes for the button. it's a primary button (likely styled with a distinct color) and should span the full width of its container (block-level)

- name="commit" : names the button, which can be used for identifying the button in form submissions

- type="submit" : specifies the button as a submit button

- value="Sign In": text that appears on the button

**3. Analyzing Network Traffic for Login Request**

**Identify the network request made when you submit the login form (GET or POST).  Explain why this method was chosen.**
- "POST" request was made when submitted the login form. It's mainly due to the security features of the "POST"; it sends data in the body of the HTTP request, not in the URL like the "GET" request. Also, the "POST" request is used for actions that change server data, like starting a new session when a user logs in.

**Carefully examine the payload that was submitted to the server during login.  Compare this payload to the `<form>` / `<input>` fields you previously analyzed.  Explain your observation.**
- utf8: ✓ → accept-charset="UTF-8" in the `<form>`. 

- authenticity_token: Q87EK5Z/PZH3mdkIvxH/6qyNl5DlYBifY2SG60cpdII= → name="authenticity_token" in the hidden `<input>` under the `<form>`.

- user[login]: jjikim@ucdavis.edu → matches the name attributes of the `<input>`

- user[password]: 1q2w3e4r → matches the name attributes of the `<input>`

- commit: Sign In → matches the value="Sign In" attribute of the button

The payload submitted to the server includes all the information from the form's input fields.

**4. Automating the Login Process**

Using Python and appropriate libraries like requests, simulate the login process.

Create a session object to maintain your login state across multiple requests.

Prepare a payload with your login credentials and other necessary form data identified from the login page and the network analysis.

Send a POST request to the login form’s action URL to log in, using the session object.

In [165]:
time.sleep(10)
headers = {'User-agent': 'Mozilla/5.0'}

url = 'https://theoldreader.com/users/sign_in'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

#print(soup)

In [166]:
time.sleep(10)

# open session allows us to make post requests
session = requests.session() 
# input all information from the payload and make a post request
res = session.post(url, 
                   data = {'utf8' : '✓',
                           'authenticity_token' : 'Q87EK5Z/PZH3mdkIvxH/6qyNl5DlYBifY2SG60cpdII=',
                           'user[login]' : 'jjikim@ucdavis.edu',
                           'user[password]' : '1q2w3e4r',
                           'commit' : 'Sign In'},
                   timeout = 20)

res.ok

True

**5. Verifying Successful Login**

After attempting to log in, inspect the cookies saved in the session object to understand the information The Old Reader stores on your computer.

Use the session object to access https://theoldreader.com.

Verify successful login by checking for the presence of your user information that is only available when logged in.

In [167]:
# Get the cookies
cookies = session.cookies.get_dict()
print(cookies)

{'_new_reader_session': 'BAh7CkkiD3Nlc3Npb25faWQGOgZFVEkiJTkyZjI3YTUxY2QxNDA2YzEzYWUyZjVmYWQ4ZjEyOTU1BjsAVEkiGXdhcmRlbi51c2VyLnVzZXIua2V5BjsAVFsHWwZVOhpNb3BlZDo6QlNPTjo6T2JqZWN0SWQiEaoZxYEvNa9qsS%2FaJkkiIiQyYSQwNSRiTGJuV1AvbTZCN1RxVEdhL202TjNlBjsAVEkiDWxhbmd1YWdlBjsARjoHZW5JIhByZWRpcmVjdF90bwY7AEZJIgYvBjsARkkiEF9jc3JmX3Rva2VuBjsARkkiMXBtUUlnUDhuRHJyYkRxZHVpZUtRK3hUbzEvc1FqTFNxN0loVDFvL0pqaFk9BjsARg%3D%3D--4e116a541f44823a2e4719df9f53f38e3006f2f0', 'i_know_you': 'Jihyun+Kim', 'remember_user_token': 'BAhbB1sGVToaTW9wZWQ6OkJTT046Ok9iamVjdElkIhGqGcWBLzWvarEv2iZJIiIkMmEkMDUkYkxibldQL202QjdUcVRHYS9tNk4zZQY6BkVU--324bb43cf746782420b303194bff5d263e873168', 'signed_at': '1707282839'}


In [168]:
time.sleep(10)

# Get url from Manage Account page. Use the cookies obtain from the login session to remain in-session.
page2 = session.get('https://theoldreader.com/accounts/new', cookies=cookies) 
soup2 = BeautifulSoup(page2.content, 'html.parser') 

print(soup2.prettify()) # Print the entire content of the page

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
  <link href="//s.theoldreader.com/assets/application-befb06d5a14978388154b51422cef437.css" media="all" rel="stylesheet" type="text/css"/>
  <link href="//s.theoldreader.com/assets/apple-touch-icon-57x57-86fe1176e14af4907a6fecfe5ca7e3f1.png" rel="apple-touch-icon-precomposed" sizes="57x57"/>
  <link href="//s.theoldreader.com/assets/apple-touch-icon-114x114-bae89acc41c93261dd962ea6ade08d22.png" rel="apple-touch-icon-precomposed" sizes="114x114"/>
  <link href="//s.theoldreader.com/assets/apple-touch-icon-72x72-f248503edfa3676f8d58af531aff7e88.png" rel="apple-touch-icon-precomposed" sizes="72x72"/>
  <link href="//s.theoldreader.com/assets/apple-touch-icon-144x144-510415291cae9b46a9ca4ac398

In [169]:
# Check if there's user name in the content
elements = soup2.find_all(title="Jihyun Kim") 
for element in elements:
    print(str(element.parent))
# Login is verified by checking for the presence of the user name("Jihun Kim") in the content.

<li class="dropdown">
<a class="dropdown-toggle" data-hover="dropdown" data-toggle="dropdown" href="#" title="Jihyun Kim">Jihyun Kim  <i class="fa fa-caret-down"></i></a>
<ul class="dropdown-menu">
<li class="dropdown-header">Settings</li>
<li><a data-pjax="" href="/users/edit">Manage Settings</a></li>
<li><a href="https://theoldreader.com/accounts/manage">Manage Account</a></li>
<li><a data-pjax="" href="/subscriptions">Manage Subscriptions</a></li>
<li><a data-pjax="" href="/profile/aa19c5812f35af6ab12fda26">View Profile</a></li>
<li class="divider"></li>
<li class="dropdown-header">Help</li>
<li><a data-pjax="" href="/pages/tour">Product Tour</a></li>
<li><a href="mailto:support@theoldreader.com">Support</a></li>
<li class="divider"></li>
<li><a href="/users/sign_out">Sign Out</a></li>
</ul>
</li>
