<a href="https://colab.research.google.com/github/NahidFathima/NahidF_INFO5731_Fall2023/blob/main/Syed_In_class_exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The second In-class-exercise (09/13/2023, 40 points in total)

Kindly use the provided .ipynb document to write your code or respond to the questions. Avoid generating a new file.
Execute all the cells before your final submission.

This in-class exercise is due tomorrow September 14, 2023 at 11:59 PM. No late submissions will be considered.

The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

Question 1 (10 points): Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? How many data needed for the analysis? The detail steps for collecting and save the data.

**Question:**
How do pricing strategies impact the sales performance of handmade jewelry on Etsy? Specifically, does dynamic pricing based on factors like seasonality, competition, and materials used lead to higher sales and profitability for Etsy sellers?

**Data Collection:**
Data needs to be collected from Etsy:

- Product name
- Product link
- Price
- Ratings

The data will be collected by scraping Etsy listings for handmade jewelry earrings. This involves making web requests, parsing the HTML, and extracting the relevant information.

Web scraping tools like Beautiful Soup and requests are used to extract data from the Etsy search results page. The code navigates through the HTML structure, targeting elements with specific class names for product name, link, price, and ratings. The collected data can be saved in a CSV file for further analysis.

This data will serve as the foundation for understanding how pricing strategies influence the sales performance of handmade jewelry on Etsy.

Question 2 (10 points): Write python code to collect 1000 data samples you discussed above.

In [3]:
import requests
from bs4 import BeautifulSoup
import csv
import time

# URL of the Etsy search results page
base_url = "https://www.etsy.com/search?q=handmade+jewelry+earrings+dangle&ref=s2qit&explicit=1&s2qii=4&s2qit=as&prq=handmade+jewelry"

# Function to scrape product data
def scrape_etsy_products(base_url, num_samples=1000):
    products = []
    page_number = 1

    while len(products) < num_samples:
        # Make a GET request to the Etsy search results page
        url = f"{base_url}&page={page_number}"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find product listings on the page
        listings = soup.find_all('div', class_='v2-listing-card__info')

        if not listings:
            break

        for listing in listings:
            product = {}

            # Extract product name
            product_name_elem = listing.find('h3', class_='wt-text-caption v2-listing-card__title wt-text-truncate')
            if product_name_elem:
                product['name'] = product_name_elem.text.strip()
            else:
                product['name'] = 'Name not available'

            # Extract product link
            product_link_elem = listing.find('a', class_='listing-link wt-display-inline-block be92086f5a29fa263  logged')
            if product_link_elem and 'href' in product_link_elem.attrs:
                product['link'] = "https://www.etsy.com" + product_link_elem['href']
            else:
                product['link'] = 'Link not available'

            # Extract product price
            price_span = listing.find('span', class_='currency-value')
            if price_span:
                product['price'] = price_span.text.strip()
            else:
                product['price'] = 'Price not available'

            # Extract product ratings if available
            ratings_div = listing.find('div', class_='wt-align-items-center wt-max-height-full wt-display-flex-xs flex-direction-row-xs wt-text-title-small wt-no-wrap')
            if ratings_div:
                product['ratings'] = ratings_div.text.strip()
            else:
                product['ratings'] = 'No ratings'

            products.append(product)

            if len(products) >= num_samples:
                break

        page_number += 1

        # Sleep briefly to avoid overloading the server
        time.sleep(1)

    return products

# Collect 1000 data samples
data_samples = scrape_etsy_products(base_url, num_samples=1000)

# Save data to a CSV file
csv_filename = "etsy_handmade_jewelry_sampledata.csv"
with open(csv_filename, mode='w', newline='', encoding='utf-8') as csv_file:
    fieldnames = ['name', 'link', 'price', 'ratings']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    for product in data_samples:
        writer.writerow(product)

print(f"Data saved to {csv_filename}")

Data saved to etsy_handmade_jewelry_sampledata.csv


Question 3 (10 points): Write python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "information retrieval". The articles should be published in the last 10 years (2013-2023).

The following information of the article needs to be collected:

(1) Title

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime

# Function to scrape data from an article page
def scrape_article(article_url):
    response = requests.get(article_url)
    # Check if the request was successful
    if response.status_code != 200:
        return None

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract and format the article data
    return {
        'Title': soup.find("h1", class_="citation__title").text.strip(),
        'Authors': [author.text.strip() for author in soup.select("span.loa__name")],
        'Year': soup.find("span", class_="epub-section__date").text.strip(),
        'Abstract': soup.find("div", class_="abstract__content").text.strip(),
        'Venue': soup.select_one("a.publication-title-link").text.strip()
    }

# Function to scrape ACM articles based on keyword and quantity
def scrape_acm_articles(keyword, num_articles):
    base_url = f"https://dl.acm.org/doSearch?query={keyword}&FullText=true&startPage=0&pageSize={num_articles}"
    articles = []

    response = requests.get(base_url)
    # Check if the request was successful
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Find all article items in the search results
        search_results = soup.find_all("li", class_="search__item issue-item-container")

        for result in search_results:
            article_link = result.find("a", href=True)
            if article_link:
                article_url = "https://dl.acm.org" + article_link['href']
                article_data = scrape_article(article_url)
                if article_data:
                    articles.append(article_data)

    return articles

# Main function to scrape and save data
def main():
    keyword = "information retrieval"
    num_articles = 1000
    articles = scrape_acm_articles(keyword, num_articles)

    if articles:
        # Filter articles published in the last 10 years (2013-2023)
        current_year = datetime.now().year
        filtered_articles = [article for article in articles if int(article['Year']) >= 2013 and int(article['Year']) <= current_year]

        if filtered_articles:
            # Save data to a CSV file
            with open('acm_articles.csv', 'w', newline='', encoding='utf-8') as csvfile:
                fieldnames = ['Title', 'Authors', 'Year', 'Abstract', 'Venue']
                writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
                writer.writeheader()
                for article in filtered_articles:
                    writer.writerow(article)
            print(f"Scraped and saved {len(filtered_articles)} articles to 'acm_articles.csv'.")
        else:
            print("No articles found within the specified date range.")
    else:
        print("No articles found.")

if __name__ == "__main__":
    main()


Question 4 (10 points): Write python code to collect 1000 posts from Twitter, or Facebook, or Instagram. You can either use hashtags, keywords, user_name, user_id, or other information to collect the data.

The following information needs to be collected:

(1) User_name

(2) Posted time

(3) Text

In [None]:
pip install instaloader

Collecting instaloader
  Downloading instaloader-4.10.tar.gz (62 kB)
Building wheels for collected packages: instaloader
  Building wheel for instaloader (setup.py): started
  Building wheel for instaloader (setup.py): finished with status 'done'
  Created wheel for instaloader: filename=instaloader-4.10-py3-none-any.whl size=64299 sha256=b6183a82060974df1b22e8ffc5303407c0f64e4cff108b20812a51f99b18751e
  Stored in directory: c:\users\nahid\appdata\local\pip\cache\wheels\30\d3\e9\a15fc8e2e997e4dc75983128dc3f48a051476301f8422c8cde
Successfully built instaloader
Installing collected packages: instaloader
Successfully installed instaloader-4.10
Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install --upgrade win_unicode_console


Note: you may need to restart the kernel to use updated packages.


In [None]:
import instaloader
import csv

# Initialize Instaloader
L = instaloader.Instaloader()

# Target Instagram account
username = "unt"

# Load the profile of the target account
profile = instaloader.Profile.from_username(L.context, username)

# Initialize a list to store the collected data
data = []

# Collect up to 1000 posts
for post in profile.get_posts():
    # Extract relevant data like Get username, post_time, and post_text
    title = username
    year = post.date.year
    authors = username
    abstract = post.caption if post.caption else ""

    # Append data to the list
    data.append([title, year, authors, abstract])

    # Break the loop once 1000 posts are collected
    if len(data) >= 1000:
        break

# Save the collected data to a CSV file
csv_file_name = f'instagram_{username}.csv'
with open(csv_file_name, 'w', newline='', encoding='utf-8') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['Title', 'Year', 'Authors', 'Abstract'])
    csv_writer.writerows(data)

print(f"CSV file saved to {csv_file_name}")


CSV file saved to instagram_unt.csv
