<a href="https://colab.research.google.com/github/Laasya04299/Laasya_INFO5731_Fall2024/blob/main/Madige_Laasya_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
'''
Research Question:
How does the frequency of edits to Wikipedia articles correlate with their quality and relevance over time?

Data Collection:
Data Types Needed:
  Article Metadata: Title, creation date, and last edited date.
  Edit History: Number of edits, timestamps of edits and edit summaries.
  Article Quality Metrics: Article assessments references and citations.
Amount of Data:
Number of Articles: A sample size of 1,000 to 5,000 articles
Edit Records: The complete edit history for each article in the sample.
Quality Assessments: Ratings or quality assessments for each article

Steps for Collecting and Saving Data:
Step 1: Define the Sample Set
Choose a set of Wikipedia articles that are representative of different topics and quality levels.

Step 2: Extract Article Metadata
Use the Wikipedia API to extract metadata for each article, including titles, creation dates, and last edited dates.

Step 3: Collect Edit History
Retrieve the complete edit history for each article using the Wikipedia API.

Step 4: Gather Article Quality Metrics
Extract quality assessments from Wikipedia’s article assessment pages or using Wikipedia’s own datasets where available.

Step 5: Store Data
Save the extracted metadata, edit histories and quality metrics in a structured format such as JSON or CSV files.

'''


'\nResearch Question:\nHow does the frequency of edits to Wikipedia articles correlate with their quality and relevance over time?\n\nData Collection:\nData Types Needed:\n  Article Metadata: Title, creation date, and last edited date.\n  Edit History: Number of edits, timestamps of edits and edit summaries.\n  Article Quality Metrics: Article assessments references and citations.\nAmount of Data:\nNumber of Articles: A sample size of 1,000 to 5,000 articles\nEdit Records: The complete edit history for each article in the sample.\nQuality Assessments: Ratings or quality assessments for each article\n\nSteps for Collecting and Saving Data:\nStep 1: Define the Sample Set\nChoose a set of Wikipedia articles that are representative of different topics and quality levels.\n\nStep 2: Extract Article Metadata\nUse the Wikipedia API to extract metadata for each article, including titles, creation dates, and last edited dates.\n\nStep 3: Collect Edit History\nRetrieve the complete edit history 

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# write your answer here
import requests
import pandas as pd

# API endpoint
WIKI_API_ENDPOINT = "https://en.wikipedia.org/w/api.php"

def fetch_article_metadata(title):
    params = {
        'action': 'query',
        'prop': 'info',
        'titles': title,
        'format': 'json'
    }
    response = requests.get(WIKI_API_ENDPOINT, params=params)
    data = response.json()
    pages = data['query']['pages']
    for page_id, page_info in pages.items():
        if page_id != '-1':  # Check if the page exists
            return {
                'title': page_info['title'],
                'pageid': page_id,
                'ns': page_info['ns'],
                'lastrevid': page_info['lastrevid'],
                'touched': page_info['touched'],
                'url': f"https://en.wikipedia.org/wiki/{page_info['title']}"
            }
    return None

def fetch_edit_history(title):
    params = {
        'action': 'query',
        'prop': 'revisions',
        'titles': title,
        'rvprop': 'timestamp|user|comment',
        'rvlimit': 'max',
        'format': 'json'
    }
    response = requests.get(WIKI_API_ENDPOINT, params=params)
    data = response.json()
    pages = data['query']['pages']
    for page_id, page_info in pages.items():
        if page_id != '-1':
            revisions = page_info.get('revisions', [])
            return [{
                'timestamp': rev['timestamp'],
                'user': rev.get('user', 'Anonymous'),
                'comment': rev.get('comment', '')
            } for rev in revisions]
    return []

def fetch_article_titles(num_titles):
    titles = []
    params = {
        'action': 'query',
        'list': 'random',
        'rnlimit': num_titles,
        'format': 'json'
    }
    response = requests.get(WIKI_API_ENDPOINT, params=params)
    data = response.json()
    random_pages = data['query']['random']
    for page in random_pages:
        titles.append(page['title'])
    return titles

def collect_data(num_samples):
    titles = fetch_article_titles(num_samples)
    collected_data = []

    for i, title in enumerate(titles):
        print(f"Fetching data for article {i+1}/{num_samples}: {title}")

        metadata = fetch_article_metadata(title)
        if metadata:
            edit_history = fetch_edit_history(title)
            collected_data.append({
                **metadata,
                'edit_history': edit_history
            })

    return collected_data

# Collect 1000 samples
data = collect_data(1000)

# Save to CSV
df = pd.DataFrame(data)
df.to_csv('wikipedia_articles_data.csv', index=False)

print("Data collection complete. Saved to wikipedia_articles_data.csv")



Fetching data for article 1/1000: User talk:Clareiet
Fetching data for article 2/1000: User talk:Xx*MCR-ROX*xX
Fetching data for article 3/1000: User talk:Lilone12~enwiki
Fetching data for article 4/1000: User talk:KanziGG
Fetching data for article 5/1000: Category:Mobile phone companies of Bosnia and Herzegovina
Fetching data for article 6/1000: User talk:2601:2C4:C800:E3B0:BC23:D059:C44E:7DB8
Fetching data for article 7/1000: User:Wellboyswhatsthecraic/TWA/Earth/2
Fetching data for article 8/1000: User talk:68.197.19.122
Fetching data for article 9/1000: Aeroflot Flight 3603
Fetching data for article 10/1000: Yardea
Fetching data for article 11/1000: Toby Radford
Fetching data for article 12/1000: User talk:58.106.239.195
Fetching data for article 13/1000: Talk:Fred C. Ainsworth
Fetching data for article 14/1000: Talk:Individual Paralympic Athletes
Fetching data for article 15/1000: User talk:106.66.121.244
Fetching data for article 16/1000: User talk:68.38.131.22
Fetching data for a

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# write your answer here
import requests
import time
import pandas as pd

BASE_URL = 'https://api.semanticscholar.org/graph/v1/paper/search'


PARAMS = {
    'query': 'XYZ',
    'limit': 100,
    'year': '2014-2024',
    'fields': 'title,venue,year,authors,abstract',
}

def fetch_papers(offset):
    response = requests.get(BASE_URL, params={**PARAMS, 'offset': offset})
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Failed to retrieve data: Status code {response.status_code}")
        return None

def parse_papers(papers):
    paper_list = []
    for paper in papers:
        title = paper.get('title', 'N/A')
        venue = paper.get('venue', 'N/A')
        year = paper.get('year', 'N/A')
        authors = ', '.join([author.get('name', 'N/A') for author in paper.get('authors', [])])
        abstract = paper.get('abstract', 'N/A')
        paper_list.append([title, venue, year, authors, abstract])
    return paper_list

all_papers = []
total_papers = 1000
offset = 0
papers_per_request = 100

while len(all_papers) < total_papers:
    print(f"Fetching data for papers {offset + 1}/{total_papers}...")
    data = fetch_papers(offset)
    if data and 'data' in data:
        papers = parse_papers(data['data'])
        all_papers.extend(papers)
        offset += papers_per_request
        time.sleep(2)
    else:
        print("No more papers or failed request. Exiting.")
        break

df = pd.DataFrame(all_papers, columns=['Title', 'Venue', 'Year', 'Authors', 'Abstract'])
df.to_csv('semantic_scholar_articles.csv', index=False)

print(f"Data collection complete. Retrieved {len(all_papers)} papers. Saved to semantic_scholar_articles.csv.")



Fetching data for papers 1/1000...
Fetching data for papers 101/1000...
Failed to retrieve data: Status code 429
No more papers or failed request. Exiting.
Data collection complete. Retrieved 100 papers. Saved to semantic_scholar_articles.csv.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
# write your answer here
import praw
import pandas as pd
import time

def initialize_reddit_client():
    return praw.Reddit(
        client_id='LS_EEfvq6w66WD51xWNT0Q',
        client_secret='X8mI-m5QoRVdsb3E92J4T0fzZvXDHg',
        user_agent='v1.0'
    )

def fetch_reddit_data(keyword, num_posts, start_year, end_year):
    reddit = initialize_reddit_client()
    collected_data = []

    for submission in reddit.subreddit('all').search(keyword, sort='new', limit=num_posts):
        try:
            title = submission.title
            author = submission.author.name if submission.author else 'N/A'
            post_url = submission.url
            upvotes = submission.score
            date = submission.created_utc
            year = time.strftime('%Y', time.gmtime(date))

            if start_year <= int(year) <= end_year:
                collected_data.append({
                    'Title': title,
                    'Author': author,
                    'Post URL': post_url,
                    'Upvotes': upvotes,
                    'Date': year
                })

            if len(collected_data) >= num_posts:
                break

        except Exception as e:
            print(f"Error processing post: {e}")

    return collected_data

# Fetch 50 posts with the keyword "Python"
posts = fetch_reddit_data(keyword="Python", num_posts=50, start_year=2014, end_year=2024)

if posts:
    df = pd.DataFrame(posts)
    df.to_csv('reddit_posts_data.csv', index=False)
    print("Data collection complete. Saved to reddit_posts_data.csv")
else:
    print("No data collected. Please check the scraping process and response.")


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Data collection complete. Saved to reddit_posts_data.csv


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
The experience of completing web scraping projects has been quite helpful as it has helped me in gaining further insights of API’s, data extraction and dealing with rate limits. Some of them were parsing of web pages using BeautifulSoup and handling the APIs with retry mechanisms. As some of the challenges, there were rate limits and dynamic content which changed the tactics and strong error handling was necessary. This makes it easier to gather and assess data from online sources in a real-time manner, thus accurate and all inclusive data, and hence a boost in the quality and depth of my research.
'''

'\nThe experience of completing web scraping projects has been quite helpful as it has helped me in gaining further insights of API’s, data extraction and dealing with rate limits. Some of them were parsing of web pages using BeautifulSoup and handling the APIs with retry mechanisms. As some of the challenges, there were rate limits and dynamic content which changed the tactics and strong error handling was necessary. This makes it easier to gather and assess data from online sources in a real-time manner, thus accurate and all inclusive data, and hence a boost in the quality and depth of my research.\n'