<a href="https://colab.research.google.com/github/Lavanya-INFO5731-Fall2024/Lavanya_INFO5731_Fall2024/blob/main/Nidamanuri_Lavanya_Exercise_2_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here


### **Research Question:**
**"How does population size correlate with key demographic indicators such as growth rate, population density, and urban population percentage across different countries?"**

### **Objective:**
To analyze the relationship between population size and key demographic indicators (e.g., growth rate, density, and urban population percentage) for all countries, using the most recent data available on the Worldometers website.

### **Data Needed:**
To answer this research question, we need the following types of data:

#### **Country Information:**
- **Fields:** Country name, population size, yearly growth rate, population density (people per square km), urban population percentage, and country rank.
- **Source:** Worldometers website ([https://www.worldometers.info/world-population/population-by-country/](https://www.worldometers.info/world-population/population-by-country/)).

#### **Other Indicators (Optional):**
- **Fields:** Median age, fertility rate, migrants, and life expectancy.
- **Source:** The same Worldometers page may contain these fields if they are present in the table.

### **Amount of Data Needed:**
- All available countries listed on the Worldometers page (approximately 200+ countries).
- Full set of demographic indicators for each country (as listed on the webpage).

### **Steps for Collecting and Saving the Data:**

#### **Collect Country Data:**
- Use Selenium or BeautifulSoup to extract data from the table on the Worldometers page.
- Parse the table to collect fields like population size, growth rate, density, and urban population percentage.

#### **Store the Data:**
- Use pandas to structure the collected data in a DataFrame format.
- Save the cleaned and structured data into a CSV file or a database for further analysis.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# write your answer here

In [None]:
# Install necessary packages
!pip install requests
!pip install beautifulsoup4
!pip install pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_world_population_data(url):
    # Send a GET request to the webpage
    response = requests.get(url)

    # Parse the webpage content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the table containing population data
    table = soup.find('table', {'id': 'example2'})

    # Extract table headers
    headers = [header.text.strip() for header in table.find_all('th')]

    # Extract table rows
    rows = []
    for row in table.find_all('tr')[1:]:  # Skipping the header row
        cols = row.find_all('td')
        row_data = [col.text.strip() for col in cols]
        rows.append(row_data)

    # Convert to DataFrame
    df = pd.DataFrame(rows, columns=headers)

    return df

# URL of the Worldometers page containing the population data
url = 'https://www.worldometers.info/world-population/population-by-country/'
df = scrape_world_population_data(url)

# Save the data to a CSV file
df.to_csv('world_population_by_country.csv', index=False)

print(df.head())


   # Country (or dependency) Population (2024) Yearly Change  Net Change  \
0  1                   India     1,450,935,791        0.89 %  12,866,195   
1  2                   China     1,419,321,278       -0.23 %  -3,263,655   
2  3           United States       345,426,571        0.57 %   1,949,236   
3  4               Indonesia       283,487,931        0.82 %   2,297,864   
4  5                Pakistan       251,269,164        1.52 %   3,764,669   

  Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age  \
0             488       2,973,190       -630,830        2.0       28   
1             151       9,388,211       -318,992        1.0       40   
2              38       9,147,420      1,286,132        1.6       38   
3             156       1,811,570        -38,469        2.1       30   
4             326         770,880     -1,401,173        3.5       20   

  Urban Pop % World Share  
0        37 %     17.78 %  
1        66 %     17.39 %  
2        82 %      4.23 % 

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [2]:
import requests
import pandas as pd
import time
def collect_articles_semantic_scholar(query, st_year, ed_year, no_of_articles):
    base_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    headers = {"Accept": "application/json"}
    articles = []
    offset = 0
    limit = 100

    while len(articles) < no_of_articles:
        params = {
            "query": query,
            "offset": offset,
            "limit": min(limit, no_of_articles - len(articles)),
            "fields": "title,venue,year,authors,abstract",
            "yearFilter": f"{st_year}-{ed_year}"
        }

        response = requests.get(base_url, headers=headers, params=params)

#        if response.status_code == 429:
 #           print("Rate limit exceeded, sleeping for 60 seconds...")
  #          time.sleep(60)  # Sleep for 60 seconds before retrying
   #         continue

        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            break

        data = response.json().get('data', [])
        for i in data:
            if len(articles) >= no_of_articles:
                break

            article = {
                'title': i.get('title', 'No title'),
                'venue': i.get('venue', 'No venue'),
                'year': i.get('year', 'No year'),
                'authors': ', '.join([author.get('name', 'Unknown') for author in i.get('authors', [])]),
                'abstract': i.get('abstract', 'No abstract')
            }
            articles.append(article)

        offset += limit

    return articles


query = "XYZ"
st_year = 2014
ed_year = 2024
no_of_articles = 1000
articles = collect_articles_semantic_scholar(query, st_year, ed_year, no_of_articles)

df = pd.DataFrame(articles)
df.to_csv('articles.csv', index=False)

for i, article in enumerate(articles[:5], 1):
    print(f"Article {i}:")
    print(f"Title: {article['title']}")
    print(f"Venue: {article['venue']}")
    print(f"Year: {article['year']}")
    print(f"Authors: {article['authors']}")
    print(f"Abstract: {article['abstract']}\n")

Error: 429


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
# write your answer here

In [None]:
pip install praw

Collecting praw
  Downloading praw-7.7.1-py3-none-any.whl.metadata (9.8 kB)
Collecting prawcore<3,>=2.1 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Downloading praw-7.7.1-py3-none-any.whl (191 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Installing collected packages: prawcore, praw
Successfully installed praw-7.7.1 prawcore-2.4.0


In [None]:
import praw
import pandas as pd

# Reddit API credentials
client_id = 'wXw8RAzd_hkSmKt8j2Pu2w'
client_secret = 'svWmPzQaiaQ_V8x12H1W56ZRzn8dWQ'
user_agent = 'EmbarrassedCover2033'

In [None]:
# Authenticate to Reddit using PRAW
reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)

# Define the subreddit and search parameters
subreddit_name = 'stocks'
search_query = 'Apple'

# Function to collect Reddit posts
def collect_reddit_posts(subreddit_name, search_query):
    subreddit = reddit.subreddit(subreddit_name)
    posts_data = []

    # Search for posts in the subreddit
    for submission in subreddit.search(search_query, limit=100):
        post_id = submission.id
        title = submission.title
        author = submission.author.name if submission.author else 'N/A'
        created_at = submission.created_utc
        num_comments = submission.num_comments
        score = submission.score

        posts_data.append([post_id, title, author, created_at, num_comments, score])

    return posts_data

# Fetch and save Reddit posts data
posts_data = collect_reddit_posts(subreddit_name, search_query)
posts_df = pd.DataFrame(posts_data, columns=['Post ID', 'Title', 'Author', 'Created At', 'Number of Comments', 'Score'])
posts_df.to_csv('reddit_posts_data.csv', index=False)

print("Data collection completed and saved to reddit_posts_data.csv.")
print(posts_df.head())

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Data collection completed and saved to reddit_posts_data.csv.
   Post ID                                              Title  \
0  1ej2a5e  Warren Buffett’s Berkshire Hathaway sold nearl...   
1  1bk7xgd                DOJ sues Apple over iPhone monopoly   
2  1ciq1fw  Apple announces largest-ever $110 billion shar...   
3  1b6agqo  Apple hit with more than $1.95 billion EU anti...   
4  1fdcraa  Apple loses EU court battle over 13 billion eu...   

            Author    Created At  Number of Comments  Score  
0  themagicalpanda  1.722688e+09                 543   3422  
1        Puginator  1.711032e+09                 912   2723  
2        Puginator  1.714682e+09                 524   2993  
3        Puginator  1.709558e+09                 414   1685  
4        Puginator  1.725956e+09                 261    862  


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
'''