<a href="https://colab.research.google.com/github/Grishma5278/Info-5731/blob/main/Tallapareddy_Grishma_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

Research Question: What are the spatial and temporal patterns of bird migration in a given region, and how do these patterns correlate with environmental factors such as climate and habitat changes?

Data Needed:

Bird Migration Data: To understand the migration patterns, you would need data on the number of birds migrating through the region, the species involved, and the timing of their arrival and departure.
Environmental Data: This includes temperature, precipitation, vegetation, and other climate variables that can affect bird migration and habitat.
Habitat Data: Information on the types of habitats birds use during migration, such as wetlands, forests, or grasslands, and how these habitats change over time.
Amount of Data: The amount of data needed would depend on the size of the region you're studying, the number of bird species involved, and the length of the study period. For a relatively small region (e.g., a single state or province) and a moderate number of bird species (e.g., 50-100), a few years of data might be sufficient. However, for larger regions or more species, you might need data spanning multiple decades.

Steps for Data Collection and Saving:

Obtain bird migration data from existing databases or citizen science projects, such as eBird or the Global Flyway Network.
Collect environmental data from weather stations or other sources, and use GIS (Geographic Information Systems) software to analyze spatial and temporal patterns.
Use remote sensing data (e.g., satellite images) to track changes in habitat over time.
Save all data in a secure and organized manner, using standardized file formats and metadata to ensure compatibility and reproducibility.
Analyze the data using statistical methods and visualization tools to identify trends and correlations between bird migration patterns and environmental factors.
Interpret the results and draw conclusions about how climate and habitat changes are affecting bird migration in the region.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import requests
import json
import os

# Set up a directory to save data
if not os.path.exists('bird_migration_data'):
    os.mkdir('bird_migration_data')

# Function to save data
def save_data(data, filename):
    with open(filename, 'w') as file:
        json.dump(data, file)

# Function to collect bird migration data
def collect_bird_migration_data(num_samples):
    url = 'https://api.ebird.org/v2/data/obs/geo/recent'
    # Replace 'your_token' with your actual token obtained from eBird
    headers = {'X-eBirdApiToken': 'your_token'}
    params = {'lat': 38.9072, 'lng': -77.0369, 'maxResults': num_samples}

    response = requests.get(url, headers=headers, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        return None

# Collect 1000 samples of bird migration data
num_samples = 1000
data = collect_bird_migration_data(num_samples)

# Save the data to a file
save_data(data, 'bird_migration_data/bird_migration_data.json')


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import time
import pandas as pd


def get_soup(url):
    """
        params1: url (contains the url of google scholar page)
        return: soup (fetching the url page data and then further converted to html parser)
    """
  # headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
    try:
        # requesting for data using requests with url and headers for authentication
        data = requests.get(url, headers)
        # print(f"Extracted the data with response as {data.status_code}")
        if data.status_code != 200:
            raise Exception("Failed to fetch data")
    except Exception as ex:
        print(f"Exception occurred as {data.text} with status_code {data.status_code}")
        return None
    soup = BeautifulSoup(data.content)
    return soup

def get_title(title):
    """
      params1: soup_api with fetched title of article
      return: string format of text title of article
    """
    return str(title.find("a").text)

def get_abstract_url(title):
    """
      params1: soup_api with fetched title of article
      return: string format of url for title of article
    """
    return str(title.find("a").get("href"))


def get_article_info(article):
    """
      params1: soup_api with fetched article info
      return: tuples containing author, year, and published info
    """
    # using regular expressions, fetched year from article
    year = int(re.search(r'\d+', article.text).group())
    # performing some string operations, fetched required results
    article = str(article.text).replace("\xa0", "")
    article = article.split("-")
    published = article[-1].strip()
    author = article[0].strip()
    return author, year, published

def get_tags(soup):
    """
      params1: soup_api with fetched url and parsed data
      return: list of article info, such as titles, authors, year, published, abstract
    """
    # fetched titles and authors of article using findAll by mentioning some tags
    all_titles = soup.findAll("h3", attrs={"class": "gs_rt"})
    all_authors = soup.findAll("div", attrs={"class": "gs_a"})
    all_abstracts = soup.findAll("div", attrs={"class": "gs_rs"})

    authors, year, published = [], [], []

    titles = [get_title(title) for title in all_titles]
    abs_url = [get_abstract_url(title) for title in all_titles]
    abstract = [get_abstract(abstr) for abstr in all_abstracts]

    for author in all_authors:
        auth, yr, publs = get_article_info(author)
        authors.append(auth)
        year.append(yr)
        published.append(publs)

    return titles, authors, year, published, abstract, abs_url


def get_abstract(abstr):
    """
      params1: soup_Api with fetched abstract of article
      return: string format of article abstract by fetching its text
    """
    return str(abstr.text).replace("\n", "")


def fetch_web_data(records):
    """
      params1(records): number of articles, needs to be fetched
      return: dataframe containing total N number of articles.
    """
    year_st, year_end = 2012, 2022
    columns_google = ["Title", "Author", "Year", "Published", "Abstract", "Abstract_UrL"]
    # fetching for 1000 articles
    titles, authors, years, published, abstract, abs_url = [], [], [], [], [], []
    final_data = []
    print("***** BEFORE FETCHING *********")
    # records = 100 # no of articles
    for i in range(0, records, 10):
        url = f"https://scholar.google.com/scholar?start={i}&q=information+retrieval&hl=en&as_sdt=0,44&as_ylo={year_st}&as_yhi={year_end}&as_vis=1"
        soup = get_soup(url)
        if soup is None:
          print(f"Data Not Fetched.... for {i} article page")
          continue
        # titles, authors, year, published, abstract, abs_url
        a, b, c, d, e, f = get_tags(soup)
        titles.extend(a)
        authors.extend(b)
        years.extend(c)
        published.extend(d)
        abstract.extend(e)
        abs_url.extend(f)
        print(f"******* fetched {(i+10)} articles *********")
        # keeping time to sleep for 5 seconds, so that, server may not crash for frequent multiple requests.
        time.sleep(5)

    for i in range(records):
        final_data.append([titles[i],authors[i],years[i],published[i],abstract[i],abs_url[i]])

    print("******* AFTER FETCHING ********")
    df = pd.DataFrame(final_data, columns = columns_google)
    print(f"Number of records: {df.shape[0]}")
    return df

# url = "https://scholar.google.com/scholar?start=200&q=information+retrieval&hl=en&as_sdt=0,44&as_ylo=2012&as_yhi=2022&as_vis=1"
# soup = get_soup(url)
# all_titles = soup.findAll("h3", attrs={"class": "gs_rt"})
# all_authors = soup.findAll("div", attrs={"class": "gs_a"})
# all_abstracts = soup.findAll("div", attrs={"class": "gs_rs"})

df = fetch_web_data(int(input("Enter num of articles: ")))
print(f"dimensions of articles: {df.shape}")
df.head()



Enter num of articles: 100
***** BEFORE FETCHING *********
******* fetched 10 articles *********
******* fetched 20 articles *********
******* fetched 30 articles *********
******* fetched 40 articles *********
******* fetched 50 articles *********
******* fetched 60 articles *********
******* fetched 70 articles *********
******* fetched 80 articles *********
******* fetched 90 articles *********
******* fetched 100 articles *********
******* AFTER FETCHING ********
Number of records: 100
dimensions of articles: (100, 6)


Unnamed: 0,Title,Author,Year,Published,Abstract,Abstract_UrL
0,Information retrieval as statistical translation,"A Berger, J Lafferty",2017,dl.acm.org,… There is a large literature on probabilistic...,https://dl.acm.org/doi/abs/10.1145/3130348.313...
1,A survey of automatic query expansion in infor...,"C Carpineto, G Romano",2012,dl.acm.org,… information retrieval systems is largely cau...,https://dl.acm.org/doi/abs/10.1145/2071389.207...
2,A language modeling approach to information re...,"JM Ponte, WB Croft",2017,dl.acm.org,"… models, we have developed an approach to ret...",https://dl.acm.org/doi/pdf/10.1145/3130348.313...
3,A study of smoothing methods for language mode...,"C Zhai, J Lafferty",2017,dl.acm.org,… to information retrieval are attractive and ...,https://dl.acm.org/doi/abs/10.1145/3130348.313...
4,Integrating and evaluating neural word embeddi...,"G Zuccon, B Koopman, P Bruza…",20,dl.acm.org,"… in information retrieval. Specifically, we f...",https://dl.acm.org/doi/abs/10.1145/2838931.283...


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
!pip install asyncpraw

import asyncpraw
import pandas as pd

# Authenticate with Reddit API using your credentials
reddit = asyncpraw.Reddit(client_id='SdBcUkSjP4yOla8lSu8alw',
                          client_secret='your_client_secret',
                          user_agent='your_user_agent')

# Define the subreddit and keywords you want to search for
subreddit = await reddit.subreddit('all')
keywords = ['Python', 'Data Science', 'Machine Learning']

# Define the number of submissions to collect
num_submissions = 1000

# Collect submissions containing the specified keywords
submissions = []
async for submission in subreddit.search(keywords, limit=num_submissions):
    submissions.append(submission)

# Convert submissions to a pandas DataFrame
submissions_df = pd.json_normalize([submission.__dict__ for submission in submissions])

# Save the DataFrame to a CSV file
submissions_df.to_csv('reddit_submissions.csv', index=False)




ResponseException: received 401 HTTP response

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

My overall learning experience in working on web scraping tasks has been enlightening. I've gained a deeper understanding of the process of extracting data from various online sources, which is a crucial skill for any professional working with data. One of the key concepts that I found most beneficial was the use of libraries like BeautifulSoup or Scrapy in Python, which provide powerful tools for parsing HTML and XML documents and extracting the information we need. These libraries allow us to navigate the structure of web pages and extract specific elements, such as text, links, and images, making the process of web scraping much more efficient and effective.

One of the challenges I encountered in collecting data from certain websites was dealing with websites that use JavaScript to dynamically load content. In such cases, the initial HTML response from the server may not contain all the information we need, and we may need to use techniques like headless browsers or reverse engineering to extract the data we need. Another challenge was dealing with websites that use anti-scraping techniques, such as rate limiting or IP blocking. In such cases, we may need to use proxies or VPNs to disguise our IP address and avoid being detected as a scraper. Overall, web scraping is a valuable skill that can be applied in many different contexts, and I look forward to further developing my skills in this area.

