# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
#I want to find out which stocks have been in news of the day, or highlighted or trending stocks of day by using news articles
#To collect data, I may use free publicly accessed news websiteslike yahoo finance.
#After collecting articles related to keywords , I will collect all the text of website.
#Then I will use preprocessing techniques to clean data like removing stop words,lematize etc
#Then all the quality data will be stored into csv file to use for further analysis.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [16]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta

# Function to scrape news articles from a given URL
def scrape_news(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        headlines = soup.find_all('h2')
        articles = []
        for headline in headlines:
            headline_text = headline.get_text().strip()
            publication_date = None
            date_element = headline.find_previous(class_='publication-date')
            if date_element:
                publication_date = date_element.get_text().strip()
                publication_date = datetime.strptime(publication_date, '%Y-%m-%d %H:%M:%S')
            article_content = headline.find_next('p').get_text().strip()
            articles.append({'headline': headline_text, 'publication_date': publication_date, 'content': article_content})
        return articles
    else:
        print("Failed to retrieve news articles from the URL:", url)
        return []

# Function to filter news articles published within the past 24 hours
def filter_articles_within_24_hours(articles):
    current_time = datetime.now()
    filtered_articles = []
    for article in articles:
        if article['publication_date'] and (current_time - article['publication_date']) < timedelta(days=1):
            filtered_articles.append(article)
    return filtered_articles

# Example usage
news_website_url = "https://finance.yahoo.com"
articles = scrape_news(news_website_url)
print("Total articles found:", len(articles))
print("---------------")
for article in articles:
    print("Headline:", article['headline'])
    print("Publication Date:", article['publication_date'])
    print("Content:", article['content'])
    print("--------------")


Total articles found: 1
---------------
Headline: Nvidia's ripple effect
Publication Date: None
Content: The leading AI chipmaker has seen its shares rise by almost 50% this year. Now, the company is spreading that love to other AI firms.
--------------


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [18]:
#I cannot able to load the data of 1000 articles because of CPU and time usage to run the code , Code may take lot of time to process data
#so I have retrieved just 12 articles, If i change num_article parameter below to 1000 ,i would get 1000 articles data.
import requests
from bs4 import BeautifulSoup

def fetch_google_scholar_articles(query, num_articles):
    articles = []
    start = 0

    while len(articles) < num_articles:
        url = f"https://scholar.google.com/scholar?q={query}&start={start}&as_ylo=2014&as_yhi=2024&hl=en&as_sdt=0%2C5"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        for result in soup.find_all('div', class_='gs_ri'):
            title = result.find('h3', class_='gs_rt').text.strip()
            venue = result.find('div', class_='gs_a').text.split('-')[0].strip()
            year = result.find('div', class_='gs_a').text.split(',')[-1].strip()
            authors = result.find('div', class_='gs_a').text.split('-')[1].strip()
            abstract = result.find('div', class_='gs_rs').text.strip()

            articles.append({
                'title': title,
                'venue': venue,
                'year': year,
                'authors': authors,
                'abstract': abstract
            })

            if len(articles) >= num_articles:
                break

        start += 10  # Move to the next page

    return articles[:num_articles]

query = "artificial intelligence"
num_articles = 12  # change the number of articles count whichever you want
articles = fetch_google_scholar_articles(query, num_articles)

for idx, article in enumerate(articles, 1):
    print(f"Article {idx}:")
    print(f"Title: {article['title']}")
    print(f"Venue: {article['venue']}")
    print(f"Year: {article['year']}")
    print(f"Authors: {article['authors']}")
    print(f"Abstract: {article['abstract']}")
    print("\n" + "="*50 + "\n") # Add a separator between articles


Article 1:
Title: Artificial intelligence in medicine
Venue: P Hamet, J Tremblay
Year: 2017 - Elsevier
Authors: Metabolism, 2017
Abstract: … artificial intelligence” (AI) in 1955, defining it as “the science and engineering of making 
intelligent … at a Dartmouth College conference on artificial intelligence. The conference gave birth …


Article 2:
Title: Causability and explainability of artificial intelligence in medicine
Venue: A Holzinger, G Langs, H Denk…
Year: 2019 - Wiley Online Library
Authors: … Reviews: Data Mining …, 2019
Abstract: Explainable artificial intelligence (AI) is attracting much interest in medicine. Technically, the 
problem of explainability is as old as AI itself and classic AI represented comprehensible …


Article 3:
Title: [BOOK][B] Introduction to artificial intelligence
Venue: W Ertel
Year: W Ertel - 2018 - books.google.com
Authors: 2018
Abstract: … during your studies this book will help you share my fascination with Artificial Intelligence. … 
philosop

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [19]:
#Due to authentication API keys issues with reddit api , twitter API, I have used publicly accessed wikepedia to write code
#I dont have enough time to get API keys from the those organizations, so i have retrived data from wikepedia
import requests
import pandas as pd

cols = ['Title', 'Extract', 'Page_ID', 'URL']
data = pd.DataFrame(columns=cols)
keyword = input("Enter a keyword to search for on Wikipedia: ")
base_url = f"https://en.wikipedia.org/w/api.php"
params = {
    "action": "query",
    "format": "json",
    "list": "search",
    "srsearch": keyword,
    "utf8": 1
}

try:

    res = requests.get(base_url, params=params)
    res.raise_for_status()
    json_data = res.json()
    for item in json_data['query']['search']:
        title = item['title']
        page_id = item['pageid']
        url = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
        extract = item.get('snippet', 'No snippet available')
        row = [title, extract, page_id, url]
        data = data.append(pd.Series(row, index=cols), ignore_index=True)


    print(f"Extracted {len(data)} Wikipedia articles for keyword: {keyword}")
    print(data)

except requests.exceptions.RequestException as e:
    print("Error occurred while making the request:", e)
except KeyError:
    print("Error: Unexpected response format from Wikipedia API.")
except ValueError:
    print("Error: Failed to parse JSON response from Wikipedia API.")


Enter a keyword to search for on Wikipedia: india
Extracted 10 Wikipedia articles for keyword: india
                                   Title  \
0                                  India   
1  States and union territories of India   
2                     East India Company   
3                    Government of India   
4                  Demographics of India   
5                          Punjab, India   
6                       Economy of India   
7                Prime Minister of India   
8            India national cricket team   
9                       History of India   

                                             Extract   Page_ID  \
0  <span class="searchmatch">India</span>, offici...     14533   
1  <span class="searchmatch">India</span> is a fe...    375986   
2  The East <span class="searchmatch">India</span...     43281   
3  The Government of <span class="searchmatch">In...    553883   
4  <span class="searchmatch">India</span> is the ...     14598   
5  historically kn

  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)
  data = data.append(pd.Series(row, index=cols), ignore_index=True)


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
#Questions were too tough for my knowledge and so I took some help from chatgpt.
#I felt very tough with the python to do web scraping,
#because I was doing web scraping with rapid miner in other class INFO 5810 in this semester,when comapred it is so easy to web scrap with rapid minor
#Since i am not well known with python programming , I am still strugling to write complete code witout errors.
# I have faced some issues when accessing somewebsites like reddit API, Twitter API because I need to get API keys from respective organizations to authorize.
