<a href="https://colab.research.google.com/github/Tharungovind/GOVINDTHARUN_INFO5731_FALL2024/blob/main/Govind_Tharun_Exersice_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

Research Question :
Classify PubMed journal abstracts.

Data Needed:
50,000–100,000 abstracts from PubMed, labeled with relevant classification labels.

Steps:
Scrape abstracts using PubMed’s E-utilities via the Entrez API or python package Metapub. Preprocess and label the data. Save the data in CSV/JSON format. Store data in cloud services and use version control tools to manage it.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [1]:
# write your answer here
!pip install metapub

Collecting metapub
  Downloading metapub-0.5.12-py2.py3-none-any.whl.metadata (16 kB)
Collecting lxml-html-clean (from metapub)
  Downloading lxml_html_clean-0.2.2-py3-none-any.whl.metadata (1.8 kB)
Collecting eutils (from metapub)
  Downloading eutils-0.6.0-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting habanero (from metapub)
  Downloading habanero-1.2.6-py2.py3-none-any.whl.metadata (14 kB)
Collecting cssselect (from metapub)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting unidecode (from metapub)
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Collecting docopt (from metapub)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting coloredlogs (from metapub)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting python-Levenshtein (from metapub)
  Downloading python_Levenshtein-0.25.1-py3-none-any.whl.metadata (3.7 kB)
Collecting humanfriendly>=9.1 (from co

In [2]:
from metapub import PubMedFetcher
import pandas as pd
import time

# Initialize the PubMedFetcher
fetcher = PubMedFetcher()

# Function to fetch PubMed abstracts using Metapub
def fetch_pubmed_abstracts_metapub(query, max_results=10):
    pmids = fetcher.pmids_for_query(query, retmax=max_results)

    abstracts = []

    # Loop through each PMID and fetch the abstract and other details
    for pmid in pmids:
        try:
            # Fetch the article details
            article = fetcher.article_by_pmid(pmid)

            # Check if the abstract exists and fetch it
            if article.abstract:
                abstracts.append({"PMID": pmid,  "Abstract": article.abstract , 'Label':query})

            # Add a delay to avoid overwhelming the API
            time.sleep(0.5)

        except Exception as e:
            print(f"Error fetching data for PMID {pmid}: {e}")

    return abstracts

# Define query
query = "cardiology"  # This can be changed

# Fetch 1000 PubMed abstracts using the query
abstracts_data = fetch_pubmed_abstracts_metapub(query, max_results=10)

# Convert the data into a pandas DataFrame
df = pd.DataFrame(abstracts_data)

# Save the dataset to a CSV file
df.to_csv('pubmed_abstracts_metapub_1000.csv', index=False)

print("Dataset saved to 'pubmed_abstracts_metapub_1000.csv'.")

2024-09-16 01:08:48 f2315b584b6b numexpr.utils[370] INFO NumExpr defaulting to 2 threads.


Dataset saved to 'pubmed_abstracts_metapub_1000.csv'.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [3]:
# write your answer here

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime

url = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C44&q=chemistry&oq='


response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    end_year = datetime.now().year
    data = []
    articles = soup.find_all('div', class_='article')

    if not articles:
        print("No articles found.")


    for article in articles:
        title_tag = article.find('h2', class_='title')
        title = title_tag.text.strip() if title_tag else 'No title available'
        journal_tag = article.find('span', class_='journal')
        year_tag = article.find('span', class_='year')
        journal = journal_tag.text.strip() if journal_tag else 'No journal/venue available'
        year_text = year_tag.text.strip() if year_tag else 'No year available'


        authors_tag = article.find('span', class_='authors')
        authors = authors_tag.text.strip() if authors_tag else 'No authors available'


        abstract_tag = article.find('p', class_='abstract')
        abstract = abstract_tag.text.strip() if abstract_tag else 'No abstract available'
        try:
            year = int(year_text)
            if start_year <= year <= end_year:
                data.append({
                    'Title': title,
                    'Journal/Venue': journal,
                    'Year': year,
                    'Authors': authors,
                    'Abstract': abstract
                })
        except ValueError:

            continue

    df = pd.DataFrame(data)
    print(df)

else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")





No articles found.
Empty DataFrame
Columns: []
Index: []


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [6]:
!pip install Getoldtweets3



In [7]:
# write your answer here

import pandas as pd
import GetOldTweets3 as got

def twitter_info():
  tweet_df["tweet_text"] = tweet_df["got_criteria"].apply(lambda x: x.text)
  tweet_df["date"] = tweet_df["got_criteria"].apply(lambda x: x.date)
  tweet_df["hashtags"] = tweet_df["got_criteria"].apply(lambda x: x.hashtags)

keyword = '#Viratkohli'
oldest_date = '09-15-2023'
newest_date = '09-15-2024'
locations = ['Hyderabad']

number_of_tweets = 10

tweetCriteria_list = []
for location in locations:
  try:
    tweetCriteria = tweetCriteria = got.manager.TweetCriteria().setQuerySearch(keyword).setSince(oldest_date).setUntil(newest_date).setNear(location).setMaxTweets(number_of_tweets)
    tweetCriteria_list.append(tweetCriteria)
  except:
    continue

tweet_dict = {}
for criteria, location in zip(tweetCriteria_list, locations):
    tweets = got.manager.TweetManager.getTweets(criteria)
    tweet_dict[location] = tweets


tweet_df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in tweet_dict.items() ]))
tweet_df['tweet_count'] = tweet_df.index
tweet_df = pd.melt(tweet_df, id_vars=["tweet_count"], var_name='City', value_name='got_criteria')
tweet_df = tweet_df.dropna()

twitter_info()
tweet_df = tweet_df.drop("got_criteria", axis=1)
tweet_df.head()

csv_file_path = 'tweets_data.csv'
tweet_df.to_csv(csv_file_path, index=False)

print(tweet_df.head())



2024-09-16 01:23:13 f2315b584b6b root[370] ERROR Internal Python error in the inspect module.
Below is the traceback from this internal error.

2024-09-16 01:23:13 f2315b584b6b root[370] INFO 
Unfortunately, your original traceback can not be constructed.



An error occured during an HTTP request: HTTP Error 403: Forbidden
Try to open in browser: https://twitter.com/search?q=%23Viratkohli%20near%3A%22Hyderabad%22%20within%3A15mi%20since%3A09-15-2023%20until%3A09-15-2024&src=typd
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/GetOldTweets3/manager/TweetManager.py", line 343, in getJsonResponse
    response = opener.open(url)
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbi

TypeError: object of type 'NoneType' has no len()

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.

The key concepts i found most beneficial are HTML structure , libraries and tools , Handling dynamic content.
'''