<a href="https://colab.research.google.com/github/NagillaUdayasree/Udayasree_INFO5731_Spring2024/blob/main/Udayasree_Nagilla_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
Research Question : "How do lifestyle factors such as diet, exercise, and stress levels impact the risk of developing heart disease?"

we would need to collect data on various lifestyle factors, medical history, and heart disease outcomes. Here are the steps for collecting and saving the data:

Identify Relevant Variables: Determine which variables are relevant to the research question. This may include factors such as age, gender, BMI, diet (e.g., consumption of fruits, vegetables, saturated fats), exercise habits (e.g., frequency, duration), smoking status, alcohol consumption, stress levels, family history of heart disease, blood pressure, cholesterol levels, and presence of other medical conditions.

Find Suitable Data Sources: Look for datasets that contain information on the identified variables. These datasets can come from various sources such as health surveys, medical records, clinical trials, or population studies.

Acquire the Data: Download or access the datasets containing the relevant variables. Ensure that the datasets are in a format that can be easily imported into Python, such as CSV or Excel.

Preprocess the Data: Preprocess the datasets by cleaning missing values, standardizing variable names, and formatting data types as needed. Merge or concatenate multiple datasets if necessary.

Select Sample Size: Determine the sample size needed for analysis. Since the research question involves examining the impact of lifestyle factors on heart disease risk, a large sample size may be required to capture a diverse range of individuals with varying lifestyles.

Select Data Analysis Methods: Decide on the appropriate data analysis methods for addressing the research question. This may include descriptive statistics, correlation analysis, regression analysis, or machine learning techniques for predictive modeling.

Save the Data: Once the data preprocessing is complete, save the cleaned and merged dataset to a CSV file for further analysis. Ensure that the file is well-documented and includes information about variable definitions and data sources.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [27]:
import pandas as pd
import requests
from io import StringIO

# URL of the raw dataset file on GitHub
url = "https://raw.githubusercontent.com/hosiajosindra/heart-attack-classification/main/heart.csv"

# Fetch the data from the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Read the content of the response into a DataFrame
    heart_disease_data = pd.read_csv(StringIO(response.text))

    # Display the first few rows of the dataset
    print(heart_disease_data.head())

    # Select a representative sample of 1000 rows
    sample_data = heart_disease_data.sample(n=1000,replace=True, random_state=42)

    # Save the sample dataset to a CSV file
    sample_data.to_csv('heart_disease_sample.csv', index=False)

    print("Sample dataset saved successfully.")
else:
    print("Error: Unable to fetch data from the URL")


   age  sex  cp  trtbps  chol  fbs  restecg  thalachh  exng  oldpeak  slp  \
0   63    1   3     145   233    1        0       150     0      2.3    0   
1   37    1   2     130   250    0        1       187     0      3.5    0   
2   41    0   1     130   204    0        0       172     0      1.4    2   
3   56    1   1     120   236    0        1       178     0      0.8    2   
4   57    0   0     120   354    0        1       163     1      0.6    2   

   caa  thall  output  
0    0      1       1  
1    0      2       1  
2    0      2       1  
3    0      2       1  
4    0      2       1  
Sample dataset saved successfully.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [28]:
import requests
from bs4 import BeautifulSoup
import re
import time

def scrape_articles(keyword, num_articles_to_collect):
    base_url = "https://scholar.google.com"
    search_query = f"{base_url}/scholar?q={keyword}&as_ylo=2014&as_yhi=2024&hl=en&as_sdt=0,5"

    collected_articles = []
    collected_articles_count = 0

    while collected_articles_count < num_articles_to_collect:
        # Send a GET request to the search URL
        response = requests.get(search_query)

        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content of the page
            soup = BeautifulSoup(response.content, 'html.parser')

            # Find all the article elements on the page
            article_elements = soup.find_all('div', class_='gs_ri')

            for element in article_elements:
                # Extract article details if available
                title = element.find('h3', class_='gs_rt').text.strip()

                venue_element = element.find('div', class_='gs_a')
                if venue_element:
                    venue_match = re.search(r'-\s*(.*?)\s*-', venue_element.text)
                    if venue_match:
                        venue = venue_match.group(1).strip()
                    else:
                        venue = "Venue not available"
                else:
                    venue = "Venue not available"

                year_match = re.search(r'\d{4}', venue_element.text)
                year = year_match.group() if year_match else "Year not available"

                authors_element = element.find('div', class_='gs_a').find_all('a')
                authors = ', '.join([author.text.strip() for author in authors_element])

                abstract_element = element.find('div', class_='gs_rs')
                abstract = abstract_element.text.strip() if abstract_element else "Abstract not available"

                # Append article details to the list
                collected_articles.append({
                    'title': title,
                    'venue': venue,
                    'year': year,
                    'authors': authors,
                    'abstract': abstract
                })

                collected_articles_count += 1

                if collected_articles_count == num_articles_to_collect:
                    break

            # Check if there are more pages of results
            next_page_link = soup.find('a', class_='gs_ico gs_ico_nav_next')
            if next_page_link:
                search_query = base_url + next_page_link['href']
            else:
                break
        else:
            print("Error: Unable to retrieve articles")
            break

        # Add a delay to avoid overwhelming the server
        time.sleep(1)

    return collected_articles

# Example usage
search_keyword = "XYZ"
num_articles_to_collect = 1000
articles = scrape_articles(search_keyword, num_articles_to_collect)

# Save the collected articles to a file
with open('collected_articles.txt', 'w', encoding='utf-8') as file:
    for article in articles:
        file.write(f"Title: {article['title']}\n")
        file.write(f"Venue: {article['venue']}\n")
        file.write(f"Year: {article['year']}\n")
        file.write(f"Authors: {article['authors']}\n")
        file.write(f"Abstract: {article['abstract']}\n\n")

# Output the first few articles
for i in range(5):
    print(f"Article {i+1}:")
    print(f"Title: {articles[i]['title']}")
    print(f"Venue: {articles[i]['venue']}")
    print(f"Year: {articles[i]['year']}")
    print(f"Authors: {articles[i]['authors']}")
    print(f"Abstract: {articles[i]['abstract']}")
    print("\n")


Article 1:
Title: An overview of XYZ new particles
Venue: Chinese Science Bulletin, 2014
Year: 2014
Authors: 
Abstract: … (XYZ\) have been announced by experiments after analyzing various processes. Until now, 
the family of \(XYZ… In general, the observed \(XYZ\) states can be categorized into five groups, …


Article 2:
Title: The XYZ states: experimental and theoretical status and perspectives
Venue: Physics Reports, 2020
Year: 2020
Authors: N Brambilla, S Eidelman
Abstract: The quark model was formulated in 1964 to classify mesons as bound states made of a 
quark–antiquark pair, and baryons as bound states made of three quarks. For a long time all …


Article 3:
Title: [HTML][HTML] XYZ states: An experimental point-of-view
Venue: Reviews in Physics, 2022
Year: 2022
Authors: 
Abstract: Since 2003, a new family of states without a clear theoretical interpretation has been measured 
in the heavy quarkonium spectrum, the so-called X Y Z states. While the nature of these …


Article 4:


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [35]:
# write your answer here
import json
from datetime import datetime
import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'www.reddit.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    # 'cookie': 'rdt=d9156fa0768a757557ad793f28748e0e; edgebucket=arLuX6jp7DGYl2A688; loid=000000000ua7lqq3vy.2.1708044555156.Z0FBQUFBQmx6ckVMX0I2OWdqU1U1SWFGMVlXaVJFMlBLNjEzOTlWWkFQWkZJTGJlcVMxNC1IUmFJU0N5a0RjbWw3TGZNNmoxV0RrODJkOFpzZ3AzUjBkSzNNZVU0SUxMQVA0M0dQM1RVNFl6cGlvdkZvSjg5ZFh4aDdRRkdxU2dKdkxTSkEtc0c5YjU; token_v2=eyJhbGciOiJSUzI1NiIsImtpZCI6IlNIQTI1NjpzS3dsMnlsV0VtMjVmcXhwTU40cWY4MXE2OWFFdWFyMnpLMUdhVGxjdWNZIiwidHlwIjoiSldUIn0.eyJzdWIiOiJsb2lkIiwiZXhwIjoxNzA4MTMwOTU1LjE1NzE5NiwiaWF0IjoxNzA4MDQ0NTU1LjE1NzE5NiwianRpIjoiSjBRa09SYWw4Q0phSmt2TmJYV0c3a3dMZEpjcVZBIiwiY2lkIjoiMFItV0FNaHVvby1NeVEiLCJsaWQiOiJ0Ml91YTdscXEzdnkiLCJsY2EiOjE3MDgwNDQ1NTUxNTYsInNjcCI6ImVKeGtrZEdPdERBSWhkLWwxejdCX3lwX05odHNjWWFzTFFhb2szbjdEVm9jazcwN2NMNGlIUDhuS0lxRkxFMnVCS0drS1dFRld0T1VOaUx2NTh5OU9aRUZTeUZUUjg0M3l3b2thVXBQVW1ONXB5bFJ3V1prTGxmYXNVS0RCNllwVlM2WjIwS1BTNXZRM0kxRnowNk1xbHhXSHRUWW8zSnBiR01LMnhQanpjWnFReXF1eTZsTVlGa29uOFdMZnZ5Ry10WS1mN2JmaEhZd3JLZ0tEX1RPdUZ4d1lfSERGSGJfbnByMGJGMndxTDNYZzlRLTEtTjI3Yk5tb2RtNV9WelB2emFTY1RtRzVpZll2N3QtQ1IxNDVIbVpVUWN3WWcwX3lyQWo2X0N2T29ES0JRV01KWWhQSTVBcmwyX19KZGl1VGY4YXR5ZC0tR2JFVFdfNHJSbW81eExFb1VfajZ6Y0FBUF9fWERfZTR3IiwiZmxvIjoxfQ.NS6crHALde11wRKGX1LnAcZtvzwpjkeeAYfceMi7yvIwYG98ZNvHHFRfFuJ_iUnleI3ENUQ2VCCAWUMeN5-eASrFbJQ0JhxXfUft6kvajZsTqsX_CeOt1RxrBB3z9mG1nvq6qrAiBCWNQatClw2YWoEy0RLDxj5Tny5-ssY_5jkYph3NBwn8-KIGgLFoDJe5xxHGXSHcDwhLc4a84nxSpAnTf966CnRxXuhSJ4UH5z3_-nZ3PSutmG28yc9pU9WXvLuokC_07SJTP9Jah8Icl02SipXODRWaCo1wzzAmHUwLEh2UD-xPBwKGbc1Ib6FcOrIMSJQjfQ71py5nIvZcHQ; csv=2; csrf_token=bdc4813e75d19b2185e69da83990ca4b; session_tracker=fjfqjrbcemnechefgj.0.1708050577975.Z0FBQUFBQmx6c2lSbjlOel81MVNCX3FzRGlpekJ0VTl3REQ0WVdjekZPSjROeXVTcHZXeEc1RkNPcDNONzZyY3l5ZHdaOGI3Q0c2bFEtYS1BajFWLXh6M3RWYXp2bklCc21rWjF5XzlOWDhEa1I1RENXZ2MxbjhLNVYydThKeUFGangyRTdpQnNCRlg',
    'sec-ch-ua': '"Not A(Brand";v="99", "Google Chrome";v="121", "Chromium";v="121"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
}


query = "XYZ"
url = f'https://www.reddit.com/search/?q={query}'
page = requests.get(url, headers=headers)
print(page.content)
soup = BeautifulSoup(page.content, 'html.parser')

search_params = {
    'source': 'search',
    'action': 'view',
    'noun': 'post',
    'data-testid': 'search-post'
}
faceplate_list = soup.findAll(name="faceplate-tracker", attrs=search_params)

print(f'No. of posts extracted were {len(faceplate_list)}')
print()

for faceplate in faceplate_list:
    each_post_attr = faceplate.get('data-faceplate-tracking-context')

    # load json
    each_post_data = json.loads(each_post_attr)

    post_data = each_post_data.get('post', {})
    title = post_data.get('title', '')
    date_timestamp = post_data.get('created_timestamp', '')
    date = datetime.utcfromtimestamp(date_timestamp/1000.0).strftime('%Y-%m-%d %H:%M:%S')
    score = post_data.get('score', 0)
    subreddit_name = post_data.get('subreddit_name', '')
    number_comments = post_data.get('number_comments', 0)

    print(80*'#')
    # Print the attributes
    print("Title:", title)
    print("Date:", date)
    print("Score:", score)
    print("Subreddit Name:", subreddit_name)
    print("Number of Comments:", number_comments)
    print(80*'#')
    print()


b"<!doctype html>\n     <html>\n  <head>\n    <title>Blocked</title>\n    <style>\n      body {\n          font: small verdana, arial, helvetica, sans-serif;\n          width: 600px;\n          margin: 0 auto;\n      }\n\n      h1 {\n          height: 40px;\n          background: transparent url(//www.redditstatic.com/reddit.com.header.png) no-repeat scroll top right;\n      }\n    </style>\n  </head>\n  <body>\n    <h1>whoa there, pardner!</h1>\n\n<p>Your request has been blocked due to a network policy.</p>\n\n<p>Try logging in or creating an account <a href=https://www.reddit.com/login/>here</a> to get back to browsing.</p>\n\n<p>If you're running a script or application, please register or sign in with your developer credentials <a href=https://www.reddit.com/wiki/api/>here</a>. Additionally make sure your User-Agent is not empty and is something unique and descriptive and try again. if you're supplying an alternate User-Agent string,\ntry changing back to default as that can somet

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [37]:
The new concept presented a challenge initially, especially given the prerequisite knowledge required. However, delving deeply into the topic proved to be an enriching learning experience. I made efforts to comprehend the intricacies of HTML structure, focusing on how to precisely locate desired elements within it. Additionally, I explored the novel task of identifying the sources of datasets, which broadened my understanding. Yes, few websites din't allow to scrape directly and wanted to use API's. The code supported in few platforms and din't work on google collab.
Collecting and analyzing data from online sources has been incredibly beneficial to me as an individual. It has provided me with access to a wealth of valuable information across different domains, enabling me to stay informed and make more informed decisions.