<a href="https://colab.research.google.com/github/Manaswini1912/INFO-5731/blob/main/Kodela_Manaswini_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# How does the integration of artificial intelligence (AI) algorithms in healthcare diagnostic processes affect patient outcomes and medical professionals' workload?

# 1. Obtain permission and approval from ethics committees or institutional review boards.
# 2. Identify data sources, such as hospitals, clinics, insurance companies, and research databases.
# 3. Collect patient demographic information, medical history, and treatment outcomes from electronic health records.
# 4. Gather health monitoring data from wearable devices and sensors.
# 5. Obtain billing records, insurance claims data, and healthcare expenditure information.
# 6. Capture outputs generated by AI algorithms, including diagnostic predictions and treatment recommendations.
# 7. Track the usage of AI algorithms in clinical decision-making.
# 8. Measure clinical outcomes and patient satisfaction through surveys or interviews.
# 9. Ensure adherence to ethical guidelines, obtain informed consent, and protect patient confidentiality.
# 10. Integrate data from different sources into a centralized database.
# 11. Cleanse and preprocess the data to remove errors and inconsistencies.
# 12. Store collected data securely using databases, data warehouses, or cloud storage.
# 13. Implement access controls and encryption to protect data integrity and confidentiality.
# 14. Document data sources, collection methods, and preprocessing steps.
# 15. Backup data regularly and establish disaster recovery plans.
# 16. Define access policies and permissions for data sharing with authorized individuals or research collaborators.
# 17. Ensure compliance with data sharing agreements and intellectual property rights.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import pandas as pd

# Function to collect patient data
def collect_patient_data(num_samples):
    patient_data = []
    for _ in range(num_samples):
        # Simulate patient data collection process
        # Replace this with actual data collection code
        patient_info = {
            "patient_id": "P" + str(_),
            "age": 30 + _ % 50,
            "gender": "Male" if _ % 2 == 0 else "Female",
            "diagnosis": "Hypertension" if _ % 3 == 0 else "Diabetes",
            "treatment": "Medication" if _ % 2 == 0 else "Lifestyle changes",
            "outcome": "Improved" if _ % 4 == 0 else "Stable"
        }
        patient_data.append(patient_info)
    return patient_data

# Function to collect AI algorithm performance data
def collect_algorithm_data(num_samples):
    algorithm_data = []
    for _ in range(num_samples):
        # Simulate algorithm performance data collection process
        # Replace this with actual data collection code
        algorithm_info = {
            "patient_id": "P" + str(_),
            "diagnostic_prediction": "High" if _ % 3 == 0 else "Low",
            "treatment_recommendation": "Medication" if _ % 2 == 0 else "Lifestyle changes",
            "usage_frequency": _ % 5
        }
        algorithm_data.append(algorithm_info)
    return algorithm_data

# Collect patient data
patient_data = collect_patient_data(1000)

# Collect AI algorithm performance data
algorithm_data = collect_algorithm_data(1000)

# Convert data to pandas DataFrame
patient_df = pd.DataFrame(patient_data)
algorithm_df = pd.DataFrame(algorithm_data)

# Merge patient data and algorithm performance data based on patient ID
merged_df = pd.merge(patient_df, algorithm_df, on="patient_id")

# Save merged dataset to CSV file
merged_df.to_csv("healthcare_dataset.csv", index=False)


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import time

def fetch_google_scholar_articles(keyword, max_articles=1000):
    base_url = "https://scholar.google.com"
    articles = []

    # Iterate over search result pages until desired number of articles is collected
    while len(articles) < max_articles:
        # Construct the search query URL
        url = f"{base_url}/scholar?q={keyword}&hl=en&as_sdt=0%2C5&as_ylo=2014&as_yhi=2024&start={len(articles)}"

        # Send a GET request to the URL
        response = requests.get(url)

        # Parse the HTML content
        soup = BeautifulSoup(response.text, "html.parser")

        # Extract article elements
        article_elements = soup.find_all("div", class_="gs_ri")

        # Iterate over each article element
        for article_element in article_elements:
            try:
                # Extract article information
                title = article_element.find("h3", class_="gs_rt").text.strip()
                venue = article_element.find("div", class_="gs_a").text.split(" - ")[0].strip()
                year = article_element.find("div", class_="gs_a").text.split(" - ")[-1].split(",")[-1].strip()
                authors = ", ".join(article_element.find("div", class_="gs_a").text.split(" - ")[-1].split(",")[:-1]).strip()
                abstract = article_element.find("div", class_="gs_rs").text.strip()

                # Append article information to the list
                articles.append({
                    "title": title,
                    "venue": venue,
                    "year": year,
                    "authors": authors,
                    "abstract": abstract
                })

                # If desired number of articles is reached, break the loop
                if len(articles) == max_articles:
                    break
            except Exception as e:
                print(f"Error processing article: {e}")

        # Add a delay to avoid hitting the server too frequently
        time.sleep(2)

    return articles

# Fetch articles from Google Scholar
articles = fetch_google_scholar_articles("XYZ", max_articles=1000)

# Print the first article to check
print(articles[0])

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
import requests
import json
import pandas as pd

class SocialMediaDataCollector:
    def __init__(self, api_keys):
        self.api_keys = api_keys

    def fetch_data(self, platform, query):
        if platform.lower() == 'reddit':
            return self.fetch_reddit_data(query)
        elif platform.lower() == 'instagram':
            return self.fetch_instagram_data(query)
        elif platform.lower() == 'twitter':
            return self.fetch_twitter_data(query)
        elif platform.lower() == 'facebook':
            return self.fetch_facebook_data(query)
        else:
            raise ValueError("Unsupported platform")

    def fetch_reddit_data(self, query):
        # Placeholder implementation - Replace with actual Reddit API logic
        return [
            {"USER ID": "reddit_user1", "USERNAME": "RedditUser1", "CREATED AT": "2024-02-15", "PROFILE NAME": "reddit_profile1", "TEXT": "Sample text 1"},
            {"USER ID": "reddit_user2", "USERNAME": "RedditUser2", "CREATED AT": "2024-02-14", "PROFILE NAME": "reddit_profile2", "TEXT": "Sample text 2"}
        ]

    def fetch_instagram_data(self, query):
        # Placeholder implementation - Replace with actual Instagram API logic
        return [
            {"USER ID": "instagram_user1", "USERNAME": "InstagramUser1", "CREATED AT": "2024-02-15", "PROFILE NAME": "instagram_profile1", "TEXT": "Sample text 1"},
            {"USER ID": "instagram_user2", "USERNAME": "InstagramUser2", "CREATED AT": "2024-02-14", "PROFILE NAME": "instagram_profile2", "TEXT": "Sample text 2"}
        ]

    def fetch_twitter_data(self, query):
        # Placeholder implementation - Replace with actual Twitter API logic
        return [
            {"USER ID": "twitter_user1", "USERNAME": "TwitterUser1", "CREATED AT": "2024-02-15", "PROFILE NAME": "twitter_profile1", "TEXT": "Sample text 1"},
            {"USER ID": "twitter_user2", "USERNAME": "TwitterUser2", "CREATED AT": "2024-02-14", "PROFILE NAME": "twitter_profile2", "TEXT": "Sample text 2"}
        ]

    def fetch_facebook_data(self, query):
        # Placeholder implementation - Replace with actual Facebook API logic
        return [
            {"USER ID": "facebook_user1", "USERNAME": "FacebookUser1", "CREATED AT": "2024-02-15", "PROFILE NAME": "facebook_profile1", "TEXT": "Sample text 1"},
            {"USER ID": "facebook_user2", "USERNAME": "FacebookUser2", "CREATED AT": "2024-02-14", "PROFILE NAME": "facebook_profile2", "TEXT": "Sample text 2"}
        ]

# Example usage:
api_keys = {
    'reddit': 'REDDIT_API_KEY',
    'instagram': 'INSTAGRAM_API_KEY',
    'twitter': 'TWITTER_API_KEY',
    'facebook': 'FACEBOOK_API_KEY'
}

data_collector = SocialMediaDataCollector(api_keys)

# Fetch data from Reddit
reddit_data = data_collector.fetch_data('reddit', '#python')

# Fetch data from Instagram
instagram_data = data_collector.fetch_data('instagram', 'python')

# Fetch data from Twitter
twitter_data = data_collector.fetch_data('twitter', 'python')

# Fetch data from Facebook
facebook_data = data_collector.fetch_data('facebook', 'python')

# Combine data into a DataFrame with more than four columns
all_data = pd.concat([
    pd.DataFrame(reddit_data, columns=['USER ID', 'USERNAME', 'CREATED AT', 'PROFILE NAME', 'TEXT']),
    pd.DataFrame(instagram_data, columns=['USER ID', 'USERNAME', 'CREATED AT', 'PROFILE NAME', 'TEXT']),
    pd.DataFrame(twitter_data, columns=['USER ID', 'USERNAME', 'CREATED AT', 'PROFILE NAME', 'TEXT']),
    pd.DataFrame(facebook_data, columns=['USER ID', 'USERNAME', 'CREATED AT', 'PROFILE NAME', 'TEXT'])
], ignore_index=True)

# Display the combined data if not empty
if not all_data.empty:
    print(all_data)
else:
    print("No data available.")


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
# Learning web scraping for the first time was tough, but with online resources, I managed to grasp the basics but not enough. Understanding HTML structure, CSS selectors, and XPath was tricky initially. I often had to rely on tutorials and forums to troubleshoot problems and experiment with different scraping tools like BeautifulSoup and Scrapy.Overall, while learning web scraping had its challenges. It's opened up new possibilities for research and analysis.

# I understood that I have to go through and learn a lot in these topics, I understood at my best as per basic understanding.