<a href="https://colab.research.google.com/github/RST0310/INFO-5731/blob/main/RAYABARAPU_SAITEJA_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# # write your answer here
# Research Question: Is there an alternate method that can be proposed to replace or supplement the Duckworth-Lewis (D/L) method in the game of cricket?

# The Duckworth-Lewis method is currently used in cricket to calculate target scores in rain-affected limited-overs matches. However, it has faced criticism for its complexity and occasional lack of accuracy. Exploring alternatives could lead to more accurate and transparent methods for determining target scores in such scenarios.

# Data Needed for Analysis:
# 1. Match Data:
#    - Basic match information (venue, date, teams involved)
#    - Details of the match format (limited-overs, T20, ODI)
#    - Innings-wise scores (both runs scored and wickets fallen)
#    - Weather conditions during the match (rainfall, humidity, visibility, etc.)
#    - Duration and timing of rain interruptions, if any

# 2. Historical Match Data:
#    - A significant dataset of historical matches, including both rain-affected and non-rain-affected matches
#    - Similar match data as described above, covering a diverse range of playing conditions and venues

# 3. Outcome Data:
#    - The actual match outcome (win, loss, tie) for each match
#    - Comparison of match results using Duckworth-Lewis method vs. actual outcomes

# 4. Potential Alternate Method Data:
#    - Data related to any proposed alternate methods for calculating target scores
#    - Simulation results or theoretical calculations based on these alternate methods

# Amount of Data Needed:
# - Match Data: Ideally, a dataset covering several seasons or years of cricket matches across different formats and conditions would be beneficial. This could include hundreds or even thousands of matches.
# - Historical Match Data: Similar to match data, a large dataset covering a significant period of cricket history would be required.
# - Outcome Data: Data on the outcomes of all matches in the dataset.
# - Potential Alternate Method Data: Data related to proposed alternate methods and their simulations or theoretical calculations.

# Steps for Collecting and Saving Data:
# 1. Identify reliable sources for cricket match data, such as cricket databases, official cricket websites, or APIs provided by cricket organizations.
# 2. Extract match data for the desired period, covering various formats and conditions. This may involve web scraping or using APIs to access the data.
# 3. Organize the data into a structured format, including match details, innings data, weather information, and any other relevant variables.
# 4. Collect historical match data using similar methods to ensure a comprehensive dataset.
# 5. Compile outcome data by cross-referencing match results with the match data collected.
# 6. Research and gather data on any proposed alternate methods for calculating target scores.
# 7. Store the collected data in a secure and accessible database or spreadsheet format for analysis.

# With this comprehensive dataset, researchers can analyze the performance of the Duckworth-Lewis method compared to actual match outcomes and explore potential alternate methods for calculating target scores in rain-affected cricket matches.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [14]:
import random
import pandas as pd
from datetime import datetime, timedelta

# Sample venues, teams, and outcomes
venues = ['Eden Gardens', 'Lord\'s Cricket Ground', 'Melbourne Cricket Ground', 'Sydney Cricket Ground', 'Wankhede Stadium']
teams = ['India', 'Australia', 'England', 'Pakistan', 'South Africa']
outcomes = ['win', 'loss', 'tie']

# Function to generate random match data
def generate_match_data():
    venue = random.choice(venues)
    date = datetime.now() - timedelta(days=random.randint(1, 365))  # Random date within the past year
    teams_playing = random.sample(teams, 2)
    format = random.choice(['T20', 'ODI'])
    innings1_runs = random.randint(100, 400)
    innings1_wickets = random.randint(0, 10)
    innings2_runs = random.randint(50, innings1_runs)  # Ensure second innings score is lower than first
    innings2_wickets = random.randint(0, 10)
    rainfall = random.uniform(0, 50)  # Simulating rainfall in mm
    humidity = random.uniform(0, 100)  # Simulating humidity percentage
    visibility = random.uniform(0, 10)  # Simulating visibility in km
    outcome = random.choice(outcomes)

    return {
        'venue': venue,
        'date': date.strftime('%Y-%m-%d'),
        'teams': teams_playing,
        'format': format,
        'innings1_runs': innings1_runs,
        'innings1_wickets': innings1_wickets,
        'innings2_runs': innings2_runs,
        'innings2_wickets': innings2_wickets,
        'rainfall': rainfall,
        'humidity': humidity,
        'visibility': visibility,
        'outcome': outcome
    }

# Collect 1000 samples of match data
num_samples = 1000
dataset = [generate_match_data() for _ in range(num_samples)]

# Convert dataset to DataFrame and save to CSV file
df = pd.DataFrame(dataset)
df.to_csv('cricket_match_dataset.csv', index=False)
print("Dataset saved successfully.")


Dataset saved successfully.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [19]:
pip install requests



In [20]:
pip install beautifulsoup4



In [24]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import random
from time import sleep

def fetch_article_info(keyword, year_range, num_articles):
    base_url = "https://www.semanticscholar.org"
    articles_info = []

    # Iterate through multiple pages of search results
    for page_num in range(1, (num_articles // 10) + 2):
        url = f"https://www.semanticscholar.org/search?q={keyword}&sort=relevance&page={page_num}"
        response = urllib.request.urlopen(url)
        soup = BeautifulSoup(response.read(), 'html.parser')

        # Extract article links from the search page
        article_links = soup.select("a[data-selenium-selector='title-link']")
        for link in article_links:
            article_url = base_url + link['href']
            article_info = extract_article_info(article_url)
            if article_info:
                articles_info.append(article_info)
                if len(articles_info) >= num_articles:
                    break
        sleep(random.uniform(1, 3))  # Add some delay between requests to avoid being blocked

    return articles_info

def extract_article_info(article_url):
    response = urllib.request.urlopen(article_url)
    soup = BeautifulSoup(response.read(), 'html.parser')

    # Extracting article details
    title = soup.find("h1", class_="heading").text.strip() if soup.find("h1", class_="heading") else ""
    venue = soup.find("span", class_="venue").text.strip() if soup.find("span", class_="venue") else ""
    year = soup.find("span", class_="citation__year").text.strip() if soup.find("span", class_="citation__year") else ""
    authors = [author.text.strip() for author in soup.select("span.author-list__name")] if soup.select("span.author-list__name") else []
    abstract = soup.find("div", class_="text-truncator abstract__text").text.strip() if soup.find("div", class_="text-truncator abstract__text") else ""

    article_info = {
        'title': title,
        'venue': venue,
        'year': year,
        'authors': authors,
        'abstract': abstract
    }

    return article_info

# Keyword, year range, and number of articles to fetch
keyword = "XYZ"
year_range = range(2014, 2025)
num_articles = 1000

# Fetch article information
articles_info = fetch_article_info(keyword, year_range, num_articles)

# Print number of articles collected
print(f"Number of articles collected: {len(articles_info)}")

# Print 5 samples if articles are available
if len(articles_info) > 0:
    print("\nSample articles:")
    for i in range(min(5, len(articles_info))):
        print(articles_info[i])
else:
    print("No articles found.")

# Convert to DataFrame and save to CSV
df = pd.DataFrame(articles_info)
df.to_csv('semantic_scholar_articles.csv', index=False)

print("Articles collected and saved successfully.")


Number of articles collected: 0
No articles found.
Articles collected and saved successfully.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [2]:
import requests
import json
import pandas as pd

# Update your Facebook App credentials
app_id = "325885207078723"
app_secret = "83d30c2b9d413ae28f71e27d3baa8d20"
access_token = "EAAEoZABABZA0MBO33Ew0xtgs08cTsRPbqYSZBXlxWyPf6WRRyg5M28ZA4hvXv8k5FRsc4vXwjaIXFCLwewHBC4B4J9CVVD6PP2MzmzEnBIFNVpKbhhnPIIZAiNdzDQcbgP5zzEDZBBqtVHyS5aCZCQvgD2eq29n4iaKKMSPAMgrvA7RhC7wbPgP08v0JKNZBlY3HpGL3doNAIU7JkF84t1WVUAb9aWEao6CqdLiQQblZAyWtAPzCy73vI3oZBde48J6hN7bGUnL06HZBu0ZD"

def get_facebook_page_posts(page_id, access_token, limit=100):
    base_url = f"https://graph.facebook.com/v12.0/{page_id}/posts"
    params = {
        "access_token": access_token,
        "limit": limit,
        "fields": "id,message,created_time,likes.summary(true),shares.summary(true),comments.summary(true)"
    }

    response = requests.get(base_url, params=params)
    if response.status_code != 200:
        print(f"Error: {response.status_code} - {response.text}")
        return []

    try:
        data = response.json()
        if 'data' in data:
            posts_data = []
            for post in data['data']:
                post_data = {
                    'Post ID': post['id'],
                    'Message': post.get('message', ''),
                    'Created Time': post['created_time'],
                    'Likes': post['likes']['summary']['total_count'],
                    'Shares': post['shares']['count'] if 'shares' in post else 0,
                    'Comments': post['comments']['summary']['total_count']
                }
                posts_data.append(post_data)
            return posts_data
        else:
            print("No 'data' key found in the response.")
            return []
    except json.decoder.JSONDecodeError as e:
        print(f"Error decoding JSON: {e}")
        return []

if __name__ == "__main__":
    page_id = "325885207078723"  # Replace "your_page_id" with the actual ID of your Facebook page
    limit = 100
    posts_data = get_facebook_page_posts(page_id, access_token, limit)

    # Convert to DataFrame
    posts_df = pd.DataFrame(posts_data)
    print(posts_df.head())


Error: 400 - {"error":{"message":"(#100) Tried accessing nonexisting field (posts) on node type (Application)","type":"OAuthException","code":100,"fbtrace_id":"AyvEDvOw7y_zsUBjaP_WySb"}}
Empty DataFrame
Columns: []
Index: []


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
# This webscarping is quite challenging,and it takes a good skill to extract the data. I understood how to generate tokens and access them. I has to create a developer accound on meta and genereate the acess token which is quite exciting. It helped me with a lot of understanding of where to find the extract data required for the research.'''