<a href="https://colab.research.google.com/github/ShashankAlluri28/INFO-5731Computational-Methods/blob/main/Alluri_Shashank_In_class_Exercise_2_(1)_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here

What impact does weather variability have on retail customers' purchasing decisions?

Data Required:

->Daily weather information for one or more regions, including temperature, precipitation, humidity, wind speed, etc.
->Retail stores in the same region provide daily sales data.
->Information on the region's demographics to account for any socioeconomic influences.
->At least a year, collect daily weather data to identify long-term trends and seasonal variations.
->For a minimum of a year, track daily sales data to track shifts in consumer behaviour over time.


Process of Gathering and Preserving Data:

Weather Data:
a. Determine trustworthy sources, such as commercial weather data providers, government meteorological agencies, or weather APIs.
b. After selecting the region or regions of interest, download daily weather data for the selected time frame, making sure the data format is consistent.
c. Store the meteorological data in an organised file type, like CSV or Excel, with columns denoting various meteorological parameters and each row representing a single day.

Sales Information:
 a. Work with retail establishments in the selected region(s) to gain access to their daily sales logs. Verify that data security and confidentiality procedures are followed.
b. Gather daily sales data, such as revenue, product categories, sales volume, and other relevant information, for the designated time frame.
c. Arrange the sales data into a structured format akin to that of the meteorological data, with columns designating pertinent sales metrics and each row denoting a single day.

Demographic Information:
a. Acquire demographic information from reliable sources, such as market research companies, statistical agencies, or government census databases.
b. Choose pertinent demographic factors for the study, such as age distribution, income levels, and population density.

Data Management and Storage:
a. Establish a safe, orderly data repository to house the gathered datasets.
b. To avoid data loss or corruption, periodically backup your data.
c. Verify that ethical standards and data privacy laws are followed at every stage of the data handling procedure.

Data Preprocessing and Data Cleaning:
a. Clean the datasets to find and fix any errors, missing values, or outliers.
b. For consistency, standardise the units of measurement and formats used in various datasets.
c. If required, normalise the data to eliminate biases and confounding variables.


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [3]:
# write your answer here
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

def get_date_ranges(start_date, end_date):
    return pd.date_range(start=start_date, end=end_date)

# Generate synthetic weather data
def generate_weather_data(start_date, end_date, region):
    dates = get_date_ranges(start_date, end_date)
    weather_data = {
        'Date': dates,
        'Temperature (C)': np.random.randint(-10, 40, size=len(dates)),
        'Precipitation (mm)': np.random.uniform(0, 20, size=len(dates)),
        'Humidity (%)': np.random.randint(30, 90, size=len(dates)),
        'Wind Speed (km/h)': np.random.randint(0, 30, size=len(dates))
    }
    df_weather = pd.DataFrame(weather_data)
    df_weather['Region'] = region
    return df_weather

# Generate synthetic sales data
def generate_sales_data(start_date, end_date, region):
    dates = get_date_ranges(start_date, end_date)
    sales_data = {
        'Date': dates,
        'Total Sales': np.random.randint(1000, 10000, size=len(dates)),
        'Product Category': [random.choice(['Electronics', 'Clothing', 'Groceries']) for _ in range(len(dates))]
    }
    df_sales = pd.DataFrame(sales_data)
    df_sales['Region'] = region
    return df_sales

# Generate synthetic demographic data
def generate_demographic_data(regions):
    demographics = {
        'Region': regions,
        'Population': [random.randint(50000, 1000000) for _ in range(len(regions))],
        'Average Income': [random.randint(20000, 80000) for _ in range(len(regions))]
    }
    df_demographics = pd.DataFrame(demographics)
    return df_demographics

# Define regions
regions = ['Region A', 'Region B']

# Define time frame
start_date, end_date = datetime(2023, 1, 1), datetime(2023, 12, 31)

# Generate weather data for each region
weather_data = pd.concat([generate_weather_data(start_date, end_date, region) for region in regions], ignore_index=True)

# Generate sales data for each region
sales_data = pd.concat([generate_sales_data(start_date, end_date, region) for region in regions], ignore_index=True)

# Generate demographic data
demographic_data = generate_demographic_data(regions)

# Merge dataframes
df = pd.merge(weather_data, sales_data, on=['Date', 'Region'])
df = pd.merge(df, demographic_data, on='Region')

# Shuffle dataframe
df = df.sample(frac=1).reset_index(drop=True)

# Save dataset to CSV
df.to_csv('retail_purchasing_decision_dataset.csv', index=False)


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [4]:
import requests
from bs4 import BeautifulSoup

def scrape_semantic_scholar(keyword, num_articles):

    base_url = "https://www.semanticscholar.org"
    url = f"{base_url}/search?q={keyword}&sort=relevance"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }

    articles = []
    count = 0

    while len(articles) < num_articles:
        response = requests.get(url, headers=headers)
        print("Scraping:", url)  # Print the URL being scraped
        print("Status code:", response.status_code)  # Print the status code

        if response.status_code != 200:
            print("Failed to retrieve data. Status code:", response.status_code)
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        results = soup.find_all("a", {"data-selenium-selector": "title-link"})

        for result in results:
            if len(articles) >= num_articles:
                break

            article_url = base_url + result["href"]
            article_response = requests.get(article_url, headers=headers)
            article_soup = BeautifulSoup(article_response.content, 'html.parser')

            title = article_soup.find("h1", class_="paper-detail-header__title").text.strip()
            venue = article_soup.find("div", class_="paper-meta-item").text.strip()
            year = article_soup.find("span", class_="paper-meta-item__meta-venue").text.strip().split()[-1]
            authors = [author.text.strip() for author in article_soup.find_all("span", class_="author-item")]
            abstract = article_soup.find("meta", {"name": "description"})["content"]

            articles.append({
                "Title": title,
                "Venue": venue,
                "Year": year,
                "Authors": authors,
                "Abstract": abstract
            })

            count += 1

        print(f"Scraped {count} articles.")

        next_page = soup.find("a", class_="next-page-link")
        if next_page:
            url = base_url + next_page["href"]
        else:
            break

    return articles

keyword = "XYZ"
num_articles = 1000

articles = scrape_semantic_scholar(keyword, num_articles)

if articles:
    # Saving the collected data to a CSV file
    import csv

    keys = articles[0].keys()
    with open('articles_semantic_scholar.csv', 'w', newline='', encoding='utf-8') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(articles)

    print("Data saved to articles_semantic_scholar.csv")
else:
    print("No articles were scraped. Please check if the scraping process encountered any issues.")


Scraping: https://www.semanticscholar.org/search?q=XYZ&sort=relevance
Status code: 200
Scraped 0 articles.
No articles were scraped. Please check if the scraping process encountered any issues.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [5]:
import requests
import pandas as pd

# Instagram Graph API endpoint URLs
BASE_URL = 'https://graph.instagram.com'
ACCESS_TOKEN = 'your_access_token'

# Function to make API requests to Instagram Graph API
def make_api_request(endpoint, params):
    url = f'{BASE_URL}/{endpoint}'
    params['access_token'] = ACCESS_TOKEN
    response = requests.get(url, params=params)
    try:
        response.raise_for_status()
        return response.json()
    except requests.HTTPError as e:
        print(f'HTTP error occurred: {e}')
        print(f'Response content: {response.content}')
        return None

# Function to collect user's media
def collect_user_media(user_id, num_posts):
    media_data = []
    params = {
        'fields': 'id,media_type,media_url,permalink,timestamp,caption',
        'limit': num_posts
    }
    response_data = make_api_request(f'{user_id}/media', params)
    if response_data and 'data' in response_data:
        media_data.extend(response_data['data'])
    return media_data

# Specify user ID and number of posts to collect
user_id = '_shashank_varma'  # Use 'self' for your own account or specify user ID
num_posts = 20

# Collect user's media
media = collect_user_media(user_id, num_posts)

if media:
    # Convert the list of dictionaries into a DataFrame
    df = pd.DataFrame(media)

    # Save the collected data to a CSV file
    df.to_csv('instagram_data.csv', index=False)

    print("Data saved to instagram_data.csv")
else:
    print("Failed to collect Instagram data. Check the error messages for details.")


HTTP error occurred: 400 Client Error: Bad Request for url: https://graph.instagram.com/_shashank_varma/media?fields=id%2Cmedia_type%2Cmedia_url%2Cpermalink%2Ctimestamp%2Ccaption&limit=20&access_token=your_access_token
Response content: b"Sorry, this content isn't available right now"
Failed to collect Instagram data. Check the error messages for details.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [1]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
'''

In [None]:
Web scrapping is a technique to  get data from social media platforms .which is a essential technique  for data science fields , research and business intelligence fields .  web scrapping has a essential aspects which are learning HTTP requests , responses , handling pagination and implementing error handling mechanisms . most challenging while  web scrapping  is dealing dynamic content which is generated by javascript  , as traditional web scrapping  libraries .