<a href="https://colab.research.google.com/github/MMR1318/Maheshreddy_INFO5731_Fall2024/blob/main/Mottakatla_Mahesh_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# Data to be Collected:
# We need to collect the following types of data:

# 1. Weather Data:
#    - Temperature (°C or °F)
#    - Precipitation (mm or inches)
#    - Humidity (%)
#    - Wind Speed (in km/h or mph)
#    - Major weather events (e.g., storms, droughts)
#    - Source: Weather services or APIs (e.g., NOAA, Weather.com API, OpenWeatherMap API)

# 2. Consumer Spending Data:
#    - Daily/weekly/monthly retail sales
#    - E-commerce sales (if applicable)
#    - Type of expenditure (e.g., food, clothing, electronics)
#    - Geographical regions (city/state/country)
#    - Source: Government economic data, retail databases, financial institutions (e.g., Statista, World Bank, Kaggle)

# 3. Demographic Data:
#    - Population size
#    - Income levels
#    - Employment rates
#    - Age distribution
#    - Source: Census data, demographic surveys, public databases

# Amount of Data:
# - Collect data for at least 2-3 years to account for seasonal adjustments and major weather events.
# - The dataset should include:
#    - Weather Data: Daily records for all regions.
#    - Spending Data: Monthly or weekly trends in retail categories.
#    - Demographic Data: Latest available demographic data for each region.
# - For large studies, the dataset may consist of millions of rows.

# Steps for Collecting and Saving Data:

# 1. Identify Data Sources:
#    - Choose weather APIs (e.g., OpenWeatherMap) and government sources for spending data.
#    - Identify APIs or databases for demographic data.

# 2. Access the Data:
#    - Use APIs (e.g., OpenWeatherMap) to retrieve daily weather data for the relevant geographic areas.
#    - Gather monthly/weekly consumer spending data from government or retail databases.
#    - Retrieve demographic data from census databases or other national sources.

# 3. Data Collection Process:
#    - Weather Data:
#      - Use Python or another programming language to call weather APIs.
#      - Pass the required date and regions as parameters in the API call.
#      - Store the data in JSON or CSV format for further analysis.
#    - Consumer Spending Data:
#      - Collect retail data from public sources or use scraping techniques with financial data provider APIs.
#      - Pool the data by region and time (weekly/monthly intervals).
#    - Demographic Data:
#      - Use Python libraries like Pandas and BeautifulSoup to scrape or download demographic data for the selected regions.

# 4. Data Cleaning and Preprocessing:
#    - Use Pandas to clean missing or inconsistent data.
#    - Standardize data (e.g., convert all temperatures to °C).
#    - Classify spending data into major retail categories.

# 5. Save the Data:
#    - Store weather, spending, and demographic data in a structured format:
#      - Use CSV files for easier analysis.
#      - For large datasets, consider using a relational database (e.g., SQLite, MySQL) for faster querying.
#      - For scalability, consider using cloud storage (e.g., AWS S3).

# 6. Data Validation:
#    - Ensure data integrity by:
#      - Comparing totals across datasets.
#      - Validating API response codes.
#      - Cross-referencing data from multiple sources (e.g., comparing data from two weather APIs).


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
import requests
import pandas as pd
import random
import time

# Define API key and endpoint for OpenWeatherMap
API_KEY = "cdcee05e3d7b54158bd244757081bf5d"
BASE_URL = "http://api.openweathermap.org/data/2.5/weather"

# List of cities to collect data from (for simplicity, we use a few cities)
cities = ["London", "New York", "Berlin", "Tokyo", "Sydney", "Toronto", "Delhi", "Paris", "Dubai", "Moscow"]

# Function to fetch weather data from OpenWeatherMap API
def fetch_weather_data(city):
    params = {
        'q': city,
        'appid': API_KEY,
        'units': 'metric'  # Using Celsius for temperature
    }
    response = requests.get(BASE_URL, params=params)
    if response.status_code == 200:
        data = response.json()
        return {
            'city': city,
            'temperature': data['main']['temp'],
            'humidity': data['main']['humidity'],
            'wind_speed': data['wind']['speed'],
            'weather_event': data['weather'][0]['description']
        }
    else:
        return None

# Function to generate random consumer spending data
def generate_spending_data():
    return {
        'food_spending': round(random.uniform(50, 300), 2),
        'clothing_spending': round(random.uniform(20, 200), 2),
        'electronics_spending': round(random.uniform(100, 1000), 2),
        'total_spending': round(random.uniform(500, 3000), 2)
    }

# Function to generate random demographic data
def generate_demographic_data():
    return {
        'population_size': random.randint(500000, 20000000),
        'income_level': round(random.uniform(20000, 100000), 2),
        'employment_rate': round(random.uniform(60, 95), 2),
        'age_distribution': random.choice(['18-35', '35-50', '50+'])
    }

# Initialize an empty list to store the collected data
data = []

# Collect 1000 samples
for i in range(1000):
    city = random.choice(cities)

    # Fetch weather data for the city
    weather_data = fetch_weather_data(city)

    if weather_data:
        # Generate consumer spending and demographic data
        spending_data = generate_spending_data()
        demographic_data = generate_demographic_data()

        # Combine all data into a single dictionary
        combined_data = {**weather_data, **spending_data, **demographic_data}
        data.append(combined_data)

    # To avoid hitting API rate limits, sleep for a short time between requests
    time.sleep(0.2)

# Convert the collected data into a Pandas DataFrame
df = pd.DataFrame(data)

# Save the dataset to a CSV file
df.to_csv('weather_consumer_spending_data.csv', index=False)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,city,temperature,humidity,wind_speed,weather_event,food_spending,clothing_spending,electronics_spending,total_spending,population_size,income_level,employment_rate,age_distribution
0,Dubai,37.96,39,6.69,clear sky,73.39,38.82,887.39,2296.73,17749626,21087.68,70.24,50+
1,Paris,15.53,60,4.12,few clouds,50.9,39.47,757.95,844.66,16417172,40528.8,81.7,50+
2,Sydney,13.53,75,15.43,scattered clouds,294.7,31.88,295.92,1429.6,13073281,35818.2,72.71,35-50
3,Toronto,16.05,96,4.12,mist,137.91,110.47,444.94,1872.78,14049884,89368.66,90.48,18-35
4,New York,19.01,88,2.57,clear sky,270.9,168.83,265.3,2685.16,5357384,40211.57,66.37,35-50


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Function to collect data from Google Scholar
def collect_articles(keyword, start_year, end_year, num_articles=5):
    i = 0
    results = []

    # Create a query string with the provided keyword
    query = f"+{keyword.replace(' ', '+')}"

    while len(results) < num_articles:
        url = f'https://scholar.google.com/scholar?start={i}&q={query}&hl=en&as_sdt=0,5&as_ylo={start_year}&as_yhi={end_year}'
        content = requests.get(url).text
        page = BeautifulSoup(content, 'lxml')

        try:
            for entry in page.find_all("div", attrs={"class": "gs_ri"}):
                title = entry.find('h3', attrs={'class': 'gs_rt'})
                author = entry.find('div', attrs={'class': 'gs_a'})
                abst = entry.find('div', attrs={'class': 'gs_rs'})
                cite = entry.find('div', attrs={'class': 'gs_fl'})

                # Collect the required details
                results.append({
                    "title": title.text if title else "No title",
                    "url": entry.a['href'] if entry.a else "No URL",
                    "authors": author.text if author else "No authors",
                    "abstract": abst.text if abst else "No abstract",
                    "citation": cite.text.replace('Cited by', '').strip() if cite else "No citation",
                    "year": end_year
                })

                # Break the loop if the required number of articles are collected
                if len(results) >= num_articles:
                    break

        except Exception as e:
            print(f"An exception occurred: {e}")
            print(f"Failed URL: {url}")

        time.sleep(5) # Delay to avoid being blocked by Google Scholar
        i += 10 # Move to the next page

    return results

# Collect 1000 articles with the keyword "XYZ" from 2014 to 2024
keyword = "XYZ"
start_year = 2014
end_year = 2024
num_articles = 5

articles = collect_articles(keyword, start_year, end_year, num_articles)

# Convert the results to a DataFrame and save to CSV
df = pd.DataFrame(articles)
df.to_csv('google_data.csv', mode='a', header=False, index=False)
print(df.head())


KeyboardInterrupt: 

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# 1. Introduction to Octoparse
# Octoparse is a no-code web scraping tool that allows users to extract data from websites easily.
# It provides a visual interface to configure scraping tasks and export data in various formats such as CSV or Excel.

# 2. Steps for Web Scraping

# A. Setup Octoparse

# Download and Install Octoparse:
# 1. Go to the Octoparse website and download the software.
# 2. Install it on your computer and sign up or log in.

# Create a New Task:
# 1. Open Octoparse and click on + New Task.
# 2. Enter the URL of the Reddit subreddit you want to scrape, e.g., https://www.reddit.com/r/technology/.
# 3. Click Start to load the page in Octoparse's built-in browser.

# B. Configure Scraping Rules

# Point-and-Click Selection:
# 1. Click on the post titles on the Reddit page to select them. Octoparse will highlight similar elements.
# 2. Choose Select all to capture all post titles.
# 3. Click on the author names, post dates, and content to add them as fields. Name these fields as "Post Title," "Username," "Post Date," and "Content."

# Loop Pagination (if needed):
# 1. If you want to scrape multiple pages, click on the Next Page button and choose Loop Click.

# Preview and Adjust:
# 1. Use the Preview tab to check the data extracted so far. Make sure all fields are correctly mapped.

# C. Run the Task

# Run Locally or in the Cloud:
# 1. Click Run. You can choose to run it locally or in Octoparse's cloud (if you have a cloud plan).

# Monitor Progress:
# 1. Check the progress and ensure that the data is being extracted correctly.

# D. Export the Data

# Export Options:
# 1. After the scraping task is complete, go to the Export section.
# 2. Choose CSV or Excel as the export format.
# 3. Click Export and save the file to your computer.


# 3. Document Preparation
# Create a Word or PDF document that includes:
# - An introduction to Octoparse.
# - Detailed steps of the scraping process.
# - Screenshots of the Octoparse setup, including the point-and-click interface, data preview, and export process.
# - A link to the exported CSV or Excel file.

# 5. Upload and Share

# Upload Document:
# 1. Upload the document to a shared storage service like UNT OneDrive, Google Drive, or Dropbox.
# 2. Ensure the document is publicly accessible.

# Generate Shareable Link:
# 1. Obtain a shareable link for the document.

# Link: https://drive.google.com/file/d/15uR17-et9cl6zVJp0Z5CqaJ8IhddAtSM/view?usp=sharing

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
# Reflective Feedback on Web Scraping and Data Collection

# Learning Experience

# Blogs, articles, and other web contents were explored while solving the web scraping tasks
# to get efficiency in data scraping methods. Here are some key takeaways:

# 1. Understanding Web Scraping Fundamentals:
# - HTML and CSS Selectors:
#   # The most important for attaining proficiency in identification of both HTML elements and CSS selectors proved to be essential.
#   # This knowledge was useful in determining the particular data to be scraped out of web pages.
# - Data Extraction Techniques:
#   # Something as simple as the use of point-and-click interface such as Octoparse or general understanding of how to set up an XPath or CSS selector for scraping was basic.
#   # Practical methods which are vital in data collection are these.

# 2. Automation and Efficiency:
# - Task Automation:
#   # To fit large numbers of records, data were preprocessed using the automation tools, such as Octoparse, which proved the concept of automated data collection.
# - Error Handling:
#   # Dealing with certain obstacles for example CAPTCHA or data formatting problems also instilled humility and heart for error and data checks.

# 3. Data Export and Integration:
# - Export Formats:
#   # Coming to know that exporting data in formats such as CSV and Excel was useful for further data manipulation and integration of the data into other tools was helpful.
# - Data Cleaning:
#   # The process of exporting data then cleaning and preprocessing it was very useful in showing how data preparation is critical in obtaining accurate and reliable results.

# Challenges Encountered

# 1. Website Structure and Scraping Difficulties:
# - Dynamic Content:
#   # Some of the websites allowed dynamic content loading through JavaScript, which made the extraction process a bit challenging.
#   # For this, I used tools that could work with AJAX or dealt with browser automation methods.
# - CAPTCHAs and Anti-Scraping Mechanisms:
#   # Websites with CAPTCHAs or any form of anti-scraping technologies proved to be very difficult.
#   # For these, I either employed headless browsers or proxy servers in order to unblock access.

# 2. Tool-Specific Challenges:
# - Octoparse:
#   # Octoparse is easy to use, though I faced problems while working with pagination settings as well as sites with intricate structures.
#   # Some of the operations were easy to accomplish because of the visual interface, but others needed extra actions while configuring the parameters of complex workflows.

# 3. Non-Coding Option Experience:
# - Ease of Use:
#   # Option №2 was a no-code solution that used the Octoparse tool, which anyone could utilize easily without prior programming knowledge.
# - Limitations:
#   # Although it is relatively straightforward to use, there exist problems with processing large data or for actively changing content: supplementary methods or tools were needed.

# Relevance to Your Field of Study

# 1. Enhanced Data Collection:
# - Comprehensive Research:
#   # Collecting data from various sources on the internet helps to expand the dataset of collected data.
#   # It is especially useful in disciplines such as market research, sociology, and technology studies.

# 2. Data-Driven Insights:
# - Informed Decision-Making:
#   # Real-time data and historical data make it possible to perform well-analyzed decisions and trend analysis.
#   # It is important when creating tactics, analyzing customer trends, or evaluating market competitiveness.

# 3. Skill Development:
# - Technical Proficiency:
#   # The knowledge of web scraping tools and methods improves technical competencies.
#   # Technical competence is critical in the current world since data-driven positions are widespread.
#   # It is applied to activities such as academic writing, research, business intelligence, and data analytics.

# In general, it can be stated that the work with web scraping as well as the data collection tools has been quite enlightening and tends to be quite demanding at the same time.
# It has thus offered a way of how the process of extracting intelligence from data can be done and why it is necessary to be flexible when dealing with the data.
# These are very crucial skills as they ensure that the data gathered from the internet is used in meaningful research by different disciplines.

'''

'\n# Reflective Feedback on Web Scraping and Data Collection\n\n# Learning Experience\n\n# Blogs, articles, and other web contents were explored while solving the web scraping tasks\n# to get efficiency in data scraping methods. Here are some key takeaways:\n\n# 1. Understanding Web Scraping Fundamentals:\n# - HTML and CSS Selectors: \n#   # The most important for attaining proficiency in identification of both HTML elements and CSS selectors proved to be essential.\n#   # This knowledge was useful in determining the particular data to be scraped out of web pages.\n# - Data Extraction Techniques:\n#   # Something as simple as the use of point-and-click interface such as Octoparse or general understanding of how to set up an XPath or CSS selector for scraping was basic.\n#   # Practical methods which are vital in data collection are these.\n\n# 2. Automation and Efficiency:\n# - Task Automation:\n#   # To fit large numbers of records, data were preprocessed using the automation tools, s