<a href="https://colab.research.google.com/github/MikeChastain84/Mike_INFO5731_Fall2024/blob/main/Chastain_Mike_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mike Chastain Assignment 2

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
"""
A small residential cleaning company named Uniform Cleaners was created with the goal of creating job opportunities for military spouses.
They are located in Killeen, TX near Fort Cavazos, TX. They want to expand to towns outside of other military installations but need help
selecting the best location.

Research question: Where should Uniform cleaning expand to maximize job opportunities for military spouses and business profitability?

Data Analysis Requirements:
This is a simple project that aims to collect a small amount of data related to the question above.

Data considerations:
- How many military personnel are stationed at different military installations.?
    - The installations with the highest number of military personnel should have higher numbers of military spouses seeking employment.

- How much residential cleaning demand is in each town?
    - Using Beautiful Soup web scraping techniques to determine the nature of residential cleaning demand around different military
      installations.

"""


## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [30]:
# write your answer here
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Dictionary of top 10 military installations with the most service members and their locations:
locations = {
    "Fort Liberty": "Fort Liberty, North Carolina",
    "Joint Base San Antonio": "Joint Base San Antonio, Texas",
    "Fort Cavazos": "Fort Cavazos, Texas",
    "Joint Base Lewis-McChord": "Joint Base Lewis-McChord, Washington",
    "Naval Station Norfolk": "Naval Station Norfolk, Virginia",
    "Camp Pendleton": "Camp Pendleton, California",
    "Fort Campbell": "Fort Campbell, Kentucky",
    "Joint Base Elmendorf-Richardson": "Joint Base Elmendorf-Richardson, Alaska",
    "Fort Benning": "Fort Benning, Georgia",
    "Fort Stewart": "Fort Stewart, Georgia"
}

# Function to scrape cleaning service information from Yelp
def fetch_cleaning_data(location, num_samples=200):
    base_url = "https://www.yelp.com/search"
    response = requests.get(base_url)
    soup = BeautifulSoup(response.text, "html.parser")

    businesses = soup.find_all('h4', class_='css-1l5lt1i')  # Extract business names
    cleaning_data = []

    for business in businesses[:num_samples]:
        business_name = business.get_text
        cleaning_data.append({
            "Business Name": business_name,
            "Location": location
        })

    return cleaning_data

# Function to collect data for 1000 samples
def collect_data():
    all_data = []

    for base, location in locations.items():
        print(f"Fetching cleaning data for {location}...")

        cleaning_services = fetch_cleaning_data(location)
        for service in cleaning_services:
            all_data.append({
                'Installation': base,
                'Location': location,
                'Business Name': service['Business Name']
            })

        time.sleep(1) # avoiding overloading the server

    return pd.DataFrame(all_data)

df = collect_data()
df.to_csv('cleaning_expansion_data.csv', index=False)

print("Collected data and saved to 'cleaning_expansion_data.csv'")

Fetching cleaning data for Fort Liberty, North Carolina...
Fetching cleaning data for Joint Base San Antonio, Texas...
Fetching cleaning data for Fort Cavazos, Texas...
Fetching cleaning data for Joint Base Lewis-McChord, Washington...
Fetching cleaning data for Naval Station Norfolk, Virginia...
Fetching cleaning data for Camp Pendleton, California...
Fetching cleaning data for Fort Campbell, Kentucky...
Fetching cleaning data for Joint Base Elmendorf-Richardson, Alaska...
Fetching cleaning data for Fort Benning, Georgia...
Fetching cleaning data for Fort Stewart, Georgia...
Collected data and saved to 'cleaning_expansion_data.csv'


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [24]:
# I have most of Sunday working on this problem and couldn't figure out why it wouldn't work.
# It turns out, semanticscholar blocks scraping unless you use their API, which I applied for but haven't received a response.
# I couldn't get it to work using Beautiful Soup (from bs4 import BeautifulSoup) and (soup = BeautifulSoup(response.text, "html.parser")),
# I decided to use scholarly and scrape Google Scholarly. I ran into issues with scholarly as well. I think they blocked me. I finally gave
# ...up on this one as well.

!pip install scholarly          # installing scholarly library for scraping Google Scholarly
!pip install --upgrade scholarly # upgrading scholarly library for scraping Google Scholarly
import scholarly                # importing scholarly
import pandas as pd             # importing pandas for data manipulation and analysis

def fetch_articles(query, num_articles=1000):         # a new function to fetch articles; accepts two parameters;
                                                        # ...query and num_articles (defaulted to 1000)

  search_query = scholarly.search_pubs_query(query)   # searches for the query in Google Scholar

  articles = []                                       # creates an empty list to store the articles
  count = 0                                           # initializes a counter to track the number of articles

  while len(articles) < num_articles:                 # a loop until  the number of articles is collected
    try:
      paper = next(search_query)                      # get the next paper from the search results

      # finds the title or "N/A" if missing
      title = paper.bib.get('title', 'N/A')

      # finds the venue/journal or "N/A" if missing
      venue = paper.bib.get('journal', 'N/A')

      # finds the publication year, or "N/A" if missing
      year = paper.bib.get("year", "N/A")

      # finds the authors, or "N/A" if missing
      authors = ", ".join(paper.bib.get("author", []))

      # finds the abstract, or "N/A" if missing
      abstract = paper.bib.get("abstract", "N/A")

      # adds a dictionary with the article details to the articles list
      articles.append({
          "Title": title,
          "Venue": venue,
          "Year": year,
          "Authors": authors,
          "Abstract": abstract
      })

      count += 1                # increments the counter
      if count >= num_articles: # breaks the loop once you reach the desired num of articles
        break

    except StopIteration:       # if no more articles are found, break the loop
      break

  return articles               # returns the list of articles

# example
query = "XYZ"
articles = fetch_articles(query, num_articles)

# save to csv file
df = pd.DataFrame(articles)
df.to_csv("articles.csv", index=False)

print(f"Collected {len(articles)} articles")



AttributeError: module 'scholarly' has no attribute 'search_pubs_query'

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [32]:
# I got close with this one, but I couldn't figure out why I was getting the 403 error. Maybe I'm being blocked by Reddit.

import requests                 # Import the requests library to make HTTP requests
import pandas as pd             # Import pandas for handling and analyzing data

# Function to fetch Reddit data using Pushshift API
def fetch_reddit_data(keyword, num_posts=100):
    base_url = 'https://api.pushshift.io/reddit/search/submission/'  # Base URL for the Pushshift API
    params = {                    # Parameters for the API request
        'q': keyword,             # Keyword to search in Reddit posts
        'size': num_posts,        # Number of posts to retrieve
        'sort': 'desc',           # Sort by newest posts first
        'sort_type': 'created_utc' # Sort based on post creation time
    }

    response = requests.get(base_url, params=params)   # Make the API request

    # Check if the request was successful
    if response.status_code != 200:
        print(f"Error: Received status code {response.status_code}")
        return None

    # Check if the 'data' field is present in the response
    try:
        data = response.json()['data']    # Extract the 'data' field from the JSON response
    except KeyError:
        print("Error: 'data' field not found in the response.")
        print("Full response:", response.json())  # Print full response for debugging
        return None

    posts_data = []             # List to hold the collected posts data

    for post in data:           # Loop through each post in the data
        posts_data.append({
            'Title': post.get('title', 'N/A'),            # Extract post title or use 'N/A' if not available
            'Subreddit': post.get('subreddit', 'N/A'),    # Extract subreddit name or use 'N/A'
            'Username': post.get('author', 'N/A'),        # Extract author's username or use 'N/A'
            'Upvotes': post.get('score', 'N/A'),          # Extract upvote count or use 'N/A'
            'Created_UTC': post.get('created_utc', 'N/A') # Extract timestamp of post creation or use 'N/A'
        })

    df = pd.DataFrame(posts_data)  # Create a DataFrame from the list of posts
    return df                      # Return the DataFrame

# Test the function by searching for the keyword 'python' and retrieving 100 posts
keyword = 'python'
reddit_df = fetch_reddit_data(keyword, 100)

# Check if data was returned before saving
if reddit_df is not None:
    # Save the collected data to a CSV file
    reddit_df.to_csv('reddit_data.csv', index=False)     # Save the DataFrame to a CSV file
    print('Collected Reddit data and saved it to "reddit_data.csv"')
else:
    print("No data was collected.")


Error: Received status code 403
No data was collected.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
I spent most or Saturday and Sunday trying to understand what I was getting wrong. I went through many versions of these.
I think we should extend the due date on these assignments until the end of Monday. I can't come see the TA on Thursday or
Friday because of either my personal schedule or class schedules.
I work on this stuff over the weekend and if we had Monday to discuss with them I could probably solve these errors before submitting.
'''