# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [1]:
'''How does the implementation of SAP S/4HANA influence the issues of operational efficiency and decision-making processes within manufacturing companies?
'''The objective will be to gauge how well SAP S/4HANA has been implemented in pursuit of efficiency in operations and decision-making processes in manufacturing companies.
1)Data Collection:

- Operational Efficiency Data :
i.e variables to collect such as KPI's(Key performance indicators),
process downtime and maintenance records, Cost metrics etc...
- Data Sources :
SAP S4/HANA system reports and Dashboards, historical data from pre-SAP implementation periods.
- Amount of Data :
Collect data from multiple manufracturing sites using SAP S4/HANA over a period of 6-12 months.
- Steps for collection :
   Identify and define KPI's,
   Use SAP S4/HANA's Built-in reporting tools to extract KPI's,
   get greater equivalent historical comparision prior to SAP S4/HANA implementation.


2)Decesion making process Data:

- Variables to collect:
  Time taken for key decesion making process such as order approvals, inventory records.
  Accuracy and frequency in devesion making errors and adjustments.
  Feedback from the decesion makers and system's support.

- Data Sources:
  Collect feedback from atleast 20-30 decesion makers across different departments.
  analyze system logs an ddecesion records over 6-12 months.

Steps for collection :
  By conduction of Design Surveys, Interviews, review of documentation and system logs.


3)Data storage and Management:

- Database setup:
  Create a RDBMS or Data warehouse to store optional efficiency and decesion making data.
  Ensure proper indexing and categorization of data for easy retrieval and analysis.

- Data Security:
  Implement robust data security measures to protect sensitive information, including encryption and access controls.

- Data Backup:
  Regularly back up data to secure storage solutions to prevent data loss.

- Data Cleaning:
  Clean data regularly on timely basis to remove inconsistencies and errors.

- Data Analysis:
  Use statistical analysis tools and techniques to evaluate changes in operational efficiency and decision-making processes.
  Use data visualization tools to present findings and identify trends.

Analysis and Insights

- Compare KPI's
- Evaluate Decesion making
- Feedback Analysis

References:

Kot, P., & Szymanski, J. (2020). The impact of SAP S/4HANA on enterprise resource planning (ERP) systems: A review and research agenda. Journal of Enterprise Information Management. Retrieved from ResearchGate
Peters, L., & Roberts, A. V. (2019). SAP S/4HANA and the future of ERP: A technical and business perspective. Information Systems Frontiers, 21(3), 545-563. https://doi.org/10.1007/s10796-018-9878-1




SyntaxError: invalid decimal literal (<ipython-input-1-b5550458d99e>, line 2)

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [2]:
import pandas as pd
import numpy as np

samples = 1000
np.random.seed(42)

production_cycle_time = np.random.normal(loc=50, scale=10, size=samples)
inventory_turnover_rate = np.random.normal(loc=8, scale=2, size=samples)
order_fulfillment_accuracy = np.random.uniform(85, 100, size=samples)
production_cost = np.random.normal(loc=50000, scale=10000, size=samples)
operational_cost = np.random.normal(loc=20000, scale=5000, size=samples)

decision_time = np.random.normal(loc=2, scale=0.5, size=samples)
decision_errors = np.random.poisson(lam=1, size=samples)
user_satisfaction = np.random.uniform(1, 10, size=samples)

df = pd.DataFrame({
    'Production_Cycle_Time': production_cycle_time,
    'Inventory_Turnover_Rate': inventory_turnover_rate,
    'Order_Fulfillment_Accuracy': order_fulfillment_accuracy,
    'Production_Cost': production_cost,
    'Operational_Cost': operational_cost,
    'Decision_Time': decision_time,
    'Decision_Errors': decision_errors,
    'User_Satisfaction': user_satisfaction
})

df.to_csv('sap_s4hana_impact_data.csv', index=False)

print("Dataset of 1000 samples has been generated and saved to 'sap_s4hana_impact_data.csv'.")


Dataset of 1000 samples has been generated and saved to 'sap_s4hana_impact_data.csv'.


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [18]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime

def google_scholar_scraper(query, num_articles=10):
    base_url = 'https://scholar.google.com/scholar'
    params = {
        'q': query,
        'hl': 'en',
        'as_sdt': '0,5',
        'as_ylo': '2014',
        'as_yhi': '2024'
    }

    articles = []
    for start in range(0, num_articles, 10):
        params['start'] = start
        response = requests.get(base_url, params=params)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            results = soup.find_all('div', {'class': 'gs_ri'})
            for result in results:
                title = result.find('h3').text
                authors_element = result.find('div', {'class': 'gs_a'})
                if authors_element:
                    authors = authors_element.text
                else:
                    authors = None
                venue_year = result.find('div', {'class': 'gs_a'}).text
                abstract = result.find('div', {'class': 'gs_rs'})

                if abstract:
                    abstract = abstract.text
                else:
                    abstract = None

                articles.append({
                    'title': title,
                    'authors': authors,
                    'venue_year': venue_year,
                    'abstract': abstract
                })

    return articles

query = "How does the implementation of SAP S/4HANA influence the issues of operational efficiency and decision-making processes within manufacturing companies?"
num_articles = 10
articles = google_scholar_scraper(query, num_articles)

if articles:
    for i in range(3):
        print(f"article {i+1}:\n")
        print(f"title: {articles[i]['title']}")
        print(f"Authors: {articles[i]['authors']}")
        print(f"Venue/Year: {articles[i]['venue_year']}")
        print(f"Abstract: {articles[i]['abstract']}")
        print("\n" + "="*50 + "\n")
else:
    print("No articles found for the given query.")

No articles found for the given query.


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [22]:
import tweepy
import pandas as pd

CONSUMER_KEY = 'YOUR_CONSUMER_KEY'
CONSUMER_SECRET = 'YOUR_CONSUMER_SECRET'
ACCESS_TOKEN = 'YOUR_ACCESS_TOKEN'
ACCESS_TOKEN_SECRET = 'YOUR_ACCESS_TOKEN_SECRET'

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

def fetch_tweets(query, num_tweets=10):
    tweets = []
    for tweet in tweepy.Cursor(api.search_tweets, q=query, tweet_mode='extended', lang='en').items(num_tweets):
        tweets.append({
            'Tweet': tweet.full_text,
            'Username': tweet.user.screen_name,
            'Followers': tweet.user.followers_count,
            'Likes': tweet.favorite_count,
            'Retweets': tweet.retweet_count,
            'Date': tweet.created_at
        })
    return tweets

def save_to_csv(tweets, filename='twitter_tweets.csv'):
    df = pd.DataFrame(tweets)
    df.to_csv(filename, index=False)
    print(f"Saved {len(tweets)} Twitter tweets to {filename}")

tweets = fetch_tweets('data science', num_tweets=10)
save_to_csv(tweets)


Unauthorized: 401 Unauthorized
89 - Invalid or expired token.

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here
Need to take subscription for the account to a website to perform this activity.

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [23]:
The web scraping and data collection exercises provided a deeper understanding of methods of data extraction that underlined ethical problems and tool selection.
These skills can be applied to academic research but at the same time find practical use in a wide array of professional domains.
The experience armed me with essential tools and knowledge to manage and analyze data from diverse online sources effectively.
I faces some challenges while working on question 4, i identified that the issue is with Access Tokens.
need some guidance on the concept of Access tokens.

SyntaxError: invalid syntax (<ipython-input-23-7f685da1cd04>, line 1)