<a href="https://colab.research.google.com/github/ImaduddinAhmedMohammed/ImaduddinAhmed_INFO5731_Spring2024/blob/main/Mohammed_Imad_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
"""Research Question: To find the productivity and job satisfaction of employee
in the tech industry?

For this research question, the data that needs to be collected is:

Employee Productivity Metrics:
Number of tasks completed per day/week/month
Time taken to complete tasks
Quality of work (e.g., error rates, customer satisfaction ratings)
Self-reported productivity levels (e.g., using surveys)

Employee Job Satisfaction Metrics:
Responses to job satisfaction surveys (e.g., Likert scale questions)
Turnover rates
Attendance records
Feedback from performance reviews or one-on-one meetings
Employee engagement scores
"""



## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# write your answer here
import random
import pandas as pd
from datetime import datetime, timedelta

# Employee Productivity Metrics
def generate_productivity_data(num_employees):
    productivity_data = []
    for _ in range(num_employees):
        tasks_completed = random.randint(20, 100)
        time_taken = round(random.uniform(4, 10), 2)  # Hours
        error_rate = round(random.uniform(0, 5), 2)  # Percentage
        customer_satisfaction = round(random.uniform(3, 5), 2)  # Scale of 1-5
        self_reported_productivity = round(random.uniform(3, 5), 2)  # Scale of 1-5
        productivity_data.append({
            'Tasks Completed': tasks_completed,
            'Time Taken (hours)': time_taken,
            'Error Rate (%)': error_rate,
            'Customer Satisfaction': customer_satisfaction,
            'Self-reported Productivity': self_reported_productivity
        })
    return pd.DataFrame(productivity_data)

# Employee Job Satisfaction Metrics
def generate_satisfaction_data(num_employees):
    satisfaction_data = []
    for _ in range(num_employees):
        satisfaction_score = random.randint(1, 5)
        turnover = random.choice(['Low', 'Medium', 'High'])
        attendance = round(random.uniform(0.7, 1), 2)  # Percentage
        feedback_score = round(random.uniform(3, 5), 2)  # Scale of 1-5
        engagement_score = round(random.uniform(3, 5), 2)  # Scale of 1-5
        satisfaction_data.append({
            'Satisfaction Score': satisfaction_score,
            'Turnover': turnover,
            'Attendance': attendance,
            'Feedback Score': feedback_score,
            'Engagement Score': engagement_score
        })
    return pd.DataFrame(satisfaction_data)



# Generate dataset
num_employees = 1000
productivity_df = generate_productivity_data(num_employees)
satisfaction_df = generate_satisfaction_data(num_employees)

employee_data = pd.concat([productivity_df, satisfaction_df,], axis=1)

print(employee_data)

employee_data.to_csv('employee_data.csv', index=False)


     Tasks Completed  Time Taken (hours)  Error Rate (%)  \
0                 70                8.83            2.52   
1                 20                5.75            1.87   
2                 81                8.16            0.13   
3                 35                7.51            4.40   
4                 85                7.47            0.07   
..               ...                 ...             ...   
995               82                9.13            4.55   
996               47                4.23            1.03   
997               98                6.18            4.43   
998               51                6.78            4.68   
999               23                5.18            5.00   

     Customer Satisfaction  Self-reported Productivity  Satisfaction Score  \
0                     3.29                        4.26                   2   
1                     4.72                        3.19                   4   
2                     4.55                   

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time

def fetch_article_info(url):

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    title = soup.find('h3', {'class': 'gt_rt'}).text.strip()
    venue = soup.find('div', {'class': 'gs_a'}).text.split('-')[0].strip()
    year = re.findall(r'\b(?:20[12][0-9])\b', soup.text)[0] # Extracts the year from the text
    authors = [author.text for author in soup.find_all('div', {'class': 'gs_a'})[0].find_all('a')]
    abstract = soup.find('div', {'id': 'abstract'}).text.strip()

    article_info = {
        'Title': title,
        'Venue': venue,
        'Year': year,
        'Authors': authors,
        'Abstract': abstract
    }
    return article_info

def scrape_google_scholar(keyword, num_articles):

    base_url = f'https://scholar.google.com/scholar?start=20&q={keyword}&hl=en&as_sdt=0,44&num={num_articles}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(base_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    article_links = soup.find_all('h3', {'class': 'gs_rt'})[:num_articles]
    articles_data = []
    for link in article_links:
        try:
            article_url = link.a['href']
            article_info = fetch_article_info(article_url)
            articles_data.append(article_info)
            time.sleep(2)  # Adding a delay to avoid overwhelming the server
        except Exception as e:
            print(f"An error occurred while processing: {link.text.strip()}")
            print(e)
            continue

    return pd.DataFrame(articles_data)

# Scrape 1000 articles with keyword "XYZ"
keyword = "XYZ"
num_articles = 1000
articles_df = scrape_google_scholar(keyword, num_articles)

# Save the DataFrame to a CSV file
#articles_df.to_csv("google_scholar_articles.csv", index=False)

# Display the first few rows of the DataFrame
print(articles_df)




Empty DataFrame
Columns: []
Index: []


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
# write your answer here
# I did not get a way to use free API.
import tweepy
import pandas as pd

consumer_key = "YOUR_CONSUMER_KEY"
consumer_secret = "YOUR_CONSUMER_SECRET"
access_token = "YOUR_ACCESS_TOKEN"
access_token_secret = "YOUR_ACCESS_TOKEN_SECRET"

auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token, access_token_secret)
api = tweepy.API(auth)

hashtag = "#python"

tweets_data = []
for tweet in tweepy.Cursor(api.search, q=hashtag, lang="en", tweet_mode="extended").items(100):
    tweet_data = {
        "Username": tweet.user.screen_name,
        "Text": tweet.full_text,
        "Retweets": tweet.retweet_count,
        "Likes": tweet.favorite_count
    }
    tweets_data.append(tweet_data)

df = pd.DataFrame(tweets_data)

df.to_csv("twitter_data.csv", index=False)

print(df.head())



AttributeError: 'API' object has no attribute 'search'

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.

Learning Experience:
Working on web scraping tasks is a  journey where I am learning how to
extract valuable data from the vast ocean of information available online.
One of the most eye-opening aspects was understanding how web pages are structured
in HTML and how to navigate through them to find the data I need. Although I am
still learning on how to do it, I find it really interesting and challenging
Learning about tools like BeautifulSoup and Selenium was particularly helpful.
BeautifulSoup is a bit simpler than the rest, so it is easy to understand.

Challenges Encountered:
Scraping data from websites that load content dynamically using JavaScript.
It is a bit difficult to exactly pinpoint where you need to extract the data from
using the inspect element. And hence it is difficult to implement multiple things
from multiple packages at the same time.

Relevance to Your Field of Study:
Being a student from Data Science, knowing how to Web scrape is a requirement.
Since extracting data and working on it is the basic function of Data Scientists,
it is a compulsion to know how to webscrape.


'''
