## The second In-class-exercise (09/13/2023, 40 points in total)

Kindly use the provided .ipynb document to write your code or respond to the questions. Avoid generating a new file.
Execute all the cells before your final submission.

This in-class exercise is due tomorrow September 14, 2023 at 11:59 PM. No late submissions will be considered.

The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

Question 1 (10 points): Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? How many data needed for the analysis? The detail steps for collecting and save the data.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:


Certainly, let's consider an interesting research question related to the impact of remote work on employee productivity:

**Research Question**:
*"What is the relationship between remote work arrangements and employee productivity, and are there specific factors that significantly influence this relationship?"*

**Data Needed**:
To answer this question, we would need the following data:

1. **Employee Data**: Information about individual employees, including their roles, job functions, work experience, and demographics (e.g., age, gender).

2. **Work Arrangement Data**: Details about employees' work arrangements, including whether they work remotely, their remote work frequency (e.g., full-time, part-time), and the duration of remote work arrangements.

3. **Productivity Metrics**: Quantitative measures of employee productivity, which could include performance metrics (e.g., project completion rates, sales numbers), work hours, and possibly subjective measures like self-assessed productivity.

4. **Additional Factors**: Data on potential factors that could influence productivity, such as access to technology, quality of remote work setup, team communication tools, and the presence of distractions.

**Data Quantity**:
The number of data points needed would depend on the scale and scope of the research. To achieve meaningful results, it's essential to have a sufficiently large and diverse dataset. A sample size of several hundred to several thousand employees across various industries and roles would be ideal. However, the more data you have, the more robust your analysis is likely to be.

**Steps for Data Collection**:

1. **Employee Data**:
   - Collect data on employees' roles, job functions, work experience, and demographics from HR records, surveys, or databases.
   - Ensure that all data collection complies with relevant privacy regulations.

2. **Work Arrangement Data**:
   - Determine the criteria for categorizing remote work arrangements (e.g., full-time, part-time) and collect this information from HR records or surveys.
   - Record the duration of remote work arrangements.

3. **Productivity Metrics**:
   - Collect quantitative productivity metrics specific to each employee's job role. Ensure these metrics are objective and can be consistently measured.
   - Gather data on work hours, either through time tracking software or self-reporting.

4. **Additional Factors**:
   - Collect data on factors that could influence productivity, such as the quality of remote work setups, access to technology, and the use of team communication tools.
   - Employee surveys and interviews can provide valuable insights into subjective factors like distractions.

5. **Data Storage**:
   - Store the collected data in a secure and structured format, such as a database or spreadsheet, ensuring that it is appropriately anonymized to protect employee privacy.

6. **Data Analysis**:
   - Analyze the data using statistical methods (e.g., regression analysis, correlation) and machine learning techniques to identify relationships and patterns.
   - Interpret the results and draw conclusions regarding the impact of remote work on productivity and the significant influencing factors.

7. **Ethical Considerations**:
   - Ensure that all data collection and analysis processes adhere to ethical standards and legal regulations, particularly with regard to employee privacy.

By following these steps, you can collect, analyze, and draw meaningful insights from the data to answer the research question regarding the impact of remote work on employee productivity.


'''

Question 2 (10 points): Write python code to collect 1000 data samples you discussed above.

In [2]:
# You code here (Please add comments in the code):

import pandas as pd
import numpy as np

# Create an empty DataFrame to store the data
data = pd.DataFrame(columns=[
    'Employee_ID',
    'Role',
    'Job_Function',
    'Work_Experience',
    'Age',
    'Gender',
    'Remote_Work',
    'Remote_Work_Frequency',
    'Remote_Work_Duration',
    'Productivity_Score',
    'Work_Hours',
    'Technology_Access',
    'Remote_Setup_Quality',
    'Team_Communication',
    'Distractions'
])

# Generate synthetic data for 1000 employees
for i in range(1000):
    data.loc[i] = [
        i + 1,
        np.random.choice(['Manager', 'Developer', 'Designer', 'Analyst'], p=[0.2, 0.4, 0.2, 0.2]),
        np.random.choice(['Technical', 'Administrative', 'Sales', 'Customer Support'], p=[0.4, 0.2, 0.2, 0.2]),
        np.random.randint(0, 30),
        np.random.randint(22, 65),
        np.random.choice(['Male', 'Female'], p=[0.4, 0.6]),
        np.random.choice(['Yes', 'No'], p=[0.6, 0.4]),
        np.random.choice(['Full-Time', 'Part-Time'], p=[0.7, 0.3]),
        np.random.randint(0, 12),
        np.random.randint(1, 10),
        np.random.uniform(20, 60),
        np.random.choice(['High', 'Medium', 'Low'], p=[0.3, 0.4, 0.3]),
        np.random.choice(['Excellent', 'Good', 'Fair'], p=[0.4, 0.4, 0.2]),
        np.random.choice(['High', 'Medium', 'Low'], p=[0.3, 0.4, 0.3]),
        np.random.choice(['High', 'Medium', 'Low'], p=[0.3, 0.4, 0.3])
    ]

# Save the synthetic data to a CSV file
data.to_csv('synthetic_employee_data.csv', index=False)

print("Synthetic data for 1000 employees generated and saved.")




Synthetic data for 1000 employees generated and saved.


Question 3 (10 points): Write python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "information retrieval". The articles should be published in the last 10 years (2013-2023).

The following information of the article needs to be collected:

(1) Title

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
# You code here (Please add comments in the code):

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Function to scrape article data from Google Scholar
def scrape_google_scholar(query, num_articles=1000, years=(2013, 2023)):
    base_url = "https://scholar.google.com/scholar"

    headers = {
        "User-Agent": "Your User-Agent String"
    }

    articles = []
    page = 0

    while len(articles) < num_articles:
        params = {
            "q": query,
            "hl": "en",
            "as_sdt": "0,5",
            "as_ylo": years[0],
            "as_yhi": years[1],
            "start": page * 10
        }

        response = requests.get(base_url, params=params, headers=headers)
        soup = BeautifulSoup(response.text, "html.parser")

        results = soup.find_all("div", {"class": "gs_ri"})

        for result in results:
            try:
                title = result.find("h3", {"class": "gs_rt"}).text.strip()
                venue = result.find("div", {"class": "gs_a"}).text.strip()
                year = result.find("div", {"class": "gs_a"}).text.strip().split(" - ")[-1]
                authors = result.find("div", {"class": "gs_a"}).text.strip().split(" - ")[0]
                abstract = result.find("div", {"class": "gs_rs"}).text.strip()

                articles.append({
                    "Title": title,
                    "Venue": venue,
                    "Year": year,
                    "Authors": authors,
                    "Abstract": abstract
                })

                if len(articles) >= num_articles:
                    break
            except Exception as e:
                pass

        page += 1
        time.sleep(1)  # Add a delay to avoid overloading the server

    return articles

# Scrape 1000 articles on "information retrieval" from Google Scholar
articles_data = scrape_google_scholar("information retrieval", num_articles=1000, years=(2013, 2023))

# Create a DataFrame to store the collected data
articles_df = pd.DataFrame(articles_data)

# Save the data to a CSV file
articles_df.to_csv('information_retrieval_articles.csv', index=False)

print("Collected data for {} articles and saved to 'information_retrieval_articles.csv'.".format(len(articles_df)))



Do either of the question-4 tasks given below.

Question 4 (10 points): Write python code to collect 1000 posts from Twitter, or Facebook, or Instagram. You can either use hashtags, keywords, user_name, user_id, or other information to collect the data.

The following information needs to be collected:

(1) User_name

(2) Posted time

(3) Text

In [None]:
# You code here (Please add comments in the code):




Question 4 (10 points):

In this task, you are required to identify and utilize online tools for web scraping data from websites without the need for coding, with a specific focus on Parsehub. The objective is to gather data and save it in formats like CSV, Excel, or any other suitable file format.

You have to mention an introduction to the tool which ever you prefer to use, steps to follow for web scrapping and the final output of the data collected.

Upload a document (Word or PDF File) in the same repository and you can add the link in the ipynb file.

In [None]:
# Upload the link to the document here