<a href="https://colab.research.google.com/github/Saikrishna2472/INFO-5731.020-7886-Assignment-1/blob/main/paleru_jayasaikrishna_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# write your answer here
# Research Question
# How do different types of text to speech synthesis techniques affect user comprehension and retention of spoken information?

# Data Collection Plan
# 1. Define the Types of TTS Synthesis Techniques
# Concatenative TTS: Uses pre-recorded speech segments.
# Formant Synthesis: Uses mathematical models to produce speech.
# Neural TTS: Uses deep learning models for more natural-sounding speech.
# 2. Identify the Data to Be Collected
# User Comprehension Scores: Measure how well users understand and recall the spoken content.
# User Retention Scores: Measure how much information users remember over time.
# User Feedback: Collect qualitative data on user preferences.
# 3. Determine the Amount of Data Needed
# Sample Size: To achieve statistically significant results, aim for at least 30 participants per TTS type.
# Text Length: Use consistent text length for all TTS types.
# Test Duration: Each user should engage with the text for a similar amount of time (e.g., 2-3 minutes).
# 4. Detailed Steps for Data Collection

# Prepare the TTS Samples
# Select Texts: Choose or create texts that are neutral and suitable for your study (e.g., informative articles or passages).
# Generate Audio: Use each TTS synthesis technique to generate audio files of the same texts.

# Design the Experiment
# Recruit Participants: Aim for a diverse group of at least 90 participants (30 per TTS type) to ensure generalizability.
# Random Assignment: Randomly assign participants to listen to one of the three TTS types to control for biases.

# Conduct User Testing
# Comprehension Test: After listening to the audio, participants complete a comprehension quiz related to the text.
# Retention Test: Administer the same quiz after a delay to measure retention.
# Collect Feedback: Have participants fill out a survey rating their experience with the TTS system, including clarity, naturalness, and ease of understanding.

# Data Storage
# Data Collection Tools: Use online survey platforms or custom apps for quizzes and feedback forms.
# Data Storage: Save quantitative data (comprehension and retention scores) in a secure database or spreadsheet. Store qualitative feedback in a text format or use transcription software for analysis.
# Backup: Back up the data regularly to avoid loss and ensure compliance with data protection regulations.

# Data Analysis
# Quantitative Analysis: Use statistical tests (e.g., ANOVA) to compare comprehension and retention scores across the different TTS types.
# Qualitative Analysis: Analyze feedback to identify common themes and user preferences.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# write your answer here
import os
import random
import pandas as pd
import numpy as np
from datetime import datetime

# Define TTS types and file paths
tts_types = ['concatenative', 'formant', 'neural']
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "To be or not to be, that is the question.",
    "All human beings are born free and equal in dignity and rights."
]  # Sample texts for the TTS

# Define paths for saving audio files
audio_paths = {
    tts: f'audio/{tts}' for tts in tts_types
}

# Ensure the directories exist
for path in audio_paths.values():
    os.makedirs(path, exist_ok=True)

# Function to simulate audio generation (replace with actual TTS generation)
def generate_audio(tts_type, text, file_path):
    # This function would actually call a TTS API or library
    # Here we'll just create dummy files for demonstration
    with open(file_path, 'w') as f:
        f.write(f"Generated by {tts_type} TTS: {text}")

# Simulate generating audio files
for tts_type in tts_types:
    for idx, text in enumerate(texts):
        file_path = os.path.join(audio_paths[tts_type], f'text_{idx}.txt')
        generate_audio(tts_type, text, file_path)

# Create an empty list to collect user data
data = []

# Function to simulate collecting user data (replace with actual data collection)
def collect_user_data(participant_id, tts_type, text_id):
    # Simulate user responses
    comprehension_score = random.randint(0, 10)
    retention_score = random.randint(0, 10)
    feedback = "Sample feedback"
    return comprehension_score, retention_score, feedback

# Simulate data collection for 1000 participants
for participant_id in range(1, 1001):
    tts_type = random.choice(tts_types)
    text_id = random.randint(0, len(texts) - 1)
    comprehension_score, retention_score, feedback = collect_user_data(participant_id, tts_type, text_id)
    data.append({
        'participant_id': participant_id,
        'tts_type': tts_type,
        'text_id': text_id,
        'comprehension_score': comprehension_score,
        'retention_score': retention_score,
        'feedback': feedback
    })

# Create a DataFrame from the collected data
df = pd.DataFrame(data)

# Save the dataset to a CSV file
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
csv_filename = f'dataset_{timestamp}.csv'
df.to_csv(csv_filename, index=False)

print(f"Dataset saved to {csv_filename}")
print(df)

Dataset saved to dataset_20240916_012005.csv
     participant_id       tts_type  text_id  comprehension_score  \
0                 1        formant        2                    3   
1                 2        formant        0                    9   
2                 3        formant        2                    2   
3                 4        formant        0                    6   
4                 5  concatenative        0                    7   
..              ...            ...      ...                  ...   
995             996         neural        2                   10   
996             997  concatenative        2                    5   
997             998        formant        0                   10   
998             999        formant        2                    8   
999            1000  concatenative        0                    2   

     retention_score         feedback  
0                  3  Sample feedback  
1                  3  Sample feedback  
2                 

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [34]:
import requests
import pandas as pd
from datetime import datetime

# Define the search parameters
query = "XYZ"
start_year = 2014
end_year = 2024
num_articles = 1000
api_url = "https://api.semanticscholar.org/v1/paper/search"

# Prepare the DataFrame to store the data
columns = ['Title', 'Venue', 'Year', 'Authors', 'Abstract']
df = pd.DataFrame(columns=columns)

def get_articles(query, start_year, end_year, num_articles):
    articles_collected = 0
    skip = 0

    while articles_collected < num_articles:
        response = requests.get(api_url, params={
            'query': query,
            'year': f'{start_year}-{end_year}',
            'offset': skip
        })

        # Debugging: Check the response status and content
        print(f"Request URL: {response.url}")
        print(f"Response Status Code: {response.status_code}")

        if response.status_code != 200:
            print("API request failed.")
            break

        data = response.json()

        if 'data' not in data or not data['data']:
            print("No more data or API limit reached.")
            break

        for paper in data['data']:
            title = paper.get('title', 'N/A')
            venue = paper.get('venue', 'N/A')
            year = paper.get('year', 'N/A')
            authors = ', '.join([author['name'] for author in paper.get('authors', [])])
            abstract = paper.get('abstract', 'N/A')

            df.loc[len(df)] = [title, venue, year, authors, abstract]
            articles_collected += 1

            if articles_collected >= num_articles:
                break

        skip += 10  # Move to the next page of results

    return df

# Collect articles
df = get_articles(query, start_year, end_year, num_articles)

# Save to CSV
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
csv_filename = f'semantic_scholar_articles_{timestamp}.csv'
df.to_csv(csv_filename, index=False)

print(f"Dataset saved to {csv_filename}")


from bs4 import BeautifulSoup
import requests
import pandas as pd
from datetime import datetime

def fetch_google_scholar_articles(query, num_articles):
    search_url = "https://scholar.google.com/scholar"
    params = {'q': query}

    headers = {'User-Agent': 'Mozilla/5.0'}
    articles = []
    start = 0

    while len(articles) < num_articles:
        response = requests.get(search_url, headers=headers, params=params)

        # Debugging: Check the response status and content
        print(f"Request URL: {response.url}")
        print(f"Response Status Code: {response.status_code}")

        soup = BeautifulSoup(response.text, 'html.parser')
        results = soup.select('.gs_ri')

        if not results:
            print("No more results or Google Scholar has blocked the request.")
            break

        for result in results:
            title_elem = result.select_one('.gs_rt')
            title = title_elem.text if title_elem else 'N/A'

            venue_elem = result.select_one('.gs_a')
            venue = venue_elem.text.split(' - ')[-1] if venue_elem else 'N/A'
            year = venue.split()[-1] if venue_elem else 'N/A'

            authors_elem = venue_elem.text.split(' - ')[0] if venue_elem else 'N/A'
            authors = authors_elem

            abstract_elem = result.select_one('.gs_rs')
            abstract = abstract_elem.text if abstract_elem else 'N/A'

            articles.append([title, venue, year, authors, abstract])
            if len(articles) >= num_articles:
                break

        start += 10
        params['start'] = start

    return pd.DataFrame(articles, columns=['Title', 'Venue', 'Year', 'Authors', 'Abstract'])

# Fetch articles
df = fetch_google_scholar_articles("XYZ", 1000)

# Save to CSV
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
csv_filename = f'google_scholar_articles_{timestamp}.csv'
df.to_csv(csv_filename, index=False)

print(f"Dataset saved to {csv_filename}")
print(df)

Request URL: https://api.semanticscholar.org/v1/paper/search?query=XYZ&year=2014-2024&offset=0
Response Status Code: 404
API request failed.
Dataset saved to semantic_scholar_articles_20240916_023248.csv
Request URL: https://www.google.com/sorry/index?continue=https://scholar.google.com/scholar%3Fq%3DXYZ&q=EgQiSXKNGNCxnrcGIiyDQY7g9HzNy_C2M8FVqH07knEertvl9oNUgrEEdqw9vfbnlxsMXMU1ocQMuTIBcloBQw
Response Status Code: 429
No more results or Google Scholar has blocked the request.
Dataset saved to google_scholar_articles_20240916_023249.csv
Empty DataFrame
Columns: [Title, Venue, Year, Authors, Abstract]
Index: []


## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


<!DOCTYPE html>
<html lang="en" id="facebook" class="no_js">
<head><meta charset="utf-8" /><meta name="referrer" content="origin-when-crossorigin" id="meta_referrer" /><script nonce="BHH4vyFZ">function envFlush(a){function b(b){for(var c in a)b[c]=a[c]}window.requireLazy?window.requireLazy(["Env"],b):(window.Env=window.Env||{},b(window.Env))}envFlush({"useTrustedTypes":false,"isTrustedTypesReportOnly":false,"ajaxpipe_token":"AXg0iS9aSs7sAYlqRmA","stack_trace_limit":30,"timesliceBufferSize":5000,"show_invariant_decoder":false,"compat_iframe_token":"AQ71zkhyVEz1lXWuPhc","isCQuick":false,"brsid":"7415056696462305761"});</script><script nonce="BHH4vyFZ">(function(a){function b(b){if(!window.openDatabase)return;b.I_AM_INCOGNITO_AND_I_REALLY_NEED_WEBSQL=function(a,b,c,d){return window.openDatabase(a,b,c,d)};window.openDatabase=function(){throw new Error()}}b(a)})(this);</script><style nonce="BHH4vyFZ"></style><script nonce="BHH4vyFZ">__DEV__=0;</script><noscript><meta http-equiv="refresh" co

  texts = soup.find_all(text=True)


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here
# https://docs.google.com/spreadsheets/d/1zsZWcvRRkJni3OYxGbOrfyZEzy6Q04MK/edit?usp=sharing&ouid=108515101479457813988&rtpof=true&sd=true

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
Working on web scraping and data collection was a valuable learning experience. I gained hands-on knowledge of extracting data from various
websites, understanding HTML structures, and dealing with dynamic content using tools like Selenium and Octoparse. I faced challenges such as
handling blocked requests and navigating complex site structures, but these were overcome by mimicking real user behavior and using user-friendly
tools. This skill set is highly relevant to research and work, as it allows for efficient data collection from online sources, insightful analysis
of trends and patterns, and automation of repetitive tasks. Overall, the experience enhanced my technical skills and provided practical insights
into data extraction and analysis.
'''