<a href="https://colab.research.google.com/github/SwathiNagilla/Swathi_INFO5731_FALL2024/blob/main/Nagilla_Swathi_Exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
'''
Research Question: How have user reviews for popular tech products changed in the past year.

Data Required:Product Name
              Review Text
              Sentiment (positive/negative)
              Review Date

For data collection, I'll be collecting reviews of 10 top-rated technical products in the last year.
Each product will have at least 100 user reviews, adding up to approximately 1,000 in total. I will
use web scraping tools to collect this data and try to focus on getting review text and usernames.
I'll then check, after gathering the reviews, whether each review is positive or negative using TextBlob or VADER.
Then, sorted data of product name, review text, sentiment score, and username are put into a CSV file for more analysis.

Data Collection Steps:
- Identify one source of tech product reviews.
- Extract reviews for products using Python and BeautifulSoup.
- Do a sentiment analysis of each review using TextBlob.
- Save the data into a CSV file.

'''

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [4]:
import pandas as pd
import random
from datetime import datetime, timedelta

# Generate random reviews
def random_reviews(num_reviews):
    products = ["Laptop", "Smartphone", "Headphones", "Smartwatch", "Tablet"]
    sentiments = ["Positive", "Negative"]

    reviews = []
    for _ in range(num_reviews):
        #use random function to generate random product from products list
        product = random.choice(products)
        review_text = f"This is a {'good' if random.choice(sentiments) == 'Positive' else 'bad'} review for {product}."
        #use random function to generate random sentiment review
        sentiment = random.choice(sentiments)
        review_date = datetime.now() - timedelta(days=random.randint(0, 365))

        reviews.append({
            "Product Name": product,
            "Review Text": review_text,
            "Sentiment": sentiment,
            "Review Date": review_date.strftime("%Y-%m-%d")
        })

    return reviews

# Save data
num_reviews = 1000
data = random_reviews(num_reviews)
df = pd.DataFrame(data)
df.to_csv('tech_product_reviews.csv', index=False)

# Print the data containing the first 15 rows
print(df.head(15))


   Product Name                            Review Text Sentiment Review Date
0        Laptop      This is a good review for Laptop.  Positive  2024-06-17
1        Laptop       This is a bad review for Laptop.  Positive  2023-11-03
2        Tablet      This is a good review for Tablet.  Negative  2024-07-19
3        Laptop      This is a good review for Laptop.  Negative  2023-12-01
4    Smartwatch   This is a bad review for Smartwatch.  Positive  2024-07-03
5    Smartwatch   This is a bad review for Smartwatch.  Positive  2023-11-11
6        Laptop      This is a good review for Laptop.  Positive  2024-04-13
7    Smartwatch   This is a bad review for Smartwatch.  Positive  2023-11-15
8    Smartwatch  This is a good review for Smartwatch.  Positive  2024-06-14
9    Headphones   This is a bad review for Headphones.  Positive  2024-02-07
10   Smartwatch  This is a good review for Smartwatch.  Negative  2023-12-04
11   Smartwatch   This is a bad review for Smartwatch.  Positive  2024-05-07

## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def google_scholar(query, num_articles):
    #url with keyword XYZ from gooogle scholar
    url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C44&q=XYZ&btnG="
    parameter = {
        'hl': 'en',
        'q': query,
        'as_ylo': '2014',
        'as_yhi': '2024'
    }

    response = requests.get(url, params=parameter)
    soup = BeautifulSoup(response.content, 'html.parser')

    output = []

    for entry in soup.find_all('div', class_='gs_ri')[:num_articles]:
        title_tag = entry.find('h3', class_='gs_rt')
        authors_tag = entry.find('div', class_='gs_a')
        abstract_tag = entry.find('div', class_='gs_rs')
        #extract title and if nothing is present, return N/A
        title = title_tag.text if title_tag else "N/A"

        # extract year
        if authors_tag:
            parts = authors_tag.text.split()
            #identifying digit
            year = next((part for part in parts if part.isdigit() and len(part) == 4), "N/A")
            authors = ', '.join(part.strip() for part in authors_tag.text.split('-')[0].split(', ')) if authors_tag else "N/A"
        else:
            year = "N/A"
            authors = "N/A"
        # extract abstract and if nothing is present, return N/A
        abstract = abstract_tag.text if abstract_tag else "N/A"
        #append everything
        output.append({'Title': title, 'Authors': authors, 'Year': year, 'Abstract': abstract})
    #return output list
    return output

query = "natural language processing"
output = google_scholar(query, 1000)

data = pd.DataFrame(output)
data = data[['Title', 'Authors', 'Year', 'Abstract']]

# Save to CSV
data.to_csv('google_scholar_articles.csv', index=False)
print(data.head(15))


                                               Title  \
0                        Natural language processing   
1  [HTML][HTML] Natural language processing: stat...   
2            Advances in natural language processing   
3  A primer on neural network models for natural ...   
4  [PDF][PDF] The Stanford CoreNLP natural langua...   
5  Transformers: State-of-the-art natural languag...   
6  [BOOK][B] Neural network methods in natural la...   
7  Allennlp: A deep semantic natural language pro...   
8  Stanza: A Python natural language processing t...   
9  Jumping NLP curves: A review of natural langua...   

                                    Authors  Year  \
0                KR Chowdhary, KR Chowdhary  2020   
1     D Khurana, A Koli, K Khatter, S Singh  2023   
2                  J Hirschberg, CD Manning  2015   
3                                Y Goldberg  2016   
4          CD Manning, M Surdeanu, J Bauer…  2014   
5      T Wolf, L Debut, V Sanh, J Chaumond…  2020   
6           

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
'''

'''

## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
'''
 I used ParseHub to scrape movie data from IMDb.

 Steps:
 1. Download parse hub desktop app
 2. Created a new project in ParseHub and pasted the IMDb URL.
 3. Extracted movie names and URLs using the ParseHub selection tool.
 4. Employed relative search to extract the release year of each movie.
 5. Movie ratings extracted with relative search.
 6. Carried out the data scraping to collect the needed information.
 7. Saved the extracted data in CSV format.
 8. The CSV file was transformed into PDF format. 8. PDF document was uploaded to UNT OneDrive.

 Link-- https://myunt-my.sharepoint.com/:b:/g/personal/swathinagilla_my_unt_edu/EcwnPnvwhh5Iho6e7a_F4X4BRFPyoud02G68szf_UveRWw?e=jzLDXh
'''

# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
This assignment was relatively challenging and took much time to finish. I felt that web scraping was complex but enlightening.
The most useful parts of the exercise were how to choose data for extraction or scraping and how to adjust to website structures.
The Google Colab demo class file was pretty helpful; it gave the essential guidance and practical examples that supported my understanding
of the web scraping process.

A big challenge I had was with Question 4A, where the extraction process from social media accounts became complex. I am finding the
first attempts really tricky but finally sought an alternative in ParseHub. While using ParseHub introduced its own complexities, I
were able to surmount these by referring to YouTube tutorials, which facilitated my understanding of the tool's features.
'''