<a href="https://colab.research.google.com/github/180030814-GnaneshwarReddy/GnaneswaraReddy_INFO5731_Fall2024/blob/main/Palem_Gnaneswara_Exercise_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

In [None]:
# # write your answer here
# # Question: How does tire pressure affect fuel efficiency and handling performance of a vehicle under various driving conditions?
# # That data that needs to collected in order to answer the question is
# Tire Pressure Levels: Underinflated, Recommended, Overinflated
# Driving Conditions: City, Highway, Mixed
# Fuel Efficiency: liters per 100 km
# Braking Distance: meters
# Cornering Speed: km/h

# The amount of data needed
# Tire pressure levels: at least 3 different levels (underinflated, recommended, overinflated)
# Driving conditions: at least 3 different conditions (city, highway, mixed)
# Fuel efficiency: data over 100 km per trial
# Handling performance: 10 to 15 trials of braking distance and cornering speed

# Data Collection steps:
# 1. Vehicle and Equipment: Use a standard vehicle with sensors for fuel efficiency and handling metrics.
# 2. Adjust Tire Pressure: Set tire pressures to desired levels, record using a gauge.
# 3. Fuel Efficiency Tests: Drive under controlled conditions, record data using OBD-II or manual methods.
# 4. Handling Tests: Perform braking and cornering tests, measure with GPS and accelerometers.
# 5. Document Conditions: Record weather, road surface, and traffic for consistency.
# 6. Save Data: Organize data in CSV or Excel, backed up in a database or cloud storage.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# write your answer here
import pandas as pd
import numpy as np

np.random.seed(42)
samples = 1000
tire_pressures = np.random.choice(['Underinflated', 'Recommended', 'Overinflated'], size=samples, p=[0.3, 0.4, 0.3])
driving_conditions = np.random.choice(['City', 'Highway', 'Mixed'], size=samples, p=[0.4, 0.3, 0.3])
fuel_efficiency = np.random.normal(
    loc=[9 if tp == 'Underinflated' else 7 if tp == 'Recommended' else 8 for tp in tire_pressures],
    scale=0.5,
    size=samples
)

braking_distance = np.random.normal(
    loc=[40 if tp == 'Underinflated' else 35 if tp == 'Recommended' else 37 for tp in tire_pressures],
    scale=2,
    size=samples
)

cornering_speed = np.random.normal(
    loc=[50 if tp == 'Underinflated' else 55 if tp == 'Recommended' else 52 for tp in tire_pressures],
    scale=3,
    size=samples
)

data = pd.DataFrame({
    'Tire_Pressure': tire_pressures,
    'Driving_Condition': driving_conditions,
    'Fuel_Efficiency_L_100km': fuel_efficiency,
    'Braking_Distance_m': braking_distance,
    'Cornering_Speed_kmh': cornering_speed
})


data.to_csv('vehicle_performance_dataset.csv', index=False)
data.head()


Unnamed: 0,Tire_Pressure,Driving_Condition,Fuel_Efficiency_L_100km,Braking_Distance_m,Cornering_Speed_kmh
0,Recommended,City,6.561009,38.74193,56.04013
1,Overinflated,Highway,7.58656,37.779228,59.53467
2,Overinflated,Mixed,7.886761,35.263415,46.479767
3,Recommended,Mixed,7.183683,36.069258,54.903158
4,Underinflated,Mixed,9.456792,34.728505,51.921629


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [54]:
from selenium import webdriver
from scholarly import scholarly, ProxyGenerator
import pandas as pd
import time
import random

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

def collect_articles(keyword, num_articles=1000, start_year=2014, end_year=2024):
    pg = ProxyGenerator()
    pg.FreeProxies()
    scholarly.use_proxy(pg)

    articles = []
    search_query = scholarly.search_pubs(keyword)

    for i, article in enumerate(search_query):
        try:
            if len(articles) >= num_articles:
                break

            scholarly.fill(article)
            pub_year = article.get('bib', {}).get('pub_year')

            if pub_year and start_year <= int(pub_year) <= end_year:
                title = article.get('bib', {}).get('title', 'N/A')
                venue = article.get('bib', {}).get('venue', 'N/A')
                year = pub_year
                authors = article.get('bib', {}).get('author', 'N/A')
                abstract = article.get('bib', {}).get('abstract', 'N/A')

                articles.append({
                    'Title': title,
                    'Venue': venue,
                    'Year': year,
                    'Authors': authors,
                    'Abstract': abstract
                })

            time.sleep(random.uniform(5, 20))

        except Exception as e:
            print(f"Error processing article {i}: {e}")

    df = pd.DataFrame(articles)
    df.to_csv('collected_articles.csv', index=False)
    print(f"Collected {len(articles)} articles.")
    return df

driver = webdriver.Chrome(options=options)

df = collect_articles(keyword="XYZ")
print(df.head())

INFO:scholarly:Proxy works! IP address: 34.105.10.49
INFO:scholarly:Proxy works! IP address: 34.105.10.49
INFO:scholarly:Proxy works! IP address: 34.105.10.49
INFO:scholarly:Proxy works! IP address: 34.105.10.49
INFO:scholarly:Getting https://scholar.google.com/scholar?hl=en&q=XYZ&as_vis=0&as_sdt=0,33
INFO:httpx:HTTP Request: GET https://scholar.google.com/scholar?hl=en&q=XYZ&as_vis=0&as_sdt=0,33 "HTTP/1.1 200 OK"
INFO:scholarly:Timeout Exception ReadTimeout while fetching page: ('The read operation timed out',)
INFO:scholarly:Increasing timeout and retrying within same session.
INFO:scholarly:Timeout Exception ConnectTimeout while fetching page: ('_ssl.c:990: The handshake operation timed out',)
INFO:scholarly:Increasing timeout and retrying within same session.
INFO:httpx:HTTP Request: GET https://scholar.google.com/scholar?hl=en&q=XYZ&as_vis=0&as_sdt=0,33 "HTTP/1.1 200 OK"
INFO:scholarly:Exception RemoteProtocolError while fetching page: ('peer closed connection without sending comp

MaxTriesExceededException: Cannot Fetch from Google Scholar.

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
!pip install ntscraper

Collecting ntscraper
  Downloading ntscraper-0.3.17-py3-none-any.whl.metadata (7.4 kB)
Downloading ntscraper-0.3.17-py3-none-any.whl (12 kB)
Installing collected packages: ntscraper
Successfully installed ntscraper-0.3.17


In [None]:
import pandas as pd
from ntscraper import Nitter

nt = Nitter(0)
tweets = nt.get_profile_info('ImRo45')
def get_tweets(name, modes, no):
  tweets = nt.get_tweets(name, mode=modes, number=no)
  tweets1 = []
  for i in tweets['tweets']:
    data = [i['date'], i['text'], i['stats']['likes'], i['stats']['comments'], i['link']]
    tweets1.append(data)
  df = pd.DataFrame(tweets1, columns=['Date','Text', 'Likes','Comments', 'Twitter Link'])
  return df
df1 = get_tweets('ImRo45', 'user', 10)
df1.head()


Testing instances: 100%|██████████| 16/16 [00:12<00:00,  1.26it/s]
INFO:root:No instance specified, using random instance https://nitter.lucabased.xyz
INFO:root:No instance specified, using random instance https://nitter.lucabased.xyz
INFO:root:Current stats for ImRo45: 10 tweets, 0 threads...


Unnamed: 0,Date,Text,Likes,Comments,Twitter Link
0,"Aug 28, 2024 · 11:27 AM UTC","Limits? They’ve shattered them. Now, they’re s...",23906,401,https://twitter.com/ImRo45/status/182875629955...
1,"Aug 28, 2024 · 10:39 AM UTC",Heartiest Congratulations @JayShah,40984,669,https://twitter.com/ImRo45/status/182874430942...
2,"Aug 25, 2024 · 7:46 AM UTC",From sharing rooms to sharing lifetime memorie...,114131,1788,https://twitter.com/ImRo45/status/182761354862...
3,"Aug 17, 2024 · 5:31 AM UTC",Excited to be part of the @FITTRwithsquats com...,18461,453,https://twitter.com/ImRo45/status/182468047291...
4,"Jul 30, 2024 · 7:56 PM UTC",Perfect start ✅ Well done Team 🇮🇳👏,121926,850,https://twitter.com/ImRo45/status/181837515835...


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

In [None]:
# write your answer here


# Mandatory Question

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**

In [None]:
'''
Write your response here.
With this exercise I was able to learn how to deal with web scrapping from the websites the like twitter, reddit and instagram.
In this excercise we were asked to work on a research question which made use my critical thinking on how to deal with the question and
what data needs to be collected and how to collect it. With this exercise I was able to deeply grasp on how to stimulate a dataset.

Challenges Encountered:
The main challenge I faced with was on working with the 3rd question, where I have tried n number of ways to implement the code and I
even tried it with the help of the code that you gave in the web scrapping demo but yet failed, so I left the question without giving
the final results.

Relevance to Your Field of Study:
The exercise on web scrapping is very helpful for my field of study as I come from the data science major, where I have to deal with the
data, day in and day out.

'''