<a href="https://colab.research.google.com/github/AhmedCoolProjects/ESI/blob/main/Text_Mining_Project_Data_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BY AHMED BARGADY

# Introduction 🚀

In this project, we embark on a data odyssey, employing various methods like the Google Search API, online dataset exploration, and the artful craft of web scraping. 🌐💻 Our goal? To curate a diverse and comprehensive dataset that speaks to our project's objectives.

Using the Google Search package, we navigate the digital realm, ensuring a wide spectrum of topics. We also explore existing online datasets, gleaning valuable insights to enrich our collection. 🧐💡

Web scraping takes center stage, allowing us to extract real-time data from various sources. The Newscatcher library is our ally, capturing headlines and providing a snapshot of current news across domains. 🗞️🔍

The grand finale sees the creation of 13 unique DataFrames, each capturing a facet of the information landscape. From technology to politics, health to entertainment, these frames lay the foundation. 🌍📊

The climax? A master DataFrame, a harmonious symphony of insights. This consolidated dataset is then saved in CSV format, providing a universal key for others to access and utilize. 🎶🔗

Let the data exploration begin! 💻🚀

# Used variables and packages installations

In [40]:
# donwload some packages
!pip install GoogleNews
!pip install newspaper3k



In [41]:
# for our google search
TOPICS = {
    'Technology': [
        "Artificial Intelligence",
        "Blockchain Technology",
        "Cybersecurity",
        "Latest Tech Innovations",
    ],
    'Health': [
        "Medical Research",
        "Healthcare Technology",
        "Public Health Updates",
        "Medical Breakthroughs",
    ],
    'Science': [
        "Space Exploration",
        "Scientific Discoveries",
        "Climate Change",
        "Biotechnology",
    ],
    'Business and Finance': [
        "Stock Market Updates",
        "Economic Trends",
        "Cryptocurrency News",
        "Global Business News",
    ],
    'Politics': [
        "World Politics",
        "National Politics",
        "Government Policies",
        "International Relations",
    ],
    'Environment': [
        "Environmental Conservation",
        "Renewable Energy",
        "Climate Action",
        "Sustainable Practices",
    ],
    'Entertainment': [
        "Movie Reviews",
        "Celebrity News",
        "Music Industry Updates",
        "TV Show Highlights",
    ],
    'Sports': [
        "Latest Sports Events",
        "Sports Analysis",
        "Athlete Interviews",
        "Upcoming Tournaments",
    ],
}

# For our dataframes
COLUMNS = ["content"]

# for reading content
ARTICLES_DATASET_URL = "https://firebasestorage.googleapis.com/v0/b/esi-school-resources.appspot.com/o/text_mining%2Fproject%2Farticles_data%20(1).csv?alt=media&token=4048d506-e841-4dad-bd41-d85c4ee77aa9"

# for reading content of ny times
ARTICLES_DATASET_NY_TIMES_URL = "https://firebasestorage.googleapis.com/v0/b/esi-school-resources.appspot.com/o/text_mining%2Fproject%2Fnyt-articles-2020.csv?alt=media&token=1ab0db94-6094-4568-903c-cfd571e956ec"

# for web scrapping
WEB_SCRAPPING_LINK = "https://arxiv.org/list/hep-ex/2301"

# Data Collection

## Packages

In [42]:
# import used packages
from GoogleNews import GoogleNews
import pandas as pd
import newspaper
import json
import requests
import re
from bs4 import BeautifulSoup, Comment

## Google News Search

In [43]:
# search and return collected content
def search_google_news(topics):
  googlenews = GoogleNews(lang='en')

  content = []

  for topic in topics:
    googlenews.clear()
    googlenews.get_news(topic)
    googlenews.search(topic)

    result_0 = googlenews.page_at(1)

    desc_1 = googlenews.get_texts()

    for i in list(range(2, 6)):

      result = googlenews.page_at(i)
      desc = googlenews.get_texts()

      desc_1 += desc

    content += desc_1

  df = pd.DataFrame(columns = COLUMNS)

  df[COLUMNS[0]] = content

  return df

In [None]:
df_technology = search_google_news(TOPICS['Technology'])
df_science = search_google_news(TOPICS['Science'])
df_politics = search_google_news(TOPICS['Politics'])
df_health = search_google_news(TOPICS['Health'])
df_business = search_google_news(TOPICS['Business and Finance'])
df_sports = search_google_news(TOPICS['Sports'])
df_environment = search_google_news(TOPICS['Environment'])
df_entertainment = search_google_news(TOPICS['Entertainment'])

In [48]:
df_technology.head()

Unnamed: 0,content
0,Seattle is an artificial intelligence hub in a...
1,Microsoft president says no chance of super-in...
2,An AI Leader's Human-Centered Approach To Arti...
3,Talented humans still working at Sports Illust...
4,How Will Artificial Intelligence Transform the...


In [49]:
print("Technology:", df_technology.shape)
print("Science:", df_science.shape)
print("Politics:", df_politics.shape)
print("Health:", df_health.shape)
print("Business:", df_business.shape)
print("Sports:", df_sports.shape)
print("Environment:", df_environment.shape)
print("Entertainment:", df_entertainment.shape)

Technology: (8912, 1)
Science: (8880, 1)
Politics: (8880, 1)
Health: (8864, 1)
Business: (7004, 1)
Sports: (6400, 1)
Environment: (6400, 1)
Entertainment: (6560, 1)


In [50]:
df_technology['content'][0]

'Seattle is an artificial intelligence hub in a rapidly changing field'

## Read Datasets

### Get content of articles from different newspapers and blogs

In [51]:
def handle_bad_lines(line):
    print(f"Skipping line: {line}")

df_articles_data = pd.read_csv(ARTICLES_DATASET_URL, header=0, encoding='utf-8', engine="python", on_bad_lines=handle_bad_lines)
print(df_articles_data.shape)
df_articles_data.head()

(10437, 15)


Unnamed: 0.1,Unnamed: 0,source_id,source_name,author,title,description,url,url_to_image,published_at,content,top_article,engagement_reaction_count,engagement_comment_count,engagement_share_count,engagement_comment_plugin_count
0,0,reuters,Reuters,Reuters Editorial,NTSB says Autopilot engaged in 2018 California...,The National Transportation Safety Board said ...,https://www.reuters.com/article/us-tesla-crash...,https://s4.reutersmedia.net/resources/r/?m=02&...,2019-09-03T16:22:20Z,WASHINGTON (Reuters) - The National Transporta...,0.0,0.0,0.0,2528.0,0.0
1,1,the-irish-times,The Irish Times,Eoin Burke-Kennedy,Unemployment falls to post-crash low of 5.2%,Latest monthly figures reflect continued growt...,https://www.irishtimes.com/business/economy/un...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T10:32:28Z,The States jobless rate fell to 5.2 per cent l...,0.0,6.0,10.0,2.0,0.0
2,2,the-irish-times,The Irish Times,Deirdre McQuillan,"Louise Kennedy AW2019: Long coats, sparkling t...",Autumn-winter collection features designer’s g...,https://www.irishtimes.com/\t\t\t\t\t\t\t/life...,https://www.irishtimes.com/image-creator/?id=1...,2019-09-03T14:40:00Z,Louise Kennedy is showing off her autumn-winte...,1.0,,,,
3,3,al-jazeera-english,Al Jazeera English,Al Jazeera,North Korean footballer Han joins Italian gian...,Han is the first North Korean player in the Se...,https://www.aljazeera.com/news/2019/09/north-k...,https://www.aljazeera.com/mritems/Images/2019/...,2019-09-03T17:25:39Z,"Han Kwang Song, the first North Korean footbal...",0.0,0.0,0.0,7.0,0.0
4,4,bbc-news,BBC News,BBC News,UK government lawyer says proroguing parliamen...,"The UK government's lawyer, David Johnston arg...",https://www.bbc.co.uk/news/av/uk-scotland-4956...,https://ichef.bbci.co.uk/news/1024/branded_new...,2019-09-03T14:39:21Z,,0.0,0.0,0.0,0.0,0.0


In [65]:
df_articles_content = pd.DataFrame({'content': df_articles_data['content']})
df_articles_desc = pd.DataFrame({'content': df_articles_data['description']})
df_articles_content.head()

Unnamed: 0,content
0,WASHINGTON (Reuters) - The National Transporta...
1,The States jobless rate fell to 5.2 per cent l...
2,Louise Kennedy is showing off her autumn-winte...
3,"Han Kwang Song, the first North Korean footbal..."
4,


In [66]:
df_articles_desc.head()

Unnamed: 0,content
0,The National Transportation Safety Board said ...
1,Latest monthly figures reflect continued growt...
2,Autumn-winter collection features designer’s g...
3,Han is the first North Korean player in the Se...
4,"The UK government's lawyer, David Johnston arg..."


In [67]:
df_articles_desc.shape, df_articles_content.shape

((10437, 1), (10437, 1))

### Get Content from NY Times articles dataset

In [55]:
df_articles_ny_times_data = pd.read_csv(ARTICLES_DATASET_NY_TIMES_URL, header=0, encoding='utf-8', engine="python", on_bad_lines=handle_bad_lines)
print(df_articles_ny_times_data.shape)
df_articles_ny_times_data.head()

(16787, 11)


Unnamed: 0,newsdesk,section,subsection,material,headline,abstract,keywords,word_count,pub_date,n_comments,uniqueID
0,Editorial,Opinion,,Editorial,Protect Veterans From Fraud,Congress could do much more to protect America...,"['Veterans', 'For-Profit Schools', 'Financial ...",680,2020-01-01 00:18:54+00:00,186,nyt://article/69a7090b-9f36-569e-b5ab-b0ba5bb3...
1,Games,Crosswords & Games,,News,‘It’s Green and Slimy’,Christina Iverson and Jeff Chen ring in the Ne...,['Crossword Puzzles'],931,2020-01-01 03:00:10+00:00,257,nyt://article/9edddb54-0aa3-5835-a833-d311a76f...
2,Science,Science,,News,Meteor Showers in 2020 That Will Light Up Nigh...,"All year long, Earth passes through streams of...","['Meteors and Meteorites', 'Space and Astronom...",1057,2020-01-01 05:00:08+00:00,6,nyt://article/04bc90f0-b20b-511c-b5bb-3ce13194...
3,Science,Science,,Interactive Feature,Sync your calendar with the solar system,"Never miss an eclipse, a meteor shower, a rock...","['Space and Astronomy', 'Moon', 'Eclipses', 'S...",0,2020-01-01 05:00:12+00:00,2,nyt://interactive/5b58d876-9351-50af-9b41-a312...
4,Science,Science,,News,"Rocket Launches, Trips to Mars and More 2020 S...",A year full of highs and lows in space just en...,"['Space and Astronomy', 'Private Spaceflight',...",1156,2020-01-01 05:02:38+00:00,25,nyt://article/bd8647b3-8ec6-50aa-95cf-2b81ed12...


In [64]:
df_articles_ny_times_content = pd.DataFrame({'content': df_articles_ny_times_data['abstract']})
df_articles_ny_times_content.head()

Unnamed: 0,content
0,Congress could do much more to protect America...
1,Christina Iverson and Jeff Chen ring in the Ne...
2,"All year long, Earth passes through streams of..."
3,"Never miss an eclipse, a meteor shower, a rock..."
4,A year full of highs and lows in space just en...


## Web Scrape

In [57]:
def get_links(source):
    links = []

    py_page = requests.get(source)
    py_soup = BeautifulSoup(py_page.content, 'lxml')
    # Find all <a> tags with href attributes that match the pattern "/abs/..."
    hrefs = py_soup.find_all('a', href=re.compile(r'^/abs/'))
    # Extract and print the href attribute from each matching <a> tag
    for link in hrefs:
        links.append("https://arxiv.org" + link.get('href'))

    return links

In [58]:
LINKS_LIST = get_links(WEB_SCRAPPING_LINK)
print(len(LINKS_LIST))
print(LINKS_LIST[0])

26
https://arxiv.org/abs/2301.00222




**Let's scrape 26 pages content**

In [59]:
def scrape_content(links):
  contents = []

  for link in links:
    py_page = requests.get(link)
    py_soup = BeautifulSoup(py_page.content, 'lxml')

    # Find the <blockquote> tag with class "abstract mathjax" and extract the text
    content_tag = py_soup.find('blockquote', class_='abstract')
    if content_tag:
        content_text = content_tag.get_text(separator='\n', strip=True)
        contents.append(content_text)
  # create df
  df = pd.DataFrame(columns = COLUMNS)
  df[COLUMNS[0]] = contents

  return df

In [60]:
df_scraped_content = scrape_content(LINKS_LIST)
print(df_scraped_content.shape)

(26, 1)


In [62]:
df_scraped_content.head()

Unnamed: 0,content
0,Abstract:\nMeasurements of coherent charmonium...
1,Abstract:\nHeavy flavour production measuremen...
2,Abstract:\nThe collection of a statistically s...
3,Abstract:\nA set of measurements of azimuthal ...
4,Abstract:\nMeasurements of jet energy scale (J...


# Combinate dataframes

In [68]:
# List of DataFrames to combine
dataframes_to_combine = [
    df_articles_ny_times_content,
    df_scraped_content,
    df_articles_desc,
    df_articles_content,
    df_technology ,
    df_science,
    df_politics,
    df_health ,
    df_business ,
    df_sports ,
    df_environment ,
    df_entertainment ,
]

# Concatenate the DataFrames
df_final = pd.concat(dataframes_to_combine, ignore_index=True)

# Print the final DataFrame
print(df_final.shape)
df_final.head()

(99587, 1)


Unnamed: 0,content
0,Congress could do much more to protect America...
1,Christina Iverson and Jeff Chen ring in the Ne...
2,"All year long, Earth passes through streams of..."
3,"Never miss an eclipse, a meteor shower, a rock..."
4,A year full of highs and lows in space just en...


# Save the collected data

In [69]:
# Save df_final to a CSV file
df_final.to_csv('ahmed_bargady_collected_data.csv', index=False)

# Print a message indicating successful saving
print("df_final has been saved to 'ahmed_bargady_collected_data.csv'")

df_final has been saved to 'ahmed_bargady_collected_data.csv'


**You can find the saved file online on: [Ahmed Bargady Collected Data](https://firebasestorage.googleapis.com/v0/b/esi-school-resources.appspot.com/o/text_mining%2Fproject%2Fahmed_bargady_collected_data.csv?alt=media&token=cb47884a-aff3-42bb-9e94-cc325cfa2f1a)**

# Test the deployed collected data

In [70]:
COLLECTED_DATA_LINK = "https://firebasestorage.googleapis.com/v0/b/esi-school-resources.appspot.com/o/text_mining%2Fproject%2Fahmed_bargady_collected_data.csv?alt=media&token=cb47884a-aff3-42bb-9e94-cc325cfa2f1a"

In [71]:
df_collected_data = pd.read_csv(COLLECTED_DATA_LINK, header=0, encoding='utf-8', engine="python", on_bad_lines=handle_bad_lines)
print(df_collected_data.shape)
df_collected_data.head()

(98268, 1)


Unnamed: 0,content
0,Congress could do much more to protect America...
1,Christina Iverson and Jeff Chen ring in the Ne...
2,"All year long, Earth passes through streams of..."
3,"Never miss an eclipse, a meteor shower, a rock..."
4,A year full of highs and lows in space just en...


In [72]:
df_collected_data.tail()

Unnamed: 0,content
98263,UCL Highlights - Ep 1 | Video | Watch TV Show
98264,Best of Max Homa | Saturday highlights | Video...
98265,How to watch Star Wars in order—even the shows
98266,Nottingham Forest 1-1 Brentford | Premier Leag...
98267,The Open | Day Four highlights | Video | Watch...


# Conclusion 🚀

In this data-driven adventure, we've navigated the digital landscape using Google searches, online datasets, and web scraping tools. 💻🌐 Our arsenal included the Google Search API, and powerful scraping techniques.

The finale? A collection of 13 unique DataFrames, each representing a distinct slice of information. From tech and science to politics and entertainment, these frames paint a diverse picture. 📊🌍

But the showstopper is the master DataFrame, seamlessly weaving these threads into a cohesive whole. A CSV treasure is born, a universal gift to researchers, developers, and knowledge enthusiasts. 🎁💾

Here's the key: [**Download Dataset CSV**](https://firebasestorage.googleapis.com/v0/b/esi-school-resources.appspot.com/o/text_mining%2Fproject%2Fahmed_bargady_collected_data.csv?alt=media&token=cb47884a-aff3-42bb-9e94-cc325cfa2f1a). Unleash its potential for innovation and discovery! 🚀🔍

Happy coding, and may your projects soar! 🌟🔗