<a href="https://colab.research.google.com/github/Mjcherono/Webscraping_repositories_from_Github_trending_topics/blob/main/Scraping_Latest_Research_Analysis_Topics_by_Fraym.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###  Latest Research/Analysis Topics by Fraym as at April 2021

- We're going to scrape https://fraym.io/analysis/
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- We'll create a CSV file in the following format:

  Topic_Name, Publish_date, URL, Description.

1. Scrape the list of Topics from te Fraym Website.
 - Use requests to download the page
 - Use beautiful soup 4 to parse and extract information.
 - Convert to pandas data frame

In [298]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


In [299]:
# download the page
topics_url = 'https://fraym.io/analysis'
s = requests.Session()
response = requests.get(topics_url)

In [300]:
response.status_code

200

In [301]:
#To get the content of the page
len(response.text)

99904

In [302]:
page_contents = response.text

In [303]:
#Save the page contents to a html file
with open('webpage.html','w') as f:
  f.write(page_contents)
  #Not necessary though

Use Beautiful Soup to parse and extract Information.

In [304]:
from bs4 import BeautifulSoup

doc = BeautifulSoup(page_contents, 'html.parser')


Topic Headings

In [305]:
#Trying to grab the topic headings
#Since the titles are on the a tags. We'll find all the a tags
#Specify the search
selection_class ="fl-post-feed-title"
topic_title_tags = doc.find_all('h2', {'class':selection_class})

In [306]:
#Check the length to see if info you are getting is reasonable

len(topic_title_tags)

10

In [307]:
#check first 10 entries

topic_title_tags[:10]

[<h2 class="fl-post-feed-title" itemprop="headline">
 <a href="https://fraym.io/tbethiopia/" rel="bookmark" title="Understanding TB Risk in Pastoralist Communities">Understanding TB Risk in Pastoralist Communities</a>
 </h2>, <h2 class="fl-post-feed-title" itemprop="headline">
 <a href="https://fraym.io/nepalwash/" rel="bookmark" title="Expanding Access to Water Where it Matters Most: The Impact of Climate Change on WASH in Nepal">Expanding Access to Water Where it Matters Most: The Impact of Climate Change on WASH in Nepal</a>
 </h2>, <h2 class="fl-post-feed-title" itemprop="headline">
 <a href="https://fraym.io/pr-genderequality/" rel="bookmark" title="Leading Global Organizations Invest in Community-Level Data to Advance Gender Equality">Leading Global Organizations Invest in Community-Level Data to Advance Gender Equality</a>
 </h2>, <h2 class="fl-post-feed-title" itemprop="headline">
 <a href="https://fraym.io/childmarriage/" rel="bookmark" title="Fighting for a Future Free of Chi

Date of Publishing Topic.

In [308]:
#Topic Dates

date_selector ="fl-post-feed-date"
topic_dateofpublish_tags = doc.find_all('span', {'class':date_selector})

In [309]:
len(topic_dateofpublish_tags)

10

In [310]:
topic_dateofpublish_tags[:5]

[<span class="fl-post-feed-date">
 						March 23, 2021					</span>, <span class="fl-post-feed-date">
 						March 22, 2021					</span>, <span class="fl-post-feed-date">
 						March 8, 2021					</span>, <span class="fl-post-feed-date">
 						March 4, 2021					</span>, <span class="fl-post-feed-date">
 						March 3, 2021					</span>]

Description of Topic.

In [311]:
#Topic Description

desc_selector ="fl-post-feed-text"
topic_desc_tags = doc.find_all('div', {'class':desc_selector})

In [312]:
len(topic_desc_tags)

10

In [313]:
topic_desc_tags[:5]

[<div class="fl-post-feed-text">
 <div class="fl-post-feed-content" itemprop="text">
 <p>Tuberculosis (TB) remains the third leading cause of death among communicable, maternal, neonatal, and nutritional diseases in Ethiopia, despite being preventable and treatable.1 While Ethiopia continues to make steady progress towards eliminating TB within its borders, lack of access to early diagnosis, treatment and HIV co-infections continue to pose significant challenges. Fraym technology mapped communities…</p>
 <a class="fl-post-feed-more" href="https://fraym.io/tbethiopia/" title="Understanding TB Risk in Pastoralist Communities">Read More</a>
 </div>
 </div>, <div class="fl-post-feed-text">
 <div class="fl-post-feed-content" itemprop="text">
 <p>Nepal has made great strides in improving access to water and sanitation, despite suffering a devastating earthquake and remaining one of the poorest nations in the world. Since 1990, access to improved water sources has increased from 46 to 95 perc

Topic URL.


In [314]:
#To get the tag corresponding to the first topic

topic_title_tags[0]

#To get the parent
#topic_title_tag0.parent.parent


<h2 class="fl-post-feed-title" itemprop="headline">
<a href="https://fraym.io/tbethiopia/" rel="bookmark" title="Understanding TB Risk in Pastoralist Communities">Understanding TB Risk in Pastoralist Communities</a>
</h2>

In [315]:
#Get link of the first topic
links = doc.find_all('a',{'rel':"bookmark"})
link_urls = [i['href'] for i in links]

#The URLs got duplicated. To remain with one per topic
#Use this function
from collections import OrderedDict

def duplicates(link_urls):
    return list(OrderedDict.fromkeys(link_urls))

topic_urls_tags = duplicates(link_urls)
len(topic_urls_tags)

10

In [316]:
topic_urls_tags[0]

'https://fraym.io/tbethiopia/'

Organizing Scraped Information.

In [317]:
#Getting jus the topics
topic_titles = []

for tag in topic_title_tags:
  topic_titles.append(tag.text.strip())
print(topic_titles) 

['Understanding TB Risk in Pastoralist Communities', 'Expanding Access to Water Where it Matters Most: The Impact of Climate Change on WASH in Nepal', 'Leading Global Organizations Invest in Community-Level Data to Advance Gender Equality', 'Fighting for a Future Free of Child Marriage – How Hyper-Local Data Can Help Drive Results', 'Fraym Partners with Facebook’s Project 17: Layering Sex-Disaggregated Data Sets Expose Gender Inequalities', 'M&E Series Part 3 of 3: ML-Powered Spatial Data Transforms Your Adaptative Management Techniques', 'International Day of Zero Tolerance for FGM: Applying ML data to target programs', 'M&E Series Part 2 of 3: Adding New Dimensions to Geospatial Impact Evaluations', 'Machine-learning data helps global development organizations monitor & evaluate programs', 'M&E Series Part 1 of 3: Geospatial Data for Baseline Assessments and Contextual Analysis']


In [318]:
# for tag in topic_title_tags:
#   print(tag.text)

In [319]:
#Dates
topic_publishing_dates = []

for tag in topic_dateofpublish_tags:
  topic_publishing_dates.append(tag.text.strip())

topic_publishing_dates

['March 23, 2021',
 'March 22, 2021',
 'March 8, 2021',
 'March 4, 2021',
 'March 3, 2021',
 'February 16, 2021',
 'February 10, 2021',
 'January 12, 2021',
 'December 16, 2020',
 'December 8, 2020']

In [320]:
#Descriptions
topic_descs = []
for tag in topic_desc_tags:
  topic_descs.append(tag.text.strip())

topic_descs[0]


'Tuberculosis (TB) remains the third leading cause of death among communicable, maternal, neonatal, and nutritional diseases in Ethiopia, despite being preventable and treatable.1 While Ethiopia continues to make steady progress towards eliminating TB within its borders, lack of access to early diagnosis, treatment and HIV co-infections continue to pose significant challenges. Fraym technology mapped communities…\nRead More'

In [321]:
#Urls
topic_urls = []
for tag in topic_urls_tags:
  topic_urls.append(tag)

topic_urls

['https://fraym.io/tbethiopia/',
 'https://fraym.io/nepalwash/',
 'https://fraym.io/pr-genderequality/',
 'https://fraym.io/childmarriage/',
 'https://fraym.io/facebookproject17/',
 'https://fraym.io/meadaptivemanagement_part3/',
 'https://fraym.io/fgm-mali/',
 'https://fraym.io/meevauluation_part2/',
 'https://fraym.io/pr-mande/',
 'https://fraym.io/mebaseline_part1/']

In [322]:
#create my data frame from the scraped data
topic_dict = {
    'Topic ttitle': topic_titles,
    'Date of Publishing':topic_publishing_dates,
    'URL': topic_urls,
    'Topic Description':topic_descs
}

topic_df = pd.DataFrame(topic_dict)
topic_df.shape

(10, 4)

### Creating a CSV file for the scraped Information.

In [323]:
#save to csv
topic_df.to_csv('Latest_Analysis_topics_by_Fraym', index=None)