# Guideline to create a webscraping project
## 1. Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Outline
- We are going to scrap : https://github.com/topics
- We will get a list of tiopics. For each topic we will get repo name, username, stars and page link url
- We will find 25 repsitories info
- For each 25 repsitories we will get a CSV file in this format
```
Repository Name,Username,Strats,Page URL
three.js,mrdoob,71400,https://github.com/mrdoob/three.js
libgdx,libgdx,18500,https://github.com/libgdx/libgdx
```

### Install Modules

In [29]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

### Import Modules

In [78]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [31]:
pageLink = 'https://github.com/topics'
response = requests.get(topicLink)

In [32]:
response.status_code

200

In [33]:
len(response.text)

129102

## 2.Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [34]:
pageContent = response.text

In [35]:
pageContent[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-m3WD6x26QRryZ5Kq9O1ZlHgaTlqab5aFm8gTU6hlvFYQDmA7rKgoq/cZi2n8N3HUpRVlSguxW/h8fXDPvUeS2A==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-9b7583eb1dba411af26792aaf4ed5994.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-M7v6ZmO8lTwry7a0sQ9SQvpr79m30uXOQ+HLcRT4pzumVyg32ehYGikTLYePuhgC0ovxvIi8ceGV+RoF7KsCjA==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-33bbfa6663bc953c2bcbb6b4b10f5242.css" />\n    \n    \

In [36]:
# with open('main.html','w') as f:
#     f.write(pageContent)
with open('main.html', "w", encoding="utf-8") as f:
    f.write(pageContent)

## 3.Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
(Optional) Use a REST API to acquire additional information if required.

In [37]:
soup = BeautifulSoup(pageContent, 'html.parser')

In [38]:
type(soup)

bs4.BeautifulSoup

In [47]:
topicNameClass = 'f3 lh-condensed mb-0 mt-1 Link--primary'
# topicNameParagraph = soup.find_all('p',class_=topicNameClass)
topicNameParagraph = soup.find_all('p',topicNameClass) # Not need to write the class attribute

In [48]:
len(topicNameParagraph)

30

In [52]:
topicNameParagraph[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [53]:
topicDescriptionClass = 'f5 color-text-secondary mb-0 mt-1'
topicDescriptionParagraph = soup.find_all('p',topicDescriptionClass)

In [55]:
len(topicDescriptionParagraph)

30

In [54]:
topicDescriptionParagraph[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [56]:
topicLinkClass = 'd-flex no-underline'
topicLink = soup.find_all('a',topicLinkClass)

In [57]:
len(topicLink)

30

In [58]:
topicLink[:5]

[<a class="d-flex no-underline" data-ga-click="Explore, go to 3d, location:All featured topics" href="/topics/3d">
 <div class="color-bg-info f4 color-text-tertiary text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 <div class="d-sm-flex flex-auto">
 <div class="flex-auto">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>
 </div>
 <div class="d-inline-block js-toggler-container starring-container">
 <a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
         action:topics#index;
         text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
 <svg aria-hidden="true" class="octi

In [61]:
topicLink[0]['href']

'/topics/3d'

In [64]:
topicDescriptionParagraph[0].text.strip()

'3D modeling is the process of virtually developing the surface and structure of a 3D object.'

In [65]:
topicNames = []
for name in topicNameParagraph:
    topicNames.append(name.text.strip())

In [66]:
topicNames[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [67]:
topicDescriptions = []
for description in topicDescriptionParagraph:
    topicDescriptions.append(description.text.strip())

In [69]:
topicDescriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [73]:
topicLinks = []
for url in topicLink:
    topicLinks.append('https://github.com/'+url['href'].strip())

In [74]:
topicLinks[:5]

['https://github.com//topics/3d',
 'https://github.com//topics/ajax',
 'https://github.com//topics/algorithm',
 'https://github.com//topics/amphp',
 'https://github.com//topics/android']

## 4.Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [77]:
df1Dict = {
    "Topic Title":topicNames,
    "Topic Description":topicDescriptions,
    "Topic URL":topicLinks,
}

In [79]:
df1 = pd.DataFrame(df1Dict)

In [80]:
df1.head()

Unnamed: 0,Topic Title,Topic Description,Topic URL
0,3D,3D modeling is the process of virtually develo...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android


In [82]:
df1.to_csv('topicDataFrame.csv',index=None)