# Guideline to create a webscraping project
## 1. Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Outline
- We are going to scrap : https://github.com/topics
- We will get a list of tiopics. For each topic we will get repo name, username, stars and page link url
- We will find 25 repsitories info
- For each 25 repsitories we will get a CSV file in this format
```
Repository Name,Username,Strats,Page URL
three.js,mrdoob,71400,https://github.com/mrdoob/three.js
libgdx,libgdx,18500,https://github.com/libgdx/libgdx
```

### Install Modules

In [2]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

### Import Modules

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pandas_profiling as pp 
import json

In [4]:
pageLink = 'https://github.com/topics'
response = requests.get(pageLink)

In [5]:
response.status_code

200

In [6]:
len(response.text)

129143

## 2.Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [7]:
pageContent = response.text

In [8]:
pageContent[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-yiSzZyQYjjSm73vWjlt23NeW6XsLBACzGZv3ZHwxQ9zgCby0YYjfFxEAXErxlKcQ4ke40vqUiKcvuFn8QZfP1w==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-ca24b36724188e34a6ef7bd68e5b76dc.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-vvcQc56Qj8Yfd1jyEk2XzB3+mWlykbMgyQQ/wwu4JU+YViiQnG8ItjwPyj3Jx5SjqqqOK/adpzLRqb1Fol1TuA==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-bef710739e908fc61f7758f2124d97cc.css" />\n    \n    \

In [9]:
# with open('main.html','w') as f:
#     f.write(pageContent)
with open('main.html', "w", encoding="utf-8") as f:
    f.write(pageContent)

## 3.Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
(Optional) Use a REST API to acquire additional information if required.

In [10]:
soup = BeautifulSoup(pageContent, 'html.parser')

In [11]:
type(soup)

bs4.BeautifulSoup

In [12]:
topicNameClass = 'f3 lh-condensed mb-0 mt-1 Link--primary'
# topicNameParagraph = soup.find_all('p',class_=topicNameClass)
topicNameParagraph = soup.find_all('p',topicNameClass) # Not need to write the class attribute

In [13]:
len(topicNameParagraph)

30

In [14]:
topicNameParagraph[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [15]:
topicDescriptionClass = 'f5 color-text-secondary mb-0 mt-1'
topicDescriptionParagraph = soup.find_all('p',topicDescriptionClass)

In [16]:
len(topicDescriptionParagraph)

30

In [17]:
topicDescriptionParagraph[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [18]:
topicLinkClass = 'd-flex no-underline'
topicLink = soup.find_all('a',topicLinkClass)

In [19]:
len(topicLink)

30

In [20]:
topicLink[:5]

[<a class="d-flex no-underline" data-ga-click="Explore, go to 3d, location:All featured topics" href="/topics/3d">
 <div class="color-bg-info f4 color-text-tertiary text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 <div class="d-sm-flex flex-auto">
 <div class="flex-auto">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>
 </div>
 <div class="d-inline-block js-toggler-container starring-container">
 <a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
         action:topics#index;
         text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
 <svg aria-hidden="true" class="octi

In [21]:
topicLink[0]['href']

'/topics/3d'

In [22]:
topicDescriptionParagraph[0].text.strip()

'3D modeling is the process of virtually developing the surface and structure of a 3D object.'

In [23]:
topicNames = []
for name in topicNameParagraph:
    topicNames.append(name.text.strip())

In [24]:
topicNames[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [25]:
topicDescriptions = []
for description in topicDescriptionParagraph:
    topicDescriptions.append(description.text.strip())

In [26]:
topicDescriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [27]:
baseUrl = 'https://github.com'

In [28]:
topicLinks = []
for url in topicLink:
    topicLinks.append(baseUrl+url['href'].strip())

In [29]:
topicLinks[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

## 4.Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [30]:
df1Dict = {
    "Topic Title":topicNames,
    "Topic Description":topicDescriptions,
    "Topic URL":topicLinks,
}

In [31]:
df1 = pd.DataFrame(df1Dict)

In [32]:
df1.head()

Unnamed: 0,Topic Title,Topic Description,Topic URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [33]:
df1.to_csv('topicDataFrame.csv',index=None)

In [34]:
profile = pp.ProfileReport(df1)
profile.to_file(output_file='output.html')

Summarize dataset: 100%|██████████| 16/16 [00:03<00:00,  5.12it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.64it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00,  6.97it/s]


In [35]:
topicLinkPage = topicLinks[29]

In [36]:
topicLinkPage

'https://github.com/topics/cpp'

In [37]:
response1 = requests.get(topicLinkPage)
response1.status_code

200

In [38]:
len(response1.text)

638088

In [39]:
topicInfo = BeautifulSoup(response1.text,'html.parser')

In [40]:
repo_unClass = 'f3 color-text-secondary text-normal lh-condensed'
repo_un = topicInfo.find_all('h1',repo_unClass)


In [41]:
repo_un[0]

<h1 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":36260787,"originating_url":"https://github.com/topics/cpp","user_id":null}}' data-hydro-click-hmac="62c5fcd0d58a5057103d91a8bff8d5aa6514b22f1d48ad914bf5800c85feb280" data-view-component="true" href="/CyC2018">
            CyC2018
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":121395510,"originating_url":"https://github.com/topics/cpp","user_id":null}}' data-hydro-click-hmac="f4a2f63e271a64f

In [42]:
# username = repo_un[0].find_all('a')
# username[0].text.strip()
# a = repo_un[0].find_all('a')
# a[1]['href']
usernames = []
repoNames = []
repoLinks = []
for ru in repo_un:
    un = ru.find_all('a')
    usernames.append(un[0].text.strip())
    repoNames.append(un[1].text.strip())
    repoLinks.append(baseUrl+un[1]['href'])

In [43]:
len(usernames)

30

In [44]:
usernames[:5]

['CyC2018', 'tuvtran', 'azl397985856', 'x64dbg', 'fffaraz']

In [45]:
len(repoNames)

30

In [46]:
repoNames[:5]

['CS-Notes', 'project-based-learning', 'leetcode', 'x64dbg', 'awesome-cpp']

In [47]:
len(repoLinks)

30

In [48]:
repoLinks[:5]

['https://github.com/CyC2018/CS-Notes',
 'https://github.com/tuvtran/project-based-learning',
 'https://github.com/azl397985856/leetcode',
 'https://github.com/x64dbg/x64dbg',
 'https://github.com/fffaraz/awesome-cpp']

In [49]:
starClass = 'social-count float-none'
star = topicInfo.find_all('a',starClass)

In [50]:
stars = []
for st in star:
    s = float(st.text.strip()[:-1])
    if (st.text.strip()[-1]!='k'):
        s = float(st.text.strip())        
    stars.append(s*1000)

In [51]:
len(stars)

30

In [52]:
stars[:5]

[132000.0, 50900.0, 42400.0, 36900.0, 31800.0]

In [53]:
topicPageDfDict = {
    'Repository_Name':repoNames,
    'Username':usernames,
    'Stars':stars,
    'Repository_URL':repoLinks
}
topicDf = pd.DataFrame(topicPageDfDict)

In [54]:
len(topicDf)

30

In [55]:
topicDf[:5]

Unnamed: 0,Repository_Name,Username,Stars,Repository_URL
0,CS-Notes,CyC2018,132000.0,https://github.com/CyC2018/CS-Notes
1,project-based-learning,tuvtran,50900.0,https://github.com/tuvtran/project-based-learning
2,leetcode,azl397985856,42400.0,https://github.com/azl397985856/leetcode
3,x64dbg,x64dbg,36900.0,https://github.com/x64dbg/x64dbg
4,awesome-cpp,fffaraz,31800.0,https://github.com/fffaraz/awesome-cpp


In [56]:
allDf = []

In [57]:
def topicsInfo(tpklnks = topicLinks[0]):
    topicLinkPage = tpklnks
    response1 = requests.get(topicLinkPage)
    topicInfo = BeautifulSoup(response1.text,'html.parser')

    repo_unClass = 'f3 color-text-secondary text-normal lh-condensed'
    repo_un = topicInfo.find_all('h1',repo_unClass)

    usernames = []
    repoNames = []
    repoLinks = []
    stars = []
    
    for ru in repo_un:
        un = ru.find_all('a')

        usernames.append(un[0].text.strip())
        repoNames.append(un[1].text.strip())
        repoLinks.append(baseUrl+un[1]['href'])
    
    starClass = 'social-count float-none'
    star = topicInfo.find_all('a',starClass)

    for st in star:
        s = float(st.text.strip()[:-1])

        if (st.text.strip()[-1]!='k'):
            s = float(st.text.strip()) 

        stars.append(s*1000)

    topicPageDfDict = {
        'Repository_Name':repoNames,
        'Username':usernames,
        'Stars':stars,
        'Repository_URL':repoLinks,
        }

    topicDf=pd.DataFrame(topicPageDfDict)
    # topicDfs.append(topicDf)
    outputCsvFile = 'Topics Info'+tpklnks[25:]+'.csv'
    topicDf.to_csv(outputCsvFile,index=None)
    # outputHtmlFile = 'Topics Report'+tpklnks[25:]+'.html'
    # profile1 = pp.ProfileReport(topicDf)
    # profile1.to_file(output_file=outputHtmlFile)
      
    # topicDf['Topic Name'] = tpklnks[25:]
    # topicsDfs=pd.concat([topicsDfs,topicDf])
    topicDf["Topic Class"]=tpklnks[26:]
    allDf.append(topicDf)

In [58]:
for links in topicLinks:
    topicsInfo(links)

In [59]:
len(allDf)

30

In [60]:
topicLinks[29]

'https://github.com/topics/cpp'

In [61]:
dfAll = pd.DataFrame()
dfAll = pd.concat(allDf)
dfAll.to_csv('allinfo.csv',index=None)
finalProfile = pp.ProfileReport(dfAll)
finalProfile.to_file(output_file='finalReport.html')

Summarize dataset: 100%|██████████| 19/19 [00:01<00:00, 13.59it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.57s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.13it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00,  7.48it/s]
