# Guideline to create a webscraping project
## 1. Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Outline
- We are going to scrap : https://github.com/topics
- We will get a list of tiopics. For each topic we will get repo name, username, stars and page link url
- We will find 25 repsitories info
- For each 25 repsitories we will get a CSV file in this format
```
Repository Name,Username,Strats,Page URL
three.js,mrdoob,71400,https://github.com/mrdoob/three.js
libgdx,libgdx,18500,https://github.com/libgdx/libgdx
```

### Install Modules

In [232]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

### Import Modules

In [233]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pandas_profiling as pp 

In [234]:
pageLink = 'https://github.com/topics'
response = requests.get(pageLink)

In [235]:
response.status_code

200

In [236]:
len(response.text)

129066

## 2.Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [237]:
pageContent = response.text

In [238]:
pageContent[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-yiSzZyQYjjSm73vWjlt23NeW6XsLBACzGZv3ZHwxQ9zgCby0YYjfFxEAXErxlKcQ4ke40vqUiKcvuFn8QZfP1w==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-ca24b36724188e34a6ef7bd68e5b76dc.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-vvcQc56Qj8Yfd1jyEk2XzB3+mWlykbMgyQQ/wwu4JU+YViiQnG8ItjwPyj3Jx5SjqqqOK/adpzLRqb1Fol1TuA==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-bef710739e908fc61f7758f2124d97cc.css" />\n    \n    \

In [239]:
# with open('main.html','w') as f:
#     f.write(pageContent)
with open('main.html', "w", encoding="utf-8") as f:
    f.write(pageContent)

## 3.Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
(Optional) Use a REST API to acquire additional information if required.

In [240]:
soup = BeautifulSoup(pageContent, 'html.parser')

In [241]:
type(soup)

bs4.BeautifulSoup

In [242]:
topicNameClass = 'f3 lh-condensed mb-0 mt-1 Link--primary'
# topicNameParagraph = soup.find_all('p',class_=topicNameClass)
topicNameParagraph = soup.find_all('p',topicNameClass) # Not need to write the class attribute

In [243]:
len(topicNameParagraph)

30

In [244]:
topicNameParagraph[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [245]:
topicDescriptionClass = 'f5 color-text-secondary mb-0 mt-1'
topicDescriptionParagraph = soup.find_all('p',topicDescriptionClass)

In [246]:
len(topicDescriptionParagraph)

30

In [247]:
topicDescriptionParagraph[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [248]:
topicLinkClass = 'd-flex no-underline'
topicLink = soup.find_all('a',topicLinkClass)

In [249]:
len(topicLink)

30

In [250]:
# topicLink[:5]

In [251]:
topicLink[0]['href']

'/topics/3d'

In [252]:
topicDescriptionParagraph[0].text.strip()

'3D modeling is the process of virtually developing the surface and structure of a 3D object.'

In [253]:
topicNames = []
for name in topicNameParagraph:
    topicNames.append(name.text.strip())

In [254]:
topicNames[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [255]:
topicDescriptions = []
for description in topicDescriptionParagraph:
    topicDescriptions.append(description.text.strip())

In [256]:
topicDescriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [257]:
baseUrl = 'https://github.com'

In [258]:
topicLinks = []
for url in topicLink:
    topicLinks.append(baseUrl+url['href'].strip())

In [259]:
topicLinks[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

## 4.Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [260]:
df1Dict = {
    "Topic Title":topicNames,
    "Topic Description":topicDescriptions,
    "Topic URL":topicLinks,
}

In [261]:
df1 = pd.DataFrame(df1Dict)

In [262]:
df1.head()

Unnamed: 0,Topic Title,Topic Description,Topic URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [263]:
df1.to_csv('topicDataFrame.csv',index=None)

In [264]:
profile = pp.ProfileReport(df1)
profile.to_file(output_file='output.html')

Summarize dataset: 100%|██████████| 16/16 [00:01<00:00,  8.17it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.05s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  5.54it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00,  6.68it/s]


In [265]:
topicLinkPage = topicLinks[29]

In [266]:
topicLinkPage

'https://github.com/topics/cpp'

In [267]:
response1 = requests.get(topicLinkPage)
response1.status_code

200

In [268]:
len(response1.text)

638073

In [269]:
topicInfo = BeautifulSoup(response1.text,'html.parser')

In [270]:
repo_unClass = 'f3 color-text-secondary text-normal lh-condensed'
repo_un = topicInfo.find_all('h1',repo_unClass)


In [271]:
repo_un[0]

<h1 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":36260787,"originating_url":"https://github.com/topics/cpp","user_id":null}}' data-hydro-click-hmac="62c5fcd0d58a5057103d91a8bff8d5aa6514b22f1d48ad914bf5800c85feb280" data-view-component="true" href="/CyC2018">
            CyC2018
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":121395510,"originating_url":"https://github.com/topics/cpp","user_id":null}}' data-hydro-click-hmac="f4a2f63e271a64f

In [272]:
# username = repo_un[0].find_all('a')
# username[0].text.strip()
# a = repo_un[0].find_all('a')
# a[1]['href']
usernames = []
repoNames = []
repoLinks = []
for ru in repo_un:
    un = ru.find_all('a')
    usernames.append(un[0].text.strip())
    repoNames.append(un[1].text.strip())
    repoLinks.append(baseUrl+un[1]['href'])

In [273]:
len(usernames)

30

In [274]:
usernames[:5]

['CyC2018', 'tuvtran', 'azl397985856', 'x64dbg', 'fffaraz']

In [275]:
len(repoNames)

30

In [276]:
repoNames[:5]

['CS-Notes', 'project-based-learning', 'leetcode', 'x64dbg', 'awesome-cpp']

In [277]:
len(repoLinks)

30

In [278]:
repoLinks[:5]

['https://github.com/CyC2018/CS-Notes',
 'https://github.com/tuvtran/project-based-learning',
 'https://github.com/azl397985856/leetcode',
 'https://github.com/x64dbg/x64dbg',
 'https://github.com/fffaraz/awesome-cpp']

In [279]:
starClass = 'social-count float-none'
star = topicInfo.find_all('a',starClass)

In [280]:
stars = []
for st in star:
    s = float(st.text.strip()[:-1])
    if (st.text.strip()[-1]!='k'):
        s = float(st.text.strip())        
    stars.append(s*1000)

In [281]:
len(stars)

30

In [282]:
stars[:5]

[131000.0, 50900.0, 42400.0, 36800.0, 31800.0]

In [283]:
topicPageDfDict = {
    'Repository_Name':repoNames,
    'Username':usernames,
    'Stars':stars,
    'Repository_URL':repoLinks
}
topicDf = pd.DataFrame(topicPageDfDict)

In [284]:
len(topicDf)

30

In [285]:
topicDf[:5]

Unnamed: 0,Repository_Name,Username,Stars,Repository_URL
0,CS-Notes,CyC2018,131000.0,https://github.com/CyC2018/CS-Notes
1,project-based-learning,tuvtran,50900.0,https://github.com/tuvtran/project-based-learning
2,leetcode,azl397985856,42400.0,https://github.com/azl397985856/leetcode
3,x64dbg,x64dbg,36800.0,https://github.com/x64dbg/x64dbg
4,awesome-cpp,fffaraz,31800.0,https://github.com/fffaraz/awesome-cpp


In [286]:
allDf = []

In [287]:
def topicsInfo(tpklnks = topicLinks[0]):
    topicLinkPage = tpklnks
    response1 = requests.get(topicLinkPage)
    topicInfo = BeautifulSoup(response1.text,'html.parser')

    repo_unClass = 'f3 color-text-secondary text-normal lh-condensed'
    repo_un = topicInfo.find_all('h1',repo_unClass)

    usernames = []
    repoNames = []
    repoLinks = []
    stars = []
    
    for ru in repo_un:
        un = ru.find_all('a')

        usernames.append(un[0].text.strip())
        repoNames.append(un[1].text.strip())
        repoLinks.append(baseUrl+un[1]['href'])
    
    starClass = 'social-count float-none'
    star = topicInfo.find_all('a',starClass)

    for st in star:
        s = float(st.text.strip()[:-1])

        if (st.text.strip()[-1]!='k'):
            s = float(st.text.strip()) 

        stars.append(s*1000)

    topicPageDfDict = {
        'Repository_Name':repoNames,
        'Username':usernames,
        'Stars':stars,
        'Repository_URL':repoLinks,
        }

    topicDf=pd.DataFrame(topicPageDfDict)
    # topicDfs.append(topicDf)
    outputCsvFile = 'Topics Info'+tpklnks[25:]+'.csv'
    topicDf.to_csv(outputCsvFile,index=None)
    # outputHtmlFile = 'Topics Report'+tpklnks[25:]+'.html'
    # profile1 = pp.ProfileReport(topicDf)
    # profile1.to_file(output_file=outputHtmlFile)
      
    # topicDf['Topic Name'] = tpklnks[25:]
    # topicsDfs=pd.concat([topicsDfs,topicDf])
    topicDf["Topic Class"]=tpklnks[25:]
    allDf.append(topicDf)

In [288]:
for links in topicLinks:
    topicsInfo(links)

In [289]:
len(allDf)

30

In [290]:
topicLinks[29]

'https://github.com/topics/cpp'

In [291]:
dfAll = pd.DataFrame()
dfAll = pd.concat(allDf)
dfAll.to_csv('allinfo.csv',index=None)