# Guideline to create a webscraping project
## 1. Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Outline
- We are going to scrap : https://github.com/topics
- We will get a list of tiopics. For each topic we will get repo name, username, stars and page link url
- We will find 25 repsitories info
- For each 25 repsitories we will get a CSV file in this format
```
Repository Name,Username,Strats,Page URL
three.js,mrdoob,71400,https://github.com/mrdoob/three.js
libgdx,libgdx,18500,https://github.com/libgdx/libgdx
```

### Install Modules

!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

### Import Modules

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pandas_profiling as pp 
import os

## 2.Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [3]:
pageLink = 'https://github.com/topics'
response = requests.get(pageLink)

ConnectionError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /topics (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001C2E3BE9520>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [None]:
response.status_code

In [None]:
len(response.text)

In [None]:
pageContent = response.text

In [None]:
pageContent[:1000]

In [None]:
# with open('main.html','w') as f:
#     f.write(pageContent)
with open('main.html', "w", encoding="utf-8") as f:
    f.write(pageContent)

## 3.Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
(Optional) Use a REST API to acquire additional information if required.

In [None]:
soup = BeautifulSoup(pageContent, 'html.parser')

In [None]:
type(soup)

In [None]:
topicNameClass = 'f3 lh-condensed mb-0 mt-1 Link--primary'
# topicNameParagraph = soup.find_all('p',class_=topicNameClass)
topicNameParagraph = soup.find_all('p',topicNameClass) # Not need to write the class attribute

In [None]:
len(topicNameParagraph)

In [None]:
topicNameParagraph[:5]

In [None]:
topicDescriptionClass = 'f5 color-text-secondary mb-0 mt-1'
topicDescriptionParagraph = soup.find_all('p',topicDescriptionClass)

In [None]:
len(topicDescriptionParagraph)

In [None]:
topicDescriptionParagraph[:5]

In [None]:
topicLinkClass = 'd-flex no-underline'
topicLink = soup.find_all('a',topicLinkClass)

In [None]:
len(topicLink)

In [None]:
topicLink[:5]

In [None]:
topicLink[0]['href']

In [None]:
topicDescriptionParagraph[0].text.strip()

In [None]:
topicNames = []
for name in topicNameParagraph:
    topicNames.append(name.text.strip())

In [None]:
topicNames[:5]

In [None]:
topicDescriptions = []
for description in topicDescriptionParagraph:
    topicDescriptions.append(description.text.strip())

In [None]:
topicDescriptions[:5]

In [None]:
baseUrl = 'https://github.com'

In [None]:
topicLinks = []
for url in topicLink:
    topicLinks.append(baseUrl+url['href'].strip())

In [None]:
topicLinks[:5]

## 4.Create CSV file(s) with the extracted information
- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [None]:
df1Dict = {
    "Topic Title":topicNames,
    "Topic Description":topicDescriptions,
    "Topic URL":topicLinks,
}

In [None]:
df1 = pd.DataFrame(df1Dict)

In [None]:
df1.head()

In [None]:
df1.to_csv('topicDataFrame.csv',index=None)

In [None]:
profile = pp.ProfileReport(df1)
profile.to_file(output_file='output.html')

In [None]:
topicLinkPage = topicLinks[29]

In [None]:
topicLinkPage

In [None]:
response1 = requests.get(topicLinkPage)
response1.status_code

In [None]:
len(response1.text)

In [None]:
topicInfo = BeautifulSoup(response1.text,'html.parser')

In [None]:
repo_unClass = 'f3 color-text-secondary text-normal lh-condensed'
repo_un = topicInfo.find_all('h1',repo_unClass)


In [None]:
repo_un[0]

In [None]:
# username = repo_un[0].find_all('a')
# username[0].text.strip()
# a = repo_un[0].find_all('a')
# a[1]['href']
usernames = []
repoNames = []
repoLinks = []
for ru in repo_un:
    un = ru.find_all('a')
    usernames.append(un[0].text.strip())
    repoNames.append(un[1].text.strip())
    repoLinks.append(baseUrl+un[1]['href'])

In [None]:
len(usernames)

In [None]:
usernames[:5]

In [None]:
len(repoNames)

In [None]:
repoNames[:5]

In [None]:
len(repoLinks)

In [None]:
repoLinks[:5]

In [None]:
starClass = 'social-count float-none'
star = topicInfo.find_all('a',starClass)

In [None]:
stars = []
for st in star:
    s = float(st.text.strip()[:-1])
    if (st.text.strip()[-1]!='k'):
        s = float(st.text.strip())        
    stars.append(s*1000)

In [None]:
len(stars)

In [None]:
stars[:5]

In [None]:
topicPageDfDict = {
    'Repository_Name':repoNames,
    'Username':usernames,
    'Stars':stars,
    'Repository_URL':repoLinks
}
topicDf = pd.DataFrame(topicPageDfDict)

In [None]:
len(topicDf)

In [None]:
topicDf[:5]

In [None]:
allDf = []

In [None]:
def topicsInfo(tpklnks = topicLinks[0]):
    topicLinkPage = tpklnks
    response1 = requests.get(topicLinkPage)
    topicInfo = BeautifulSoup(response1.text,'html.parser')

    repo_unClass = 'f3 color-text-secondary text-normal lh-condensed'
    repo_un = topicInfo.find_all('h1',repo_unClass)

    usernames = []
    repoNames = []
    repoLinks = []
    stars = []
    
    for ru in repo_un:
        un = ru.find_all('a')

        usernames.append(un[0].text.strip())
        repoNames.append(un[1].text.strip())
        repoLinks.append(baseUrl+un[1]['href'])
    
    starClass = 'social-count float-none'
    star = topicInfo.find_all('a',starClass)

    for st in star:
        s = float(st.text.strip()[:-1])

        if (st.text.strip()[-1]!='k'):
            s = float(st.text.strip()) 

        stars.append(s*1000)

    topicPageDfDict = {
        'Repository_Name':repoNames,
        'Username':usernames,
        'Stars':stars,
        'Repository_URL':repoLinks,
        }

    topicDf=pd.DataFrame(topicPageDfDict)
    # topicDfs.append(topicDf)

    os.makedirs('Topics Info',exist_ok=True)
    outputCsvFile = 'Topics Info'+tpklnks[25:]+'.csv'
    topicDf.to_csv(outputCsvFile,index=None)
    # outputHtmlFile = 'Topics Report'+tpklnks[25:]+'.html'
    # profile1 = pp.ProfileReport(topicDf)
    # profile1.to_file(output_file=outputHtmlFile)
      
    # topicDf['Topic Name'] = tpklnks[25:]
    # topicsDfs=pd.concat([topicsDfs,topicDf])
    topicDf["Topic Class"]=tpklnks[26:]
    allDf.append(topicDf)

In [None]:
for links in topicLinks:
    topicsInfo(links)

In [None]:
len(allDf)

In [None]:
topicLinks[29]

In [None]:
dfAll = pd.DataFrame()
dfAll = pd.concat(allDf)
dfAll.to_csv('allinfo.csv',index=None)
finalProfile = pp.ProfileReport(dfAll)
finalProfile.to_file(output_file='finalReport.html')