In [595]:
import pyforest

In [596]:
pyforest.lazy_imports()

['import sklearn',
 'import plotly.express as px',
 'from sklearn.linear_model import Ridge',
 'from scipy import stats',
 'from fbprophet import Prophet',
 'from sklearn.ensemble import RandomForestClassifier',
 'from statsmodels.tsa.arima_model import ARIMA',
 'import statsmodels.api as sm',
 'from sklearn.model_selection import RandomizedSearchCV',
 'from openpyxl import load_workbook',
 'from sklearn.model_selection import StratifiedKFold',
 'import seaborn as sns',
 'import altair as alt',
 'from sklearn.linear_model import RidgeCV',
 'from sklearn.decomposition import PCA',
 'from sklearn.preprocessing import OneHotEncoder',
 'from pyspark import SparkContext',
 'from sklearn.preprocessing import PolynomialFeatures',
 'from scipy import signal as sg',
 'from sklearn.preprocessing import LabelEncoder',
 'from sklearn import svm',
 'from sklearn.linear_model import LinearRegression',
 'import pickle',
 'import awswrangler as wr',
 'from sklearn.ensemble import GradientBoostingClass

## Building a Python Web Scraping Project From Scratch


![](https://i.imgur.com/6zM7JBq.png)

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

1. **Pick a website and describe your objective**

    - Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
    - Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
    - Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


2. **Use the requests library to download web pages**

    - Inspect the website's HTML source and identify the right URLs to download.
    - Download and save web pages locally using the `requests` library.
    - Create a function to automate downloading for different topics/search queries.


3. **Use Beautiful Soup to parse and extract information**

    - Parse and explore the structure of downloaded web pages using Beautiful soup.
    - Use the right properties and methods to extract the required information.
    - Create functions to extract from the page into lists and dictionaries.
    - (Optional) Use a REST API to acquire additional information if required.


4. **Create CSV file(s) with the extracted information**

    - Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
    - Execute the function with different inputs to create a dataset of CSV files.
    - Verify the information in the CSV files by reading them back using [Pandas](https://pandas.pydata.org).


5. **Document and share your work**

    - Add proper headings and documentation in your Jupyter notebook.
   




## WE ARE GOING TO SCRAPE https://github.com/topics IN THIS PROJECT



### OUTLINE:
- WE WILL START BY BROWSING THRPUGH THE MOST POPULAR TOPICS IN THE GITHUB'S TOPICS PAGE.
- WE WILL GET THE POPULAR TOPIC TITLE AND TOPIC PAGE URL.
- FOR EACH TOPIC WE WILL GET TOP 25 REPOSITORIES FROM THE TOPIC PAGE.
- FOR EACH REPOSITORY WE WILL GET THE REPO NAME, USER NAME, STARS AND THE URL.
- FOR EACH TOPIC WE WILL CREATE A .CSV EXTENSION FILE.

In [597]:
import warnings
warnings.filterwarnings('ignore')

In [598]:
import requests
url="https://github.com/topics"
response=requests.get(url)

In [599]:
response.status_code

200

In [600]:
contents=response.content

In [601]:
contents[:1000]

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-MnwFAmWD4N6ubNtKWD47hA5NsuUbQrUija5wdKekur+Latb6waJ7slYqKr7zqDFy4ndnwMsQEHFHoeZD/KT1MA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-327c05026583e0deae6cdb4a583e3b84.css" /><link crossorigin="anonymous" media="all" integrity="sha512-72v7ZBq0FiO3HXumxDxYd1gzz37KXxGQJKjX2l2INKADbC6YS+91wttp4Ndt6tsrFvcAUQNKPaNupLe

In [602]:
page_content=response.text

In [603]:
len(response.text)

141518

In [604]:
with open('html_page.html', 'w', encoding='utf-8') as hp:
    hp.write(str(page_content))
    #hp.seek(0)
    #print(hp.read())

In [605]:
import bs4

In [606]:
soup=bs4.BeautifulSoup(response.content, 'html.parser')

In [607]:
soup.find_all("border border-black-fade color-bg-info f4 color-text-tertiary text-bold rounded flex-shrink-0 text-center mr-3")

[]

In [608]:
classs=soup.find_all('data-pjax'=='#js-repo-pjax-container')
# t=classs.find('h1')

In [609]:
for class1 in  classs:
    print( class1.get("href"))
    #print("Yes!")

None
None
None
https://github.githubassets.com
https://avatars.githubusercontent.com
https://github-cloud.s3.amazonaws.com
https://user-images.githubusercontent.com/
https://github.githubassets.com
https://avatars.githubusercontent.com
https://github.githubassets.com/assets/light-327c05026583e0deae6cdb4a583e3b84.css
https://github.githubassets.com/assets/dark-ef6bfb641ab41623b71d7ba6c43c5877.css
https://github.githubassets.com/assets/frameworks-ccd28d27987dfc140f2b1822bcef5319.css
https://github.githubassets.com/assets/behaviors-dbc725588aaef4eab4c5384e5c470707.css
https://github.githubassets.com/assets/site-95582afbc21b7e3a2eaeddeb803f7606.css
https://github.githubassets.com/assets/explore-0ad4db178c531c74f92df2b51ceb2607.css
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
Non

In [610]:
classs

[<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
 <meta charset="utf-8"/>
 <link href="https://github.githubassets.com" rel="dns-prefetch"/>
 <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
 <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
 <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
 <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
 <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
 <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-327c05026583e0deae6cdb4a583e3b84.css" integrity="sha512-MnwFAmWD4N6ubNtKWD47hA5NsuUbQrUija5wdKekur+Latb6waJ7slYqKr7zqDFy4ndnwMsQEHFHoeZD/KT1MA==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-ef6bfb641ab41623b71d7ba6c43c5877.css" integrity="sha512-72v7ZBq0FiO3HXumxDxYd1gzz37KXxGQJKjX2l2INKA

In [611]:
para=soup.find_all('p', class_="f5 color-text-secondary text-center mb-0 mt-1")

In [612]:
for i in para:
    print(i)

<p class="f5 color-text-secondary text-center mb-0 mt-1">PICO-8 is a fantasy console for making, sharing and playing tiny games and other computer programs in Lua.</p>
<p class="f5 color-text-secondary text-center mb-0 mt-1">Flask is a web framework for Python based on the Werkzeug toolkit.</p>
<p class="f5 color-text-secondary text-center mb-0 mt-1">Firebase is a mobile app development platform that provides data analysis and database web services for developers.</p>


In [613]:
len(para)

3

In [614]:
para_All=soup.find_all('p')

In [615]:
len(para_All)

67

In [616]:
para_All[1]

<p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
        PICO-8
      </p>

In [617]:
para_All[6]['class']

['f5', 'color-text-secondary', 'text-center', 'mb-0', 'mt-1']

In [618]:
para_All[:30]

[<p class="f4 color-text-secondary col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         PICO-8
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">PICO-8 is a fantasy console for making, sharing and playing tiny games and other computer programs in Lua.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Flask
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Flask is a web framework for Python based on the Werkzeug toolkit.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Firebase
       </p>,
 <p class="f5 color-text-secondary text-center mb-0 mt-1">Firebase is a mobile app development platform that provides data analysis and database web services for developers.</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the proces

In [619]:
topics=soup.find_all('p',class_="f3 lh-condensed mb-0 mt-1 Link--primary")

In [620]:
len(topics)

30

In [621]:
top=[]
for topic in topics:
    print(topic.string)
    top.append(topic.string)

3D
Ajax
Algorithm
Amp
Android
Angular
Ansible
API
Arduino
ASP.NET
Atom
Awesome Lists
Amazon Web Services
Azure
Babel
Bash
Bitcoin
Bootstrap
Bot
C
Chrome
Chrome extension
Command line interface
Clojure
Code quality
Code review
Compiler
Continuous integration
COVID-19
C++


In [622]:
top

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [623]:
description=soup.find_all(class_="f5 color-text-secondary mb-0 mt-1")

In [624]:
len(description)

30

In [625]:
description[0].string

'\n              3D modeling is the process of virtually developing the surface and structure of a 3D object.\n            '

In [626]:
descr=[]
for i, desc in enumerate(description):
    print(i, "\t\t\t\t",desc.string)
    descr.append(desc.string.strip())

0 				 
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            
1 				 
              Ajax is a technique for creating interactive web applications.
            
2 				 
              Algorithms are self-contained sequences that carry out a variety of tasks.
            
3 				 
              Amp is a non-blocking concurrency framework for PHP.
            
4 				 
              Android is an operating system built by Google designed for mobile devices.
            
5 				 
              Angular is an open source web application platform.
            
6 				 
              Ansible is a simple and powerful automation engine.
            
7 				 
              An API (Application Programming Interface) is a collection of protocols and subroutines for building software.
            
8 				 
              Arduino is an open source hardware and software company and maker community.
            
9 				 
              ASP.NET is 

In [627]:
descr

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'A

In [628]:
parent0=description[0].parent

In [629]:
description[0]

<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>

In [630]:
parent0

<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>

In [631]:
parent0.find_all('p')

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>]

In [632]:
parent00=parent0.parent

In [633]:
parent00

<div class="d-sm-flex flex-auto">
<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>
<div class="d-inline-block js-toggler-container starring-container">
<a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
        action:topics#index;
        text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
<svg aria-hidden="true" class="octicon octicon-star mr-1" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M8 .25a.75.75 0 01.673.418l1.882 3.815 4.21.612a.75.75 0 01.416 1.279l-3.046 2.97.719 4.192a.75.75 0 01-1.088.791L8 12.347l-3.766 1.98a.75.75 0 01-1.088-.79l.72-4.194L.818 6.374a.75.75 0 01.416-1

In [634]:
for anchor in parent00.find_all('a'):
    print(anchor.get('href'))

/login?return_to=%2Ftopics%2F3d


In [635]:
links=soup.find_all('a')

In [636]:
link=soup.find_all('a',class_='d-flex no-underline')

In [637]:
i=0
e=[]
for l in link:
    print(l.get("href"))
    print(l.parent)
    print("\n\n\n-------------------------------------------------------------------------------------------------------------------\n\n\n")
    e.append(l.get("href"))

/topics/3d
<div class="py-4 border-bottom">
<a class="d-flex no-underline" data-ga-click="Explore, go to 3d, location:All featured topics" href="/topics/3d">
<div class="color-bg-info f4 color-text-tertiary text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
<div class="d-sm-flex flex-auto">
<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>
<div class="d-inline-block js-toggler-container starring-container">
<a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
        action:topics#index;
        text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
<svg ar

In [638]:
e

['/topics/3d',
 '/topics/ajax',
 '/topics/algorithm',
 '/topics/amphp',
 '/topics/android',
 '/topics/angular',
 '/topics/ansible',
 '/topics/api',
 '/topics/arduino',
 '/topics/aspnet',
 '/topics/atom',
 '/topics/awesome',
 '/topics/aws',
 '/topics/azure',
 '/topics/babel',
 '/topics/bash',
 '/topics/bitcoin',
 '/topics/bootstrap',
 '/topics/bot',
 '/topics/c',
 '/topics/chrome',
 '/topics/chrome-extension',
 '/topics/cli',
 '/topics/clojure',
 '/topics/code-quality',
 '/topics/code-review',
 '/topics/compiler',
 '/topics/continuous-integration',
 '/topics/covid-19',
 '/topics/cpp']

In [639]:
link[3]['href']

'/topics/amphp'

In [640]:
for i,j in enumerate(e):
    e[i]="https://github.com"+ str(j)
    

In [641]:
type(e)

list

In [642]:
e[0]

'https://github.com/topics/3d'

In [643]:
[print (i) for i in e[:30]]

https://github.com/topics/3d
https://github.com/topics/ajax
https://github.com/topics/algorithm
https://github.com/topics/amphp
https://github.com/topics/android
https://github.com/topics/angular
https://github.com/topics/ansible
https://github.com/topics/api
https://github.com/topics/arduino
https://github.com/topics/aspnet
https://github.com/topics/atom
https://github.com/topics/awesome
https://github.com/topics/aws
https://github.com/topics/azure
https://github.com/topics/babel
https://github.com/topics/bash
https://github.com/topics/bitcoin
https://github.com/topics/bootstrap
https://github.com/topics/bot
https://github.com/topics/c
https://github.com/topics/chrome
https://github.com/topics/chrome-extension
https://github.com/topics/cli
https://github.com/topics/clojure
https://github.com/topics/code-quality
https://github.com/topics/code-review
https://github.com/topics/compiler
https://github.com/topics/continuous-integration
https://github.com/topics/covid-19
https://github.com/

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [644]:
lin=[]
for i in range(0,30):
    lin.append(e[i])

In [645]:
len(lin)

30

In [646]:
import pandas as pd

In [647]:
df1=pd.DataFrame()

In [648]:
df1['Name of the Topic:']=top

In [649]:
df1['Description of the Topic']= descr

In [650]:
df1

Unnamed: 0,Name of the Topic:,Description of the Topic
0,3D,3D modeling is the process of virtually develo...
1,Ajax,Ajax is a technique for creating interactive w...
2,Algorithm,Algorithms are self-contained sequences that c...
3,Amp,Amp is a non-blocking concurrency framework fo...
4,Android,Android is an operating system built by Google...
5,Angular,Angular is an open source web application plat...
6,Ansible,Ansible is a simple and powerful automation en...
7,API,An API (Application Programming Interface) is ...
8,Arduino,Arduino is an open source hardware and softwar...
9,ASP.NET,ASP.NET is a web framework for building modern...


In [651]:
df1['Associated Links']=lin

In [652]:
df1.isnull().sum()

Name of the Topic:          0
Description of the Topic    0
Associated Links            0
dtype: int64

In [653]:
df1.to_csv("First 30 topics with details.csv")

In [654]:
df1

Unnamed: 0,Name of the Topic:,Description of the Topic,Associated Links
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [655]:
df1.iloc[0,:]

Name of the Topic:                                                         3D
Description of the Topic    3D modeling is the process of virtually develo...
Associated Links                                 https://github.com/topics/3d
Name: 0, dtype: object

## Getting info out of a topic page:

In [656]:
lin[0]

'https://github.com/topics/3d'

In [657]:
url2=lin[0]

In [658]:
response=requests.get(url2)

In [659]:
response.status_code

200

In [660]:
soup=bs4.BeautifulSoup(response.content, 'html.parser')

In [661]:
print(soup.prettify)

<bound method Tag.prettify of 
<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-327c05026583e0deae6cdb4a583e3b84.css" integrity="sha512-MnwFAmWD4N6ubNtKWD47hA5NsuUbQrUija5wdKekur+Latb6waJ7slYqKr7zqDFy4ndnwMsQEHFHoeZD/KT1MA==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-ef6bfb641ab41623b71d7ba6c43c5877.css" integrity="sha512-72v7Z

In [662]:
parent=soup.find_all("h3","class_"=="f3 color-text-secondary text-normal lh-condensed")

In [663]:
len(parent)

30

In [664]:
owner=[]
name_of_repo=[]
for par in parent:
    aa=par.find_all('a')
    for i,j in enumerate(aa):
        if i==0:
            href=j.get("href")
            
            owner.append(href)
        else:
            href=j.get("href")
            name_of_repo.append(href)
        

In [665]:
name_of_repo

['/mrdoob/three.js',
 '/libgdx/libgdx',
 '/pmndrs/react-three-fiber',
 '/BabylonJS/Babylon.js',
 '/aframevr/aframe',
 '/ssloy/tinyrenderer',
 '/lettier/3d-game-shaders-for-beginners',
 '/FreeCAD/FreeCAD',
 '/metafizzy/zdog',
 '/CesiumGS/cesium',
 '/timzhang642/3D-Machine-Learning',
 '/a1studmuffin/SpaceshipGenerator',
 '/isl-org/Open3D',
 '/spritejs/spritejs',
 '/tensorspace-team/tensorspace',
 '/jagenjo/webglstudio.js',
 '/YadiraF/PRNet',
 '/AaronJackson/vrn',
 '/domlysz/BlenderGIS',
 '/openscad/openscad',
 '/ssloy/tinyraytracer',
 '/mosra/magnum',
 '/google/model-viewer',
 '/blender/blender',
 '/gfxfundamentals/webgl-fundamentals',
 '/cleardusk/3DDFA',
 '/jasonlong/isometric-contributions',
 '/rg3dengine/rg3d',
 '/antvis/L7',
 '/cnr-isti-vclab/meshlab']

In [666]:
parent[0].find_all('a')[0]

<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [667]:
parent[0].find_all('a')[0].text.strip()

'mrdoob'

In [668]:
parent[0].text.strip()

'mrdoob\n          /\n          \n            three.js'

In [669]:
name=[]
for i in range(len(parent)):
    d=parent[i].find_all('a')[0].text
    name.append(d.strip())
    

In [670]:
name

['mrdoob',
 'libgdx',
 'pmndrs',
 'BabylonJS',
 'aframevr',
 'ssloy',
 'lettier',
 'FreeCAD',
 'metafizzy',
 'CesiumGS',
 'timzhang642',
 'a1studmuffin',
 'isl-org',
 'spritejs',
 'tensorspace-team',
 'jagenjo',
 'YadiraF',
 'AaronJackson',
 'domlysz',
 'openscad',
 'ssloy',
 'mosra',
 'google',
 'blender',
 'gfxfundamentals',
 'cleardusk',
 'jasonlong',
 'rg3dengine',
 'antvis',
 'cnr-isti-vclab']

In [671]:
repo_name=[]
for i in range(len(parent)):
    d=parent[i].find_all('a')[1].text
    repo_name.append(d.strip())

In [672]:
repo_name

['three.js',
 'libgdx',
 'react-three-fiber',
 'Babylon.js',
 'aframe',
 'tinyrenderer',
 '3d-game-shaders-for-beginners',
 'FreeCAD',
 'zdog',
 'cesium',
 '3D-Machine-Learning',
 'SpaceshipGenerator',
 'Open3D',
 'spritejs',
 'tensorspace',
 'webglstudio.js',
 'PRNet',
 'vrn',
 'BlenderGIS',
 'openscad',
 'tinyraytracer',
 'magnum',
 'model-viewer',
 'blender',
 'webgl-fundamentals',
 '3DDFA',
 'isometric-contributions',
 'rg3d',
 'L7',
 'meshlab']

In [673]:
url1='https://github.com'

In [674]:
type(owner[0])

str

In [675]:
for i, own in enumerate(owner):
    owner[i]=url1+own

In [676]:
owner

['https://github.com/mrdoob',
 'https://github.com/libgdx',
 'https://github.com/pmndrs',
 'https://github.com/BabylonJS',
 'https://github.com/aframevr',
 'https://github.com/ssloy',
 'https://github.com/lettier',
 'https://github.com/FreeCAD',
 'https://github.com/metafizzy',
 'https://github.com/CesiumGS',
 'https://github.com/timzhang642',
 'https://github.com/a1studmuffin',
 'https://github.com/isl-org',
 'https://github.com/spritejs',
 'https://github.com/tensorspace-team',
 'https://github.com/jagenjo',
 'https://github.com/YadiraF',
 'https://github.com/AaronJackson',
 'https://github.com/domlysz',
 'https://github.com/openscad',
 'https://github.com/ssloy',
 'https://github.com/mosra',
 'https://github.com/google',
 'https://github.com/blender',
 'https://github.com/gfxfundamentals',
 'https://github.com/cleardusk',
 'https://github.com/jasonlong',
 'https://github.com/rg3dengine',
 'https://github.com/antvis',
 'https://github.com/cnr-isti-vclab']

In [677]:
# for i,j,k in zip(range(0,len(owner)),owner, name_of_repo):
#     name_of_repo[i]=j+k

In [678]:
for i,j in enumerate(name_of_repo):
    name_of_repo[i]=url1+j

In [679]:
name_of_repo

['https://github.com/mrdoob/three.js',
 'https://github.com/libgdx/libgdx',
 'https://github.com/pmndrs/react-three-fiber',
 'https://github.com/BabylonJS/Babylon.js',
 'https://github.com/aframevr/aframe',
 'https://github.com/ssloy/tinyrenderer',
 'https://github.com/lettier/3d-game-shaders-for-beginners',
 'https://github.com/FreeCAD/FreeCAD',
 'https://github.com/metafizzy/zdog',
 'https://github.com/CesiumGS/cesium',
 'https://github.com/timzhang642/3D-Machine-Learning',
 'https://github.com/a1studmuffin/SpaceshipGenerator',
 'https://github.com/isl-org/Open3D',
 'https://github.com/spritejs/spritejs',
 'https://github.com/tensorspace-team/tensorspace',
 'https://github.com/jagenjo/webglstudio.js',
 'https://github.com/YadiraF/PRNet',
 'https://github.com/AaronJackson/vrn',
 'https://github.com/domlysz/BlenderGIS',
 'https://github.com/openscad/openscad',
 'https://github.com/ssloy/tinyraytracer',
 'https://github.com/mosra/magnum',
 'https://github.com/google/model-viewer',
 'htt

In [680]:
stars=soup.find_all('a',{"class": "social-count float-none"})

In [681]:
len(stars)

30

In [682]:
stars[0]

<a class="social-count float-none" data-ga-click="Explore, go to repository stargazers, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"STARGAZERS","click_visual_representation":"STARGAZERS_NUMBER","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4f3c0fb1ad4e5a9f72ed698531bf27b302fcc5846d9458e33ceeeeb05888b64c" data-view-component="true" href="/mrdoob/three.js/stargazers">
          74.4k
</a>

In [683]:
star=[]
stars[0].text.strip()

'74.4k'

In [684]:
for i in stars:
    
    star.append(i.text.strip)
    


### If we want stars to be in float

In [685]:
star=[]
for i in stars:
    i=i.text.strip()[:-1]
    star.append(float(i)*1000)

In [686]:
star

[74400.0,
 19000.0,
 15100.0,
 14900.0,
 13100.0,
 11300.0,
 11200.0,
 9900.0,
 8700.0,
 7500.0,
 7100.0,
 6900.0,
 5500.0,
 4600.0,
 4500.0,
 4400.0,
 4400.0,
 4400.0,
 4300.0,
 4300.0,
 3900.0,
 3600.0,
 3400.0,
 3400.0,
 3200.0,
 3100.0,
 3000.0,
 2800.0,
 2400.0,
 2400.0]

In [687]:
# %%writefile 1.txt
# Yes this is the way it should be done. Oh yeah!

In [688]:
# f=open('1.txt','r+')

In [689]:
# f.read()

In [690]:
# f.seek(0)

## Defining a dataframe ot save all of the information collected so far:


In [691]:
df2=pd.DataFrame()

In [692]:
df2['Name of the owner']=name

In [693]:
df2['Name of the Repository']=repo_name

In [694]:
df2['Link to the owner\'s profile']=owner

In [695]:
df2['Link of the most popular repo of this owner']=name_of_repo

In [696]:
df2['Number of stars accumulated over time']=star

In [697]:
_3D_Repos=df2

In [698]:
df2.to_csv('Most Popular Repositories concerning 3D.csv')

### Making a function to do this for the rest of the pages

In [699]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc
    

In [700]:
doc = get_topics_page()

In [707]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


In [702]:
titles = get_topic_titles(doc)

In [703]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs


In [704]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

### Putting everything together and getting the dataframe

In [705]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [706]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet
