#Web Scraping Project 

###1. Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


###2. Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

###3. Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
(Optional) Use a REST API to acquire additional information if required.

###4. Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

###5. Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.

---------------------------------------------------------------------------------------------------------------------------------
##Steps to follow in this project 

1. Get the list of topics from topics page
2. Get the list of top repos from the individual topics page
3. For each topic create a csv file of top repos of topics


outline --

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topics, we'll get topic name, topic title, topic page url and topic stars.
- For each repository, we'll grab the repo name, repo username, stars and repo URL.
- For each topic we'll create a CSV file in the following formate.


```
Repo Name,Repo Name,Repo Name,Repo URL
three.js,mrdoob,86800,https://github.com/mrdoob/three.js
libgdx,libgdx,20700,https://github.com/libgdx/libgdx
```

In [1]:
# download required libraries

!pip install requests --upgrade --quiet

[?25l[K     |█████▏                          | 10 kB 20.0 MB/s eta 0:00:01[K     |██████████▍                     | 20 kB 22.8 MB/s eta 0:00:01[K     |███████████████▋                | 30 kB 25.4 MB/s eta 0:00:01[K     |████████████████████▉           | 40 kB 7.1 MB/s eta 0:00:01[K     |██████████████████████████      | 51 kB 8.4 MB/s eta 0:00:01[K     |███████████████████████████████▎| 61 kB 8.7 MB/s eta 0:00:01[K     |████████████████████████████████| 62 kB 1.1 MB/s 
[?25h

In [2]:
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |██▋                             | 10 kB 22.7 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 19.6 MB/s eta 0:00:01[K     |███████▊                        | 30 kB 24.7 MB/s eta 0:00:01[K     |██████████▎                     | 40 kB 9.4 MB/s eta 0:00:01[K     |████████████▉                   | 51 kB 9.1 MB/s eta 0:00:01[K     |███████████████▍                | 61 kB 10.5 MB/s eta 0:00:01[K     |██████████████████              | 71 kB 9.7 MB/s eta 0:00:01[K     |████████████████████▌           | 81 kB 9.9 MB/s eta 0:00:01[K     |███████████████████████         | 92 kB 10.9 MB/s eta 0:00:01[K     |█████████████████████████▋      | 102 kB 8.5 MB/s eta 0:00:01[K     |████████████████████████████▏   | 112 kB 8.5 MB/s eta 0:00:01[K     |██████████████████████████████▊ | 122 kB 8.5 MB/s eta 0:00:01[K     |████████████████████████████████| 128 kB 8.5 MB/s 
[?25h

In [3]:
# import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

###Step 1:-  

In [4]:
def get_topic_info_df(doc):


    topic_title = []
    topic_description = []
    topic_url = []

    # topic _title
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tag = doc.find_all('p', {'class': selection_class})
    for tag in topic_title_tag:
        topic_title.append(tag.text)

    # topic description
    selection_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_description_tag = doc.find_all('p', {'class': selection_class})
    for tag in topic_description_tag:
        topic_description.append(tag.text.strip())

    # topic url
    for urls in topic_title_tag:   
        topic_url.append("https://github.com" + urls.parent.get('href'))

    # topic details dictionary 
    topic_dict = {'title': topic_title, 'description': topic_description, "url": topic_url} 
    # topic details DataFrame 
    topic_df = pd.DataFrame(topic_dict)  

    return topic_df   

In [5]:
# get topic page
def topic_page():
    topic_url = "https://github.com/topics"
    response = requests.get(topic_url)
    page_contents = response.text   
    doc = BeautifulSoup(page_contents, 'html.parser')

    return doc

In [6]:
def scrape_topics_df():
    doc = topic_page()
    return get_topic_info_df(doc)

###Step 2:-

In [7]:
def function_1(topic_url):

    # loading  data from topic_url 
    topic_page = requests.get(topic_url)
    doc = BeautifulSoup(topic_page.text, 'html.parser')

    # dictionary to store all top repository info 
    dic = {"repo_name": [], "repo_username": [], "star_count": [], "urls": []}

    # h3_tags contiain repo name , repo_username and urls
    parent_class = 'f3 color-fg-muted text-normal lh-condensed'
    h3_tags = doc.find_all('h3', {'class' : parent_class})

    # star_tag contain star_count for repo
    star_class = 'Counter js-social-count'
    star_tag = doc.find_all('span', {'class': star_class })

    # iterating all top repos for given topic and store it in dictionary (dic)
    for i in range(len(star_tag)):
        # function_2 is used to take out info from h3_tags nas star_tag
        # then append data into dic
        repo_info = function_2(h3_tags[i], star_tag[i])                  ##### use of function_2 in function_1
        dic['repo_name'].append(repo_info[0])
        dic['repo_username'].append(repo_info[1])
        dic['star_count'].append(repo_info[2])
        dic['urls'].append(repo_info[3])

    # convert dic into DataFrame
    dic_df = pd.DataFrame(dic)

    return dic_df

def function_2(h3_tags, star_tag):

    base_url = "https://github.com"
    
    repo_name = h3_tags.find_all('a')[0].text.strip()
    repo_username = h3_tags.find_all('a')[1].text.strip()
    stars = star_tag.text
    links = base_url + h3_tags.find_all('a')[1]['href']

    return repo_name, repo_username, stars, links

###Step 3:-

In [8]:
# Put is all together

def scrape_topic(topic_url, path ):
    if os.path.exists(path):
        print("The file path {} already exists. Skipping....".format(path))
        return

    topic_repo_df = function_1(topic_url)
    topic_repo_df.to_csv(path, index=None)


def scrape_topic_repos():
    print("Scraping list of topics.")
    topics_df = scrape_topics_df()
    
    os.makedirs("data", exist_ok = True)
    for index, row in topics_df.iterrows():
        print("Scraping top repositories for {}".format(row['title']))
        scrape_topic(row['url'], "data/{}.csv".format(row['title']))

scrape_topic_repos()

Scraping list of topics.
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scraping to

In [10]:
!echo "# Web-Scraping-Project" >> README.md
!git init

Reinitialized existing Git repository in /content/.git/


In [14]:
!git config --global user.name "Mynk-kuswa"
!git config --global user.email "mayankkushwah912@gmail.com"

In [15]:
!git init

Reinitialized existing Git repository in /content/.git/


In [16]:
!git add README.md

In [None]:
!git commit -m "first commit"
!git branch -M main
!git remote add origin https://github.com/Mynk-kuswa/Web-Scraping-Project.git
!git push -u origin main