## Web-Scraping project on Github topics page

### About Web-scraping
Web scraping is an automated method used to extract large amounts of data from websites quickly and efficiently. This process involves fetching web pages and extracting relevant information, which can then be stored and analyzed for various purposes such as data analysis, research, and machine learning.

### About Github (since we will scraping the topics page of github)
Well, since this is already on github, you know about it. Putting it simply, it's a disneyland for developers where projects come to life.

### About the project


This project aims to extract and analyze data from GitHub repositories for various topics using web scraping techniques. By scraping data from GitHub, we can gain insights into the most popular repositories, trends in different fields, and the activity levels of various projects.

#### Objectives
1. Scrape GitHub Repositories: Extract data from GitHub repository pages for specific topics.
2. Data Parsing and Storage: Parse the HTML content and store the extracted data in a structured format.

#### Technologies used
1. Python: All code is in python, since it is versatile
2. Jupyter-Notebook: It is the platform used to write the code

#### Libraries used
1. Pandas: For data manipulation
2. Requests: For sending HTTP requests
3. OS: To interact with the operating system, since we are saving the data in the form of csv's on the device
4. BeatifulSoup: Parsing HTML files and extract data

## Project workflow:
1. Setting up the environment
2. Web scraping
3. Data parsing and storage

###### Note: Currently, this only includes only scraping of the topics in the first page of main topics page

#####  1. Setting up the environment:
Importing all the necessary libraries that will be used for the task of web-scraping data from the

In [17]:
import pandas as pd
import requests
import os
from bs4 import BeautifulSoup

### 2. Web Scraping:
- First we will use the required url of the github topics page to create a secure connection to the page.
- Then store all the data, relevant or irrelevant, into a variable using BeautifulSoup.

In [18]:
base_url='https://github.com/topics'
resp= requests.get(base_url)
resp.status_code

200

In [32]:
page_contents=resp.text

with open('webpage.html','w') as f:
    f.write(page_contents)

doc=BeautifulSoup(page_contents,'html.parser')

- We get the number of likes or star count, below function converts the string into integer.

In [20]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

- All data is converted into relevant text, using BeatifulSoup library, with the help of class name and tag type.

In [21]:
def get_repo_info(h3_tag,star_tag):
    #returns all the required information about a particular repository
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1].text.strip()
    stars_count=parse_star_count(star_tag.text.strip())
    return username,repo_name,repo_url,stars_count


- After a secure connection, all repo info is stored in the form of a dataframe, below functions exactly these tasks.

In [22]:
def get_topic_page(topic_url):
    #download the page
    response=requests.get(topic_url)

    #check response
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    #parse html
    topic_doc=BeautifulSoup(response.text,'html.parser')

    return topic_doc




def get_all_topic_repo_info(topic_doc):
    #repo tags
    usernames_and_repo_tag=topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})

    #stars count tags
    star_tags=topic_doc.find_all('span',{'class':'Counter js-social-count'})

    #using above tags to get repo info using function
    topic_repo_dict={
    'username':[],
    'repo_name':[],
    'repo_url':[],
    'stars':[]
    }

    for i in range(len(usernames_and_repo_tag)):
        repo_info=get_repo_info(usernames_and_repo_tag[i],star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['repo_url'].append(repo_info[2])
        topic_repo_dict['stars'].append(repo_info[3])
    
    return pd.DataFrame(topic_repo_dict)

- Below functions extract the exact data, ie. , 'topics_title' which is the name of the topic. 'topic_description' which is the description of the topic and 'topic_url', url of the topic page.

In [23]:
#function to extract topic titles from title tags taken from the page
def get_topics_title(doc):
    selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_names_p_tags=doc.find_all('p',{'class':selection_class})
    topic_titles=[]
    for tag in topic_names_p_tags:
        topic_titles.append(tag.text)
    
    return topic_titles

#function to extract description from the description tag
#  extracted from the webpage
def get_topic_description(doc):
    topic_desc_class='f5 color-fg-muted mb-0 mt-1'
    topic_description__tags=doc.find_all('p',{'class':topic_desc_class})
    topic_descriptions=[]
    for tag in topic_description__tags:
        topic_descriptions.append(tag.text.strip())
    
    return topic_descriptions
    
#extract individual topic page link
def get_topic_link(doc):
    topic_link_tags=doc.find_all('a',{'class' : 'no-underline flex-grow-0'})
    topic_urls=[]
    base_url='https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    
    return topic_urls 
    

- Using the above functions, below function will create a dataframe.

In [28]:
#function that creates a csv file containing the list 
# of all topics present in the github/topics page
def scrape_topics():
    topics_url='https://github.com/topics'
    response=requests.get(topics_url)
    #check response
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    topics_dict={
        'title':get_topics_title(doc),
        'description':get_topic_description(doc),
        'url':get_topic_link(doc)
    }
    
    return pd.DataFrame(topics_dict)

- Dataframe is converted into a csv file and stored on a folder on the device using the os library, as per name extracted from above title extraction function 'get_topics_title()'

In [31]:
def create_csv_of_each_topic(topic_url,topic_name):
    topic_df=get_all_topic_repo_info(get_topic_page(topic_url))
    file_name=topic_name+'.csv'

    if os.path.exists(file_name):
        print('The file "{}" already exists.'.format(file_name))
        return

    topic_df.to_csv(topic_name+'.csv',index=None)

def scrape_repo_info_of_topic():
    topics_df =scrape_topics()

    os.makedirs('topics_data',exist_ok=True)

    for index,row in topics_df.iterrows():
        # print('scraping top repositories for "{}"'.format(row['title']))
        print('scraping top repositories for "{}"'.format(row['title']))
        create_csv_of_each_topic(row['url'],'topics_data/{}.csv'.format(row['title']))


- Call the final function, 'scrape_repo_info_of_topic()'.

In [26]:
scrape_repo_info_of_topic()

scraping top repositories for "3D"
scraping top repositories for "Ajax"
scraping top repositories for "Algorithm"
scraping top repositories for "Amp"
scraping top repositories for "Android"
scraping top repositories for "Angular"
scraping top repositories for "Ansible"
scraping top repositories for "API"
scraping top repositories for "Arduino"
scraping top repositories for "ASP.NET"
scraping top repositories for "Awesome Lists"
scraping top repositories for "Amazon Web Services"
scraping top repositories for "Azure"
scraping top repositories for "Babel"
scraping top repositories for "Bash"
scraping top repositories for "Bitcoin"
scraping top repositories for "Bootstrap"
scraping top repositories for "Bot"
scraping top repositories for "C"
scraping top repositories for "Chrome"
scraping top repositories for "Chrome extension"
scraping top repositories for "Command-line interface"
scraping top repositories for "Clojure"
scraping top repositories for "Code quality"
scraping top repositori

Folder named 'topics_data' contains the final results, which is all csv files containing the data extracted from the topic page.