# Data collection

In this notebook, I'll use the **GitHub API** to extract various information from my user profile such as repositories, commits and more. I'll also save this data to **.csv** files so that I can draw insights.

## Import libraries and defining constants

I'll import various libraries needed for fetching the data.

In [1]:
import json
import requests
import numpy as np
import pandas as pd

import requests
from requests.auth import HTTPBasicAuth

I'll fetch the credentials from the json file and create an `authentication` variable.

In [3]:
credentials = json.loads(open('credentials.json').read())
authentication = HTTPBasicAuth(credentials['username'], credentials['password'])

## User information

I'll first extract the user information such as name and related URLs which would be useful ahead.

In [4]:
data = requests.get('https://api.github.com/users/' + credentials['username'],
                    auth = authentication)
data = data.json()
data

{'login': 'BandhaviC',
 'id': 60414933,
 'node_id': 'MDQ6VXNlcjYwNDE0OTMz',
 'avatar_url': 'https://avatars.githubusercontent.com/u/60414933?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/BandhaviC',
 'html_url': 'https://github.com/BandhaviC',
 'followers_url': 'https://api.github.com/users/BandhaviC/followers',
 'following_url': 'https://api.github.com/users/BandhaviC/following{/other_user}',
 'gists_url': 'https://api.github.com/users/BandhaviC/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/BandhaviC/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/BandhaviC/subscriptions',
 'organizations_url': 'https://api.github.com/users/BandhaviC/orgs',
 'repos_url': 'https://api.github.com/users/BandhaviC/repos',
 'events_url': 'https://api.github.com/users/BandhaviC/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/BandhaviC/received_events',
 'type': 'User',
 'site_admin': False,
 'name': 'Bandhavi Chitturi'

From the json output above, I'll try to extract basic information such as `name`, `location`, `email`, `bio`, `public_repos`, and `public gists`. I'll also keep some of the urls handy inluding `repos_url`, `gists_url` and `blog`.

In [5]:
print("Information about user {}:\n".format(credentials['username']))
print("Name: {}".format(data['name']))
print("Email: {}".format(data['email']))
print("Location: {}".format(data['location']))
print("Public repos: {}".format(data['public_repos']))
print("Public gists: {}".format(data['public_gists']))
print("About: {}".format(data['bio']))

Information about user BandhaviC:

Name: Bandhavi Chitturi
Email: None
Location: Smyrna, GA
Public repos: 22
Public gists: 0
About: I'm a passionate Data enthusiast! I have experience as a Data Analyst. I like to bring insights from data that helps in predictions and recommendation systems.


## Repositories

Next, I'll fetch repositories for the user. By default, only 30 repositories are fetched in one go. So, I'll iterate over the API till all repositories are fetched.

In [6]:
url = data['repos_url']
page_no = 1
repos_data = []
while (True):
    response = requests.get(url, auth = authentication)
    response = response.json()
    repos_data = repos_data + response
    repos_fetched = len(response)
    print("Total repositories fetched: {}".format(repos_fetched))
    if (repos_fetched == 30):
        page_no = page_no + 1
        url = data['repos_url'].encode("UTF-8") + '?page=' + str(page_no)
    else:
        break

Total repositories fetched: 22


I'll first explore only one repository information and take a look at all the information I can keep.

In [7]:
repos_data[0]

{'id': 746834741,
 'node_id': 'R_kgDOLIPLNQ',
 'name': '911_Calls_Data_Analysis',
 'full_name': 'BandhaviC/911_Calls_Data_Analysis',
 'private': False,
 'owner': {'login': 'BandhaviC',
  'id': 60414933,
  'node_id': 'MDQ6VXNlcjYwNDE0OTMz',
  'avatar_url': 'https://avatars.githubusercontent.com/u/60414933?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/BandhaviC',
  'html_url': 'https://github.com/BandhaviC',
  'followers_url': 'https://api.github.com/users/BandhaviC/followers',
  'following_url': 'https://api.github.com/users/BandhaviC/following{/other_user}',
  'gists_url': 'https://api.github.com/users/BandhaviC/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/BandhaviC/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/BandhaviC/subscriptions',
  'organizations_url': 'https://api.github.com/users/BandhaviC/orgs',
  'repos_url': 'https://api.github.com/users/BandhaviC/repos',
  'events_url': 'https://api.github.com/users/Ba

There are a number of things that we can keep a track of here. I'll select the following:
1. id: Unique id for the repository.
2. name: The name of the repository.
3. description: The description of the repository.
4. created_at: The time and date when the repository was first created.
5. updated_at: The time and date when the repository was last updated.
6. login: Username of the owner of the repository.
7. license: The license type (if any).
8. has_wiki: A boolean that signifies if the repository has a wiki document.
9. forks_count: Total forks of the repository.
10. open_issues_count: Total issues opened in the repository.
11. stargazers_count: The total stars on the reepository.
12. watchers_count: Total users watching the repository.

I'll also keep track of some urls for further analysis including:
1. url: The url of the repository.
2. commits_url: The url for all commits in the repository.
3. languages_url: The url for all languages in the repository.

The commit url, I'll remove the end value inside the braces.

In [8]:
repos_information = []
for i, repo in enumerate(repos_data):
    data = []
    data.append(repo['id'])
    data.append(repo['name'])
    data.append(repo['description'])
    data.append(repo['created_at'])
    data.append(repo['updated_at'])
    data.append(repo['owner']['login'])
    data.append(repo['license']['name'] if repo['license'] != None else None)
    data.append(repo['has_wiki'])
    data.append(repo['forks_count'])
    data.append(repo['open_issues_count'])
    data.append(repo['stargazers_count'])
    data.append(repo['watchers_count'])
    data.append(repo['url'])
    data.append(repo['commits_url'].split("{")[0])
    data.append(repo['url'] + '/languages')
    repos_information.append(data)

In [10]:
repos_df = pd.DataFrame(repos_information, columns = ['Id', 'Name', 'Description', 'Created on', 'Updated on', 
                                                      'Owner', 'License', 'Includes wiki', 'Forks count', 
                                                      'Issues count', 'Stars count', 'Watchers count',
                                                      'Repo URL', 'Commits URL', 'Languages URL'])
repos_df.head(10)

Unnamed: 0,Id,Name,Description,Created on,Updated on,Owner,License,Includes wiki,Forks count,Issues count,Stars count,Watchers count,Repo URL,Commits URL,Languages URL
0,746834741,911_Calls_Data_Analysis,This is a data capstone project of 911 Calls D...,2024-01-22T19:07:47Z,2024-01-22T22:12:14Z,BandhaviC,,True,0,0,0,0,https://api.github.com/repos/BandhaviC/911_Cal...,https://api.github.com/repos/BandhaviC/911_Cal...,https://api.github.com/repos/BandhaviC/911_Cal...
1,702736151,Apache-Beam-Batch-processing_Dataflow,,2023-10-09T23:00:51Z,2023-10-09T23:01:56Z,BandhaviC,MIT License,True,0,0,0,0,https://api.github.com/repos/BandhaviC/Apache-...,https://api.github.com/repos/BandhaviC/Apache-...,https://api.github.com/repos/BandhaviC/Apache-...
2,741108129,BandhaviC.github.io,,2024-01-09T17:58:59Z,2024-01-10T20:30:36Z,BandhaviC,,False,0,0,1,1,https://api.github.com/repos/BandhaviC/Bandhav...,https://api.github.com/repos/BandhaviC/Bandhav...,https://api.github.com/repos/BandhaviC/Bandhav...
3,360756417,Beachbnb,,2021-04-23T03:52:14Z,2021-04-23T04:01:32Z,BandhaviC,,True,0,0,0,0,https://api.github.com/repos/BandhaviC/Beachbnb,https://api.github.com/repos/BandhaviC/Beachbn...,https://api.github.com/repos/BandhaviC/Beachbn...
4,702722284,Customer-Segmentation-and-Clustering,,2023-10-09T22:00:28Z,2023-10-09T22:00:33Z,BandhaviC,,True,0,0,0,0,https://api.github.com/repos/BandhaviC/Custome...,https://api.github.com/repos/BandhaviC/Custome...,https://api.github.com/repos/BandhaviC/Custome...
5,632156401,DataStructures_and_algorithms_Python,,2023-04-24T20:37:49Z,2023-06-21T20:47:23Z,BandhaviC,,False,0,0,0,0,https://api.github.com/repos/BandhaviC/DataStr...,https://api.github.com/repos/BandhaviC/DataStr...,https://api.github.com/repos/BandhaviC/DataStr...
6,657332813,Data_Analysis_of_UberData,,2023-06-22T20:43:43Z,2023-06-22T20:43:50Z,BandhaviC,GNU General Public License v3.0,True,0,0,0,0,https://api.github.com/repos/BandhaviC/Data_An...,https://api.github.com/repos/BandhaviC/Data_An...,https://api.github.com/repos/BandhaviC/Data_An...
7,639551572,data_engineering_with_snowpark_python,,2023-05-11T17:35:31Z,2023-05-11T17:40:38Z,BandhaviC,Apache License 2.0,True,0,0,0,0,https://api.github.com/repos/BandhaviC/data_en...,https://api.github.com/repos/BandhaviC/data_en...,https://api.github.com/repos/BandhaviC/data_en...
8,619930653,Data_Science-ML,,2023-03-27T17:38:38Z,2023-03-27T17:41:09Z,BandhaviC,,True,0,0,0,0,https://api.github.com/repos/BandhaviC/Data_Sc...,https://api.github.com/repos/BandhaviC/Data_Sc...,https://api.github.com/repos/BandhaviC/Data_Sc...
9,742164526,E-commerce-Data-Analysis,,2024-01-11T22:16:09Z,2024-01-11T22:16:19Z,BandhaviC,,True,0,0,0,0,https://api.github.com/repos/BandhaviC/E-comme...,https://api.github.com/repos/BandhaviC/E-comme...,https://api.github.com/repos/BandhaviC/E-comme...


## Languages

For topics of each repository, I'll iterate through all repos' `Languagues URL` and get the corresponding data. I'll also store them back to the dataframe.

In [11]:
for i in range(repos_df.shape[0]):
    response = requests.get(repos_df.loc[i, 'Languages URL'], auth = authentication)
    response = response.json()
    print(i, response)
    if response != {}:
        languages = []
        for key, value in response.items():
            languages.append(key)
        languages = ', '.join(languages)
        repos_df.loc[i, 'Languages'] = languages
    else:
        repos_df.loc[i, 'Languages'] = ""

0 {'Jupyter Notebook': 506555}
1 {'Python': 1927}
2 {'JavaScript': 99440, 'HTML': 1733, 'CSS': 735}
3 {'TypeScript': 36262, 'HTML': 11684, 'SCSS': 8767, 'JavaScript': 1851}
4 {'Jupyter Notebook': 654912}
5 {'Jupyter Notebook': 31817, 'Python': 1284}
6 {'Jupyter Notebook': 202855}
7 {'Python': 26583, 'PLpgSQL': 2274}
8 {'Jupyter Notebook': 40628506, 'HTML': 6312735}
9 {'Jupyter Notebook': 3971306}
10 {'Jupyter Notebook': 184261, 'Python': 52495}
11 {'Jupyter Notebook': 87096}
12 {'CSS': 85076, 'SCSS': 77989, 'HTML': 39074, 'JavaScript': 17779}
13 {'TypeScript': 25928, 'HTML': 12471, 'JavaScript': 4614, 'CSS': 196}
14 {'JavaScript': 7976, 'HTML': 1721, 'CSS': 1240}
15 {'Python': 1019}
16 {'JavaScript': 11401, 'CSS': 10251, 'HTML': 361}
17 {'TypeScript': 49005, 'HTML': 21949, 'SCSS': 4612, 'JavaScript': 2325}
18 {'Jupyter Notebook': 255488}
19 {'Jupyter Notebook': 654653}
20 {'JavaScript': 7843, 'CSS': 713, 'HTML': 400}
21 {'Python': 4099}


I'll publish this data into a .csv file called **repos_info.csv**

In [12]:
repos_df.to_csv('repos_info.csv', index = False)

## Commits

I'll now also create a dataset with all the commits done till now.

In [13]:
response = requests.get(repos_df.loc[0, 'Commits URL'], auth = authentication)
response.json()

[{'sha': '68a0d7ce20251e6e1e9c72062bd22d25e9aa17e3',
  'node_id': 'C_kwDOLIPLNdoAKDY4YTBkN2NlMjAyNTFlNmUxZTljNzIwNjJiZDIyZDI1ZTlhYTE3ZTM',
  'commit': {'author': {'name': 'Bandhavi Chitturi',
    'email': '60414933+BandhaviC@users.noreply.github.com',
    'date': '2024-01-22T22:06:28Z'},
   'committer': {'name': 'GitHub',
    'email': 'noreply@github.com',
    'date': '2024-01-22T22:06:28Z'},
   'message': 'Update README.md',
   'tree': {'sha': '7fc7fb048d5667df187b0abd8906032f489af964',
    'url': 'https://api.github.com/repos/BandhaviC/911_Calls_Data_Analysis/git/trees/7fc7fb048d5667df187b0abd8906032f489af964'},
   'url': 'https://api.github.com/repos/BandhaviC/911_Calls_Data_Analysis/git/commits/68a0d7ce20251e6e1e9c72062bd22d25e9aa17e3',
   'comment_count': 0,
   'verification': {'verified': True,
    'reason': 'valid',
    'signature': '-----BEGIN PGP SIGNATURE-----\n\nwsFcBAABCAAQBQJlrubkCRC1aQ7uu5UhlAAAYc0QAB9Guh9HUEBsqn/7XvIeXP3T\nFrMDFaCM2GEntR4uurDTgBrDtV1MH1aLWG8UW7LsXNIDO98+

I'll save the id, date and the message of each commit.

In [15]:
commits_information = []
for i in range(repos_df.shape[0]):
    url = repos_df.loc[i, 'Commits URL']
    page_no = 1
    while (True):
        response = requests.get(url, auth = authentication)
        response = response.json()
        print("URL: {}, commits: {}".format(url, len(response)))
        for commit in response:
            commit_data = []
            commit_data.append(repos_df.loc[i, 'Id'])
            commit_data.append(commit['sha'])
            commit_data.append(commit['commit']['committer']['date'])
            commit_data.append(commit['commit']['message'])
            commits_information.append(commit_data)
        if (len(response) == 30):
            page_no = page_no + 1
            url = repos_df.loc[i, 'Commits URL'] + '?page=' + str(page_no)
        else:
            break

URL: https://api.github.com/repos/BandhaviC/911_Calls_Data_Analysis/commits, commits: 3
URL: https://api.github.com/repos/BandhaviC/Apache-Beam-Batch-processing_Dataflow/commits, commits: 3
URL: https://api.github.com/repos/BandhaviC/BandhaviC.github.io/commits, commits: 3
URL: https://api.github.com/repos/BandhaviC/Beachbnb/commits, commits: 2
URL: https://api.github.com/repos/BandhaviC/Customer-Segmentation-and-Clustering/commits, commits: 4
URL: https://api.github.com/repos/BandhaviC/DataStructures_and_algorithms_Python/commits, commits: 20
URL: https://api.github.com/repos/BandhaviC/Data_Analysis_of_UberData/commits, commits: 6
URL: https://api.github.com/repos/BandhaviC/data_engineering_with_snowpark_python/commits, commits: 16
URL: https://api.github.com/repos/BandhaviC/Data_Science-ML/commits, commits: 3
URL: https://api.github.com/repos/BandhaviC/E-commerce-Data-Analysis/commits, commits: 1
URL: https://api.github.com/repos/BandhaviC/Google-Cloud-Analytics/commits, commits: 1
U

In [16]:
commits_df = pd.DataFrame(commits_information, columns = ['Repo Id', 'Commit Id', 'Date', 'Message'])
commits_df.head(5)

Unnamed: 0,Repo Id,Commit Id,Date,Message
0,746834741,68a0d7ce20251e6e1e9c72062bd22d25e9aa17e3,2024-01-22T22:06:28Z,Update README.md
1,746834741,f2230fc11583c9ac61384434ddb3feb6ae1e701d,2024-01-22T22:06:10Z,Create README.md
2,746834741,fb33a7dc044411c6740debdb3fc2859c19527f83,2024-01-22T22:04:21Z,911 calls analysis EDA
3,702736151,d4b70f999620be3c1deb9ae690131f032973634b,2023-10-09T23:04:28Z,Update README.md
4,702736151,acac3958bf93947b7588dd28eccdd6dac46fad93,2023-10-09T23:03:39Z,Update README.md


I'll publish this data into a .csv file called **commits_info.csv**

In [17]:
commits_df.to_csv('commits_info.csv', index = False)