gdutils.datamine
==============

``datamine`` is a module in package ``gdutils`` that provides functions for finding, listing, and mining data.

---

__Examples Setup__

The following commands are used for setting up the examples below. 

*Note:* The example input files were pulled and converted from the GeoJSON [link](http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson) provided in the [geopandas IO docs](https://geopandas.org/io.html).

In [1]:
# Install ``gdutils`` package
!pip install git+https://github.com/mggg/gdutils.git > /dev/null

In [2]:
import gdutils.datamine as dm # imports the ``datamine`` module

import geopandas as gpd
import pandas as pd

---

Example 1. Get a list of public GitHub repos
---------------------------------------------------


__Example 1.1.__ Get a list of public repos from a GitHub user account

In [3]:
# Ex. 1.1

user_account = 'octocat'
user_repos = dm.list_gh_repos(user_account, 'users') # gets repos
user_repos # renders raw list of repos

[('boysenberry-repo-1', 'https://github.com/octocat/boysenberry-repo-1.git'),
 ('git-consortium', 'https://github.com/octocat/git-consortium.git'),
 ('hello-worId', 'https://github.com/octocat/hello-worId.git'),
 ('Hello-World', 'https://github.com/octocat/Hello-World.git'),
 ('linguist', 'https://github.com/octocat/linguist.git'),
 ('octocat.github.io', 'https://github.com/octocat/octocat.github.io.git'),
 ('Spoon-Knife', 'https://github.com/octocat/Spoon-Knife.git'),
 ('test-repo1', 'https://github.com/octocat/test-repo1.git')]

In [4]:
# prints list of repos in pretty format using pattern-matching
print('{:20} : {}'.format('repo name', 'repo url'))
print('-------------------------------')

for (repo_name, repo_url) in user_repos:
    print('{:20} : {}'.format(repo_name, repo_url))

repo name            : repo url
-------------------------------
boysenberry-repo-1   : https://github.com/octocat/boysenberry-repo-1.git
git-consortium       : https://github.com/octocat/git-consortium.git
hello-worId          : https://github.com/octocat/hello-worId.git
Hello-World          : https://github.com/octocat/Hello-World.git
linguist             : https://github.com/octocat/linguist.git
octocat.github.io    : https://github.com/octocat/octocat.github.io.git
Spoon-Knife          : https://github.com/octocat/Spoon-Knife.git
test-repo1           : https://github.com/octocat/test-repo1.git


__Example 1.2.__ Get a list of public repos from a GitHub organization account

In [5]:
# Ex. 1.2.

org_account = 'mggg-states'
org_repos = dm.list_gh_repos(org_account, 'orgs')

# prints list of repos in pretty format using pattern-matching
print('{:20} : {}'.format('repo name', 'repo url'))
print('-------------------------------')

for repo_name, repo_url in org_repos:
    print('{:20} : {}'.format(repo_name, repo_url))

repo name            : repo url
-------------------------------
PA-shapefiles        : https://github.com/mggg-states/PA-shapefiles.git
MA-shapefiles        : https://github.com/mggg-states/MA-shapefiles.git
WI-shapefiles        : https://github.com/mggg-states/WI-shapefiles.git
AK-shapefiles        : https://github.com/mggg-states/AK-shapefiles.git
OH-shapefiles        : https://github.com/mggg-states/OH-shapefiles.git
TX-shapefiles        : https://github.com/mggg-states/TX-shapefiles.git
GA-shapefiles        : https://github.com/mggg-states/GA-shapefiles.git
IL-shapefiles        : https://github.com/mggg-states/IL-shapefiles.git
NC-shapefiles        : https://github.com/mggg-states/NC-shapefiles.git
UT-shapefiles        : https://github.com/mggg-states/UT-shapefiles.git
VA-shapefiles        : https://github.com/mggg-states/VA-shapefiles.git
VT-shapefiles        : https://github.com/mggg-states/VT-shapefiles.git
MI-shapefiles        : https://github.com/mggg-states/MI-shapefiles.git


Example 2. Clone public GitHub repos
---------------------------------------------------------------

__Example 2.1.__ Clone all repositories of a known account

In [6]:
# Ex. 2.1

dm.clone_gh_repos(user_account, 'users')

__Example 2.2.__ Clone specific repositories of a known account

In [7]:
# Ex. 2.2.

dm.clone_gh_repos(org_account, 'orgs', ['AK-shapefiles', 'AZ-shapefiles'])

__Example 2.3.__ Clone specific repos into a given directory

In [8]:
# Ex. 2.3.

dm.clone_gh_repos(org_account, 'orgs', ['CT-shapefiles'], 'outputs/')

__Example 2.4.__ Clone all repos into a given directory

In [9]:
# Ex. 2.4.

dm.clone_gh_repos(user_account, 'users', outpath='outputs/')

Example 4. Get a list of local files of specific types
-----------------------------------------------------------

__Example 4.1.__ Recursively list files of a given type starting from current working directory

In [10]:
# Ex. 4.1.

files_from_cwd = dm.list_files_of_type('.zip')
files_from_cwd

['./AK-shapefiles/AK_precincts.zip',
 './AZ-shapefiles/az_precincts.zip',
 './example-inputs/example.zip',
 './example-inputs/counties.zip',
 './outputs/CT-shapefiles/CT_precincts.zip']

__Example 4.2.__ Recursively list files of a given type starting from a given directory

In [11]:
# Ex. 4.2.

files_from_dir = dm.list_files_of_type('.zip', 'outputs/')
files_from_dir

['outputs/CT-shapefiles/CT_precincts.zip']

__Example 4.3.__ Recursively list files of given types starting from a given directory

In [12]:
# Ex. 4.3.

zips_and_mds = dm.list_files_of_type(['.zip', '.md'], 'outputs/')
zips_and_mds

['outputs/linguist/README.md',
 'outputs/linguist/CONTRIBUTING.md',
 'outputs/linguist/test/fixtures/Data/Modelines/example_smalltalk.md',
 'outputs/linguist/samples/GCC Machine Description/pdp10.md',
 'outputs/linguist/samples/Markdown/tender.md',
 'outputs/linguist/vendor/grammars/Sublime-Inform/README.md',
 'outputs/linguist/vendor/grammars/less.tmbundle/README.md',
 'outputs/test-repo1/2016-02-24-first-post.md',
 'outputs/test-repo1/2016-02-26-sample-post-jekyll.md',
 'outputs/test-repo1/2015-04-12-test-post-last-year.md',
 'outputs/CT-shapefiles/LICENSE.md',
 'outputs/CT-shapefiles/CT_precincts.zip',
 'outputs/CT-shapefiles/README.md',
 'outputs/git-consortium/product-backlog.md',
 'outputs/git-consortium/README.md',
 'outputs/Spoon-Knife/README.md',
 'outputs/boysenberry-repo-1/README.md',
 'outputs/boysenberry-repo-1/READTHIS.md',
 'outputs/hello-worId/README.md']

__Example 4.4.__ Recursively list files of a given type from current working directory, including hidden files

In [13]:
# Ex. 4.4.

files_incl_hidden = dm.list_files_of_type('.zip', exclude_hidden=False)
files_incl_hidden

['./.example-hidden-file.zip',
 './AK-shapefiles/AK_precincts.zip',
 './AZ-shapefiles/az_precincts.zip',
 './example-inputs/example.zip',
 './example-inputs/counties.zip',
 './outputs/CT-shapefiles/CT_precincts.zip']

Example 5. Get a list of keys from a nested (categorized) dictionary
-------------------------------------------------------------------------------

In [14]:
# Example nested dictionary
example_dict = {
    'category1' : [ # category
        {'key1_1' : 'value1'}, # key-value pair
        {'key1_2' : 2}
    ],
    'category2' : [
        {'key2_1' : True},
        ['key2_2', 'key2_3', 'key2_4'] # list of keys
    ],
    'category3' : [
        ['key3']
    ]
}

__Example 5.1.__ Get a list of keys from a single category

In [15]:
keys = dm.get_keys_by_category(example_dict, 'category2')
keys

['key2_1', 'key2_2', 'key2_3', 'key2_4']

__Example 5.2.__ Get a list of keys from a list of categories

In [16]:
keys = dm.get_keys_by_category(example_dict, ['category1', 'category3'])
keys

['key1_1', 'key1_2', 'key3']

Example 6. Remove repos from local filesystem
--------------------------------------------------------

__Example 6.1.__ Remove a specific repository

In [17]:
# Ex. 6.1.

path_to_repo_to_remove = 'outputs/Hello-World'
dm.remove_repos(path_to_repo_to_remove)

__Example 6.2.__ Recursively remove all repos in a directory

In [18]:
# Ex. 6.2.

dm.remove_repos('outputs/')

---

__Examples Cleanup__

The following commands are used to reset and clean up the examples above.

In [19]:
# Remove all cloned repos
dm.remove_repos('.')

In [20]:
# Remove outputs
!rm -r outputs

In [21]:
# Uninstall Package
!echo y | pip uninstall gdutils

In [22]:
# Reset Jupyter Notebook IPython Kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")