# Learn.co CleanUp

There is two reasons to do this:

**A)** You want a nice `github` repo where you can quickly look through all your lessons. Who doesn't want that?

**B)** Your local file structure is a mess. You saved some stuff over here. You saved some stuff over there. Then you went in a new direction with a naming scheme.

`git` has its own idea of nested file structure. It uses _**submodules**_ to reference different repos. The following code automates the process of adding all the lessons you have done to a single `github` repo. 

A side effect of adding a `submodule` to a repo is that `git` clones that file to your local machine again, creating a set of redundant files. Once you are done with this process, the choice is yours of what files you wish to keep locally. If you are inclined to keep the the organizational structure introduced here, feel free. Otherwise, just delete the recently cloned files.

___

## 1) Fork the Repo and Rename it.

The first step fork this repo and renamed it. It will act as a table of contents for all of the Learn.co lessons.


<img src = imgs/fork.png style="height: 500px; width:700px; resize:both">

<br>

Now title your repo as you wish.

<img src = imgs/rename.png style="height:500px">

<br>

Now clone this file to your machine.

## 2) Download the HTML of Learn.co 

When logged in from main landing page of Learn.co, download the HTML of this page. Save this file in the repo we created during Step 1.

<img src=imgs/html_save.png>

<br>

Congratulations, the hard work is done.

## 3) Run Some Cells

In [1]:
import re
import time
import webbrowser
import subprocess, shlex
from bs4 import BeautifulSoup

In [2]:
with open('Learn - Data Science Career v1.1.html','r') as f:  #Make sure this file matches the name of HTML file you just saved.
    html = f.read()
soup = BeautifulSoup(html, "html.parser")

In [3]:
chunk = str(soup('script',{'type' : 'text/javascript'}))
#chunk

The information we want is in a javascript block. Uncomment above to see what it is.  
Instead of bringing in any more libraries, we will tackle this problem with our `regex` knowhow. 

### RegEx

In [4]:
pattern = re.compile(r"learn-co-curriculum/d.*?(?=\")") #make a pattern object
repo_names = pattern.findall(chunk)

Uncomment the cell below to what we collected.

In [5]:
#repo_names[:10]

## Personalize the URLs

In [6]:
cohort = 'online-ds-pt-100118' #Enter your cohort as a string
github = 'https://github.com/Socjon/' #Enter your github URL. Be sure to include a trailing backslash

In [7]:
mod_full_urls = {'mod_1': [],
                 'mod_2': [],
                 'mod_3': [],
                 'mod_4': [],}

for name in repo_names:
    name = name.lstrip('learn-co-curriculum/')
    
    if name.startswith('dsc-1-') or name.startswith('dsc-00-') or name.startswith('dsc-01-'):
        mod_full_urls['mod_1'].append(github + name + '-' + cohort)
    
    elif name.startswith('dsc-2-'):
        mod_full_urls['mod_2'].append(github + name + '-' + cohort)
        
    elif name.startswith('dsc-3-'):
        mod_full_urls['mod_3'].append(github + name + '-' + cohort)
        
    elif name.startswith('dsc-4-') or name.startswith(f'dsc-04-'):
        mod_full_urls['mod_4'].append(github + name + '-' + cohort)

## Section Names

I provided a list of section names, keep in mind this is for the V1.1 circ. If you go to scrapping the new list, be warned of inconsistencies in naming schemes. Some cleaning is required.

In [8]:
section_list = ['01-getting-started-with-data-science',
 '02-importing-and-statistical-analysis-of-data',
 '03-working-with-pandas',
 '04-data-cleaning-in-pandas',
 '05-sql-and-relational-databases',
 '06-object-oriented-programming',
 '07-oop-continued',
 '08-numpy-and-foundations-of-probability-and-combinatorics',
 '09-statistical-distributions',
 '10-introduction-to-linear-regression',
 '11-multiple-regression-and-model-validation',
 '12-a-complete-data-science-project-using-multiple-regression',
 '13-linear-algebra',
 '14-calculus-cost-function-and-gradient-descent',
 '15-an-introduction-to-orms',
 '16-working-with-json-and-xml',
 '17-accessing-data-through-apis',
 '18-html-css-and-web-scraping',
 '19-distributions-and-sampling',
 '20-hypothesis-and-ab-testing',
 '21-combinatorics-continued-and-maximum-likelihood-estimation',
 '22-bayesian-classification',
 '23-resampling-and-monte-carlo-simulation',
 '24-extensions-to-linear-models',
 '25-time-series-visualization-and-testing-for-trends',
 '26-time-series-modeling',
 '27-distance-metrics-and-k-nearest-neighbors',
 '28-graph-theory',
 '29-introduction-to-logistic-regression',
 '30-logistic-regression-in-depth-mle-and-gradient-descent',
 '31-decision-trees',
 '32-ensemble-methods',
 '33-support-vector-machines',
 '34-dimensionality-reduction-with-pca',
 '35-clustering',
 '36-building-a-machine-learning-pipeline',
 '37-foundations-of-natural-language-processing-nlp',
 '38-big-data-in-pyspark',
 '39-developing-a-recommendation-system-in-pyspark',
 '40-introduction-to-deep-learning',
 '41-multi-layer-perceptrons',
 '42-regularization-and-optimization',
 '43-introduction-to-convolutional-neural-networks',
 '44-convolutional-neural-networks-continued',
 '45-deep-nlp-word-embeddings',
 '46-deep-nlp-sequence-models']

# Interacting with the OS

___
___

Below are two functions. The first function takes a Learn URL, identifies the section is belongs to, and returns the full section name.  

The second function is the workhorse of this notebook. It takes in two arguments: `mod_number` as an integer and `url_dict` which is the dictionary we created earlier. The function then creates the proper directories for both the module and section, then adds the lessons as submodules. Finally it will return a list of URLs of lessons that were not cloned.  

This process works only work with one module at a time and takes a few minutes to run. An output will display for each of the URLs being added.

In [9]:
def section_name(url):   #Takes in the URL and returns the section as a string
    
    name = None          #Resets the name variable
    
    to_remove = len(github) + 4     #Github names are variable. The 4 is removing "dsc-" 
    to_check = url[to_remove:]
    
    if to_check.startswith('0'):    #Checks for the leading zero and removes it, if that is the case
        to_check = to_check[1:]
    
    section_num = to_check[2:4]     #Isolates the section number as a string
    
    for section in section_list:    #Loops through section list
        
        if section.startswith(section_num) == True:
            name = section
            break
        
    if name == None:
        name = 'Project'
    
    return name

In [10]:
def add_submodule(mod_number, url_dict):
    
    valid = [1,2,3,4]
    if mod_number not in valid:
        raise ValueError(f"results: mod_number must be either {valid}.")
        
    to_fork = []
    mod_name = 'Module_' + str(mod_number)                          
    mod_dict = mod_full_urls[f'mod_' + str(mod_number)]             
    
    subprocess.run(f'mkdir {mod_name}', shell=True)               #Makes the module directory
    time.sleep(1)                                                 #Don't use (shell=True) lightly #https://stackoverflow.com/questions/3172470/actual-meaning-of-shell-true-in-subprocess


    
    for url in mod_full_urls['mod_2'][:15]:
        name = section_name(url)
        subprocess.run(f'mkdir {name}', shell=True, cwd = f'{mod_name}')
        time.sleep(1)
        
        
        command = f'git submodule add {url}'                        #Iterate over the URL dictionary we created before and adds them as submodules.
        kwargs = {}
        kwargs['stdout'] = subprocess.PIPE
        kwargs['stderr'] = subprocess.PIPE
        proc = subprocess.Popen(shlex.split(command), **kwargs, cwd = f'{mod_name}/{name}')
        (stdout_str, stderr_str) = proc.communicate()
        return_code = proc.wait()
        #print (stdout_str)
        #print (stderr_str)
    
    
        to_check = stderr_str.decode('utf-8')                     #Changing the terminal output from btye to str
        print(to_check)                                           #Prints status updates --optional
        pattern = 'fatal'                                         #Making a patter to loop through and find all the non exsistant URLs
        if to_check.find(pattern) > 0:
            to_fork.append(url)
                
    commands = ["git commit -m 'adding a submodule'", "git push"]    #Pushing all the changes
    for command in commands:
        subprocess.Popen(shlex.split(command))
    
    print('Go create the following')
    print(to_fork)
    return(to_fork)

In [11]:
to_fork = add_submodule(2, mod_full_urls)   #Input the module number and the url dictionary.

Cloning into 'C:/Users/J/DS/Flatiron/Module_2/13-linear-algebra/dsc-2-13-02-introduction-summary-online-ds-pt-100118'...
remote: Repository not found.
fatal: repository 'https://github.com/Socjon/dsc-2-13-02-introduction-summary-online-ds-pt-100118/' not found
fatal: clone of 'https://github.com/Socjon/dsc-2-13-02-introduction-summary-online-ds-pt-100118' into submodule path 'C:/Users/J/DS/Flatiron/Module_2/13-linear-algebra/dsc-2-13-02-introduction-summary-online-ds-pt-100118' failed

Cloning into 'C:/Users/J/DS/Flatiron/Module_2/13-linear-algebra/dsc-2-13-03-lingalg-motivation-online-ds-pt-100118'...
remote: Repository not found.
fatal: repository 'https://github.com/Socjon/dsc-2-13-03-lingalg-motivation-online-ds-pt-100118/' not found
fatal: clone of 'https://github.com/Socjon/dsc-2-13-03-lingalg-motivation-online-ds-pt-100118' into submodule path 'C:/Users/J/DS/Flatiron/Module_2/13-linear-algebra/dsc-2-13-03-lingalg-motivation-online-ds-pt-100118' failed

Cloning into 'C:/Users/J/D

## Helper Function
After running the above cell, there may be some Learn lessons that you didn't fork: Section Recaps, Introductions, etc.  

I have included a helper function to open all the URLS that need to be cloned before the above needs to be run again. Once they have been forked, just rerun the above cell to add them as `submodules`.
   

In [12]:
#to_fork

In [13]:
def let_there_be_tabs(url_list):
    learn = 'https://github.com/learn-co-students/'
    for url in url_list:
        webbrowser.open(url.replace(github, learn))

Uncomment below to run the function

In [14]:
let_there_be_tabs(to_fork)