In [1]:
from jyquickhelper import add_notebook_menu
add_notebook_menu()

## Objectifs de la séance

- Compléter les données de l'API avec celles du webscraping
- Visualiser les données
- Nettoyer les données (regexp et nltk)
- Machine Learning "non-supervisé" : tagging automatique avec un tf-idf
- Amélioration de la prédiction de tags grâce à des méthodes de classifications binaires

## Compléter les données de l'API avec celles du webscraping

### Récupérer les données

Retrouver votre CONSUMER_KEY et votre ACCESS_TOKEN. Modifier les codes ci-dessous.

In [119]:
CONSUMER_KEY = "71268-0c68343bf8cd8c34e9029034"
ACCESS_TOKEN = "47d8c59a-c4f5-32be-f9f0-f3fabb"

On importe de nouveau les données du compte pocket.

In [120]:
import requests
from pprint import pprint

items = {
         "consumer_key":CONSUMER_KEY,
         "access_token":ACCESS_TOKEN,
         "detailsType": "complete"
        }

dict_pocket = requests.post('https://getpocket.com/v3/get', data = items).json()['list']

### En cas de probleme, vous pouvez importer les données à partir du .json

On va donc directement importer les données au format json.

In [123]:
import json
from pprint import pprint

with open('./data_pocket.json') as fp:    
    dict_pocket = json.load(fp)

pprint(dict_pocket)

{'1003565100': {'authors': {'55834601': {'author_id': '55834601',
                                         'item_id': '1003565100',
                                         'name': 'Grafikart.fr',
                                         'url': ''}},
                'excerpt': 'Ionic est un framework qui va vous permettre de '
                           'créer des applications mobiles en utilisant des '
                           'technologies Web. Ionic se base pour cela sur '
                           "d'autres frameworks / technologies qui ont fait "
                           'leurs preuves.  Avant de pouvoir commencer, il '
                           'nous faut évidemment commencer par installer '
                           "l'outil.",
                'favorite': '0',
                'given_title': 'Tutoriel Vidéo Apache Cordova Ionic Framework',
                'given_url': 'https://www.grafikart.fr/tutoriels/cordova/ionic-framework-641',
                'has_image': '0',
      

On créé une DataFrame avec les éléments souhaités (item_id, resolved_url, resolved_title, excerpt, tags).

### Mise en forme des données

In [124]:
dict_to_df = {}

keys = ['resolved_url', 'resolved_title', 'excerpt', 'tags']

for (k,v) in dict_pocket.items():
    dict_to_df[k] = dict(zip(keys, [v[key] for key in keys if key in v]))

In [125]:
import pandas as p
df_pocket = p.DataFrame.from_dict(dict_to_df, orient = "index")
df_pocket.head()

Unnamed: 0,resolved_url,resolved_title,excerpt,tags
1003565100,https://www.grafikart.fr/tutoriels/cordova/ion...,Tutoriel Vidéo Apache CordovaIonic Framework,Ionic est un framework qui va vous permettre d...,"{'mobile app': {'item_id': '1003565100', 'tag'..."
1008275819,http://www.colorhunt.co,Color Hunt,Home Create Likes () About Add To Chrome Faceb...,"{'lewagon': {'item_id': '1008275819', 'tag': '..."
1011618630,https://jakevdp.github.io/blog/2015/08/14/out-...,Out-of-Core Dataframes in Python: Dask and Ope...,"In recent months, a host of new tools and pack...","{'data science': {'item_id': '1011618630', 'ta..."
1014684096,https://blog.dominodatalab.com/ab-testing-with...,A/B Testing with Hierarchical Models in Python,"In this post, I discuss a method for A/B testi...","{'abtest': {'item_id': '1014684096', 'tag': 'a..."
1016233829,https://developer.mozilla.org/en-US/docs/Learn...,Getting started with the Web,Getting started with the Web is a concise seri...,"{'mdn': {'item_id': '1016233829', 'tag': 'mdn'..."


On met en forme la colonne "tags".

In [126]:
df_pocket['tags'] = df_pocket['tags'].apply(lambda x: x.keys() if x==x else x)
df_pocket.rename(columns = {'resolved_url':'url', 'resolved_title': 'title'}, inplace = True)
df_pocket.head()

Unnamed: 0,url,title,excerpt,tags
1003565100,https://www.grafikart.fr/tutoriels/cordova/ion...,Tutoriel Vidéo Apache CordovaIonic Framework,Ionic est un framework qui va vous permettre d...,(mobile app)
1008275819,http://www.colorhunt.co,Color Hunt,Home Create Likes () About Add To Chrome Faceb...,(lewagon)
1011618630,https://jakevdp.github.io/blog/2015/08/14/out-...,Out-of-Core Dataframes in Python: Dask and Ope...,"In recent months, a host of new tools and pack...",(data science)
1014684096,https://blog.dominodatalab.com/ab-testing-with...,A/B Testing with Hierarchical Models in Python,"In this post, I discuss a method for A/B testi...",(abtest)
1016233829,https://developer.mozilla.org/en-US/docs/Learn...,Getting started with the Web,Getting started with the Web is a concise seri...,"(mdn, documentation)"


### Regular expression 

Par [ici](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td2a_eco_5d_Travailler_du_texte_les_expressions_regulieres.html#td2aeco5dtravaillerdutextelesexpressionsregulieresrst).

Les expressions régulières (ou regular expression, regexp) servent à extraire des données de texte, ou à valider des données utilisateurs (lorsqu'un utilisateur rempli un formulaire par exemple).

Voici les principaux caractères à connaître.

<table>
    <thead>
        <th> Regexp </th>
        <th> Signification </th>
    </thead>
    <tbody>
        <tr>
            <td colspan=2> __Bases__ </td>
        </tr>
        <tr>
            <td> "a" </td>
            <td> a </td>
        </tr>
        <tr>
            <td colspan=2> __Quantificateurs__ </td>
        </tr>
        <tr>
            <td> "abc?" </td>
            <td> ab suivi par 0 ou 1 c </td>
        </tr>
        <tr>
            <td> "abc*" </td>
            <td> ab suivi par 0..∞ c </td>
        </tr>
        <tr>
            <td> "abc+" </td>
            <td> ab suivi par 1..∞ c </td>
        </tr>
        <tr>
            <td> "abc{3}" </td>
            <td> ab suivi par 3 c </td>
        </tr>
        <tr>
            <td colspan=2> __Groupes__ </td>
        </tr>
        <tr>
            <td> "(abc)+" </td>
            <td> 1..∞ abc </td>
        </tr>
        <tr>
            <td> "(a|b)c" </td>
            <td> ac ou bc </td>
        </tr>
        <tr>
            <td colspan=2> __Intervalles (type 1)__ </td>
        </tr>
        <tr>
            <td> "." </td>
            <td> n'importe quel caractère (un seul) </td>
        </tr>
        <tr>
            <td> "[aB9]" </td>
            <td> a ou B ou 9 </td>
        </tr>
        <tr>
            <td> "[0-9]" </td>
            <td> n'importe quel caractère numérique </td>
        </tr>
        <tr>
            <td> "[a-zA-Z]" </td>
            <td> n'importe quel caractère alphabétique </td>
        </tr>
        <tr>
            <td> "[^a-c]" </td>
            <td> n'importe quel caractère SAUF a, b et c </td>
        </tr>
        <tr>
            <td colspan=2> __Intervalles (type 2)__ </td>
        </tr>
        <tr>
            <td> "\d" </td>
            <td> comme "[0-9]" </td>
        </tr>
        <tr>
            <td> "\w" </td>
            <td> comme "[a-zA-Z0-9_]" </td>
        </tr>
        <tr>
            <td> "\W" </td>
            <td> comme "[^a-zA-Z0-9_]" </td>
        </tr>
        <tr>
            <td> "\s" </td>
            <td> espaces (" ", "\n", "\t", "\r") </td>
        </tr>
        <tr>
            <td> "\S" </td>
            <td> tout ce qui n'est pas un espace </td>
        </tr>
        <tr>
            <td colspan=2> __Ancres__ </td>
        </tr>
        <tr>
            <td> "^abc" </td>
            <td> commence par "abc" </td>
        </tr>
        <tr>
            <td> "^abc" </td>
            <td> commence par "abc" </td>
        </tr>
        <tr>
            <td> "abc$" </td>
            <td> termine par "abc" </td>
        </tr>
    </tbody>
</table>

Pour utiliser les regexp avec python, on fait appel au module "re", de la manière suivante : 

In [134]:
import re
string = "abc abccc abc ab bc"
print(re.compile("\w+").findall(string))

['abc', 'abccc', 'abc', 'ab', 'bc']


### Exercice 1

Extraire le domaine en utilisant des regexp. Le stocker dans un champs du DataFrame. Vous pouvez utiliser ce super outil : https://www.debuggex.com/

### Exercice 1 - correction

In [201]:
import re

def domain_extractor(url): 
    regex = re.compile("^((http|https)\:\/\/)?(www\.)?(.*)(\\.)((?!html|htm|php|asp|md|pdf)[a-z]{2,3})")
    try: 
        match_list = regex.findall(url)[0]
        return ''.join(match_list[-3:])
    except:
        return 'no_match'

In [203]:
df_pocket['domain']=df_pocket['url'].apply(lambda x: domain_extractor(x) )
#[print(d) for d in df_pocket['domain'][df_pocket['domain'].str.contains('/')]]
df_pocket[['url','domain']]

Unnamed: 0,url,domain
1003565100,https://www.grafikart.fr/tutoriels/cordova/ion...,grafikart.fr
1008275819,http://www.colorhunt.co,colorhunt.co
1011618630,https://jakevdp.github.io/blog/2015/08/14/out-...,jakevdp.github.io
1014684096,https://blog.dominodatalab.com/ab-testing-with...,blog.dominodatalab.com
1016233829,https://developer.mozilla.org/en-US/docs/Learn...,developer.mozilla.org
1028949272,http://lewagon.github.io/ui-components/#tabs,lewagon.github.io
1037787031,https://github.com/shakacode/react_on_rails,github.com
1038434596,http://alumni.lewagon.org/,alumni.lewagon.org
1045811499,https://www.youtube.com/watch?v=sleZ-hzrtRY,youtube.com
1055285491,https://ponyfoo.com/articles/es6-promises-in-d...,ponyfoo.com


### Ajout des données issues du webscraping

#### Contenu des pages présentes dans pocket

In [204]:
from bs4 import BeautifulSoup
import numpy as np
import re

def words_htmltag(url):
    print(url)
    try: 
        html = requests.get(url).text
        soup = BeautifulSoup(html, "html5lib")
        words_dict = {}
        for html_tag in ['h1','h2','h3','p', 'article']:
            for e in soup.findAll(html_tag):
                text = e.get_text()
                words = re.split(r'\s+', text)
                # r'\s+' c est une expression reguliere.
                # celle-ci permet de splitter sur " ", "\t", "\n", "\r" meme s il y en a plusieurs d affilee
                words_dict[html_tag]=words 
        return words_dict
    except:
        return "scraper banned"

df_test = df_pocket.ix[2:4]

df_test['html_soup'] = df_test['url'].apply(lambda x: words_htmltag(x))

https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/
https://blog.dominodatalab.com/ab-testing-with-hierarchical-models-in-python/


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


#### Contenu des pages afférentes

In [205]:
! pip install scrapy

Collecting scrapy
  Downloading Scrapy-1.4.0-py2.py3-none-any.whl (248kB)
[K    100% |████████████████████████████████| 256kB 496kB/s 
[?25hCollecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-1.18.0-py2.py3-none-any.whl
Collecting parsel>=1.1 (from scrapy)
  Downloading parsel-1.2.0-py2.py3-none-any.whl
Collecting pyOpenSSL (from scrapy)
  Downloading pyOpenSSL-17.3.0-py2.py3-none-any.whl (51kB)
[K    100% |████████████████████████████████| 51kB 783kB/s 
[?25hCollecting lxml (from scrapy)
  Downloading lxml-4.1.0-cp36-cp36m-manylinux1_x86_64.whl (5.6MB)
[K    100% |████████████████████████████████| 5.6MB 296kB/s 
[?25hCollecting Twisted>=13.1.0 (from scrapy)
  Downloading Twisted-17.9.0.tar.bz2 (3.0MB)
[K    100% |████████████████████████████████| 3.0MB 452kB/s 
[?25hCollecting PyDispatcher>=2.0.5 (from scrapy)
  Downloading PyDispatcher-2.0.5.tar.gz
Collecting cssselect>=0.9 (from scrapy)
  Downloading cssselect-1.0.1-py2.py3-none-any.whl
Collecting queuelib (from scrap

Si vous ne parvenez pas à installer scrapy, aller dans le terminal et taper : `conda install -c conda-forge scrapy`

In [206]:
! rm -rf './scrapy.json' #supprime les eventuelles versions precedentes de scrapy.json

In [207]:
import scrapy
from scrapy.crawler import CrawlerProcess
    
class Content_Spider(scrapy.Spider):
    name = 'pocket_spider'
    start_urls = ['https://pythonprogramming.net/']
    allowed_domains = ['pythonprogramming.net']

    custom_settings = {'FEED_FORMAT' : 'json',
                        'FEED_URI' : 'scrapy.json', 
                       # le nom du fichier json dépend du moment où on lance le code
                        'FEED_EXPORT_ENCODING' : 'utf-8',
                        'DEPTH_LIMIT': 2,
                        'DEPTH_PRIORITY': 1}    
    
    def parse(self, response):
        
        for e in response.css('body'):
            yield {
                "url": response.request.url, # url de depart : sera une clé de jointure pour la suite
                "h1" : e.css("h1::text").extract(),
                "h2" : e.css("h2::text").extract(),
                "h3" : e.css("h3::text").extract(),
                "p" : [s.trip() for s in e.css("p::text").extract()]
            }
                
        for a in response.css('body a::attr(href)').extract():
            if (a is not None):
                yield response.follow(a, callback=self.parse)

ImportError: No module named 'scrapy'

In [12]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(Content_Spider)
process.start(stop_after_crawl=True)

2017-10-20 19:09:20 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-10-20 19:09:20 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2017-10-20 19:09:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2017-10-20 19:09:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.ht

/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/data-analysis-tutorials/
/data-analysis-tutorials/
/robotics-tutorials/
/robotics-tutorials/
/web-development-tutorials/
/web-development-tutorials/
/game-development-tutorials/
/game-development-tutorials/
/python-fundamental-tutorials/
/python-fundamental-tutorials/
/gui-development-tutorials/
/gui-development-tutorials/
#
#
#


2017-10-20 19:09:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/community/> (referer: https://pythonprogramming.net/)
2017-10-20 19:09:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pythonprogramming.net/community/>
{'url': 'https://pythonprogramming.net/community/', 'h1': [], 'h2': ['Login', 'Sign up'], 'h3': [], 'p': ['\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '\n\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t']}
2

/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/post/
/community/795/Cuffmerge SyntaxError invalid syntax/
/community/794/DBSCAN and Optics algorithm/
/community/793/Whos Online with Python Flask/
/community/792/Python Web service to sql database example/
/community/790/How to land rocket stages with Python in Kerbal Space Program/
/community/791/Heart-felt thank you/
/community/785/Get the variable name of a variable as a string/
/community/788/Can I use my laptop to make and train Haar Cascade?/
/community/789/Python query on Group by and Indexing/
/community/787/Multi target classifier/
/community/page/2
#
#
#


2017-10-20 19:09:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pythonprogramming.net/robotics-tutorials/>
{'url': 'https://pythonprogramming.net/robotics-tutorials/', 'h1': [], 'h2': ['Login', 'Sign up'], 'h3': [], 'p': ['How to setup and use your Raspberry Pi with various projects.', 'Use the Raspberry Pi along with the GoPiGo to learn about robotics.', 'Build and program your own quadcopter from scratch to take-off.', '\n\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t']}
2017-10-20 19:09:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/+=1/> (referer: https://pythonprogramming.net/)
2017-10-20 19:09:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://pythonprogramming.net/store/> from <GET https://pythonprogramming.net/store/python-hoodie/>
2017-10-20 19:09:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/py

/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/introduction-raspberry-pi-tutorials/
/introduction-raspberry-pi-tutorials/
/robotics-raspberry-pi-tutorial-gopigo-introduction/
/robotics-raspberry-pi-tutorial-gopigo-introduction/
/building-quadcopter-tutorial-intro/
/building-quadcopter-tutorial-intro/
#
#
#


2017-10-20 19:09:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/web-development-tutorials/> (referer: https://pythonprogramming.net/)
2017-10-20 19:09:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/gui-development-tutorials/> (referer: https://pythonprogramming.net/)
2017-10-20 19:09:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/register/> (referer: https://pythonprogramming.net/)
2017-10-20 19:09:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/login/> (referer: https://pythonprogramming.net/)
2017-10-20 19:09:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pythonprogramming.net/python-fundamental-tutorials/>
{'url': 'https://pythonprogramming.net/python-fundamental-tutorials/', 'h1': [], 'h2': ['Login', 'Sign up'], 'h3': [], 'p': ['Just getting started?', 'Not a problem, learn the basics of programming with Python 3 here!', "Python fundamentals 

/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/login/
/register/
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/introduction-to-python-programming/
/introduction-to-python-programming/
/introduction-intermediate-python-tutorial/
/introduction-intermediate-python-tutorial/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/pygame-python-3-part-1-intro/
/pygame-python-3-part-1-intro/
/opengl-rotating-cube-example-pyopengl-tutorial/
/opengl-rotating-cube-example-pyopengl-tutorial/
/kivy-application-development-tutorial/
/kivy-application-development-tutorial/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/data-analysis-tutorials/
/data-analysis-tutorials/
/robotics-t

2017-10-20 19:09:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pythonprogramming.net/community/794/DBSCAN%20and%20Optics%20algorithm/>
{'url': 'https://pythonprogramming.net/community/794/DBSCAN%20and%20Optics%20algorithm/', 'h1': [], 'h2': ['Login', 'Sign up'], 'h3': ['DBSCAN and Optics algorithm'], 'p': ['You must be logged in to post. Please ', ' or ', ' an account.', '\n\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t']}
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/794/DBSCAN%20and%20Optics%20algorithm/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth >

/store/python-hoodie/
/community/
/login/
/register/
/register
/forgot-password/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/quadcopter-parts-tutorial/?completed=/building-quadcopter-tutorial-intro/
/support-donate/?a=1&t=/building-quadcopter-tutorial-intro/
/quadcopter-parts-tutorial/
/quadcopter-assembly-tutorial/
/quadcopter-esc-calibration-tutorial/
/quadcopter-flight-and-legal-tutorial/
#


2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/building-quadcopter-tutorial-intro/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/building-quadcopter-tutorial-intro/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/building-quadcopter-tutorial-intro/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/support-donate/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/consulting/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://www.facebook.com/pythonprogramming.net/ 
2017-10-20 19:09:22 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://twitter.com/sentdex 
2017-10-20 19:09:22 [s

#
#
/support-donate/
/consulting/
https://www.facebook.com/pythonprogramming.net/
https://twitter.com/sentdex
https://plus.google.com/+sentdex
/about/tos/
/about/privacy-policy/
https://xkcd.com/353/
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/robot-remote-control-car-with-the-raspberry-pi/
/supplies-needed-raspberry-pi-gopigo-robot/?completed=/robotics-raspberry-pi-tutorial-gopigo-introduction/
/+=1/?a=11&t=/robotics-raspberry-pi-tutorial-gopigo-introduction/
/supplies-needed-raspberry-pi-gopigo-robot/
/robot-programming-basics-tutorial/
/remote-control-robot-programming-tutorial/
/usb-foam-cannon-robot-gopigo-tutorial/
/programming-autonomous-robot-gopigo-tutorial/
/raspberry-pi-camera-opencv-face-detection-tutorial/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/log

2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/login/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/register/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/789/Python%20query%20on%20Group%20by%20and%20Indexing/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/789/Python%20query%20on%20Group%20by%20and%20Indexing/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/789/Python%20query%20on%20Group%20by%20and%20Indexing/ 
2017-10-20 19:09:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET ht

/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
/login/
/register/
#
#
#


2017-10-20 19:09:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/register> (referer: https://pythonprogramming.net/login/)
2017-10-20 19:09:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pythonprogramming.net/community/791/Heart-felt%20thank%20you/>
{'url': 'https://pythonprogramming.net/community/791/Heart-felt%20thank%20you/', 'h1': [], 'h2': ['Login', 'Sign up'], 'h3': ['Heart-felt thank you'], 'p': ['You must be logged in to post. Please ', ' or ', ' an account.', '\n\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t\t\t\t\t', ' \xa0\n\t\t\t\t\t\t\t\t', '\n\t\t\t\t\t\t\t']}
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/791/Heart-felt%20thank%20you/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth

/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/machine-learning-tutorials/
/machine-learning-tutorials/
/data-analysis-python-pandas-tutorial-introduction/
/data-analysis-python-pandas-tutorial-introduction/
/matplotlib-intro-tutorial/
/matplotlib-intro-tutorial/
/finance-tutorials/
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
http://www.python.org/
http://pyopengl.sourceforge.net/
http://pygame.org/
http://www.python.org/
http://pyopengl.sourceforge.net/
http://pygame.org/
http://www.lfd.

2017-10-20 19:09:23 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://pythonprogramming.net/about/tos/> from <GET https://pythonprogramming.net/about/tos>
2017-10-20 19:09:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/community/795/Cuffmerge%20SyntaxError%20invalid%20syntax/> (referer: https://pythonprogramming.net/community/)
2017-10-20 19:09:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/community/788/Can%20I%20use%20my%20laptop%20to%20make%20and%20train%20Haar%20Cascade/?/> (referer: https://pythonprogramming.net/community/)
2017-10-20 19:09:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/basic-gui-pyqt-tutorial/> (referer: https://pythonprogramming.net/gui-development-tutorials/)
2017-10-20 19:09:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pythonprogramming.net/tkinter-depth-tutorial-making-actual-program/> (referer: https://pythonprogrammi

/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/about/tos
/about/privacy-policy
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/community/
/login/
/register/
/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
http://www.riverbankcomputing.co.uk/software/pyqt/intro
http://www.qt.io/
https://github.com/kenwaldek
https://github.com/kenwaldek/pythonprogramming
http://www.riverbankcomputing.com/software/pyqt/download
/sys-module-python-3/
/+=

2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/__str__-__repr__-intermediate-python-tutorial/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/args-kwargs-intermediate-python-tutorial/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/introduction-intermediate-python-tutorial/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/introduction-intermediate-python-tutorial/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/introduction-intermediate-python-tutorial/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/support-donate/ 
2017-10-20 19:09:23 [scrapy.spidermiddlewares.depth] DEBUG

/__str__-__repr__-intermediate-python-tutorial/
/args-kwargs-intermediate-python-tutorial/
#
#
#
/support-donate/
/consulting/
https://www.facebook.com/pythonprogramming.net/
https://twitter.com/sentdex
https://plus.google.com/+sentdex
/about/tos/
/about/privacy-policy/
https://xkcd.com/353/
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
https://www.digitalocean.com/?refcode=d1c4c6e66979
https://www.linode.com/?r=b417bc672ff52d6c055fd7a56e024939c667b0fd
/+=1/
/basic-flask-website-tutorial/?completed=/practical-flask-introduction/
/+=1/?a=12&t=/practical-flask-introduction/
/basic-flask-website-tutorial/
/bootstrap-jinja-templates-flask/
/website-home-page-flask/
/flask-homepage-improvements/
/flask-homepage-completed/
/flask-user-dashboard/
/flask-content-management-basics/
/flask-error-handling-basics/
/flash-flask-tutorial/
/flask-users-tutorial/
/flask-get-post-requests-handling-tutorial/
/mysql-database-

2017-10-20 19:09:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pythonprogramming.net/introduction-to-python-programming/>
{'url': 'https://pythonprogramming.net/introduction-to-python-programming/', 'h1': [], 'h2': ['Python 3 Programming Introduction Tutorial', 'Login', 'Sign up'], 'h3': [], 'p': ["Chances are, if you're viewing this page, you're brand new to Python.", 'You might even be new to Programming all-together. Either way, \n\t\tyou have come to the right place, and chosen the right language!', 'Python is very beginner-friendly. The syntax (words and structure) is extremely simple to read and follow, most of which can be understood even if you do not know any programming. Let me show you:', '"print()" is a built-in Python function that will output some text to the ', '.', 'When someone says to "print to the console," they are referring to where information from your program is ouput. This might be a command prompt (CMD.exe), the terminal for Mac/Linux users, or the

/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
https://www.python.org/
https://goo.gl/yMaaTM
/
/register/
https://www.python.org/
https://goo.gl/yMaaTM
/python-tutorial-print-function-strings/?completed=/introduction-to-python-programming/
/dashboard/
/+=1/?a=12&t=/introduction-to-python-programming/
/python-tutorial-print-function-strings/
/math-basics-python-3-beginner-tutorial/
/python-3-variables-tutorial/
/python-3-loop-tutorial/
/loop-python-3-basics-tutorial/
/if-statement-python-3-basics-tutorial/
/else-python-3-tutorial/
/elif-else-python-3-tutorial/
/functions-python-3-basics-tutorial/
/function-parameters-python-3-basics/
/function-parameter-defaults-python-3-basics/
/global-local-variables/
/installing-modules-python-3/
/using-pip-install-for-python-modules/
/common-errors-python-3-basics/
/writing-file-python-3-basics/
/appending-file-python-3-tutorial/
/reading-file-python-3-tutorial/
/classes-

2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/login/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/register/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/792/Python%20Web%20service%20to%20sql%20database%20example/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/792/Python%20Web%20service%20to%20sql%20database%20example/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/community/792/Python%20Web%20service%20to%20sql%20database%20example/ 
2017-10-20 19:09:24 [scrapy.core.scraper] DEBUG: Scrap


/login/
/register/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/machine-learning-tutorial-python-introduction/
/machine-learning-tutorial-python-introduction
/game-frames-open-cv-python-plays-gta-v/
/game-frames-open-cv-python-plays-gta-v/
/introduction-use-tensorflow-object-detection-api-tutorial/
/introduction-use-tensorflow-object-detection-api-tutorial/
/machine-learning-python-sklearn-intro/
/machine-learning-python-sklearn-intro/
/flat-clustering-machine-learning-python-scikit-learn/
/flat-clustering-machine-learning-python-scikit-learn/
/image-recognition-python/
/image-recognition-python/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/getting-stock-prices-python-programming-for-finance/
/getting-stock-prices-python-programming-for-finance/
/machine-learning-python-sklearn-intro/
/machine-learning-python-sk

2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/mpi4py-size-command-mpi/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/sending-receiving-data-messages-mpi4py/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/sending-receiving-messages-nodes-dynamically/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/tagging-messages-mpi-multiple-messages/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/mpi-broadcast-tutorial-mpi4py/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link (depth > 2): https://pythonprogramming.net/scatter-gather-mpi-mpi4py-tutorial/ 
2017-10-20 19:09:24 [scrapy.spidermiddlewares.depth] DEBUG: Ignoring link 

/mpi4py-size-command-mpi/
/sending-receiving-data-messages-mpi4py/
/sending-receiving-messages-nodes-dynamically/
/tagging-messages-mpi-multiple-messages/
/mpi-broadcast-tutorial-mpi4py/
/scatter-gather-mpi-mpi4py-tutorial/
/mpi-gather-command-mpi4py-python/
#
#
#
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
//www.lfd.uci.edu/~gohlke/pythonlibs/#opencv
//pythonprogramming.net/using-pip-install-for-python-modules/
/matplotlib-intro-tutorial/
//www.numpy.org/
/loading-video-python-opencv-tutorial/?completed=/loading-images-python-opencv-tutorial/
/+=1/?a=11&t=/loading-images-python-opencv-tutorial/
/loading-video-python-opencv-tutorial/
/drawing-writing-python-opencv-tutorial/
/image-operations-python-opencv-tutorial/
/image-arithmetics-logic-python-opencv-tutorial/
/thresholding-image-analysis-python-opencv-tutorial/
/color-filter-python-opencv-tutorial/
/blurring-smoothing-python-opencv-tutorial/
/morpholo

In [13]:
with open('./scrapy.json') as f:
    list_scrapy = json.load(f)
pprint(list_scrapy)

[{'h1': [],
  'h2': ['Login', 'Sign up'],
  'h3': [],
  'p': ['Learn how to use Python with Pandas, Matplotlib, and other modules to '
        'gather insights from and about your data.',
        'Control hardware with Python programming and the Raspberry Pi.',
        'How to develop websites with either the Flask or Django frameworks '
        'for Python.',
        "Create your own games with Python's PyGame library, or check out the "
        'multi-platform Kivy.',
        'Learn the basic and intermediate Python fundamentals.',
        'Create software with a user interface using Tkinter, PyQt, or Kivy.',
        '\n\t\t\t\t\t\t',
        ' \xa0\n\t\t\t\t\t\t',
        '\n\t\t\t\t\t',
        '\n\t\t\t\t\t\t\t\t',
        ' \xa0\n\t\t\t\t\t\t\t\t',
        '\n\t\t\t\t\t\t\t'],
  'url': 'https://pythonprogramming.net/'},
 {'h1': [],
  'h2': ['Login', 'Sign up'],
  'h3': [],
  'p': ['\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
        '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\

In [14]:
list_scrapy[0]

{'h1': [],
 'h2': ['Login', 'Sign up'],
 'h3': [],
 'p': ['Learn how to use Python with Pandas, Matplotlib, and other modules to gather insights from and about your data.',
  'Control hardware with Python programming and the Raspberry Pi.',
  'How to develop websites with either the Flask or Django frameworks for Python.',
  "Create your own games with Python's PyGame library, or check out the multi-platform Kivy.",
  'Learn the basic and intermediate Python fundamentals.',
  'Create software with a user interface using Tkinter, PyQt, or Kivy.',
  '\n\t\t\t\t\t\t',
  ' \xa0\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\t',
  ' \xa0\n\t\t\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t'],
 'url': 'https://pythonprogramming.net/'}

In [15]:
list_scrapy[1]

{'h1': [],
 'h2': ['Login', 'Sign up'],
 'h3': [],
 'p': ['\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t',
  ' \xa0\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\t',
  ' \xa0\n\t\t\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t'],
 'url': 'https://pythonprogramming.net/community/'}

In [16]:
list_scrapy[2]

{'h1': [],
 'h2': ['Login', 'Sign up'],
 'h3': [],
 'p': ['How to setup and use your Raspberry Pi with various projects.',
  'Use the Raspberry Pi along with the GoPiGo to learn about robotics.',
  'Build and program your own quadcopter from scratch to take-off.',
  '\n\t\t\t\t\t\t',
  ' \xa0\n\t\t\t\t\t\t',
  '\n\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t\t',
  ' \xa0\n\t\t\t\t\t\t\t\t',
  '\n\t\t\t\t\t\t\t'],
 'url': 'https://pythonprogramming.net/robotics-tutorials/'}

In [17]:
list_scrapy[3]

{'h1': [],
 'h2': [],
 'h3': [],
 'p': ['Want a video series to go? No problem, as a +=1 subscriber, you can download the videos for entire series.',
  'Work out your problem-solving skills, test your knowledge of the material you are learning and learn actual use-cases for what you learn.',
  '+=1 users are given a direct line of communication for help, discussion, suggestions, and more.',
  'Adsense ads will be no longer shown to you as a subscriber.',
  'The cost for subscription is $9 per month via PayPal.',
  'After subscribing, head to this same page for the +=1 dashboard, or you can head straight to the tutorials for the included quizzes and challenges. Allow for up to 5 minutes for changes to take place, though it should be near-instant.',
  'You must be logged in to subscribe. Please ',
  ' or ',
  ' an account.'],
 'url': 'https://pythonprogramming.net/+=1/'}

## Visualiser les données

## Nettoyer les données (regexp et nltk) 

### Exercice 3

Nettoyer les "\n", "\t", et espaces, extraire seulement les mots.

## Machine Learning "non-supervisé" : tagging automatique avec un tf-idf 

## Amélioration de la prédiction de tags grâce à des méthodes de classifications binaires