# _INSIDE TECH'S HYPE_

Este Notebook contiene la extracción, la limpieza y la transformación de los datos necesarios para llevar a cabo el proyecto, que se realizará en otros archivos. 

En primer lugar, importamos las librerías con las que vamos a trabajar: Requests, BeautifulSoup y Pandas

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import time
import re
import pandas as pd

Tras importar las librerías, comenzamos a scrapear la página web seleccionada. El primer paso es obtener, a través de las urls de las secciones del blog, la mayor cantidad de links a noticias posible.

In [6]:
url = 'https://hackernoon.com/tagged/'
feat_list = ['cryptocurrency','coding','artificial-intelligence',
             'futurist','startups']

lista_articulos = []
for features in feat_list:
    res = requests.get(url+features)
    page = res.content
    soup = bs(page, 'html.parser')
    for article in soup.find_all('div',{'class':'title'}):
        news = article.find('a', href=True)
        lista_articulos.append('https://hackernoon.com' + news.get('href'))


In [7]:
print(len(lista_articulos))

360


Hemos comprobado el número de links que hemos conseguido: 360. Un número que a priori es escaso pero que, debido a los plazos de entrega, damos por válido para continuar el análisis. El siguiente paso es extraer la información que es relevante para el proyecto: el texto del artículo, el título y los 'tags' que aparecen en cada noticia.

In [9]:
tag_dict = dict()
title_list = []
text_list=[]
counter=0
counter2=1
counter3=0
counter4=1
start=time.time()
for url in lista_articulos:
    
    paragraph = []
    page = requests.get(url)
    soup = bs(page.text, 'html.parser')
    
    try:
        for t in soup.find_all('div',{'class':'paragraph'}):
            paragraph.append(t.text)
        text_list.append(' '.join(paragraph))
        
    except:
        raise ValueError
  
    try:   
        
        for title in soup.find('h1', {'class':'title'}):
            if title in title_list:
                title_list.append(title+str(counter2))
                counter2+=1
            else:
                title_list.append(title)
    except:
        
        title = "No title" + str(counter)
        counter+=1
        title_list.append(title)

    tags = []
    
    for i in soup.find_all('div', {'class':'archive-tags'}):
        tags_str=i.text.split('\n')[1:-1]
    tags.append(tags_str)   
    counter4+=1
    tag_dict[str(counter4)]=tags

end = time.time()
print(end-start)
    

311.5912640094757


In [10]:
print(text_list[0])

Financial markets are chaotic. So chaotic, even, that many economists and investors believe market trends to be the product of ârandom walksâ and that prices cannot be predicted (see generallyÂ Malkiel). But randomness shouldn't be worrisome. In fact, random price movements can be good. Gaussian random walk, an assumption used by an options pricing model called Black-Scholes, treats intervals of an assetâs price over time as independent variables. By doing so, the changes in price over time, or the returns of an asset, are assumed to be normally distributed. Otherwise stated, âIf transactions are fairly uniformly spread across time, and if the number of transactions per day, week, or month is very large, then the Central Limit Theorem leads us to expect that these price changes will have normal or Gaussian distributionsâ (Fama, 399). When an asset's returns are normally distributed, the probabilities of those returns are known. Knowing these probabilities can give investors a

Hemos extraído el contenido de la página en dos listas (title_list y text_list) y un diccionario (tag_dict). Al ver el texto, observamos que hay palabras con caracteres raros. Limpiamos estos términos con expresiones regulares.

In [11]:
#regex to clean text format
def regexTitle(s):
    lst=[]
    for text in s:
        lst.append(re.sub('[âÂ]\S*','', text))
    return lst

title_list=regexTitle(title_list)
text_list=regexTitle(text_list)
    

Creamos un diccionario con el contenido extraído de Hackernoon para convertir posteriormente en un dataframe con Pandas

In [12]:
articles_dict = {"id": lista_articulos, "title": title_list, "text": text_list}


In [13]:
articles = pd.DataFrame.from_dict(articles_dict)
articles['tags'] = tag_dict.values()
articles.shape

(360, 4)

In [14]:
articles.tags

0      ([Bitcoin, Blockchain, Finance, Investment, Cr...
1      ([Venture Capital, Self Custodian Bank, Crypto...
2      ([Cryptocurrency, Ethereum, Blockchain, Machin...
3      ([Scott Stornetta, Cryptocurrency, End User, E...
4      ([Crypto Exchanges, Latest Tech Stories, Hacke...
                             ...                        
355    ([Startup, Product Management, Culture, Startu...
356    ([Product Management, Startups, Startup, Manag...
357    ([Instacart, Technology, Amazon, Predictions, ...
358    ([Instacart, Technology, Amazon, Predictions, ...
359    ([Startup, Management, Agile, Startups, Produc...
Name: tags, Length: 360, dtype: object

Como podemos observar, tenemos varios tags para cada artículo encapsulados en una lista. Además, el objeto es una tupla en la que el primer valor es un índice. Eliminamos el índice para quedarnos con la lista de tags, que posteriormente modificaremos también para quedarnos con un solo tag por artículo. 

In [15]:
articles.tags = [e[0] for e in articles.tags]

Hay artículos de los que no se ha obtenido texto, pues estaban dentro de una etiqueta HTML distinta. Eliminamos estos registros.

In [16]:

articles = articles[articles.text != '']
articles.head()

Unnamed: 0,id,title,text,tags
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...","[Bitcoin, Blockchain, Finance, Investment, Cry..."
1,https://hackernoon.com/why-we-invested-in-mult...,"Why We Invested in Multis, The Self Custodian ...",We are excited to announce our investment in M...,"[Venture Capital, Self Custodian Bank, Cryptoc..."
2,https://hackernoon.com/five-fascinating-data-v...,Centralized Crypto Exchanges Explained in 5 Fa...,"Centralized exchanges are, arguably, one of th...","[Cryptocurrency, Ethereum, Blockchain, Machine..."
3,https://hackernoon.com/founding-father-of-bloc...,Founding father of Blockchain Scott Stornetta ...,Dr. Scott Stornetta is the blockchain co-inven...,"[Scott Stornetta, Cryptocurrency, End User, El..."
4,https://hackernoon.com/the-curious-case-of-cry...,The Curious Case of Crypto-Exchanges,Hacker Noon contributors evaluated the industr...,"[Crypto Exchanges, Latest Tech Stories, Hacker..."


A continuación, creamos una función para extraer los tags de la lista y pasarlos a registros independientes

In [17]:
def tagSeparator(df):
    s = df.tags.apply(pd.Series,1).stack()
    s.index=s.index.droplevel(-1)
    s.name = 'Tags'
    return df.join(s)

new_articles= tagSeparator(articles)

new_articles.drop('tags', axis=1, inplace=True)
type(new_articles)


pandas.core.frame.DataFrame

In [18]:
new_articles.Tags.unique()

array(['Bitcoin', 'Blockchain', 'Finance', 'Investment', 'Cryptocurrency',
       'Hackernoon Top Story', 'Kurtosis', 'Kurtosis And Bitcoin',
       'Venture Capital', 'Self Custodian Bank', 'Cryptocurrency Bank',
       'Fintech', 'Multis', 'Gnosis', 'Ethereum', 'Machine Learning',
       'Crypto Exchanges', 'Centralized Crypto Exchanges',
       'Data Visualizations', 'Scott Stornetta', 'End User',
       'Electroneum', 'Regulations', 'Latest Tech Stories',
       'Moderated Centralisation', 'Good Company',
       'Hacker Noon Newsletter', 'Decentralized Exchanges',
       'Cryptocurrency Liquidity', 'Trading Volume Cryptocurrency',
       'Cryptocurrency Exchange', 'Craig Wright', 'Binance', 'Quantopian',
       'Crypto', 'Custody', 'Storing Digital Assets', 'Banking',
       'Custodianship Of Assets', 'Digital Asset',
       'Institutional Grade Custody', 'Cold Wallet', 'Hot Wallet',
       'Multi Signature Wallet', 'Tech Newsletter', 'Decentralization',
       'Software Developmen

In [19]:
drop_tags = ['Top tech stories', 'Latest','Hacker noon awards', 'Tech awards','Noonies 2019',
                  'Hackernoon medium', 'Medium hackernoon', 'David smooke',
       'Ev williams', 'Hackernoon', 'Medium','Killing globalization',
       'Killing globalisation', 'I see the world','Wtf Is A Merkle Tree','What Is A Merkle Tree',
       'Another tech thought', 'Founders', 'Founder stories',
      'Problemeter','Latet tech stories', 'Promiseallsettled', 'Promiseany',' all encompassing enemy',
                 'Be for something', 'Erik Brynjolfsson']

new_articles[~new_articles.Tags.isin(drop_tags)]



Unnamed: 0,id,title,text,Tags
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Bitcoin
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Blockchain
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Finance
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Investment
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Cryptocurrency
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Hackernoon Top Story
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Kurtosis
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Kurtosis And Bitcoin
1,https://hackernoon.com/why-we-invested-in-mult...,"Why We Invested in Multis, The Self Custodian ...",We are excited to announce our investment in M...,Venture Capital
1,https://hackernoon.com/why-we-invested-in-mult...,"Why We Invested in Multis, The Self Custodian ...",We are excited to announce our investment in M...,Self Custodian Bank


In [20]:
new_articles.head()

Unnamed: 0,id,title,text,Tags
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Bitcoin
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Blockchain
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Finance
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Investment
0,https://hackernoon.com/kurtosis-and-bitcoin-uc...,Kurtosis and Bitcoin: A Quantitative Analysis,"Financial markets are chaotic. So chaotic, eve...",Cryptocurrency


In [21]:
new_articles.shape

(1593, 4)

Cargamos el nuevo dataset

In [22]:
new_articles.to_csv('text_classifier.csv')