In [1]:
import pandas as pd
import requests # pedidos http 
from bs4 import BeautifulSoup # web scraping

In [2]:
def get_data(url, parse):
    """
    returns a parsed html 
    """
    raw_html = requests.get(url, parse).content
    return BeautifulSoup(raw_html)

In [3]:
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
raw_html = requests.get(url).content

html = BeautifulSoup(raw_html)

## Let's count the number of words in this page

#### Find tags directly beneath other tags:

In [5]:
part = html.select('div > p')

In [7]:
word_list = []
for p in part:
    words = [word for word in p.text.split(' ')]
    word_list.extend(words)

In [8]:
count_word = {}
for word in word_list:
    if word in count_word:
        count_word[word] += 1
    else:
        count_word[word] = 1

In [9]:
sorted_keys = sorted(count_word, key=count_word.get, reverse=True)

In [10]:
print(type(sorted_keys))
print(sorted_keys[:10])

<class 'list'>
['and', 'the', 'a', 'Python', 'of', 'to', 'is', 'in', 'as', 'for']


In [11]:
for k in sorted_keys[:10]:
    print('{}: {}'.format(k, count_word[k]))


and: 130
the: 127
a: 98
Python: 83
of: 82
to: 81
is: 76
in: 75
as: 53
for: 40


## Work class

* Go to `https://www.standvirtual.com` and select the `Lotus` car
* Now get the price, description and city, year and km of that car into a dataframe 
* In the end, save that dataframe in a csv format - we will need this later on 

In [12]:
requests.get('http://www.cats.com')

<Response [200]>

In [28]:
html = get_data('https://www.standvirtual.com/carros/lotus/?search%5Bfilter_enum_damaged%5D=0&search%5Bnew_used%5D=all', 'html.parse')

#### Go to stackoverflow, google or whatever! You have to google to understand how you can get this and you can do it in many ways

In [29]:
price = html.find_all("div", class_="offer-item__price")

In [35]:

price[0].get_text()

'\n\n\n\n\n\n\n55 500\nEUR\n\n\n\n                                                    Negociável                                                                                                                            \n\n'

Ok so we got a Beautiful soup tag. What the hell do I do with it? Simple: you read the documentation or try the shortcut in stackoverflow. Whatever floast your boat as long as you get use to it because this is web scrapping, it's painful, tedious and boring. Eventually it comes with great power. 

In [17]:
price[0]

<div class="offer-item__price">
<div class="offer-price ds-price-block">
<div class="price-wrapper-listing">
<div data-autoload="1" data-class="" data-props='{"indicator":"none","url":"https:\/\/www.standvirtual.com\/anuncio\/lotus-evora-launch-edition-ID8OZBlL.html","adId":8082561041,"showTooltip":true,"showLabel":false}' data-test="" data-widget="PriceEvaluation/Display">
</div>
<script type="text/javascript">
    try {
        window.autoStartWidgets && window.autoStartWidgets()
    } catch (error) {
        window.newrelic && window.newrelic.addPageAction("Failed to start widget: " + "PriceEvaluation/Display");
    }
    </script>
<span class="offer-price__number ds-price-number">
<span>55 500</span>
<span class="offer-price__currency ds-price-currency" data-type="price_currency_1">EUR</span>
</span>
</div>
<span class="offer-price__details ds-price-complement" data-type="price_negotiable">
                                                    Negociável                              

In [18]:
price[0].get_text()

'\n\n\n\n\n\n\n55 500\nEUR\n\n\n\n                                                    Negociável                                                                                                                            \n\n'

Oh splendid! Now we have a horrible horrible string full of stuff we don't need ready to make our next 10minutes a nightmare! Here it's more about string manipulation than it is about searching

Being creative here: 
- The first strip removes the `\n\n`
- I'll split based on having more than 1 space (making sure the beautiful space between our numbers)

In [19]:
int(price[0].get_text().strip('\n\n').split('  ')[0].replace(' ', '').split('\n')[0])

55500

Sweet mammaaaa! Houston, we have price! Now comes the cool part: do what machine are meant for, loop it! 

In [20]:
len(price)

8

In [21]:
prices = []
for i in range(len(price)):
    p = price[i]
    text = p.get_text()
    _append = text.strip('\n\n').split('  ')[0].replace(' ', '').split('\n')[0]
    
    _append = _append.replace(' ', '')
    prices.append(int(_append))

In [22]:
prices

[55500, 27500, 34900, 31000, 38000, 79900, 20000, 43000]

And there we go! Prices.. Notice the architecture is pretty bad. I didn't match the prices with anything, now I now the 30000 correspond to the first car, not a very robust architecture but serve the purpose of showing some of the problem you might encounter when scrapping the webbbzzz. 

Now it's your time power rangers! Go and get us the rest of the info

In [23]:
df = pd.DataFrame(prices, columns=['price'])
df.to_csv('example_dataframe.csv', index=False)

## Suggestion of a possible solution

In [24]:
# get different divs that describe each car
html = get_data("https://www.standvirtual.com/carros/lotus/", "html.parsed")

items = html.find_all("div", class_=["offer-item__content", "ds-details-container"])
my_cars = {}

for index in range(len(items)):
    # If you assume there is a class that will return an empty list, as you can 
    # see here: class_="fsdjfnsdjfsad" returns an empty list, because obviously
    # there are no classes named like that. This tries to set a price variable, 
    # in case the error is an IndexError, then we can assign to 0.
    try:
        price = items[index].find_all(class_="fsdjfnsdjfsad")[0].span.get_text()
    except IndexError:
        price = 0
    my_cars["car_{}".format(index + 1)] = {
        "price": price,
        "mileage": items[index].find_all(attrs={"data-code": "mileage"})[0].span.get_text(),
        "title": items[index].find_all(class_="offer-title__link")[0].get_text().replace("  ", "").strip()
    }

IndexError: list index out of range

In [25]:
my_cars

{}