In [4]:
from bs4 import BeautifulSoup # modulo de BeautifulSoup
import urllib.request, urllib.parse, urllib.error
import ssl
import json

# Spyding

The idea is to gather data from the web. We need to first select a web (and its [URL](https://psdeals.net/tr-store)) to extract the data. We need info regarding the price of a game in the Turkey store. At the very beginning, we'll use [this website](https://psdeals.net/tr-store) to construct the data.

In this first step, we will analyze how the website is constructed and how we can obtain mainly the price and the name of the game that we search for.


## Step 1: The URL

Doing the search. In this case let's create a tool to search an specific term in the web. Let's suppose that we are interested in searching the name of the game God of War. The main page looks like this:

<img src="../pics/main page.png" alt="Main Page" width="500"/>

But, after we do a search this is the result. Now we need to look at the structure of the URL created to do the search.

<img src="../pics/search-page.png" alt="Search Page" width="500"/>

This means that the url has an standard search ith aprameters added. The standard urls is something like `https://psdeals.net/tr-store/search?search_query=` and the searched item is added to the url separed by a "+" symbol. 

So, with that in mind...


In [10]:
# let's create a function to generate URLs of the games that we want to search
def search_games():
    """
    Generates search URLs for the given items.

    Args: 
        None. The games' names are provided by the user through the console.

    Returns:
        List[str]: A list of URLs to search for the specified games.
    """
    items = []
    game = input("Insert a game (or press enter to finish): ")
    while game != "":
        items.append(game)
        game = input("Insert a game (or press enter to finish): ")

    # Generating URLs to search :D
    base_url = "https://psdeals.net/tr-store/search?search_query="
    urls = [base_url + i.replace(" ", "+") for i in items]

    print("You asked for: {}".format(", ".join(items)))
    return urls

urls = search_games()
print(urls)



You asked for: god of war, sekiro, red dead redemption 2
['https://psdeals.net/tr-store/search?search_query=god+of+war', 'https://psdeals.net/tr-store/search?search_query=sekiro', 'https://psdeals.net/tr-store/search?search_query=red+dead+redemption+2']


# Step 2: The Soup

Now that we have the URLs, we can extract the data from the pages.

For this step, we will use the `requests` library to fetch the pages and the `BeautifulSoup` library to parse the HTML.

First, we need to install the required libraries. If you dont have them well... Go on! Do it, and then i'll illustrate you my dear commrade. 

With that being said, let us cook a beautiful Soup!

In [108]:
# With only 1 URL: Just to get 
url = urls[0]

# Adding headers to the request: The page could thing i am a bot, so let's suppose we're not... Aren´t we?
req = urllib.request.Request(url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
})

# Now we have the html!
html = urllib.request.urlopen(req).read()

# Now let's cook the soup!
soup = BeautifulSoup(html, "html.parser")

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [107]:
soup.prettify

<bound method Tag.prettify of :3�}�F�������_-5�3�!{,�:}Ȧ���ʥo�G��h
�G� (ZN2���nWn�U��_�W�oԭ�#��a^�i r
�X+|\���0���֫K�T�d�p
4�r�S��w�X������R~&amp; �%Da+,QV�Px�$jelݗ�U��|YYiI��Vu��[��WB"#�`�d���OK_�٫&gt;g
x����
�7� 
���\-�DN_�ϗZ��Bs�94h� ������մ��ז`���猜�
}V�����l�������k����
�
 pZ��4���|
�o^3^�����K-6q@�{R&gt;n/c��
FaEq ����=)��$]83���e()�w��l�
ُU�TR�t0m��C0��F

��D�V�Eڐr�\�%�� ��A��.�+��G�ݙ��
[�ө͹D?���4:���7�oV�^�D�������
U�!��7/&lt;��9{&lt; ��]Xf��A,�/���Է�p���~�|��ć�B��s#�C��7���x�����f��yq���������)L�����N����ϥ=sg®/G�"��i��B��:�2'&lt;�8'�0*�����9:

�FG�N4m�[��#^s&amp;q�K�ش�:u�W$MF�L����3���������}�����{o.v� G����m
�ǽ�H0�DQ�U8�AF!H���`������at8&lt;8&lt;�,Vf�i��"�d����@�Uzo�$��G
������4�n  I��q��
^�L�"war]���YY�"�a�E
������y
+�
��sS;�A$�:ɯ��Hb^�Y�D�������{��(��!�	���&gt;�Ш��&amp;
,pET`Z"�
�n�p�:
�����E�ϳK����&lt;9Y�F�
��p�rT�ּt�3��h����,���o��� ����Ϫ8!�1����$��6ɱ���d�Z��g�n�LWQ �
������iPY
��*o20%n���'��
X`

Gootta say, this loks awful... Why is that? You may be wondering... Thistrue, it is exactly what you were thinking, **encoding is not the rigth one!!** But, which one to use? After all, we are using a Turkey webside so, could be hard. dont it??

To this problem, smarter people than me had create a solution...

In [120]:

html = urllib.request.urlopen(req).read()

In [None]:
html.

In [None]:

soup = BeautifulSoup(html, "html5")

In [115]:
soup.prettify

<bound method Tag.prettify of :3�}�F���������N���0n�`
����up�LM���OR�Vw�ݲ�
Um����?�Q�ӯ�߫�7���M�0
��4 9�i�&gt;.J�G���_�եH�B�Q�U9�)	{߻O,|@�}w`)? ����(�@(&lt;~�2����*_i�����Za���i����!	�@�N2�Oۧ����U�3�e�y0m�
)�;�������+��VkRh:����P0���R���V���������y��g�
���f
� �����}[���  ��K��������5�5Zzo��b���'����2&amp;^��aV�M
�iߓb^M҅3ӏ�\���~W`�Άڑ�XEL%�J�v&gt;cIj�����A�k�]�
)�ʥZ��y&gt; ]
�Z��"�r�|dޝY�ܰu=�zќK�
�J���C��f��%H�x�
l���Q%?~��9�����߅U�l
�r`��^���v�,n�l|&lt;��x�W㏃���?���,����������g��Fџ?=)qٿЏ�,�W�p7��qz������V&gt;7q
L�����9��9��
j�ڝ�xpx8�{�h�n�ZCG��L����i�u��H�&gt;� �+og������M�o5��������\
���
'��8��z#�
EW�� ���ݧ���G�����8�X�U�U.�����*�=V�q�(r
5�.
f�:�f
D���$���
�wxM2u�܅E�u�JRde��&lt;�A`y
���G�[Xi��
���a
"q�I~M4F��J&amp;2��'}�q�?�Q�)�C4�J�}h�QŇM&lt;X*���
��DF
ݢ��}8x��
3��g/�����Cxr��

�y���i�Ƚ�ҁ�����&amp;c����G�
h�&gt;�℘�Pd�ƒ�{��$�N�
�9~Xh]HB���M2]E��+6�G����Ae!x,^������m��<d��v2`�lj9���