# Analysis of the data
In this section, we present our results. First, we download our data:

In [4]:
from download import download_all
from req import RequestHelper
req = RequestHelper(disable_debug_print=True)

In [None]:
# By instantiating the RequestHelper in the previous cell and passing it here, 
# we enable caching of hte requests. Therefore, if we want to run this cell again
# with higher number of calls, we do not make the same call again.
# Try this with running this cell for the first time (takes time to evaulate),
# where the second run (for the same batch size) is almost instant.
data = download_all(40, req=req)

 - getting Heureka lists of eshops:


100%|██████████| 3/3 [00:04<00:00,  1.65s/it]


 - getting Heureka page details:


100%|██████████| 40/40 [00:21<00:00,  2.54it/s]


Data from Heureka finished in 26.139s.
 - getting data from eshop pages in 30 threads:


100%|██████████| 40/40 [00:04<00:00,  9.07it/s]


Data from Eshop pages finished in 4.467s.
 - getting data from Instagram:


 28%|██▊       | 5/18 [00:02<00:07,  1.72it/s]

We inspect the top data:

In [None]:
data.head()

In [None]:
data['reviews_positive_ratio'] = data['reviews_positive_count']/data['reviews']

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
plt.style.use('ggplot')

In [None]:
data.corr().style.background_gradient(cmap='coolwarm', axis=None)

In [None]:
sns.pairplot(data)
plt.show()

Now we try to plot the _reviews_ vs _instagram_\__followers_

In [None]:
f, ax = plt.subplots(num=None, figsize=(15, 7))
data.plot.scatter('reviews', 'instagram_followers', ax=ax,)
plt.show()

There is, however, not much to see. Can we improve it by log-scaling?

In [None]:
f, ax = plt.subplots(num=None, figsize=(15, 7))
data.plot.scatter('reviews', 'instagram_followers', ax=ax,)
ax.set_yscale('log')
ax.set_xscale('log')
plt.show()

Nevertheless, there appears to be no stable relationship. Now what about the correlations?

In [None]:
sub_data = data[['reviews', 'instagram_followers']].dropna()
sub_data['log(reviews)'] = np.log(data['reviews'])
sub_data['log(instagram_followers)'] = np.log(data['instagram_followers'])
sub_data.corr().style.background_gradient(cmap='coolwarm', axis=None)

From 5.4% for the non-transformed scenario we cannot do better than 18% for _reviews_ vs _log(instagram_\__followers)_.

We can also try other variables:

In [None]:
f, ax = plt.subplots(num=2, figsize=(15, 7))
data.plot.scatter('reviews', 'instagram_posts_average_like', ax=ax,)
ax.set_yscale('log')
ax.set_xscale('log')
plt.show()

In [None]:
f, ax = plt.subplots(num=2, figsize=(15, 7))
data.plot.scatter('reviews_positive_ratio', 'instagram_posts_average_like', ax=ax,)
ax.set_yscale('log')
plt.show()

However, the story is similar, there is no clear pattern in the data.