# [RQ4]

In [30]:
from steam_analysis import count_languages,\
                           sort_count,\
                           languages_pie,\
                           print_top_languages,\
                           filter_by_language

## Top Languages
First, we want to check in what languages most of the reviews are written in. We can do this by grouping the dataset by its `language` column, then by counting how many unique elements (languages) there are by using the `size()` method, and finally by sorting it and slicing the top 3.

We can manage to load the whole dataset by only selecting the columns we will need for our analysis.

In [None]:
df = pd.read_csv("data/steam_reviews.csv", 
                 usecols = ['review_id', 'language', 'votes_funny', 'votes_helpful'], 
                 header = 'infer')

## What are the most common languages?

In [None]:
languages_pie(df['language'])

top_languages = sort_count(count_languages(df))

print_top_languages(top_languages)

### Now let's filter the dataset so it only includes reviews in these languages

How did other users consider these reviews: Funny or Helpful?

In [None]:
filtered_df = filter_by_language(df, [language for language, _ in top_languages])

# [RQ5]

# [RQ6]

# [RQ7]

In [None]:
from steam_analysis import compute_prob, format_prob

Let's only import the columns we need, so we can keep memory usage to a minimum

In [5]:
df = pd.read_csv("data/steam_reviews.csv", 
                 usecols = ['review_id', 'votes_funny', 'weighted_vote_score'], 
                 header = 'infer')

## Weighted Vote Score

We want to know what's the probability of a review having a *WVS* of at least 0.5.

In order to do so, let's take a first look into how these scores are distributed.

In [None]:
df['weighted_vote_score'].describe()

The distribution has a mean of about 0.16, and about 3/4 of the votes are below 0.5. 

This tells us that we should expect a low figure for $\mathcal{P}(score \geq 0.5)$

In order to get a better grasp of this data, we should plot an histogram of the values.

In [None]:
n_bins = 20

plt.hist(df['weighted_vote_score'], bins = n_bins)

The vast majority of reviews have a Weighted Vote Score of exactly 0, so instead on working with the entire dataset, let's only focus on those reviews which have a non-zero score.

In [None]:
wvs = df[df['weighted_vote_score'] > 0]

print(wvs['weighted_vote_score'].describe())

print(wvs['weighted_vote_score'].median())

Only about 1/3 of the reviews have a non-zero score. The mean now is very close to 0.5 and the distribution is (probability-wise) symmetric about 0.52; as we can visualize from the updated histogram the scores seem normally distributed, although the right tail is heavier than the left one: reviews tend to have a score higher than 0.5 more likely than lower.

In [None]:
plt.hist(wvs['weighted_vote_score'], bins = n_bins)

To estimate the probability of $\mathcal{P}(score \geq 0.5)$ we can sum up the number of elements contained in each bin in the interval $[0.5, 1.0]$ and then divide the value we get by the total number of binned elements.

This is easily done by operating directly on the dataset.

In [None]:
prob_wvs = compute_prob(wvs, 'weighted_vote_score', 0.5)

format_prob(prob_wvs)

About 2/3 of the reviews have a Weighted Vote Score of at least 0.5.

On the other hand, by considering the original dataset we would have gotten only about 1/5 of the reviews.

In [None]:
prob_wvs_orig = compute_prob(df, 'weighted_vote_score', 0.5)

format_prob(prob_wvs_orig)

## Let's take a deeper look into these reviews

We want to study the correlation between a review having a *WVS* bigger than or equal to 0.5 and it being rated as 'Funny'.

First, let's compute the probability of a review having $WVS \geq 0.5$ and at least one 'Funny' vote:

$\mathcal{P}(WVS \geq 0.5\: \text{and}\: funny \geq 1)$

Just like before, we can filter the dataset and then divide the number of reviews in the filtered dataset by the total number of reviews.

In [None]:
wvs_funny = df[(df['weighted_vote_score'] >= 0.5) & df['votes_funny'] > 0]['review_id']

prob_wvs_funny = compute_prob(wvs_funny, 'weighted_vote_score', 0.5)

format_prob(prob_wvs_orig)

### Are these two events independent?

If the probability of a review having $WVS \geq 0.5$ and the probability of it having been rated as 'Funny' by at least one user are independent, then we expect

$\mathcal{P}(WVS \geq 0.5\: \text{and}\: funny \geq 1) = \mathcal{P}(WVS \geq 0.5)\cdot\mathcal{P}(funny \geq 1)$

In order to check if this equality holds, let's compute $\mathcal{P}(funny \geq 1)$.

We will filter these reviews out of the dataset which contains only reviews with non-zero *WVS*'s.

In [None]:
prob_funny = compute_prob(wvs, 'votes_funny', 1)

Let's now compute the difference between the probability of the intersection of these two events, and the product of the probabilities of the two single events.

In [None]:
abs(prob_wvs_funny - prob_wvs * prob_funny)

The difference is too large to be negligible: the two events can't be considered independent.