# This notebook is a combination of different notebooks to get to know the dataset

First step, as always, is to import the required libraries

In [None]:
# import required libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

set the parameters for the graph to be uniform

In [None]:
plt.rcParams['figure.figsize'] = 4, 2
plt.rcParams['figure.dpi'] = 150
palette = ['#43948c', '#36a097', '#28aea2', '#1bbbad', '#0dc9b8']
hue_palette = ['#43948c', '#3CB371']

Then we import the dataset

In [None]:
df = pd.read_csv('../data/review_1819_eng.csv')

Let' get a first impression of the dataset

In [None]:
df.head()

So the data consists of a singular review_id, then a user_id and a business_id which match with other reviews of the same business or by the same user.   
More intersting for us is the star-rating. By checking for unique values, we can see, that we are dealing with the typical 5 star rating system that we expect from a dataset by Yelp. 

In [None]:
df.stars.unique()

the next three columns give us information as to how usefule/funny/cool the reviews has been seen by other users.

In [None]:
df.useful.unique()

we can see, that these numbers differ greatly, and that even one review has a negative number of clicks?!

to be able to better work with these numbers, we're going to create a new column called clicked, which just shows, if a review has been clicked as useful by other user

In [None]:
df['clicked'] = df['useful'].apply(lambda x: 1 if x >= 1 else 0)
df.sample(5)

Next comes the text of the review itself, with which we will work later in the cleaning process. for now let's just add a column to our dataframe with the length of the text, so that we will be able to get a first impression

In [None]:
# add a new column for the length of the review, to get an impression of the data we're dealing with
df['length'] = df['text'].apply(lambda x: len(x))

As for our question, the date of the review, the ids and the number of times a review was clicked won't matter any more, we wll drop these columns

In [None]:
df.drop(['review_id', 'user_id', 'business_id', 'useful', 'funny', 'cool', 'date', 'year'], axis=1)

### Let's have a look at the distribution of the star ratings

First, let's have a look at the distribution of the ratings

In [None]:
ax = sns.countplot(data=df, x='stars', palette=palette, zorder=2)
plt.title('Distribution of star ratings')
plt.ylim(0, 1000000)
plt.xlabel('Stars')
plt.ylabel('Count')
plt.bar_label(ax.containers[0], padding=-15, fontsize=8);

we can see, that our dataset is strongly biased towards good reviews. more then half of the reviews are 5 star reviews.

### Let's have a look at the length of the text reviews in realtion to the star distribution

In [None]:
ax = sns.barplot(data=df, x='stars', y='length', errorbar=None, palette=palette)
plt.title('Length of reviews by stars')
plt.ylim(0, 800)
plt.xlabel('Stars')
plt.ylabel('Mean of length')
plt.bar_label(ax.containers[0], padding=-15, fontsize=8);

In [None]:
con = ax.containers
print(con)

We can see, that reviews with higher star-rating seem to have shorter text. This does make sense, as a bad review often tends to explain the reasoning behind the bad rating.

### Finally let's see, if there is a relation between the rating and if a review was clicked as useful

In [None]:
ax = sns.countplot(data=df, x='stars', hue='clicked', palette=hue_palette, hue_order=[1,0])
plt.title('Clicked reviews by stars')
plt.xlabel('Stars')
plt.ylabel('Number of clicks')
plt.legend(['Clicked', 'Not clicked'])
for p in ax.patches:
    ax.annotate(format(p.get_height()/1000, '.0f')+'K',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha = 'center', va = 'center', 
                size=5,
                xytext = (0, -4), 
                textcoords = 'offset points'
                )

We can see that especially the bad reviews are clicked more often then not, which also seems logic, as people tend to be interested in good explanations, as to why a place is considered bad.

## Okay, now that we got a first impression of the star rating, let's have a look at usefulness, as this will finally be the target of our model

Have a look at the number of reviews that were clicked as useful

In [None]:
ax = sns.countplot(data=df, x='clicked', palette=hue_palette, hue_order=[1, 0])
plt.title('Nurmber of clicked reviews')
plt.ylim(0, 1200000)
plt.xlabel('')
plt.ylabel('Count')
plt.xticks([0, 1], labels=['Clicked', 'Not clicked'])
plt.bar_label(ax.containers[0], fmt='%.0f', padding=-15);

We can see, that more reviews havn't been clicked, than there are that have been clicked, but the relation isn't too disparate

Now we can have alook whether the length of the review changes something

In [None]:
ax = sns.barplot(data=df, x='clicked', y='length', errorbar=None, palette=hue_palette, order=[1,0])
plt.title("Mean length of reviews by 'usefulness'")
plt.ylim(0, 800)
plt.xlabel("'Usefulness'")
plt.ylabel('Mean of length')
plt.xticks([0, 1], labels=['Clicked', 'Not clicked'])
plt.bar_label(ax.containers[0], fmt='%.0f', padding=-15);

We can clearly see, that longer reviews are more often clicked as useful