# Analysis Ratings & Reviews NIVEA.de

## Notebook 1: First Data Exploration

### Reading the data

In [43]:
import pandas as pd

reviews = pd.read_csv("reviews.csv", sep=';')

In [44]:
reviews.head(3) # Example of the first 3 rows

Unnamed: 0,id,date,product_id,product_url,product_title,review_title,review_content,review_score,display_name
0,170031864,2012-02-03T00:00:00Z,80105.0,https://www.nivea.de/produkte/nivea-creme-4005...,NIVEA Creme Regenbogen Edition,Kundenbewertung,Meine Mutter hat zeitlebens diese Nivea-Creme ...,5,Heike
1,170031865,2012-05-31T00:00:00Z,80105.0,https://www.nivea.de/produkte/nivea-creme-4005...,NIVEA Creme Regenbogen Edition,Kundenbewertung,Wirklich unübertroffen! Mein Freund und eine g...,5,Tina
2,170031866,2011-09-27T00:00:00Z,80105.0,https://www.nivea.de/produkte/nivea-creme-4005...,NIVEA Creme Regenbogen Edition,Kundenbewertung,Diese Creme begleitet mich seit meiner Kindhei...,5,Michael N.


### Adjustment of the data types

#### Checking which data types were assigned automatically

In [45]:
reviews.dtypes

id                  int64
date               object
product_id        float64
product_url        object
product_title      object
review_title       object
review_content     object
review_score        int64
display_name       object
dtype: object

#### Adjusting the data types

In [46]:
reviews['id'] = reviews.id.astype('category')
reviews['product_id'] = reviews.product_id.astype('category')
reviews['product_url'] = reviews.product_url.astype('category')
reviews['product_title'] = reviews.product_title.astype('category')
reviews['review_title'] = reviews.review_title.astype(str)
reviews['review_content'] = reviews.review_content.astype(str)
reviews['display_name'] = reviews.display_name.astype(str)
reviews.dtypes

id                category
date                object
product_id        category
product_url       category
product_title     category
review_title        object
review_content      object
review_score         int64
display_name        object
dtype: object

#### Checking the date format

In [47]:
reviews.date

0         2012-02-03T00:00:00Z
1         2012-05-31T00:00:00Z
2         2011-09-27T00:00:00Z
3         2011-07-28T00:00:00Z
4         2013-01-08T00:00:00Z
                  ...         
105807    2021-11-27T11:56:57Z
105808    2021-11-27T12:27:40Z
105809    2021-11-27T14:54:04Z
105810    2021-11-27T14:58:30Z
105811    2021-11-27T16:39:06Z
Name: date, Length: 105812, dtype: object

#### Adjusting the date format

In [48]:
reviews['date'] = pd.to_datetime(reviews['date'], format='%Y-%m-%d')
reviews.date

0        2012-02-03 00:00:00+00:00
1        2012-05-31 00:00:00+00:00
2        2011-09-27 00:00:00+00:00
3        2011-07-28 00:00:00+00:00
4        2013-01-08 00:00:00+00:00
                    ...           
105807   2021-11-27 11:56:57+00:00
105808   2021-11-27 12:27:40+00:00
105809   2021-11-27 14:54:04+00:00
105810   2021-11-27 14:58:30+00:00
105811   2021-11-27 16:39:06+00:00
Name: date, Length: 105812, dtype: datetime64[ns, UTC]

### Data exploration (univariate)

#### How many different products does the data set contain?

In [49]:
len(pd.unique(reviews['product_id']))

898

In [50]:
len(pd.unique(reviews['product_title']))

807

There are 898 product IDs and 807 product names. That is: for some product names, there are several product ID's. 
* Explanation: When a product is relaunched, it gets a new ID.

#### Which products occur most frequently in the dataset (are most frequently rated)?

In [51]:
reviews['product_title'].value_counts().nlargest(3)

NIVEA Eau de Toilette                    2694
MagnesiumDry Fresh Floral Deo Roll-On    1488
NIVEA SUN Eau de Toilette                1330
Name: product_title, dtype: int64

#### What is the average rating across all products?

In [52]:
reviews['review_score'].describe()

count    105812.000000
mean          4.637659
std           0.736914
min           1.000000
25%           4.000000
50%           5.000000
75%           5.000000
max           5.000000
Name: review_score, dtype: float64

#### The number of  reviews sorted by their star rating (5-star rating system)

In [53]:
reviews['review_score'].value_counts().sort_index() 

1     1474
2     1223
3     3993
4    20789
5    78333
Name: review_score, dtype: int64

We learn that the distribution is "left-skewed". So there are significantly more positive reviews than negative ones. This might be a challenge for the classification model that we want to calculate later.

## Practice Tasks

#### What is the average star rating of the 10 most popular products?


In [54]:
reviews.groupby('product_title').agg({'id': 'count', 'review_score':'mean'}).sort_values("id", ascending = False)
reviews.groupby('product_title').agg({'id': 'count', 'review_score':'mean'}).nlargest(10, "id")

Unnamed: 0_level_0,id,review_score
product_title,Unnamed: 1_level_1,Unnamed: 2_level_1
NIVEA Eau de Toilette,2694,4.41314
MagnesiumDry Fresh Floral Deo Roll-On,1488,4.678763
NIVEA SUN Eau de Toilette,1330,4.502256
Natural Balance Aloe Vera Body Lotion,1274,4.77708
Sanfte Rasur Rasierer mit Wechselklingen,1256,4.394904
Sensitive All-In-One Balsam Gesicht & 3-Tage Bart,1031,4.70999
Rosenblüte Gel-Creme Tagespflege,1016,4.75689
Reichhaltige Body Milk,1000,4.61
Haarmilch Regeneration Mildes Shampoo Feines Haar,967,4.598759
Cellular Luminous630® Anti Pigmentflecken Intensiv Serum,964,4.419087


#### Please remove all ratings and reviews that were submitted before 1 January 2019.

In [55]:
reviews_filtered = reviews[reviews['date'] >= "2019-01-01"] # Removing the reviews that were submitted before 01/01/2019 
reviews_filtered.date.min() # checking of the earliest date after the filter application

Timestamp('2019-01-01 00:00:00+0000', tz='UTC')

#### How is the distribution of the stars now? Does the skewness remain?

In [56]:
reviews_filtered.groupby(['review_score']).size()

review_score
1      996
2      885
3     3025
4    16171
5    60608
dtype: int64

* the distribution remains "left-skewed"