## Outline of steps
* [step0](#step0): import necessary packages
* [step1](#step1): import dataset as `part1_dataset`
* [step2](#step2): have a look at the dataset and its data type
* [step3](#step3): convert `Review_Date` from object to datetime
* [step4](#step4): have a look at the descriptive statistical summary
* [step5](#step5): save the dataset as part1_dataset

In [1]:
# import necessary packages
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # module for missing value visualization

# Pretty display for notebooks
%matplotlib inline

<a id="step1"></a>
## step1: import dataset

In [2]:
# import dataset
part1_dataset = pd.read_csv("./Hotel_Reviews.csv")

## step2: have a look at the dataset

In [8]:
# have a look at the dataset 
display(part1_dataset.head(n=2))

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
0,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Russia,I am so angry that i made this post available...,397,1403,Only the park outside of the hotel was beauti...,11,7,2.9,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968
1,s Gravesandestraat 55 Oost 1092 AA Amsterdam ...,194,8/3/2017,7.7,Hotel Arena,Ireland,No Negative,0,1403,No real complaints the hotel was great great ...,105,7,7.5,"[' Leisure trip ', ' Couple ', ' Duplex Double...",0 days,52.360576,4.915968


In [4]:
# have an overview of all column data type
part1_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515738 entries, 0 to 515737
Data columns (total 17 columns):
Hotel_Address                                 515738 non-null object
Additional_Number_of_Scoring                  515738 non-null int64
Review_Date                                   515738 non-null object
Average_Score                                 515738 non-null float64
Hotel_Name                                    515738 non-null object
Reviewer_Nationality                          515738 non-null object
Negative_Review                               515738 non-null object
Review_Total_Negative_Word_Counts             515738 non-null int64
Total_Number_of_Reviews                       515738 non-null int64
Positive_Review                               515738 non-null object
Review_Total_Positive_Word_Counts             515738 non-null int64
Total_Number_of_Reviews_Reviewer_Has_Given    515738 non-null int64
Reviewer_Score                                515738 non-null flo

<a id="step3"></a>
## step3: convert the column Review_Date to datetime data type.

In [9]:
# As we can tell, the Review_date is an object type, we need to convert it into datetime.
part1_dataset.Review_Date = pd.to_datetime(part1_dataset.Review_Date, format="%m/%d/%Y")

<a id="step4"></a>
## step4: have a look at the descriptive statistical summary

In [11]:
# have a look at the descriptive statistical summary
part1_dataset.describe(include='all')

'''
# if we want to show up all the descriptive summary for both categorical and numerical features.
dataset.describe(include="all")
'''

Unnamed: 0,Hotel_Address,Additional_Number_of_Scoring,Review_Date,Average_Score,Hotel_Name,Reviewer_Nationality,Negative_Review,Review_Total_Negative_Word_Counts,Total_Number_of_Reviews,Positive_Review,Review_Total_Positive_Word_Counts,Total_Number_of_Reviews_Reviewer_Has_Given,Reviewer_Score,Tags,days_since_review,lat,lng
count,515738,515738.0,515738,515738.0,515738,515738,515738,515738.0,515738.0,515738,515738.0,515738.0,515738.0,515738,515738,512470.0,512470.0
unique,1493,,731,,1492,227,330011,,,412601,,,,55242,731,,
top,163 Marsh Wall Docklands Tower Hamlets London ...,,2017-08-02 00:00:00,,Britannia International Hotel Canary Wharf,United Kingdom,No Negative,,,No Positive,,,,"[' Leisure trip ', ' Couple ', ' Double Room '...",1 days,,
freq,4789,,2585,,4789,245246,127890,,,35946,,,,5101,2585,,
first,,,2015-08-04 00:00:00,,,,,,,,,,,,,,
last,,,2017-08-03 00:00:00,,,,,,,,,,,,,,
mean,,498.081836,,8.397487,,,,18.53945,2743.743944,,17.776458,7.166001,8.395077,,,49.442439,2.823803
std,,500.538467,,0.548048,,,,29.690831,2317.464868,,21.804185,11.040228,1.637856,,,3.466325,4.579425
min,,1.0,,5.2,,,,0.0,43.0,,0.0,1.0,2.5,,,41.328376,-0.369758
25%,,169.0,,8.1,,,,2.0,1161.0,,5.0,1.0,7.5,,,48.214662,-0.143372


<a id="step5"></a>
## step5: save the dataset to pickle data for later use.

In [6]:
# save the dataset as part1_dataset
part1_dataset.to_pickle("part1_dataset.pickle")