# Phase 2 : Exploring Data

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('./Data/raw_data/AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


# Question 1: 
How many rows and how many columns?

In [4]:
print(f'The number of rows : {df.shape[0]}')
print(f'The number of columns : {df.shape[1]}')

The number of rows : 48895
The number of columns : 16


# Question 2: 
What is the meaning of each row?

- Each row is the infomation of each listing

# Question 3: 
Are there duplicated rows?

In [5]:
print(df.duplicated().sum())

0


# Question 4: 
What is the meaning of each column?

| Column name     | Meaning |
| :---        |    :----:   |
| id      | ID of the listing      |
| name   | The name of the listing        |
| host_id      | ID of the host      |
| host_name   | Name of the host        |
| neighbourhood_group      | The group of neighbourhood area around the listing       |
| neighbourhood   | The neighbourhood area around the listing          |
| latitude      | The latitude of place of listing (from the World Geodetic System(WGS84))       |
| longitude   |  The longtitude of place of listing (from the World Geodetic System(WGS84))       |
| room_type   | The type of room        |
| price      | The price of listing      |
| minimum_nights      | The minimum nights to order        |
| number_of_reviews   | The number of reviews for this listing        |
| last_review      | The date of latest reviews for this listing       |
| reviews_per_month   | The average number of reviews for this listing per month    |
| calculated_host_listings_count   | The number of listings the host has in the curent scrape       |
| availability_365   | The availability of the listing 365 days in the future as determined by the calendar       |


# Question 5: 
What is the current data type of each column? Are there columns having inappropriate data types?

In [6]:
df.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

Finding the true data type of each column

In [7]:
def open_object_col(col):
    return (set(col.apply(type)))

In [8]:
df.apply(open_object_col)

id                                                 {<class 'int'>}
name                              {<class 'str'>, <class 'float'>}
host_id                                            {<class 'int'>}
host_name                         {<class 'str'>, <class 'float'>}
neighbourhood_group                                {<class 'str'>}
neighbourhood                                      {<class 'str'>}
latitude                                         {<class 'float'>}
longitude                                        {<class 'float'>}
room_type                                          {<class 'str'>}
price                                              {<class 'int'>}
minimum_nights                                     {<class 'int'>}
number_of_reviews                                  {<class 'int'>}
last_review                       {<class 'str'>, <class 'float'>}
reviews_per_month                                {<class 'float'>}
calculated_host_listings_count                     {<class 'in

- Column id, host_id, neighbourhood_group, neighbourhood, latitude,longtitude, room_type, price, minimum_nights, number_of_reviews,review_per_month, calculated_host_listings_count and availability_365 have a right data type.
- Column name, host_name need to convert into string
- Column last_review need to convert into datetime

In [10]:
df[['name', 'host_name']] = df[['name', 'host_name']].replace(to_replace=np.nan, value='')
df['last_review'] = pd.to_datetime(df['last_review'],format='%Y-%m-%d')

In [11]:
df.apply(open_object_col)

id                                                                  {<class 'int'>}
name                                                                {<class 'str'>}
host_id                                                             {<class 'int'>}
host_name                                                           {<class 'str'>}
neighbourhood_group                                                 {<class 'str'>}
neighbourhood                                                       {<class 'str'>}
latitude                                                          {<class 'float'>}
longitude                                                         {<class 'float'>}
room_type                                                           {<class 'str'>}
price                                                               {<class 'int'>}
minimum_nights                                                      {<class 'int'>}
number_of_reviews                                                   {<class 