# Explore here

## Step 1: Problem statement and data collection

Problem statement: What price should an AirBnB host anticipate being able to charge, based on prices of existing AirBnBs with some similar and some different characteristics?

In [5]:


# Import dataset to start working with it
import pandas as pd

raw_data = pd.read_csv("../data/raw/AB_NYC_2019.csv")

def split_test_vs_training():
    test_frac = 0.20
    test_set  = raw_data.sample(frac=test_frac, random_state=42)
    train_set = raw_data.drop(test_set.index)



## Step 2: Exploration and data cleaning

In [6]:
print(f"Raw data shape: {raw_data.shape}")
print(raw_data.info())

Raw data shape: (48895, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last

Looks like the columns with missing data are: name, host_name, last_review, reviews_per_month.

Name and host_name should have negligible or no effect on price.
last_review and reviews_per_month sound plausible to indirectly correlate (i.e. a place with more reviews per time period likely got more bookings; more bookings correlates to less time unbooked; that correlates to a venue that is more in-demand and as such can likely charge a higher price per booking).

But we may be able to get away with ignoring them rather than trying to make some convoluted fictional values in place of missing values.

In [7]:
print(f"Count of duplicates found based on id: {raw_data.drop('id', axis = 1).duplicated().sum()}")

# Looks at every column except id. Two rows are considered the “same” if all of those other columns match, even when their id values differ.
#raw_data = raw_data.drop_duplicates(subset = raw_data.columns.difference(['id']))

#  Looks at all columns (because subset is omitted). Two rows are duplicates only if every column—including id—matches.
raw_data = raw_data.drop_duplicates()

print(raw_data.shape)
raw_data.head()

Count of duplicates found based on id: 0
(48895, 16)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [8]:
# remove data considered irrelevant
raw_data.drop(["id", "name", "host_name", "last_review", "reviews_per_month"], axis = 1, inplace = True)
raw_data.head()

Unnamed: 0,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,calculated_host_listings_count,availability_365
0,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,6,365
1,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2,355
2,4632,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,1,365
3,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,1,194
4,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,1,0


## Step 3: Analysis of univariate variables
Categorical, non-numreric variables in this dataframe include: neighbourhood_group, neighbourhood, room_type

neighbourhood_group can be either: Manhattan, Brooklyn, Queens, Bronx, Staten Island
neighbourhood can be either: 

In [25]:
categorical_variables = ["neighbourhood_group", "neighbourhood", "room_type"]

for each_categorical_variable in categorical_variables:
    print(f"\n\t{each_categorical_variable}s:")
    for each_possible_value in sorted(raw_data[f"{each_categorical_variable}"].unique()):
        print(each_possible_value)
        


	neighbourhood_groups:
Bronx
Brooklyn
Manhattan
Queens
Staten Island

	neighbourhoods:
Allerton
Arden Heights
Arrochar
Arverne
Astoria
Bath Beach
Battery Park City
Bay Ridge
Bay Terrace
Bay Terrace, Staten Island
Baychester
Bayside
Bayswater
Bedford-Stuyvesant
Belle Harbor
Bellerose
Belmont
Bensonhurst
Bergen Beach
Boerum Hill
Borough Park
Breezy Point
Briarwood
Brighton Beach
Bronxdale
Brooklyn Heights
Brownsville
Bull's Head
Bushwick
Cambria Heights
Canarsie
Carroll Gardens
Castle Hill
Castleton Corners
Chelsea
Chinatown
City Island
Civic Center
Claremont Village
Clason Point
Clifton
Clinton Hill
Co-op City
Cobble Hill
College Point
Columbia St
Concord
Concourse
Concourse Village
Coney Island
Corona
Crown Heights
Cypress Hills
DUMBO
Ditmars Steinway
Dongan Hills
Douglaston
Downtown Brooklyn
Dyker Heights
East Elmhurst
East Flatbush
East Harlem
East Morrisania
East New York
East Village
Eastchester
Edenwald
Edgemere
Elmhurst
Eltingville
Emerson Hill
Far Rockaway
Fieldston
Financial D

## Step 4: Analysis of multivariate variables¶
