# AirBnB Build Week Project

## NLP Preprocessing

### [My EDA Notebook](https://github.com/BW-TT-158-AirBnB/Data-Science-Unit3-and-Unit4/blob/main/notebooks/BW4_AirBnB_EDA.ipynb)

### [My Model Notebook](https://github.com/BW-TT-158-AirBnB/Data-Science-Unit3-and-Unit4/blob/main/notebooks/BW4_Airbnb_Model_Notebook.ipynb)


## Links to Raw Cleaned Files

[Text Features Only](https://raw.githubusercontent.com/BW-TT-158-AirBnB/Data-Science-Unit3-and-Unit4/main/data/amsterdam_txt_feat.csv)

[Numeric Features Only](https://raw.githubusercontent.com/BW-TT-158-AirBnB/Data-Science-Unit3-and-Unit4/main/data/amsterdam_num_feat.csv)

[Boolean Features Only](https://raw.githubusercontent.com/BW-TT-158-AirBnB/Data-Science-Unit3-and-Unit4/main/data/amsterdam_bool_feat.csv)

[Cleaned with All Features](https://raw.githubusercontent.com/BW-TT-158-AirBnB/Data-Science-Unit3-and-Unit4/main/data/amsterdam_list_clean.csv)

## Imports

In [1]:
# Install the squarify plotting library
!pip install squarify



In [2]:
# Standard
import pandas as pd
import numpy as np

# Set DataFrame display options
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 300)

# NLP
import re
from nltk.stem import PorterStemmer
import spacy
from spacy.tokenizer import Tokenizer
from collections import Counter

# Plotting
import squarify
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Download the spacy large vocab
!python -m spacy download en_core_web_lg
import en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [4]:
# Instantiate the sapcy vocab and tokenizer
nlp = spacy.load('en_core_web_lg')
tokenizer = Tokenizer(nlp.vocab)

## Read in the Data and Explore it Further

In [5]:
# Read in just the text features file
txt_url = 'https://raw.githubusercontent.com/BW-TT-158-AirBnB/Data-Science-Unit3-and-Unit4/main/data/amsterdam_txt_feat.csv'
txt_df = pd.read_csv(txt_url)

# Look at the data
print(txt_df.shape)
txt_df.head()

(18291, 5)


Unnamed: 0,description,neighborhood,property_type,room_type,amenities
0,"Quiet Garden View Room & Super Fast WiFi<br /><br /><b>The space</b><br />I'm renting a bedroom (room overlooking the garden) in my apartment in Amsterdam, <br /><br />The room is located to the east of the city centre in a quiet, typical Amsterdam neighbourhood the ""Indische Buurt"". Amsterdam’s...",Oostelijk Havengebied - Indische Buurt,Private room in apartment,Private room,"[""Hangers"", ""Coffee maker"", ""Paid parking on premises"", ""Long term stays allowed"", ""First aid kit"", ""Bed linens"", ""Lock on bedroom door"", ""Private entrance"", ""Carbon monoxide alarm"", ""Dedicated workspace"", ""Host greets you"", ""Single level home"", ""Extra pillows and blankets"", ""Hot water"", ""Paid p..."
1,"17th century Dutch townhouse in the heart of the city. no public transport needed! Located a stones throw from Rembrandt Square, Dam Square, Leidse Square and Flower Market. Walking distance from Central Station.<br />Comfortable, cosy, lockable studio with comfortable bed and with private bathr...",Centrum-Oost,Private room in townhouse,Private room,"[""Essentials"", ""Bed linens"", ""Hot water"", ""Hangers"", ""Smoke alarm"", ""Heating"", ""Paid parking off premises"", ""Free street parking"", ""Refrigerator"", ""Carbon monoxide alarm"", ""Dedicated workspace"", ""Host greets you"", ""TV"", ""Wifi"", ""Hair dryer"", ""Long term stays allowed"", ""Fire extinguisher""]"
2,"Lovely apt in Centre ( lift & fireplace) near Jordaan<br /><br /><b>The space</b><br />This nicely furnished, newly renovated apt is very sunny, and spacious with 8 windows. The appliances are fairly new. There are two flat screen TVs, washing machine, dryer, dishwasher, laptop computer, Wi-fi,...",Centrum-West,Entire apartment,Entire home/apt,"[""Hangers"", ""Elevator"", ""Cooking basics"", ""Dishes and silverware"", ""Oven"", ""Coffee maker"", ""Long term stays allowed"", ""Bed linens"", ""Dishwasher"", ""Dedicated workspace"", ""Kitchen"", ""Extra pillows and blankets"", ""Hot water"", ""Heating"", ""TV"", ""Cable TV"", ""Hair dryer"", ""Microwave"", ""Essentials"", ""Sm..."
3,"Stylish and romantic houseboat on fantastic historic location with breathtaking view. Wheelhouse, deckhouse and captains room. Central, quiet. Great breakfast, 2 vanMoof design bikes and a Canadian Canoe are included. Just read the reviews on tripadvisor for instance!<br /><br /><b>The space</b...",Centrum-West,Private room in houseboat,Private room,"[""Patio or balcony"", ""Hangers"", ""Dishes and silverware"", ""Luggage dropoff allowed"", ""Breakfast"", ""Coffee maker"", ""Lake access"", ""Long term stays allowed"", ""Private entrance"", ""Waterfront"", ""Carbon monoxide alarm"", ""Dedicated workspace"", ""Hot water"", ""Heating"", ""TV"", ""Hair dryer"", ""Essentials"", ""..."
4,"<b>The space</b><br />In a monumental house right in the center of Amsterdam, we offer two rooms (one single room and one double room) to visitors who want to enjoy the comfort of a home-like accommodation, plus being in the middle of the historic city center, close to all important museums, sho...",Centrum-Oost,Private room in apartment,Private room,"[""Hot water"", ""Essentials"", ""Smoke alarm"", ""Hangers"", ""Heating"", ""Lock on bedroom door"", ""Private entrance"", ""Refrigerator"", ""Carbon monoxide alarm"", ""Host greets you"", ""Shampoo"", ""Wifi"", ""Hair dryer"", ""Long term stays allowed"", ""Fire extinguisher"", ""Dryer""]"


In [6]:
# Check for null values in the dataframe
txt_df.isnull().sum()

description      0
neighborhood     0
property_type    0
room_type        0
amenities        0
dtype: int64

In [7]:
# Check the unique values in the neigborhood feature
txt_df['neighborhood'].value_counts()

De Baarsjes - Oud-West                    3044
De Pijp - Rivierenbuurt                   2269
Centrum-West                              2000
Centrum-Oost                              1560
Westerpark                                1395
Zuid                                      1309
Oud-Oost                                  1181
Bos en Lommer                             1047
Oostelijk Havengebied - Indische Buurt     875
Oud-Noord                                  586
Watergraafsmeer                            530
IJburg - Zeeburgereiland                   423
Slotervaart                                411
Noord-West                                 369
Noord-Oost                                 255
Buitenveldert - Zuidas                     252
Geuzenveld - Slotermeer                    214
De Aker - Nieuw Sloten                     126
Osdorp                                     125
Gaasperdam - Driemond                      121
Bijlmer-Centrum                            103
Bijlmer-Oost 

* This is a multi-class feature, will probably just one-hot encode this one.

In [8]:
# Check the unique values in the property_type feature
txt_df['property_type'].value_counts()

Entire apartment                      11393
Private room in apartment              2218
Entire house                           1198
Entire townhouse                        472
Private room in bed and breakfast       344
Private room in house                   339
Entire loft                             261
Entire condominium                      261
Houseboat                               199
Boat                                    198
Private room in townhouse               168
Entire serviced apartment               149
Room in boutique hotel                  135
Private room in houseboat               117
Private room in guest suite             103
Private room in boat                     96
Room in hotel                            90
Private room in condominium              83
Private room in loft                     55
Room in bed and breakfast                49
Entire guest suite                       42
Shared room in apartment                 29
Entire villa                    

* This one has a lot more values to it, will probably run this through a count vectorizer.

In [9]:
# Check the values in the room_type feature
txt_df['room_type'].value_counts()

Entire home/apt    14283
Private room        3831
Hotel room           127
Shared room           50
Name: room_type, dtype: int64

* This one only has only 4 unique values, can just map this one to numerical.

* Will need to circle back to this. Not sure if I will have enough time to so NLP but will if time allows.
