# Airbnb Data Mining Notebook

In this notebook we are going to deal with data from a well-known residential rental application, Airbnb. Specifically, based on the data for the Athens area for 3 months of 2019 (February, March and April), we are going to answer the following question: 
* What is the most common type of room_type for our data?
* Plot graphs showing the fluctuation of prices for the 3 month period.
* What are the top 5 neighborhoods with the most reviews?
* What is the neighborhood with most real estate listings?
* How many entries are per neighborhood and per month?
* Plot the histogram of the neighborhood_group variable.
* What is the most common type of room (room_type)?
* What is the most common room type (room_type) in each neighborhood (neighborhood_group)?
* What is the most expensive room type?

## Import Libraries

In [1]:
# Ignoring unnecessory warnings
import warnings
warnings.filterwarnings("ignore")  
# Specialized container datatypes
import collections
# For data vizualization 
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# For large and multi-dimensional arrays
import numpy as np
# For data manipulation and analysis
import pandas as pd
# Natural language processing library
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
# For basic cleaning and data preprocessing 
import re
import string 
# Communicating with operating and file system
import os
# Machine learning libary
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# For wordcloud generating 
from wordcloud import WordCloud

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pantelis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
"""
Check whether or not train.csv has been created. 
If not os walk all over the files in data folder,
and create the dataset based on .csv files.
"""
DATASET = "./data/train.csv"
if os.path.exists(DATASET):
    print("good")
else:
    for root, _, files in os.walk("./data", topdown=False):
        for file in files:
            if (file.endswith(".csv")):
                print(os.path.join(root, file))

./data/april/reviews.csv
./data/april/listings0.csv
./data/april/neighbourhoods.csv
./data/april/listings.csv
./data/april/calendar.csv
./data/april/reviews0.csv
./data/febrouary/reviews.csv
./data/febrouary/listings0.csv
./data/febrouary/neighbourhoods.csv
./data/febrouary/listings.csv
./data/febrouary/calendar.csv
./data/febrouary/reviews0.csv
./data/march/reviews.csv
./data/march/listings0.csv
./data/march/neighbourhoods.csv
./data/march/listings.csv
./data/march/calendar.csv
./data/march/reviews0.csv
