## Key Takeaways from Initial Data Analysis
- There don't seem to be any consistencies with the missing bathroom/bedroom data. I believe it was simply never input into the airbnb listing. Therefore, I chose to delete out those listings.
- Almost all (15,827 out of 15,835) "last_review" and "first review" are because there are no reviews. These columns are dates. Cut these columns since "host_since" had less missing data and essentially fills the same aspect of understandinig how long the airbnb has been around for.
- All null "review_scores_rating" are caused by no reviews. Therefore, I've inserted "no reviews" into column
- Almost all of the reviews contain images. This could make a cool project to analyze how thumbnail images affect price. Most likely not useful for our project.
- Created new columns based on extracted text from "ammenities" column
- For 180 NaN's in "host has no profile picture." I decided to just assumed there was no image and set value to false 

In [8]:
# Import Statements
from datetime import datetime
import pandas as pd
import numpy as np
from PIL import Image
import requests

In [9]:
# Import files
train = pd.read_csv("data/archive-2/train.csv", index_col='id')
test = pd.read_csv("data/archive-2/test.csv", index_col='id')

In [10]:
train_before = train.shape

In [11]:
test_before = test.shape

In [12]:
def cleaned_dataframe(df):
    """
    1. Adds feature columns to df
    2. Deals with all null values
    3. Turns ratings into categorical column
    4. Converts "Host_since" into a measure of time
    """ 
    features = ['Wireless Internet','Air conditioning', 'Kitchen', 'Heating','Family/kid friendly', 'Essentials', 'Hair dryer', 'Iron', 
                'Smoke detector', 'Shampoo', 'Hangers', 'Hair dryer', 'Fire extinguisher', 'Laptop friendly workspace', 'First aid kit', 'Indoor fireplace',
                'TV','Cable TV', 'Elevator in building']
    
    # forloop to create all new columns
    for item in features:
        df[item]=np.where(df['amenities'].str.contains(item), 1, 0)
    
    # drop unnecessary column & columns with no host information
    # neighborhood will be dictated by zip and latitude/longitude
    df.drop(columns=['amenities', 'first_review', 'last_review', 'host_response_rate', 'neighbourhood'], axis=1, inplace=True)
    
    # drop rows with null values in certain columns
    df = df.dropna(axis=0, subset=['bathrooms', 'bedrooms', 'beds'])
    
    # dropped rows with no host information or now zip code
    df = df.dropna(axis=0, subset=['host_since', 'host_identity_verified', 'zipcode'])
    
    # Dealing with ratings column
    # Zero isn't a real rating in the columns
    # Temporarily assign rating_score with no previous reviews as 0 so it can later make it a category
    df['review_scores_rating']=np.where(df['number_of_reviews']==0, 0, df['review_scores_rating'])
    
    # drop remaining 800 rows with no values
    df = df.dropna(axis=0, subset=['review_scores_rating'])
    
    # change reviews into categories
    df['review_scores_rating'] = df['review_scores_rating'].round(-1).astype('int').astype('str')
    
    # reassign 0 ratings as "no past ratings" category
    df['review_scores_rating']=np.where(df['number_of_reviews']==0, 'no past ratings', df['review_scores_rating'])
    
    # Convert "host_since" into column that measures # of days an individual has been a host
    for i in range(len(df['host_since'])):
        today = datetime.today()
        date_time_obj = datetime.strptime(df['host_since'].iloc[i], '%Y-%m-%d')
        df['host_since'].iloc[i] = (today - date_time_obj).days
    
    # Convert "host_since" from object to int
    df['host_since'] = df['host_since'].astype('int')
    
    #drop columns with low correlation
    df.drop(columns=['latitude', 'longitude', 'Smoke detector', 'number_of_reviews', 'Hangers','First aid kit', 'Elevator in building', 'Essentials', 'zipcode', 'thumbnail_url', 'description', 'name'], axis=1, inplace=True)
    return df

In [13]:
train = cleaned_dataframe(train)

In [14]:
test = cleaned_dataframe(test)

In [15]:
# Statement on cut data
train_after = train.shape
lost_rows = train_before[0]-train_after[0]
print("We cut", lost_rows, "data points when cleaning data leaving",train_after[0],"data points.")

We cut 2353 data points when cleaning data leaving 71758 data points.


In [None]:
# Export CSV Files

In [20]:
train.to_csv('./data/archive-2/train_2.csv')

In [21]:
test.to_csv('./data/archive-2/test_2.csv')