### Introduction

In the following notebook, I will be cleaning a raw data file of listings data from Inside Airbnb

**Read in libraries**

In [201]:
import numpy as np
import pandas as pd
import swifter
import matplotlib.pyplot as plt
import seaborn as sns

**Set notebook preferences**

In [202]:
#Set pandas preferences
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

**Read in data**

In [203]:
#Set path to data on local machine
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\01_Raw\SF Airbnb'

#Read in data
df = pd.read_csv(path + '/2020_0519_Aggregated_Listings.csv',dtype={'zipcode':'category'} ,parse_dates= ['host_since','last_review', 'first_review'],index_col=0)

KeyboardInterrupt: 

### Data Overview

**Preview Data**

In [None]:
#Display data, print shape
print('Data shape:', df.shape)
display(df.head(3))

**View data description**

In [None]:
#View data description
df.describe().T

## Data Cleaning

### Drop Columns

**Drop mostly homogenous/redundant columns and columns with only missing values**

In [None]:
df.head()

In [None]:
#Extract cols with values with more than 1 unique value
df = df.loc[:,(df.nunique() != 1)]

In [None]:
#Drop missing columns
df.dropna(axis =1,how = 'all', inplace = True)

#Drop redundant columns
df.drop(['jurisdiction_names', 'market','state','neighbourhood'], axis = 1, inplace = True)

In [None]:
#Inspect cols with <=2 unique values
inspect = df.loc[:, (df.nunique() <=2)].columns.to_list()

#Check
display(df[inspect].head(3))

In [None]:
#Create dictionary for mapping
mapping = {'t':1,'f':0}

#Map 1's and 0's on t's and f's
df[inspect] = df[inspect].apply(lambda x: x.map(mapping, na_action='ignore'))

#Check
display(df[inspect].head(3))

**Drop columns containing url data or pertain to webscraping**

In [None]:
#Subset column headers containing 'url' or 'scrape' and store in drop
drop = list(df.filter(regex='url|scrape').columns)

#Drop drop list and check
df.drop(columns= df[drop], inplace=True)
df.head(1)

## Check for high correlations between features

**Prepare data**

In [None]:
#Create correlation matrix and capture absolute values of correlations
c = df.corr().abs()

#Create a df that stores correlations between features >.9
s = c.unstack()
so = s.sort_values(kind="quicksort").reset_index()
so.columns = ['feat1','feat2','corr']
so = so.loc[ (so.feat1 != so.feat2 )& (so['corr'] > .9)]

#Capture list of features
feats =so.feat1.unique()

#Subset df by cols in feats and create corr
corr= df[feats].corr()

**Create heatmap**

In [None]:
#Create fig
f, ax = plt.subplots(figsize = (13,13))

#Plot corr as heat map
sns.heatmap(data = corr, annot=True,fmt='.1%', cmap = 'coolwarm', ax=ax,
            linewidths=1.0, square=1);

**Drop cols with high collinearity**

In [None]:
#Cols with high collinearity
drop = ['calculated_host_listings_count_entire_homes','maximum_nights_avg_ntm', 'maximum_maximum_nights',
        'maximum_minimum_nights','minimum_minimum_nights', 'minimum_nights_avg_ntm', 'host_total_listings_count']

#Drop drop
df.drop(drop, axis=1, inplace = True)

## Clean up object and numeric columns

**Clean up numeric columns**

In [None]:
#Filter cols pertaining to prices and assign col names as a list to money_cols
money_cols = df.filter(regex = 'people|deposit|price|fee$|rate').columns.tolist()

#Remove $, and set type as numeric for money_cols
df[money_cols] = df[money_cols].replace('[$|,|%]','',regex = True).astype('float')

#Check
display(df[money_cols].head(3))

**Clean up object columns**

In [None]:
#Create list of columns to apply cleaning to
objects = df.select_dtypes('object').columns.to_list()

#Check
display(df[objects].head(3))

In [None]:
#Remove punctuation
df[objects] = df[objects].apply(lambda x : x.str.replace('[^\w\s]|(_)',' '))

#Check
display(df[objects].head())

### Missing Data

In [None]:
#Import missing_calculator
from Missing_Stats import missing_calculator

#Store missing statistics about df
missing =missing_calculator(df)

#Capture index where percentage missing >40% and store list
drop = missing.loc[missing.percentage > 40].index.tolist()

In [None]:
#Dropping square_feet, keeping the rest for now
df.drop(drop, axis = 1, inplace = True)

### Write to csv

In [None]:
#Print final shape of df
print('Shape of cleaned data:', df.shape)

#Set path to local machine
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\02_Cleaned'

#Write file
df.to_csv(path + '/2020_0520_Listings_Cleaned.csv')