## Springboard Capstone Project 1

## Airbnb New User Bookings

### Inferential Statistics

Gender Preference for Airbnb Bookings
In this section, we will be applying statistical tools to gain some inferences and insights into the kind of data we are dealing with and disovering relationships between various features of our dataset.

To begin, let us check if there is a gender based preference for certain countries. In other words, does the gender of a person affect the first country s/he books an Airbnb in? To answer this question we will have to test the relationship between two categorical variables: Gender and Destination Country. Since the number of destination countries is multivariate, the Chi Square Square Significance Test.

Before we begin, we will make certain assumptions:

We will consider only those users who have enlisted their gender as male or female. Unknown and other genders are not included in this analysis.

We do not consider users who have never booked an Airbnb or have booked in a country not enlisted as a class (NDF and Other).

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
sns.set_style('whitegrid')
plt.style.use('ggplot')
%matplotlib inline

In [4]:
# Load the data into DataFrames
df_train = pd.read_csv('train_users_2.csv')
df_test = pd.read_csv('test_users.csv')
sessions = pd.read_csv('sessions.csv')
df_agb = pd.read_csv('age_gender_bkts.csv')
countries = pd.read_csv('countries.csv')

In [5]:
df_agb[df_agb['year'].isnull()]
# there's no null value in age_gender_bkts.csv

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year


In [6]:
df_agb['gender'].value_counts()

male      210
female    210
Name: gender, dtype: int64

The gender can also be turned into a categorical binary variable. Let us represent male with 0 and female with 1. Again, we do this just in case we require this variable to function as a numerical quantity. It must be stated that there is no immediate need for it and therefore, can be skipped.

In [8]:
df_agb['gender'] = df_agb['gender'].apply(lambda x: 0 if x == 'male' else 1)
df_agb['gender'].value_counts()

1    210
0    210
Name: gender, dtype: int64

In [9]:
df_agb['year'].value_counts()

2015.0    420
Name: year, dtype: int64

In [10]:
df_agb = df_agb.drop('year', axis=1)
df_agb.head()

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands
0,100+,AU,0,1.0
1,95-99,AU,0,9.0
2,90-94,AU,0,47.0
3,85-89,AU,0,118.0
4,80-84,AU,0,199.0


In [11]:
# replace these values with NaN to denote that we do not know the real age of these people.
df_train['age'] = df_train['age'].apply(lambda x: np.nan if x > 120 else x)

In [12]:
df_inf = df_train[(df_train['country_destination'] != 'NDF') & (df_train['country_destination'] != 'other') & (df_train['gender'] != 'OTHER') & (df_train['gender'].notnull())]
df_inf = df_inf[['id', 'gender', 'country_destination']]
df_inf.head()

Unnamed: 0,id,gender,country_destination
2,4ft3gnwmtx,FEMALE,US
4,87mebub9p4,-unknown-,US
5,osr2jwljor,-unknown-,US
6,lsw9q7uk0j,FEMALE,US
7,0d01nltbrs,FEMALE,US


In [13]:
df_inf['gender'].value_counts()

FEMALE       28833
-unknown-    25549
MALE         24278
Name: gender, dtype: int64

In [14]:
df_inf['country_destination'].value_counts()

US    62260
FR     5010
IT     2830
GB     2321
ES     2245
CA     1423
DE     1058
NL      759
AU      538
PT      216
Name: country_destination, dtype: int64

### Hypothesis Testing

Null Hypothesis: There is no relationship between country preference and the sex of the customer.

Alternate Hypothesis: There is a relationship between country preference and the sex of the customer.

We will assume our significance level, $\alpha$ to be 0.05.

In [22]:
observed = df_inf.pivot_table('id', ['gender'], 'country_destination', aggfunc='count').reset_index()
#del observed.columns.name
observed = observed.set_index('gender')
observed

country_destination,AU,CA,DE,ES,FR,GB,IT,NL,PT,US
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
-unknown-,143,491,284,715,1713,758,1040,227,69,20109
FEMALE,207,455,358,853,1962,881,1091,254,78,22694
MALE,188,477,416,677,1335,682,699,278,69,19457


In [16]:
chi2, p, dof, expected = stats.chi2_contingency(observed)
chi2

176.12475432418529

In [17]:
p

5.6057373970830885e-28

#### Result
The p-value that we have obtained is less than our chosen significance level. 

Therefore, we reject the null hypothesis and accept the negating alterate hypothesis. 

There is a relationship between country preference and the sex of the customer. 