# Airbnb New User Bookings

## Inferential Statistics

We will be applying statistical tools to gain some inferences and insights into the kind of data we are dealing with and disovering relationships between various features of our dataset.


Does the gender of a person affect the first country s/he books an Airbnb in? To answer this question, we will have to test the relationship between two categorical variables: Gender and Age. 

#We will consider only those users who have enlisted their gender as male or female (Unknown and other genders are not included in this analysis).

#We do not consider users who have never booked an Airbnb or have booked in a country not enlisted as a class (NDF and Other).

In [54]:
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression



In [55]:
# Load the data into DataFrames
df_train = pd.read_csv('train_users_2.csv')
df_test = pd.read_csv('test_users.csv')
sessions = pd.read_csv('sessions.csv')
df_agb = pd.read_csv('age_gender_bkts.csv')
countries = pd.read_csv('countries.csv')

In [56]:
df_agb[df_agb['year'].isnull()]
# there's no null value in age_gender_bkts.csv

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year


In [4]:
df_agb['gender'].value_counts()

male      210
female    210
Name: gender, dtype: int64

In [5]:
df_agb['age_bucket'].value_counts()

10-14    20
95-99    20
85-89    20
65-69    20
70-74    20
90-94    20
100+     20
30-34    20
40-44    20
45-49    20
60-64    20
0-4      20
35-39    20
75-79    20
55-59    20
25-29    20
5-9      20
80-84    20
50-54    20
15-19    20
20-24    20
Name: age_bucket, dtype: int64

The gender can also be turned into a categorical binary variable. 
We represent male with 0 and female with 1.
Again, we do this just in case we require this variable to function as a numerical quantity. 

In [6]:
df_agb['gender'] = df_agb['gender'].apply(lambda x: 0 if x == 'male' else 1)
df_agb['gender'].value_counts()

1    210
0    210
Name: gender, dtype: int64

In [7]:
df_agb

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands,year
0,100+,AU,0,1.0,2015.0
1,95-99,AU,0,9.0,2015.0
2,90-94,AU,0,47.0,2015.0
3,85-89,AU,0,118.0,2015.0
4,80-84,AU,0,199.0,2015.0
...,...,...,...,...,...
415,95-99,US,0,115.0,2015.0
416,90-94,US,0,541.0,2015.0
417,15-19,US,1,10570.0,2015.0
418,85-89,US,0,1441.0,2015.0


In [8]:
df_agb['year'].value_counts()

2015.0    420
Name: year, dtype: int64

In [9]:
df_agb = df_agb.drop('year', axis=1)
df_agb.head()

Unnamed: 0,age_bucket,country_destination,gender,population_in_thousands
0,100+,AU,0,1.0
1,95-99,AU,0,9.0
2,90-94,AU,0,47.0
3,85-89,AU,0,118.0
4,80-84,AU,0,199.0


In [11]:
#We now view training set
df_train = pd.read_csv('train_users_2.csv')
df_train.head()

Unnamed: 0,id,date_account_created,timestamp_first_active,date_first_booking,gender,age,signup_method,signup_flow,language,affiliate_channel,affiliate_provider,first_affiliate_tracked,signup_app,first_device_type,first_browser,country_destination
0,gxn3p5htnn,2010-06-28,20090319043255,,-unknown-,,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,NDF
1,820tgsjxq7,2011-05-25,20090523174809,,MALE,38.0,facebook,0,en,seo,google,untracked,Web,Mac Desktop,Chrome,NDF
2,4ft3gnwmtx,2010-09-28,20090609231247,2010-08-02,FEMALE,56.0,basic,3,en,direct,direct,untracked,Web,Windows Desktop,IE,US
3,bjjt8pjhuk,2011-12-05,20091031060129,2012-09-08,FEMALE,42.0,facebook,0,en,direct,direct,untracked,Web,Mac Desktop,Firefox,other
4,87mebub9p4,2010-09-14,20091208061105,2010-02-18,-unknown-,41.0,basic,0,en,direct,direct,untracked,Web,Mac Desktop,Chrome,US


In [13]:
# replace age-values that "above 120 and below 16" with NaN to denote that we do not know the real age of these people.
df_train['age'] = df_train['age'].apply(lambda x: np.nan if x > 120 else x)
df_train['age'] = df_train['age'].apply(lambda x: np.nan if x < 16 else x)

In [14]:
#We exclude unrelated/noised/unknown
df_inf = df_train[(df_train['age'].notnull()) & (df_train['country_destination'] != 'NDF') & (df_train['country_destination'] != 'other') & (df_train['gender'] != 'OTHER') & (df_train['gender'].notnull())]
df_inf = df_inf[['id', 'gender', 'country_destination', 'age']]
df_inf.head()

Unnamed: 0,id,gender,country_destination,age
2,4ft3gnwmtx,FEMALE,US,56.0
4,87mebub9p4,-unknown-,US,41.0
6,lsw9q7uk0j,FEMALE,US,46.0
7,0d01nltbrs,FEMALE,US,47.0
8,a1vcnhxeij,FEMALE,US,50.0


In [15]:
df_inf['country_destination'].value_counts()

US    48232
FR     3679
IT     2016
GB     1760
ES     1692
CA     1066
DE      840
NL      594
AU      434
PT      156
Name: country_destination, dtype: int64

In [16]:
df_inf['gender'].value_counts()

FEMALE       26868
MALE         22941
-unknown-    10660
Name: gender, dtype: int64

In [17]:
df_inf['age'].value_counts()

30.0     3243
31.0     3115
32.0     3080
29.0     3065
28.0     3046
         ... 
96.0        2
111.0       2
16.0        1
92.0        1
113.0       1
Name: age, Length: 97, dtype: int64

## Hypothesis Testing： Is there a significant difference between males and females mean age?

Null Hypothesis: There is no significant difference between males and females.

Alternate Hypothesis: There is a significant difference between males and females mean age.

We will assume our significance level, 
α
 to be 0.05.

In [18]:
df_inf.head()

Unnamed: 0,id,gender,country_destination,age
2,4ft3gnwmtx,FEMALE,US,56.0
4,87mebub9p4,-unknown-,US,41.0
6,lsw9q7uk0j,FEMALE,US,46.0
7,0d01nltbrs,FEMALE,US,47.0
8,a1vcnhxeij,FEMALE,US,50.0


In [19]:
df_inf.country_destination.describe()

count     60469
unique       10
top          US
freq      48232
Name: country_destination, dtype: object

In [20]:
df_inf.gender.describe()

count      60469
unique         3
top       FEMALE
freq       26868
Name: gender, dtype: object

In [21]:
df_inf.country_destination.unique()

array(['US', 'CA', 'FR', 'IT', 'GB', 'ES', 'NL', 'DE', 'AU', 'PT'],
      dtype=object)

In [22]:
df_inf.age.unique()

array([ 56.,  41.,  46.,  47.,  50.,  36.,  33.,  31.,  29.,  30.,  40.,
        26.,  32.,  35.,  37.,  42.,  44.,  34.,  19.,  52.,  57.,  49.,
        54.,  28.,  69.,  43.,  39.,  25.,  65.,  38.,  63.,  18.,  45.,
        60.,  48.,  51.,  61.,  64.,  70.,  67.,  55.,  73., 104.,  66.,
       105.,  68.,  27.,  53.,  58.,  75.,  59.,  79.,  62.,  72.,  24.,
       101.,  98.,  74.,  23.,  87.,  92.,  71.,  21.,  22.,  78.,  86.,
       103.,  81.,  95.,  82.,  77., 107.,  85.,  17., 115.,  83., 110.,
        20., 102.,  91.,  97.,  88., 113.,  93., 106.,  80., 109.,  76.,
        96., 108., 100., 111.,  90.,  99.,  89.,  84.,  16.])

In [23]:
df_inf = pd.DataFrame(df_inf)
df_inf.to_csv('inferential_statistics.csv')

In [24]:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
from scipy.stats import norm
import scipy.stats
import math


In [30]:
# Male Age
male = df_inf[df_inf.gender == 'MALE']
print(male.shape)
male.head()

(22941, 4)


Unnamed: 0,id,gender,country_destination,age
21,qsibmuz9sx,MALE,US,30.0
58,fp6ndcm5ak,MALE,US,52.0
64,0xvbruzuzz,MALE,CA,35.0
85,6qx6xl5eho,MALE,US,33.0
96,4ktx9s53w5,MALE,US,34.0


In [44]:
male_mean_age = male.age.mean()
print("Average Male age is : " + str(male_mean_age))

male_std_age = male.age.std()
print("std. for male age is : " + str(male_std_age))

Average Male age is : 37.03696438690554
std. for male age is : 12.460498446818018


In [45]:
# Female Age
female = df_inf[df_inf.gender == 'FEMALE']
print(female.shape)
female.head()

(26868, 4)


Unnamed: 0,id,gender,country_destination,age
2,4ft3gnwmtx,FEMALE,US,56.0
6,lsw9q7uk0j,FEMALE,US,46.0
7,0d01nltbrs,FEMALE,US,47.0
8,a1vcnhxeij,FEMALE,US,50.0
10,yuuqmid2rp,FEMALE,US,36.0


In [46]:
female_mean_age = female.age.mean()
print("Average Female age is: " + str(female_mean_age))

female_std_age = male.age.std()
print("std. for Female age is: " + str(male_std_age))

Average Female age is: 36.199717135626024
std. for Female age is: 12.460498446818018


In [47]:
# Diference in mean of Male and Female age

mean_diff = male_mean_age - female_mean_age
print("Mean difference is:"+ str(mean_diff))

# Standard Error Calculation

SE=((male_std_age**2)/22947 + (female_std_age**2)/26874)**0.5
print("Standard error is:", SE)

Mean difference is:0.8372472512795142
Standard error is: 0.11199858127457232


In [49]:
# Use 0.05 Significance level in two sample t-test
t_val=((male_mean_age - female_mean_age)-0)/SE
print(t_val)

# two-sided pvalue = Prob(abs(t)>tt)
p_value = stats.t.sf(np.abs(2.29), 128)*2  
print(p_value)

7.47551658022296
0.023657711289024146


In [53]:
if p_value < 0.05:
    print('Null Hypothesis--"mean of female and male age are the same" will be rejected.')
    print('There is significance difference between male and female mean age.')
else: 
    print('We do not reject the null, "Mean of female and male age are the same" IS TRUE!')

Null Hypothesis--"mean of female and male age are the same" will be rejected.
There is significance difference between male and female mean age.
