### In this Kernel, we will look at some descriptive statistics related to the Berlin AirBnB dataset. We will start with some univariate analysis.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
print(os.listdir("../input"))

In [None]:
lis = pd.read_csv("../input/listings.csv")
nei = pd.read_csv("../input/neighbourhoods.csv")
rev = pd.read_csv("../input/reviews.csv")
lis_sum = pd.read_csv("../input/listings_summary.csv")
rev_sum = pd.read_csv("../input/reviews_summary.csv")
cal_sum = pd.read_csv("../input/calendar_summary.csv")

In [None]:
lis.head()

## DATA TYPES

The following data types can be used in base python:

    boolean
    integer
    float
    string
    list
    None
    long
    complex
    object
    
    
    Numerical or Quantitative
       - Discrete
            Integer (int)
       - Continuous
            Float (float)
            
            
    Categorical or Qualitative
            - Nominal
                Boolean (bool)
                String (str)
                None (NoneType)
            - Ordinal
                Only defined by how you use the data
                Often important when creating visuals

dtypes method in Python will list down the data type of each variable in data set.

In [None]:
lis.dtypes

## UNIVARIATE ANALYSIS

uni- means one and bi- means two: think of a unicycle, which has one wheel, and a bicycle,
which has two. Multi means many and in statistics it is often used to mean “more
than two.”

Univariate statistics such as the mean therefore describe characteristics
of one variable, and the bar chart and histogram are examples of univariate
graphic displays.

### Frequency Table - A type of Univariate analysis and a common way to summarize categorical data

In [None]:
ng = lis['neighbourhood_group'].value_counts().reset_index()
ng.columns = ['Neighbourhood_Group', 'Count']
ng['Percent'] = ng['Count']/ng['Count'].sum() * 100
ng

    Friedrichshain-Kreuzberg Neighbourhood group seems to have more Listings in the dataset, with 24%

### Bar Chart - A type of Univariate analysis and a common way to visualize categorical data


In the X-Axis is the name and Y-axis has the frequency AKA the count of the Neighbourhood groups.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from matplotlib import rcParams
rcParams['figure.figsize'] = 13, 10

ax = sns.barplot(x="Neighbourhood_Group", y="Count", data=ng)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

    Same shown here, but with Count of it as the measure.

### Pie Chart - A type of Univariate analysis and a not-so-common way to visualize categorical data



Here we will look at the type of Rooms available in the Listing.

In [None]:
ngp = lis['room_type'].value_counts().reset_index()
ngp.columns = ['room_type', 'Count']
ngp['Percent'] = ngp['Count']/ngp['Count'].sum() * 100

In [None]:
import matplotlib.pyplot as plt

labels = ngp.room_type.tolist()
sizes = ngp['Percent'].tolist()
explode = (0.1, 0, 0)

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=45)
ax1.axis('equal')

plt.show()

> ### More than 50% of the rooms listed are Private Rooms as seen in the Chart above.


> #### NOTE: For More Fancy Plots, checkout Bokeh or Plotly library, doing neither here as Pie chart is not normally recommended

So, for categorical data;

    - Frequency Tables - Great for Numerical Summaries
    - Bar Charts -  Great for Visualization

### QUANTITATIVE DATA - UNIVARIATE ANALYSIS

Variables that have a numerical value(Quantity) that we can perform mathematical operations on. 

Divided into 2 
    - Discrete: Age, Number of Children in a room etc.. 
    - Continuous: Height, Weight etc..

In [None]:
price = lis['price']
nor = lis['number_of_reviews']

### Histogram - A type of Univariate analysis and a Common way to visualize Quantitative data

#### We will look at the distribution of the price first.

In [None]:
import warnings
warnings.filterwarnings('ignore')
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,9))

sns.distplot(price, hist=True, kde=False, hist_kws={'edgecolor':'black'}, ax=ax[0])
sns.distplot(price, hist=True, kde=True, hist_kws={'edgecolor':'black'}, ax=ax[1])

ax[1].set_title('Histogram of Listing Prices With KDE')
ax[1].set_xlim(0,1000)
ax[1].set_ylabel('Frequency')

ax[0].set_ylabel('Frequency')
ax[0].set_xlim(0,1000)
ax[0].set_title('Histogram of Listing Prices Without KDE')

**There are 4 Main aspects to a Histogram:**

**Shape**: Overall apperance of the Histogram. Can be symmetric, bell-shaped, left-skewed, right-skewed etc.. 
In the Price histogram above, it is right-skewed & definitely not normally distributed,

**Center**: Mean or Median - Both will not be equal as they are not symmetric in shape.
In a right-skewed distribution(like our case above), the mean will be **greater** than the median, whereas in a left-skewed distribution,
the mean will be **less** than the median.

**Spread**: How far our data spreads. Range, Interquartile Region(IQR), Standard deviation, Variance. For a histogram, it is typically the range(Max - Min) --> price.max() - price.min()

**Outliers**: Data Points that fall far off from the bulk of the data.Normally calculated using IQR. We can see that most of the prices are between 0-200 range, so anything beyong the range of 450-500+ can be considered as an outlier and in this example there a lot of such of values.


A point to remember here is although it may look similar to a bar chart, bar charts are for **CATEGORICAL DATA**, while histograms are for **QUANTITATIVE OR NUMERICAL DATA**

### Numerical Summaries for Quantitative Data - A type of Univariate analysis


It is used alongside graphs to give you a first impression of what our data looks like.

#### The Numerical Summaries are usually:
    - Min(the smallest value)
    - 1st Quartile(25% of the values are below this)
    - Median(50% of the values are below this)
    - 3rd Quartile(75% of the values are below this)
    - Max(the largest value)
    

The other Statistic we can derive is Interquartile Range(IQR) defined as Q3 - Q1.
IQR is also another measure of spread.



We will look at the numerical summary for the Price variable using the built-in describe method in Python

In [None]:
lis['price'].describe()

So in the above result,

    25% is our Q1
    50% is our median
    75% is our Q3
    std is our Standard Deviation, the average distance our data points fall from our mean value.
    count is our sample size.
    Q3 - Q1 will give you IQR and sometimes a better way to let know where exactly our data is falling.
    The Range is less robust to outliers and sometimes may not give you where most of our data falls, whereas IQR will give you that. 

### Standard Score (Empirical Rule)

    68% of the values are expected to be 1 standard deviation each away from the mean, for a normal distribution.
    Similarly, 95% of the values are expected to be 2 standard deviations each away from the mean, for a normal distribution.
    Similarly, 99.7% of the values are expected to be 3 standard deviations each away from the mean, for a normal distribution.
    
    In statistics, the standard score is the signed fractional number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. Observed values above the mean have positive standard scores, while values below the mean have negative standard scores. Standard Score is a very useful statistic because it (a) allows us to calculate the probability of a score occurring within our normal distribution and (b) enables us to compare two scores that are from different normal distributions.
    
    It is calculated by subtracting the population mean from an individual raw score and then dividing the difference by the population standard deviation. It is a dimensionless quantity. This conversion process is called standardizing or normalizing.

    Standard scores are also called z-values, z-scores, normal scores, and standardized variables.
    Computing a z-score requires knowing the mean and standard deviation of the complete population to which a data point belongs; if one only has a sample of observations from the population, then the analogous computation with sample mean and sample standard deviation yields the t-statistic.

### Boxplots for Quantitative Data - A type of Graphical Representation for Univariate analysis


There are alternatives to Box Plot like Swarm plots, strip plots, Violin plots especially if there is a need to group it with a categorical variable.

A Boxplot is basically a Visual Picture of the 5-number summary we saw above. 

The length of the box is the IQR(Q3 - Q1), a nice measure of spread.

![](http://jukebox.esc13.net/interactiveGlossary/HTML_images/boxPlot_example.png)

#### Let us take the availability of the rooms in a year and Visually check the 5-number summary.

In [None]:
import seaborn as sns
sns.set(style="whitegrid")
ax = sns.boxplot(x=lis["availability_365"])

    - From the Plot above, we can see the IQR to be around 125(Length of the Box), meaning most listings are available for a maximum of 4 Months. 
    - The Distribution is also right-skewed as we can see a lot of values to the extreme right.
    - The Median is Approximately 10.
    - The dots seen on the right end are outliers, meaning values > 1.5 times and < 3 times the IQR.

In [None]:
lis_sum['security_deposit'] = lis_sum['security_deposit'].str.replace('$', '')
lis_sum['security_deposit'] = lis_sum['security_deposit'].str.replace(',', '')
lis_sum['security_deposit'] = lis_sum['security_deposit'].fillna(0)
lis_sum['security_deposit'] = lis_sum['security_deposit'].astype(str).astype(float)

In [None]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True)

#### Another Box plot, this time interactive one to see Security deposits grouped by Room types

In [None]:
import plotly.express as px
#tips = px.data.tips()
fig = px.box(lis_sum, x="room_type", y="security_deposit")
fig.show()

In [None]:
lis_sum.groupby('room_type')['security_deposit'].mean()

Just to Confirm, the values are very minimal that min and other  values are not seen. The NaN's were filled with 0's.

But its clear that the average security deposit and the maximum deposit is high for Entire home/Apartment

### Multivariate Analysis

#### Two-way contingency table between Neighbourhood group and Room type

In [None]:
lis_sum_t = lis_sum[['neighbourhood_group_cleansed', 'room_type']]
two_cls = pd.crosstab(lis_sum_t.neighbourhood_group_cleansed, lis_sum_t.room_type)
two_cls

In [None]:
two_cls.plot.bar(stacked=True)
#plt.legend(title='mark')
plt.show()

#### Quantitative variables

#### How is security deposit related to number of reviews? 

In [None]:
from matplotlib import rcParams
rcParams['figure.figsize'] = 12, 9

sns.scatterplot(x="number_of_reviews", y="security_deposit", data=lis_sum)

 #### Association types in a scatter plot

* Linear Association: The pattern is a line
* Quadratic Association: The pattern is parabolic
* No association: There is no pattern

#### How does number of bedrooms affect review on cleanliness across room types?

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

lis_sum.head(3)

In [None]:
sns.FacetGrid(lis_sum, col="room_type", size = 4).map(plt.scatter, "review_scores_cleanliness", "bedrooms").add_legend()

#### How are the listings priced against its availability?

In [None]:
g = sns.jointplot(x="price", y="availability_365", kind='kde', data=lis)

g.ax_marg_x.set_xlim(0, 800)
g.ax_marg_y.set_ylim(0, 500)

### More to come