<a href="https://colab.research.google.com/github/mani-github2021/AirBnb_Booking_Analysis/blob/main/AirBnb_Bookings_Analysis_EDA_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    



##### **Project Type**    - Exploratory Data Analysis
##### **Contribution**    - Individual


# **Project Summary**

Airbnb has changed the way we travel since 2008, offering unique experiences worldwide. With millions of listings, Airbnb collects a lot of data, around 49,000 observations in this dataset alone, with 16 columns mixing different types of information. This data is essential for many things, like making the platform safer, deciding on business strategies, understanding how users and hosts behave, evaluating performance, targeting ads better, and coming up with new services. Exploring this dataset helps Airbnb learn from its users and improve its services for everyone.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Analyze the Airbnb NYC 2019 dataset to understand various aspects of Airbnb listings in New York City, such as distribution, pricing, and relationships between different variables, and to provide insights that can help improve business strategies for hosts and Airbnb as a platform.
The analysis involves importing and examining the dataset, understanding the variables, performing data wrangling, and visualizing relationships between variables. Insights gained from the analysis will help Airbnb hosts and the platform optimize their offerings.

#### **Define Your Business Objective?**

To gain insights from the Airbnb NYC 2019 dataset that can help improve host performance, optimize pricing strategies, and enhance guest satisfaction. This includes understanding data distribution, identifying key factors affecting prices, and finding patterns that can be leveraged for better decision-making.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
path='/content/Airbnb NYC 2019.csv'
df = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.rcParams['figure.figsize'] = (10,5)
df.isna().sum().plot.bar()
plt.show()

### What did you know about your dataset?

The dataset has 48895 rows and 16 columns withmissing values in name, host name,last_review,reviews per month

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

*These are the variables in the dataset*

'State' : Catagory for the states

'Account length': How long account has been active.

'Area code' : Code number of area.

'International plan' : whether a customer subscribed for International plan or not

'Voice mail plan' : Whether a customer subscribed for Voice mail plan or not

'Number vmail messages' : Number of voice mail messages

'Total day minutes' : Total day minutes used

'Total day calls' : Total number of calls for the day

'Total day charge' : Total price charged by a customer for day

'Total eve minutes' : Total durations of calls in minute for evening

'Total eve calls' : Total number of calls for evening

'Total eve charge' : Total price charged by a customer for evening

'Total night minutes' : Total durations of calls for night

'Total night calls' : Total number of calls for night

'Total night charge' : Total price charged by a customer for night

'Total intl minutes' : Total international call durations

'Total intl calls' : Total number of international calls

'Total intl charge' : Total charged for international services

'Customer service calls' : Number of customer service.

'Churn' : churn or non



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in telecom_df.columns:
    print(f"Number of unique value for {col} is : {len(telecom_df[col].unique())}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Counts of churn/non churn for each states
df.groupby(['State','Churn'])['Churn'].count().unstack()

In [None]:
# Churn data groupby Area Code wise
df.groupby(['Area code','Churn'])['Churn'].count().unstack()

In [None]:
# Churn data groupby International plan wise
df.groupby(['International plan','Churn'])['Churn'].count().unstack()

In [None]:
# Churn data groupby International plan wise
df.groupby(['Voice mail plan','Churn'])['Churn'].count().unstack()

In [None]:
# Assigning churn data without international plan
df_churn_intl_no = df[df['International plan'] =='No']
# Assigning churn data with international plan
df_churn_intl_yes = df[df['International plan'] =='Yes']


In [None]:
# Number of churn for each area
df_churn_intl_no['Area code'].value_counts().reset_index(name='churn')

In [None]:
# Number of voicemail without international plan
df_churn_intl_no['Number vmail messages'].value_counts().reset_index(name='user count')

In [None]:
# Number of voicemail with international plan
df_churn_intl_yes['Number vmail messages'].value_counts().reset_index(name='user count')


In [None]:
x=df["Churn"].value_counts()[0]
y=df["Churn"].value_counts()[1]
percentage_of_customer_churn=y/(x+y)*100
print(percentage_of_customer_churn)

In [None]:
# creating column call duration per call for no international plan
df_churn_intl_no['day_call_duration'] = df_churn_intl_no['Total day minutes']/df_churn_intl_no['Total day calls']
df_churn_intl_no['eve_call_duration'] = df_churn_intl_no['Total eve minutes']/df_churn_intl_no['Total eve calls']
df_churn_intl_no['night_call_duration'] = df_churn_intl_no['Total night minutes']/df_churn_intl_no['Total night calls']
df_churn_intl_no['intl_call_duration'] = df_churn_intl_no['Total intl minutes']/df_churn_intl_no['Total intl calls']


In [None]:
# creating call duration column with international plan
df_churn_intl_yes['day_call_duration'] = df_churn_intl_yes['Total day minutes']/df_churn_intl_yes['Total day calls']
df_churn_intl_yes['eve_call_duration'] = df_churn_intl_yes['Total eve minutes']/df_churn_intl_yes['Total eve calls']
df_churn_intl_yes['night_call_duration'] = df_churn_intl_yes['Total night minutes']/df_churn_intl_yes['Total night calls']
df_churn_intl_yes['intl_call_duration'] = df_churn_intl_yes['Total intl minutes']/df_churn_intl_yes['Total intl calls']

In [None]:
# creating price rate column for no international plan
df_churn_intl_no['day_rate_per_min'] = df_churn_intl_no['Total day charge']/df_churn_intl_no['Total day minutes']
df_churn_intl_no['eve_rate_per_min'] = df_churn_intl_no['Total eve charge']/df_churn_intl_no['Total eve minutes']
df_churn_intl_no['night_rate_per_min'] = df_churn_intl_no['Total night charge']/df_churn_intl_no['Total night minutes']
df_churn_intl_no['intl_rate_per_min'] = df_churn_intl_no['Total intl charge']/df_churn_intl_no['Total intl minutes']

In [None]:
# creating price rate column for international plan
df_churn_intl_yes['day_rate_per_min'] = df_churn_intl_yes['Total day charge']/df_churn_intl_yes['Total day minutes']
df_churn_intl_yes['eve_rate_per_min'] = df_churn_intl_yes['Total eve charge']/df_churn_intl_yes['Total eve minutes']
df_churn_intl_yes['night_rate_per_min'] = df_churn_intl_yes['Total night charge']/df_churn_intl_yes['Total night minutes']
df_churn_intl_yes['intl_rate_per_min'] = df_churn_intl_yes['Total intl charge']/df_churn_intl_yes['Total intl minutes']

In [None]:
df_churn_intl_yes['intl_rate_per_min'].sum()

In [None]:
df['total mins'] = df['Total day minutes']+df['Total eve minutes']+df['Total night minutes']+df['Total intl minutes']
df['total calls'] = df['Total day calls']+df['Total eve calls']+df['Total night calls']+df['Total intl calls']
df['total charges'] = df['Total day charge']+df['Total eve charge']+df['Total night charge']+df['Total intl charge']


### What all manipulations have you done and insights you found?

First I groupby state and know the number of customer churn and not churn.

Area code 415 has maximum number of customers .

percentage of customer churn in each area is approximately equal i.e 14% to 15%.

Area code 510 have minimum number of customers .

Customers donot having interplan plan are maximum.

Customer having international plan are minimum and the churn rate is high for those customers who have international plan.

Customers having voice plan are maximum and approximately 20% of customers churn who have a voice plan.

There are only 922 customers who has a voice plan .

In the area code of 415 customers who donot have international plan are maxiumun (1505).

Among the total customers 14.5% customers churn.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.rcParams['figure.figsize'] = (15,10)
colors = ['orange','g']
df['Churn'].value_counts().plot.bar(figsize=(24,4),yticks=np.arange(0,3334,300),color= colors)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different labels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages. I used this chart to check whether target data is balanced or not.

##### 2. What is/are the insight(s) found from the chart?

I found that about 600 customers are predicted as churn and approximately 2900 chustomers are not predicted as churn.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The dataset is imbalanced only 500-600 peoples are predicted as churn and more than 2800 are predicted as non churn. Since the dataset is imbalanced so it is not good when applying ML algorithm.

#### Chart - 2

In [None]:
# State wise churn visualization
df.groupby(['State'])['Churn'].mean().sort_values(ascending= False).head(10).plot.bar(figsize=(24,4),color=colors)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different labels of a categorical or nominal variable.

##### 2. What is/are the insight(s) found from the chart?

Here in this graph we see the churn in top 10 state.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These  are the some state where churn rate is high so company have to take more attention towards these state .

#### Chart - 3

In [None]:
#number of customer from each state
df['State'].value_counts().sort_values(ascending = False).plot.bar(color=colors,figsize=(24,4))
plt.title('Number of customers for each state')
plt.xlabel('States')
plt.ylabel('Number of customers')
plt.yticks(np.arange(0,111,10))
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart shows frequency counts of values for different labels therefore I picked this chart to check number of customers for each state.

##### 2. What is/are the insight(s) found from the chart?

Number of customer is highest in WV that is more than 100 and for CA is least that is less than 40.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

WV state has maximum number of customers if they become churn then it will be a big lose to the business so company should case most about them.

#### Chart - 4

In [None]:
df.groupby(['Churn'])['total calls'].mean().plot.pie(autopct='%0.02f%%',colors={'r','gray'})
plt.title('Churn vs Average calls')
plt.show()

##### 1. Why did you pick the specific chart?

Pie chart expresses part to whole relationship to the data I picked this chart to see number of churn/non churn for mean number of call.

##### 2. What is/are the insight(s) found from the chart?

Churn customrs and non churn both has same number of call approximately 50% in both the cases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

No


#### Chart - 5

In [None]:
#state wise churn rate
x=df["State"].unique()
y=df.groupby(["State"])["Churn"].mean()
plt.rcParams["figure.figsize"]=(15,10)
plt.plot(x,y,color="b",marker="o",linewidth=3,markersize=10)
plt.title("states churn rate ")
plt.xlabel("States")
plt.ylabel('Churn rate ')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



I observed that more number of customers are predicted as churn when the call duration and call charges is higher but in case of voice mail plan customers are less churn therefore we can make the charge for voice mail plan lower than call charge so that people who are doing long call can switch to voice mail plan because people who are using more voice mail plan are less churn

# **Conclusion**

Total 14.5% of the customers are predicted as churn

State with highest number of customer is WV.

Area code 415 has highest number of customers more than 1600.

Only 9.7% customers are subscribed for international plan.

27.7% of the customers subcribed for voice mail plan.

Number of churn is higher where call duration is high.

More number of call has higher number of churn.

Higher price has higher number of churn.

People with less number of voice message are less number of churn.

total charge and total minute is linearly correlated with each other.

The dataset has 3333 rows and 20 columns without any missing value and duplicate row.

Total number of churn is 483

State with highest churn are NJ and TX and highest number of churn is 18.

Area code 415 has highest number of churn that is 236

Number of churn with international plan is 137 and without plan is 346

Number of churn with voice mail plan is 80 and without plan is 403.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***