![image](https://jetsettingfools.com/wp-content/uploads/2016/09/airbnb_horizontal_lockup_web-1000x449.png)

# First let's Know What is Airbnb?
* Airbnb is an organization which operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities.

* Based on [New York City Airbnb Open Data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data) We will try to:
1. Explore. 
2. Analyze.
3. Act.
* To get insights n conclusions for Business Recommendation so let's start the game 

# 1. Explore 
![desert](https://images.pexels.com/photos/847402/pexels-photo-847402.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)

* Imagine u are looking for a treasure in the wide desert so u got ur equipment and started the journey without considering the factors of the right spot for finding the treasure, here is a question would it be easy for u to find ur dream goal? for sure the answer is no as u have to study and understand all the factors for the right spot to go on so the desert here is the data u have to understand all the features in this data to know what should u do next and how can u efficiently handle this data to get good insights for effective conclusions that could be relied on for business recommendations.

* **A. First Step:** as you are now a treasure hunter finding a map and looking at it carefully would be the first step, the map here is the dataset so let's take a look at it: 

* The Dataset is 48895 rows (Observations or Records) x 16 Columns (Features or Feilds)  
 
 
 <table>
<tr><td>Column</td><td>Description</td></tr>
<tr><td>id</td><td>listing ID</td></tr>
<tr><td>name </td><td>listing Name </td></tr> 
<tr><td>host_id </td><td>Host ID </td></tr> 
<tr><td>host_name </td><td> Host Name </td></tr> 
<tr><td>neighbourhood_group</td><td>Location</td></tr> 
<tr><td>neighbourhood </td><td> area </td></tr> 
<tr><td>latitude </td><td>latitude coordinates</td></tr> 
<tr><td>longitude </td><td>longitude coordinates </td></tr> 
<tr><td>room_type </td><td>listing space type </td></tr> 
<tr><td> price </td><td>price in dollars</td></tr> 
<tr><td>minimum_nights </td><td> amount of nights minimum </td></tr> 
<tr><td>number_of_reviews</td><td>number of reviews</td></tr> 
<tr><td>last_review </td><td>latest review date </td></tr> 
<tr><td>reviews_per_month </td><td>number of reviews per month </td></tr> 
<tr><td>calculated_host_listings_count</td><td>amount of listing per host </td></tr> 
<tr><td>availability_365 </td><td>number of days when listing is available for booking</td></tr>     
</table>

* **B. Second Step:** Now after discovering the map generally let's now take a deep look using Python:
 
 

In [None]:
# Import Needed Packages
import pandas as pd 

# load the data using read_csv function then store the dataframe in airbnb_df variable 
airbnb_df = pd.read_csv('../input/new-york-city-airbnb-open-data/AB_NYC_2019.csv')

# let's now take a look at airbnb_df using head function to make sure the dataframe loaded successfully
# note that head function retrieve the first 5 records
airbnb_df.head() 

In [None]:
# take a look at the the dataframe shape 
airbnb_df.shape

In [None]:
# take a look at columns datatypes 
airbnb_df.dtypes

# 2. Analyze
* Now you got the general structure of the map (dataset) let's move further and analyze it following the Data Analysis Process LifeCycle
* ### Ask-> Prepare and Process-> Predict-> Share-> Act.
## Ask
* Go back to the dataset and take your time tring to extract questions and eliminate those questions to the most important questions with useful informations.
* Questions: 
1. Get the unique Number of (hosts, neighbourhood_group, Room_type).
2. Get the (host, neighbourhood_group, room_type) with the biggest number of listings.
3. Get the average of (prices per neighbourhood_group and room types).
4. Get the relationships ( prices and minimum_nights, prices and number_of_reviews, Prices and room_type). 
5. Can we predict the price?

In [None]:
# 1. Get the unique Number of (hosts, neighbourhood_group, Room_type).
print ('The Number of Hosts is ', len (airbnb_df.host_id.unique()),'\n')
print ('The Number of Unique neighbourhoods is ', len (airbnb_df.neighbourhood_group.unique()), airbnb_df.neighbourhood_group.unique(), '\n')
print ('The Number of Unique room_type is ', len (airbnb_df.room_type.unique()), airbnb_df.room_type.unique())

In [None]:
# 2. Get the (host, neighbourhood_group, room_type) with the biggest number of listings.

# calculate the maximum number of listings per host
max_number_host = max(airbnb_df.calculated_host_listings_count)

top_host = airbnb_df[airbnb_df['calculated_host_listings_count']== max_number_host].head(1)
print ('The host with the maximum number of listings is \n',top_host[['host_id','host_name']])
print ('\n With number of listings:', max_number_host)

# calculate the maximum number of listings per neighbourhood_group
max_number_neighbourhood = airbnb_df.neighbourhood_group.value_counts()
print ('\n The neighbourhood group with maximum number of listings is', max_number_neighbourhood.head(1))

# calculate the maximum number of listings per room_type
max_number_room_type = airbnb_df.room_type.value_counts()
print ('\n The room type with maximum number of listings is', max_number_room_type.head(1))


In [None]:
# 3. Get the average of (prices per neighbourhood_group and room types).
avg_price_room_neighbourhood = airbnb_df.groupby(['neighbourhood_group','room_type'])['price'].mean().round(2)
avg_price_room_neighbourhood

In [None]:
# Visualize the neighbourhood_group + room type by mean price 
avg_price_room_neighbourhood.plot.bar()

In [None]:
# 4. Get the relationships ( prices and minimum_nights, prices and number_of_reviews, Prices and room_type).
# As we study relationships the best plot is scatter plot 
airbnb_df.plot.scatter(x='minimum_nights', y='price')
airbnb_df.plot.scatter(x='number_of_reviews', y='price')
airbnb_df.plot.scatter(x='room_type', y='price')


### Plots findings 
* **First plot** As we see there are a negative relationship as when we increase the minimum number of nights the price decreases.
* **Second plot** low prices lodgings gained more reviews and more visitors.
* **Third plot** shared rooms got the minimum prices.

* **5. Can we predict the price? To answer this question we have at first to prepare and process the data as we can't evaluate a regression model without this step, The suspense has begun ready?**

## Prepare and Process

### Prepare

###	Data checkpoints R-O-C-C-C: (Reliable, Original, Comprehensive, Current, Cited)

a.	The data from **Reliable** source is the data being collected via Airbnb.

b.	As it’s **internal** data and it’s collected by Bellabeat so it’s **Original data** (First-party data).

c.	**Comprehensive** as the number of observations is almost 49000 so it’s enough for our study.

d.	The data is **Not Current**, as it’s in 2019 - 3 years ago so considered not current.

e.	**Not Sure** if the data is cited or not.

### Process

* Remove duplicates
* Handle Null values
* Feature Selection and Scaling (Normalization)

In [None]:
# Remove duplicates
airbnb_df.duplicated().sum()
airbnb_df.drop_duplicates(inplace=True)

In [None]:
# Handle Null Values
# Calulate number of null values per columns 
airbnb_df.isnull().sum()

In [None]:
# first make a copy of the dataframe (don't forget)
airbnb_df2 = airbnb_df.copy()
# We will replace null values in just reviews_per_month column as this is the only one we interested in 
airbnb_df2.reviews_per_month.fillna(0,inplace = True)
airbnb_df2.reviews_per_month.isnull().sum()

In [None]:
# Feature Selection: lets at first look at correlation heatmap To choose the most effective features to Price 
# we don't need these columns for our model 
airbnb_df2.drop(['id','name','host_name','host_id','latitude','longitude','last_review'], axis=1, inplace=True)


In [None]:
# just check all is aright, notice that any null value will break our model 
airbnb_df2.isnull().sum()

In [None]:
# Feature Scaling (encoding)
# as we have categorical features we will use get_dummies function 
airbnb_df2= pd.get_dummies(airbnb_df2)


## Predict 

* Split the Data
* Evaluate Regression Models ( linear Regression - Decision Tree - Random Forest )

In [None]:
# Split the data 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing  

# X for the training dataset , y for the test dataset 
X = airbnb_df2.drop('price', axis= 1)
y = airbnb_df2.price 

# split the data to traing and test datasets 
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.1, random_state = 300)

In [None]:
# Evaluate Regression models (linear Regression - Decision Tree - Random Forest)
# Imported Packages
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# 1. Linear Regression 
lin_reg=LinearRegression()
lin_reg.fit(x_train,y_train)
y_pred=lin_reg.predict(x_test)

# Calculate the accuracy score for linear regression 
r2_score(y_test,y_pred)

In [None]:
# 2. Decision Tree Regression 
DTree=DecisionTreeRegressor(min_samples_leaf=.0001)
DTree.fit(x_train,y_train)
y_predict=DTree.predict(x_test)

# Calculate the accuracy score for Decision Tree regression 
r2_score(y_test,y_predict)

In [None]:
# 3. Random Forest Regression 
Rf_model = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)
Rf_model.fit(x_train,y_train)
y_predict=Rf_model.predict(x_test)

# Calculate the accuracy score for Decision Tree regression 
r2_score(y_test,y_predict)

## Share - Visualization

* let's now visualize all the analysis we did above 
* count numbers for different **neighbourhood_group**
* count numbers for different **room_type**


In [None]:
# import needed package - seaborn
import seaborn as sns

# Visualize the number of observation for each neighbourhood_group
sns.countplot(airbnb_df['neighbourhood_group'], palette="plasma")
fig = plt.gcf()
fig.set_size_inches(7,7)
plt.title('Neighbourhood Group')

In [None]:
#Restaurants delivering Online or not
sns.countplot(airbnb_df['room_type'], palette="plasma")
fig = plt.gcf()
fig.set_size_inches(7,7)
plt.title('Room Types vs Count')

## Relationship Between Prices and Room types 

In [None]:
plt.figure(figsize=(7,7))
sns.barplot(data=airbnb_df, y='price',x='room_type',palette='plasma')
plt.title('Room Types vs price')

## Map of neighbourhood_group

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(airbnb_df.longitude,airbnb_df.latitude,hue=airbnb_df.neighbourhood_group)
plt.ioff()

# 3. Act 
## <font color='orange'> Congratulation dude you successfully found the treasure ! </font> 
![treasure](https://images.pexels.com/photos/366791/pexels-photo-366791.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)

### Now you have to give it to the government (Stakeholders) but arrange them before giving it

### <font color='orange'> Conclusions n Findings: </font> 

1. The Number of **unique Hosts** in New York city is  **37457**. 

2. The host with the maximum number of listings is **Sonder (NYC)** with id 219517861 **(327 listings)**.

3. The neighbourhood group with maximum number of listings is **Manhattan (21661)**.

4. The room type with maximum number of listings is **Entire home/apt   (25409)**.

5. The **highest average price** for **Entire home/apt is in Manhatten for 249.24 dollars**.

6. The **highest average price for Private room is in Manhatten for 116.78 dollars**. 

7. The **highest average price for Shared room is in Manhatten for 88.98  dollars**. 

8. There are a **negative relationship between minimum number of nights and the price.**

9. **low prices lodgings gained more reviews and more visitors.**

10. **shared rooms got the minimum prices.**



* ## Finally, hope you benefit and see you in the next journey huge thanks to [CHIRAG SAMAL](https://www.kaggle.com/code/chirag9073/airbnb-analysis-visualization-and-prediction)