# Final Project : Airbnb Price Prediction
* `Project Name :`  Airbnb Open Data
* ` Name :` Mohamed Ashraf Saad


# Data Description:

## What is Airbnb?

* Airbnb: “Air Bed and Breakfast”
* A service that lets property owners rent out their spaces to travelers looking for a place to stay.
* Travelers can rent a space for multiple people to share, a shared space with private rooms, or the entire property for themselves.

```'''Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.'''```

```

##id: Customer id
## Name: Customer Name
##host id: host id
##host_identity_verified: Verify the host's ID before entering the Host
##host name: Host Name
##neighbourhood group: The group of neighborhoods in that city
##neighbourhood: The neighborhood within a group of neighborhoods
##lat: Latitude lines to this host
##long: Longitude to this host
##country: Host country
##country code: Host country code
##instant_bookable: The host accepts immediate booking or not
##cancellation_policy:Procedures when canceling a reservation
##room type: Host room type
##Construction year: The year this host was built
##price: Price at this host
##service fee: The amount of service is money
##minimum nights: Minimum number of nights spent at the host
##number of reviews: number of reviews to this host
##last review: last review to this host
##reviews per month: reviews per month to this host
##review rate number: review rate number to this host
##alculated host listings count: The number of host listings calculated when
## booking at the host
##availability 365: The number of days available during the year
##house_rules: Rules to this house
##license : The license owned by this host

# ⏬ Import Libraries and explore data

In [None]:
!pip install category_encoders
!pip install folium

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import folium

In [None]:
df =pd.read_csv("Airbnb_Open_Data.csv" , low_memory=False)
df

In [None]:
# Checking the shape of datset
print(f'The shape of Airbnb Dataset is {df.shape}')

In [None]:
# Checking the feature names
print(f' The names of the features present in the dataset are: ')
list(df.columns)

In [None]:
# Replace 'df' with the actual variable name if it's different

# Define the column names
columns = ['id', 'NAME', 'host id', 'host_identity_verified', 'host name', 'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country', 'country code', 'instant_bookable', 'cancellation_policy', 'room type', 'Construction year', 'price', 'service fee', 'minimum nights', 'number of reviews', 'last review', 'reviews per month', 'review rate number', 'calculated host listings count', 'availability 365', 'house_rules', 'license']

# Replace spaces with underscores in column names
new_columns = [col.replace(' ', '_') for col in columns]

# Rename the columns in the DataFrame
df.columns = new_columns

# Now df contains the DataFrame with column names having spaces replaced with underscores

In [None]:
#Checking for categorical columns
cat_cols = df.select_dtypes(include = 'object').columns
print(' The following are the non categorical features in the dataset:')
list(cat_cols)

In [None]:
# Checking for numeric/ non categorical columns
num_cols = df.select_dtypes(exclude = 'object').columns
print(' The following are the non numeric features in the dataset:')
list(num_cols)

In [None]:
# Checking the last 5 rows of the data
df.tail(5)

In [None]:
# Checking the information of the dataset
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
# Checking for the null values
print(f'The missing values before cleaning the data are:')
df.isnull().sum()

In [None]:
# check for missing values %
df.isnull().mean() * 100

In [None]:
df.neighbourhood_group.value_counts().nlargest(10)

In [None]:
df.neighbourhood.value_counts().nlargest(10)

In [None]:
df.room_type.value_counts().nlargest(10)

In [None]:
df.calculated_host_listings_count.value_counts().nlargest(10)

In [None]:
df.price.value_counts().nlargest(10)

In [None]:
df.host_name.nunique()

In [None]:
df.host_id.nunique()

In [None]:
df.neighbourhood.nunique()

In [None]:
df.neighbourhood_group.nunique()


In [None]:
def ShowDetails():
    global df
    for col in df.columns :
        print(f'for feature {col}')
        print(f'Number of Nulls is {df[col].isna().sum()}')
        print(f'Number of Unique Values is {len(df[col].unique())}')
        print(f'Unique Values is {df[col].unique()}')
        print(f'Random Value is {df[col][np.random.randint(df.shape[0])]}')
        print(f'Random Value is {df[col][np.random.randint(df.shape[0])]}')
        print(f'Random Value is {df[col][np.random.randint(df.shape[0])]}')
        print('\n\n==================================\n\n')

In [None]:
ShowDetails()

# ⏬ Data Cleaning

In [None]:
df.shape

In [None]:
df.duplicated().sum()

In [None]:
#Remove duplicated
df.drop_duplicates(inplace=True)


In [None]:
df.shape

In [None]:
print("Total columns before dropping : ", len(df.columns), "\n")
print("Columns with more than 50% missing values: ")
print(df.columns[df.isnull().mean() > 0.5], "\n")
df = df.drop(df.columns[df.isnull().mean() > 0.5], axis=1)
print("Total columns after dropping:", len(df.columns))  # Corrected line

In [None]:
df

In [None]:
# Convert 'last review' column to datetime type
df['last_review'] = pd.to_datetime(df['last_review'])

In [None]:
# Convert 'price' column to float type

df['price'] = df['price'].str.replace('$', '')
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].astype(float)

In [None]:
df['price']


In [None]:
# Convert 'service_fee' column to float type
df['service_fee'] = df['service_fee'].str.replace('$', '')
df['service_fee'] = df['service_fee'].str.replace(',', '')
df['service_fee'] = df['service_fee'].astype(float)

In [None]:
df['service_fee']

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
# Assuming 'id' is a column in  DataFrame 'df'
df = df.drop('id', axis=1)

In [None]:
# Assuming 'host_id' is a column in  DataFrame 'df'
df = df.drop('host_id', axis=1)

In [None]:
#drop null value in this coulmn
df = df.dropna(subset=['price','NAME','host_identity_verified'])
df = df.dropna(subset=['neighbourhood_group','neighbourhood'])
df = df.dropna(subset=['cancellation_policy'])
df = df.dropna(subset=['instant_bookable', 'host_name','Construction_year'])
df = df.dropna(subset=['lat','long'])

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.duplicated().sum()

In [None]:
# Drop columns where value counts is equal to 1
for col in df.columns:
    if df[col].value_counts().shape[0] == 1:
        df = df.drop(col, axis=1)

In [None]:
df.shape


In [None]:
print(df.columns)


In [None]:
df.isna().sum()

In [None]:
df[df['service_fee']==0]

In [None]:
# Fill null values in service fee by zeros
df['service_fee'] = df['service_fee'].fillna(0)

In [None]:
df['service_fee'].value_counts()

In [None]:
df['availability_365'].describe()


In [None]:
# Assuming 'availability 365' is the name of the column in your DataFrame 'df'
df = df.drop(df[df['availability_365'] < 0].index)

In [None]:
# remove availability 365 that more than 365
df = df.drop(df[df['availability_365'] >365 ].index)

In [None]:
df['availability_365'].describe()


In [None]:
df['minimum_nights'].describe()


In [None]:
# remove minimum nights that less than 0
df = df.drop(df[df['minimum_nights'] < 0].index)

In [None]:
df['minimum_nights'].describe()


In [None]:
df["neighbourhood_group"].value_counts()

In [None]:
# remove the row that neighbourhood group is brookln (out lier)
df = df[df['neighbourhood_group'] != 'brookln']


In [None]:
df["neighbourhood_group"].value_counts()

In [None]:
df.isnull().sum().sort_values()


In [None]:
df.info()

In [None]:
# check for missing values %
df.isnull().mean() * 100

In [None]:
df['last_review'].min(), df['last_review'].max()


In [None]:
df['number_of_reviews'].median()

In [None]:
# Assuming 'number_of_reviews' is the name of the column in  DataFrame 'df'
df['number_of_reviews'].fillna(df['number_of_reviews'].median(), inplace=True)

In [None]:
# check for missing values %
df.isnull().mean() * 100


In [None]:
df['minimum_nights'].median()

In [None]:
# Assuming 'minimum_nights' is  fill Null value column in  DataFrame 'df'
df['minimum_nights'].fillna(df['minimum_nights'].median(), inplace=True)

In [None]:
# check for missing values %
df.isnull().mean() * 100


In [None]:
df['availability_365'].median()

In [None]:
# Assuming 'availability_365' is  fill Null value column in  DataFrame 'df'
df['availability_365'].fillna(df['availability_365'].median(), inplace=True)

In [None]:
df['review_rate_number'].median()

In [None]:
# Assuming 'calculated_host_listings_count' is  fill Null value column in DataFrame 'df'
df['calculated_host_listings_count'].fillna(df['calculated_host_listings_count'].median(), inplace=True)

In [None]:
df['calculated_host_listings_count'].value_counts()

In [None]:
df['review_rate_number'].median()

In [None]:
# Assuming 'review_rate_number' is fill Null value column in  DataFrame 'df'
df['review_rate_number'].fillna(df['review_rate_number'].median(), inplace=True)

In [None]:
df['review_rate_number'].value_counts()

In [None]:
df.isnull().sum()

In [None]:
df[df['last_review'].isna()]


In [None]:
# Assuming 'last_review' and 'reviews_per_month' drop are the names of the columns in  DataFrame 'df'
df.drop(['last_review', 'reviews_per_month'], axis=1, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df['minimum_nights'].value_counts().sum()

In [None]:
df['minimum_nights'].gt(30).sum()

In [None]:
# Drop rows where the 'minimum_nights' column is greater than 30 inplace

df.drop(df[df['minimum_nights'] > 30].index, inplace=True)


In [None]:
df.info()

In [None]:
df.shape

In [None]:
numeric_columns = df.select_dtypes(include='number').columns.tolist()
print("Numeric Columns:")
for col in numeric_columns:
    print(col)

In [None]:
df

# ⏬ EDA

In [None]:
#What are the unique values in the 'neighbourhood_group' column?
unique_neighbourhood_groups = df['neighbourhood_group'].unique()
print("Unique neighbourhood groups:", unique_neighbourhood_groups)

In [None]:
#What is the distribution of 'room_type' in the dataset?
room_type_distribution = df['room_type'].value_counts()
print("Room type distribution:\n", room_type_distribution)

In [None]:
#How many listings are available per neighbourhood?
listings_per_neighbourhood = df['neighbourhood'].value_counts().head(10)
print("Listings per neighbourhood:\n", listings_per_neighbourhood)

In [None]:
#What is the distribution of 'minimum_nights' for listings?
minimum_nights_distribution = df['minimum_nights'].value_counts()
print("Minimum nights distribution:\n", minimum_nights_distribution)


In [None]:
#What are the top 10 hosts with the most listings?
top_hosts = df['host_name'].value_counts().head(10)
print("Top 10 hosts with most listings:\n", top_hosts)


In [None]:
#How many unique hosts are there in the dataset?
unique_hosts = df['host_name'].nunique()
print("Number of unique hosts:", unique_hosts)


In [None]:
#What is the average number of reviews per listing?
average_reviews_per_listing = df['number_of_reviews'].mean()
print("Average number of reviews per listing:", average_reviews_per_listing)

In [None]:
#What is the availability distribution throughout the year ('availability_365')?
availability_distribution = df['availability_365'].value_counts()
print("Availability distribution throughout the year:\n", availability_distribution)


In [None]:
# Select the columns you want to calculate the correlation with 'price'
columns_to_correlate = ['NAME', 'host_identity_verified', 'host_name', 'neighbourhood_group',
                        'neighbourhood', 'lat', 'long', 'instant_bookable', 'cancellation_policy',
                        'room_type', 'Construction_year', 'service_fee', 'minimum_nights',
                        'number_of_reviews', 'review_rate_number', 'calculated_host_listings_count',
                        'availability_365']

# Calculate the correlation between 'price' and the selected columns
price_correlation = df[columns_to_correlate].corrwith(df['price'])

# Print the correlation values
print("Correlation between 'price' and other columns:")
print(price_correlation)

In [None]:
#What is the correlation between 'number_of_reviews' and 'availability_365'?
correlation_reviews_availability = df['number_of_reviews'].corr(df['availability_365'])
print("Correlation between 'number_of_reviews' and 'availability_365':", correlation_reviews_availability)


In [None]:
#What is the correlation between 'calculated_host_listings_count' and 'price'?
correlation_host_listings_price = df['calculated_host_listings_count'].corr(df['price'])
print("Correlation between 'calculated_host_listings_count' and 'price':", correlation_host_listings_price)

In [None]:
#What is the correlation between 'price' and 'number_of_reviews'?
correlation_price_reviews = df['price'].corr(df['number_of_reviews'])
print("Correlation between 'price' and 'number_of_reviews':", correlation_price_reviews)

In [None]:
#What is the correlation between 'room_type' and 'price'? (if 'room_type' is a categorical variable, you may need to encode it first)
# Assuming 'room_type' is categorical and needs encoding
encoded_room_type = pd.get_dummies(df['room_type'], drop_first=True)
correlation_room_type_price = encoded_room_type.join(df['price']).corr()['price'].drop('price')
print("Correlation between 'room_type' and 'price':\n", correlation_room_type_price)


In [None]:
#What is the correlation between 'service_fee' and 'availability_365'?
correlation_reviews_service_fee = df['service_fee'].corr(df['service_fee'])
print("Correlation between 'number_of_reviews' and 'service_fee':", correlation_reviews_availability)

In [None]:
#What is the average price of listings in each neighbourhood group?
average_price_neighbourhood_group = df.groupby('neighbourhood_group')['price'].mean()
print("Average price by neighbourhood group:\n", average_price_neighbourhood_group)


In [None]:
#What is the average price for Max 10 neighbourhood?

average_price_neighbourhood = df.groupby('neighbourhood')['price'].mean().sort_values(ascending=False).head(10)
print("Average price for each neighbourhood:\n", average_price_neighbourhood)

In [None]:
#What is the average price for Min 10 neighbourhood?
average_price_neighbourhood = df.groupby('neighbourhood')['price'].mean().sort_values(ascending=True).head(10)
print("Average price for each neighbourhood:\n", average_price_neighbourhood)

In [None]:
#What is the distribution of room types within each neighbourhood group?
room_type_distribution = df.groupby(['neighbourhood_group', 'room_type']).size().unstack(fill_value=0)
print("Room type distribution within each neighbourhood group:\n", room_type_distribution)

In [None]:
#What is the average price for each room type?
average_price_room_type = df.groupby('room_type')['price'].mean()
print("Average price for each room type:\n", average_price_room_type)

In [None]:
#What is the average number of listings per host?
average_listings_per_host = df.groupby('host_name').size().mean()
print("Average number of listings per host:", average_listings_per_host)

In [None]:
#What is the average 'price' for listings with 'minimum_nights' less than or equal to 2?
average_price_less_than_2_nights = df[df['minimum_nights'] <= 2]['price'].mean()
print("Average price for listings with minimum nights <= 2:", average_price_less_than_2_nights)

In [None]:
#What is the average 'availability_365' for each 'neighbourhood_group'?
average_availability_neighbourhood_group = df.groupby('neighbourhood_group')['availability_365'].mean()
print("Average availability for each neighbourhood group:\n", average_availability_neighbourhood_group)

In [None]:
#How many listings are there with 'price' greater than $100 and 'minimum_nights' less than 3?
listings_price_above_100_minimum_nights_below_3 = df[(df['price'] > 100) & (df['minimum_nights'] < 3)]
num_listings_price_above_100_minimum_nights_below_3 = len(listings_price_above_100_minimum_nights_below_3)
print("Number of listings with price > $100 and minimum nights < 3:", num_listings_price_above_100_minimum_nights_below_3)

In [None]:
#What is the average 'price' for each 'neighbourhood' within each 'neighbourhood_group'
average_price_neighbourhood_within_group = df.groupby(['neighbourhood_group', 'neighbourhood'])['price'].mean().sort_values(ascending=False).head(10)
print("Average price for each neighbourhood within each neighbourhood group:\n", average_price_neighbourhood_within_group)

In [None]:
#Average price for each neighbourhood_group for each room_type
average_price_neighbourhood_within_room_type=df.groupby(['neighbourhood_group', 'room_type'])['price'].mean().sort_values(ascending=False).head(10)
print("Average price for each neighbourhood_group for each room_type:\n", average_price_neighbourhood_within_room_type)

In [None]:
data = df.groupby(['neighbourhood_group', 'neighbourhood', 'room_type'])['price'].mean().nlargest(10).reset_index()
data

# ⏬ Visualization

##⬛Location Analysis

In [None]:
#Hosts spread in New York City for room type
plt.figure(figsize=(12, 8))
sns.scatterplot(x='long', y='lat', data=df, hue='room_type', palette='Set1');

In [None]:
#Hosts spread in New York City for neighbourhood_group
plt.figure(figsize=(12, 8))
sns.scatterplot(x='long', y='lat', data=df, hue='neighbourhood_group', palette='Set1');

In [None]:
lat_mean = df['lat'].mean()
long_mean = df['long'].mean()
area_lat = df['lat'].groupby(df['neighbourhood_group']).mean()
area_long = df['long'].groupby(df['neighbourhood_group']).mean()
area_lat_long= pd.concat([area_lat,area_long],axis=1)
area_lat_long = area_lat_long.values.tolist()

In [None]:
#All in New York City


map = folium.Map(
                  # Center the map here
                  location=[lat_mean,long_mean],
                  zoom_start = 10
                )
for point in range(0, len(area_lat_long)):
  _=folium.Marker(
                      area_lat_long[point],
                      popup = area_lat_long[point]
                      ).add_to(map)
map

In [None]:
#What is prices depends on neighbourhood group by map ?
fig = px.scatter_mapbox(df,lat="lat",
           lon="long",
           opacity = 0.3,
           hover_name="neighbourhood_group",
           hover_data=["neighbourhood_group", "price"],
           color="price",
           color_discrete_sequence=px.colors.sequential.PuBuGn,
           title = "Price comparing to the place",
           template = "plotly_dark",
           zoom=10
           )
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0},font = dict(size=17,family="Franklin Gothic"))
fig.show()

###⬛pieplot

In [None]:
# Count the occurrences of each neighborhood group
neighborhood_group_counts = df['neighbourhood_group'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
plt.pie(neighborhood_group_counts, labels=neighborhood_group_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Listings by Neighborhood Group')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

In [None]:
# Count the occurrences of each room type
room_type_counts = df['room_type'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
plt.pie(room_type_counts, labels=room_type_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Listings by Room Type')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

In [None]:
# Assuming 'availability_365' is the column containing availability information

availability_counts = pd.cut(df['availability_365'], bins=[0, 50, 100, 150, 200, 250, 300, 365]).value_counts()

plt.figure(figsize=(8, 6))
plt.pie(availability_counts, labels=availability_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Listings by Availability')
plt.axis('equal')
plt.show()

In [None]:
# Assuming 'host_name' is the column containing host information
host_counts = df['host_name'].value_counts()

plt.figure(figsize=(8, 6))
plt.pie(host_counts.head(10), labels=host_counts.head(10).index, autopct='%1.1f%%', startangle=140)
plt.title('Top 10 Hosts with Most Listings')
plt.axis('equal')
plt.show()

In [None]:

# Assuming 'calculated_host_listings_count' is the column containing the calculated listings count by the host
calculated_host_listings_count_counts = df['calculated_host_listings_count'].value_counts()

# For better visualization, let's consider only the top 5 most frequent counts
top_counts = calculated_host_listings_count_counts.head(5)

plt.figure(figsize=(8, 6))
plt.pie(top_counts, labels=top_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Listings by Host\'s Calculated Listings Count (Top 5)')
plt.axis('equal')
plt.show()

##⬛facetgrid


In [None]:
#FacetGrid showing the relationship between 'neighbourhood_group' and 'price' for different room_type:

g = sns.FacetGrid(df, col='room_type', height=4, aspect=1.5, palette='Set1')
g.map(sns.barplot, 'neighbourhood_group', 'price', order=df.neighbourhood_group.value_counts().index);

In [None]:
#FacetGrid showing the relationship between 'room_type' and 'price' for different neighbourhood_group:

g = sns.FacetGrid(df, col='neighbourhood_group', height=4, aspect=1.5, palette='Set1')
g.map(sns.barplot, 'room_type', 'price', order=df.room_type.value_counts().index);

##⬛Heatmap

In [None]:
#Heatmap for all data
# Generate heatmap for all data
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap for All Data')
plt.show()


##⬛countplot & ⬛barplot

In [None]:
# Countplot for the top 10 neighborhoods
plt.figure(figsize=(12, 8))
sns.countplot(y='neighbourhood', data=df, order=df['neighbourhood'].value_counts().nlargest(10).index, palette='Set1')
plt.title('Countplot of Top 10 Neighborhoods')
plt.xlabel('Count')
plt.ylabel('Neighborhood')
plt.show()

In [None]:
# Countplot for the top 10 hosts
plt.figure(figsize=(12, 8))
sns.countplot(y='host_name', data=df, order=df['host_name'].value_counts().nlargest(10).index, palette='Set2')
plt.title('Countplot of Top 10 Hosts')
plt.xlabel('Count')
plt.ylabel('Host Name')
plt.show()

In [None]:
# Countplot for the  room types
plt.figure(figsize=(10, 6))
sns.countplot(y='room_type', data=df, order=df['room_type'].value_counts().nlargest(3).index, palette='viridis')
plt.title('Countplot of  Room Types')
plt.xlabel('Count')
plt.ylabel('Room Type')
plt.show()

In [None]:
# Barplot for 'neighbourhood_group' vs 'price' with hue='room_type'
plt.figure(figsize=(12, 8))
sns.barplot(x='neighbourhood_group', y='price', hue='room_type', data=df, order=df['neighbourhood_group'].value_counts().index, palette='Set1')
plt.title('Barplot of Price by Neighbourhood Group with Room Type')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Price')
plt.show()

In [None]:
# Barplot for 'room_type' vs 'price' with hue='neighbourhood_group'
plt.figure(figsize=(12, 8))
sns.barplot(x='room_type', y='price', hue='neighbourhood_group', data=df, palette='Set2')
plt.title('Barplot of Price by Room Type with Neighbourhood Group')
plt.xlabel('Room Type')
plt.ylabel('Price')
plt.show()

In [None]:
# Barplot for 'neighbourhood' vs 'price' with hue='room_type'
plt.figure(figsize=(14, 8))
sns.barplot(x='neighbourhood', y='price', hue='room_type', data=df, order=df['neighbourhood'].value_counts().index[:10], palette='viridis')
plt.title('Barplot of Price by Neighbourhood with Room Type')
plt.xlabel('Neighbourhood')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.show()

In [None]:
#Who is Top 25 most reviewed neighbourhoods
top_25_reviewed_neighbourhoods = df.groupby(['neighbourhood'])['number_of_reviews'].sum().sort_values(ascending=False)[0:24]
plt.figure(figsize=(10,10))
plt.title("Top 25 most reviewed neighbourhoods")
sns.barplot(x=top_25_reviewed_neighbourhoods.values,y=top_25_reviewed_neighbourhoods.index)
plt.show()


In [None]:
#What is the average availability for each neighbourhood group
availability_per_neighbourhood_group = df.groupby('neighbourhood_group')['availability_365'].mean()
fig = px.bar(availability_per_neighbourhood_group,
            x=availability_per_neighbourhood_group.index,
            y=availability_per_neighbourhood_group.values,
            labels={'x': 'Neighbourhood group', 'y': 'Average availability'},
            text=[str(round(i)) for i in availability_per_neighbourhood_group.values],
            title='Average availability per neighbourhood group',
            color_discrete_sequence=px.colors.sequential.deep,
            template='plotly_dark'
)

fig.update_layout(font=dict(size=20, color='white', family='Avenir'))

fig.show()

In [None]:
#Which neighbourhood do you give the highest ratings to
avg_rating_per_neighbourhood = df.groupby(['neighbourhood'])['review_rate_number'].mean().sort_values(ascending=False)[0:24]
plt.figure(figsize=(8,10))
sns.barplot(x=avg_rating_per_neighbourhood.values, y=avg_rating_per_neighbourhood.index, palette='rocket')
plt.show()

In [None]:
#What are the average price for each room type
price_per_room_type = df.groupby('room_type')['price'].median()
fig = px.bar(price_per_room_type,
            x=price_per_room_type.index,
            y=price_per_room_type.values,
            labels={'x': 'Room type', 'y': 'Average price'},
            text=['$' + str(int(i)) for i in price_per_room_type.values],
            title='Average price per room type in USD',
            color_discrete_sequence=px.colors.sequential.Bluyl,
            template='plotly_dark'
)

fig.update_layout(font=dict(size=16, color='white', family='Avenir'))

fig.show()

In [None]:
#How many constructions are there for each year
constructions_per_year = df.groupby('Construction_year')['Construction_year'].count()

# Let's plot them using plotly's barplot with value counts
fig = px.bar(constructions_per_year,
            x=constructions_per_year.index,
            y=constructions_per_year.values,
            labels={'x': 'Construction year', 'y': 'Number of constructions'},
            text=[str(i) for i in constructions_per_year.values],
            title='Number of constructions per year',
            template='plotly_dark'
)

fig.update_layout(font=dict(size=20, color='white', family='Avenir'))

fig.show()

In [None]:
#What are the price ranges for each construction year
price_per_year = df.groupby('Construction_year')['price'].count()

fig = px.bar(price_per_year,
            x=price_per_year.index,
            y=price_per_year.values,
            labels={'x': 'Construction year', 'y': 'Average price'},
            text=['$' + str(int(i)) for i in price_per_year.values],
            title='Average price per construction year in USD',
            color_discrete_sequence=px.colors.sequential.RdBu,
            template='plotly_dark'
)

fig.update_layout(font=dict(size=16, color='white', family='Avenir'))

fig.show()

In [None]:
#What are the average service fee for each neighbourhood group
service_fee_per_neighbourhood_group = df.groupby('neighbourhood_group')['service_fee'].mean()
fig = px.bar(service_fee_per_neighbourhood_group,
            x=service_fee_per_neighbourhood_group.index,
            y=service_fee_per_neighbourhood_group.values,
            labels={'x': 'Neighbourhood_group', 'y': 'Average service fee'},
            text=['$' + str(int(i)) for i in service_fee_per_neighbourhood_group.values],
            title='Average service fee per neighbourhood group in USD',
            color_discrete_sequence=px.colors.sequential.Plasma,
            template='plotly_dark'
)

fig.update_layout(font=dict(size=16, color='white', family='Avenir'))

fig.show()

In [None]:
#What are the average review rates for each neighbourhood group
review_rate_per_neighbourhood_group = df.groupby('neighbourhood_group')['review_rate_number'].mean()

# Let's plot them using plotly's barplot with value counts
fig = px.bar(review_rate_per_neighbourhood_group,
            x=review_rate_per_neighbourhood_group.index,
            y=review_rate_per_neighbourhood_group.values,
            labels={'x': 'Neighbourhood group', 'y': 'Average review rate'},
            text=[str(round(i, 2)) for i in review_rate_per_neighbourhood_group.values],
            title='Average review rate per neighbourhood group',
            color_discrete_sequence=px.colors.sequential.algae,
            template='plotly_dark'
)

fig.update_layout(font=dict(size=20, color='white', family='Avenir'))

fig.show()

##⬛pairplot

In [None]:
sns.pairplot(df[['price', 'minimum_nights', 'number_of_reviews', 'calculated_host_listings_count', 'availability_365','Construction_year','service_fee','review_rate_number']]);

##⬛Lineplot

In [None]:
#Is the later construction year correlated with higher price
price_per_year = df.groupby('Construction_year')['price'].median()
fig = px.line(price_per_year,
            x=price_per_year.index,
            y=price_per_year.values,
            labels={'x': 'Construction year', 'y': 'Average price'},
            text=['$' + str(int(i)) for i in price_per_year.values],
            title='Average price per construction year in USD',
            color_discrete_sequence=px.colors.sequential.Teal_r,
            template='plotly_dark'
)

fig.update_layout(font=dict(size=16, color='white', family='Avenir'))

fig.show()

# ⏬ Preprocessing

In [100]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler , OrdinalEncoder , PolynomialFeatures ,RobustScaler

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression , Ridge , Lasso , ElasticNet
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error , r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from category_encoders import BinaryEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVR

In [101]:
df

Unnamed: 0,NAME,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,Construction_year,price,service_fee,minimum_nights,number_of_reviews,review_rate_number,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,False,strict,Private room,2020.0,966.0,193.0,10.0,9.0,4.0,6.0,286.0
1,Skylit Midtown Castle,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,False,moderate,Entire home/apt,2007.0,142.0,28.0,30.0,45.0,4.0,2.0,228.0
4,Entire Apt: Spacious Studio/Loft by central park,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,False,moderate,Entire home/apt,2009.0,204.0,41.0,10.0,9.0,3.0,1.0,289.0
8,Large Furnished Room Near B'way,verified,Evelyn,Manhattan,Hell's Kitchen,40.76489,-73.98493,True,strict,Private room,2005.0,1018.0,204.0,2.0,430.0,3.0,1.0,180.0
10,Cute & Cozy Lower East Side 1 bdrm,verified,Miranda,Manhattan,Chinatown,40.71344,-73.99037,False,flexible,Entire home/apt,2004.0,319.0,64.0,1.0,160.0,3.0,4.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102037,Bx Apartment,unconfirmed,Vii,Bronx,Olinville,40.88438,-73.86397,True,strict,Private room,2008.0,531.0,0.0,1.0,0.0,3.0,1.0,0.0
102038,HUGE BEDROOM LORIMER L TRAIN!!!,unconfirmed,Jose,Brooklyn,Williamsburg,40.71355,-73.95003,True,flexible,Private room,2016.0,570.0,0.0,28.0,17.0,1.0,6.0,229.0
102039,Spacious two bedrooms condo in upper Manhattan,verified,Max,Manhattan,Inwood,40.86461,-73.92363,True,moderate,Entire home/apt,2003.0,665.0,0.0,2.0,49.0,1.0,3.0,147.0
102040,"Room in Queens, NY, near LGA.",verified,Sonia,Queens,East Elmhurst,40.76245,-73.87938,True,strict,Private room,2022.0,982.0,196.0,1.0,239.0,2.0,2.0,361.0


In [None]:
# Checking the feature names
print(f' The names of the features present in the dataset are: ')
list(df.columns)

In [None]:
df.info()

In [None]:
df.drop(col , axis=1 ).drop('price' , axis=1).select_dtypes('number').columns.tolist()

In [None]:
df.drop(col , axis=1 ).drop('price' , axis=1).select_dtypes('object').columns.tolist()

In [102]:
col = ['host_name','neighbourhood','lat','NAME','long']
new_df = df.copy()
new_df.drop(col , axis=1 , inplace=True)

X = new_df.drop('price' , axis=1)
y = new_df.price

x_train , x_test , y_train , y_test = train_test_split(X , y , test_size=0.2 , random_state=42)

num = ['minimum_nights','number_of_reviews','calculated_host_listings_count','availability_365','Construction_year','service_fee','review_rate_number']
cat = ['neighbourhood_group','room_type','cancellation_policy','host_identity_verified','instant_bookable']

cat_pipeline = Pipeline ( steps = [("Encoder" , BinaryEncoder(handle_unknown='ignore') )] )
num_pipeline = Pipeline ( steps = [("Scaler" , RobustScaler())] )


Column_Transformer = ColumnTransformer(transformers = [('num' , num_pipeline , num) , ('cat' , cat_pipeline , cat)]).set_output(transform='pandas')


import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

x_train_trans = Column_Transformer.fit_transform(x_train)
x_test_trans = Column_Transformer.transform(x_test)

Column_Transformer

# ⏬ Modeling

## ⬛ Linear Regression

In [103]:
LR = LinearRegression(fit_intercept=False)
LR.fit(x_train_trans , y_train)
y_pred = LR.predict(x_test_trans)

print("r2_score ",r2_score(y_test , y_pred))
print("train_Score ",LR.score(x_train_trans , y_train))
print("test_Score ",LR.score(x_test_trans , y_test))
print("mean_squared_error : ",mean_squared_error(y_test , y_pred))


r2_score  0.9863331207677608
train_Score  0.9892478497083599
test_Score  0.9863331207677608
mean_squared_error :  1502.192160041586


In [None]:
# Perform cross-validation
cross_scores = cross_val_score(LR, x_train_trans, y_train)

# Print the cross-validation scores
print('Scores: ', cross_scores)
print('Mean: ', np.mean(cross_scores))
print('Std: ', np.std(cross_scores))

## ⬛ Decision Tree

In [None]:
# Define the DT regressor
DT = DecisionTreeRegressor()

# Define the hyperparameters grid to search through
param_grid_DT = {
    'max_depth': [20, 30, 10],  # max_depth parameter
    'random_state': [42, 43, 44],  # random_state parameter
    'ccp_alpha': [1.5, 2, 2.5],  # ccp_alpha parameter
    'splitter': [ "best", "random"] #splitter parameter
}

# Initialize GridSearch
grid_search_DT = GridSearchCV(estimator=DT, param_grid=param_grid_DT , cv=3, n_jobs=-1, verbose=2)

# Fit the data to perform the search for best parameters
grid_search_DT.fit(x_train_trans, y_train)

# Output the best parameters found
print("Best Parameters:", grid_search_DT.best_params_)
print("Best Score:", grid_search_DT.best_score_)

In [104]:
# Perform cross-validation
DT = DecisionTreeRegressor(max_depth=10 ,random_state=44 ,ccp_alpha=1.5,splitter="random")
cross_scores_T = cross_val_score(DT, x_train_trans, y_train)

# Print the cross-validation scores
print('Scores: ', cross_scores_T)
print('Mean: ', np.mean(cross_scores_T))
print('Std: ', np.std(cross_scores_T))

Scores:  [0.99693396 0.99454779 0.99694977 0.99547114 0.99664828]
Mean:  0.9961101867180184
Std:  0.0009510531257764309


In [105]:
DT = DecisionTreeRegressor(max_depth=10 ,random_state=44 ,ccp_alpha=1.5,splitter="random")
DT.fit(x_train_trans , y_train)
y_pred_DT = DT.predict(x_test_trans)

print(" r2_score: ", r2_score(y_test , y_pred_DT))
print(" train_Score: ", DT.score(x_train_trans , y_train))
print(" test_Score: ", DT.score(x_test_trans , y_test))
print(" mean_squared_error: ", mean_squared_error(y_test , y_pred_DT))


 r2_score:  0.9949705784684761
 train_Score:  0.9970313778538333
 test_Score:  0.9949705784684761
 mean_squared_error:  552.8078111919947


## ⬛ SVR

In [None]:
# Define the SVR regressor
svr = LinearSVR()

# Define the hyperparameters grid to search through
param_grid_svr = {
    'C': [0.1, 1.0, 10.0],  # Regularization parameter
    'epsilon': [0.1, 0.2, 0.3],  # Epsilon parameter
    'tol': [1e-4, 1e-3, 1e-2]  # Tolerance for stopping criteria
}

# Initialize GridSearch
grid_search_svr = GridSearchCV(estimator=svr, param_grid=param_grid_svr , cv=3, n_jobs=-1, verbose=2)

# Fit the data to perform the search for best parameters
grid_search_svr.fit(x_train_trans, y_train)

# Output the best parameters found
print("Best Parameters:", grid_search_svr.best_params_)
print("Best Score:", grid_search_svr.best_score_)

In [106]:
Lsvr = LinearSVR(C=10 , epsilon=0.3 , tol=0.001)
Lsvr.fit(x_train_trans , y_train)
y_pred_Lsvr = Lsvr.predict(x_test_trans)

print("r2_score ",r2_score(y_test , y_pred_Lsvr))
print("train_Score ",Lsvr.score(x_train_trans , y_train))
print("test_Score ",Lsvr.score(x_test_trans , y_test))
print("mean_squared_error : ",mean_squared_error(y_test , y_pred_Lsvr))

r2_score  0.9862058881158992
train_Score  0.9891674537779689
test_Score  0.9862058881158992
mean_squared_error :  1516.1769102453454




In [None]:
# Perform cross-validation
Lsvr = LinearSVR(C=10 , epsilon=0.3 , tol=0.01)
cross_scores_Lsvr = cross_val_score(Lsvr, x_train_trans, y_train)

# Print the cross-validation scores
print('Scores: ', cross_scores_Lsvr)
print('Mean: ', np.mean(cross_scores_Lsvr))
print('Std: ', np.std(cross_scores_Lsvr))

##⬛XGBRegressor


In [None]:
# Define the XGBoost regressor
XGBR = XGBRegressor()

# Define the hyperparameters grid to search through
param_grid = {
    'max_depth': list(range(1,11)),
    'n_estimators': list(range(1,81,10))}

# Initialize GridSearch with cross-validation
grid_search = GridSearchCV(estimator=XGBR, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Fit the data to perform the search for best parameters
grid_search.fit(x_train_trans, y_train)

# Output the best parameters found
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

In [107]:
XGBR = XGBRegressor(max_depth=6 , n_estimators=6 ,random_state=42)
XGBR.fit(x_train_trans , y_train)
y_pred_XGBR = XGBR.predict(x_test_trans)
print(" r2_score ",r2_score(y_test , y_pred_XGBR))
print(" train_Score ",XGBR.score(x_train_trans , y_train))
print(" test_Score ",XGBR.score(x_test_trans , y_test))
print(" mean_squared_error : ",mean_squared_error(y_test , y_pred_XGBR))

 r2_score  0.9827784025840339
 train_Score  0.9843539312513158
 test_Score  0.9827784025840339
 mean_squared_error :  1892.9082625264439


In [None]:
# Perform cross-validation
XGBR_cross = XGBRegressor(max_depth=7 , n_estimators=80 ,random_state=42)
cv_scores_XGBR = cross_val_score(XGBR, x_train_trans, y_train)

# Print the cross-validation scores
print('Scores: ', cv_scores_XGBR)
print('Mean: ', np.mean(cv_scores_XGBR))
print('Std: ', np.std(cv_scores_XGBR))

## ⬛ Random Forest Regressor

In [None]:
DFR = RandomForestRegressor(n_estimators=100)
DFR.fit(x_train_trans , y_train)
y_pred_DFR = DFR.predict(x_test_trans)

print(" r2_score ",r2_score(y_test , y_pred_DFR))
print(" train_Score ",DFR.score(x_train_trans , y_train))
print(" test_Score ",DFR.score(x_test_trans , y_test))
print(" mean_squared_error : ",mean_squared_error(y_test , y_pred_DFR))

##⬛AdaBoostRegressor

In [None]:
ADB = AdaBoostRegressor(n_estimators =100 )
ADB.fit(x_train_trans , y_train)
y_pred_ADB = ADB.predict(x_test_trans)

print(" r2_score ",r2_score(y_test , y_pred_ADB))
print(" train_Score ",ADB.score(x_train_trans , y_train))
print(" test_Score ",ADB.score(x_test_trans , y_test))
print(" mean_squared_error : ",mean_squared_error(y_test , y_pred_ADB))


##⬛GradientBoostingRegressor


In [None]:
# Perform cross-validation
GDB = GradientBoostingRegressor(max_depth=20 , n_estimators=100)
cv_scores_GDB = cross_val_score(GDB, x_train_trans, y_train)

# Print the cross-validation scores
print('Scores: ', cv_scores_GDB)
print(' Mean: ', np.mean(cv_scores_GDB))
print(' Std: ', np.std(cv_scores_GDB))

In [None]:
GDB = GradientBoostingRegressor(max_depth=20 , n_estimators=100)
GDB.fit(x_train_trans , y_train)
y_pred_GDB = GDB.predict(x_test_trans)

print(" r2_score ",r2_score(y_test , y_pred_GDB))
print(" train_Score ",GDB.score(x_train_trans , y_train))
print(" test_Score ",GDB.score(x_test_trans , y_test))
print(" mean_squared_error : ",mean_squared_error(y_test , y_pred_GDB))


## ⬛ KNeighborsRegressor

In [None]:
for i in range(1,15):
    KNN = KNeighborsRegressor(n_neighbors=i)
    KNN.fit(x_train_trans , y_train)
    y_pred_KNN = KNN.predict(x_test_trans)
    print("n_neighbors : ",i )
    print("mean_squared_error : ",mean_squared_error(y_test , y_pred_KNN))
    print(" r2_score ",r2_score(y_test , y_pred_KNN))
    print(" train_Score ",KNN.score(x_train_trans , y_train))
    print(" test_Score ",KNN.score(x_test_trans , y_test))
    print("-"*50)

In [None]:

KNN = KNeighborsRegressor(n_neighbors=10)
KNN.fit(x_train_trans , y_train)
y_pred_KNN = KNN.predict(x_test_trans)

print(" r2_score ",r2_score(y_test , y_pred_KNN))
print(" train_Score ",KNN.score(x_train_trans , y_train))
print(" test_Score ",KNN.score(x_test_trans , y_test))
print(" mean_squared_error : ",mean_squared_error(y_test , y_pred_KNN))


# ⏬ Test Model

In [108]:
Test = x_test.sample(1)
Test_idx = Test.index

Test_New_Preprocess = Column_Transformer.transform(Test)
Test_New_Predict = XGBR.predict(Test_New_Preprocess)

Actual_Price = df.loc[Test_idx][['price']]
Predicted_Price = Test_New_Predict[0]

print("Actual Price : ", Actual_Price)
print("\nPredicted Price : ", round(Predicted_Price,1))

Actual Price :         price
46998  621.0

Predicted Price :  620.2


# ⏬ Save Model

In [109]:
# Define the model
model=XGBR

In [110]:
import joblib
# Save the model and preprocessor
joblib.dump(model, 'XGBRegressor_new.h5')
joblib.dump(Column_Transformer, 'column_Transformer_New.h5')

['column_Transformer_New.h5']

In [None]:
!pip install streamlit

In [113]:
%%writefile app.py
import streamlit as st
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler , OrdinalEncoder , PolynomialFeatures ,RobustScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression , Ridge , Lasso , ElasticNet
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error , r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from category_encoders import BinaryEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVR

st.set_page_config(layout="wide" , page_title="Airbnb APP")

st.title('Airbnb Price Prediction')

column_1 , column_2 , column_3 = st.columns([70,5,70])
with column_1:
    neighbourhood_group=st.selectbox('neighbourhood_group? ',["Manhattan","Brooklyn","Queens","Bronx","Staten Island"])
    room_type=st.selectbox('room_type? ',["Entire home/apt","Private room","Shared room","Hotel room"])
    cancellation_policy=st.selectbox('cancellation_policy? ',["moderate","strict","flexible"])
    Construction_year=st.selectbox('Construction_year? ',["2022","2021","2020","2019","2018","2017","2016","2015",'2014','2013','2012',"2011","2010","2009" ,"2008","2007","2006","2005","2004" ,"2003"])
    host_identity_verified=st.radio('host_identity_verified? ',["unconfirmed","verified"])
    instant_bookable=st.radio("instant_bookable",['False','True'])

with column_3:
    minimum_nights =st.slider('minimum_nights? ',1,30,15)
    number_of_reviews =st.slider('number_of_reviews? ',0,1026,513)
    calculated_host_listings_count =st.slider('calculated_host_listings_count? ',1,332,166)
    availability_365 =st.slider('availability_365? ',0,365,0)
    service_fee =st.slider('service_fee? ',0,240,120)
    review_rate_number =st.slider('review_rate_number? ',1,5,1)

New_Date = pd.DataFrame({'neighbourhood_group':[neighbourhood_group],
                         'room_type':[room_type],
                         'cancellation_policy':[cancellation_policy],
                         'host_identity_verified':[host_identity_verified],
                         'Construction_year':[Construction_year],
                         'instant_bookable':[instant_bookable],
                         'minimum_nights':[minimum_nights],
                         'number_of_reviews':[number_of_reviews],
                         'calculated_host_listings_count':[calculated_host_listings_count],
                         'availability_365':[availability_365],
                         'service_fee':[service_fee],
                         'review_rate_number':[review_rate_number]},index=[0])


transformer=joblib.load('Column_Transformer_new.h5')
model=joblib.load('XGBRegressor_new.h5')

Preprocess = transformer.transform(New_Date)
Predict = model.predict(Preprocess)

st.dataframe(New_Date,width=1200,height=10,use_container_width=True)

if st.button('Predict'):
    st.subheader(round(Predict[0],2))

Overwriting app.py


In [112]:
!streamlit run app.py

^C


In [None]:
!pip install pipreqs

In [115]:
!pipreqs

Please, verify manually the final list of requirements.txt to avoid possible dependency confusions.
INFO: Successfully saved requirements file in D:\Data Science\01-ZeroGrad\00-My Projects ZeroGrad\ML 13- Airbnb\Mido\requirements.txt
