This cell is for loading the data and installing relevant librarys

Hypothesis: Impact of Review Scores on Price

Null Hypothesis (H0): Review scores (e.g., cleanliness, location, communication) do not significantly affect the listing price.

Alternative Hypothesis (H1): Higher review scores lead to higher listing prices.

EDA Approach: Calculate correlations between review scores and price. Fit a multiple linear regression model (e.g., price ~ cleanliness + location_score + communication_score).


In [63]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('nbagg') #to change to interactive backend so we can see graphs
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


file = ('/Users/olisa/Downloads/clean_df.csv') # convert to website link later
data = pd.read_csv(file)

Next step is cleaning the dataframe and removing the colums that are not neccsary for the analysis.

In [3]:
print(data.columns)

Index(['id', 'listing_url', 'name', 'description', 'neighborhood_overview',
       'host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_picture_url',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_identity_verified',
       'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
       'room_type', 'accommodates', 'bathrooms_text', 'bedrooms', 'beds',
       'amenities', 'price', 'minimum_nights', 'maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'number_of_reviews',
       'number_of_reviews_ltm', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
       'ca

In [4]:

relevant_columns = ['price', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location','review_scores_value']
subset_data = data[relevant_columns]

# Display the first 5 rows of the new DataFrame
print(subset_data.head())



     price  review_scores_rating  review_scores_accuracy  \
0   $50.00                  4.90                    4.82   
1   $75.00                  4.79                    4.84   
2   $90.00                  4.32                    4.53   
3   $55.00                  4.84                    4.91   
4  $379.00                  4.74                    4.82   

   review_scores_cleanliness  review_scores_checkin  \
0                       4.89                   4.86   
1                       4.88                   4.87   
2                       4.03                   4.72   
3                       4.71                   4.93   
4                       4.69                   4.69   

   review_scores_communication  review_scores_location  review_scores_value  
0                         4.93                    4.75                 4.82  
1                         4.82                    4.93                 4.73  
2                         4.86                    4.72                 4.3

Now lets check and potentially handle any missing values and convert any datatypes if needed

In [27]:
print(subset_data.dtypes)
print(subset_data.isna().sum(axis=0))

price                          float64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_cleanliness      float64
review_scores_checkin          float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
dtype: object
price                              0
review_scores_rating           16785
review_scores_accuracy         17807
review_scores_cleanliness      17794
review_scores_checkin          17841
review_scores_communication    17808
review_scores_location         17839
review_scores_value            17842
dtype: int64


We have confirmed that there are some missing values so we will need to fix that. In addition price is in the wrong type so lets convert it to a neumeric one. After that we will proceed with some Descriptive statistics

In [28]:
# As there are a significant amount of missing values for the review score we cannot use the mean to predict values. Thus will have to 'drop' these rows from the dataframe entirely

subset_data = subset_data.dropna()

In [30]:
# Converting datatype for 'price' 
subset_data['price'] = subset_data['price'].astype(str).str.replace('$', '').astype(float)

print(subset_data['price'])
print(subset_data['price'].dtypes)


0         50.0
1         75.0
2         90.0
3         55.0
4        379.0
         ...  
69346     55.0
69347    201.0
69348    246.0
69349    250.0
69350    134.0
Name: price, Length: 51505, dtype: float64
float64


In [43]:
# Mean
mean_price = subset_data['price'].mean()

# Median
median_price = subset_data['price'].median()

# Standard Deviation
std_price = subset_data['price'].std()

# Minimum and Maximum
min_price = subset_data['price'].min()
max_price = subset_data['price'].max()

# Quartiles 
q25_price = subset_data['price'].quantile(0.25)
q75_price = subset_data['price'].quantile(0.75)

i = 0
summ_stats = [(f'mean = {mean_price}'), (f'median = {median_price}'), (f'standard deviation = {std_price}'), (f'min = {min_price}'), (f'max = {max_price}'), (f'q25 = {q25_price}'), (f'q75 = {q75_price}')]
while(i < 7):
    print(summ_stats[i])
    i += 1

mean = 134.02898184642268
median = 96.0
standard deviation = 125.55085686084625
min = 0.0
max = 999.0
q25 = 55.0
q75 = 165.0


these stats on there own are not very useful so lets do some more work

# now we are going to start with data visualisation

# Boxplot - A boxplot provides a concise summary of the distribution of the data, including measures such as the median, quartiles, and potential outliers. Here's how you can create a boxplot for the 'price' variable:

In [64]:

#boxplot of price distribution

plt.figure(figsize=(8, 6))
sns.boxplot(x='price', data=subset_data, palette='pastel')
plt.title('Boxplot of Price')
plt.xlabel('Price')
plt.show()


<IPython.core.display.Javascript object>


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='price', data=subset_data, palette='pastel')
  plt.show()


In [42]:
# Boxplots comparing each review score with the listing price


plt.figure(figsize=(10, 6))

review_scores = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location','review_scores_value']

for score in review_scores:
    sns.boxplot(x=score, y='price', data=subset_data, palette='pastel')
    plt.xlabel(f'Review Score ({score.replace("review_scores_", "")})')
    plt.ylabel('Listing Price')
    plt.title(f'Impact of {score} on Listing Price')
    plt.tight_layout()
    plt.show()


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()


Scatter plots vizualize the relationship between review score and price

In [58]:
plt.figure(figsize=(10, 6))


for score in review_scores:
    sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
    plt.xlabel(f'Review Score ({score.replace("review_scores_", "")})')
    plt.ylabel('Listing Price')
    plt.title(f'Impact of {score} on Listing Price')
    plt.tight_layout()
    plt.show()


<IPython.core.display.Javascript object>

  sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()
  sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()
  sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()
  sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()
  sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()
  sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()
  sns.scatterplot(x=score, y='price', data=subset_data, palette='pastel')
  plt.show()


Histograms show us the frequency distribution of the given variable

In [35]:

plt.figure(figsize=(10, 6))

for score in review_scores:
    sns.histplot(x=score, data=subset_data, bins=10, palette='pastel')
    plt.xlabel(f'Review Score ({score.replace("review_scores_", "")})')
    plt.ylabel('Count')
    plt.title(f'Distribution of {score}')
    plt.tight_layout()
    plt.show()


  plt.show()


Analysis of summary statstics

Correlation analysis - using Spearmans rank
Spearman's rank correlation is a way to see if there's a pattern between these two factors (in this case it will be price and a review score)
The stronger the correlation (closer to 1 or -1), the more closely review scores and prices are connected


In [57]:

# library needed for spearmans rank
from scipy.stats import spearmanr

from scipy.stats import spearmanr

# Dictionary to store correlation coefficients for each review score
correlation_results = {}



# Iterate over each review score variable
for score in review_scores:
    # Calculate Spearman's rank correlation coefficient between the review score and prices
    spearman_corr, p_value = spearmanr(subset_data[score], subset_data['price'])
    
    # Store the correlation coefficient and p-value in the dictionary
    correlation_results[score] = {'correlation_coefficient': spearman_corr, 'p_value': p_value}

# Print the results
for score, result in correlation_results.items():
    print(f"Spearman's Rank Correlation for {score}:")
    print("Correlation Coefficient:", result['correlation_coefficient'])
    print("P-value:", result['p_value'])
    print()


Spearman's Rank Correlation for review_scores_rating:
Correlation Coefficient: -0.0024953209579192427
P-value: 0.5711945717696489

Spearman's Rank Correlation for review_scores_accuracy:
Correlation Coefficient: -0.03312066723199588
P-value: 5.539954507858322e-14

Spearman's Rank Correlation for review_scores_cleanliness:
Correlation Coefficient: 0.026908509333014967
P-value: 1.0103494962826755e-09

Spearman's Rank Correlation for review_scores_checkin:
Correlation Coefficient: -0.04873188124616945
P-value: 1.8367884502121817e-28

Spearman's Rank Correlation for review_scores_communication:
Correlation Coefficient: -0.04677633043577563
P-value: 2.372086799243635e-26

Spearman's Rank Correlation for review_scores_location:
Correlation Coefficient: 0.13709621530397903
P-value: 1.6054008947546624e-214

Spearman's Rank Correlation for review_scores_value:
Correlation Coefficient: -0.14556720649490376
P-value: 7.153351638502639e-242



In [56]:
#Visualisation of Spearmans rank using heatmap we do this so we can look at all the correlation coefficents between the multiple variables

correlation_coefficients = [[correlation_results[score]['correlation_coefficient'] for score in review_scores]]

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_coefficients, annot=True, cmap='coolwarm', xticklabels=review_scores, yticklabels=['Price'], fmt=".2f")
plt.title("Spearman's Rank Correlation Coefficients between Review Scores and Price")
plt.xlabel('Review Scores')
plt.ylabel('Price')
plt.show()

<IPython.core.display.Javascript object>

  plt.show()


Analysis of spearmann rank

Linear regression is a commonly used statistical technique for modeling the relationship between one or more independent variables (predictors) and a continuous dependent variable (outcome).

multiple linear regression is more appropriate than simple linear regression because we are interested in assessing the combined impact of multiple factors (review scores for cleanliness, location, and communication) on listing prices. By using multiple linear regression, we can analyze how these review scores collectively influence listing prices while controlling for other variables. This approach allows for a more comprehensive understanding of the relationship between review scores and prices, accounting for potential interactions or confounding effects between predictors.


Here are the steps we are going to follow for the linear regression

In [62]:
# Prepare your features (X) and target variable (y)
X = subset_data[['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
          'review_scores_checkin', 'review_scores_communication',
          'review_scores_location', 'review_scores_value']]
y = subset_data['price']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

# Interpretation (coefficients)
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
print(coefficients)

# Set Seaborn color palette to pastel
sns.set_palette("pastel")

# Visualization (actual vs. predicted)
sns.scatterplot(x=y_test, y=y_pred, alpha=0.5)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Regression Model: Actual vs. Predicted Price')
plt.show()


Mean Squared Error: 14459.53
R-squared: 0.05
                       Feature  Coefficient
0         review_scores_rating    37.364039
1       review_scores_accuracy   -12.696113
2    review_scores_cleanliness    29.294405
3        review_scores_checkin    -7.249690
4  review_scores_communication   -20.410385
5       review_scores_location    58.724739
6          review_scores_value   -74.597303


  plt.show()
