# Data Cleaning
Contains steps and logic behind the data cleaning.

Start by defining a function called remove_rows_with_missing_ratings which removes the rows with missing values in these columns. It should take in the dataset as a pandas dataframe and return the same type.

In [2]:
import pandas as pd
df = pd.read_csv("./airbnb-property-listings/tabular_data/listing.csv")

In [3]:
df.head()

Unnamed: 0,ID,Category,Title,Description,Amenities,Location,guests,beds,bathrooms,Price_Night,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating,amenities_count,url,bedrooms,Unnamed: 19
0,f9dcbd09-32ac-41d9-a0b1-fdb2793378cf,Treehouses,Red Kite Tree Tent - Ynys Affalon,"['About this space', ""Escape to one of these t...","['What this place offers', 'Bathroom', 'Shampo...",Llandrindod Wells United Kingdom,2,1.0,1.0,105,4.6,4.7,4.3,5.0,4.3,4.3,13.0,https://www.airbnb.co.uk/rooms/26620994?adults...,,
1,1b4736a7-e73e-45bc-a9b5-d3e7fcf652fd,Treehouses,Az Alom Cabin - Treehouse Tree to Nature Cabin,"['About this space', ""Come and spend a romanti...","['What this place offers', 'Bedroom and laundr...",Guyonvelle Grand Est France,3,3.0,0.0,92,4.3,4.7,4.6,4.9,4.7,4.5,8.0,https://www.airbnb.co.uk/rooms/27055498?adults...,1.0,
2,d577bc30-2222-4bef-a35e-a9825642aec4,Treehouses,Cabane Entre Les Pins\n🌲🏕️🌲,"['About this space', 'Rustic cabin between the...","['What this place offers', 'Scenic views', 'Ga...",Duclair Normandie France,4,2.0,1.5,52,4.2,4.6,4.8,4.8,4.8,4.7,51.0,https://www.airbnb.co.uk/rooms/51427108?adults...,1.0,
3,ca9cbfd4-7798-4e8d-8c17-d5a64fba0abc,Treehouses,Tree Top Cabin with log burner & private hot tub,"['About this space', 'The Tree top cabin is si...","['What this place offers', 'Bathroom', 'Hot wa...",Barmouth Wales United Kingdom,2,,1.0,132,4.8,4.9,4.9,4.9,5.0,4.6,23.0,https://www.airbnb.co.uk/rooms/49543851?adults...,,
4,8b2d0f78-16d8-4559-8692-62ebce2a1302,Treehouses,Hanging cabin,"['About this space', 'Feel refreshed at this u...","['What this place offers', 'Heating and coolin...",Wargnies-le-Petit Hauts-de-France France,2,1.0,,111,,,,,,,5.0,https://www.airbnb.co.uk/rooms/50166553?adults...,1.0,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 988 entries, 0 to 987
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    988 non-null    object 
 1   Category              988 non-null    object 
 2   Title                 988 non-null    object 
 3   Description           900 non-null    object 
 4   Amenities             988 non-null    object 
 5   Location              988 non-null    object 
 6   guests                969 non-null    object 
 7   beds                  945 non-null    float64
 8   bathrooms             888 non-null    float64
 9   Price_Night           988 non-null    int64  
 10  Cleanliness_rating    890 non-null    float64
 11  Accuracy_rating       890 non-null    float64
 12  Communication_rating  890 non-null    float64
 13  Location_rating       890 non-null    float64
 14  Check-in_rating       890 non-null    float64
 15  Value_rating          8

In [5]:
df.duplicated().sum()

0

In [6]:
df.isna().sum()

ID                        0
Category                  0
Title                     0
Description              88
Amenities                 0
Location                  0
guests                   19
beds                     43
bathrooms               100
Price_Night               0
Cleanliness_rating       98
Accuracy_rating          98
Communication_rating     98
Location_rating          98
Check-in_rating          98
Value_rating             98
amenities_count           0
url                       0
bedrooms                 82
Unnamed: 19             987
dtype: int64

In [7]:
pd.set_option('display.max_rows', 20)
df[["Cleanliness_rating", "Accuracy_rating", "Communication_rating", "Location_rating", "Check-in_rating", "Value_rating"]][df['Cleanliness_rating'].isna()]

Unnamed: 0,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating
4,,,,,,
9,,,,,,
15,,,,,,
16,,,,,,
17,,,,,,
...,...,...,...,...,...,...
967,,,,,,
969,,,,,,
971,,,,,,
973,,,,,,


The cleaning is pretty simple because all the rows miss the rating at the same time.

In [8]:
df = df[df['Cleanliness_rating'].notna()]

In [9]:
df.isna().sum()

ID                        0
Category                  0
Title                     0
Description              60
Amenities                 0
Location                  0
guests                   18
beds                     34
bathrooms                79
Price_Night               0
Cleanliness_rating        0
Accuracy_rating           0
Communication_rating      0
Location_rating           0
Check-in_rating           0
Value_rating              0
amenities_count           0
url                       0
bedrooms                 76
Unnamed: 19             889
dtype: int64

In [10]:
df.dtypes

ID                       object
Category                 object
Title                    object
Description              object
Amenities                object
Location                 object
guests                   object
beds                    float64
bathrooms               float64
Price_Night               int64
Cleanliness_rating      float64
Accuracy_rating         float64
Communication_rating    float64
Location_rating         float64
Check-in_rating         float64
Value_rating            float64
amenities_count         float64
url                      object
bedrooms                 object
Unnamed: 19             float64
dtype: object

The "Description" column contains lists of strings. You'll need to define a function called combine_description_strings which combines the list items into the same string.

Unfortunately, pandas doesn't recognise the values as lists, but as strings whose contents are valid Python lists.

You should look up how to do this (don't implement a from-scratch solution to parse the string into a list). The lists contain many empty quotes which should be removed. If you don't remove them before joining the list elements with a whitespace, they might cause the result to contain multiple whitespaces in places. The function should take in the dataset as a pandas dataframe and return the same type. It should remove any records with a missing description, and also remove the "About this space" prefix which every description starts with.

The "guests", "beds", "bathrooms", and "bedrooms" columns have empty values for some rows. Don't remove them, instead, define a function called set_default_feature_values, and replace these entries with the number 1. It should take in the dataset as a pandas dataframe and return the same type.

In [11]:
# have to remove those lines without description. Modify this code.
df = df[df['Description'].isnull() == False]

In [12]:
sample = df['Description'].sample()
print(sample)
print(sample.str.replace("'About this space', ", ''))
print(sample.str.replace(" 'The space', 'The space\n", ''))
print(sample.str.replace(r'\n\n', ' '))
print(sample.str.replace(r'\n', ' '))
print(sample.str.replace("''", ""))
print(sample.apply(eval))
#print(sample)

738    ['About this space', 'Free pickup in new TESLA...
Name: Description, dtype: object
738    ['Free pickup in new TESLA! Starting from late...
Name: Description, dtype: object
738    ['About this space', 'Free pickup in new TESLA...
Name: Description, dtype: object
738    ['About this space', 'Free pickup in new TESLA...
Name: Description, dtype: object
738    ['About this space', 'Free pickup in new TESLA...
Name: Description, dtype: object
738    ['About this space', 'Free pickup in new TESLA...
Name: Description, dtype: object
738    [About this space, Free pickup in new TESLA! S...
Name: Description, dtype: object


In [13]:
df = pd.read_csv("./airbnb-property-listings/tabular_data/listing.csv")

def combine_description_strings(df):
    df['Description'] = df['Description'].str.replace("'About this space', ", '')
    df['Description'] = df['Description'].str.replace(" 'The space', 'The space\n", '')
    df['Description'] = df['Description'].str.replace(r'\n\n', ' ')
    df['Description'] = df['Description'].str.replace(r'\n', ' ')
    df['Description'] = df['Description'].replace("''", "")
    return df

sample = combine_description_strings(df.sample())
sample

Unnamed: 0,ID,Category,Title,Description,Amenities,Location,guests,beds,bathrooms,Price_Night,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating,amenities_count,url,bedrooms,Unnamed: 19
142,c4e32eed-c3cd-4d4a-831b-a8e7e720b645,Treehouses,Tradewinds Treehouse,['The Trade Winds Treehouse is simply our most...,"['What this place offers', 'Bathroom', 'Hot wa...",Stanton Kentucky United States,8,5.0,1.0,510,4.7,4.8,4.9,4.9,4.9,4.5,14.0,https://www.airbnb.co.uk/rooms/25002039?adults...,3,


In [14]:
# # Your original text
# original_text = '["Escape to one of these two fabulous Tree Tents... very patchy mobile signal and limited internet connectivity.\']'

# # Clean and transform the text
# cleaned_text = eval(original_text)  # This will convert the text into a list
# cleaned_text = '\n'.join(cleaned_text)  # Join the list elements into a single string

# # Now, 'cleaned_text' contains the transformed text as a string
# print(cleaned_text)


In [15]:
import pandas as pd
df = pd.read_csv("./airbnb-property-listings/tabular_data/clean_tabular_data.csv")
df

Unnamed: 0.1,Unnamed: 0,ID,Category,Title,Description,Amenities,Location,guests,beds,bathrooms,...,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating,amenities_count,url,bedrooms,Unnamed: 19
0,0,f9dcbd09-32ac-41d9-a0b1-fdb2793378cf,Treehouses,Red Kite Tree Tent - Ynys Affalon,"[""Escape to one of these two fabulous Tree Ten...","['What this place offers', 'Bathroom', 'Shampo...",Llandrindod Wells United Kingdom,2,1.0,1.0,...,4.6,4.7,4.3,5.0,4.3,4.3,13.0,https://www.airbnb.co.uk/rooms/26620994?adults...,1,
1,1,1b4736a7-e73e-45bc-a9b5-d3e7fcf652fd,Treehouses,Az Alom Cabin - Treehouse Tree to Nature Cabin,"[""Come and spend a romantic stay with a couple...","['What this place offers', 'Bedroom and laundr...",Guyonvelle Grand Est France,3,3.0,0.0,...,4.3,4.7,4.6,4.9,4.7,4.5,8.0,https://www.airbnb.co.uk/rooms/27055498?adults...,1,
2,2,d577bc30-2222-4bef-a35e-a9825642aec4,Treehouses,Cabane Entre Les Pins\n🌲🏕️🌲,"['Rustic cabin between the pines, 3 meters hig...","['What this place offers', 'Scenic views', 'Ga...",Duclair Normandie France,4,2.0,1.5,...,4.2,4.6,4.8,4.8,4.8,4.7,51.0,https://www.airbnb.co.uk/rooms/51427108?adults...,1,
3,3,ca9cbfd4-7798-4e8d-8c17-d5a64fba0abc,Treehouses,Tree Top Cabin with log burner & private hot tub,['The Tree top cabin is situated in our peacef...,"['What this place offers', 'Bathroom', 'Hot wa...",Barmouth Wales United Kingdom,2,1.0,1.0,...,4.8,4.9,4.9,4.9,5.0,4.6,23.0,https://www.airbnb.co.uk/rooms/49543851?adults...,1,
4,5,cfe479b9-c8f8-44af-9bc6-46ede9f14bb5,Treehouses,Treehouse near Paris Disney,"['Charming cabin nestled in the leaves, real u...","['What this place offers', 'Bathroom', 'Hair d...",Le Plessis-Feu-Aussoux Île-de-France France,4,3.0,1.0,...,5.0,4.9,5.0,4.7,5.0,4.7,32.0,https://www.airbnb.co.uk/rooms/935398?adults=1...,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
884,982,cfbc88da-e88c-415a-b397-108d0948c4ba,Beachfront,Lancing Beach Apartment,['An apartment directly on the beach at Lancin...,"['What this place offers', 'Bathroom', 'Hair d...",Lancing United Kingdom,4,2.0,1.5,...,4.9,5.0,5.0,5.0,4.9,4.8,33.0,https://www.airbnb.co.uk/rooms/12680472?adults...,2,
885,983,4fea5054-f999-4c07-addc-67e4d893deab,Beachfront,Apartment,['Light roomy space with outside garden 5 minu...,"['What this place offers', 'Bathroom', 'Hair d...",Brighton and Hove England United Kingdom,2,1.0,1.0,...,4.8,5.0,4.9,4.9,5.0,4.9,54.0,https://www.airbnb.co.uk/rooms/48565992?adults...,1,
886,984,282118e2-049e-4d9f-b2f2-b47477881b07,Beachfront,Sea front flat with a stunning view!,['This specious two bedroom flat on the sea fr...,"['What this place offers', 'Scenic views', 'Be...",East Sussex England United Kingdom,4,2.0,1.5,...,4.8,5.0,5.0,5.0,5.0,4.8,38.0,https://www.airbnb.co.uk/rooms/49742544?adults...,2,
887,985,9ebf9cec-624e-480e-8704-dffa7cb1fe51,Beachfront,MP713 - Camber Sands Holiday Park - Sleeps 6 +...,"['With all the modern amenities, our contempor...","['What this place offers', 'Bathroom', 'Hot wa...",Camber England United Kingdom,6,3.0,2.0,...,4.7,4.8,5.0,5.0,5.0,4.7,24.0,https://www.airbnb.co.uk/rooms/47777462?adults...,2,


In [16]:
df[['guests', 'beds', 'bathrooms', 'bedrooms']] = df[['guests', 'beds', 'bathrooms', 'bedrooms']].fillna(value=1)
df.isna().sum()


Unnamed: 0           0
ID                   0
Category             0
Title                0
Description         60
                  ... 
Value_rating         0
amenities_count      0
url                  0
bedrooms             0
Unnamed: 19        889
Length: 21, dtype: int64

In [17]:
df.info('dict')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            889 non-null    int64  
 1   ID                    889 non-null    object 
 2   Category              889 non-null    object 
 3   Title                 889 non-null    object 
 4   Description           829 non-null    object 
 5   Amenities             889 non-null    object 
 6   Location              889 non-null    object 
 7   guests                889 non-null    int64  
 8   beds                  889 non-null    float64
 9   bathrooms             889 non-null    float64
 10  Price_Night           889 non-null    int64  
 11  Cleanliness_rating    889 non-null    float64
 12  Accuracy_rating       889 non-null    float64
 13  Communication_rating  889 non-null    float64
 14  Location_rating       889 non-null    float64
 15  Check-in_rating       8

In [18]:
df.select_dtypes(include='number')

Unnamed: 0.1,Unnamed: 0,guests,beds,bathrooms,Price_Night,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating,amenities_count,bedrooms,Unnamed: 19
0,0,2,1.0,1.0,105,4.6,4.7,4.3,5.0,4.3,4.3,13.0,1,
1,1,3,3.0,0.0,92,4.3,4.7,4.6,4.9,4.7,4.5,8.0,1,
2,2,4,2.0,1.5,52,4.2,4.6,4.8,4.8,4.8,4.7,51.0,1,
3,3,2,1.0,1.0,132,4.8,4.9,4.9,4.9,5.0,4.6,23.0,1,
4,5,4,3.0,1.0,143,5.0,4.9,5.0,4.7,5.0,4.7,32.0,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
884,982,4,2.0,1.5,240,4.9,5.0,5.0,5.0,4.9,4.8,33.0,2,
885,983,2,1.0,1.0,78,4.8,5.0,4.9,4.9,5.0,4.9,54.0,1,
886,984,4,2.0,1.5,113,4.8,5.0,5.0,5.0,5.0,4.8,38.0,2,
887,985,6,3.0,2.0,80,4.7,4.8,5.0,5.0,5.0,4.7,24.0,2,


In [19]:
rating_columns = ['Cleanliness_rating', 'Accuracy_rating', 'Communication_rating', 'Location_rating', 'Check-in_rating', 'Value_rating']
df[rating_columns].isna()

Unnamed: 0,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
884,False,False,False,False,False,False
885,False,False,False,False,False,False
886,False,False,False,False,False,False
887,False,False,False,False,False,False


In [20]:
import pandas as pd
df = pd.read_csv("./airbnb-property-listings/tabular_data/clean_tabular_data.csv")

def combine_description_strings(df):
        df['Description'] = df['Description'].str.replace("'About this space', ", '')
        df['Description'] = df['Description'].str.replace(" 'The space', 'The space\n", '')
        df['Description'] = df['Description'].str.replace(r'\n\n', ' ')
        df['Description'] = df['Description'].str.replace(r'\n', ' ')
        df['Description'] = df['Description'].replace("''", "")
        return df


In [21]:
df['Description'][0]

'["Escape to one of these two fabulous Tree Tents. Suspended high above the canopy, it’s time to appreciate life from a new perspective. Featured on George Clarke’s Amazing Spaces, these Tree Tents are a feat of aviation technology. Tree Tent comes complete with fire pit, outdoor kitchen and shower with hot water. You’ll discover a comfortable bed and cosy wood burning stove. Part of the Red Kite Estate, along with our barn and its sister tree tent, the first ever built in the UK, Dragon\'s Egg.", \'The space\', \'The space The true joy of this place is how wonderfully simple it is (aviation technology aside). Days are filled with fireside discussions, wildlife watching and stunningly beautiful walks. With the nearest mobile signal a ten minute walk away, it’s a great place to ditch the digital and truly escape. Head over the bridge to your own private deck that happily houses a clever outdoor-kitchen and shower (complete with hot water). It’s the perfect spot to fry up breakfast whils

In [22]:
import pandas as pd
df = pd.read_csv("./airbnb-property-listings/tabular_data/clean_tabular_data.csv")

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            889 non-null    int64  
 1   ID                    889 non-null    object 
 2   Category              889 non-null    object 
 3   Title                 889 non-null    object 
 4   Description           829 non-null    object 
 5   Amenities             889 non-null    object 
 6   Location              889 non-null    object 
 7   guests                889 non-null    int64  
 8   beds                  889 non-null    float64
 9   bathrooms             889 non-null    float64
 10  Price_Night           889 non-null    int64  
 11  Cleanliness_rating    889 non-null    float64
 12  Accuracy_rating       889 non-null    float64
 13  Communication_rating  889 non-null    float64
 14  Location_rating       889 non-null    float64
 15  Check-in_rating       8

In [24]:
df['Category'] = df['Category'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Unnamed: 0            889 non-null    int64   
 1   ID                    889 non-null    object  
 2   Category              889 non-null    category
 3   Title                 889 non-null    object  
 4   Description           829 non-null    object  
 5   Amenities             889 non-null    object  
 6   Location              889 non-null    object  
 7   guests                889 non-null    int64   
 8   beds                  889 non-null    float64 
 9   bathrooms             889 non-null    float64 
 10  Price_Night           889 non-null    int64   
 11  Cleanliness_rating    889 non-null    float64 
 12  Accuracy_rating       889 non-null    float64 
 13  Communication_rating  889 non-null    float64 
 14  Location_rating       889 non-null    float64 
 15  Check-

In [25]:
df = pd.get_dummies(df, columns=['Category'], dtype=int, prefix=['Category'])

In [26]:
df

Unnamed: 0.1,Unnamed: 0,ID,Title,Description,Amenities,Location,guests,beds,bathrooms,Price_Night,...,Value_rating,amenities_count,url,bedrooms,Unnamed: 19,Category_Amazing pools,Category_Beachfront,Category_Chalets,Category_Offbeat,Category_Treehouses
0,0,f9dcbd09-32ac-41d9-a0b1-fdb2793378cf,Red Kite Tree Tent - Ynys Affalon,"[""Escape to one of these two fabulous Tree Ten...","['What this place offers', 'Bathroom', 'Shampo...",Llandrindod Wells United Kingdom,2,1.0,1.0,105,...,4.3,13.0,https://www.airbnb.co.uk/rooms/26620994?adults...,1,,0,0,0,0,1
1,1,1b4736a7-e73e-45bc-a9b5-d3e7fcf652fd,Az Alom Cabin - Treehouse Tree to Nature Cabin,"[""Come and spend a romantic stay with a couple...","['What this place offers', 'Bedroom and laundr...",Guyonvelle Grand Est France,3,3.0,0.0,92,...,4.5,8.0,https://www.airbnb.co.uk/rooms/27055498?adults...,1,,0,0,0,0,1
2,2,d577bc30-2222-4bef-a35e-a9825642aec4,Cabane Entre Les Pins\n🌲🏕️🌲,"['Rustic cabin between the pines, 3 meters hig...","['What this place offers', 'Scenic views', 'Ga...",Duclair Normandie France,4,2.0,1.5,52,...,4.7,51.0,https://www.airbnb.co.uk/rooms/51427108?adults...,1,,0,0,0,0,1
3,3,ca9cbfd4-7798-4e8d-8c17-d5a64fba0abc,Tree Top Cabin with log burner & private hot tub,['The Tree top cabin is situated in our peacef...,"['What this place offers', 'Bathroom', 'Hot wa...",Barmouth Wales United Kingdom,2,1.0,1.0,132,...,4.6,23.0,https://www.airbnb.co.uk/rooms/49543851?adults...,1,,0,0,0,0,1
4,5,cfe479b9-c8f8-44af-9bc6-46ede9f14bb5,Treehouse near Paris Disney,"['Charming cabin nestled in the leaves, real u...","['What this place offers', 'Bathroom', 'Hair d...",Le Plessis-Feu-Aussoux Île-de-France France,4,3.0,1.0,143,...,4.7,32.0,https://www.airbnb.co.uk/rooms/935398?adults=1...,2,,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
884,982,cfbc88da-e88c-415a-b397-108d0948c4ba,Lancing Beach Apartment,['An apartment directly on the beach at Lancin...,"['What this place offers', 'Bathroom', 'Hair d...",Lancing United Kingdom,4,2.0,1.5,240,...,4.8,33.0,https://www.airbnb.co.uk/rooms/12680472?adults...,2,,0,1,0,0,0
885,983,4fea5054-f999-4c07-addc-67e4d893deab,Apartment,['Light roomy space with outside garden 5 minu...,"['What this place offers', 'Bathroom', 'Hair d...",Brighton and Hove England United Kingdom,2,1.0,1.0,78,...,4.9,54.0,https://www.airbnb.co.uk/rooms/48565992?adults...,1,,0,1,0,0,0
886,984,282118e2-049e-4d9f-b2f2-b47477881b07,Sea front flat with a stunning view!,['This specious two bedroom flat on the sea fr...,"['What this place offers', 'Scenic views', 'Be...",East Sussex England United Kingdom,4,2.0,1.5,113,...,4.8,38.0,https://www.airbnb.co.uk/rooms/49742544?adults...,2,,0,1,0,0,0
887,985,9ebf9cec-624e-480e-8704-dffa7cb1fe51,MP713 - Camber Sands Holiday Park - Sleeps 6 +...,"['With all the modern amenities, our contempor...","['What this place offers', 'Bathroom', 'Hot wa...",Camber England United Kingdom,6,3.0,2.0,80,...,4.7,24.0,https://www.airbnb.co.uk/rooms/47777462?adults...,2,,0,1,0,0,0


In [27]:
df = pd.read_csv("./airbnb-property-listings/tabular_data/clean_tabular_data.csv")

In [28]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Category,Title,Description,Amenities,Location,guests,beds,bathrooms,...,Cleanliness_rating,Accuracy_rating,Communication_rating,Location_rating,Check-in_rating,Value_rating,amenities_count,url,bedrooms,Unnamed: 19
0,0,f9dcbd09-32ac-41d9-a0b1-fdb2793378cf,Treehouses,Red Kite Tree Tent - Ynys Affalon,"[""Escape to one of these two fabulous Tree Ten...","['What this place offers', 'Bathroom', 'Shampo...",Llandrindod Wells United Kingdom,2,1.0,1.0,...,4.6,4.7,4.3,5.0,4.3,4.3,13.0,https://www.airbnb.co.uk/rooms/26620994?adults...,1,
1,1,1b4736a7-e73e-45bc-a9b5-d3e7fcf652fd,Treehouses,Az Alom Cabin - Treehouse Tree to Nature Cabin,"[""Come and spend a romantic stay with a couple...","['What this place offers', 'Bedroom and laundr...",Guyonvelle Grand Est France,3,3.0,0.0,...,4.3,4.7,4.6,4.9,4.7,4.5,8.0,https://www.airbnb.co.uk/rooms/27055498?adults...,1,
2,2,d577bc30-2222-4bef-a35e-a9825642aec4,Treehouses,Cabane Entre Les Pins\n🌲🏕️🌲,"['Rustic cabin between the pines, 3 meters hig...","['What this place offers', 'Scenic views', 'Ga...",Duclair Normandie France,4,2.0,1.5,...,4.2,4.6,4.8,4.8,4.8,4.7,51.0,https://www.airbnb.co.uk/rooms/51427108?adults...,1,
3,3,ca9cbfd4-7798-4e8d-8c17-d5a64fba0abc,Treehouses,Tree Top Cabin with log burner & private hot tub,['The Tree top cabin is situated in our peacef...,"['What this place offers', 'Bathroom', 'Hot wa...",Barmouth Wales United Kingdom,2,1.0,1.0,...,4.8,4.9,4.9,4.9,5.0,4.6,23.0,https://www.airbnb.co.uk/rooms/49543851?adults...,1,
4,5,cfe479b9-c8f8-44af-9bc6-46ede9f14bb5,Treehouses,Treehouse near Paris Disney,"['Charming cabin nestled in the leaves, real u...","['What this place offers', 'Bathroom', 'Hair d...",Le Plessis-Feu-Aussoux Île-de-France France,4,3.0,1.0,...,5.0,4.9,5.0,4.7,5.0,4.7,32.0,https://www.airbnb.co.uk/rooms/935398?adults=1...,2,


In [29]:
df = pd.read_csv("./airbnb-property-listings/tabular_data/clean_tabular_data_transformed.csv")

FileNotFoundError: [Errno 2] No such file or directory: './airbnb-property-listings/tabular_data/clean_tabular_data_transformed.csv'

In [None]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,ID,Title,Description,Amenities,Location,guests,beds,bathrooms,Price_Night,...,Category_Chalets,Category_Offbeat,Category_Treehouses,Area_Africa,Area_Asia,Area_Australia,Area_Central America,Area_Europe,Area_North America,Area_South America
232,269,84ec7ea4-4e91-4626-973f-e931dd6380d1,Sparrowhawk Luxury Lodge with Sauna & Hot Tub,['COVID-19 Restrictions for England Guests can...,"['What this place offers', 'Bathroom', 'Bath',...",Chinnor Oxfordshire United Kingdom,2,1.0,1.0,235,...,1,0,0,0,0,0,0,1,0,0
414,452,1922ade8-b33d-4eb8-ac66-72b482dfbc8a,Three en-suite bedroom cottage with indoor pool,['This cottage is like a tardis! Very large en...,"['What this place offers', 'Bathroom', 'Hair d...",Cotleigh United Kingdom,8,4.0,3.5,179,...,0,0,0,0,0,0,0,1,0,0
251,288,83425df8-e0fb-4832-955a-24222d8a000e,Ivy shepherds hut with hot tub,"['Set in a clearing in the woods, our holiday ...","['What this place offers', 'Bathroom', 'Hot wa...",Hampshire United Kingdom,2,1.0,1.0,72,...,1,0,0,0,0,0,0,1,0,0
262,299,403f50e5-8847-4984-8441-ca92c0f35032,Self Catering shepherds hut with the deer.,['Self catering shepherds hut with deer in the...,"['What this place offers', 'Bathroom', 'Shampo...",Lodsworth England United Kingdom,2,1.0,1.0,83,...,1,0,0,0,0,0,0,1,0,0
732,808,802d2eda-fb04-4772-818b-d32ed24da8f3,Countryside workshop with sauna,"[""Beehive Encounter Holiday Workshop Of Bees A...","['What this place offers', 'Bathroom', 'Hair d...",Waldachtal Baden-Wurttemberg Germany,6,5.0,1.0,71,...,0,1,0,0,0,0,0,1,0,0
577,640,bada66a8-9df7-4cca-81c8-5d1bde123e4e,Cow Shed: Daisy at Easton Farm Park,['A unique place to stay! The Cow Shed is a mi...,"['What this place offers', 'Scenic views', 'Pa...",Woodbridge United Kingdom,4,2.0,0.0,86,...,0,1,0,0,0,0,0,1,0,0
455,494,c1614d6c-d69a-4196-9d92-c6b38a9c0f3d,Le Clos de Blisse - Juno Lodge,,"['What this place offers', 'Bathroom', 'Bath',...",Vouilly Normandie France,2,1.0,1.0,113,...,0,0,0,0,0,0,0,1,0,0
453,492,dcdd6136-ae0a-4856-8a5f-1e1067361efe,"Nice T2, 4 pers 2 pools 300m beach Cabourg",['Accommodation F2 30m2 facing south located i...,"['What this place offers', 'Scenic views', 'Ma...",Cabourg Normandie France,4,2.0,1.0,44,...,0,0,0,0,0,0,0,1,0,0
488,528,fcdaf5b3-a2aa-4fff-b9e5-25d500aee328,Idyllic Dorset Hideaway,"[""Our traditional English Shepherd's Hut sits ...","['What this place offers', 'Bathroom', 'Shampo...",Dorset England United Kingdom,2,1.0,1.0,85,...,0,0,0,0,0,0,0,1,0,0
340,377,fff65af2-8386-4551-bbe8-4bdead026533,Log Cabin Snowdonia,['*LAST MIN CANCELLATION FOR XMAS TO COVID* Ni...,"['What this place offers', 'Bathroom', 'Hair d...",Trawsfynydd Wales United Kingdom,3,2.0,1.0,69,...,1,0,0,0,0,0,0,1,0,0


In [None]:
df_numeric = df.select_dtypes(include='number')
df_numeric.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              889 non-null    int64  
 1   guests                  889 non-null    int64  
 2   beds                    889 non-null    float64
 3   bathrooms               889 non-null    float64
 4   Price_Night             889 non-null    int64  
 5   Cleanliness_rating      889 non-null    float64
 6   Accuracy_rating         889 non-null    float64
 7   Communication_rating    889 non-null    float64
 8   Location_rating         889 non-null    float64
 9   Check-in_rating         889 non-null    float64
 10  Value_rating            889 non-null    float64
 11  amenities_count         889 non-null    float64
 12  bedrooms                889 non-null    int64  
 13  Unnamed: 19             0 non-null      float64
 14  Category_Amazing pools  889 non-null    in

## Scaling the "transformed" datafrace
Requires to avoid pass the one-hot-encoded variables to the scaler.
Here's a workaround that is later implemented in the regression files.

In [None]:

from sklearn.preprocessing import StandardScaler
import pandas as pd
from tabular_data import database_utils as dbu

data_path = "./airbnb-property-listings/tabular_data/clean_tabular_data_transformed.csv"
# load the previously cleaned data
df = pd.read_csv(data_path)
#df.head()
# define labels and features
label = 'Price_Night'
features, labels = dbu.load_airbnb(df, label=label, numeric_only=True)
# create a list of numerical features
features_to_scale = ['guests', 'beds', 'bathrooms', 'Price_Night', 'Cleanliness_rating',
                        'Accuracy_rating', 'Communication_rating', 'Location_rating',
                        'Check-in_rating', 'Value_rating', 'amenities_count', 'bedrooms'] 
# remove the label from the list, there's no need to rescale it
features_to_scale.remove(label)
features_subset = features[features_to_scale]
print(features_subset)
scaler = StandardScaler() # features scaling  
scaled_features = scaler.fit_transform(features_subset) # fit and transform the data
    # now substitute the scaled features back in the original dataframe
features[features_to_scale] = scaled_features
#features.describe()

     guests  beds  bathrooms  Cleanliness_rating  Accuracy_rating  \
0         2   1.0        1.0                 4.6              4.7   
1         3   3.0        0.0                 4.3              4.7   
2         4   2.0        1.5                 4.2              4.6   
3         2   1.0        1.0                 4.8              4.9   
4         4   3.0        1.0                 5.0              4.9   
..      ...   ...        ...                 ...              ...   
884       4   2.0        1.5                 4.9              5.0   
885       2   1.0        1.0                 4.8              5.0   
886       4   2.0        1.5                 4.8              5.0   
887       6   3.0        2.0                 4.7              4.8   
888       4   2.0        1.0                 4.9              4.9   

     Communication_rating  Location_rating  Check-in_rating  Value_rating  \
0                     4.3              5.0              4.3           4.3   
1                

In [None]:
#df['Category_Treehouses'].value_counts()

Category_Treehouses
0    682
1    207
Name: count, dtype: int64

In [30]:
df2 = pd.read_csv('test.csv')

In [31]:
df2

Unnamed: 0.1,Unnamed: 0,y_pred,y_test
0,767,4.906016,4.584967
1,830,4.675817,4.330733
2,479,4.620739,4.787492
3,505,4.669929,4.736198
4,172,3.893567,3.784190
...,...,...,...
262,519,5.572439,5.472271
263,425,4.762778,4.948760
264,708,4.891550,4.787492
265,87,4.829193,5.407172


In [33]:
import numpy as np
df2[['y_pred_lin', 'y_test_lin']] = np.exp(df2[['y_pred', 'y_test']])

In [36]:
pd.set_option('display.max_rows', 100)
df2.sample(100)

Unnamed: 0.1,Unnamed: 0,y_pred,y_test,y_pred_lin,y_test_lin
6,695,5.159646,4.418841,174.102762,83.0
35,698,5.480153,6.068426,239.883526,432.0
130,58,3.825627,4.941642,45.861544,140.0
84,66,4.672007,4.51086,106.912135,91.0
44,468,4.62821,4.276666,102.330764,72.0
48,508,4.837102,4.634729,126.103358,103.0
62,161,4.331344,3.78419,76.046389,44.0
248,785,5.196666,5.857933,180.668954,350.0
68,449,4.594726,5.225747,98.961032,186.0
32,760,4.80851,5.236442,122.548837,188.0
