# AutoClean library

AutoClean helps you exactly with that: it performs preprocessing and cleaning of data in Python in an automated manner so that you can save time when working on your next project.
It's dealing with:
- duplicates <br>
- missing values<br>
- outliers<br>
- encoding <br>
- extraction<br>
    

Documentation can be found here: https://github.com/elisemercury/AutoClean

Plan of the notebook:
1. Import all essentials like libraries and data set
    - The same Airbnb data set: https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata
2.  Basic usage on the sliced data set 
    - AutoClean()
3. Adjusting parameters for beter result
    - mode
    - missing_num
    - outliers
4. Summing up and comparing 
    - ProfileReport()

## Import all essentials

In [1]:
import pandas as pd
from AutoClean import AutoClean
from pandas_profiling import ProfileReport
from IPython.display import display

In [2]:
raw_data = pd.read_csv('Airbnb_Open_Data.csv')

In [3]:
raw_data.columns

Index(['id', 'NAME', 'host id', 'host_identity_verified', 'host name',
       'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
       'country code', 'instant_bookable', 'cancellation_policy', 'room type',
       'Construction year', 'price', 'service fee', 'minimum nights',
       'number of reviews', 'last review', 'reviews per month',
       'review rate number', 'calculated host listings count',
       'availability 365', 'house_rules', 'license'],
      dtype='object')

##  Basic usage on the sliced data set

In [4]:
df2 = raw_data[['instant_bookable', 'cancellation_policy', 'room type',
        'minimum nights']] 
df3 = raw_data[[ 'number of reviews','last review', 'reviews per month', 'review rate number']] 

In [23]:
df2_new = raw_data[[ 'instant_bookable', 'cancellation_policy', 'room type',
       'Construction year', 'price', 'service fee', 'minimum nights',
       'number of reviews', 'reviews per month',
       'review rate number']]

In [6]:
df1 = raw_data[['NAME', 'host id', 'neighbourhood group', 
                'neighbourhood',  'country',
                'country code']]

In [7]:
df1_old = raw_data[['NAME', 'host id', 'host_identity_verified', 'host name',
       'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
       'country code']]

In [5]:
pipeline = AutoClean(df1)

AutoClean process completed in 40.497956 seconds
Logfile saved to: c:\Users\ikova\Augumented_Analysis\autoclean.log


In [6]:
result = pipeline.output

In [7]:
result.info()
result.isna().any()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102058 entries, 0 to 102057
Data columns (total 21 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   NAME                                102058 non-null  object 
 1   host id                             102058 non-null  Int64  
 2   host_identity_verified              102058 non-null  object 
 3   host name                           102058 non-null  object 
 4   neighbourhood group                 102058 non-null  object 
 5   neighbourhood                       102058 non-null  object 
 6   lat                                 102058 non-null  float64
 7   long                                102058 non-null  float64
 8   country                             102058 non-null  object 
 9   country code                        102058 non-null  object 
 10  country_United States               102058 non-null  Int64  
 11  host_identity_verified_unc

NAME                                  False
host id                               False
host_identity_verified                False
host name                             False
neighbourhood group                   False
neighbourhood                         False
lat                                   False
long                                  False
country                               False
country code                          False
country_United States                 False
host_identity_verified_unconfirmed    False
host_identity_verified_verified       False
neighbourhood group_Bronx             False
neighbourhood group_Brooklyn          False
neighbourhood group_Manhattan         False
neighbourhood group_Queens            False
neighbourhood group_Staten Island     False
neighbourhood group_brookln           False
neighbourhood group_manhatan          False
country code_US                       False
dtype: bool

In [25]:
result['neighbourhood group'].unique()
result['neighbourhood group_Queens'].unique()

<IntegerArray>
[0, 1]
Length: 2, dtype: Int64

In [7]:
result.head(5)

Unnamed: 0,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,country code,...,neighbourhood group_Brooklyn,neighbourhood group_Manhattan,neighbourhood group_Queens,neighbourhood group_Staten Island,neighbourhood group_brookln,neighbourhood group_manhatan,country_United States,host_identity_verified_unconfirmed,host_identity_verified_verified,country code_US
0,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,...,1,0,0,0,0,0,1,1,0,1
1,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,...,0,1,0,0,0,0,1,0,1,1
2,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,unconfirmed,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,US,...,0,1,0,0,0,0,1,1,0,1
3,Calming room in a thrilling city,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,US,...,1,0,0,0,0,0,1,1,0,1
4,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,...,0,1,0,0,0,0,1,0,1,1


In [32]:
pipeline2 = AutoClean(df2)

AutoClean process completed in 0.150993 seconds
Logfile saved to: c:\Users\ikova\Augumented_Analysis\autoclean.log


In [47]:
pipeline3 = AutoClean(df3)

AutoClean process completed in 7.473309 seconds
Logfile saved to: c:\Users\ikova\Augumented_Analysis\autoclean.log


## Adjusting parameters for beter result

In [16]:
pipeline_categ_df1 = AutoClean(df1, mode='manual', missing_categ='most_frequent', outliers='winz')  

AutoClean process completed in 0.882163 seconds
Logfile saved to: c:\Users\ikova\Augumented_Analysis\autoclean.log


In [17]:
manual = pipeline_categ_df1.output
manual.head(5)

Unnamed: 0,NAME,host id,neighbourhood group,neighbourhood,country,country code
0,Clean & quiet apt home by the park,80014485718,Brooklyn,Kensington,United States,US
1,Skylit Midtown Castle,52335172823,Manhattan,Midtown,United States,US
2,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,Manhattan,Harlem,United States,US
3,Home away from home,85098326012,Brooklyn,Clinton Hill,United States,US
4,Entire Apt: Spacious Studio/Loft by central park,92037596077,Manhattan,East Harlem,United States,US


In [18]:
display(df1.isna().sum())
display(manual.isna().sum())

NAME                   250
host id                  0
neighbourhood group     29
neighbourhood           16
country                532
country code           131
dtype: int64

NAME                   0
host id                0
neighbourhood group    0
neighbourhood          0
country                0
country code           0
dtype: int64

In [24]:
df2_new['price'] = df2_new['price'].str[1:]
df2_new['service fee'] = df2_new['service fee'].str[1:]

In [25]:
df2_new['minimum nights'] = df2_new['minimum nights'].astype(float)
df2_new['number of reviews'] = df2_new['number of reviews'].astype(float)
df2_new['reviews per month'] = df2_new['reviews per month'].astype(float)
df2_new['review rate number'] = df2_new['review rate number'].astype(float)


In [29]:
pipeline_num_df2 = AutoClean(df2_new, mode='manual',
                            missing_num='knn' )  # change types

AutoClean process completed in 87.472815 seconds
Logfile saved to: c:\Users\ikova\Augumented_Analysis\autoclean.log


In [31]:
manual_df2 = pipeline_num_df2.output
manual_df2.head(5)

Unnamed: 0,instant_bookable,cancellation_policy,room type,Construction year,price,service fee,minimum nights,number of reviews,reviews per month,review rate number
0,False,strict,Private room,2020,966,193,10,9,0.21,4
1,False,moderate,Entire home/apt,2007,142,28,30,45,0.38,4
2,True,flexible,Private room,2005,620,124,3,0,1.37,5
3,True,moderate,Entire home/apt,2005,368,74,30,270,4.64,4
4,False,moderate,Entire home/apt,2009,204,41,10,9,0.1,3


In [32]:
display(df2_new.isna().sum())
display(manual_df2.isna().sum())

instant_bookable         105
cancellation_policy       76
room type                  0
Construction year        214
price                    247
service fee              273
minimum nights           409
number of reviews        183
reviews per month      15879
review rate number       326
dtype: int64

instant_bookable       105
cancellation_policy     76
room type                0
Construction year        0
price                  247
service fee            273
minimum nights           0
number of reviews        0
reviews per month        0
review rate number       0
dtype: int64

## Sum up

In [12]:
profile_categ = ProfileReport(manual)
profile_categ

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [33]:
profile_num = ProfileReport(manual_df2)
profile_num

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [10]:
profile = ProfileReport(result)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

