## Given bugs associated with older versions of the webcrawler (*fixed on Jan 1, 2022*) , we need to clean and transform data that were scraped prior to this: namely, October 27th-Dec 31st, 2021 data.

#### What was this bug?: Previously, the webcrawler did not correctly handle rental listings whose listings were deleted or ended while the webcrawler was accessing each individual rental listing.

#### NB: Namely, the bug would cause a misalignment issue in which scraped rental listings data were not properly aligned with their 'true' listing URLs and listing ids (ie, what they actually should have been). As a result, some of the data are not lined up properly, which would lead to misleading analysis and much more noise in the dataset. Consequently, we need to remove all misaligned data, but save/salvage all of the rental listings data records that line up properly along the given rows for each column.

## How do we correctly keep only the accurate rental listings data, among these Oct to Dec 2021 CSV files?

## First, we need to recursively load in all of these 2021 scraped CSV files.

## 2.) Next, in order to filter out any misaligned data, we need to compare the 'true' listing ids for each given record of data, and compare this with the listing id data that the webcrawler scraped.

## To verify rental listings data that are accurate and not negatively affected by the old misalignment bug,, we need to use a regex pattern to look up the 'true'

### This regex will search for 10 digits (ie, 0-10) in a row since a.) listing ids will *always* be contained within a rental listing URL--and we have previously determined the webcrawler has *never* had any problems in parsing the URL listings correctly. b) In addition, each listing ID is always a 10-digit unique ID.

## 3.) We then need to add 2 columns--'flat' & 'land'-- which were added to the webcrawler program with some additional changes in Jan 2022. 

## 4) Finally, save the cleaned and transformed data as a single CSV in the new 'old_scraped_data' folder, within the scraped_data> sfbay folders within the webcrawler project's CraigslistWebScraper folder. 

## 1.) Import relevant data (after library imports):

In [1]:
# imports-- file processing
import os
import glob

# data analysis libraries & SQL libraries
import numpy as np
import pandas as pd



In [2]:
## import old scraped data, which needs to be cleaned  
def recursively_import_all_CSV_and_concat_to_single_df(parent_direc, fn_regex=r'*.csv'):
    """Recursively search parent directory, and look up all CSV files.
    Then, import all CSV files and concatenate into a single Pandas' df using pd.concat()"""
    path =  parent_direc # specify parent path of directories containing the scraped rental listings CSV data -- NB: use raw text--as in r'path...', or can we use the double-back slashes to escape back-slashes??
    df_concat = pd.concat((pd.read_csv(file) for file in glob.iglob(
        os.path.join(path, '**', fn_regex), 
        recursive=True)), ignore_index=True
        )  # os.path.join helps ensure this concatenation is OS independent

    return df_concat


## specify directory and import data
scraped_data_path = r"D:\\Coding and Code projects\\Python\\craigslist_data_proj\\old_scraped_data"
# import data
df = recursively_import_all_CSV_and_concat_to_single_df(scraped_data_path)
print(f"Sanity check--overview (ie, via .info() method) of the imported scraped data data:: {df.info()}") # sanity check-examine size of dataset, columns, etc.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23502 entries, 0 to 23501
Data columns (total 47 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   listing_urls             23502 non-null  object 
 1   ids                      23460 non-null  float64
 2   sqft                     18082 non-null  float64
 3   cities                   23449 non-null  object 
 4   prices                   23448 non-null  object 
 5   bedrooms                 23118 non-null  float64
 6   bathrooms                23404 non-null  object 
 7   attr_vars                23440 non-null  object 
 8   listing_descrip          23440 non-null  object 
 9   date_of_webcrawler       23470 non-null  object 
 10  kitchen                  23450 non-null  float64
 11  date_posted              23440 non-null  object 
 12  region                   23502 non-null  object 
 13  sub_region               23502 non-null  object 
 14  cats_OK               

## Next, clean data, with the following steps in mind:

## Clean data based on 3 main things:

###  1.) deduplicate data based on listing ids

### 2.) Remove misaligned data by:
### --firstly a.) using a regex pattern on the listing_urls (ie, rental listing urls)
### to check for the 'true' listing ids.
### Let's call this 'true_listing_ids'. 

"""-- *NB*: given that listing ids are always 10 digits, a regex that can parse the listing ids is
as follows:
<<<
regex_pattern = r"[0-9]{10}"   # search for any series of 10 consecutive digits 
"""
### --then: b.) identifying any rows in which the scraped ids column differ (ie, are not exactly equal to) 
### the 'true' listing ids. 

### 3.) Rename existing cols as needed,  and *add* and parse 3 or 4 additional dummy variable cols such as 'flat' & 'land'


### 1) Deduplicate data:

In [3]:
# 1) deduplicate
def deduplicate_df(df):
    """Remove duplicate rows based on listing ids"""
    return df.drop_duplicates(keep='first', subset = ['ids'])

df = deduplicate_df(df)

In [4]:
# sanity check
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18440 entries, 0 to 23501
Data columns (total 47 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   listing_urls             18440 non-null  object 
 1   ids                      18439 non-null  float64
 2   sqft                     14433 non-null  float64
 3   cities                   18437 non-null  object 
 4   prices                   18435 non-null  object 
 5   bedrooms                 18245 non-null  float64
 6   bathrooms                18435 non-null  object 
 7   attr_vars                18436 non-null  object 
 8   listing_descrip          18436 non-null  object 
 9   date_of_webcrawler       18440 non-null  object 
 10  kitchen                  18438 non-null  float64
 11  date_posted              18436 non-null  object 
 12  region                   18440 non-null  object 
 13  sub_region               18440 non-null  object 
 14  cats_OK               

### 2) Remove misaligned data

### 2 a) Use regex to parse the 'true' listing ids: 
### ; b) transform data types of ids to match the object data type of the 'true' listing ids:

In [5]:
# 2.) Remove misaligned data:

# 2 a) Use regex to parse the 'true' listing ids:
def regex_create_pandas_col(df, col, regex_pattern):
    """Create new series for DataFrame based on regex pattern results, given DataFrame (df) and Series (ie, col). """
    return df[col].str.extract(regex_pattern) # apply regex pattern to parse data from col and create new column


# specify regex pattern to parse the 'true' listing ids from the listing urls:
regex_pattern = r"([0-9]{10})"  # *NB*: wrap regex pattern in tuple to be able to use Pandas' str.extract() method, which our regex_create_pandas_col() function relies on!

# parse the 'true' listing ids:
df['true_listing_ids'] = regex_create_pandas_col(df, 'listing_urls', regex_pattern)

# sanity check

print(f"True listing ids:\n{df['true_listing_ids']}")

True listing ids:
0        7401005952
1        7401007644
2        7401007110
3        7400999282
4        7397957717
            ...    
23495    7423968538
23496    7423960763
23498    7414126130
23500    7421288799
23501    7423920882
Name: true_listing_ids, Length: 18440, dtype: object


In [6]:
# 2 b) Transform ids col to int, and then object, so that data types of ids and 'true_listing_ids' cols match each other (and without the unneeded 1st decimal points (ie, the '.0' values) )
def transform_col_to_int(df, col):
    return df[col].astype(float).astype('Int64') # use Int64 (or int64) due to the long id values

# convert col data type function
def transform_dtype_of_col(df, col, data_type):
    """Convert col to specified data type"""
    return df[col].astype(data_type)


# 2 b) Transform ids col to object, so that data types of ids and 'true_listing_ids' cols match each other
# convert 'ids' to integer
df['ids'] = transform_col_to_int(df, 'ids')
# convert 'ids' data type to object:
df['ids'] = transform_dtype_of_col(df, 'ids', str)

# sanity check that ids are now of object data type
print(f"Data type of ids:\n{df['ids'].dtypes}")


Data type of ids:
object


## 2 c) Remove any rows in which the scraped ids do not exactly match the 'true' listing ids, thereby routing any rows that were adversely impacted by the misalignment issue:

In [7]:
# 2 c) Remove rows from dataframe by comparing scraped ids vs the 'true' listing ids
def remove_rows_if_cols_not_equal(df, col1, col2):
    """Remove row from given DataFrame if values from col1 and col2 are not equal"""
    df = df.loc[df[col1] == df[col2]]
    return df

# 2 c) Remove rows from dataframe by comparing scraped ids (ie, 'ids') vs the 'true' listing ids (ie, 'true_listing_ids')
df = remove_rows_if_cols_not_equal(df, 'ids', 'true_listing_ids') 

# sanity check
print(f"Data--now that misaligned data have been removed--is:\n{df}")

Data--now that misaligned data have been removed--is:
                                            listing_urls         ids   sqft  \
0      https://sfbay.craigslist.org/eby/apa/d/berkele...  7401005952    NaN   
1      https://sfbay.craigslist.org/eby/apa/d/san-ram...  7401007644  898.0   
2      https://sfbay.craigslist.org/eby/apa/d/fremont...  7401007110  615.0   
3      https://sfbay.craigslist.org/eby/apa/d/pleasan...  7400999282    NaN   
4      https://sfbay.craigslist.org/eby/apa/d/oakland...  7397957717  800.0   
...                                                  ...         ...    ...   
23092  https://sfbay.craigslist.org/sfc/apa/d/san-fra...  7423666098    NaN   
23093  https://sfbay.craigslist.org/sfc/apa/d/san-fra...  7423621283    NaN   
23094  https://sfbay.craigslist.org/sfc/apa/d/san-fra...  7423549135    NaN   
23095  https://sfbay.craigslist.org/sfc/apa/d/san-fra...  7423328327    NaN   
23096  https://sfbay.craigslist.org/sfc/apa/d/san-fra...  7426061928    NaN  

## 3 a.) Rename cols


In [8]:

def rename_cols(df, rename_cols_dict):
    df = df.rename(columns=rename_cols_dict)
    return df

# specify dictionary to specify what cols to rename, and vals for the renamed cols
dict_rename_cols = {
    'apt_type':'apt',
    'in_law_apt_type':'in_law_apt',
    'condo_type':'condo',
    'townhouse_type':'townhouse',
    'cottage_or_cabin_type':'cottage_or_cabin',
    'single_fam_type':'single_fam',
    'duplex_type':'duplex'    
    }


# 3 a) Rename cols:
df = rename_cols(df, dict_rename_cols)

# sanity check on col names:
df[['apt', 'condo', 'single_fam', 'duplex']]

Unnamed: 0,apt,condo,single_fam,duplex
0,0,0,0,0
1,1,0,0,0
2,1,0,0,0
3,1,0,0,0
4,1,0,0,0
...,...,...,...,...
23092,1,0,0,0
23093,1,0,0,0
23094,1,0,0,0
23095,1,0,0,0


## 3 b) Add the 2 new indicator columns to match the newer webcrawler specifications:

In [9]:
# 3 b) Add and parse additional indicator var cols:
# create indicator var using numpy and Pandas' str.contains() based on scraped rental listing attributes and descriptions  
def indicator_vars_from_scraped_data(df, col_to_parse, attr_substr):
    return pd.Series(np.where(df[col_to_parse].str.contains(attr_substr), 1, 0))


# 'flat:
df['flat'] = indicator_vars_from_scraped_data(df, 'attr_vars', 'flat') 

# land
df['land'] = indicator_vars_from_scraped_data(df, 'attr_vars', 'land')

# sanity check

print(f"Value counts of the new 'flat' col:\n{df['flat'].value_counts()}")


Value counts of the new 'flat' col:
0.0    3061
1.0      45
Name: flat, dtype: int64


## 3 c) Move the 2 new indicator cols to the corresponding locations to match the webcrawler specs



In [10]:
## Namely: move 'flat' to index just to right of 'duplex', and move 'land' to right of 'flat'
## Ergo: let's start by looking up index location of 'duplex':

# look up index location of given col
def look_up_index_loc_of_col(df, col):
    """ Return index location of given column, given column name"""
    index_of_col = df.columns.get_loc(col)
    return index_of_col

# index location of 'duplex' col
index_of_duplex = look_up_index_loc_of_col(df,  'duplex')  # look up index location for 'duplex' col)

# Now, determine the new 'flat' col index location by adding 1 to the duplex index loc
flat_new_loc = index_of_duplex + 1 # add 1 to duplex loc to determine the location where we want to move 'flat'


# specify function to move col location for dataframe:
def move_col_loc_for_df_dict(df, col, index_loc_to_move):
    col = df.pop(col)  # sequester given col from each df 
    df.insert(index_loc_to_move, col.name, col)  # move location of given col within df
    return df 


# move 'flat' col:
df = move_col_loc_for_df_dict(df, 'flat', flat_new_loc)

# sanity check
print(f"Ensure 'flat' has been moved to proper location:\n{df.iloc[0, flat_new_loc]}")

Ensure 'flat' has been moved to proper location:
0.0


In [11]:
# move 'land' col:

# Determine 'land' col new index location--NB: Since we want 'land' 1 col to right of 'flat', simply add 1 to the flat_new_loc
land_new_loc = flat_new_loc + 1 # add 1 to flat's (soon-to-be) new location so it is contiguous 1 col to the right


# move 'land' col:
df = move_col_loc_for_df_dict(df, 'land', land_new_loc) 

## 3 d) Remove unneeded cols to match the webcrawler 

In [14]:
def remove_cols(df):
    df = df.drop(columns=['true_listing_ids'])
    return df

df = remove_cols(df)

# sanity check
print(df.columns)

Index(['listing_urls', 'ids', 'sqft', 'cities', 'prices', 'bedrooms',
       'bathrooms', 'attr_vars', 'listing_descrip', 'date_of_webcrawler',
       'kitchen', 'date_posted', 'region', 'sub_region', 'cats_OK', 'dogs_OK',
       'wheelchair_accessible', 'laundry_in_bldg', 'no_laundry',
       'washer_and_dryer', 'washer_and_dryer_hookup', 'laundry_on_site',
       'full_kitchen', 'dishwasher', 'refrigerator', 'oven', 'flooring_carpet',
       'flooring_wood', 'flooring_tile', 'flooring_hardwood', 'flooring_other',
       'apt', 'in_law_apt', 'condo', 'townhouse', 'cottage_or_cabin',
       'single_fam', 'duplex', 'flat', 'land', 'is_furnished',
       'attached_garage', 'detached_garage', 'carport', 'off_street_parking',
       'no_parking', 'EV_charging', 'air_condition', 'no_smoking'],
      dtype='object')


## 4) Export cleaned/transformed data:

In [15]:
# 4.) Export cleaned data to one large CSV file--in the 'old_scraped_data' sub-directory within the main sfbay folder containing scraped rental listings data:
def df_to_csv(df, direc, CSV_file_name):
    return df.to_csv(direc + '\\'+ CSV_file_name, index=False)

# NB: I've manually created a new sub-folder within the main scraped sfbay directory, to contain just these cleaned 'old' scraped data--ie, scraped via an older version of the webcrawler (namely: prior to the bug fix on Jan 1, 2022)
direc_to_export = r"D:\\Coding and Code projects\\Python\\craigslist_data_proj\\CraigslistWebScraper\\scraped_data\\sfbay\\old_scraped_data"


## export
df_to_csv(df, direc_to_export, 'craigslist_all_sfbay_subregions_Oct_27_to_Dec_31_2021.csv')
