## Missing city names data:

### In this notebook, we want to clean data for several weeks' worth of *sfbay* rental listings in which the webcrawler had not been updated properly to scrape city names. In short, several weeks of rental listings from Jan-Feb 2023 have missing city names data (ie, for the 'cities' column).

### To do this, we need to:

### 1) Import all *sfbay* rental listings as a *single* conatenated DataFrame. 

### 1b) Then, clean and remove null city names.

### 1c) Finally, parse out and clean the city names. We want to ultimately obtain a list of all unique city names for sfbay rental listings.

### 1d) As said in 1c, derive a list of unique city names for sfbay rental listings.

### 2) Second, we need to import each week of sfbay rental listings that have missing city names data--by subregion--as *separate* dataframes, ie, with each df corresponding to a given week and a given subregion. To do this, we need to filter the data to only data from mid January 2023 to 

### 2b) Then, **replace** the missing city names with matching city names--ie, based on the list of unique city names (see step 1)--vis-a-vis a regex str.findall() search of the city names as parsed from the rental listing URLs (ie, the listing_url col).


In [1]:
# 1) Import all sfbay rental listings data, so we can derive all unique sfbay city names: 

# imports-- file processing & datetime libraries
import os
import glob
import datetime
# data analysis libraries & SQL libraries
import numpy as np
import pandas as pd
from pandas.core.frame import DataFrame
# SQL ODBC for API connection between Python & SQL Server
import pyodbc
# use json library to open a json file, which contains SQL credentials & configuration--ie, username, password, etc.
import json 

## Data pipeline of Pandas' df to SQL Server -- import scraped craigslist rental listings data from CSV files to single Pandas' df: 

# recursively search parent direc to look up CSV files within subdirectories
def recursively_import_all_CSV_and_concat_to_single_df(parent_direc, fn_regex=r'*.csv'):
    """Recursively search parent directory, and look up all CSV files.
    Then, import all CSV files to a single Pandas' df using pd.concat()"""
    # specify parent path of the relevant (sfbay) scraped rental listings CSV data -- NB: use raw text--as in r'path...', or can we use the double-back slashes to escape back-slashes
    path =  parent_direc 
    # import each CSV file from directories, and then concatenate each CSV file into a single Pandas' DataFrame 
    df_concat = pd.concat((pd.read_csv(file,
                                        sep = ',', encoding='utf-8'
                                        ) for file in glob.iglob( # iterate over each CSV file in path
                                            os.path.join(path, '**', fn_regex), # have glob.iglob() search for *only* CSV files-- ie, '*.csv'  
                                            recursive=True)), ignore_index=True)  # set recursive to True to recursively search through all relevant child directories (ie, all subregions within given parent region path) 
    # ))
    # df_concat = pd.concat((pd.read_csv(file, # import each CSV file from directory
    #                                     sep=',',encoding = 'utf-8'  # assume standard CSV (ie, comma separated ) formt and use utf-8 encoding
    #                                     ) for file in glob.iglob( # iterate over each CSV file in path
    #                                         os.path.join(path, '**', fn_regex), 
    #                                         recursive=True)), ignore_index=True)  # recursively iterate over each CSV file in path, and use os.path.join to help ensure this concatenation is OS independent
    return df_concat

In [2]:
# 1)-- cont'd: Import *all* sfbay rental listings data as a single df:
# specify path of sfbay scraped data
scraped_data_sfbay = r'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay'

# import all available scraped sfbay data:
df = recursively_import_all_CSV_and_concat_to_single_df(scraped_data_sfbay)

# sanity check
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 317751 entries, 0 to 317750
Data columns (total 49 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   listing_urls             317751 non-null  object 
 1   ids                      317394 non-null  float64
 2   sqft                     248865 non-null  object 
 3   cities                   317634 non-null  object 
 4   prices                   317324 non-null  object 
 5   bedrooms                 312888 non-null  object 
 6   bathrooms                317386 non-null  object 
 7   attr_vars                317384 non-null  object 
 8   listing_descrip          317380 non-null  object 
 9   date_of_webcrawler       317636 non-null  object 
 10  kitchen                  317634 non-null  float64
 11  date_posted              317366 non-null  object 
 12  region                   317751 non-null  object 
 13  sub_region               317751 non-null  object 
 14  cats

In [3]:
#next, let's subset the data to 2 separate dfs:



# 1) b) Clean city names by removing anyb null city names:
#  No 'cities' (ie, city names) nulls: subset data to all scraped data that actually contains city names data--ie: subset to no missing city names
def filter_out_null_vals_for_col(df, col):
    return df.loc[df[col].notnull()]

# filter out missing city names:
df_no_city_nulls = filter_out_null_vals_for_col(df, 'cities')

# sanity check
df_no_city_nulls['cities'].isnull().sum()

0

In [4]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023

# change col to datetime 
def transform_col_to_datetime(df, datetime_like_col):
    """ Transform to datetime. 
    NB: use utc=True since pd.datetime() will otherwise throw a ValueError: Cannot mix tz-aware with tz-naive values.
    utc=True will tell Pandas to create a timezone-aware datetime conversion."""
    return pd.to_datetime(df[datetime_like_col], utc=True)


df['date_posted'] = transform_col_to_datetime(df, 'date_posted')

# sanity check-- ensure datetime data type
df['date_posted'].dtype

datetime64[ns, UTC]

In [5]:
# new datetime format
df['date_posted']

0        2021-12-27 17:45:00+00:00
1        2022-01-03 00:49:00+00:00
2        2022-01-03 00:20:00+00:00
3        2021-12-10 13:16:00+00:00
4        2021-12-19 02:14:00+00:00
                    ...           
317746   2022-11-04 11:07:00+00:00
317747   2022-11-04 11:05:00+00:00
317748   2022-11-04 10:57:00+00:00
317749   2022-11-04 11:04:00+00:00
317750   2022-11-04 11:04:00+00:00
Name: date_posted, Length: 317751, dtype: datetime64[ns, UTC]

In [6]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023


import pandas as pd
from pathlib import Path
import glob

# specify subregion code
subregion_code = 'pen'

# specify parent path of all sfbay data-- NB: use an f-string combined with a raw (ie, r) string--ie, fr to modify the string so we can input the subregion code as an argument to add to the path
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_code}'


# get all CSV files from path, and grab the file stems for each given CSV file 
df = pd.DataFrame({'files' : [file for file in Path(path).glob('*.csv')],
                  'file_stem' : [file.stem for file in Path(path).glob('*.csv')]}) # get file stem using .stem method

# # Parse the dates from each CSV file, and keep the same 'MM_DD_YYY' format (**including the underscore delimiters!!)
# df['date_of_file'] = pd.to_datetime(df['file_stem'].str.extract('\d{2}_\d{2}_\d{4}')[0]) 

# df['date_of_file'] = df['file_stem'].str.extract(r'\d{2}_\d{2}_\d{4}')


## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"
# sanity check
print(df)

# df.to_csv('CSV_files_for_peninsula_test.csv', index=False)


# ## ask user for the desired start & end dates, in format of 'MM_DD_YYYY'
# start_date_month = str(input('Enter desired Start Date month: '))
# start_date_day = str(input('Enter desired Start Date day: '))

# start_date_year = str(input('Enter desired Start Date year: '))

# end_date_month = str(input('Enter desired End Date month: '))
# end_date_day = str(input('Enter desired End Date day: '))

# end_date_year = str(input('Enter desired End Date year: '))


## create a list of the start & end dates for filtering the files; then concatenate all matching files into a DataFrame

# file_date_slice = df.set_index('date').loc[start_date:end_date]['files'].tolist()


# concat_df = pd.concat([pd.read_csv(file).compute() for file in file_date_slice])


## NB: My webcrawler program's CSV files are of this format:
## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"

Unnamed: 0,listing_urls,ids,sqft,cities,prices,bedrooms,bathrooms,attr_vars,listing_descrip,date_of_webcrawler,...,land,is_furnished,attached_garage,detached_garage,carport,off_street_parking,no_parking,EV_charging,air_condition,no_smoking
266209,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,7.578609e+09,725.0,San Francisco,3295,1,1,application fee details: $35 per applicant for...,centrally located 1 bedroom 1 bath sunny garde...,2023-02-15,...,0.0,0,0,0,0,0,0,0,0,1
266328,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,7.578635e+09,860.0,San Francisco,2850,2,1,apartment\nlaundry in bldg\nno smoking\nattach...,"address:\n155, 20 avenue, apt 4, sf, ca 94121,...",2023-02-15,...,0.0,0,1,0,0,0,0,0,0,1
266288,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,7.578652e+09,1685.0,San Francisco,3850,3,2,flat\nlaundry in bldg\nno smoking\nattached ga...,#top floor /large & spacious 3br+1 extra room/...,2023-02-15,...,0.0,0,1,0,0,0,0,0,0,1
266107,https://sfbay.craigslist.org/sfc/apa/d/san-fra...,7.578653e+09,1685.0,San Francisco,3850,3,2,flat\nlaundry in bldg\nno smoking\nattached ga...,## reduced to $3850 ##\n\n#top floor /large & ...,2023-02-15,...,0.0,0,1,0,0,0,0,0,0,1
266225,https://sfbay.craigslist.org/sfc/apa/d/ready-f...,7.578670e+09,,San Francisco,3675,3,1,apartment\nno laundry on site\nno smoking\natt...,open house:\n\naddress: corner building... 191...,2023-02-15,...,0.0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255701,https://sfbay.craigslist.org/scz/apa/d/santa-c...,7.589953e+09,800.0,Santa Cruz,2000,1,1,air conditioning\ncottage/cabin\nw/d in unit\n...,"- lovely 1 bedroom, 1 bath stand alone cottage...",2023-02-16,...,0.0,0,0,0,0,1,0,0,1,1
255697,https://sfbay.craigslist.org/scz/apa/d/santa-c...,7.589960e+09,1300.0,Santa Cruz,4850,3,2,cats are OK - purrr\ndogs are OK - wooof\nhous...,"available april 1, 2023\n\nlong term tenant de...",2023-02-16,...,0.0,0,1,0,0,0,0,0,0,1
255698,https://sfbay.craigslist.org/scz/apa/d/santa-c...,7.589960e+09,,Santa Cruz,1250,1,1,apartment\nw/d in unit\noff-street parking\nre...,"bright and charming, lots of natural light and...",2023-02-16,...,0.0,0,0,0,0,1,0,0,0,0
255699,https://sfbay.craigslist.org/scz/apa/d/santa-c...,7.589962e+09,400.0,Santa Cruz,2750,1,1,application fee details: $45\ncats are OK - pu...,photos are similar but not actual. more photos...,2023-02-16,...,0.0,0,0,0,0,0,0,0,0,1


In [7]:
#  remove original df from memory since we no longer need it
df = [] 

In [14]:
""" Clean the city names data (ie, non null city names) by removing extraneous address & street data, non-sfbay cities, etc."""

def clean_split_city_names(df, address_critera: list, neighborhood_criteria:list, split_city_delimiters: list, incorrect_city_names:dict, cities_not_in_region:dict, cities_that_need_extra_cleaning:dict):
    """Clean city names data in several ways:
    a.) Remove extraneous address & neighborhood data placed in the city names HTML object, such as 'Rd', 'Blvd', or 'Downtown'.
    b.) Unsplit city names data that are split via ',' & '/' delimiters.
    c.) Replace abbreviated or mispelled city names.
    ci) Set all city names to lowercase by using .lower(), for sake of consistent data cleaning (casing will be parsed later).
    d.) Remove city names that do not exist within the SF Bay Area (e.g., 'Redding')--ie, by using .replace() and replacing with whitespace (ie, ' '). 
    e.)Remove any digits/integers within the city names data--ie, by using a '\d+' regex as the argument of str.replace() and replace it with empty strings.
    f.) Remove any city names records that are left with merely empty strings (ie, the other steps removed all data for that given cities record).
    g.) Remove any whitespace to avoid the same city names from being treated as different entities by Pandas, Python, or SQL. 
    h.) Use str.capwords() to capitalize words (ie, excluding apostrophes).
    i.) Replace city names that are mispelled after having removed various street and neighborhood substrings such as 'St' or 'Ca'--e.g., '. Helena' should be 'St. Helena'. 
    j) Remove any remaining empty strings, null records, or rows with literal 'nan' values (ie, resulting from previous data cleaning steps)"""
    # specify extraneous street & address data (e.g., 'Rd') that we want to remove from the city names column:
    addr_criteria = '|'.join(address_critera) # Join pipe ('|') symbols to address list so we can str.split() on any one of these criteria (ie, 'or' condition splitting on each element separated by pipes):
    # specify extraneous neighborhood criteria that we should also remove from the column
    nbhood_criteria = '|'.join(neighborhood_criteria) # specify neighborhood names as well as state abbreviation (shown on website as ' Ca') that is shown without the usual comma delimiter, which we should remove from the rows of cities col
    # b.) specify delimiters we need to refer to un-split city names:
    split_city_delimiters = '|'.join(split_city_delimiters) # join pipes to delimiters so we can use str.split() based on multiple 'or' criteria simultaneously
    # clean city names data by removing extraneous address & neighborhood data, and unsplitting city names based on ',' & '\' delimiters
    df['cities'] =  df['cities'].str.split(addr_criteria).str[-1].str.replace(nbhood_criteria, '', case=True).str.lstrip()
    df['cities'] = df['cities'].str.split(split_city_delimiters).str[0] #unsplit city names based on comma or forward-slash delimiters
    # c.) replace specific abbreviated or mispelled city names
    df = df.replace({'cities':incorrect_city_names}, regex=True) # replace mispelled & abbreviated city names
    # ci) Set all city names data to lower-case temporarily, to ease the data cleaning & wrangling:
    df['cities'] = df['cities'].str.lower()
    
    # d) remove data in which the cities are not actually located in the sfbay region:
    df['cities'] = df['cities'].replace(cities_not_in_region, '', regex=True )  # remove (via empty string) cities that are not actually located in the sfbay region
    # e.) Remove digits & integer-like data from cities column:
    df['cities'] = df['cities'].str.replace('\d+', '')  # remove any digits by using '/d+' regex to look up digits, and then replace with empty string
    # f.) Remove any rows that have empty strings or null values for cities col (having performed the various data filtering and cleaning above)
    df = df[df['cities'].str.strip().astype(bool)] # remove rows with empty strings (ie, '') for cities col 
    df = df.dropna(subset=['cities']) # remove any remaining 'cities' null records
    # g.) Remove whitespace
    df['cities'] = df['cities'].str.strip() 
    # h.) capitalize the city names using str.capwords() 
    df['cities'] = df['cities'].str.split().apply(lambda x: [val.capitalize() for val in x]).str.join(' ')
    # i) Replace city names that are mispelled after having removed various street and neighborhood substrings such as 'St' or 'Ca'--e.g., '. Helena' should be 'St. Helena' & 'San los' should be 'San Carlos'. Also, remove any non-Bay Area cities such as Redding:
    df = df.replace({'cities':cities_that_need_extra_cleaning})
    # j) Remove any remaining empty strings, null records, or rows with literal 'nan' values (ie, resulting from previous data cleaning steps)
    # remove rows with literal 'nan' values
    df['cities'] = df['cities'].replace('nan', '', regex=True)

    df = df[df['cities'].str.strip().astype(bool)] # remove rows with empty strings (ie, '') for cities col 
     
    df = df.dropna(subset=['cities']) # remove any remaining 'cities' null records
    return df



## clean split city names and clean abbreviated or incorrect city names:
# specify various address and street name that we need to remove from the city names
address_criteria = ['Boulevard', 'Blvd', 'Road', 'Rd', 'Avenue', 'Ave', 'Street', 'St', 'Drive', 'Dr', 'Real', 'E Hillsdale Blvd'] 

# specify various extraneous neighborhood names such as 'Downtown' 
neighborhood_criteria = ['Downtown', 'Central/Downtown', 'North', 'California', 'Ca.', 'Bay Area', 'St. Helena', 'St', 'nyon', 
'Jack London Square', 'Walking Distance To', 'El Camino', 'Mendocino County', 'San Mateo County', 'Alameda County', 'Rio Nido Nr', 'Mission Elementary', 
'Napa County', 'Golden Gate', 'Jennings', 'South Lake Tahoe', 'Tahoe Paradise', 'Kingswood Estates', 'South Bay', 'Skyline', 
'East Bay', 'Morton Dr', 'Cour De Jeune', 
'Area', 'Rotary Way', ' Ca', 'Near ', 'galen pl'] 

# specify what delimiters we want to search for to unsplit the split city names data:
split_city_delimiters =  [',', '/', ' - ', '_____', '#']

# specify dictionary of abbreviated & mispelled cities:
incorrect_city_names = {'Rohnert Pk':'Rohnert Park', 'Hillsborough Ca': 'Hillsborough','Fremont Ca':'Fremont', 'South Sf': 'South San Francisco', 'Ca':'', 'East San Jose':'San Jose', 'Vallejo Ca':'Vallejo', 'Westgate On Saratoga .':'San Jose', 'Bodega':'Bodega Bay', 'Briarwood At Central Park':'Fremont', 'Campbell Ca':'Campbell', 'Almaden':'San Jose', '.':'', 'East Foothills':'San Jose', 'Lake County':'', 'West End':'Alameda', 'Redwood Shores':'Redwood City', 'Park Pacifica Neighborhood':'Pacifica'}

# specify dictionary of cities that are not located in sfbay (ie, not located in the region):
cities_not_in_region = ['Ketchum', 'Baypoinr', 'Quito' 'Redding', 'Bend', 'Pla Vada Woodland', 'San Antonio Tx', 'Mountain House Ca', 'Lakeside']

# specify dictionary of city names that are mispelled after having removed various street and neighborhood substrings:
cities_that_need_extra_cleaning = {'. Helena': 'St. Helena', '. Helena Deer Park': 'St. Helena', 'San Los':'San Carlos', 'Tro Valley':'Castro Valley', 'Rohnert Pk':'Rohnert Park',
'Pbell':'Campbell', 'Pbell Ca':'Campbell', 'American Yon':'American Canyon', 'Millbrae On The Burlingame Border':'Millbrae', 'Ockton Ca': 'Stockton', '. Rohnert Park': 'Rohnert Park', 'Udio Behind Main House':'', '***---rohnert Park':'Rohnert Park',
'Discovery Bay Ca':'Discovery Bay'}

# specify list of city names that should be used explicitly instead of having multiple cities (e.g.: ''santa cruz columbia beach' or 'Felton area', instead of simply 'Santa Cruz' or 'Felton')
# ie, use this list of values, look up substr via str.contains, and then use .replace() chained to .map() to replace all matching substr values with the values in the list
city_names_for_str_contains = ['Santa Cruz', 'Felton', 'Pleasure Point', 'Lodi', 'Berkeley' ]


# clean city names data:
df_no_city_nulls = clean_split_city_names(df_no_city_nulls, address_criteria, neighborhood_criteria, split_city_delimiters, incorrect_city_names, cities_not_in_region, cities_that_need_extra_cleaning)



In [15]:
""" Next, we need to determine each unique city name for each sfbay subregion.
First, remove all city values listed as 'nan', which resulted from the previous data wrangling step--ie,  .
Then, return this set of unique city names as a Series (new col), and convert to a flattened array (ie, Python list)"""


# 

# determine all unique (non-null) city names--ie, using df_no_city_nulls
def determine_unique_col_vals(df, col):
    return df[col].unique()

# return unique city names
determine_unique_col_vals(df_no_city_nulls, 'cities') 

array([], dtype=object)

In [10]:
# next, convert the set of city names data to a list:


# convert numpy array to list
def transform_np_array_to_list(np_array):
    return np_array.tolist() 

unique_city_names = determine_unique_col_vals(df_no_city_nulls, 'cities')

unique_city_names_lis = transform_np_array_to_list(unique_city_names)

# apply lower-case to each character of each city name for sake of consistency
def each_list_el_to_lowercase(list_arg):
    return [char.lower() for char in list_arg]

unique_city_names_lis = each_list_el_to_lowercase(unique_city_names_lis)

# sanity check
unique_city_names_lis

['brentwood',
 'vallejo',
 'hayward',
 'concord',
 'westbrae',
 'elmwood',
 'hercules',
 'oakland',
 'berkeley',
 'fremont',
 'danville',
 'alameda',
 'walnut creek',
 'san leandro',
 'dublin',
 'san lorenzo',
 'pittsburg',
 'richmond',
 'livermore',
 'emeryville',
 'albany',
 'lafayette',
 'fairfield',
 'san jose',
 'nan',
 'san ramon',
 'benicia',
 'irvington high area',
 'crockett',
 'el sobrante',
 'pleasanton',
 'tracy',
 'stockton',
 'castro valley',
 'san pablo',
 'pleasant hill',
 'nevada city',
 'midtown sacramento',
 'san francisco',
 'neighborhood',
 'east palo alto',
 'oakley',
 'bay point',
 's francisco way',
 'san jose ca',
 'san mateo',
 'laurence ranch neighborhood',
 'dimond district',
 'mcarthur',
 'santa clara',
 'metrosix',
 'antioch',
 'rockridge',
 'pacifica',
 'rotary way vallejo',
 'discovery bay ca',
 'sunnyvale',
 'union city',
 'santa rosa',
 'moraga',
 'montclair',
 'niles',
 'santa fe',
 'old city vallejo',
 'bushrod',
 'el cerrito',
 'american canyon',
 '

In [46]:
""" Next, add dash delimiters *in between* each word (ie, in place of whitespace in between each word of each city name) for each element (read: city name) 
from the unique_city_names_lis list.  

Why add a dash delimiter in b/w each word of each city name?:
Because the rental listings' listing_urls URLs each contain--(as of craigslist's server's changes in Jan 2023)
--the listing's city name in the URL. 
***But!: The city names in the URL are ***always*** listed with a dash delimiter in between each word!"""

def add_dash_delimiter_in_bw_each_word_of_city_names(city_names:list):
    return [word.replace(' ', '-') for word in city_names]  # use str.replace() method to replace whitespaces with dashes

unique_city_names_lis_dash_delim = add_dash_delimiter_in_bw_each_word_of_city_names(unique_city_names_lis)

# sanity check
unique_city_names_lis_dash_delim

['brentwood-',
 'vallejo-',
 'hayward',
 'concord-',
 'westbrae',
 'elmwood',
 'hercules',
 'oakland',
 'berkeley--',
 'fremont-',
 'danville-',
 'alameda',
 'berkeley',
 'walnut-creek',
 'san-leandro',
 'dublin-',
 'san-lorenzo',
 'pittsburg-',
 'richmond',
 'livermore',
 'emeryville',
 'albany-',
 'lafayette-',
 'fairfield-',
 'east-san-jose',
 'west-end',
 'nan',
 'san-ramon',
 'dublin',
 'lafayette',
 'briarwood-at-central-park',
 'benicia',
 'fremont',
 'concord',
 'irvington-high-area',
 'crockett',
 '',
 'vallejo',
 'el-sobrante',
 'pleasanton',
 'tracy',
 'ockton-ca',
 '54',
 'tro-valley',
 'san-pablo',
 'pleasant-hill',
 'pittsburg',
 'brentwood',
 'nevada-city',
 'midtown-sacramento',
 'san-francisco',
 'neighborhood',
 'east-palo-alto',
 'san-jose',
 'oakley',
 'bay-point',
 '3569--san-jose',
 's-francisco-way',
 'san-jose-ca',
 'san-mateo',
 'danville',
 'laurence-ranch-neighborhood',
 'dimond-district',
 'mcarthur',
 'santa-clara',
 'albany',
 'metrosix55',
 'antioch',
 'r

## Webcrawler to get wikipedia table data on **all** SF Bay Area city names & SC county cities:

In [None]:
# 1) Identify a list of all unique SF Bay Area & SC county city names, and output to a list

# 1a) Create a few simple webcrawlers to grab the city names data from 2 wikipedia tables

#web crawling, web scraping & webdriver libraries and modules
from selenium import webdriver  # NB: this is the main module we will use to implement the webcrawler and webscraping. A webdriver is an automated browser.
from webdriver_manager.chrome import ChromeDriverManager # import webdriver_manager package to automatically take care of any needed updates to Chrome webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException, ElementClickInterceptedException
from selenium.webdriver.chrome.options import Options  # Options enables us to tell Selenium to open WebDriver browsers using maximized mode, and we can also disable any extensions or infobars

import requests


### SF bay area city names data


# sf bay area city names wiki page:
sfbay_cities_wiki_url = 'https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_the_San_Francisco_Bay_Area'


# access page, and grab city names, append to list

def obtain_cities_from_wiki_sfbay(webpage_url,list_of_cities):
    # initialize web driver
            
    driver = webdriver.Chrome(ChromeDriverManager().install())  # install or update latest Chrome webdriver using using ChromeDriverManager() library
    
    # access webpage
    driver.get(webpage_url)

    xpaths_table = '//table[@class="wikitable plainrowheaders sortable jquery-tablesorter"]'

    # search for wiki data tables:
    table = driver.find_element(By.XPATH, xpaths_table)


    # print(f'Full table:\n\n{table.text}\n\n\n\n\n')

    # iterate over each table row and then row_val within each row to get data from the given table, pertaining to the city names
    for row in table.find_elements(By.CSS_SELECTOR, 'tr'): # iterate over each row in the table
        
        
        city_names =  row.find_elements(By.TAG_NAME, 'th')  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index
        # city_names =  row.find_elements(By.TAG_NAME, 'td')[0]  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index

        # extract text, but *skip* the first 2 rows of the table  rows' values since these are only the column names!
        for city_name in city_names[:2]: # skip first 2 rows 

            # append the remaining data to list
            list_of_cities.append(city_name.text)


    # exit webpage 
    driver.close()

    # # sanity check
    # print(f'List of city names:\n{list_of_cities}')

    return list_of_cities



# initialize lists:
sfbay_city_names = []


#sfbay data
obtain_cities_from_wiki_sfbay(sfbay_cities_wiki_url, sfbay_city_names)

# remove remaining col names:
sfbay_city_names = sfbay_city_names[4:]

# sanity check
print(f'sfbay city names:{sfbay_city_names}')

In [None]:
# Sc county city names data:

#web crawling, web scraping & webdriver libraries and modules
from selenium import webdriver  # NB: this is the main module we will use to implement the webcrawler and webscraping. A webdriver is an automated browser.
from webdriver_manager.chrome import ChromeDriverManager # import webdriver_manager package to automatically take care of any needed updates to Chrome webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException, ElementClickInterceptedException
from selenium.webdriver.chrome.options import Options  # Options enables us to tell Selenium to open WebDriver browsers using maximized mode, and we can also disable any extensions or infobars

import requests



# sc county wiki page url
sc_county_cities_wiki_url = 'https://en.wikipedia.org/wiki/Santa_Cruz_County,_California#Population_ranking'


sc_county_city_names = []


def obtain_cities_from_wiki_sc(webpage_url,list_of_cities):
    # initialize web driver
            
    driver = webdriver.Chrome(ChromeDriverManager().install())  # install or update latest Chrome webdriver using using ChromeDriverManager() library
    
    # access webpage
    driver.get(webpage_url)

    # # NB!: there are 2 tables with the same class name; only select data from the 2nd one
    # xpaths_table = '//table[@class="wikitable sortable jquery-tablesorter"][2]'  # 2nd table on webpage with this class name


    # # search for given wiki data tables:
    # table = driver.find_element(By.XPATH, xpaths_table)

    # NB!: there are 2 tables with the same class name; only select data from the 2nd one
    xpaths_table = '//table[@class="wikitable sortable jquery-tablesorter"][2]//tr//td[2]'  # 2nd table on webpage with this class name


    # search for given wiki data tables:
    table = driver.find_elements(By.XPATH, xpaths_table)


    print(f'Full table:\n\n{table}\n\n\n\n\n')

    for row in table:
        print(f'City names:{row.text}')
        list_of_cities.append(row.text)




    # print(f'Exclude 1st element of table:{table.text[1:]}')


    # # grab data from the table body:
    # table_body = table.find_elements(By.XPATH, '//*[@id="mw-content-text"]/div[1]/table[13]/tbody//b[2]/a')

    # # grab text from table body
    # table_body_2nd_col = table_body

    # print(f'Table body text\n:{table_body_2nd_col}')

    # iterate over each table row for *only* the 2nd column of the table body


#     for row in table.find_elements(By.XPATH, 'tr'): # iterate over each row from the table body (ie, tbody)

# # //*[@id="mw-content-text"]/div[1]/table[13]/tbody//b[2]/a

#         city_names =  row.text  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index

#         print(city_names)
        
        # city_names =  row.find_elements(By.TAG_NAME, 'td')  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index
        # city_names =  row.find_elements(By.TAG_NAME, 'td')[0]  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index

        

        # # extract text, but *skip* the first 2 rows of the table  rows' values since these are only the column names!
        # # for city_name in city_names[1]: # skip first column

        # #     append the remaining data to list
        # #     list_of_cities.append(city_name.text)

    
        # list_of_cities = city_names

    # exit webpage 
    driver.close()

    # # sanity check
    # print(f'List of city names:\n{list_of_cities}')

    return list_of_cities

obtain_cities_from_wiki_sc(sc_county_cities_wiki_url, sc_county_city_names)


#  # clean data by removing extraneous '†' char from city names list
sc_county_city_names = list(map(lambda x: x.replace('†',''), sc_county_city_names))

## finally, remove any whitespace from list-- use list comprehension
sc_county_city_names = [s for s in sc_county_city_names if s.strip()]

# sanity check
print(f'sc county city names:{sc_county_city_names}')

### 2) Import data for January 1-Feb 14: ie, the data in which the city names are missing!

### NB: We will need to use separate lists of dataframes for **Each** subregion, given the path structure of the scraped data derived from the webcrawler: 

## 2) Next, import data from Jan 15 to Feb 14, 2023

### NB: check following stackoverflow for useful info on how to do this, but I will need to *add* and apply a separate *datetime filter*,  *or* use the glob library to import CSV files based on a sort of regex, to the files such that I **only**  import the right dates of data:

### See following article on how to import CSV files only from specified date-range:




In [None]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023

def recursively_import_CSV_files_since_date_range(parent_path, subregion:str, month_date_range:str, day_date_range:str):
    """ Recusrively import all CSV files found in parent directory, but 
    *only* import files that match the specified date-range.

    NB: month_date_range: specify a str comprising the range of months--ie, '01-02' for January to February
    day_date_range: a str comprising the range of days: ie, '01-15' to mean the 1st through 15th 
    subregion: 
    
    Return the CSV files as separate dataframes
    within a list of dfs."""
    # specify parent path of the relevant (sfbay) scraped rental listings CSV data -- NB: use raw text--as in r'path...', or can we use the double-back slashes to escape back-slashes
    path = parent_path

    # import *only* CSV files that match specified *subregion* & *date-range*
    csv_files_date_range = glob.glob(f'./craigslist_rental_sfbay_{[subregion]}_02_17_2023_{[month_date_range]}_{[day_date_range]}-*.csv')

    # # NB: taken from stackexchange example: 
    # BANK\BANK_NIFTY_5MINs_2020-03-01.csv'

    # all_files = glob.glob('./BANK_NIFTY_5MINs_2020-[2-3]-*.csv')
    # # see URL for more details: 
    # # https://stackoverflow.com/questions/74386583/how-to-select-specific-csv-files-for-specified-date-range-from-a-folder-in-pytho



    # # specify an empty list to contain a list of dataframes
    # list_of_dfs = []

    # for files in csv_files_date_range:
    #     # Iterate recursively over each CSV file from within specified date-range, and import as separate DataFrames:
    #     dfs = pd.read_csv(files, 
    #                     sep=',',encoding = 'utf-8'  # use utf-8 encoding since it is OS-agnostic
    #                     )
        #  # append each imported df into the list of dfs

        # list_of_dfs.append(dfs)


    # use list comprehension to import all relevant dfs as a list of dfs:
    list_of_dfs = [pd.read_csv(file, sep=',',encoding = 'utf-8') for file in csv_files_date_range]



    
    # # # NB: for reference on how to import multiple CSVs as separate dfs within a list of dfs, see this example below:
    # # Read multiple CSV files into separate DataFrames in Python
    # # <https://www.geeksforgeeks.org/read-multiple-csv-files-into-separate-dataframes-in-python/>

    # # append datasets into the list
    # for i in range(len(list_of_names)):
    #     temp_df = pd.read_csv("./csv/"+list_of_names[i]+".csv")
    #     dataframes_list.append(temp_df)

    # # Nb: also see example below on another possible way of importing multiple dfs recursively:
    # list_of_dfs = pd.read_csv(csv_files_date_range, # import each CSV file from directory
    #                                     sep=',',encoding = 'utf-8'  # use utf-8 encoding since it is OS-agnostic
    #                                     ) for csv_files_date_range in glob.iglob(  # iterate over each CSV file recursively, given specified date-range
    #                                         os.path.join(path, '**', fn_regex), # have glob.iglob() search for *only* CSV files-- ie, '*.csv', & os.path.join helps ensure this concatenation is OS-independent
    #                                         recursive=True), ignore_index=True)  


    return list_of_dfs



# import all Jan 1-Feb 14th data, by subregion

# specify parent path of sfbay data
parent_path_sfbay  = r'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay'


# specify the craigslist subregion codes for as a dictionary 
dict_of_subregion_codes = {'Peninsula':'pen', 'SF':'sfc', 'East Bay':'eby', 'South Bay':'sby', 
'Santa Cruz':'scz', 'North Bay':'nby'}

# Peninsula_subregion = 'pen'
# SF_subregion ='sfc'
# ebay


# specify months & dates (strings) over which we want to apply the datetime "filter"
month_date_range_str = '01-02'

day_date_range_str = '01-31'

# Import Peninsula data:
Peninsula_dfs_since_jan_2023 = recursively_import_CSV_files_since_date_range(parent_path_sfbay, dict_of_subregion_codes.get("Peninsula"),
month_date_range_str, day_date_range_str)

# sanity check
print(f'Pensinsula data list of dfs from Jan-Feb 14, 2023--1st week (df) from this date-range: {Peninsula_dfs_since_jan_2023[0].head()}')

In [None]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023

# imports-- file processing & datetime libraries
import os
import glob
from pathlib import Path
import datetime
# data analysis libraries & SQL libraries
import numpy as np
import pandas as pd
from pandas.core.frame import DataFrame


# specify subregion code
subregion_code = 'sfc'

# specify parent path of all sfbay data-- NB: use an f-string combined with a raw (ie, r) string--ie, fr to modify the string so we can input the subregion code as an argument to add to the path
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_code}'


# get all CSV files from path, and grab the file stems for each given CSV file 
df = pd.DataFrame({'files' : [file for file in Path(path).glob('*.csv')],
                  'file_stem' : [file.stem for file in Path(path).glob('*.csv')]}) # get file stem using .stem method

# # Parse the dates from each CSV file, and keep the same 'MM_DD_YYY' format (**including the underscore delimiters!!), as the webcrawler CSV file naming convention:

df['date_of_file'] = df['file_stem'].str.extract(r'(\d{2}_\d{2}_\d{4})')


## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"
# sanity check
print(f'File dates:\n{df[["date_of_file", "file_stem"]].sort_values(by="date_of_file").tail()}')



## ask user for the desired start & end dates, in format of 'MM_DD_YYYY'

# start date inputs

start_date_month = str(input('Enter desired Start Date month: '))
start_date_day = str(input('Enter desired Start Date day: '))

start_date_year = str(input('Enter desired Start Date year: '))

# concat to single string, with **underscore** delimiters in between each component
underscore_delimiter = '_'

start_date = start_date_month + underscore_delimiter + start_date_day + underscore_delimiter + start_date_year

# sanity check on  resulting str
print(f'Start date for file filter:\n{start_date}')

# end date inputs
end_date_month = str(input('Enter desired End Date month: '))
end_date_day = str(input('Enter desired End Date day: '))

end_date_year = str(input('Enter desired End Date year: '))

# concat to single string, with **underscore** delimiters in between each component
end_date = end_date_month + underscore_delimiter + end_date_day + underscore_delimiter + end_date_year


# ## create a list of the start & end dates for filtering the files; then concatenate all matching files into a DataFrame
# index the list of files, convert to a list of date values, to be used for the filter
file_date_slice = df.set_index('date_of_file').loc[start_date:end_date]['files'].tolist()


# # sanity check
print(f'File dates we will use to filter the CSV files:\n{file_date_slice}')


# use list comprehension to import all relevant dfs as a list of dfs:
list_of_dfs = [pd.read_csv(file, sep=',',encoding = 'utf-8') for file in file_date_slice]

# # concat_df = pd.concat([pd.read_csv(file).compute() for file in file_date_slice])


# print the first element in the list:
print(f'First imported rental listings df from the period of {start_date} to {end_date} for {subregion_code} subregion, from the list of dfs:\n{list_of_dfs[0]}')


## NB: My webcrawler program's CSV files are of this format:
## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"


In [None]:
# Finish importing the list of dfs


# SF data
SF_dfs_since_jan_2023 = recursively_import_CSV_files_since_date_range(parent_path_sfbay, dict_of_subregion_codes.get("SF"),
month_date_range_str, day_date_range_str)

# East Bay
East_bay_dfs_since_jan_2023 = recursively_import_CSV_files_since_date_range(parent_path_sfbay, dict_of_subregion_codes.get("East Bay"),
month_date_range_str, day_date_range_str)

# North Bay
North_bay_dfs_since_jan_2023 = recursively_import_CSV_files_since_date_range(parent_path_sfbay, dict_of_subregion_codes.get("North Bay"),
month_date_range_str, day_date_range_str)

# South Bay
South_bay_dfs_since_jan_2023 = recursively_import_CSV_files_since_date_range(parent_path_sfbay, dict_of_subregion_codes.get("South Bay"),
month_date_range_str, day_date_range_str)

In [None]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023


# # NB: the following is a possible path I could take whereby you use a regex to parse out certain files based on their dates
# import pandas as pd
# from pathlib import Path

# path = '\tmp\s3\bucket\files'

# df = pd.DataFrame({'files' : [f for f in Path(path).glob('*.csv')],
#                   'stem' : [f.stem for f in Path(path).glob('*.csv')]})


# df['date'] = pd.to_datetime(df['stem'].str.extract('(\d{4}-\d{2}-\d{2})')[0])

# print(df)

import pandas as pd
from pathlib import Path

path = r'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data'

df = pd.DataFrame({'files' : [f for f in Path(path).glob('*.csv')],
                  'stem' : [f.stem for f in Path(path).glob('*.csv')]})


df['date'] = pd.to_datetime(df['stem'].str.extract('(\d{4}-\d{2}-\d{2})')[0])

print(df)


## ask user for the desired start & end dates, in format of 'MM_DD_YYYY'
# start_date = str(input('Enter desired Start Date: '))
# end_date = str(input('Enter desired End Date: '))


## create a list of the start & end dates for filtering the files; then concatenate all matching files into a DataFrame

# file_date_slice = df.set_index('date').loc[start_date:end_date]['files'].tolist()


# concat_df = pd.concat([pd.read_csv(file).compute() for file in file_date_slice])


## NB: My webcrawler program's CSV files are of this format:
## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"



def recursively_import_CSV_files_since_date_range(parent_path, region: str, subregion:str, month_date_range:str, 
day_date_range:str):

    """ Recusrively import all CSV files found in parent directory, but 
    *only* import files that match the specified date-range.

    NB: month_date_range: specify a str comprising the range of months--ie, '01-02' for January to February
    day_date_range: a str comprising the range of days: ie, '01-15' to mean the 1st through 15th 
    subregion: 
    
    Return the CSV files as separate dataframes
    within a list of dfs."""

    # specify parent path of the relevant (sfbay) scraped rental listings CSV data -- NB: use raw text--as in r'path...', or can we use the double-back slashes to escape back-slashes
    path = parent_path

    # get files from only the specified subregion
    subregion_full_path = glob.glob(path + subregion)

    # # import *only* CSV files that match specified *subregion* & *date-range*
    # # csv_files_date_range = glob.glob(f'./craigslist_rental_sfbay_{[subregion]}_[1-2]_*.csv')
    # csv_files_date_range = glob.glob(parent_path +  "*.csv")
    

    # sanity check on CSV files:
    print(f'List of all files from given subregion:\n{subregion_full_path}')


    # # NB: taken from stackexchange example: 
    # BANK\BANK_NIFTY_5MINs_2020-03-01.csv'

    # all_files = glob.glob('./BANK_NIFTY_5MINs_2020-[2-3]-*.csv')
    # # see URL for more details: 
    # # https://stackoverflow.com/questions/74386583/how-to-select-specific-csv-files-for-specified-date-range-from-a-folder-in-pytho



    # # specify an empty list to contain a list of dataframes
    # list_of_dfs = []

    # for files in csv_files_date_range:
    #     # Iterate recursively over each CSV file from within specified date-range, and import as separate DataFrames:
    #     dfs = pd.read_csv(files, 
    #                     sep=',',encoding = 'utf-8'  # use utf-8 encoding since it is OS-agnostic
    #                     )
        #  # append each imported df into the list of dfs

        # list_of_dfs.append(dfs)


    # use list comprehension to import all relevant dfs as a list of dfs:
    list_of_dfs = [pd.read_csv(file, sep=',',encoding = 'utf-8') for file in csv_files_date_range]

    # print the first element in the list:
    print(f'First imported rental listings df from {month_date_range} for {subregion} subregion, from the list of dfs:\n{list_of_dfs}')



    
    # # # NB: for reference on how to import multiple CSVs as separate dfs within a list of dfs, see this example below:
    # # Read multiple CSV files into separate DataFrames in Python
    # # <https://www.geeksforgeeks.org/read-multiple-csv-files-into-separate-dataframes-in-python/>

    # # append datasets into the list
    # for i in range(len(list_of_names)):
    #     temp_df = pd.read_csv("./csv/"+list_of_names[i]+".csv")
    #     dataframes_list.append(temp_df)

    # # Nb: also see example below on another possible way of importing multiple dfs recursively:
    # list_of_dfs = pd.read_csv(csv_files_date_range, # import each CSV file from directory
    #                                     sep=',',encoding = 'utf-8'  # use utf-8 encoding since it is OS-agnostic
    #                                     ) for csv_files_date_range in glob.iglob(  # iterate over each CSV file recursively, given specified date-range
    #                                         os.path.join(path, '**', fn_regex), # have glob.iglob() search for *only* CSV files-- ie, '*.csv', & os.path.join helps ensure this concatenation is OS-independent
    #                                         recursive=True), ignore_index=True)  


    return list_of_dfs



# import all Jan 1-Feb 14th data, by subregion

# specify parent path of sfbay data
parent_path_sfbay  = r'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data'


# specify the craigslist subregion codes for as a dictionary 
dict_of_subregion_codes = {'Peninsula':'pen', 
                        'SF':'sfc', 'East Bay':'eby', 'South Bay':'sby', 
                        'Santa Cruz':'scz', 'North Bay':'nby'
                        }

# Peninsula_subregion = 'pen'
# SF_subregion ='sfc'
# ebay


# specify months & dates (strings) over which we want to apply the datetime "filter"
month_date_range_str = '01'

day_date_range_str = '01-31'

# specify (parent) region
region = 'sfbay' 

# Import Peninsula data:
Peninsula_dfs_since_jan_2023 = recursively_import_CSV_files_since_date_range(parent_path_sfbay, region, dict_of_subregion_codes.get("Peninsula"),
month_date_range_str, day_date_range_str)

# sanity check
print(f'Pensinsula data list of dfs from Jan-Feb 14, 2023--1st week (df) from this date-range: {Peninsula_dfs_since_jan_2023[0].head()}')

In [1]:
# filter since date function 
def filter_df_to_date_range(df, datetime_col_for_filter, date_range_start, date_range_end):
    """ Filter data to correspond to be within a specified range of dates."""
    return df[(df[datetime_col_for_filter] >= date_range_start)&(df[datetime_col_for_filter]<= date_range_end)]

# specify target start date for date-range filter
date_range_start = '2023-01-15' # start date for date range

# end date
date_range_end = '2023-02-14'

# apply datetime filter
df_since_jan_2023 = filter_df_since_given_date(df, 'date_posted', date_range_start, date_range_end)

# sanity check
df_since_jan_2023.sort_values(by=['date_posted'])

NameError: name 'filter_df_since_given_date' is not defined

In [53]:
""" Finally, parse the city names data for all of the data since mid-Jan 2023,
which are currently missing city names data"""

def parse_city_names_from_listing_URL(df, unique_city_names_dash_delim:list):
    """ 1) Use str.contains() method chained to a .join() method in which we perform an 'OR' boolean via
    the pipe (ie, '|' operator--ie, so we can search for multiple substrings (ie, each element 
    from the list arg) to look up any matching instances of city names
    from the unique_city_names... list 
    relative to the rental listing URLs (ie, listing_urls).

    2) Then, parse each such first city name by taking the first matched city name only,
    
    3) Use these parsed city name values to **replace** the values for the 'cities' column!"""
    #1) identify any matching city names based on listing URLs
    # use .join() method to add boolean "OR" pipe operators to each element from unique_city_names... list
    str_search_pattern = '|'.join(unique_city_names_dash_delim) 
    # # now, use str.contains() to search for matching city names from rental listing URLs
    # dF_filter =  df[df['listing_urls'].str.contains(str_search_pattern)] # match city names
    # return dF_filter

    ### NB!!: see useful stackoverlow discussion re: how to updaate one column based on whether *another* col contains a substring: 
    # "Pandas: Updating Column B value if A contains string": <https://stackoverflow.com/questions/69639237/pandas-updating-column-b-value-if-a-contains-string>

    ## Also see


    # # 2) parse the first city name only

    # df['cities'] = dF_filter[0]  # parse first city name only

    ## 3) Replace values of the 'cities' column **with** the parsed city names data (ie, by using a str.findall() method (ie, similar to regex findall()) to search over any matchin city names from the listing_urls col)
    df['cities'] = df[df['listing_urls'].str.findall(str_search_pattern).str[0]]  # replace city names data by finding any matching city names as based on the rental listing URLs 
    # # NB: alternatively, use str.extract in tandem with fillna():
    # df['cities'] = df['listing_urls'].str.extract(str_search_pattern, expand=False).fillna(df['cities'])

    # return the updated city names data and assign back to the original 'cities' col:
    return df['cities']

# get copy of df
df_since_jan_2023_cpy = df_since_jan_2023.copy()

df_since_jan_2023_cpy['cities'] = parse_city_names_from_listing_URL(df_since_jan_2023, unique_city_names_dash_delim)

# sanity check
print(f'Filtered data: {df_since_jan_2023_cpy}')

# df_since_jan_2023_cpy['cities'] = parse_city_names_from_listing_URL(df_since_jan_2023_cpy, unique_city_names_lis_dash_delim)
# df_since_jan_2023_cpy['cities']

error: nothing to repeat at position 3616

In [None]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023


# # NB: the following is a possible path I could take whereby you use a regex to parse out certain files based on their dates
# import pandas as pd
# from pathlib import Path

# path = '\tmp\s3\bucket\files'

# df = pd.DataFrame({'files' : [f for f in Path(path).glob('*.csv')],
#                   'stem' : [f.stem for f in Path(path).glob('*.csv')]})


# df['date'] = pd.to_datetime(df['stem'].str.extract('(\d{4}-\d{2}-\d{2})')[0])

# print(df)

import pandas as pd
from pathlib import Path



In [None]:
""" Finally, parse the city names data for all of the data since mid-Jan 2023,
which are currently missing city names data"""

def parse_city_names_from_listing_URL(dictionary_of_dfs: dict, unique_city_names_dash_delim:list):

  """ 1) Use str.contains() method chained to a .join() method in which we perform an 'OR' boolean via
  the pipe (ie, '|' operator--ie, so we can search for multiple substrings (ie, each element 
  from the list arg) to look up any matching instances of city names
  from the unique_city_names... list 
  relative to the rental listing URLs (ie, listing_urls).

  2) Then, parse each such first city name by taking the first matched city name only,
  
  3) Use these parsed city name values to **replace** the values for the 'cities' column!"""

  for df in dictionary_of_dfs.items():   

    # step 1: use str.split() on '/apa/d' and get the 2nd element after performing the split:
    df['listing_urls_for_str_match'] = df['listing_urls'].str.split('/apa/d/').str[1]  # obtain the 2nd resulting element

    ## 2a) First!!: convert all string elements in col to lower-case for sake of consistency
    df['listing_urls_for_str_match'] = df['listing_urls_for_str_match'].str.lower()  # apply lowercase to all characters of each row's string vals 

    ## Next, do same for the list of SF Bay + SC county names:
    unique_city_names_dash_delim  = [el.lower() for el in unique_city_names_dash_delim]


    # step 3: match a substring from this newly-parsed column-- ie, 'listing_urls_for_str_match'
    # -- to matching substrings from the  sfbay_city_names list:
    # How?: use str.contains() and join pipe operators to each element of the list to perform an essentially  boolean "OR" str.contains() search for any matching city names

    # pipe operator
    pipe_operator = '|'

    df['cities'] = np.where(
        df['listing_urls_for_str_match'].str.contains(
        pipe_operator.join(
                
                unique_city_names_dash_delim   # look for any matching city names from sfbay_city_names list
                )  # parse city names by matching the data to the list of possible SF Bay Area city names

                # flags=re.I
        ),
        df['listing_urls_for_str_match'],   # ie, if there is a matching city name, then given row value for cities column  
        df['cities']  # if no match, then keep original row value 
      )
    
    # return df['cities]




# # step 1: use str.split() on '/apa/d' and get the 2nd element after performing the split:
# df['listing_urls_for_str_match'] = df['listing_urls'].str.split('/apa/d')[1]  # obtain the 2nd resulting element

# # step 2: use str.split() on the forward slashes (ie, essentially a delimiter), and obtain the first such element
# df['listing_urls_for_str_match'] = df['listing_urls_for_str_match'].str.split('/')[0]  # obtain the 1st resulting element





# # step 3: match a substring from this newly-parsed column-- ie, 'listing_urls_for_str_match'
# -- to matching substrings from the  sfbay_city_names list:
# How?: use str.contains() and join pipe operators to each element of the list to perform an essentially  boolean "OR" str.contains() search for any matching city names

# # pipe operator
# pipe_operator = '|'

# df['cities'] = np.where(
#     df['listing_urls_for_str_match'].str.contains(
#     pipe_operator.join(
            
#             sfbay_city_names   # look for any matching city names from sfbay_city_names list
#             )  # parse city names by matching the data to the list of possible SF Bay Area city names

#             # flags=re.I
#     ),
#     df['listing_urls_for_str_match'],   # ie, if there is a matching city name, then given row value for cities column  
#     df['cities']  # if no match, then keep original row value 
#   ) 







# apply function to each dictionary of dfs
# NB: use dicitonary comprehensions to apply the function to *each* df within each given dictionary:

# Peninsula
dict_of_dfs_pen2 = {key: val.pipe(parse_city_names_from_listing_URL(dict_of_dfs_pen, sfbay_city_names)) for key, val in dict_of_dfs_pen.items()}
dict_of_dfs_pen2


### Finally, export all dfs containing  the cleaned city names--ie, since Jan 2023. 

### Loop over each df from each list of dfs, and return .to_csv() to re-export them replace the old CSV files!!


### "Exporting Pandas output for multiple CSV files": https://stackoverflow.com/questions/67959271/exporting-pandas-output-for-multiple-csv-files


In [None]:
"""Warning!!!: This is pseudo-code. I need to revise with the new dfs, and test this on a smaller scale first!!!"""

""" NB: see following for soverflow article on how to do this--*assuming I've managed to import each df as separate subregion & week data:
"How to export data frame back to the original csv file that I imported it from?" 
<https://stackoverflow.com/questions/69159238/how-to-export-data-frame-back-to-the-original-csv-file-that-i-imported-it-from>"""

# loop over all dfs containing data since Jan 2023 
for df in dfs:
    # replace all old CSV files with the cleaned city names
    return pd.to_csv(df)