## Missing city names data:

NB: 

### In this notebook, we want to clean data for several weeks' worth of *sfbay* rental listings in which the webcrawler had not been updated properly to scrape city names. 

### In short, several weeks of rental listings from Jan-Feb 2023 have missing city names data (ie, for the 'cities' column). To be clear, the problem refers to only CSV files that were derived from running the webcrawler--ie, based on the datE_of_webcrawler, *not* the date_posted per se--from around early January to February 14, 2023, after which the problem was fixed. See this commit to the GitHub repo for details: commit # c60d6d9, dated from Feb 15, 2023.

### To do the needed data cleaning, we need to do the following:

### 1) Identify all unique city names for all SF Bay Area counties as well as Santa Cruz county:

### 1a) How? Run a separate webcrawler in which we grab data on all city names for a) the SF Bay Area counties & also b) Santa Cruz county. We can get these data from separate Wikipedia webpage tables.

### 1b) After extracting the city names, add '-' delimiters to match the format of the city names data that we will be parsing from the craigslist rental listing URLs from the listings data that we need to clean.


### 2) Second, we need to import each week of sfbay rental listings that have missing city names data--by subregion--as *separate* dataframes, ie, with each df corresponding to a given week and a given subregion. 

#### 2a) To do this, we need to slice the data to only data from mid January 2023 to mid February 2023.

### 2b) Then, **replace** the missing city names with matching city names--ie, based on the list of unique city names (see step 1)--vis-a-vis a regex str.findall() search of the city names as parsed from the rental listing URLs (ie, the listing_url col).

#### 2bi) More specifically: we will take the list of SF Bay Area city names and return the *first* matched city name from each given rental listing URL. 

#### So if a rental listing URL mentions multiple city names, we will assume the first listed city name is the main city, and use that as the name for the city column for sake of simplicity.

### 2c) Finally, output the cleaned data back to the **original** CSV files and overwrite the original files!


## NB: Let's try a different approach to gather names of all cities & towns in the SF Bay Area:

### Namely, let's access US Census data tables containing lists of all SF Bay Area cities--in addition to a separate page for Santa Cruz county--found on Wikipedia

### We will need to implement a short webcrawler for the purpose of extracting the city names of the SF Bay Area and SC county, then append to a list, and add the desired dash ('-') delimiters:

In [42]:
# 1) Identify a list of all unique SF Bay Area & SC county city names, and output to a list

# 1a) Create a few simple webcrawlers to grab the city names data from 2 wikipedia tables

#web crawling, web scraping & webdriver libraries and modules
from selenium import webdriver  # NB: this is the main module we will use to implement the webcrawler and webscraping. A webdriver is an automated browser.
from webdriver_manager.chrome import ChromeDriverManager # import webdriver_manager package to automatically take care of any needed updates to Chrome webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException, ElementClickInterceptedException
from selenium.webdriver.chrome.options import Options  # Options enables us to tell Selenium to open WebDriver browsers using maximized mode, and we can also disable any extensions or infobars

import requests


### SF bay area city names data


# sf bay area city names wiki page:
sfbay_cities_wiki_url = 'https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_the_San_Francisco_Bay_Area'


# access page, and grab city names, append to list

def obtain_cities_from_wiki_sfbay(webpage_url,list_of_cities):
    # initialize web driver
            
    driver = webdriver.Chrome(ChromeDriverManager().install())  # install or update latest Chrome webdriver using using ChromeDriverManager() library
    
    # access webpage
    driver.get(webpage_url)

    xpaths_table = '//table[@class="wikitable plainrowheaders sortable jquery-tablesorter"]'

    # search for wiki data tables:
    table = driver.find_element(By.XPATH, xpaths_table)


    # iterate over each table row and then row_val within each row to get data from the given table, pertaining to the city names
    for row in table.find_elements(By.CSS_SELECTOR, 'tr'): # iterate over each row in the table
        
        
        city_names =  row.find_elements(By.TAG_NAME, 'th')  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index
        # city_names =  row.find_elements(By.TAG_NAME, 'td')[0]  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index

        # extract text, but *skip* the first 2 rows of the table  rows' values since these are only the column names!
        for city_name in city_names[:2]: # skip first 2 rows 

            # append the remaining data to list
            list_of_cities.append(city_name.text)


    # exit webpage 
    driver.close()


    return list_of_cities



# initialize lists:
sfbay_city_names = []


#sfbay data
obtain_cities_from_wiki_sfbay(sfbay_cities_wiki_url, sfbay_city_names)

# remove remaining col names:
sfbay_city_names = sfbay_city_names[4:]

# sanity check
print(f'sfbay city names:{sfbay_city_names}')

print(f'There are {len(sfbay_city_names)} city names\nNB: There should be 101.')




Current google-chrome version is 111.0.5563
Get LATEST driver version for 111.0.5563
Driver [C:\Users\Kevin Allen\.wdm\drivers\chromedriver\win32\111.0.5563.64\chromedriver.exe] found in cache


sfbay city names:['Alameda', 'Albany', 'American Canyon', 'Antioch', 'Atherton', 'Belmont', 'Belvedere', 'Benicia', 'Berkeley', 'Brentwood', 'Brisbane', 'Burlingame', 'Calistoga', 'Campbell', 'Clayton', 'Cloverdale', 'Colma', 'Concord', 'Corte Madera', 'Cotati', 'Cupertino', 'Daly City', 'Danville', 'Dixon', 'Dublin', 'East Palo Alto', 'El Cerrito', 'Emeryville', 'Fairfax', 'Fairfield', 'Foster City', 'Fremont', 'Gilroy', 'Half Moon Bay', 'Hayward', 'Healdsburg', 'Hercules', 'Hillsborough', 'Lafayette', 'Larkspur', 'Livermore', 'Los Altos', 'Los Altos Hills', 'Los Gatos', 'Martinez', 'Menlo Park', 'Mill Valley', 'Millbrae', 'Milpitas', 'Monte Sereno', 'Moraga', 'Morgan Hill', 'Mountain View', 'Napa', 'Newark', 'Novato', 'Oakland', 'Oakley', 'Orinda', 'Pacifica', 'Palo Alto', 'Petaluma', 'Piedmont', 'Pinole', 'Pittsburg', 'Pleasant Hill', 'Pleasanton', 'Portola Valley', 'Redwood City', 'Richmond', 'Rio Vista', 'Rohnert Park', 'Ross', 'St. Helena', 'San Anselmo', 'San Bruno', 'San Carlos

In [43]:
# Sc county city names data:

#web crawling, web scraping & webdriver libraries and modules
from selenium import webdriver  # NB: this is the main module we will use to implement the webcrawler and webscraping. A webdriver is an automated browser.
from webdriver_manager.chrome import ChromeDriverManager # import webdriver_manager package to automatically take care of any needed updates to Chrome webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException, ElementClickInterceptedException
from selenium.webdriver.chrome.options import Options  # Options enables us to tell Selenium to open WebDriver browsers using maximized mode, and we can also disable any extensions or infobars

import requests



# sc county wiki page url
sc_county_cities_wiki_url = 'https://en.wikipedia.org/wiki/Santa_Cruz_County,_California#Population_ranking'


sc_county_city_names = []


def obtain_cities_from_wiki_sc(webpage_url,list_of_cities):
    # initialize web driver
            
    driver = webdriver.Chrome(ChromeDriverManager().install())  # install or update latest Chrome webdriver using using ChromeDriverManager() library
    
    # access webpage
    driver.get(webpage_url)


    # NB!: there are 2 tables with the same class name; only select data from the 2nd one
    xpaths_table = '//table[@class="wikitable sortable jquery-tablesorter"][2]//tr//td[2]'  # 2nd table on webpage with this class name


    # search for given wiki data tables:
    table = driver.find_elements(By.XPATH, xpaths_table)


    print(f'Full table:\n\n{table}\n\n\n\n\n')

    for row in table:
        print(f'City names:{row.text}')
        list_of_cities.append(row.text)





    # exit webpage 
    driver.close()

    # # sanity check
    # print(f'List of city names:\n{list_of_cities}')

    return list_of_cities

obtain_cities_from_wiki_sc(sc_county_cities_wiki_url, sc_county_city_names)


#  # clean data by removing extraneous '†' char from city names list
sc_county_city_names = list(map(lambda x: x.replace('†',''), sc_county_city_names))

## finally, remove any whitespace from list-- use list comprehension
sc_county_city_names = [s for s in sc_county_city_names if s.strip()]

# sanity check
print(f'\n\nsc county city names:{sc_county_city_names}')
print(f'There are {len(sc_county_city_names)} city names for SC county.')




Current google-chrome version is 111.0.5563
Get LATEST driver version for 111.0.5563
Driver [C:\Users\Kevin Allen\.wdm\drivers\chromedriver\win32\111.0.5563.64\chromedriver.exe] found in cache


Full table:

[<selenium.webdriver.remote.webelement.WebElement (session="51034c4d59dfd249ac6227fa8243b2bb", element="546d88e8-ae6b-4e03-bdc7-c997d2b78567")>, <selenium.webdriver.remote.webelement.WebElement (session="51034c4d59dfd249ac6227fa8243b2bb", element="54552ffe-99db-4174-a95e-e3395f32dd1f")>, <selenium.webdriver.remote.webelement.WebElement (session="51034c4d59dfd249ac6227fa8243b2bb", element="0689e216-58be-4b4f-8018-bfaa30365ac1")>, <selenium.webdriver.remote.webelement.WebElement (session="51034c4d59dfd249ac6227fa8243b2bb", element="f7520c47-c054-422d-b294-56c52dd45ce4")>, <selenium.webdriver.remote.webelement.WebElement (session="51034c4d59dfd249ac6227fa8243b2bb", element="eaac9f9d-b088-4008-812a-14abaab7ce28")>, <selenium.webdriver.remote.webelement.WebElement (session="51034c4d59dfd249ac6227fa8243b2bb", element="924dfa8c-85cb-45cc-8a8f-5003a3ceda58")>, <selenium.webdriver.remote.webelement.WebElement (session="51034c4d59dfd249ac6227fa8243b2bb", element="2fe86ee0-818e-4019-

In [44]:
# # combine both the sfbay city names & sc county names lists into one:
def combine_lists(list1, list2):
    return list1.extend(list2)

combine_lists(sfbay_city_names, sc_county_city_names)

# sanity check
print(f'sanity check on sfbay & sc county city names data:\n\n{sfbay_city_names}')

print(f'\nThere are {len(sfbay_city_names)} cities')

sanity check on sfbay & sc county city names data:

['Alameda', 'Albany', 'American Canyon', 'Antioch', 'Atherton', 'Belmont', 'Belvedere', 'Benicia', 'Berkeley', 'Brentwood', 'Brisbane', 'Burlingame', 'Calistoga', 'Campbell', 'Clayton', 'Cloverdale', 'Colma', 'Concord', 'Corte Madera', 'Cotati', 'Cupertino', 'Daly City', 'Danville', 'Dixon', 'Dublin', 'East Palo Alto', 'El Cerrito', 'Emeryville', 'Fairfax', 'Fairfield', 'Foster City', 'Fremont', 'Gilroy', 'Half Moon Bay', 'Hayward', 'Healdsburg', 'Hercules', 'Hillsborough', 'Lafayette', 'Larkspur', 'Livermore', 'Los Altos', 'Los Altos Hills', 'Los Gatos', 'Martinez', 'Menlo Park', 'Mill Valley', 'Millbrae', 'Milpitas', 'Monte Sereno', 'Moraga', 'Morgan Hill', 'Mountain View', 'Napa', 'Newark', 'Novato', 'Oakland', 'Oakley', 'Orinda', 'Pacifica', 'Palo Alto', 'Petaluma', 'Piedmont', 'Pinole', 'Pittsburg', 'Pleasant Hill', 'Pleasanton', 'Portola Valley', 'Redwood City', 'Richmond', 'Rio Vista', 'Rohnert Park', 'Ross', 'St. Helena', 'San

In [45]:
# 1b) Add dash ('-') delimiters to the list of city names data


""" Next, add dash delimiters *in between* each word (ie, in place of whitespace in between each word of each city name) for each element (read: city name) 
from the unique_city_names_lis list.  

Why add a dash delimiter in b/w each word of each city name?:
Because the rental listings' listing_urls URLs each contain--(as of craigslist's server's changes in Jan 2023)
--the listing's city name in the URL. 
***But!: The city names in the URL are ***always*** listed with a dash delimiter in between each word!"""

def add_dash_delimiter_in_bw_each_word_of_city_names(city_names:list):
    return [word.replace(' ', '-') for word in city_names]  # use str.replace() method to replace whitespaces with dashes

sfbay_city_names = add_dash_delimiter_in_bw_each_word_of_city_names(sfbay_city_names)
 
# sanity check
sfbay_city_names

['Alameda',
 'Albany',
 'American-Canyon',
 'Antioch',
 'Atherton',
 'Belmont',
 'Belvedere',
 'Benicia',
 'Berkeley',
 'Brentwood',
 'Brisbane',
 'Burlingame',
 'Calistoga',
 'Campbell',
 'Clayton',
 'Cloverdale',
 'Colma',
 'Concord',
 'Corte-Madera',
 'Cotati',
 'Cupertino',
 'Daly-City',
 'Danville',
 'Dixon',
 'Dublin',
 'East-Palo-Alto',
 'El-Cerrito',
 'Emeryville',
 'Fairfax',
 'Fairfield',
 'Foster-City',
 'Fremont',
 'Gilroy',
 'Half-Moon-Bay',
 'Hayward',
 'Healdsburg',
 'Hercules',
 'Hillsborough',
 'Lafayette',
 'Larkspur',
 'Livermore',
 'Los-Altos',
 'Los-Altos-Hills',
 'Los-Gatos',
 'Martinez',
 'Menlo-Park',
 'Mill-Valley',
 'Millbrae',
 'Milpitas',
 'Monte-Sereno',
 'Moraga',
 'Morgan-Hill',
 'Mountain-View',
 'Napa',
 'Newark',
 'Novato',
 'Oakland',
 'Oakley',
 'Orinda',
 'Pacifica',
 'Palo-Alto',
 'Petaluma',
 'Piedmont',
 'Pinole',
 'Pittsburg',
 'Pleasant-Hill',
 'Pleasanton',
 'Portola-Valley',
 'Redwood-City',
 'Richmond',
 'Rio-Vista',
 'Rohnert-Park',
 'Ross'

### 2) Import data for January 1-Feb 14: ie, the data in which the city names are missing!

### NB: We will need to use separate lists of dataframes for **Each** subregion, given the path structure of the scraped data derived from the webcrawler: 

## 2) Next, import data from Jan 15 to Feb 14, 2023

### NB: check following stackoverflow for useful info on how to do this, but I will need to *add* and apply a separate *datetime filter*,  *or* use the glob library to import CSV files based on a sort of regex, to the files such that I **only**  import the right dates of data:

### See following article on how to import CSV files only from specified date-range:

### "Read multiple csv files stored by date from start date to end date into a pandas dataframe" <https://stackoverflow.com/questions/61236241/read-multiple-csv-files-stored-by-date-from-start-date-to-end-date-into-a-pandas>




In [7]:
# import pandas as pd
# from pathlib import Path
# import glob
# import os

# # specify subregion code
# subregion_code = 'pen'

# # specify parent path of all sfbay data-- NB: use an f-string combined with a raw (ie, r) string--ie, fr to modify the string so we can input the subregion code as an argument to add to the path
# path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay'


# # get all CSV files from path, and grab the file stems for each given CSV file 
# df = pd.read_csv(file, # import each CSV file
#                  sep=',', 
#                  encoding = 'utf-8'  # assume standard CSV (ie, comma separated) format and use utf-8 encoding
#                  ) for file in glob.iglob(
#     os.path.join(path, '**', 
#                  fn_regex), 
#                  recursive=True)
#  # get file stem using .stem method

# # # Parse the dates from each CSV file, and keep the same 'MM_DD_YYY' format (**including the underscore delimiters!!)
# # df['date_of_file'] = pd.to_datetime(df['file_stem'].str.extract('\d{2}_\d{2}_\d{4}')[0]) 

# # df['date_of_file'] = df['file_stem'].str.extract(r'\d{2}_\d{2}_\d{4}')


# ## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"
# # sanity check
# print(df)

SyntaxError: invalid syntax (Temp/ipykernel_20628/4294683324.py, line 17)

In [1]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023

# imports-- file processing & datetime libraries
import os
import glob
from pathlib import Path
import datetime
# data analysis libraries & SQL libraries
import numpy as np
import pandas as pd
from pandas.core.frame import DataFrame


def import_dfs_matching_specific_subregion_and_dates(subregion_code:str)->dict:
    """ Import dataframes as a dictionary of dataframes, given a) a specific user-inputted range of dates & b) a specific subregion"""

    # specify parent path to all sfbay data-- NB: use an f-string combined with a raw (ie, r) string--ie, fr to modify the string so we can input the subregion code as an argument to add to the path
    path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_code}'


    # get all CSV files from path, and grab the file stems for each given CSV file 
    df = pd.DataFrame({'files' : [file for file in Path(path).glob('*.csv')],
                    'file_stem' : [file.stem for file in Path(path).glob('*.csv')]}) # get file stem using .stem method

    # # Parse the dates from each CSV file, and keep the same 'MM_DD_YYY' format (**including the underscore delimiters!!), as the webcrawler CSV file naming convention:

    df['date_of_file'] = df['file_stem'].str.extract(r'(\d{2}_\d{2}_\d{4})')


    ## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"
    # sanity check
    print(f'\nSome of the file dates:\n{df[["date_of_file", "file_stem"]].sort_values(by="date_of_file").tail()}\n')


    ## NB: The webcrawler program's CSV files are of this format:
    ## ** "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"-- therefore:

    ## **Prompt user for the desired start & end dates, which we will concatenate into the format of 'MM_DD_YYYY'

    # start date inputs
    start_date_month = str(input('Enter desired Start Date month: '))
    start_date_day = str(input('Enter desired Start Date day: '))

    start_date_year = str(input('Enter desired Start Date year: '))

    # concat start date inputs to single string, with **underscore** delimiters in between each component
    underscore_delimiter = '_'

    start_date = start_date_month + underscore_delimiter + start_date_day + underscore_delimiter + start_date_year

    # sanity check on  resulting str
    print(f'Start date for file filter:\n{start_date}')

    # end date inputs
    end_date_month = str(input('Enter desired End Date month: '))
    end_date_day = str(input('Enter desired End Date day: '))

    end_date_year = str(input('Enter desired End Date year: '))

    # concat end date inputs to single string, with **underscore** delimiters in between each component
    end_date = end_date_month + underscore_delimiter + end_date_day + underscore_delimiter + end_date_year

    # sanity check on  resulting str
    print(f'End date for file filter:\n{end_date}')


    # ## create a list of the start & end dates for filtering the files; then concatenate all matching files into a DataFrame
    # index the list of files, convert to a list of date values, to be used for the filter
    file_date_slice = df.set_index('date_of_file').loc[start_date:end_date]['files'].tolist()

    # use list comprehension to import all relevant dfs as a list of dfs:
    list_of_dfs = [pd.read_csv(file, sep=',',encoding = 'utf-8') for file in file_date_slice]


    # initialize empty dict, which will be used as a dict of dfs
    dict_of_dfs = {}

    # iterate over each CSV file that matches the file date slice:
    for csv_file in file_date_slice:
        # read each CSV file as a separate df within the dictionary, and use the file name as the key of the dictionary 
        dict_of_dfs[csv_file] = pd.read_csv(csv_file, sep=',', encoding='utf-8')

    # sanity check: print the first element in the list:
    print(f'\nSanity check: first imported rental listings df from the period of {start_date} to {end_date} for {subregion_code} subregion, from the dict of dfs:\n{next(iter(dict_of_dfs))}')

    return dict_of_dfs



# specify each subregion codes as separate strings:
subregion_pen, subregion_nby, subregion_sby, subregion_eby, subregion_sfc, subregion_scz  = 'pen', 'nby', 'sby', 'eby', 'sfc', 'scz'

# import Peninsula data:

dict_of_dfs_pen = import_dfs_matching_specific_subregion_and_dates(subregion_pen)



Some of the file dates:
   date_of_file                                file_stem
61   12_02_2022  craigslist_rental_sfbay_pen_12_02_2022_
62   12_08_2022  craigslist_rental_sfbay_pen_12_08_2022_
63   12_15_2022  craigslist_rental_sfbay_pen_12_15_2022_
64   12_21_2022  craigslist_rental_sfbay_pen_12_21_2022_
65   12_27_2022  craigslist_rental_sfbay_pen_12_27_2022_

Start date for file filter:
01_01_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_01_2023 to 02_14_2023 for pen subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_pen_01_02_2022.csv


In [36]:
# import East Bay data:

dict_of_dfs_eby = import_dfs_matching_specific_subregion_and_dates(subregion_eby)



Some of the file dates:
   date_of_file                                file_stem
65   12_07_2022  craigslist_rental_sfbay_eby_12_07_2022_
66   12_14_2022  craigslist_rental_sfbay_eby_12_14_2022_
67   12_19_2022  craigslist_rental_sfbay_eby_12_19_2022_
68   12_23_2022  craigslist_rental_sfbay_eby_12_23_2022_
69   12_29_2022  craigslist_rental_sfbay_eby_12_29_2022_

Start date for file filter:
01_01_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_01_2023 to 02_14_2023 for eby subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\eby\craigslist_rental_sfbay_eby_01_03_2022.csv


In [None]:
# import North Bay data:

dict_of_dfs_nby = import_dfs_matching_specific_subregion_and_dates(subregion_nby)


In [None]:
# import South Bay data:

dict_of_dfs_sby = import_dfs_matching_specific_subregion_and_dates(subregion_sby)


In [None]:
# import SF data:

dict_of_dfs_sfc = import_dfs_matching_specific_subregion_and_dates(subregion_sfc)

In [None]:
# import Santa Cruz data:

dict_of_dfs_scz = import_dfs_matching_specific_subregion_and_dates(subregion_scz)


### Try to determine what day or at least week the missing city name data began to surface:

#### Iterate over each df of East Bay data, and print each unique city name per df:

In [None]:
# Let's first show the names of the East Bay files, so we can know the specific dates the webcrawler was run
for key in dict_of_dfs_eby:
    print("East Bay files")
    print(key.name)

In [40]:
def print_unique_row_vals_for_col_for_each_df_in_dict_of_dfs(dict_of_dfs, col):
    for name, df in dict_of_dfs.items():
        print(f'Unique city names per df from {name.name}')  # show name of corresponding CSV file per df
        print(name, df[col].unique())  # show unique values of given col
        print()  # include a new line in between each df


print_unique_row_vals_for_col_for_each_df_in_dict_of_dfs(dict_of_dfs_eby, 'cities')


Unique city names per df craigslist_rental_sfbay_eby_01_03_2022.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\eby\craigslist_rental_sfbay_eby_01_03_2022.csv ['Brentwood ' 'Vallejo ' 'Hayward' 'Concord ' 'Westbrae' 'Elmwood'
 'Hercules, Pinole, San Pablo, El Sob' 'Oakland' 'Berkeley North '
 'Fremont ' 'Danville ' 'Alameda' 'Berkeley' 'Walnut Creek' 'San Leandro'
 'Dublin ' 'San Lorenzo' 'Pittsburg ' 'Richmond' 'Livermore' 'Emeryville'
 'Albany ' 'Lafayette ' 'Fairfield ' 'East San Jose' 'West End' 'Nan']

Unique city names per df craigslist_rental_sfbay_eby_01_05_2023_.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\eby\craigslist_rental_sfbay_eby_01_05_2023_.csv ['Concord ' 'Oakland' 'Dublin ' 'Alameda' 'Lafayette ' 'Walnut Creek'
 'Berkeley' 'Danville ' 'Fairfield ' 'Pittsburg '
 'Hercules, Pinole, San Pablo, El Sob' 'Hayward' 'Fremont ' 'Vallejo '
 'San Leandro' 'Sunnyvale' 'Nan

In [41]:
# NB: let's take a slightly different way to determine what specific day when the city names were no longer being parsed correctly

def groupby_by_day_and_find_day_when_city_nulls_started(df):
    """Group the data by the date when the webcrawler was run,
    and use the .nunique() method on the cities column.
    For any days"""
    return df.groupby('date_of_webcrawler')['cities'].nunique()


# East Bay data
dict_of_dfs_eby_null_per_col_per_day = {key: val.pipe(groupby_by_day_and_find_day_when_city_nulls_started) for key, val in dict_of_dfs_pen.items()}
print(dict_of_dfs_eby_null_per_col_per_day)

{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_02_2022.csv'): date_of_webcrawler
1/2/2022    26
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_06_2023_.csv'): date_of_webcrawler
2023-01-06    32
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_08_2022.csv'): date_of_webcrawler
1/8/2022    27
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_10_2022.csv'): date_of_webcrawler
1/10/2022    25
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pe

## Notice that the city names start disappearing around January 14th, because the webcrawler that was run that day failed to parse any city names. 

### How can we be sure?: The .unique() method printed a list of merely 'NaN', and the /nunique() method shows only 1 for the webcrawler run on Jan 14th to 15th. 

### In other words, there were no valid City names dfata.To elaborate, there were listings data from the webcrawler that day that included valid data on all other columns; only the city names were conspicuously absent.

In [16]:
dict_of_dfs_sfc_null_per_col_per_day = {key: val.pipe(groupby_by_day_and_find_day_when_city_nulls_started) for key, val in dict_of_dfs_sfc.items()}
print(f'SF number of unique city names: \n{dict_of_dfs_sfc_null_per_col_per_day}')

SF data nulls: 
{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_03_2022.csv'): date_of_webcrawler
1/3/2022    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_07_2022.csv'): date_of_webcrawler
1/7/2022    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_08_2023_.csv'): date_of_webcrawler
2023-01-08    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_10_2022.csv'): date_of_webcrawler
1/10/2022    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_d

In [17]:
dict_of_dfs_nby_null_per_col_per_day = {key: val.pipe(groupby_by_day_and_find_day_when_city_nulls_started) for key, val in dict_of_dfs_nby.items()}
print(f'North Bay number of unique city names: \n{dict_of_dfs_nby_null_per_col_per_day}')

North Bay number of unique city names: 
{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_01_2023_.csv'): date_of_webcrawler
2023-01-01    31
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_02_2022.csv'): date_of_webcrawler
1/2/2022    31
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_07_2023_.csv'): date_of_webcrawler
2023-01-07    44
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_09_2022.csv'): date_of_webcrawler
1/9/2022    33
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/

In [44]:
file_date_slice[0].name

'craigslist_rental_sfbay_sfc_01_03_2022.csv'

### 2b) Next, clean the city names data by matching the list of city names data to the *first* matched city name from each given rental listing URL.  

## NB: do some demos before applying to the whole dicitonaries of dfs:

In [90]:
# NB!: import one df to do some demos before applying a similar function to the entire dictionary of dfs:

eby_Jan_fifteenth = pd.read_csv(r'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\eby\craigslist_rental_sfbay_eby_01_14_2023_.csv')

eby_Jan_fifteenth

Unnamed: 0,listing_urls,ids,sqft,cities,prices,bedrooms,bathrooms,attr_vars,listing_descrip,date_of_webcrawler,...,attached_garage,detached_garage,carport,off_street_parking,no_parking,EV_charging,air_condition,no_smoking,region,sub_region
0,https://sfbay.craigslist.org/eby/apa/d/pleasan...,7578627707,1305.0,Nan,2856,3,3.0,air conditioning\ncats are OK - purrr\ndogs ar...,call now - show contact info x 29\nor text 29 ...,2023-01-14,...,0,0,1,0,0,0,1,1,sfbay,eby
1,https://sfbay.craigslist.org/eby/apa/d/concord...,7573915526,,Nan,1500,0,1.0,apartment\nno laundry on site\nstreet parking\...,have small studio for one year lease in concor...,2023-01-14,...,0,0,0,0,0,0,0,0,sfbay,eby
2,https://sfbay.craigslist.org/eby/apa/d/oakland...,7574539727,580.0,Nan,2800,2,1.0,dogs are OK - wooof\nhouse\nw/d in unit\nno sm...,our previous tenant just bought their own hous...,2023-01-14,...,0,0,0,0,0,0,0,1,sfbay,eby
3,https://sfbay.craigslist.org/eby/apa/d/santa-c...,7578630159,270.0,Nan,1700,0,1.0,apartment\nlaundry on site\nno smoking\ncarpor...,"1141 miramar way, sunnyvale, ca 94086\n$1700.0...",2023-01-14,...,0,0,1,0,0,0,0,1,sfbay,eby
4,https://sfbay.craigslist.org/eby/apa/d/walnut-...,7578630071,732.0,Nan,2263,1,1.0,cats are OK - purrr\ndogs are OK - wooof\napar...,the retreat is situated in lush natural surrou...,2023-01-14,...,0,0,0,1,0,0,0,1,sfbay,eby
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129,https://sfbay.craigslist.org/eby/apa/d/hayward...,7578616141,1147.0,Nan,3508,2,2.0,EV charging\nair conditioning\ncats are OK - p...,up to 6 weeks free!\n\n\ncadence\n28850 dixon ...,2023-01-14,...,1,0,0,0,0,1,1,1,sfbay,eby
130,https://sfbay.craigslist.org/eby/apa/d/pleasan...,7573002736,770.0,Nan,2816,2,2.0,air conditioning\ncats are OK - purrr\ndogs ar...,welcome to pleasanton heights apartment homes!...,2023-01-14,...,0,0,1,0,0,0,1,1,sfbay,eby
131,https://sfbay.craigslist.org/eby/apa/d/san-pab...,7578614984,1114.0,Nan,2876,2,2.0,cats are OK - purrr\ndogs are OK - wooof\napar...,call now - show contact info x 66\nor text 66 ...,2023-01-14,...,0,0,1,0,0,0,0,1,sfbay,eby
132,https://sfbay.craigslist.org/eby/apa/d/san-pab...,7578611839,,Nan,2800,3,1.0,house\nw/d hookups\nattached garage\nrent peri...,great 3 bed 1 bath house in richmond with gara...,2023-01-14,...,1,0,0,0,0,0,0,0,sfbay,eby


In [91]:
eby_Jan_fifteenth['lsting_urls_test'] = eby_Jan_fifteenth['listing_urls'].str.split('/apa/d/').str[1]

eby_Jan_fifteenth['lsting_urls_test']

0      pleasanton-online-rent-payments-spa/7578627707...
1                   concord-small-studio/7573915526.html
2      oakland-house-with-small-private-back/75745397...
3      santa-clara-lovely-studio-apartment-in/7578630...
4      walnut-creek-call-us-today-to-see-how/75786300...
                             ...                        
129    hayward-weeks-free-on-all-bedrooms-all/7578616...
130    pleasanton-call-today-for-tour-of-your/7573002...
131    san-pablo-fitness-center-barbecue/7578614984.html
132    san-pablo-great-bed-bath-house-in/7578611839.html
133       lafayette-one-bedroom-one-bath/7578614236.html
Name: lsting_urls_test, Length: 134, dtype: object

In [107]:
sfbay_city_names_lower_case  = [el.lower() for el in sfbay_city_names]


eby_Jan_fifteenth['lsting_urls_test'].str.lower().str.contains(
    pipe_operator.join(
            
            sfbay_city_names_lower_case   # look for any matching city names from sfbay_city_names list
            )  # parse city names by matching the data to the list of possible SF Bay Area city names

            # flags=re.I
    )

131

In [113]:
# pipe operator for Boolean "or" lookups
pipe_operator = '|'

# specify a regex pattern for a str.extract() method--NB: we need to wrap the pattern within a sort of tuple by using parentheses in strings--ie, '( )', so like the following format: '( regex_pattern...)'
sfbay_city_names_lower_case_for_lookup = '(' + pipe_operator.join(sfbay_city_names_lower_case)+')'  # wrap the city names regex pattern within a 'string' tuple: ie, '(...)'

# replace names with matching city names from list using str.extract()
eby_Jan_fifteenth['cities'] = eby_Jan_fifteenth['cities'].str.extract(sfbay_city_names_lower_case_for_lookup, expand=False)


eby_Jan_fifteenth['cities']

0        pleasanton
1           concord
2           oakland
3       santa-clara
4      walnut-creek
           ...     
129         hayward
130      pleasanton
131       san-pablo
132       san-pablo
133       lafayette
Name: cities, Length: 134, dtype: object

In [None]:
# pipe operator
pipe_operator = '|'

eby_Jan_fifteenth['cities'] = np.where(
    eby_Jan_fifteenth['lsting_urls_test'].str.lower().str.contains(
    pipe_operator.join(
            
            sfbay_city_names_lower_case   # look for any matching city names from sfbay_city_names list
            )  # parse city names by matching the data to the list of possible SF Bay Area city names

            # flags=re.I
    ),
    sfbay_city_names_lower_case,   # ie, if there is a matching city name, then given row value for cities column  
    eby_Jan_fifteenth['cities']  # if no match, then keep original row value 
    )

eby_Jan_fifteenth['cities']

In [85]:
""" Finally, parse the city names data for all of the data since mid-Jan 2023,
which are currently missing city names data"""

def parse_city_names_from_listing_URL(df: dict, unique_city_names_dash_delim:list):

  """ 1) Use str.contains() method chained to a .join() method in which we perform an 'OR' boolean via
  the pipe (ie, '|' operator--ie, so we can search for multiple substrings (ie, each element 
  from the list arg) to look up any matching instances of city names
  from the unique_city_names... list 
  relative to the rental listing URLs (ie, listing_urls).

  2) Then, parse each such first city name by taking the first matched city name only.
  
  3) Use these parsed city name values to **replace** the values for the 'cities' column!"""


  # step 1: use str.split() on '/apa/d/' and get the 2nd element after performing the split and converting data to str:
  df['listing_urls_for_str_match'] = df['listing_urls'].str.split('/apa/d/').str[1]  # obtain the 2nd resulting element

  # step 2: find matching city name from the list of all SF Bay Area + SC county names

  ## 2a) First!!: convert all string elements to lower-case for sake of consistency
  # apply str.lower() to set all elements in col to lower case
  df['listing_urls_for_str_match'] = df['listing_urls_for_str_match'].str.lower()

  ## Next, do same--use .lower() in lsit comp--for the list of SF Bay + SC county names:
  unique_city_names_dash_delim  = [el.lower() for el in unique_city_names_dash_delim]


  print('Listing URLS for str match col--before adding pipe operators:\n')
  print(df['listing_urls_for_str_match'])

  # step 3: match a substring from this newly-parsed column-- ie, 'listing_urls_for_str_match'
  # -- to matching substrings from the  sfbay_city_names list:
  # How?: use str.contains() and join pipe operators to each element of the list to perform an essentially  boolean "OR" str.contains() search for any matching city names

  # pipe operator
  pipe_operator = '|'

  # specify a regex pattern for a str.extract() method--NB: we need to wrap the pattern within a sort of tuple by using parentheses in strings--ie, '( )', so like the following format: '( regex_pattern...)'
  unique_city_names_dash_delim_pattern = '(' + pipe_operator.join(unique_city_names_dash_delim)+')'  # wrap the city names regex pattern within a 'string' tuple: ie, '(...)'

  # replace names with matching city names from regex pattern (ie, derived from list of names) using str.extract()
  df['cities'] = df['cities'].str.extract(unique_city_names_dash_delim_pattern, expand=False)

  
  return df
  
  # return df['cities]







# apply function to each dictionary of dfs
# NB: use dicitonary comprehensions to apply the function to *each* df within each given dictionary:

# Peninsula
eby_Jan_fifteenth = parse_city_names_from_listing_URL(eby_Jan_fifteenth, sfbay_city_names)
eby_Jan_fifteenth


Listing URLS for str match col--before adding pipe operators:

0      pleasanton-online-rent-payments-spa/7578627707...
1                   concord-small-studio/7573915526.html
2      oakland-house-with-small-private-back/75745397...
3      santa-clara-lovely-studio-apartment-in/7578630...
4      walnut-creek-call-us-today-to-see-how/75786300...
                             ...                        
129    hayward-weeks-free-on-all-bedrooms-all/7578616...
130    pleasanton-call-today-for-tour-of-your/7573002...
131    san-pablo-fitness-center-barbecue/7578614984.html
132    san-pablo-great-bed-bath-house-in/7578611839.html
133       lafayette-one-bedroom-one-bath/7578614236.html
Name: listing_urls_for_str_match, Length: 134, dtype: object


In [116]:
# get a copy just for sake of demos
dict_of_dfs_pen2 = dict_of_dfs_pen

In [119]:
for key, df in dict_of_dfs_pen2.items():
    print(df['listing_urls'].str.split('/apa/d/').str[1])

0      san-mateo-great-location-near-whole/7427823581...
1        san-mateo-join-our-waitlist-pet/7427822130.html
2      san-mateo-find-your-new-apt-home-at-bay/742781...
3      palo-alto-bed-2bath-1500sqft-c-car/7427818544....
4      los-altos-all-inclusive-amenities-club/7427810...
                             ...                        
277    san-mateo-relaxing-atmosphere-in-our/742512776...
278    daly-city-entire-house-for-rent-3bd-2ba/742511...
279    belmont-2bd-25ba-townhouse-huge-master/7425121...
280    palo-alto-updated-one-bedroom-apartment/742509...
281    sunnyvale-california-modern-eichler/7422756779...
Name: listing_urls, Length: 282, dtype: object
0       palo-alto-bike-storage-entertaining/7575381763...
1       mountain-view-mt-view-top-floor-1br-1ba/757538...
2         mountain-view-nest-programmable/7575381506.html
3       san-mateo-quality-f-apt-view-greenbelt/7575380...
4       palo-alto-two-bedroom-two-bathroom/7575380415....
                              ...   

## NB: The following function is not applying the data cleaning properly to the dicitonary of dfs for some reason,. Instead, the cities col remains unchanged. 

##  check following stackoverflow for some possible ideas on how to apply a funciton to each df in a dictionary of dataframes: "iterate over a dictionary of dataframes and apply function to each dataframe"<https://stackoverflow.com/questions/73625468/iterate-over-a-dictionary-of-dataframes-and-apply-function-to-each-dataframe>

## Also see: <https://stackoverflow.com/questions/55388844/loop-through-a-dictionary-of-dataframes>

In [131]:
""" Finally, parse the city names data for all of the data since mid-Jan 2023,
which are currently missing city names data"""

def parse_city_names_from_listing_URL(dictionary_of_dfs: dict, unique_city_names_dash_delim:list):

  """ 1) Use str.contains() method chained to a .join() method in which we perform an 'OR' boolean via
  the pipe (ie, '|' operator--ie, so we can search for multiple substrings (ie, each element 
  from the list arg) to look up any matching instances of city names
  from the unique_city_names... list 
  relative to the rental listing URLs (ie, listing_urls).

  2) Then, parse each such first city name by taking the first matched city name only,
  
  3) Use these parsed city name values to **replace** the values for the 'cities' column!"""

  for key, df in dictionary_of_dfs.items():   

    # step 1: use str.split() on '/apa/d' and get the 2nd element after performing the split:
    df['listing_urls_for_str_match'] = df['listing_urls'].str.split('/apa/d/').str[1]  # obtain the 2nd resulting element

    ## 2a) First!!: convert all string elements in col to lower-case for sake of consistency
    df['listing_urls_for_str_match'] = df['listing_urls_for_str_match'].str.lower()  # apply lowercase to all characters of each row's string vals 

    ## Next, do same for the list of SF Bay + SC county names:
    unique_city_names_dash_delim  = [el.lower() for el in unique_city_names_dash_delim]


    # step 3: match a substring from this newly-parsed column-- ie, 'listing_urls_for_str_match'
    # -- to matching substrings from the  sfbay_city_names list:
    # How?: use str.contains() and join pipe operators to each element of the list to perform an essentially  boolean "OR" str.contains() search for any matching city names

    # pipe operator
    pipe_operator = '|'

    # specify a regex pattern for a str.extract() method--NB: we need to wrap the pattern within a sort of tuple by using parentheses in strings--ie, '( )', so like the following format: '( regex_pattern...)'
    unique_city_names_dash_delim_pattern = '(' + pipe_operator.join(unique_city_names_dash_delim)+')'  # wrap the city names regex pattern within a 'string' tuple: ie, '(...)'

    # replace names with matching city names from regex pattern (ie, derived from list of names) using str.extract()
    df['cities'] = df['cities'].str.extract(unique_city_names_dash_delim_pattern, expand=False)

    # return df['cities']

  return dictionary_of_dfs



# apply function to each dictionary of dfs
# NB: use dicitonary comprehensions to apply the function to *each* df within each given dictionary:

# Peninsula
dict_of_dfs_pen2 = parse_city_names_from_listing_URL(dict_of_dfs_pen, sfbay_city_names)
dict_of_dfs_pen2

{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_02_2022.csv'):                                           listing_urls         ids    sqft  \
 0    https://sfbay.craigslist.org/pen/apa/d/san-mat...  7427823581  1051.0   
 1    https://sfbay.craigslist.org/pen/apa/d/san-mat...  7427822130  1039.0   
 2    https://sfbay.craigslist.org/pen/apa/d/san-mat...  7427819892  1104.0   
 3    https://sfbay.craigslist.org/pen/apa/d/palo-al...  7427818544  1500.0   
 4    https://sfbay.craigslist.org/pen/apa/d/los-alt...  7427810453   735.0   
 ..                                                 ...         ...     ...   
 277  https://sfbay.craigslist.org/pen/apa/d/san-mat...  7425127763   850.0   
 278  https://sfbay.craigslist.org/pen/apa/d/daly-ci...  7425116724  1600.0   
 279  https://sfbay.craigslist.org/pen/apa/d/belmont...  7425121171  1700.0   
 280  https://sfbay.craigslist.org/pen/apa/d/palo-a

In [154]:
""" Finally, parse the city names data for all of the data since mid-Jan 2023,
which are currently missing city names data"""

def parse_city_names_from_listing_URL(dictionary_of_dfs: dict, unique_city_names_dash_delim:list):

  """ 1) Use str.contains() method chained to a .join() method in which we perform an 'OR' boolean via
  the pipe (ie, '|' operator--ie, so we can search for multiple substrings (ie, each element 
  from the list arg) to look up any matching instances of city names
  from the unique_city_names... list 
  relative to the rental listing URLs (ie, listing_urls).

  2) Then, parse each such first city name by taking the first matched city name only,
  
  3) Use these parsed city name values to **replace** the values for the 'cities' column!"""
  ## apply lower-case for the list of SF Bay + SC county names:
  unique_city_names_dash_delim  = [el.lower() for el in unique_city_names_dash_delim]

  for key, df in dictionary_of_dfs.items():   

    # step 1: use str.split() on '/apa/d' and get the 2nd element after performing the split:
    df['listing_urls_for_str_match'] = df['listing_urls'].str.split('/apa/d/').str[1]  # obtain the 2nd resulting element

    ## 2a) First!!: convert all string elements in col to lower-case for sake of consistency, ie w/ respect to list of city names
    df['listing_urls_for_str_match'] = df['listing_urls_for_str_match'].str.lower()  # apply lowercase to all characters of each row's string vals 


    # step 3: match a substring from this newly-parsed column-- ie, 'listing_urls_for_str_match'
    # -- to matching substrings from the  sfbay_city_names list:
    # How?: use str.contains() and join pipe operators to each element of the list to perform an essentially  boolean "OR" str.contains() search for any matching city names

    # pipe operator
    pipe_operator = '|'

    # specify a regex pattern for a str.extract() method--NB: we need to wrap the pattern within a sort of tuple by using parentheses in strings--ie, '( )', so like the following format: '( regex_pattern...)'
    unique_city_names_dash_delim_pattern = '(' + pipe_operator.join(unique_city_names_dash_delim)+')'  # wrap the city names regex pattern within a 'string' tuple: ie, '(...)'

    # replace cities with matching city names wrt listing_urls_for_str_match col from regex pattern (ie, derived from list of names), using str.extract() 
    df['cities'] = df['listing_urls_for_str_match'].str.extract(unique_city_names_dash_delim_pattern, expand=False)

    print(df['cities'])
    
    dictionary_of_dfs[key] = df


  return dictionary_of_dfs




# apply function to each dictionary of dfs
# NB: use dicitonary comprehensions to apply the function to *each* df within each given dictionary:

# Peninsula
dict_of_dfs_pen = parse_city_names_from_listing_URL(dict_of_dfs_pen, sfbay_city_names)

0      san-mateo
1      san-mateo
2      san-mateo
3      palo-alto
4      los-altos
         ...    
277    san-mateo
278    daly-city
279      belmont
280    palo-alto
281    sunnyvale
Name: cities, Length: 282, dtype: object
0           palo-alto
1       mountain-view
2       mountain-view
3           san-mateo
4           palo-alto
            ...      
1309        daly-city
1310    mountain-view
1311        los-altos
1312        san-mateo
1313    mountain-view
Name: cities, Length: 1314, dtype: object
0           atherton
1       redwood-city
2      mountain-view
3          san-bruno
4          san-mateo
           ...      
378        palo-alto
379        san-bruno
380     redwood-city
381        san-bruno
382        los-altos
Name: cities, Length: 383, dtype: object
0          daly-city
1          san-mateo
2      mountain-view
3            belmont
4          san-mateo
           ...      
391          belmont
392              NaN
393         millbrae
394              NaN
395   

{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_02_2022.csv'):                                           listing_urls         ids    sqft  \
 0    https://sfbay.craigslist.org/pen/apa/d/san-mat...  7427823581  1051.0   
 1    https://sfbay.craigslist.org/pen/apa/d/san-mat...  7427822130  1039.0   
 2    https://sfbay.craigslist.org/pen/apa/d/san-mat...  7427819892  1104.0   
 3    https://sfbay.craigslist.org/pen/apa/d/palo-al...  7427818544  1500.0   
 4    https://sfbay.craigslist.org/pen/apa/d/los-alt...  7427810453   735.0   
 ..                                                 ...         ...     ...   
 277  https://sfbay.craigslist.org/pen/apa/d/san-mat...  7425127763   850.0   
 278  https://sfbay.craigslist.org/pen/apa/d/daly-ci...  7425116724  1600.0   
 279  https://sfbay.craigslist.org/pen/apa/d/belmont...  7425121171  1700.0   
 280  https://sfbay.craigslist.org/pen/apa/d/palo-a

In [159]:
# eby
dict_of_dfs_eby  = parse_city_names_from_listing_URL(dict_of_dfs_eby, sfbay_city_names)


0       brentwood
1             NaN
2         hayward
3       pittsburg
4          albany
          ...    
269     lafayette
270     san-pablo
271    union-city
272       vallejo
273           NaN
Name: cities, Length: 274, dtype: object
0       pleasant-hill
1             oakland
2          pleasanton
3             oakland
4             alameda
            ...      
1942        vacaville
1943    pleasant-hill
1944           dublin
1945     walnut-creek
1946          oakland
Name: cities, Length: 1947, dtype: object
0     oakland
1     oakland
2    richmond
Name: cities, dtype: object
0        martinez
1        martinez
2         oakland
3       lafayette
4         alameda
          ...    
411       alameda
412       oakland
413       alameda
414    emeryville
415    emeryville
Name: cities, Length: 416, dtype: object
0        pleasanton
1           concord
2           oakland
3       santa-clara
4      walnut-creek
           ...     
129         hayward
130      pleasanton
131     

In [156]:
# nby
dict_of_dfs_nby  = parse_city_names_from_listing_URL(dict_of_dfs_nby, sfbay_city_names)


# sby

dict_of_dfs_sby  = parse_city_names_from_listing_URL(dict_of_dfs_sby, sfbay_city_names)


# sfc
dict_of_dfs_sfc  = parse_city_names_from_listing_URL(dict_of_dfs_sfc, sfbay_city_names)



# scz
dict_of_dfs_scz  = parse_city_names_from_listing_URL(dict_of_dfs_scz, sfbay_city_names)


NameError: name 'dict_of_dfs_nby' is not defined

## Now that we've parsed the correct city names, remove the dashes from the city names that have them.

### Then, use capitalize() to apply proper noun capitalization to each city name

In [161]:
def remove_dash_chars_from_col_for_dict_of_dfs(dictionary_of_dfs, col):
    for key, df in dictionary_of_dfs.items():
     df[col] = df[col].str.replace('-', ' ')
     print(df['cities'])

     dictionary_of_dfs[key] = df

    return dictionary_of_dfs


# peninsula data

dict_of_dfs_pen = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_pen, 'cities')

dict_of_dfs_pen = return_capitliztion_for_col_for_dict_of_dfs(dict_of_dfs_pen, 'cities')

# eby
dict_of_dfs_eby = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_eby, 'cities')

dict_of_dfs_eby = return_capitliztion_for_col_for_dict_of_dfs(dict_of_dfs_eby, 'cities')


# sby
dict_of_dfs_sby = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_sby, 'cities')

dict_of_dfs_sby = return_capitliztion_for_col_for_dict_of_dfs(dict_of_dfs_sby, 'cities')

#sf
dict_of_dfs_sfc = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_sfc, 'cities')

dict_of_dfs_sfc = return_capitliztion_for_col_for_dict_of_dfs(dict_of_dfs_sfc, 'cities')

# nby
dict_of_dfs_nby = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_nby, 'cities')

dict_of_dfs_nby = return_capitliztion_for_col_for_dict_of_dfs(dict_of_dfs_nby, 'cities')


# scz
dict_of_dfs_scz = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_scz, 'cities')

dict_of_dfs_scz = return_capitliztion_for_col_for_dict_of_dfs(dict_of_dfs_scz, 'cities')


0       sanmateo
1       sanmateo
2       sanmateo
3       paloalto
4       losaltos
         ...    
277     sanmateo
278     dalycity
279      belmont
280     paloalto
281    sunnyvale
Name: cities, Length: 282, dtype: object
0           paloalto
1       mountainview
2       mountainview
3           sanmateo
4           paloalto
            ...     
1309        dalycity
1310    mountainview
1311        losaltos
1312        sanmateo
1313    mountainview
Name: cities, Length: 1314, dtype: object
0          atherton
1       redwoodcity
2      mountainview
3          sanbruno
4          sanmateo
           ...     
378        paloalto
379        sanbruno
380     redwoodcity
381        sanbruno
382        losaltos
Name: cities, Length: 383, dtype: object
0          dalycity
1          sanmateo
2      mountainview
3           belmont
4          sanmateo
           ...     
391         belmont
392             NaN
393        millbrae
394             NaN
395        sanbruno
Name: cities, Leng

### Finally, export all dfs containing  the cleaned city names--ie, since Jan 2023. 

### Loop over each df from each list of dfs, and return .to_csv() to re-export them replace the old CSV files!!


### "Exporting Pandas output for multiple CSV files": https://stackoverflow.com/questions/67959271/exporting-pandas-output-for-multiple-csv-files


In [None]:
"""Warning!!!: This is pseudo-code. I need to revise with the new dfs, and test this on a smaller scale first!!!"""

""" NB: see following for soverflow article on how to do this--*assuming I've managed to import each df as separate subregion & week data:
"How to export data frame back to the original csv file that I imported it from?" 
<https://stackoverflow.com/questions/69159238/how-to-export-data-frame-back-to-the-original-csv-file-that-i-imported-it-from>"""

# # loop over all dfs containing data since Jan 2023 
# for df in dfs:
#     # replace all old CSV files with the cleaned city names
#     return pd.to_csv(df)

In [64]:
next(iter(dict_of_dfs)).name

'craigslist_rental_sfbay_sfc_01_03_2022.csv'

In [163]:
for key in dict_of_dfs_pen:
    """Iterate over each dataframe & key (ie, path & file), and use the .name pathlib function to refer back to the original CSV file"""
    # replace all old CSV files with the cleaned city names
    print(key.name)


craigslist_rental_sfbay_pen_01_02_2022.csv
craigslist_rental_sfbay_pen_01_06_2023_.csv
craigslist_rental_sfbay_pen_01_08_2022.csv
craigslist_rental_sfbay_pen_01_10_2022.csv
craigslist_rental_sfbay_pen_01_15_2023_.csv
craigslist_rental_sfbay_pen_01_18_2022.csv
craigslist_rental_sfbay_pen_01_18_2023_.csv
craigslist_rental_sfbay_pen_01_19_2023_.csv
craigslist_rental_sfbay_pen_01_26_2023_.csv
craigslist_rental_sfbay_pen_01_27_2022.csv
craigslist_rental_sfbay_pen_02_02_2023_.csv
craigslist_rental_sfbay_pen_02_04_2022.csv
craigslist_rental_sfbay_pen_02_10_2023_.csv
craigslist_rental_sfbay_pen_02_11_2022.csv


In [79]:
# test case by writing files to local folder, before writing over to the original scraped_data folders
for key, df in dict_of_dfs_pen.items():
    """Iterate over each dataframe & key (ie, path & file), and use the .name pathlib function to refer back to the original CSV file"""
    # replace all old CSV files with the cleaned city names
    df.to_csv(
        key.name, # NB!: use pathlib's .name function to refer back to the original CSV name corresponding to each df 
        index=False
        )  

In [71]:
import os

## **NB!: We can access the full CSV file name from each df within each dictionary of dataframes
## ** How?: By using the .name function from the pathlib library

## NB2: However, we need to do so within the confines of a dictionary:
# ie:  we need to iterate over and access the key of each, taking the .name function on each to refer to the file names
# then, using the .name of each dicitonary key, we need to use this to save over the original respective CSV files



# export cleaned Peninsula data
# specify path:
path_pen = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_pen}'


# for key in dict_of_dfs_pen:
#     print(os.path.join(path_pen, key.name))

# export cleaned Peninsula data by iterating over each df from the dict of dfs:
for df, key in dict_of_dfs_pen.items():
    """Iterate over each dataframe & key (ie, the CSV file names), and use the .name pathlib function to refer back to the original CSV file"""
    # replace all old CSV files with the cleaned city names
    df.pd.to_csv( 
        os.path.join(path_pen, key.name), # 1) use os.path.join() to join the CSV filename to the relevant path &  2) NB!: use pathlib's .name function to refer back to the original CSV name corresponding to each df 
        index=False  # each df index is meaningless here
        )  


# next(iter(dict_of_dfs)).name

TypeError: cannot unpack non-iterable WindowsPath object

In [81]:
for key in dict_of_dfs_pen:
    print(os.path.join(path_pen, key.name))

D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_sfc_01_03_2022.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_sfc_01_07_2022.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_sfc_01_08_2023_.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_sfc_01_10_2022.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_sfc_01_17_2022.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_sfc_01_18_2023_.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_sfc_01_26_2022.c

In [68]:
for key in dict_of_dfs_pen:
    print(key.name)

craigslist_rental_sfbay_sfc_01_03_2022.csv
craigslist_rental_sfbay_sfc_01_07_2022.csv
craigslist_rental_sfbay_sfc_01_08_2023_.csv
craigslist_rental_sfbay_sfc_01_10_2022.csv
craigslist_rental_sfbay_sfc_01_17_2022.csv
craigslist_rental_sfbay_sfc_01_18_2023_.csv
craigslist_rental_sfbay_sfc_01_26_2022.csv
craigslist_rental_sfbay_sfc_02_01_2023_.csv
craigslist_rental_sfbay_sfc_02_02_2022.csv
craigslist_rental_sfbay_sfc_02_04_2022.csv
craigslist_rental_sfbay_sfc_02_05_2023_.csv
craigslist_rental_sfbay_sfc_02_10_2022.csv
