## Missing city names data:

NB: 

### In this notebook, we want to clean data for several weeks' worth of *sfbay* rental listings in which the webcrawler had not been updated properly to scrape city names. 

### In short, several weeks of rental listings from Jan-Feb 2023 have missing city names data (ie, for the 'cities' column). To be clear, the problem refers to only CSV files that were derived from running the webcrawler--ie, based on the datE_of_webcrawler, *not* the date_posted per se--from around early January to February 14, 2023, after which the problem was fixed. See this commit to the GitHub repo for details: commit # c60d6d9, dated from Feb 15, 2023.

### To do the needed data cleaning, we need to do the following:

### 1) Identify all unique city names for all SF Bay Area counties as well as Santa Cruz county:

### 1a) How? Run a separate webcrawler in which we grab data on all city names for a) the SF Bay Area counties & also b) Santa Cruz county. We can get these data from separate Wikipedia webpage tables.

### 1b) After extracting the city names, add '-' delimiters to match the format of the city names data that we will be parsing from the craigslist rental listing URLs from the listings data that we need to clean.


### 2) Second, we need to import each week of sfbay rental listings that have missing city names data--by subregion--as *separate* dataframes, ie, with each df corresponding to a given week and a given subregion. 

#### 2a) To do this, we need to slice the data to only data from mid January 2023 to mid February 2023.

### 2b) Then, **replace** the missing city names with matching city names--ie, based on the list of unique city names (see step 1)--vis-a-vis a regex str.findall() search of the city names as parsed from the rental listing URLs (ie, the listing_url col).

#### 2bi) More specifically: we will take the list of SF Bay Area city names and return the *first* matched city name from each given rental listing URL. 

#### So if a rental listing URL mentions multiple city names, we will assume the first listed city name is the main city, and use that as the name for the city column for sake of simplicity.

### 2c) Finally, output the cleaned data back to the **original** CSV files and overwrite the original files!


## NB: Let's try a different approach to gather names of all cities & towns in the SF Bay Area:

### Namely, let's access US Census data tables containing lists of all SF Bay Area cities--in addition to a separate page for Santa Cruz county--found on Wikipedia

### We will need to implement a short webcrawler for the purpose of extracting the city names of the SF Bay Area and SC county, then append to a list, and add the desired dash ('-') delimiters:

In [1]:
# 1) Identify a list of all unique SF Bay Area & SC county city names, and output to a list

# 1a) Create a few simple webcrawlers to grab the city names data from 2 wikipedia tables

#web crawling, web scraping & webdriver libraries and modules
from selenium import webdriver  # NB: this is the main module we will use to implement the webcrawler and webscraping. A webdriver is an automated browser.
from webdriver_manager.chrome import ChromeDriverManager # import webdriver_manager package to automatically take care of any needed updates to Chrome webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException, ElementClickInterceptedException
from selenium.webdriver.chrome.options import Options  # Options enables us to tell Selenium to open WebDriver browsers using maximized mode, and we can also disable any extensions or infobars

import requests


### SF bay area city names data


# sf bay area city names wiki page:
sfbay_cities_wiki_url = 'https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_the_San_Francisco_Bay_Area'


# access page, and grab city names, append to list

def obtain_cities_from_wiki_sfbay(webpage_url,list_of_cities):
    # initialize web driver
            
    driver = webdriver.Chrome(ChromeDriverManager().install())  # install or update latest Chrome webdriver using using ChromeDriverManager() library
    
    # access webpage
    driver.get(webpage_url)

    xpaths_table = '//table[@class="wikitable plainrowheaders sortable jquery-tablesorter"]'

    # search for wiki data tables:
    table = driver.find_element(By.XPATH, xpaths_table)


    # iterate over each table row and then row_val within each row to get data from the given table, pertaining to the city names
    for row in table.find_elements(By.CSS_SELECTOR, 'tr'): # iterate over each row in the table
        
        
        city_names =  row.find_elements(By.TAG_NAME, 'th')  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index
        # city_names =  row.find_elements(By.TAG_NAME, 'td')[0]  # iterate over value of each row, *but* ONLY for the 1st column--ie, the 0th index

        # extract text, but *skip* the first 2 rows of the table  rows' values since these are only the column names!
        for city_name in city_names[:2]: # skip first 2 rows 

            # append the remaining data to list
            list_of_cities.append(city_name.text)


    # exit webpage 
    driver.close()


    return list_of_cities



# initialize lists:
sfbay_city_names = []


#sfbay data
obtain_cities_from_wiki_sfbay(sfbay_cities_wiki_url, sfbay_city_names)

# remove remaining col names:
sfbay_city_names = sfbay_city_names[4:]

# sanity check
print(f'sfbay city names:{sfbay_city_names}')

print(f'There are {len(sfbay_city_names)} city names\nNB: There should be 101.')




Current google-chrome version is 111.0.5563
Get LATEST driver version for 111.0.5563
Get LATEST driver version for 111.0.5563
Trying to download new driver from https://chromedriver.storage.googleapis.com/111.0.5563.64/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Kevin Allen\.wdm\drivers\chromedriver\win32\111.0.5563.64]


sfbay city names:['Alameda', 'Albany', 'American Canyon', 'Antioch', 'Atherton', 'Belmont', 'Belvedere', 'Benicia', 'Berkeley', 'Brentwood', 'Brisbane', 'Burlingame', 'Calistoga', 'Campbell', 'Clayton', 'Cloverdale', 'Colma', 'Concord', 'Corte Madera', 'Cotati', 'Cupertino', 'Daly City', 'Danville', 'Dixon', 'Dublin', 'East Palo Alto', 'El Cerrito', 'Emeryville', 'Fairfax', 'Fairfield', 'Foster City', 'Fremont', 'Gilroy', 'Half Moon Bay', 'Hayward', 'Healdsburg', 'Hercules', 'Hillsborough', 'Lafayette', 'Larkspur', 'Livermore', 'Los Altos', 'Los Altos Hills', 'Los Gatos', 'Martinez', 'Menlo Park', 'Mill Valley', 'Millbrae', 'Milpitas', 'Monte Sereno', 'Moraga', 'Morgan Hill', 'Mountain View', 'Napa', 'Newark', 'Novato', 'Oakland', 'Oakley', 'Orinda', 'Pacifica', 'Palo Alto', 'Petaluma', 'Piedmont', 'Pinole', 'Pittsburg', 'Pleasant Hill', 'Pleasanton', 'Portola Valley', 'Redwood City', 'Richmond', 'Rio Vista', 'Rohnert Park', 'Ross', 'St. Helena', 'San Anselmo', 'San Bruno', 'San Carlos

In [2]:
# Sc county city names data:

#web crawling, web scraping & webdriver libraries and modules
from selenium import webdriver  # NB: this is the main module we will use to implement the webcrawler and webscraping. A webdriver is an automated browser.
from webdriver_manager.chrome import ChromeDriverManager # import webdriver_manager package to automatically take care of any needed updates to Chrome webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, TimeoutException, WebDriverException, ElementClickInterceptedException
from selenium.webdriver.chrome.options import Options  # Options enables us to tell Selenium to open WebDriver browsers using maximized mode, and we can also disable any extensions or infobars

import requests



# sc county wiki page url
sc_county_cities_wiki_url = 'https://en.wikipedia.org/wiki/Santa_Cruz_County,_California#Population_ranking'


sc_county_city_names = []


def obtain_cities_from_wiki_sc(webpage_url,list_of_cities):
    # initialize web driver
            
    driver = webdriver.Chrome(ChromeDriverManager().install())  # install or update latest Chrome webdriver using using ChromeDriverManager() library
    
    # access webpage
    driver.get(webpage_url)


    # NB!: there are 2 tables with the same class name; only select data from the 2nd one
    xpaths_table = '//table[@class="wikitable sortable jquery-tablesorter"][2]//tr//td[2]'  # 2nd table on webpage with this class name


    # search for given wiki data tables:
    table = driver.find_elements(By.XPATH, xpaths_table)


    print(f'Full table:\n\n{table}\n\n\n\n\n')

    for row in table:
        print(f'City names:{row.text}')
        list_of_cities.append(row.text)





    # exit webpage 
    driver.close()

    # # sanity check
    # print(f'List of city names:\n{list_of_cities}')

    return list_of_cities

obtain_cities_from_wiki_sc(sc_county_cities_wiki_url, sc_county_city_names)


#  # clean data by removing extraneous '†' char from city names list
sc_county_city_names = list(map(lambda x: x.replace('†',''), sc_county_city_names))

## finally, remove any whitespace from list-- use list comprehension
sc_county_city_names = [s for s in sc_county_city_names if s.strip()]

# sanity check
print(f'\n\nsc county city names:{sc_county_city_names}')
print(f'There are {len(sc_county_city_names)} city names for SC county.')




Current google-chrome version is 111.0.5563
Get LATEST driver version for 111.0.5563
Driver [C:\Users\Kevin Allen\.wdm\drivers\chromedriver\win32\111.0.5563.64\chromedriver.exe] found in cache


Full table:

[<selenium.webdriver.remote.webelement.WebElement (session="0c5f5e7ed976d55545fac19613e7d827", element="5618137b-c9c4-429a-b71d-96ced1c3568c")>, <selenium.webdriver.remote.webelement.WebElement (session="0c5f5e7ed976d55545fac19613e7d827", element="e8500275-0813-4bd2-a85f-68ce7089c81a")>, <selenium.webdriver.remote.webelement.WebElement (session="0c5f5e7ed976d55545fac19613e7d827", element="bb811dde-74dd-4906-a153-c28b6b3bebd4")>, <selenium.webdriver.remote.webelement.WebElement (session="0c5f5e7ed976d55545fac19613e7d827", element="e47ea230-8bf6-4ac8-8d30-b977cac547a6")>, <selenium.webdriver.remote.webelement.WebElement (session="0c5f5e7ed976d55545fac19613e7d827", element="7c2769ee-e3c0-4c0f-b1ac-6c968ee212a0")>, <selenium.webdriver.remote.webelement.WebElement (session="0c5f5e7ed976d55545fac19613e7d827", element="c3f6834e-a764-4072-86f9-3c2b06cf3c03")>, <selenium.webdriver.remote.webelement.WebElement (session="0c5f5e7ed976d55545fac19613e7d827", element="6ff40cfa-143a-4425-

In [3]:
# # combine both the sfbay city names & sc county names lists into one:
def combine_lists(list1, list2):
    return list1.extend(list2)

combine_lists(sfbay_city_names, sc_county_city_names)

# sanity check
print(f'sanity check on sfbay & sc county city names data:\n\n{sfbay_city_names}')

print(f'\nThere are {len(sfbay_city_names)} cities')

sanity check on sfbay & sc county city names data:

['Alameda', 'Albany', 'American Canyon', 'Antioch', 'Atherton', 'Belmont', 'Belvedere', 'Benicia', 'Berkeley', 'Brentwood', 'Brisbane', 'Burlingame', 'Calistoga', 'Campbell', 'Clayton', 'Cloverdale', 'Colma', 'Concord', 'Corte Madera', 'Cotati', 'Cupertino', 'Daly City', 'Danville', 'Dixon', 'Dublin', 'East Palo Alto', 'El Cerrito', 'Emeryville', 'Fairfax', 'Fairfield', 'Foster City', 'Fremont', 'Gilroy', 'Half Moon Bay', 'Hayward', 'Healdsburg', 'Hercules', 'Hillsborough', 'Lafayette', 'Larkspur', 'Livermore', 'Los Altos', 'Los Altos Hills', 'Los Gatos', 'Martinez', 'Menlo Park', 'Mill Valley', 'Millbrae', 'Milpitas', 'Monte Sereno', 'Moraga', 'Morgan Hill', 'Mountain View', 'Napa', 'Newark', 'Novato', 'Oakland', 'Oakley', 'Orinda', 'Pacifica', 'Palo Alto', 'Petaluma', 'Piedmont', 'Pinole', 'Pittsburg', 'Pleasant Hill', 'Pleasanton', 'Portola Valley', 'Redwood City', 'Richmond', 'Rio Vista', 'Rohnert Park', 'Ross', 'St. Helena', 'San

In [4]:
# 1b) Add dash ('-') delimiters to the list of city names data


""" Next, add dash delimiters *in between* each word (ie, in place of whitespace in between each word of each city name) for each element (read: city name) 
from the unique_city_names_lis list.  

Why add a dash delimiter in b/w each word of each city name?:
Because the rental listings' listing_urls URLs each contain--(as of craigslist's server's changes in Jan 2023)
--the listing's city name in the URL. 
***But!: The city names in the URL are ***always*** listed with a dash delimiter in between each word!"""

def add_dash_delimiter_in_bw_each_word_of_city_names(city_names:list):
    return [word.replace(' ', '-') for word in city_names]  # use str.replace() method to replace whitespaces with dashes

sfbay_city_names = add_dash_delimiter_in_bw_each_word_of_city_names(sfbay_city_names)
 
# sanity check
sfbay_city_names

['Alameda',
 'Albany',
 'American-Canyon',
 'Antioch',
 'Atherton',
 'Belmont',
 'Belvedere',
 'Benicia',
 'Berkeley',
 'Brentwood',
 'Brisbane',
 'Burlingame',
 'Calistoga',
 'Campbell',
 'Clayton',
 'Cloverdale',
 'Colma',
 'Concord',
 'Corte-Madera',
 'Cotati',
 'Cupertino',
 'Daly-City',
 'Danville',
 'Dixon',
 'Dublin',
 'East-Palo-Alto',
 'El-Cerrito',
 'Emeryville',
 'Fairfax',
 'Fairfield',
 'Foster-City',
 'Fremont',
 'Gilroy',
 'Half-Moon-Bay',
 'Hayward',
 'Healdsburg',
 'Hercules',
 'Hillsborough',
 'Lafayette',
 'Larkspur',
 'Livermore',
 'Los-Altos',
 'Los-Altos-Hills',
 'Los-Gatos',
 'Martinez',
 'Menlo-Park',
 'Mill-Valley',
 'Millbrae',
 'Milpitas',
 'Monte-Sereno',
 'Moraga',
 'Morgan-Hill',
 'Mountain-View',
 'Napa',
 'Newark',
 'Novato',
 'Oakland',
 'Oakley',
 'Orinda',
 'Pacifica',
 'Palo-Alto',
 'Petaluma',
 'Piedmont',
 'Pinole',
 'Pittsburg',
 'Pleasant-Hill',
 'Pleasanton',
 'Portola-Valley',
 'Redwood-City',
 'Richmond',
 'Rio-Vista',
 'Rohnert-Park',
 'Ross'

### 2) Import data for January 1-Feb 14: ie, the data in which the city names are missing!

### NB: We will need to use separate lists of dataframes for **Each** subregion, given the path structure of the scraped data derived from the webcrawler: 

## 2) Next, import data from Jan 15 to Feb 14, 2023

### NB: check following stackoverflow for useful info on how to do this, but I will need to *add* and apply a separate *datetime filter*,  *or* use the glob library to import CSV files based on a sort of regex, to the files such that I **only**  import the right dates of data:

### See following article on how to import CSV files only from specified date-range:

### "Read multiple csv files stored by date from start date to end date into a pandas dataframe" <https://stackoverflow.com/questions/61236241/read-multiple-csv-files-stored-by-date-from-start-date-to-end-date-into-a-pandas>




In [None]:
def import_dfs_matching_specific_subregion_and_dates(subregion_code:str)->dict:
    """ Import dataframes as a dictionary of dataframes, given a) a specific user-inputted range of dates & b) a specific subregion"""

    # specify parent path to all sfbay data-- NB: use an f-string combined with a raw (ie, r) string--ie, fr to modify the string so we can input the subregion code as an argument to add to the path
    path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_code}'


    # get all CSV files from path, and grab the file stems for each given CSV file 
    df = pd.DataFrame({'files' : [file for file in Path(path).glob('*.csv')],
                    'file_stem' : [file.stem for file in Path(path).glob('*.csv')]}) # get file stem using .stem method

    # # Parse the dates from each CSV file, and keep the same 'MM_DD_YYY' format (**including the underscore delimiters!!), as the webcrawler CSV file naming convention:

    df['date_of_file'] = df['file_stem'].str.extract(r'(\d{2}_\d{2}_\d{4})')


    ## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"
    # sanity check
    print(f'\nSome of the file dates:\n{df[["date_of_file", "file_stem"]].sort_values(by="date_of_file").tail()}\n')


import_dfs_matching_specific_subregion_and_dates(subregion_pen)


In [5]:
# 2) Import Data from mid-January 2023 to Feb 14, 2023: subset data for Jan 15, 2023 to February 14, 2023

# imports-- file processing & datetime libraries
import os
import glob
from pathlib import Path
import datetime
# data analysis libraries & SQL libraries
import numpy as np
import pandas as pd
from pandas.core.frame import DataFrame


def import_dfs_matching_specific_subregion_and_dates(subregion_code:str)->dict:
    """ Import dataframes as a dictionary of dataframes, given a) a specific user-inputted range of dates & b) a specific subregion"""

    # specify parent path to all sfbay data-- NB: use an f-string combined with a raw (ie, r) string--ie, fr to modify the string so we can input the subregion code as an argument to add to the path
    path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_code}'


    # get all CSV files from path, and grab the file stems for each given CSV file 
    df = pd.DataFrame({'files' : [file for file in Path(path).glob('*.csv')],
                    'file_stem' : [file.stem for file in Path(path).glob('*.csv')]}) # get file stem using .stem method

    # # Parse the dates from each CSV file, and keep the same 'MM_DD_YYY' format (**including the underscore delimiters!!), as the webcrawler CSV file naming convention:

    df['date_of_file'] = df['file_stem'].str.extract(r'(\d{2}_\d{2}_\d{4})')


    ## "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"
    # sanity check
    print(f'\nSome of the file dates:\n{df[["date_of_file", "file_stem"]].sort_values(by="date_of_file").tail()}\n')


    ## NB: The webcrawler program's CSV files are of this format:
    ## ** "craigslist_rental_sfbay_subregion_MM_DD_YYYY.csv"-- therefore:

    ## **Prompt user for the desired start & end dates, which we will concatenate into the format of 'MM_DD_YYYY'

    # start date inputs
    start_date_month = str(input('Enter desired Start Date month: '))
    start_date_day = str(input('Enter desired Start Date day: '))

    start_date_year = str(input('Enter desired Start Date year: '))

    # concat start date inputs to single string, with **underscore** delimiters in between each component
    underscore_delimiter = '_'

    start_date = start_date_month + underscore_delimiter + start_date_day + underscore_delimiter + start_date_year

    # sanity check on  resulting str
    print(f'Start date for file filter:\n{start_date}')

    # end date inputs
    end_date_month = str(input('Enter desired End Date month: '))
    end_date_day = str(input('Enter desired End Date day: '))

    end_date_year = str(input('Enter desired End Date year: '))

    # concat end date inputs to single string, with **underscore** delimiters in between each component
    end_date = end_date_month + underscore_delimiter + end_date_day + underscore_delimiter + end_date_year

    # sanity check on  resulting str
    print(f'End date for file filter:\n{end_date}')


    # ## create a list of the start & end dates for filtering the files; then concatenate all matching files into a DataFrame
    # index the list of files, convert to a list of date values, to be used for the filter
    file_date_slice = df.set_index('date_of_file').loc[start_date:end_date]['files'].tolist()

    # use list comprehension to import all relevant dfs as a list of dfs:
    list_of_dfs = [pd.read_csv(file, sep=',',encoding = 'utf-8') for file in file_date_slice]


    # initialize empty dict, which will be used as a dict of dfs
    dict_of_dfs = {}

    # iterate over each CSV file that matches the file date slice:
    for csv_file in file_date_slice:
        # read each CSV file as a separate df within the dictionary, and use the file name as the key of the dictionary 
        dict_of_dfs[csv_file] = pd.read_csv(csv_file, sep=',', encoding='utf-8')

    # sanity check: print the first element in the list:
    print(f'\nSanity check: first imported rental listings df from the period of {start_date} to {end_date} for {subregion_code} subregion, from the dict of dfs:\n{next(iter(dict_of_dfs))}')

    return dict_of_dfs



# specify each subregion codes as separate strings:
subregion_pen, subregion_nby, subregion_sby, subregion_eby, subregion_sfc, subregion_scz  = 'pen', 'nby', 'sby', 'eby', 'sfc', 'scz'

# import Peninsula data:

dict_of_dfs_pen = import_dfs_matching_specific_subregion_and_dates(subregion_pen)



Some of the file dates:
   date_of_file                                file_stem
63   12_02_2022  craigslist_rental_sfbay_pen_12_02_2022_
64   12_08_2022  craigslist_rental_sfbay_pen_12_08_2022_
65   12_15_2022  craigslist_rental_sfbay_pen_12_15_2022_
66   12_21_2022  craigslist_rental_sfbay_pen_12_21_2022_
67   12_27_2022  craigslist_rental_sfbay_pen_12_27_2022_

Start date for file filter:
01_14_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_14_2023 to 02_14_2023 for pen subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_pen_01_15_2023_.csv


In [6]:
# import East Bay data:

dict_of_dfs_eby = import_dfs_matching_specific_subregion_and_dates(subregion_eby)



Some of the file dates:
   date_of_file                                file_stem
66   12_07_2022  craigslist_rental_sfbay_eby_12_07_2022_
67   12_14_2022  craigslist_rental_sfbay_eby_12_14_2022_
68   12_19_2022  craigslist_rental_sfbay_eby_12_19_2022_
69   12_23_2022  craigslist_rental_sfbay_eby_12_23_2022_
70   12_29_2022  craigslist_rental_sfbay_eby_12_29_2022_

Start date for file filter:
01_14_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_14_2023 to 02_14_2023 for eby subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\eby\craigslist_rental_sfbay_eby_01_14_2023_.csv


In [7]:
# import North Bay data:

dict_of_dfs_nby = import_dfs_matching_specific_subregion_and_dates(subregion_nby)



Some of the file dates:
   date_of_file                                file_stem
59   12_05_2022  craigslist_rental_sfbay_nby_12_05_2022_
60   12_10_2022  craigslist_rental_sfbay_nby_12_10_2022_
61   12_11_2022  craigslist_rental_sfbay_nby_12_11_2022_
62   12_17_2022  craigslist_rental_sfbay_nby_12_17_2022_
63   12_23_2022  craigslist_rental_sfbay_nby_12_23_2022_

Start date for file filter:
01_14_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_14_2023 to 02_14_2023 for nby subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\nby\craigslist_rental_sfbay_nby_01_17_2022.csv


In [8]:
# import South Bay data:

dict_of_dfs_sby = import_dfs_matching_specific_subregion_and_dates(subregion_sby)



Some of the file dates:
   date_of_file                                file_stem
63   12_03_2022  craigslist_rental_sfbay_sby_12_03_2022_
64   12_10_2022  craigslist_rental_sfbay_sby_12_10_2022_
65   12_16_2022  craigslist_rental_sfbay_sby_12_16_2022_
66   12_22_2022  craigslist_rental_sfbay_sby_12_22_2022_
67   12_26_2022  craigslist_rental_sfbay_sby_12_26_2022_

Start date for file filter:
01_14_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_14_2023 to 02_14_2023 for sby subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\sby\craigslist_rental_sfbay_sby_01_16_2023_.csv


In [10]:
# import SF data:

dict_of_dfs_sfc = import_dfs_matching_specific_subregion_and_dates(subregion_sfc)


Some of the file dates:
   date_of_file                                file_stem
68   12_13_2022  craigslist_rental_sfbay_sfc_12_13_2022_
69   12_18_2022  craigslist_rental_sfbay_sfc_12_18_2022_
70   12_20_2022  craigslist_rental_sfbay_sfc_12_20_2022_
71   12_25_2022  craigslist_rental_sfbay_sfc_12_25_2022_
72   12_30_2022  craigslist_rental_sfbay_sfc_12_30_2022_

Start date for file filter:
01_14_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_14_2023 to 02_14_2023 for sfc subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\sfc\craigslist_rental_sfbay_sfc_01_14_2023_.csv


In [11]:
# import Santa Cruz data:

dict_of_dfs_scz = import_dfs_matching_specific_subregion_and_dates(subregion_scz)



Some of the file dates:
   date_of_file                                file_stem
62   11_25_2022  craigslist_rental_sfbay_scz_11_25_2022_
63   12_01_2022  craigslist_rental_sfbay_scz_12_01_2022_
64   12_11_2022  craigslist_rental_sfbay_scz_12_11_2022_
65   12_28_2022  craigslist_rental_sfbay_scz_12_28_2022_
66   12_29_2022  craigslist_rental_sfbay_scz_12_29_2022_

Start date for file filter:
01_14_2023
End date for file filter:
02_14_2023

Sanity check: first imported rental listings df from the period of 01_14_2023 to 02_14_2023 for scz subregion, from the dict of dfs:
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\scz\craigslist_rental_sfbay_scz_01_16_2023_.csv


### Try to determine what day or at least week the missing city name data began to surface:

#### Iterate over each df of East Bay data, and print each unique city name per df:

In [None]:
# Let's first show the names of the East Bay files, so we can know the specific dates the webcrawler was run
for key in dict_of_dfs_eby:
    print("East Bay files")
    print(key.name)

In [12]:
def print_unique_row_vals_for_col_for_each_df_in_dict_of_dfs(dict_of_dfs, col):
    for name, df in dict_of_dfs.items():
        print(f'Unique city names per df from {name.name}')  # show name of corresponding CSV file per df
        print(name, df[col].unique())  # show unique values of given col
        print()  # include a new line in between each df


print_unique_row_vals_for_col_for_each_df_in_dict_of_dfs(dict_of_dfs_eby, 'cities')


Unique city names per df from craigslist_rental_sfbay_eby_01_14_2023_.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\eby\craigslist_rental_sfbay_eby_01_14_2023_.csv ['Nan']

Unique city names per df from craigslist_rental_sfbay_eby_01_18_2022.csv
D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\eby\craigslist_rental_sfbay_eby_01_18_2022.csv ['Fairfield ' 'Berkeley' 'San Leandro' 'Dublin ' 'Alameda' 'Oakland'
 'Fremont ' 'Richmond' 'Hercules, Pinole, San Pablo, El Sob' 'Lafayette '
 'Danville ' 'Irvington High Area' 'Emeryville' 'Hayward' 'Walnut Creek'
 'Vallejo ' 'Concord ' 'Pittsburg ' 'Nan' 'Brentwood ' 'Crockett'
 'Albany ' 'East San Jose' 'Hercules' 'Berkeley North '
 'San Leandro - Hillcrest Knolls 1 Block Above Foothill Blvd' 'East Bay'
 'San Ramon' '1634 Lousiana St Vallejo, Ca' 'Westbrae'
 'Briarwood At Central Park' 'El Sobrante' 'Pleasanton' 'West End'
 'Concord' '6775 Golden 

In [13]:
# NB: let's take a slightly different way to determine what specific day when the city names were no longer being parsed correctly

def groupby_by_day_and_find_day_when_city_nulls_started(df):
    """Group the data by the date when the webcrawler was run,
    and use the .nunique() method on the cities column.
    For any days"""
    return df.groupby('date_of_webcrawler')['cities'].nunique()


# East Bay data
dict_of_dfs_eby_null_per_col_per_day = {key: val.pipe(groupby_by_day_and_find_day_when_city_nulls_started) for key, val in dict_of_dfs_pen.items()}
print(dict_of_dfs_eby_null_per_col_per_day)

{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_15_2023_.csv'): date_of_webcrawler
2023-01-15    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_18_2022.csv'): date_of_webcrawler
1/18/2022    26
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_18_2023_.csv'): date_of_webcrawler
2023-01-18    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/pen/craigslist_rental_sfbay_pen_01_19_2023_.csv'): date_of_webcrawler
2023-01-19    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay

## Notice that the city names start disappearing around January 14th, because the webcrawler that was run that day failed to parse any city names due to a no-longer correct selenium xpath argument. 

### How can we be sure?: The .unique() method printed a list of merely 'NaN', and the /nunique() method shows only 1 for the webcrawler run on Jan 14th to 15th. 

### In other words, there were no valid City names dfata.To elaborate, there were listings data from the webcrawler that day that included valid data on all other columns; only the city names were conspicuously absent.

In [14]:
dict_of_dfs_sfc_null_per_col_per_day = {key: val.pipe(groupby_by_day_and_find_day_when_city_nulls_started) for key, val in dict_of_dfs_sfc.items()}
print(f'SF number of unique city names: \n{dict_of_dfs_sfc_null_per_col_per_day}')

SF number of unique city names: 
{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_14_2023_.csv'): Series([], Name: cities, dtype: int64), WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_17_2022.csv'): date_of_webcrawler
1/17/2022    1
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_18_2023_.csv'): Series([], Name: cities, dtype: int64), WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_21_2023_.csv'): Series([], Name: cities, dtype: int64), WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/sfc/craigslist_rental_sfbay_sfc_01_2

In [17]:
dict_of_dfs_nby_null_per_col_per_day = {key: val.pipe(groupby_by_day_and_find_day_when_city_nulls_started) for key, val in dict_of_dfs_nby.items()}
print(f'North Bay number of unique city names: \n{dict_of_dfs_nby_null_per_col_per_day}')

North Bay number of unique city names: 
{WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_01_2023_.csv'): date_of_webcrawler
2023-01-01    31
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_02_2022.csv'): date_of_webcrawler
1/2/2022    31
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_07_2023_.csv'): date_of_webcrawler
2023-01-07    44
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/CraigslistWebScraper/scraped_data/sfbay/nby/craigslist_rental_sfbay_nby_01_09_2022.csv'): date_of_webcrawler
1/9/2022    33
Name: cities, dtype: int64, WindowsPath('D:/Coding and Code projects/Python/craigslist_data_proj/

## 2b) Next, clean the city names data by matching the list of city names data to the *first* matched city name from each given rental listing URL.  

In [15]:
""" 2b) Parse the city names data for all of the data since mid-Jan 2023,
which are currently missing city names data"""

def parse_city_names_from_listing_URL(dictionary_of_dfs: dict, unique_city_names_dash_delim:list):

  """ 1) Use str.contains() method chained to a .join() method in which we perform an 'OR' boolean via
  the pipe (ie, '|' operator--ie, so we can search for multiple substrings (ie, each element 
  from the list arg) to look up any matching instances of city names
  from the unique_city_names... list 
  relative to the rental listing URLs (ie, listing_urls).

  2) Then, parse each such first city name by taking the first matched city name only,
  
  3) Use these parsed city name values to **replace** the values for the 'cities' column!"""
  ## apply lower-case for the list of SF Bay + SC county names:
  unique_city_names_dash_delim  = [el.lower() for el in unique_city_names_dash_delim]

  for key, df in dictionary_of_dfs.items():   

    # step 1: use str.split() on '/apa/d' and get the 2nd element after performing the split:
    df['listing_urls_for_str_match'] = df['listing_urls'].str.split('/apa/d/').str[1]  # obtain the 2nd resulting element

    ## 2a) First!!: convert all string elements in col to lower-case for sake of consistency, ie w/ respect to list of city names
    df['listing_urls_for_str_match'] = df['listing_urls_for_str_match'].str.lower()  # apply lowercase to all characters of each row's string vals 


    # step 3: match a substring from this newly-parsed column-- ie, 'listing_urls_for_str_match'
    # -- to matching substrings from the  sfbay_city_names list:
    # How?: use str.contains() and join pipe operators to each element of the list to perform an essentially  boolean "OR" str.contains() search for any matching city names

    # pipe operator
    pipe_operator = '|'

    # specify a regex pattern for a str.extract() method--NB: we need to wrap the pattern within a sort of tuple by using parentheses in strings--ie, '( )', so like the following format: '( regex_pattern...)'
    unique_city_names_dash_delim_pattern = '(' + pipe_operator.join(unique_city_names_dash_delim)+')'  # wrap the city names regex pattern within a 'string' tuple: ie, '(...)'

    # replace cities with matching city names wrt listing_urls_for_str_match col from regex pattern (ie, derived from list of names), using str.extract() 
    df['cities'] = df['listing_urls_for_str_match'].str.extract(unique_city_names_dash_delim_pattern, expand=False)

    print(df['cities'])
    
    dictionary_of_dfs[key] = df


  return dictionary_of_dfs

# apply function to each dictionary of dfs
# NB: use dicitonary comprehensions to apply the function to *each* df within each given dictionary:

# Peninsula
dict_of_dfs_pen = parse_city_names_from_listing_URL(dict_of_dfs_pen, sfbay_city_names)

0          palo-alto
1          palo-alto
2          palo-alto
3         menlo-park
4          palo-alto
           ...      
214        san-mateo
215              NaN
216        san-bruno
217    san-francisco
218        san-mateo
Name: cities, Length: 219, dtype: object
0         san-mateo
1        menlo-park
2          millbrae
3      redwood-city
4          atherton
           ...     
327    redwood-city
328       los-altos
329       daly-city
330             NaN
331       daly-city
Name: cities, Length: 332, dtype: object
0           mountain-view
1               san-bruno
2          portola-valley
3               palo-alto
4               san-bruno
5           mountain-view
6                 belmont
7               daly-city
8               sunnyvale
9               daly-city
10          mountain-view
11              palo-alto
12          mountain-view
13    south-san-francisco
14          mountain-view
15               millbrae
Name: cities, dtype: object
0          los-altos
1 

In [16]:
# eby
dict_of_dfs_eby  = parse_city_names_from_listing_URL(dict_of_dfs_eby, sfbay_city_names)

0        pleasanton
1           concord
2           oakland
3       santa-clara
4      walnut-creek
           ...     
129         hayward
130      pleasanton
131       san-pablo
132       san-pablo
133       lafayette
Name: cities, Length: 134, dtype: object
0           vacaville
1            berkeley
2         san-leandro
3          pleasanton
4             oakland
            ...      
1089    pleasant-hill
1090        vacaville
1091           dublin
1092          oakland
1093          oakland
Name: cities, Length: 1094, dtype: object
0       union-city
1          vallejo
2          concord
3          concord
4          vallejo
          ...     
604        concord
605      san-ramon
606       richmond
607        vallejo
608    san-leandro
Name: cities, Length: 609, dtype: object
0         fremont
1      el-cerrito
2         oakland
3         fremont
4         concord
          ...    
768       oakland
769       oakland
770       oakland
771       oakland
772       oakland
Name: c

In [17]:
# nby
dict_of_dfs_nby  = parse_city_names_from_listing_URL(dict_of_dfs_nby, sfbay_city_names)


# sby

dict_of_dfs_sby  = parse_city_names_from_listing_URL(dict_of_dfs_sby, sfbay_city_names)


# sfc
dict_of_dfs_sfc  = parse_city_names_from_listing_URL(dict_of_dfs_sfc, sfbay_city_names)



# scz
dict_of_dfs_scz  = parse_city_names_from_listing_URL(dict_of_dfs_scz, sfbay_city_names)


0         sausalito
1        sebastopol
2        santa-rosa
3              napa
4          petaluma
           ...     
220      santa-rosa
221      santa-rosa
222         fairfax
223             NaN
224    corte-madera
Name: cities, Length: 225, dtype: object
0        santa-rosa
1        santa-rosa
2        san-rafael
3          petaluma
4      corte-madera
           ...     
553      santa-rosa
554    corte-madera
555     mill-valley
556      santa-rosa
557      santa-rosa
Name: cities, Length: 558, dtype: object
0      san-anselmo
1       santa-rosa
2       san-rafael
3              NaN
4      mill-valley
          ...     
226     santa-rosa
227     santa-rosa
228     santa-rosa
229     santa-rosa
230           napa
Name: cities, Length: 231, dtype: object
0     rohnert-park
1         petaluma
2      mill-valley
3       santa-rosa
4       healdsburg
          ...     
63      santa-rosa
64      santa-rosa
65        petaluma
66      santa-rosa
67            napa
Name: cities, Lengt

## Now that we've parsed the correct city names, remove the dashes from the city names that have them.

### Then, use capitalize() to apply proper noun capitalization to each city name

In [46]:
def remove_dash_chars_from_col_for_dict_of_dfs(dictionary_of_dfs, col):
    for key, df in dictionary_of_dfs.items():
     df[col] = df[col].str.replace('-', ' ')
     print(df[col])

     dictionary_of_dfs[key] = df

    return dictionary_of_dfs

def return_capitlization_for_col_for_dict_of_dfs(dictionary_of_dfs, col):
    """Capitalize each word of each row from given col,
    by using str.title() method!"""
    for key, df in dictionary_of_dfs.items():
     df[col] = df[col].str.title()  # str.title() method capitalizes each word in given str
     print(df[col])

     dictionary_of_dfs[key] = df

    return dictionary_of_dfs



# peninsula data

dict_of_dfs_pen = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_pen, 'cities')

dict_of_dfs_pen = return_capitlization_for_col_for_dict_of_dfs(dict_of_dfs_pen, 'cities')

# eby
dict_of_dfs_eby = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_eby, 'cities')

dict_of_dfs_eby = return_capitlization_for_col_for_dict_of_dfs(dict_of_dfs_eby, 'cities')


# sby
dict_of_dfs_sby = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_sby, 'cities')

dict_of_dfs_sby = return_capitlization_for_col_for_dict_of_dfs(dict_of_dfs_sby, 'cities')

#sf
dict_of_dfs_sfc = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_sfc, 'cities')

dict_of_dfs_sfc = return_capitlization_for_col_for_dict_of_dfs(dict_of_dfs_sfc, 'cities')

# nby
dict_of_dfs_nby = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_nby, 'cities')

dict_of_dfs_nby = return_capitlization_for_col_for_dict_of_dfs(dict_of_dfs_nby, 'cities')


# scz
dict_of_dfs_scz = remove_dash_chars_from_col_for_dict_of_dfs(dict_of_dfs_scz, 'cities')

dict_of_dfs_scz = return_capitlization_for_col_for_dict_of_dfs(dict_of_dfs_scz, 'cities')


0          Palo alto
1          Palo alto
2          Palo alto
3         Menlo park
4          Palo alto
           ...      
214        San mateo
215              NaN
216        San bruno
217    San francisco
218        San mateo
Name: cities, Length: 219, dtype: object
0         San mateo
1        Menlo park
2          Millbrae
3      Redwood city
4          Atherton
           ...     
327    Redwood city
328       Los altos
329       Daly city
330             NaN
331       Daly city
Name: cities, Length: 332, dtype: object
0           Mountain view
1               San bruno
2          Portola valley
3               Palo alto
4               San bruno
5           Mountain view
6                 Belmont
7               Daly city
8               Sunnyvale
9               Daly city
10          Mountain view
11              Palo alto
12          Mountain view
13    South san francisco
14          Mountain view
15               Millbrae
Name: cities, dtype: object
0          Los altos
1 

### Before exporting back to the original CSVs, let's examine the rows that still have missing (ie, NaN) city names data:

In [22]:
# show the number of remaining nulls for cities
def print_number_of_nulls_per_col_for_dict_of_dfs(dict_of_dfs:dict):
    for key, df in dict_of_dfs.items():
     print(key)
     print(pd.isnull(df).sum())  # capitalize each word of each row from given col


# nulls for peninsula data
print_number_of_nulls_per_col_for_dict_of_dfs(dict_of_dfs_pen)


D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\pen\craigslist_rental_sfbay_pen_01_15_2023_.csv
listing_urls                   0
ids                            0
sqft                          24
cities                         8
prices                         0
bedrooms                       1
bathrooms                      1
attr_vars                      1
listing_descrip                1
date_of_webcrawler             0
kitchen                        0
date_posted                    0
cats_OK                        0
dogs_OK                        0
wheelchair_accessible          0
laundry_in_bldg                0
no_laundry                     0
washer_and_dryer               0
washer_and_dryer_hookup        0
laundry_on_site                0
full_kitchen                   0
dishwasher                     0
refrigerator                   0
oven                           0
flooring_carpet                0
flooring_wood                  

### Seeing the above results, we can see that for the January 15, 2023 Peninsula data, there are 8 additional missing city names, out of a total of more than 200 total rental listings obtained that day. 

### Note that some of these missing city names data are **not** due to a problem with our code, but due to a limitation of the wikipedia tables

### For example, small Census-designated towns that are not large enough to be considered incorporated cities are not included (at least not in all cases) by the Wikipedia table. In other wordss, unincorporated (typically very small) towns are not included in the tables. 

### One specific example of this is Montara, CA, a small unincorporated town located in a northern coastal part of San Mateo County (ie, within the Peninsula region of the SF Bay Area).

### However, since this issue is not very widespread in the data, by necessity in a sense since such towns are small in population and therefore unlikely to have many rental listings, we will be better-served by simply removing them from the dataset. It's more of a hassle to add these data back into the dataset, than to simply remove them. 

### In short, we can remove these rows when we run the Pandas_and_SQL_ETL_and_data_cleaning.py data CSV to SQL Server data pipeline script. This script automatically removes any nulls associated with some very important columns including the cities column.

### Finally, export all dfs containing  the cleaned city names--ie, since Jan 2023. 

### Loop over each df from each list of dfs, and return .to_csv() to re-export them replace the old CSV files!!


### "Exporting Pandas output for multiple CSV files": https://stackoverflow.com/questions/67959271/exporting-pandas-output-for-multiple-csv-files


In [163]:
for key in dict_of_dfs_pen:
    """Iterate over each dataframe & key (ie, path & file), and use the .name pathlib function to refer back to the original CSV file"""
    # replace all old CSV files with the cleaned city names
    print(key.name)


craigslist_rental_sfbay_pen_01_02_2022.csv
craigslist_rental_sfbay_pen_01_06_2023_.csv
craigslist_rental_sfbay_pen_01_08_2022.csv
craigslist_rental_sfbay_pen_01_10_2022.csv
craigslist_rental_sfbay_pen_01_15_2023_.csv
craigslist_rental_sfbay_pen_01_18_2022.csv
craigslist_rental_sfbay_pen_01_18_2023_.csv
craigslist_rental_sfbay_pen_01_19_2023_.csv
craigslist_rental_sfbay_pen_01_26_2023_.csv
craigslist_rental_sfbay_pen_01_27_2022.csv
craigslist_rental_sfbay_pen_02_02_2023_.csv
craigslist_rental_sfbay_pen_02_04_2022.csv
craigslist_rental_sfbay_pen_02_10_2023_.csv
craigslist_rental_sfbay_pen_02_11_2022.csv


In [79]:
# test case by writing files to local folder, before writing over to the original scraped_data folders
for key, df in dict_of_dfs_pen.items():
    """Iterate over each dataframe & key (ie, path & file), and use the .name pathlib function to refer back to the original CSV file"""
    # replace all old CSV files with the cleaned city names
    df.to_csv(
        key.name, # NB!: use pathlib's .name function to refer back to the original CSV name corresponding to each df 
        index=False
        )  

In [48]:
# test case by writing files to local folder, before writing over to the original scraped_data folders
def export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs):
    for key, df in dict_of_dfs_pen.items():
        """Iterate over each dataframe & key (ie, path & file), and use the .name pathlib function to refer back to the original CSV file"""
        # replace all old CSV files with the cleaned city names
        df.to_csv(
            key.name, # NB!: use pathlib's .name function to refer back to the original CSV name corresponding to each df 
            index=False
            )
        
# Peninsula data

export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs_pen)


In [49]:
# export Peninsula data:
# specify full path to Pensinula scraped data
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_pen}'
# change current working directory to path
os.chdir(path)


export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs_pen)


In [50]:
# export East Bay data:
# specify full path
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_eby}'

# change current working directory to path
os.chdir(path)

export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs_eby)



In [51]:
# export North Bay data:
# specify full path to scraped data
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_nby}'

# change current working directory to path
os.chdir(path)

export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs_nby)

In [52]:
# export South Bay data:
# specify full path to scraped data
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_sby}'

# change current working directory to path
os.chdir(path)

export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs_sby)

In [53]:
# export SF data:
# specify full path to scraped data
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_sfc}'

# change current working directory to path
os.chdir(path)

export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs_sfc)

In [54]:
# export scz data:
# specify full path to scraped data
path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_scz}'

# change current working directory to path
os.chdir(path)

export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs_scz)

In [36]:
# def export_dfs_from_dict_of_dfs_back_to_path_and_write_over_original_csvs(dict_of_dfs:dict)->dict:
#     """ Import dataframes as a dictionary of dataframes, given a) a specific user-inputted range of dates & b) a specific subregion"""

#     # specify parent path to all sfbay data-- NB: use an f-string combined with a raw (ie, r) string--ie, fr to modify the string so we can input the subregion code as an argument to add to the path
#     # path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion}'
#     path = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay'
    
    
#     # subregion:str
#     # change current working directory to path
#     os.chdir(path)

#     # path_pen = fr'D:\Coding and Code projects\Python\craigslist_data_proj\CraigslistWebScraper\scraped_data\sfbay\{subregion_pen}'

#     for df, key in dict_of_dfs.items():
#         """Iterate over each dataframe & key (ie, the CSV file names), and use the .name pathlib function to refer back to the original CSV file"""
#         # replace all old CSV files with the cleaned city names
#         df.to_csv( 
#             key.name, # 1) use os.path.join() to join the CSV filename to the relevant path &  2) NB!: use pathlib's .name function to refer back to the original CSV name corresponding to each df 
#             index=False  # each df index is meaningless here
#             )  



# # def export_each_df_from_dict_of_dfs_back_to_original_path_and_write_over_original_csvs(dict_of_dfs):
# #     for key, df in dict_of_dfs_pen.items():
# #         """Iterate over each dataframe & key (ie, path & file), and use the .name pathlib function to refer back to the original CSV file"""
# #         # replace all old CSV files with the cleaned city names
# #         df.to_csv(
# #             key.name, # NB!: use pathlib's .name function to refer back to the original CSV name corresponding to each df 
# #             index=False
# #             )


# export_dfs_from_dict_of_dfs_back_to_path_and_write_over_original_csvs(dict_of_dfs_pen)


AttributeError: 'WindowsPath' object has no attribute 'to_csv'