### Importing dependencies ###

In [1]:
import csv
import pandas as pd
import numpy as np
import os

## Reading in all data 

The data is read in including files with endigs '*.csv', '*.csv.gz', and '*.geojson'

### Input folder struct: ###

```plain_text
folder_path/
├── City1/
│   ├── file1.csv
│   ├── file2.csv
│   └── file3.csv
├── City2/
│   ├── file4.csv
│   └── file5.geojson
└── City3/
    └── file6.csv.gz
```

### Output structure: ###

- Keys: City names (subdirectory names under `folder_path`).
- Values: Lists of Pandas DataFrames corresponding to the files in the respective subfolder.


```python
data_dict = {
    "City1": [DataFrame_file1, DataFrame_file2, DataFrame_file3],   # Files from `City1` subfolder
    "City2": [DataFrame_file1, DataFrame_file2],                    # Files from `City2` subfolder
    "City3": [DataFrame_file1]                                      # Files from `City3` subfolder
}

In [2]:


### Read all CSV files in a folder into a dictionary of DataFrames for every city/region in the folder###
folder_path = '/Users/georgtirpitz/Documents/Data_Literacy/example_data'

# Initialize the dictionary to store DataFrames
data_dict = {}

for root, dirs, files in os.walk(folder_path):
    # Get the name of the subdirectory
    subdirectory_name = os.path.basename(root)
    
    # Skip the root folder itself if needed (optional)
    if subdirectory_name == os.path.basename(folder_path):
        continue
    
    # Initialize a list for the current subdirectory
    data_dict[subdirectory_name] = []

    print(f"Processing folder: {root}")
    
    for file_name in files:
        if file_name.endswith('.csv') or file_name.endswith('.geojson') or file_name.endswith('.csv.gz'):
            file_path = os.path.join(root, file_name)
            
            # Read the file into a DataFrame
            if file_name.endswith('.geojson'):
                df = pd.read_json(file_path)  # Adjust based on the specific geojson handling
            else:
                df = pd.read_csv(file_path)
            
            # Append the DataFrame to the list for this subdirectory
            data_dict[subdirectory_name].append(df)
            
            print(f"Loaded {file_path} into {subdirectory_name}'s list of DataFrames")


## Merging data 

Converting data from the current data_dict format which is a dict->list->df_per_file:

```python
data_dict = {
    "City1": [DataFrame_file1, DataFrame_file2, DataFrame_file3],   # Files from `City1` subfolder
    "City2": [DataFrame_file1, DataFrame_file2],                    # Files from `City2` subfolder
    "City3": [DataFrame_file1]                                      # Files from `City3` subfolder
}
```
Into a dict->df_all_files_per_city:

```python
data_dict = {
    "City1": Dataframe_city1,                           # Files from `City1` subfolder
    "City2": Dataframe_city2,                           # Files from `City2` subfolder
    "City3": Dataframe_city3                            # Files from `City3` subfolder
}
```

In [3]:
for city, df_list in data_dict.items():
    # Concatenate all DataFrames in the list for the current city
    merged_df = pd.concat(df_list, ignore_index=True)
    
    # Replace the list of DataFrames with the merged DataFrame
    data_dict[city] = merged_df

# Print the keys and the shape of the merged DataFrames to verify
for city, df in data_dict.items():
    print(f"{city}: {df.shape}")

In [5]:
print(data_dict['amsterdam'].keys())

KeyError: 'amsterdam'

## Merging into single df 

Merging the whole folder of city names with files into a single df by adding a new coloumn 'city_name' for each row and merging all dfs

In [32]:
df_list_with_region = []

for city, df in data_dict.items():
    df.insert(0, 'region', city)
    df_list_with_region.append(df)

combined_df = pd.concat(df_list_with_region, ignore_index=True)


In [35]:
# Display the combined DataFrame
combined_df.head(5)
output_path = 'data/preprocessed_data.csv'
combined_df.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")

Unnamed: 0,region,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,...,listing_id,date,available,adjusted_price,type,features,reviewer_id,reviewer_name,comments,neighbourhood_group
0,amsterdam,6624170.0,https://www.airbnb.com/rooms/6624170,20240910000000.0,2024-09-05,previous scrape,"Warm, cozy sunlighted downtown appt",2 room appt. 1.8 km from central station with ...,,https://a0.muscache.com/pictures/df91da10-f7d4...,...,,,,,,,,,,
1,amsterdam,8837071.0,https://www.airbnb.com/rooms/8837071,20240910000000.0,2024-09-05,previous scrape,Cozy apartment in city center,Located in Amsterdam's sweet spot. A stone's t...,see the guide,https://a0.muscache.com/pictures/5fee12d4-61d0...,...,,,,,,,,,,
2,amsterdam,716107.0,https://www.airbnb.com/rooms/716107,20240910000000.0,2024-09-05,previous scrape,Loft style home nearby city centre,,,https://a0.muscache.com/pictures/9927048/b367a...,...,,,,,,,,,,
3,amsterdam,6.645388e+17,https://www.airbnb.com/rooms/664538756986273255,20240910000000.0,2024-09-06,previous scrape,Geweldige duurzame eco woonark op unieke plek!,This unique eco houseboat is located in the mo...,,https://a0.muscache.com/pictures/miso/Hosting-...,...,,,,,,,,,,
4,amsterdam,8191077.0,https://www.airbnb.com/rooms/8191077,20240910000000.0,2024-09-05,previous scrape,Old bar apartment,This just renovated apartment for 6 persons is...,Our apartment is situated in the centre of Ams...,https://a0.muscache.com/pictures/miso/Hosting-...,...,,,,,,,,,,
