# Building a Catalog

## Catalog Data Processing Classes

The provided code implements a series of classes for fetching and processing catalog data from a given API, focusing on CNIL (Commission Nationale de l'Informatique et des Libertés) data in this example. Below are detailed explanations of each part of the code:

### GetSourceCatalog Class

This class provides basic functionalities for fetching and processing catalog data from an API.

#### Methods:
- `__init__(self, url, headers)`: Initializes the GetSourceCatalog object with the API URL and necessary headers for requests.
- `fetch_data_from_api(self)`: Fetches data from the API and returns it as a list.
- `response_to_dataframe(self, data, table_name, download_url, table_id=None, file_format=None, last_update=None, dataset_id=None, dataset_name=None, frequency=None, accessURL=None)`: Processes API response data into a DataFrame, with parameters to specify keys for different pieces of information.
- `save_to_csv(self, filename)`: Saves catalog data to a CSV file.

### CustomCatalog Class

This class inherits from GoogleConnector and extends its functionalities to work with custom catalogs.

#### Methods:
- `__init__(self, credentials_path, project_id=None, dataset_name=None)`: Initializes the CustomCatalog object with the credentials path, project ID, and dataset name.
- `create_catalog_gcs(self, zip_file)`: Creates a catalog from a zip file containing data, extracting information such as table name, update date, and file format.
- `bq_catalog_all_datasets(self)`: Gets BigQuery modification dates for all tables in all datasets of the project.
- `bq_raw_catalog(self)`: Gets BigQuery modification dates for all tables in a specific dataset.

### GetCnilCatalog Class

This class inherits from GetSourceCatalog and adds specific functionalities for working with CNIL catalog data.

#### Methods:
- `__init__(self, url, headers, url_additional_info)`: Initializes the GetCnilCatalog object with the CNIL API URL, headers, and the URL of a CSV file containing additional information.
- `load_additional_info(self)`: Loads additional information from the CSV file.
- `identify_datasets_info(self)`: Identifies dataset information and adds it to the catalog DataFrame.
- `merge_additional_info(self)`: Merges additional information into the catalog DataFrame.

### Code Execution

The code demonstrates how to use the GetCnilCatalog class to fetch, process, and save CNIL catalog data. Here's a summary of the steps:

1. Creating an instance of GetCnilCatalog with the CNIL API URL, headers, and the URL of the CSV file containing additional information.
2. Fetching data from the API.
3. Processing the data into a DataFrame using the response_to_dataframe method.
4. Loading additional information from the CSV file.
5. Identifying dataset information in the DataFrame.
6. Merging additional information into the DataFrame.
7. Saving the data to a CSV file.

The last line of code demonstrates how to execute these steps and save the catalog data to a CSV file named "source_cnil_catalog.csv".


In [10]:
from classes.source_catalog import GetCnilCatalog

url = 'https://www.data.gouv.fr/api/1/organizations/534fff61a3a7292c64a77d59/catalog'
headers = {'accept': 'application/json'}
url_add = 'https://www.data.gouv.fr/fr/organizations/cnil/datasets.csv'
instance1 = GetCnilCatalog(url, headers, url_add)
data = instance1.fetch_data_from_api()
data = data['@graph']
table_name = 'title'
download_url = 'downloadURL'
table_id = 'identifier' 
file_format= 'format'
last_update= 'modified' 
accessURL = '@id'
df_catalog = instance1.response_to_dataframe(data=data, table_name=table_name, download_url=download_url, table_id=table_id, file_format=file_format, last_update=last_update, accessURL=accessURL)
df_dataset = instance1.load_additional_info()
df_catalog = instance1.identify_datasets_info()
df_catalog = instance1.merge_additional_info()
instance1.save_to_csv('source_cnil_catalog')

Request is a success: 200
CSV file has been loaded to this path data/catalog/source_cnil_catalog_2024-04-18.csv


# Uploading Files to GCS

## File Upload to Google Cloud Storage (GCS) using Python

This code snippet demonstrates how to upload files, including data from a DataFrame, to Google Cloud Storage (GCS) using a Python script.

### Libraries Used:
- `pandas`: for handling data in DataFrames
- `requests`: for making HTTP requests to download data from URLs
- `zipfile` and `gzip`: for handling compressed files
- `os`: for interacting with the operating system
- `io`: for handling input/output operations
- `date`: for working with dates
- `Google Cloud SDK`: for interfacing with Google Cloud Storage
- `colorama`: for colored output in the terminal

### Classes:
1. `FromFileToGCS`: A class inheriting from `GoogleConnector` used to process data and upload it to GCS.

### Methods in `FromFileToGCS`:
1. `__init__()`: Initializes the class with GCS bucket name and service account credentials path.
2. `create_bucket()`: Creates a new GCS bucket if it doesn't exist already.
3. `download_and_upload_from_URLs()`: Downloads data from multiple URLs and uploads it to GCS.
4. `local_to_gcs()`: Uploads local files or DataFrame objects to GCS.
5. `list_blobs()`: Lists all the blobs (files) in the GCS bucket.
6. `extract_and_upload_sel()`: Extracts and uploads data from compressed files in the bucket.

### Code Execution:
1. Imports necessary libraries and the `FromFileToGCS` class.
2. Sets up credentials, bucket name, and other necessary parameters.
3. Initializes a `FromFileToGCS` object.
4. Creates a GCS bucket.
5. Prepares data for upload (in this case, a DataFrame `df_catalog`).
6. Defines destination folder and blob names.
7. Calls the `local_to_gcs()` method to upload data to GCS.

### Usage:
- Replace `df_catalog` with the DataFrame containing the data to be uploaded.
- Modify `bucket_name` and `credential_path` with your GCS bucket details and service account credentials path.
- Adjust destination folder and blob names as required.


In [None]:
# from a dataframe (must have a dest_blob name)

from classes.gcs_processor import GCSProcessor
import os
from datetime import date

today = date.today()
bucket_name = 'cnil_csv'
credential_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name = bucket_name, credentials_path= credential_path)
init2.create_bucket()
file_paths = [df_catalog]
dest_folder = 'raw'
dest_blobs = [f'source_cnil_catalog_{today}.csv']
init2.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

## File Upload to Google Cloud Storage (GCS) using Python - Local File

This code snippet demonstrates how to upload a local file to Google Cloud Storage (GCS) without requiring a destination blob name. The destination blob name will be the same as the file name.

### Libraries Used:
- `os`: for interacting with the operating system
- `date`: for working with dates
- `Google Cloud SDK`: for interfacing with Google Cloud Storage

### Classes:
1. `FromFileToGCS`: A class inheriting from `GoogleConnector` used to process data and upload it to GCS.

### Methods in `FromFileToGCS`:
1. `__init__()`: Initializes the class with GCS bucket name and service account credentials path.
2. `create_bucket()`: Creates a new GCS bucket if it doesn't exist already.
3. `local_to_gcs()`: Uploads local files to GCS. If no destination blob name is provided, it uses the file name as the destination blob name.

### Code Execution:
1. Imports necessary libraries and the `FromFileToGCS` class.
2. Sets up credentials, bucket name, and other necessary parameters.
3. Initializes a `FromFileToGCS` object.
4. Creates a GCS bucket.
5. Specifies the local file path.
6. Defines the destination folder and destination blob names.
7. Calls the `local_to_gcs()` method to upload the local file to GCS.

### Usage:
- Replace the file path (`f'data/catalog/source_cnil_catalog_{today}.csv'`) with the path to your local file.
- Modify `bucket_name` and `credential_path` with your GCS bucket details and service account credentials path.
- Adjust the destination folder and blob names as required.


In [None]:
# from a local file (doesn't require a dest_blob name, it will be the same as the file name)

from classes.gcs_processor import GCSProcessor
import os
from datetime import date

today = date.today()

bucket_name = 'cnil_csv'
credential_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name = bucket_name, credentials_path= credential_path)
init2.create_bucket()
file_paths = [f'data/catalog/source_cnil_catalog_{today}.csv']
dest_folder = 'raw'
dest_blob = [f'source_cnil_catalog_{today}.csv']
init2.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, date=today)

## File Upload to Google Cloud Storage (GCS) using Python - From URL

This code snippet demonstrates how to download files from URLs and upload them to Google Cloud Storage (GCS) without requiring a destination blob name. The destination blob name will be the same as the file name.

### Libraries Used:
- `requests`: for making HTTP requests to download data from URLs
- `Google Cloud SDK`: for interfacing with Google Cloud Storage

### Classes:
1. `FromFileToGCS`: A class inheriting from `GoogleConnector` used to process data and upload it to GCS.

### Methods in `FromFileToGCS`:
1. `__init__()`: Initializes the class with GCS bucket name and service account credentials path.
2. `create_bucket()`: Creates a new GCS bucket if it doesn't exist already.
3. `download_and_upload_from_URLs()`: Downloads data from multiple URLs and uploads it to GCS. If no destination blob name is provided, it uses the file name from the URL as the destination blob name.

### Code Execution:
1. Sets up the GCS bucket name, service account credentials path, and other necessary parameters.
2. Initializes a `FromFileToGCS` object.
3. Creates a GCS bucket.
4. Specifies the URLs from which data needs to be downloaded.
5. Defines the destination folder and destination blob names.
6. Calls the `download_and_upload_from_URLs()` method to download data from URLs and upload it to GCS.

### Usage:
- Replace the URLs (`url`) with the URLs from which you want to download data.
- Modify `bucket_name` and `cred_path` with your GCS bucket details and service account credentials path.
- Adjust the destination folder and blob names as required.


In [None]:
# from a URL (doesn't require a dest_blob name, it will be the same as the file name)

from classes.gcs_processor import GCSProcessor
import os
from datetime import date

today = date.today()

bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name = bucket_name, credentials_path= credentials_path)
init2.create_bucket()
urls = ['https://www.data.gouv.fr/fr/organizations/cnil/datasets.csv', 'https://www.data.gouv.fr/fr/datasets/r/0f678674-4327-4c4d-8819-b6f508b41d0e']
dest_folder = 'raw'
dest_blobs = ['datasets.csv', 'plaintes.csv']
init2.dl_and_up_from_URLs(urls=urls, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

# Downloading Files from Catalog

## Download and Organize Dataset Content Class

The provided code defines a class `DlCatalogContentLocal` for downloading and organizing datasets based on a provided catalog. Below are the attributes and methods of this class:

### Attributes:
- `df_catalog` (pd.DataFrame): DataFrame containing the catalog information.

### Methods:
1. `__init__(catalog_path)`: Constructor method that initializes the object with the provided catalog path.
2. `get_tables()`: Downloads and organizes datasets based on the information in the catalog.
3. `zip_files()`: Zips all the downloaded files into a single archive.
4. `reorganize_file_name(file_name, last_date)`: Helper method to create a new filename with versioning based on the last update date.
5. `extract_date(date_str)`: Helper method to extract and convert date strings to datetime objects.

### Code Execution:

The code execution section demonstrates how to use the `DlCatalogContentLocal` class to download, organize, and zip dataset content from a provided catalog.

1. **Initialization**:
   - An instance of `DlCatalogContentLocal` is created with the path to the catalog CSV file (`source_cnil_catalog_{today}.csv`).

2. **Downloading and Organizing Datasets**:
   - The `get_tables()` method is called to download and organize datasets based on the information in the catalog. 
   - For each row in the catalog DataFrame, if a download URL is provided, the dataset is downloaded and organized into the appropriate folder structure based on the dataset name and last update date.

3. **Zipping Files**:
   - After downloading and organizing datasets, the `zip_files()` method is called to zip all the downloaded files into a single archive (`raw_datasets.zip`).

The provided code demonstrates how to automate the process of downloading, organizing, and zipping datasets based on a catalog, providing a convenient way to manage dataset content efficiently.


In [None]:
from classes.download_catalog_content import DlCatalogContentLocal
from datetime import date

today = date.today()
catalog_path = f'data/catalog/source_cnil_catalog_{today}.csv'

instance3 = DlCatalogContentLocal(catalog_path=catalog_path)
instance3.get_tables()
instance3.zip_files()

## Download and Zip Files from GCS Catalog Class

The provided code defines a class `DLFromGCSCatalogToZip` for downloading and zipping files from a Google Cloud Storage (GCS) catalog. Below are the methods and attributes of this class:

### Attributes:
- `gcs_bucket_name`: Name of the Google Cloud Storage (GCS) bucket.
- `credentials_path`: Path to the service account credentials file.
- `zip_blob_name`: Name of the zip file in GCS.
- `project_id`: Optional project ID.

### Methods:
1. `__init__(self, gcs_bucket_name, credentials_path, zip_blob_name, project_id=None)`: Constructor method that initializes the object with the specified attributes.
2. `get_file_io(self)`: Retrieves the CSV catalog file from GCS and returns it as a BytesIO object.
3. `download_files_to_zip_io(self)`: Downloads files from URLs in the catalog and returns them as a list of tuples containing file paths and content.
4. `create_zip(self, files)`: Creates a zip file containing the downloaded files and returns it as a BytesIO object.
5. `extract_date(self, date_str)`: Helper method to extract and convert date strings to datetime objects.

### Code Execution:

The code execution section demonstrates how to use the `DLFromGCSCatalogToZip` class to download and zip files from a GCS catalog.

1. **Initialization**:
   - An instance of `DLFromGCSCatalogToZip` is created with the specified GCS bucket name, credentials path, zip blob name, and optional project ID.

2. **Downloading and Zipping Files**:
   - The `get_file_io()` method is called to retrieve the CSV catalog file from GCS.
   - The `download_files_to_zip_io()` method is called to download files from URLs in the catalog.
   - The `create_zip()` method is called to create a zip file containing the downloaded files.
   - The zip file is then uploaded to GCS using an instance of the `FromFileToGCS` class.

The provided code demonstrates how to automate the process of downloading and zipping files from a GCS catalog, providing a convenient way to manage file content efficiently.


In [5]:
from classes.gcs_processor import GCSProcessor
from datetime import date
import pandas as pd
import os

today = date.today()

gcs_bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'
blob_catalog = "2024-04-15/raw/source_cnil_catalog_2024-04-15.csv"

instance1 = GCSProcessor(bucket_name=gcs_bucket_name, credentials_path=credentials_path)
files = instance1.download_files_from_catalog(catalog_path=blob_catalog)
zip_file = instance1.create_zip_from_files(files)

file_paths = [zip_file]
dest_folder = 'raw'
dest_blobs = ['raw_datasets.zip']
instance1.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

Current file downloading: les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml
Current file downloading: traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11.csv
File Organismes ayant désigné un(e) délégué(e) à la protection des données (DPD/DPO) does not have a download URL.
Current file downloading: organismes-ayant-designe-un-e-delegue-e-a-la-protection-des-donnees-dpd-dpo/opencnil-organismes-avec-dpo.xlsx_2024-04-08.xlsx
Current file downloading: organismes-ayant-designe-un-e-delegue-e-a-la-protection-des-donnees-dpd-dpo/opencnil-organismes-avec-dpo.csv_2024-04-08.csv


KeyboardInterrupt: 

# Prep data to upload to BQ

In [5]:
from classes.gcs_processor import GCSProcessor
from datetime import date

today = date.today()

gcs_bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'

instance1 = GCSProcessor(bucket_name=gcs_bucket_name, credentials_path=credentials_path)
blob_name_zip = '2024-04-15/raw/raw_datasets.zip'
zip_file = instance1.get_zip_file_object(blob_name_zip)

In [None]:
# from classes.prep_data import PrepFilesBQ
# import pandas as pd

# instance5 = PrepFilesBQ(zip_file)
# zip_output = instance5.process_zip_file(zip_file)

In [6]:
import pandas as pd
from classes.prep_data import PrepDataCnilBQ

instance5 = PrepDataCnilBQ(zip_file)
dfs = instance5.process_dfs(zip_file)

[32mcurrent: les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml[0m
---------------------------------------------------
[32mles-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml[0m
<zipfile.ZipExtFile name="les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml" mode='r' compress_type=deflate>
les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml
opened df, return from open_df
this is df
[31mles-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml not processed![0m
---------------------------------------------------
[32mcurrent: traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les trait

  warn("Workbook contains no default style, apply openpyxl's default")


(98227, 26)
try to find headers in 2nd row
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfect

  warn("Workbook contains no default style, apply openpyxl's default")


(18907, 12)
try to find headers in 2nd row
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-execution completed.
[32mtraitements-de-donnees-personnelles-declares-a-la-cnil-depuis-le-25-mai-2018/Formalités préalables reçues par la CNIL depuis le 25 mai 2018_2024-03-25.xlsx processed successfully![0m
---------------------------------------------------
[32mcurrent: sanctions-prononcees-par-la-cnil/opencn

  warn("""Cannot parse header or footer so it will be ignored""")


(430, 7)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-execution completed.
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30.xlsx processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03.csv[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03.csv[0m
<zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03.csv' mode='r' compress_type=deflate>


  warn("""Cannot parse header or footer so it will be ignored""")


In [None]:
import pandas as pd
import zipfile
import os

def save_dfs_to_zip(dfs, zip_filename):
    """
    Save a DataFrame to a CSV file within a zip archive.

    Parameters:
        df (pandas.DataFrame): The DataFrame to be saved.
        zip_filename (str): The name of the zip file to create.
        csv_filename (str): The name of the CSV file to be saved inside the zip.

    Returns:
        None
    """
    # Create a zip file

    with zipfile.ZipFile(zip_filename, 'w') as zipf:
        
        for file in dfs:
          df = file['df']
          filename = file['path']
        # Write DataFrame to a CSV file in memory
          csv_buffer = df.to_csv(index=False, sep=';')

          # Add the CSV file to the zip archive
          zipf.writestr(filename, csv_buffer)

    print(f"DataFrame saved to {zip_filename} as {filename}")

# Exemple d'utilisation :
# Création d'un DataFrame pour illustrer


# Appel de la fonction pour enregistrer le DataFrame dans un fichier CSV compressé

save_dfs_to_zip(dfs, 'prep_data.zip')

DataFrame saved to prep_data.zip as controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2014_2015_06_15_xlsx.csv


In [12]:
import pandas as pd
import zipfile
import os
import io

def save_dfs_to_zip(dfs):
    """
    Save a DataFrame to a CSV file within a zip archive.

    Parameters:
        df (pandas.DataFrame): The DataFrame to be saved.
        zip_filename (str): The name of the zip file to create.
        csv_filename (str): The name of the CSV file to be saved inside the zip.

    Returns:
        None
    """
    # Create a zip file
    zip_output = io.BytesIO()
    with zipfile.ZipFile(zip_output, 'w') as zipf:
            for file in dfs:
                df = file['df']
                filename = file['path']
                # Write DataFrame to a CSV file in memory
                csv_buffer = df.to_csv(index=False, sep=';')

                # Add the CSV file to the zip archive
                zipf.writestr(filename, csv_buffer)

    zip_output.seek(0)
    return zip_output

zip_io = save_dfs_to_zip(dfs)

In [2]:
import pandas as pd
from classes.prep_data import PrepDataCnilBQ

instance5 = PrepDataCnilBQ(zip_file)
zip_output = instance5.process_zip_io_file(zip_file_io=zip_file)

[32mcurrent: les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml[0m
---------------------------------------------------
[32mles-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml[0m
<zipfile.ZipExtFile name="les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml" mode='r' compress_type=deflate>
les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml
opened df, return from open_df
this is df
[31mles-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-12.xml not processed![0m
---------------------------------------------------
[32mcurrent: traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les trait

  warn("Workbook contains no default style, apply openpyxl's default")


(98227, 26)
try to find headers in 2nd row
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfect

  warn("Workbook contains no default style, apply openpyxl's default")


(18907, 12)
try to find headers in 2nd row
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-execution completed.
[32mtraitements-de-donnees-personnelles-declares-a-la-cnil-depuis-le-25-mai-2018/Formalités préalables reçues par la CNIL depuis le 25 mai 2018_2024-03-25.xlsx processed successfully![0m
---------------------------------------------------
[32mcurrent: sanctions-prononcees-par-la-cnil/opencn

  warn("""Cannot parse header or footer so it will be ignored""")


(430, 7)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-execution completed.
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30.xlsx processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03.csv[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03.csv[0m
[31mException type: TypeError[0m
[31mException: expected str, bytes or os.PathLike object, not ZipExtFile[0m
cant read as csv
<zipfile.ZipExtFile name

  warn("""Cannot parse header or footer so it will be ignored""")


In [13]:
from classes.gcs_processor import GCSProcessor

bucket_name = 'cnil_csv'
cred_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name=bucket_name, credentials_path=credentials_path)
init2.create_bucket()
file_paths = [zip_io]
dest_folder = 'prep'
dest_blobs = ['prep_gbq_datasets.zip']
instance1.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

Bucket already exists.
[32m file 2024-04-18/prep_test/prep_gbq_datasets.zip uploaded to GCS successfully to 2024-04-18/prep_test/prep_gbq_datasets.zip.[0m


In [None]:
from classes.gcs_processor import GCSProcessor

bucket_name = 'cnil_csv'
cred_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name=bucket_name, credentials_path=credentials_path)
prefix = 'prep'
blobs = init2.list_blobs(prefix=prefix)
init2.extract_and_upload_selection(blobs = blobs, folder_name='prep/extracted', date=today)

# GCS to GCP 

In [1]:
from classes.gcs_processor import GCSProcessor
from datetime import date

today = date.today()

gcs_bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'
folder = "2024-04-18/prep"

instance1 = GCSProcessor(bucket_name=gcs_bucket_name, credentials_path=credentials_path)
list_blob = instance1.list_blobs(folder)
zip_file = instance1.get_zip_file_object(list_blob[0].name)

cnil_csv


In [2]:
from classes.gcs_to_gcp import FromGCStoGBQ

# usage exemple
credentials_path = 'cred/service_account_local_py.json'
project_id = 'cnil-392113'
dataset_name = 'raw_data'

processor_bq = FromGCStoGBQ(credentials_path, project_id, dataset_name)
processor_bq.create_dataset()
processor_bq.upload_zipio_to_bq(zip_file)

[32mCreated dataset (or already exists) cnil-392113.raw_data[0m
---------------------
organismes_ayant_designe_un_e_delegue_e_a_la_protection_des_donnees_dpd_dpo/opencnil_organismes_avec_dpo_xlsx_2024_04_08_xlsx.csv
opencnil_organismes_avec_dpo_xlsx_2024_04_08_xlsx.csv
2024-04-08
xlsx
opencnil_organismes_avec_dpo_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_organismes_avec_dpo_xlsx
---------------------


1it [00:06,  6.97s/it]


[32morganismes_ayant_designe_un_e_delegue_e_a_la_protection_des_donnees_dpd_dpo/opencnil_organismes_avec_dpo_xlsx_2024_04_08_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_organismes_avec_dpo_xlsx[0m
---------------------
traitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_04_08_csv.csv
Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_04_08_csv.csv
2024-04-08
csv
formalites_prealables_recues_par_la_cnil_depuis_le_25_mai_2018
this is the table name:  cnil-392113.raw_data.formalites_prealables_recues_par_la_cnil_depuis_le_25_mai_2018
---------------------


1it [00:05,  5.32s/it]


[32mtraitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_04_08_csv.csv is uploaded to cnil-392113.raw_data.formalites_prealables_recues_par_la_cnil_depuis_le_25_mai_2018[0m
---------------------
budget_de_la_cnil_1/opencnil_budget_depuis_2000_maj_mars_2024_csv_2024_03_29_csv.csv
opencnil_budget_depuis_2000_maj_mars_2024_csv_2024_03_29_csv.csv
2024-03-29
csv
opencnil_budget_depuis_2000_maj_mars_2024_csv
this is the table name:  cnil-392113.raw_data.opencnil_budget_depuis_2000_maj_mars_2024_csv
---------------------


1it [00:04,  4.11s/it]


[32mbudget_de_la_cnil_1/opencnil_budget_depuis_2000_maj_mars_2024_csv_2024_03_29_csv.csv is uploaded to cnil-392113.raw_data.opencnil_budget_depuis_2000_maj_mars_2024_csv[0m
---------------------
budget_de_la_cnil_1/opencnil_budget_depuis_2000_maj_mars_2024_xlsx_2024_03_29_xlsx.csv
opencnil_budget_depuis_2000_maj_mars_2024_xlsx_2024_03_29_xlsx.csv
2024-03-29
xlsx
opencnil_budget_depuis_2000_maj_mars_2024_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_budget_depuis_2000_maj_mars_2024_xlsx
---------------------


1it [00:02,  2.96s/it]


[32mbudget_de_la_cnil_1/opencnil_budget_depuis_2000_maj_mars_2024_xlsx_2024_03_29_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_budget_depuis_2000_maj_mars_2024_xlsx[0m
---------------------
notifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel/opencnil_violationsdcpnotifiees_20231231_xlsx_2024_03_29_xlsx.csv
opencnil_violationsdcpnotifiees_20231231_xlsx_2024_03_29_xlsx.csv
2024-03-29
xlsx
opencnil_violationsdcpnotifiees_20231231_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_violationsdcpnotifiees_20231231_xlsx
---------------------


1it [00:04,  4.05s/it]


[32mnotifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel/opencnil_violationsdcpnotifiees_20231231_xlsx_2024_03_29_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_violationsdcpnotifiees_20231231_xlsx[0m
---------------------
notifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel/opencnil_violationsdcpnotifiees_20231231_csv_2024_03_29_csv.csv
opencnil_violationsdcpnotifiees_20231231_csv_2024_03_29_csv.csv
2024-03-29
csv
opencnil_violationsdcpnotifiees_20231231_csv
this is the table name:  cnil-392113.raw_data.opencnil_violationsdcpnotifiees_20231231_csv
---------------------


1it [00:06,  6.20s/it]


[32mnotifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel/opencnil_violationsdcpnotifiees_20231231_csv_2024_03_29_csv.csv is uploaded to cnil-392113.raw_data.opencnil_violationsdcpnotifiees_20231231_csv[0m
---------------------
traitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_03_25_xlsx.csv
Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_03_25_xlsx.csv
2024-03-25
xlsx
formalites_prealables_recues_par_la_cnil_depuis_le_25_mai_2018
this is the table name:  cnil-392113.raw_data.formalites_prealables_recues_par_la_cnil_depuis_le_25_mai_2018
---------------------


1it [00:04,  4.49s/it]


[32mtraitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_03_25_xlsx.csv is uploaded to cnil-392113.raw_data.formalites_prealables_recues_par_la_cnil_depuis_le_25_mai_2018[0m
---------------------
sanctions_prononcees_par_la_cnil/opencnil_sanctions_depuis_2019_maj_nov_2023_csv_2023_11_24_csv.csv
opencnil_sanctions_depuis_2019_maj_nov_2023_csv_2023_11_24_csv.csv
2023-11-24
csv
opencnil_sanctions_depuis_2019_maj_nov_2023_csv
this is the table name:  cnil-392113.raw_data.opencnil_sanctions_depuis_2019_maj_nov_2023_csv
---------------------


1it [00:03,  3.11s/it]


[32msanctions_prononcees_par_la_cnil/opencnil_sanctions_depuis_2019_maj_nov_2023_csv_2023_11_24_csv.csv is uploaded to cnil-392113.raw_data.opencnil_sanctions_depuis_2019_maj_nov_2023_csv[0m
---------------------
sanctions_prononcees_par_la_cnil/opencnil_sanctions_depuis_2019_maj_nov_2023_xlsx_2023_11_24_xlsx.csv
opencnil_sanctions_depuis_2019_maj_nov_2023_xlsx_2023_11_24_xlsx.csv
2023-11-24
xlsx
opencnil_sanctions_depuis_2019_maj_nov_2023_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_sanctions_depuis_2019_maj_nov_2023_xlsx
---------------------


1it [00:03,  3.43s/it]


[32msanctions_prononcees_par_la_cnil/opencnil_sanctions_depuis_2019_maj_nov_2023_xlsx_2023_11_24_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_sanctions_depuis_2019_maj_nov_2023_xlsx[0m
---------------------
protection_des_donnees_personnelles_dans_le_monde/opencnil_autorites_de_protection_vd_20231010_xlsx_2023_11_09_xlsx.csv
opencnil_autorites_de_protection_vd_20231010_xlsx_2023_11_09_xlsx.csv
2023-11-09
xlsx
opencnil_autorites_de_protection_vd_20231010_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_autorites_de_protection_vd_20231010_xlsx
---------------------


1it [00:03,  3.82s/it]


[32mprotection_des_donnees_personnelles_dans_le_monde/opencnil_autorites_de_protection_vd_20231010_xlsx_2023_11_09_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_autorites_de_protection_vd_20231010_xlsx[0m
---------------------
controles_realises_par_la_cnil/open_data_controles_cnil_2022_v20231003_xlsx_2023_10_03_xlsx.csv
open_data_controles_cnil_2022_v20231003_xlsx_2023_10_03_xlsx.csv
2023-10-03
xlsx
open_data_controles_cnil_2022_v20231003_xlsx
this is the table name:  cnil-392113.raw_data.open_data_controles_cnil_2022_v20231003_xlsx
---------------------


1it [00:02,  2.50s/it]


[32mcontroles_realises_par_la_cnil/open_data_controles_cnil_2022_v20231003_xlsx_2023_10_03_xlsx.csv is uploaded to cnil-392113.raw_data.open_data_controles_cnil_2022_v20231003_xlsx[0m
---------------------
mises_en_demeure_prononcees_par_la_cnil/open_cnil_volumes_med_depuis_2014_maj_aout_2023_csv_2023_08_25_csv.csv
open_cnil_volumes_med_depuis_2014_maj_aout_2023_csv_2023_08_25_csv.csv
2023-08-25
csv
open_cnil_volumes_med_depuis_2014_maj_aout_2023_csv
this is the table name:  cnil-392113.raw_data.open_cnil_volumes_med_depuis_2014_maj_aout_2023_csv
---------------------


1it [00:03,  3.22s/it]


[32mmises_en_demeure_prononcees_par_la_cnil/open_cnil_volumes_med_depuis_2014_maj_aout_2023_csv_2023_08_25_csv.csv is uploaded to cnil-392113.raw_data.open_cnil_volumes_med_depuis_2014_maj_aout_2023_csv[0m
---------------------
mises_en_demeure_prononcees_par_la_cnil/open_cnil_volumes_med_depuis_2014_maj_aout_2023_xlsx_2023_08_25_xlsx.csv
open_cnil_volumes_med_depuis_2014_maj_aout_2023_xlsx_2023_08_25_xlsx.csv
2023-08-25
xlsx
open_cnil_volumes_med_depuis_2014_maj_aout_2023_xlsx
this is the table name:  cnil-392113.raw_data.open_cnil_volumes_med_depuis_2014_maj_aout_2023_xlsx
---------------------


1it [00:02,  2.21s/it]


[32mmises_en_demeure_prononcees_par_la_cnil/open_cnil_volumes_med_depuis_2014_maj_aout_2023_xlsx_2023_08_25_xlsx.csv is uploaded to cnil-392113.raw_data.open_cnil_volumes_med_depuis_2014_maj_aout_2023_xlsx[0m
---------------------
exercice_des_droits_indirect_donnees_generales/opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_csv_2023_06_28_csv.csv
opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_csv_2023_06_28_csv.csv
2023-06-28
csv
opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_csv
this is the table name:  cnil-392113.raw_data.opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_csv
---------------------


1it [00:02,  2.99s/it]


[32mexercice_des_droits_indirect_donnees_generales/opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_csv_2023_06_28_csv.csv is uploaded to cnil-392113.raw_data.opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_csv[0m
---------------------
exercice_des_droits_indirect_donnees_generales/opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
2023-06-28
xlsx
opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_xlsx
---------------------


1it [00:04,  4.42s/it]


[32mexercice_des_droits_indirect_donnees_generales/opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_xlsx_2023_06_28_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_volumes_dai_edi_depuis_1984_maj_juin_2023_xlsx[0m
---------------------
effectifs_de_la_cnil/opencnil_effectifs_depuis_1980_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
opencnil_effectifs_depuis_1980_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
2023-06-28
xlsx
opencnil_effectifs_depuis_1980_maj_juin_2023_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_effectifs_depuis_1980_maj_juin_2023_xlsx
---------------------


1it [00:05,  5.96s/it]


[32meffectifs_de_la_cnil/opencnil_effectifs_depuis_1980_maj_juin_2023_xlsx_2023_06_28_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_effectifs_depuis_1980_maj_juin_2023_xlsx[0m
---------------------
effectifs_de_la_cnil/opencnil_effectifs_depuis_1980_maj_juin_2023_csv_2023_06_28_csv.csv
opencnil_effectifs_depuis_1980_maj_juin_2023_csv_2023_06_28_csv.csv
2023-06-28
csv
opencnil_effectifs_depuis_1980_maj_juin_2023_csv
this is the table name:  cnil-392113.raw_data.opencnil_effectifs_depuis_1980_maj_juin_2023_csv
---------------------


1it [00:02,  2.86s/it]


[32meffectifs_de_la_cnil/opencnil_effectifs_depuis_1980_maj_juin_2023_csv_2023_06_28_csv.csv is uploaded to cnil-392113.raw_data.opencnil_effectifs_depuis_1980_maj_juin_2023_csv[0m
---------------------
plaintes_recues_par_la_cnil/opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
2023-06-28
xlsx
opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_xlsx
---------------------


1it [00:02,  2.73s/it]


[32mplaintes_recues_par_la_cnil/opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_xlsx_2023_06_28_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_xlsx[0m
---------------------
plaintes_recues_par_la_cnil/opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_csv_2023_06_28_csv.csv
opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_csv_2023_06_28_csv.csv
2023-06-28
csv
opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_csv
this is the table name:  cnil-392113.raw_data.opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_csv
---------------------


1it [00:02,  2.81s/it]


[32mplaintes_recues_par_la_cnil/opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_csv_2023_06_28_csv.csv is uploaded to cnil-392113.raw_data.opencnil_volumes_plaintes_depuis_1981_maj_juin_2023_csv[0m
---------------------
controles_realises_par_la_cnil/opencnil_nombre_controles_depuis_1990_maj_juin_2023_csv_2023_06_28_csv.csv
opencnil_nombre_controles_depuis_1990_maj_juin_2023_csv_2023_06_28_csv.csv
2023-06-28
csv
opencnil_nombre_controles_depuis_1990_maj_juin_2023_csv
this is the table name:  cnil-392113.raw_data.opencnil_nombre_controles_depuis_1990_maj_juin_2023_csv
---------------------


1it [00:03,  3.78s/it]


[32mcontroles_realises_par_la_cnil/opencnil_nombre_controles_depuis_1990_maj_juin_2023_csv_2023_06_28_csv.csv is uploaded to cnil-392113.raw_data.opencnil_nombre_controles_depuis_1990_maj_juin_2023_csv[0m
---------------------
controles_realises_par_la_cnil/opencnil_nombre_controles_depuis_1990_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
opencnil_nombre_controles_depuis_1990_maj_juin_2023_xlsx_2023_06_28_xlsx.csv
2023-06-28
xlsx
opencnil_nombre_controles_depuis_1990_maj_juin_2023_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_nombre_controles_depuis_1990_maj_juin_2023_xlsx
---------------------


1it [00:02,  2.70s/it]


[32mcontroles_realises_par_la_cnil/opencnil_nombre_controles_depuis_1990_maj_juin_2023_xlsx_2023_06_28_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_nombre_controles_depuis_1990_maj_juin_2023_xlsx[0m
---------------------
controles_realises_par_la_cnil/open_data_controles_2021_v20220921_xlsx_2022_10_25_xlsx.csv
open_data_controles_2021_v20220921_xlsx_2022_10_25_xlsx.csv
2022-10-25
xlsx
open_data_controles_2021_v20220921_xlsx
this is the table name:  cnil-392113.raw_data.open_data_controles_2021_v20220921_xlsx
---------------------


1it [00:02,  2.40s/it]


[32mcontroles_realises_par_la_cnil/open_data_controles_2021_v20220921_xlsx_2022_10_25_xlsx.csv is uploaded to cnil-392113.raw_data.open_data_controles_2021_v20220921_xlsx[0m
---------------------
controles_realises_par_la_cnil/open_data_controles_2020_vd_20210603_csv_2021_06_03_csv.csv
open_data_controles_2020_vd_20210603_csv_2021_06_03_csv.csv
2021-06-03
csv
open_data_controles_2020_vd_20210603_csv
this is the table name:  cnil-392113.raw_data.open_data_controles_2020_vd_20210603_csv
---------------------


1it [00:02,  2.69s/it]


[32mcontroles_realises_par_la_cnil/open_data_controles_2020_vd_20210603_csv_2021_06_03_csv.csv is uploaded to cnil-392113.raw_data.open_data_controles_2020_vd_20210603_csv[0m
---------------------
controles_realises_par_la_cnil/open_data_controles_2020_vd_20210603_xlsx_2021_06_03_xlsx.csv
open_data_controles_2020_vd_20210603_xlsx_2021_06_03_xlsx.csv
2021-06-03
xlsx
open_data_controles_2020_vd_20210603_xlsx
this is the table name:  cnil-392113.raw_data.open_data_controles_2020_vd_20210603_xlsx
---------------------


1it [00:04,  4.86s/it]


[32mcontroles_realises_par_la_cnil/open_data_controles_2020_vd_20210603_xlsx_2021_06_03_xlsx.csv is uploaded to cnil-392113.raw_data.open_data_controles_2020_vd_20210603_xlsx[0m
---------------------
marches_publics_de_la_cnil/opencnil_marches_publics_2014_2020_xlsx_2021_06_02_xlsx.csv
opencnil_marches_publics_2014_2020_xlsx_2021_06_02_xlsx.csv
2021-06-02
xlsx
opencnil_marches_publics_2014_2020_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_marches_publics_2014_2020_xlsx
---------------------


1it [00:03,  3.41s/it]


[32mmarches_publics_de_la_cnil/opencnil_marches_publics_2014_2020_xlsx_2021_06_02_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_marches_publics_2014_2020_xlsx[0m
---------------------
controles_realises_par_la_cnil/opencnil_liste_controles_2019_xlsx_2020_11_13_xlsx.csv
opencnil_liste_controles_2019_xlsx_2020_11_13_xlsx.csv
2020-11-13
xlsx
opencnil_liste_controles_2019_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_liste_controles_2019_xlsx
---------------------


1it [00:05,  5.74s/it]


[32mcontroles_realises_par_la_cnil/opencnil_liste_controles_2019_xlsx_2020_11_13_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_liste_controles_2019_xlsx[0m
---------------------
controles_realises_par_la_cnil/opencnil_liste_controles_2019_csv_2020_07_03_csv.csv
opencnil_liste_controles_2019_csv_2020_07_03_csv.csv
2020-07-03
csv
opencnil_liste_controles_2019_csv
this is the table name:  cnil-392113.raw_data.opencnil_liste_controles_2019_csv
---------------------


1it [00:02,  2.82s/it]


[32mcontroles_realises_par_la_cnil/opencnil_liste_controles_2019_csv_2020_07_03_csv.csv is uploaded to cnil-392113.raw_data.opencnil_liste_controles_2019_csv[0m
---------------------
controles_realises_par_la_cnil/opencnil_liste_controles_2018_csv_2019_05_16_csv.csv
opencnil_liste_controles_2018_csv_2019_05_16_csv.csv
2019-05-16
csv
opencnil_liste_controles_2018_csv
this is the table name:  cnil-392113.raw_data.opencnil_liste_controles_2018_csv
---------------------


1it [00:02,  2.79s/it]


[32mcontroles_realises_par_la_cnil/opencnil_liste_controles_2018_csv_2019_05_16_csv.csv is uploaded to cnil-392113.raw_data.opencnil_liste_controles_2018_csv[0m
---------------------
controles_realises_par_la_cnil/opencnil_liste_controles_2018_xlsx_2019_05_16_xlsx.csv
opencnil_liste_controles_2018_xlsx_2019_05_16_xlsx.csv
2019-05-16
xlsx
opencnil_liste_controles_2018_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_liste_controles_2018_xlsx
---------------------


1it [00:03,  3.62s/it]


[32mcontroles_realises_par_la_cnil/opencnil_liste_controles_2018_xlsx_2019_05_16_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_liste_controles_2018_xlsx[0m
---------------------
sanctions_prononcees_par_la_cnil/open_cnil_ventilation_sanctions_depuis_2014_vd_csv_2019_05_14_csv.csv
open_cnil_ventilation_sanctions_depuis_2014_vd_csv_2019_05_14_csv.csv
2019-05-14
csv
open_cnil_ventilation_sanctions_depuis_2014_vd_csv
this is the table name:  cnil-392113.raw_data.open_cnil_ventilation_sanctions_depuis_2014_vd_csv
---------------------


1it [00:05,  5.64s/it]


[32msanctions_prononcees_par_la_cnil/open_cnil_ventilation_sanctions_depuis_2014_vd_csv_2019_05_14_csv.csv is uploaded to cnil-392113.raw_data.open_cnil_ventilation_sanctions_depuis_2014_vd_csv[0m
---------------------
sanctions_prononcees_par_la_cnil/open_cnil_ventilation_sanctions_depuis_2014_vd_xlsx_2019_05_14_xlsx.csv
open_cnil_ventilation_sanctions_depuis_2014_vd_xlsx_2019_05_14_xlsx.csv
2019-05-14
xlsx
open_cnil_ventilation_sanctions_depuis_2014_vd_xlsx
this is the table name:  cnil-392113.raw_data.open_cnil_ventilation_sanctions_depuis_2014_vd_xlsx
---------------------


1it [00:03,  3.59s/it]


[32msanctions_prononcees_par_la_cnil/open_cnil_ventilation_sanctions_depuis_2014_vd_xlsx_2019_05_14_xlsx.csv is uploaded to cnil-392113.raw_data.open_cnil_ventilation_sanctions_depuis_2014_vd_xlsx[0m
---------------------
droit_dacces_indirect_taj_stic_judex/opencnil_dai_stic_judex_taj_maj_janvier_2019_xlsx_2019_05_13_xlsx.csv
opencnil_dai_stic_judex_taj_maj_janvier_2019_xlsx_2019_05_13_xlsx.csv
2019-05-13
xlsx
opencnil_dai_stic_judex_taj_maj_janvier_2019_xlsx
this is the table name:  cnil-392113.raw_data.opencnil_dai_stic_judex_taj_maj_janvier_2019_xlsx
---------------------


1it [00:03,  3.63s/it]


[32mdroit_dacces_indirect_taj_stic_judex/opencnil_dai_stic_judex_taj_maj_janvier_2019_xlsx_2019_05_13_xlsx.csv is uploaded to cnil-392113.raw_data.opencnil_dai_stic_judex_taj_maj_janvier_2019_xlsx[0m
---------------------
controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2017_2018_06_20_csv.csv
Liste_des_contrôles_réalisés_par_la_CNIL_en_2017_2018_06_20_csv.csv
2018-06-20
csv
liste_des_controles_realises_par_la_cnil_en_2017
this is the table name:  cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2017
---------------------


1it [00:03,  3.46s/it]


[32mcontroles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2017_2018_06_20_csv.csv is uploaded to cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2017[0m
---------------------
controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2017_2018_06_20_xlsx.csv
Liste_des_contrôles_réalisés_par_la_CNIL_en_2017_2018_06_20_xlsx.csv
2018-06-20
xlsx
liste_des_controles_realises_par_la_cnil_en_2017
this is the table name:  cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2017
---------------------


1it [00:02,  2.47s/it]


[32mcontroles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2017_2018_06_20_xlsx.csv is uploaded to cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2017[0m
---------------------
correspondants_informatique_et_libertes_cil/Organismes_avec_CIL_xlsx_2018_05_24_xlsx.csv
Organismes_avec_CIL_xlsx_2018_05_24_xlsx.csv
2018-05-24
xlsx
organismes_avec_cil_xlsx
this is the table name:  cnil-392113.raw_data.organismes_avec_cil_xlsx
---------------------


1it [00:03,  3.13s/it]


[32mcorrespondants_informatique_et_libertes_cil/Organismes_avec_CIL_xlsx_2018_05_24_xlsx.csv is uploaded to cnil-392113.raw_data.organismes_avec_cil_xlsx[0m
---------------------
correspondants_informatique_et_libertes_cil/Organismes_avec_CIL_csv_2018_05_24_csv.csv
Organismes_avec_CIL_csv_2018_05_24_csv.csv
2018-05-24
csv
organismes_avec_cil_csv
this is the table name:  cnil-392113.raw_data.organismes_avec_cil_csv
---------------------


1it [00:06,  6.76s/it]


[32mcorrespondants_informatique_et_libertes_cil/Organismes_avec_CIL_csv_2018_05_24_csv.csv is uploaded to cnil-392113.raw_data.organismes_avec_cil_csv[0m
---------------------
controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2016_2017_03_30_csv.csv
Liste_des_contrôles_réalisés_par_la_CNIL_en_2016_2017_03_30_csv.csv
2017-03-30
csv
liste_des_controles_realises_par_la_cnil_en_2016
this is the table name:  cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2016
---------------------


1it [00:02,  2.66s/it]


[32mcontroles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2016_2017_03_30_csv.csv is uploaded to cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2016[0m
---------------------
controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2016_2017_03_30_xlsx.csv
Liste_des_contrôles_réalisés_par_la_CNIL_en_2016_2017_03_30_xlsx.csv
2017-03-30
xlsx
liste_des_controles_realises_par_la_cnil_en_2016
this is the table name:  cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2016
---------------------


1it [00:02,  2.95s/it]


[32mcontroles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2016_2017_03_30_xlsx.csv is uploaded to cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2016[0m
---------------------
controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2015_2016_05_03_csv.csv
Liste_des_contrôles_réalisés_par_la_CNIL_en_2015_2016_05_03_csv.csv
2016-05-03
csv
liste_des_controles_realises_par_la_cnil_en_2015
this is the table name:  cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2015
---------------------


1it [00:02,  2.71s/it]


[32mcontroles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2015_2016_05_03_csv.csv is uploaded to cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2015[0m
---------------------
controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2015_2016_05_03_xlsx.csv
Liste_des_contrôles_réalisés_par_la_CNIL_en_2015_2016_05_03_xlsx.csv
2016-05-03
xlsx
liste_des_controles_realises_par_la_cnil_en_2015
this is the table name:  cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2015
---------------------


1it [00:03,  3.42s/it]


[32mcontroles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2015_2016_05_03_xlsx.csv is uploaded to cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2015[0m
---------------------
controles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2014_2015_06_15_xlsx.csv
Liste_des_contrôles_réalisés_par_la_CNIL_en_2014_2015_06_15_xlsx.csv
2015-06-15
xlsx
liste_des_controles_realises_par_la_cnil_en_2014
this is the table name:  cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2014
---------------------


1it [00:02,  2.59s/it]

[32mcontroles_realises_par_la_cnil/Liste_des_contrôles_réalisés_par_la_CNIL_en_2014_2015_06_15_xlsx.csv is uploaded to cnil-392113.raw_data.liste_des_controles_realises_par_la_cnil_en_2014[0m





# Building catalog from prep_data

In [1]:
from classes.prep_data import ZipFileProcessor

gcs_bucket_name = 'cnil_csv'
credential_path = 'cred/service_account_local_py.json'
zip_blob_name = '2024-02-17/prep/prep_datasets.zip'
output_folder_name = '2024-02-17/'+ 'prep'
instance4 = ZipFileProcessor(gcs_bucket_name, credential_path, zip_blob_name, output_folder_name)
zip_file = instance4.get_zip_file_object()

ImportError: cannot import name 'ZipFileProcessor' from 'classes.prep_data' (/Users/benjamindupaquier/Documents/projets_persos/Pipeline/classes/prep_data.py)

In [None]:
from classes.source_catalog import CustomCatalog
import io

instance8 = CustomCatalog('cred/service_account_local_py.json')
df = instance8.create_catalog_gcs(zip_file)
df

In [None]:
import pandas as pd
csv_output = io.BytesIO()
df.to_csv(csv_output, index=False, sep=";")
csv_output.seek(0)

In [None]:
from classes.file_to_gcs import FromFileToGCS

bucket_name = 'cnil_csv'
cred_path = 'cred/service_account_local_py.json'
init2 = FromFileToGCS(bucket_name, cred_path)
init2.create_bucket()
file_paths = [csv_output]
dest_folder = 'prep'
dest_blob = ['prepdata_cnil_catalog_2024-02-17.csv']
init2.local_to_gcs(file_paths, dest_folder, dest_blob)

# Building catalog from BQ raw_data

In [1]:
from classes.source_catalog import CustomCatalog

credential_path = 'cred/service_account_local_py.json'
dataset_name = 'raw_data'
project_id = 'cnil-392113'
instance8 = CustomCatalog(credentials_path=credential_path, project_id=project_id, dataset_name=dataset_name)
df = instance8.bq_raw_catalog()

Getting BigQuery modified dates...
Done.


In [3]:
from classes.gcs_to_gcp import FromGCStoGBQ

credentials_path = 'cred/service_account_local_py.json'
project_id = 'cnil-392113'
dataset_name = 'catalog_data'
table_name = 'cnil_catalog_bq'

processor_bq = FromGCStoGBQ(credentials_path, project_id, dataset_name)
processor_bq.create_dataset()
processor_bq.df_to_bq(df, table_name)

[32mCreated dataset (or already exists) cnil-392113.catalog_data[0m


1it [00:03,  3.81s/it]

[32mDataFrame is uploaded to cnil-392113.catalog_data.cnil_catalog_bq[0m





# Additionnal tables

In [None]:
from classes.download_catalog_content import DlCatalogContentCnil

project_id = 'cnil-392113'
credential_path = 'cred/service_account_local_py.json'
instance1 = DlCatalogContentCnil(credentials_path= credential_path, project_id=project_id, catalog_path=None)
last_sanc = instance1.get_last_record_eu()
df = instance1.scrap_eu()
df