# Building a Catalog

## Catalog Data Processing Classes

The provided code implements a series of classes for fetching and processing catalog data from a given API, focusing on CNIL (Commission Nationale de l'Informatique et des Libertés) data in this example. Below are detailed explanations of each part of the code:

### GetSourceCatalog Class

This class provides basic functionalities for fetching and processing catalog data from an API.

#### Methods:
- `__init__(self, url, headers)`: Initializes the GetSourceCatalog object with the API URL and necessary headers for requests.
- `fetch_data_from_api(self)`: Fetches data from the API and returns it as a list.
- `response_to_dataframe(self, data, table_name, download_url, table_id=None, file_format=None, last_update=None, dataset_id=None, dataset_name=None, frequency=None, accessURL=None)`: Processes API response data into a DataFrame, with parameters to specify keys for different pieces of information.
- `save_to_csv(self, filename)`: Saves catalog data to a CSV file.

### CustomCatalog Class

This class inherits from GoogleConnector and extends its functionalities to work with custom catalogs.

#### Methods:
- `__init__(self, credentials_path, project_id=None, dataset_name=None)`: Initializes the CustomCatalog object with the credentials path, project ID, and dataset name.
- `create_catalog_gcs(self, zip_file)`: Creates a catalog from a zip file containing data, extracting information such as table name, update date, and file format.
- `bq_catalog_all_datasets(self)`: Gets BigQuery modification dates for all tables in all datasets of the project.
- `bq_raw_catalog(self)`: Gets BigQuery modification dates for all tables in a specific dataset.

### GetCnilCatalog Class

This class inherits from GetSourceCatalog and adds specific functionalities for working with CNIL catalog data.

#### Methods:
- `__init__(self, url, headers, url_additional_info)`: Initializes the GetCnilCatalog object with the CNIL API URL, headers, and the URL of a CSV file containing additional information.
- `load_additional_info(self)`: Loads additional information from the CSV file.
- `identify_datasets_info(self)`: Identifies dataset information and adds it to the catalog DataFrame.
- `merge_additional_info(self)`: Merges additional information into the catalog DataFrame.

### Code Execution

The code demonstrates how to use the GetCnilCatalog class to fetch, process, and save CNIL catalog data. Here's a summary of the steps:

1. Creating an instance of GetCnilCatalog with the CNIL API URL, headers, and the URL of the CSV file containing additional information.
2. Fetching data from the API.
3. Processing the data into a DataFrame using the response_to_dataframe method.
4. Loading additional information from the CSV file.
5. Identifying dataset information in the DataFrame.
6. Merging additional information into the DataFrame.
7. Saving the data to a CSV file.

The last line of code demonstrates how to execute these steps and save the catalog data to a CSV file named "source_cnil_catalog.csv".


In [2]:
from classes.source_catalog import GetCnilCatalog

url = 'https://www.data.gouv.fr/api/1/organizations/534fff61a3a7292c64a77d59/catalog'
headers = {'accept': 'application/json'}
url_add = 'https://www.data.gouv.fr/fr/organizations/cnil/datasets.csv'
instance1 = GetCnilCatalog(url, headers, url_add)
data = instance1.fetch_data_from_api()
data = data['@graph']
table_name = 'title'
download_url = 'downloadURL'
table_id = 'identifier' 
file_format= 'format'
last_update= 'modified' 
accessURL = '@id'
df_catalog = instance1.response_to_dataframe(data=data, table_name=table_name, download_url=download_url, table_id=table_id, file_format=file_format, last_update=last_update, accessURL=accessURL)
df_dataset = instance1.load_additional_info()
df_catalog = instance1.identify_datasets_info()
df_catalog = instance1.merge_additional_info()
instance1.save_to_csv('source_cnil_catalog')

Request is a success: 200
CSV file has been loaded to this path data/catalog/source_cnil_catalog_2024-04-13.csv


# Uploading Files to GCS

## File Upload to Google Cloud Storage (GCS) using Python

This code snippet demonstrates how to upload files, including data from a DataFrame, to Google Cloud Storage (GCS) using a Python script.

### Libraries Used:
- `pandas`: for handling data in DataFrames
- `requests`: for making HTTP requests to download data from URLs
- `zipfile` and `gzip`: for handling compressed files
- `os`: for interacting with the operating system
- `io`: for handling input/output operations
- `date`: for working with dates
- `Google Cloud SDK`: for interfacing with Google Cloud Storage
- `colorama`: for colored output in the terminal

### Classes:
1. `FromFileToGCS`: A class inheriting from `GoogleConnector` used to process data and upload it to GCS.

### Methods in `FromFileToGCS`:
1. `__init__()`: Initializes the class with GCS bucket name and service account credentials path.
2. `create_bucket()`: Creates a new GCS bucket if it doesn't exist already.
3. `download_and_upload_from_URLs()`: Downloads data from multiple URLs and uploads it to GCS.
4. `local_to_gcs()`: Uploads local files or DataFrame objects to GCS.
5. `list_blobs()`: Lists all the blobs (files) in the GCS bucket.
6. `extract_and_upload_sel()`: Extracts and uploads data from compressed files in the bucket.

### Code Execution:
1. Imports necessary libraries and the `FromFileToGCS` class.
2. Sets up credentials, bucket name, and other necessary parameters.
3. Initializes a `FromFileToGCS` object.
4. Creates a GCS bucket.
5. Prepares data for upload (in this case, a DataFrame `df_catalog`).
6. Defines destination folder and blob names.
7. Calls the `local_to_gcs()` method to upload data to GCS.

### Usage:
- Replace `df_catalog` with the DataFrame containing the data to be uploaded.
- Modify `bucket_name` and `credential_path` with your GCS bucket details and service account credentials path.
- Adjust destination folder and blob names as required.


In [4]:
# from a dataframe (must have a dest_blob name)

from classes.gcs_processor import GCSProcessor
import os
from datetime import date

today = date.today()
bucket_name = 'cnil_csv'
credential_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name = bucket_name, credentials_path= credential_path)
init2.create_bucket()
file_paths = [df_catalog]
dest_folder = 'raw'
dest_blobs = [f'source_cnil_catalog_{today}.csv']
init2.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

Bucket already exists.
[32m2024-04-13/raw/source_cnil_catalog_2024-04-13.csv is uploaded to 2024-04-13/raw/source_cnil_catalog_2024-04-13.csv.[0m


## File Upload to Google Cloud Storage (GCS) using Python - Local File

This code snippet demonstrates how to upload a local file to Google Cloud Storage (GCS) without requiring a destination blob name. The destination blob name will be the same as the file name.

### Libraries Used:
- `os`: for interacting with the operating system
- `date`: for working with dates
- `Google Cloud SDK`: for interfacing with Google Cloud Storage

### Classes:
1. `FromFileToGCS`: A class inheriting from `GoogleConnector` used to process data and upload it to GCS.

### Methods in `FromFileToGCS`:
1. `__init__()`: Initializes the class with GCS bucket name and service account credentials path.
2. `create_bucket()`: Creates a new GCS bucket if it doesn't exist already.
3. `local_to_gcs()`: Uploads local files to GCS. If no destination blob name is provided, it uses the file name as the destination blob name.

### Code Execution:
1. Imports necessary libraries and the `FromFileToGCS` class.
2. Sets up credentials, bucket name, and other necessary parameters.
3. Initializes a `FromFileToGCS` object.
4. Creates a GCS bucket.
5. Specifies the local file path.
6. Defines the destination folder and destination blob names.
7. Calls the `local_to_gcs()` method to upload the local file to GCS.

### Usage:
- Replace the file path (`f'data/catalog/source_cnil_catalog_{today}.csv'`) with the path to your local file.
- Modify `bucket_name` and `credential_path` with your GCS bucket details and service account credentials path.
- Adjust the destination folder and blob names as required.


In [5]:
# from a local file (doesn't require a dest_blob name, it will be the same as the file name)

from classes.gcs_processor import GCSProcessor
import os
from datetime import date

today = date.today()

bucket_name = 'cnil_csv'
credential_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name = bucket_name, credentials_path= credential_path)
init2.create_bucket()
file_paths = [f'data/catalog/source_cnil_catalog_{today}.csv']
dest_folder = 'raw'
dest_blob = [f'source_cnil_catalog_{today}.csv']
init2.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, date=today)

Bucket already exists.
data/catalog/source_cnil_catalog_2024-04-13.csv
[32m file source_cnil_catalog_2024-04-13.csv uploaded to GCS successfully to 2024-04-13/raw/source_cnil_catalog_2024-04-13.csv.[0m


## File Upload to Google Cloud Storage (GCS) using Python - From URL

This code snippet demonstrates how to download files from URLs and upload them to Google Cloud Storage (GCS) without requiring a destination blob name. The destination blob name will be the same as the file name.

### Libraries Used:
- `requests`: for making HTTP requests to download data from URLs
- `Google Cloud SDK`: for interfacing with Google Cloud Storage

### Classes:
1. `FromFileToGCS`: A class inheriting from `GoogleConnector` used to process data and upload it to GCS.

### Methods in `FromFileToGCS`:
1. `__init__()`: Initializes the class with GCS bucket name and service account credentials path.
2. `create_bucket()`: Creates a new GCS bucket if it doesn't exist already.
3. `download_and_upload_from_URLs()`: Downloads data from multiple URLs and uploads it to GCS. If no destination blob name is provided, it uses the file name from the URL as the destination blob name.

### Code Execution:
1. Sets up the GCS bucket name, service account credentials path, and other necessary parameters.
2. Initializes a `FromFileToGCS` object.
3. Creates a GCS bucket.
4. Specifies the URLs from which data needs to be downloaded.
5. Defines the destination folder and destination blob names.
6. Calls the `download_and_upload_from_URLs()` method to download data from URLs and upload it to GCS.

### Usage:
- Replace the URLs (`url`) with the URLs from which you want to download data.
- Modify `bucket_name` and `cred_path` with your GCS bucket details and service account credentials path.
- Adjust the destination folder and blob names as required.


In [6]:
# from a URL (doesn't require a dest_blob name, it will be the same as the file name)

from classes.gcs_processor import GCSProcessor
import os
from datetime import date

today = date.today()

bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name = bucket_name, credentials_path= credentials_path)
init2.create_bucket()
urls = ['https://www.data.gouv.fr/fr/organizations/cnil/datasets.csv', 'https://www.data.gouv.fr/fr/datasets/r/0f678674-4327-4c4d-8819-b6f508b41d0e']
dest_folder = 'raw'
dest_blobs = ['datasets.csv', 'plaintes.csv']
init2.dl_and_up_from_URLs(urls=urls, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

Bucket already exists.
[32mRaw file 2024-04-13/raw/datasets.csv downloaded and uploaded to GCS successfully to 2024-04-13/raw/datasets.csv.[0m
[32mRaw file 2024-04-13/raw/plaintes.csv downloaded and uploaded to GCS successfully to 2024-04-13/raw/plaintes.csv.[0m


# Downloading Files from Catalog

## Download and Organize Dataset Content Class

The provided code defines a class `DlCatalogContentLocal` for downloading and organizing datasets based on a provided catalog. Below are the attributes and methods of this class:

### Attributes:
- `df_catalog` (pd.DataFrame): DataFrame containing the catalog information.

### Methods:
1. `__init__(catalog_path)`: Constructor method that initializes the object with the provided catalog path.
2. `get_tables()`: Downloads and organizes datasets based on the information in the catalog.
3. `zip_files()`: Zips all the downloaded files into a single archive.
4. `reorganize_file_name(file_name, last_date)`: Helper method to create a new filename with versioning based on the last update date.
5. `extract_date(date_str)`: Helper method to extract and convert date strings to datetime objects.

### Code Execution:

The code execution section demonstrates how to use the `DlCatalogContentLocal` class to download, organize, and zip dataset content from a provided catalog.

1. **Initialization**:
   - An instance of `DlCatalogContentLocal` is created with the path to the catalog CSV file (`source_cnil_catalog_{today}.csv`).

2. **Downloading and Organizing Datasets**:
   - The `get_tables()` method is called to download and organize datasets based on the information in the catalog. 
   - For each row in the catalog DataFrame, if a download URL is provided, the dataset is downloaded and organized into the appropriate folder structure based on the dataset name and last update date.

3. **Zipping Files**:
   - After downloading and organizing datasets, the `zip_files()` method is called to zip all the downloaded files into a single archive (`raw_datasets.zip`).

The provided code demonstrates how to automate the process of downloading, organizing, and zipping datasets based on a catalog, providing a convenient way to manage dataset content efficiently.


In [7]:
from classes.download_catalog_content import DlCatalogContentLocal
from datetime import date

today = date.today()
catalog_path = f'data/catalog/source_cnil_catalog_{today}.csv'

instance3 = DlCatalogContentLocal(catalog_path=catalog_path)
instance3.get_tables()
instance3.zip_files()

Error when downloading table Organismes ayant désigné un(e) délégué(e) à la protection des données (DPD/DPO) : Invalid URL 'nan': No scheme supplied. Perhaps you meant http://nan?
Error when downloading table Budget de la CNIL : Invalid URL 'nan': No scheme supplied. Perhaps you meant http://nan?
Error when downloading table Notifications à la CNIL de violations de données à caractère personnel : Invalid URL 'nan': No scheme supplied. Perhaps you meant http://nan?
Error when downloading table Sanctions prononcées par la CNIL : Invalid URL 'nan': No scheme supplied. Perhaps you meant http://nan?
Error when downloading table Traitements de données personnelles déclarés à la CNIL depuis le 25 mai 2018 : Invalid URL 'nan': No scheme supplied. Perhaps you meant http://nan?
Error when downloading table Protection des données personnelles dans le monde : Invalid URL 'nan': No scheme supplied. Perhaps you meant http://nan?
Error when downloading table Contrôles réalisés par la CNIL : Invalid U

## Download and Zip Files from GCS Catalog Class

The provided code defines a class `DLFromGCSCatalogToZip` for downloading and zipping files from a Google Cloud Storage (GCS) catalog. Below are the methods and attributes of this class:

### Attributes:
- `gcs_bucket_name`: Name of the Google Cloud Storage (GCS) bucket.
- `credentials_path`: Path to the service account credentials file.
- `zip_blob_name`: Name of the zip file in GCS.
- `project_id`: Optional project ID.

### Methods:
1. `__init__(self, gcs_bucket_name, credentials_path, zip_blob_name, project_id=None)`: Constructor method that initializes the object with the specified attributes.
2. `get_file_io(self)`: Retrieves the CSV catalog file from GCS and returns it as a BytesIO object.
3. `download_files_to_zip_io(self)`: Downloads files from URLs in the catalog and returns them as a list of tuples containing file paths and content.
4. `create_zip(self, files)`: Creates a zip file containing the downloaded files and returns it as a BytesIO object.
5. `extract_date(self, date_str)`: Helper method to extract and convert date strings to datetime objects.

### Code Execution:

The code execution section demonstrates how to use the `DLFromGCSCatalogToZip` class to download and zip files from a GCS catalog.

1. **Initialization**:
   - An instance of `DLFromGCSCatalogToZip` is created with the specified GCS bucket name, credentials path, zip blob name, and optional project ID.

2. **Downloading and Zipping Files**:
   - The `get_file_io()` method is called to retrieve the CSV catalog file from GCS.
   - The `download_files_to_zip_io()` method is called to download files from URLs in the catalog.
   - The `create_zip()` method is called to create a zip file containing the downloaded files.
   - The zip file is then uploaded to GCS using an instance of the `FromFileToGCS` class.

The provided code demonstrates how to automate the process of downloading and zipping files from a GCS catalog, providing a convenient way to manage file content efficiently.


In [12]:
from classes.gcs_processor import GCSProcessor
from datetime import date
import pandas as pd
import os

today = date.today()

gcs_bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'
blob_catalog = "2024-04-12/raw/source_cnil_catalog_2024-04-12.csv"

instance1 = GCSProcessor(bucket_name=gcs_bucket_name, credentials_path=credentials_path)
files = instance1.download_files_from_catalog(catalog_path=blob_catalog)
zip_file = instance1.create_zip_from_files(files)

file_paths = [zip_file]
dest_folder = 'raw'
dest_blob = ['raw_datasets.zip']
instance1.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

Current file downloading: traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11
Current file downloading: organismes-ayant-designe-un-e-delegue-e-a-la-protection-des-donnees-dpd-dpo/Organismes ayant désigné un(e) délégué(e) à la protection des données (DPD/DPO)_2024-04-08
Current file downloading: organismes-ayant-designe-un-e-delegue-e-a-la-protection-des-donnees-dpd-dpo/opencnil-organismes-avec-dpo.xlsx_2024-04-08
Current file downloading: organismes-ayant-designe-un-e-delegue-e-a-la-protection-des-donnees-dpd-dpo/opencnil-organismes-avec-dpo.csv_2024-04-08
Current file downloading: les-deliberations-de-la-cnil/CNIL: les délibérations de la Commission nationale de l'informatique et des libertés_2024-04-03
Current file downloading: traitements-de-donnees-personnelles-declares-a-la-cnil-depuis-le-25-mai-2018/Formalités préalables reçues par la CNIL depuis le 25 mai 2

  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)


Current file: les-deliberations-de-la-cnil/Les délibérations de la CNIL_2017-07-26
Current file: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30
Current file: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30
Current file: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03
Current file: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03
Current file: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2014_2015-06-15
Current file: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2014_2015-06-15
[32m file 2024-04-13/raw/datasets.csv uploaded to GCS successfully to 2024-04-13/raw/datasets.csv.[0m


# Prep data to upload to BQ

In [2]:
from classes.gcs_processor import GCSProcessor
from datetime import date

today = date.today()

gcs_bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'
blob_catalog = "2024-04-12/raw/source_cnil_catalog_2024-04-12.csv"

instance1 = GCSProcessor(bucket_name=gcs_bucket_name, credentials_path=credentials_path)
blob_name_zip = '2024-04-13/raw/raw_datasets.zip'
zip_file = instance1.get_zip_file_object(blob_name_zip)

In [3]:
from classes.prep_data import PrepFilesBQ
import pandas as pd

instance5 = PrepFilesBQ(zip_file)
zip_output = instance5.process_zip_file(zip_file)

[32mcurrent: traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11[0m
---------------------------------------------------
[32mtraitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11[0m
<zipfile.ZipExtFile name='traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11' mode='r' compress_type=deflate>
traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11
try to read as csv
file: <zipfile.ZipExtFile name='traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personne

  warn("Workbook contains no default style, apply openpyxl's default")


(18834, 12)
try to find headers in 2nd row
(18833, 12)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mtraitements-de-donnees-personnelles-declares-a-la-cnil-depuis-le-25-mai-2018/Formalités préalables reçues par la CNIL depuis le 25 mai 2018_2024-03-25 processed successfully![0m
---------------------------------------------------
[32mcurrent: sanctions-prononcees-par-la-cnil/

  return self._open_to_write(zinfo, force_zip64=force_zip64)


(17886, 2)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mles-deliberations-de-la-cnil/Les délibérations de la CNIL_2017-07-26 processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
<zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30' mode='r' compress_type=deflate>
controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30
try to read as csv
file: <zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_

  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  return self._open_to_write(zinfo, force_zip64=force_zip64)


(430, 7)
(430, 7)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30 processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
<zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30' mode='r' compress_type=deflate>
controles

  warn("""Cannot parse header or footer so it will be ignored""")


(496, 6)
(496, 6)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03 processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03[0m
<zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03' mode='r' compress_type=deflate>
controles

  warn("""Cannot parse header or footer so it will be ignored""")
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)


In [8]:
import pandas as pd
from classes.prep_data import PrepDataCnilBQ

instance5 = PrepDataCnilBQ(zip_file)
zip_output = instance5.process_zip_file(zip_file)

[32mcurrent: traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11[0m
---------------------------------------------------
[32mtraitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11[0m
<zipfile.ZipExtFile name='traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11' mode='r' compress_type=deflate>
traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personnelles déclarés à la CNIL entre 1979 et le 24 mai 2018_2024-04-11
try to read as csv
file: <zipfile.ZipExtFile name='traitements-de-donnees-personnelles-declares-a-la-cnil-avant-le-25-mai-2018/Les traitements de données personne

  warn("Workbook contains no default style, apply openpyxl's default")


(18834, 12)
try to find headers in 2nd row
(18833, 12)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mtraitements-de-donnees-personnelles-declares-a-la-cnil-depuis-le-25-mai-2018/Formalités préalables reçues par la CNIL depuis le 25 mai 2018_2024-03-25 processed successfully![0m
---------------------------------------------------
[32mcurrent: sanctions-prononcees-par-la-cnil/

  return self._open_to_write(zinfo, force_zip64=force_zip64)


(17886, 2)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mles-deliberations-de-la-cnil/Les délibérations de la CNIL_2017-07-26 processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
<zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30' mode='r' compress_type=deflate>
controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30
try to read as csv
file: <zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_

  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  return self._open_to_write(zinfo, force_zip64=force_zip64)


(430, 7)
(430, 7)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30 processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30[0m
<zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2016_2017-03-30' mode='r' compress_type=deflate>
controles

  warn("""Cannot parse header or footer so it will be ignored""")


(496, 6)
(496, 6)
opened df, return from open_df
this is df
More rows than columns, no need to transpose
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
The column_formatter method worked perfectly.
Re-exécution terminée.
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03 processed successfully![0m
---------------------------------------------------
[32mcurrent: controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03[0m
---------------------------------------------------
[32mcontroles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03[0m
<zipfile.ZipExtFile name='controles-realises-par-la-cnil/Liste des contrôles réalisés par la CNIL en 2015_2016-05-03' mode='r' compress_type=deflate>
controles

  warn("""Cannot parse header or footer so it will be ignored""")
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)
  return self._open_to_write(zinfo, force_zip64=force_zip64)


In [9]:
from classes.gcs_processor import GCSProcessor

bucket_name = 'cnil_csv'
cred_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name=bucket_name, credentials_path=credentials_path)
init2.create_bucket()
file_paths = [zip_output]
dest_folder = 'prep'
dest_blobs = ['prep_gbq_datasets.zip']
instance1.upload_local_to_gcs(file_paths=file_paths, dest_folder=dest_folder, dest_blobs=dest_blobs, date=today)

Bucket already exists.
[32m file 2024-04-13/prep/prep_gbq_datasets.zip uploaded to GCS successfully to 2024-04-13/prep/prep_gbq_datasets.zip.[0m


In [12]:
from classes.gcs_processor import GCSProcessor

bucket_name = 'cnil_csv'
cred_path = 'cred/service_account_local_py.json'
init2 = GCSProcessor(bucket_name=bucket_name, credentials_path=credentials_path)
prefix = 'prep'
blobs = init2.list_blobs(prefix=prefix)
init2.extract_and_upload_selection(blobs = blobs, folder_name='prep/extracted', date=today)

cnil_csv
[36mStart extracting and uploading to GCS: 2024-04-13//2024-04-13/prep/prep_gbq_datasets.zip[0m


# GCS to GCP 

In [1]:
from classes.gcs_processor import GCSProcessor
from datetime import date

today = date.today()

gcs_bucket_name = 'cnil_csv'
credentials_path = 'cred/service_account_local_py.json'
folder = "2024-04-13/prep"

instance1 = GCSProcessor(bucket_name=gcs_bucket_name, credentials_path=credentials_path)
list_blob = instance1.list_blobs('2024-04-13/prep')
zip_file = instance1.get_zip_file_object(list_blob[0].name)

cnil_csv


In [2]:
from classes.gcs_to_gcp import FromGCStoGBQ

# usage exemple
credentials_path = 'cred/service_account_local_py.json'
project_id = 'cnil-392113'
dataset_name = 'raw_data'

processor_bq = FromGCStoGBQ(credentials_path, project_id, dataset_name)
processor_bq.create_dataset()
processor_bq.upload_zip_to_bq(zip_file)


[32mCreated dataset (or already exists) cnil-392113.raw_data[0m
---------------------
notifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel/Notifications_à_la_CNIL_de_violations_de_données_à_caractère_personnel_2024_03_29.csv
Notifications_à_la_CNIL_de_violations_de_données_à_caractère_personnel_2024_03_29.csv
2024_03_29
csv
notifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel
this is the table name:  cnil-392113.raw_data.notifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel
---------------------


1it [00:04,  4.29s/it]


[32mnotifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel/Notifications_à_la_CNIL_de_violations_de_données_à_caractère_personnel_2024_03_29.csv is uploaded to cnil-392113.raw_data.notifications_a_la_cnil_de_violations_de_donnees_a_caractere_personnel[0m
---------------------
traitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_03_25.csv
Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_03_25.csv
2018
csv
formalites_prealables_recues_par_la_cnil_depuis_le_25_mai
this is the table name:  cnil-392113.raw_data.formalites_prealables_recues_par_la_cnil_depuis_le_25_mai
---------------------


1it [00:05,  5.04s/it]


[32mtraitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Formalités_préalables_reçues_par_la_CNIL_depuis_le_25_mai_2018_2024_03_25.csv is uploaded to cnil-392113.raw_data.formalites_prealables_recues_par_la_cnil_depuis_le_25_mai[0m
---------------------
sanctions_prononcees_par_la_cnil/Sanctions_prononcées_par_la_CNIL_2023_11_24.csv
Sanctions_prononcées_par_la_CNIL_2023_11_24.csv
2023_11_24
csv
sanctions_prononcees_par_la_cnil
this is the table name:  cnil-392113.raw_data.sanctions_prononcees_par_la_cnil
---------------------


1it [00:03,  3.29s/it]


[32msanctions_prononcees_par_la_cnil/Sanctions_prononcées_par_la_CNIL_2023_11_24.csv is uploaded to cnil-392113.raw_data.sanctions_prononcees_par_la_cnil[0m
---------------------
traitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Traitements_de_données_personnelles_déclarés_à_la_CNIL_depuis_le_25_mai_2018_2023_11_21.csv
Traitements_de_données_personnelles_déclarés_à_la_CNIL_depuis_le_25_mai_2018_2023_11_21.csv
2018
csv
traitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai
this is the table name:  cnil-392113.raw_data.traitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai
---------------------


1it [00:03,  3.06s/it]


[32mtraitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai_2018/Traitements_de_données_personnelles_déclarés_à_la_CNIL_depuis_le_25_mai_2018_2023_11_21.csv is uploaded to cnil-392113.raw_data.traitements_de_donnees_personnelles_declares_a_la_cnil_depuis_le_25_mai[0m
---------------------
mises_en_demeure_prononcees_par_la_cnil/Mises_en_demeure_prononcées_par_la_CNIL_2023_08_25.csv
Mises_en_demeure_prononcées_par_la_CNIL_2023_08_25.csv
2023_08_25
csv
mises_en_demeure_prononcees_par_la_cnil
this is the table name:  cnil-392113.raw_data.mises_en_demeure_prononcees_par_la_cnil
---------------------


1it [00:03,  3.12s/it]


[32mmises_en_demeure_prononcees_par_la_cnil/Mises_en_demeure_prononcées_par_la_CNIL_2023_08_25.csv is uploaded to cnil-392113.raw_data.mises_en_demeure_prononcees_par_la_cnil[0m
---------------------
exercice_des_droits_indirect_donnees_generales/Exercice_des_droits_indirect_(données_générales)_2023_06_28.csv
Exercice_des_droits_indirect_(données_générales)_2023_06_28.csv
2023_06_28
csv
exercice_des_droits_indirect_(donnees_generales)
this is the table name:  cnil-392113.raw_data.exercice_des_droits_indirect_(donnees_generales)
---------------------
[31mError: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/cnil-392113/datasets/raw_data/tables/exercice_des_droits_indirect_(donnees_generales)?prettyPrint=false: Invalid table ID "exercice_des_droits_indirect_(donnees_generales)".[0m
try to read with sep=","


EmptyDataError: No columns to parse from file

# Building catalog from prep_data

In [None]:
from classes.prep_data import ZipFileProcessor

gcs_bucket_name = 'cnil_csv'
credential_path = 'cred/service_account_local_py.json'
zip_blob_name = '2024-02-17/prep/prep_datasets.zip'
output_folder_name = '2024-02-17/'+ 'prep'
instance4 = ZipFileProcessor(gcs_bucket_name, credential_path, zip_blob_name, output_folder_name)
zip_file = instance4.get_zip_file_object()

In [None]:
from classes.source_catalog import CustomCatalog
import io

instance8 = CustomCatalog('cred/service_account_local_py.json')
df = instance8.create_catalog_gcs(zip_file)
df

In [None]:
import pandas as pd
csv_output = io.BytesIO()
df.to_csv(csv_output, index=False, sep=";")
csv_output.seek(0)

In [None]:
from classes.file_to_gcs import FromFileToGCS

bucket_name = 'cnil_csv'
cred_path = 'cred/service_account_local_py.json'
init2 = FromFileToGCS(bucket_name, cred_path)
init2.create_bucket()
file_paths = [csv_output]
dest_folder = 'prep'
dest_blob = ['prepdata_cnil_catalog_2024-02-17.csv']
init2.local_to_gcs(file_paths, dest_folder, dest_blob)

# Building catalog from BQ raw_data

In [None]:
from classes.source_catalog import CustomCatalog

credential_path = 'cred/service_account_local_py.json'
dataset_name = 'raw_data'
project_id = 'cnil-392113'
instance8 = CustomCatalog(credential_path, project_id, dataset_name)
df = instance8.bq_raw_catalog()

In [None]:
from classes.gcs_to_gcp import FromGCStoGBQ

credentials_path = 'cred/service_account_local_py.json'
project_id = 'cnil-392113'
dataset_name = 'catalog_data'
table_name = 'cnil_catalog_bq'

processor_bq = FromGCStoGBQ(credentials_path, project_id, dataset_name)
processor_bq.create_dataset()
processor_bq.df_to_bq(df, table_name)

# Additionnal tables

In [None]:
from classes.download_catalog_content import DlCatalogContentCnil

project_id = 'cnil-392113'
credential_path = 'cred/service_account_local_py.json'
instance1 = DlCatalogContentCnil(credentials_path= credential_path, project_id=project_id, catalog_path=None)
last_sanc = instance1.get_last_record_eu()
df = instance1.scrap_eu()
df