# Data Ingestion

 This script is responsible for the data ingestion process, so for the downloading of the data. Just for this notebook we will report the steps presented in the README.md

We first need to make sure that we are working in the correct directory, we want the main directory to be `mlopsProject`. Make sure to run this only once on your local machine, or restart the kernel if you want to rerun all

In [1]:
import os

Assuming that `01_data_ingestion.ipynb` is in `mlopsProject/research`

In [2]:
os.chdir('../')

current_path = os.getcwd() 
print(current_path) # Should be /mlopsProject

/home/corti/Desktop/mlopsProject


#### Step 1

Update `config.yaml`

#### Step 2

Update `params.yaml` (here skipped, no params to be added)

#### Step 3

Create `entity`



In [3]:
from dataclasses import dataclass
from pathlib import Path

In [4]:
# Use the dataclass decorator to automatically add special methods to the class, including __init__ and __repr__
# The 'frozen' parameter makes the class immutable (i.e., you can't modify the attributes once an instance is created)

@dataclass(frozen = True)
class DataIngestionConfig:
    root_dir : Path  # The root directory where the data will be stored
    source_URL : str  # The URL of the source data
    local_data_file : Path  # The path of the local file where the data will be downloaded
    unzip_dir : Path  # The directory where the data file will be unzipped

#### Step 4

Update the `configuration manager`

In [5]:
from ConversationSummarizer.constants import *
from ConversationSummarizer.utils.common import read_yaml, create_directories

In [6]:
# Define a class for managing configurations
class ConfigurationManager:
    # Initialize the ConfigurationManager with paths to the configuration and parameters files
    def __init__(self,
                 config_filepath = CONFIG_FILE_PATH,
                 params_filepath = PARAMS_FILE_PATH):
        
        # Read the configuration and parameters files
        self.config = read_yaml(config_filepath)
        self.params = read_yaml(params_filepath)
        
        # Create the root directory for storing artifacts
        create_directories([self.config.artifacts_root])
        
    # Define a method for getting the data ingestion configuration
    def get_data_ingestion_config(self) -> DataIngestionConfig:
        
        # Get the data ingestion configuration from the config file
        config = self.config.data_ingestion 
        
        # Create the root directory for data ingestion, if it doesn't already exist
        create_directories([config.root_dir])
        
        # Create a DataIngestionConfig object with the configuration values
        data_ingestion_config = DataIngestionConfig(
            root_dir = config.root_dir,
            source_URL = config.source_URL,
            local_data_file = config.local_data_file,
            unzip_dir = config.unzip_dir
        )
        
        # Return the DataIngestionConfig object
        return data_ingestion_config

#### Step 5

Update the components

In [7]:
import os
import urllib.request as request # needed to to make HTTP requests
import zipfile
from ConversationSummarizer.logging import logger
from ConversationSummarizer.utils.common import get_size

In [8]:
class DataIngestion:
    # Initialize the DataIngestion class with a configuration object
    def __init__(self, config: DataIngestionConfig):
        self.config = config 
    
    # Define a method for downloading a file
    def download_file(self):
        # Check if the local data file already exists
        if not os.path.exists(self.config.local_data_file):
            # If not, download the file from the source URL and save it to the local data file path
            filename, headers = request.urlretrieve( 
                url = self.config.source_URL,
                filename = self.config.local_data_file
            )
            # Log the filename and headers of the downloaded file
            logger.info(f"{filename} downloaded with following headers: \n{headers}")
        
        else:
            # If the file already exists, log its size
            logger.info(f"File already exists of size: {get_size(Path(self.config.local_data_file))}")
    
    def extract_zip_file(self):
        # Define the path where the zip file will be extracted
        unzip_path = self.config.unzip_dir
        
        # Create the unzip directory if it doesn't already exist
        os.makedirs(unzip_path, exist_ok = True)
        
        # Open the local data file as a zip file
        with zipfile.ZipFile(self.config.local_data_file, "r") as zip_ref:
            # Extract all the contents of the zip file to the unzip directory
            zip_ref.extractall(unzip_path)

#### Step 6

Update pipeline

In [9]:
try:
    # Instantiate ConfigurationManager and get the data ingestion configuration
    config = ConfigurationManager() 
    data_ingestion_config = config.get_data_ingestion_config()

    # Instantiate DataIngestion with the configuration and perform data ingestion
    data_ingestion = DataIngestion(config = data_ingestion_config)
    data_ingestion.download_file()
    data_ingestion.extract_zip_file()

except Exception as e:
    # Log the exception before raising it
    logger.error(f"An error occurred during data ingestion: {str(e)}")
    raise

[2024-01-31 22:07:00,977: INFO: common: yaml file: config/config.yaml loaded successfully]
[2024-01-31 22:07:00,979: INFO: common: yaml file: params.yaml loaded successfully]
[2024-01-31 22:07:00,980: INFO: common: created directory at: artifacts]
[2024-01-31 22:07:00,981: INFO: common: created directory at: artifacts/data_ingestion]
[2024-01-31 22:07:06,777: INFO: 3707265357: artifacts/data_ingestion/data.zip downloaded with following headers: 
Connection: close
Content-Length: 7903594
Cache-Control: max-age=300
Content-Security-Policy: default-src 'none'
Content-Type: application/zip
ETag: "c1b6d464322975cff5b61efcec056163f6fbc488dd31283f5624d3c4501a2d27"
Last-Modified: Wed, 31 Jan 2024 21:06:17 GMT
Strict-Transport-Security: max-age=31557600
timing-allow-origin: https://github.com
X-Content-Type-Options: nosniff
X-Frame-Options: deny
x-github-tenant: 
X-XSS-Protection: 1; mode=block
X-GitHub-Request-Id: FB78:0E85:342788D:363BCB7:65BAB648
Accept-Ranges: bytes
Date: Wed, 31 Jan 2024 2

#### Step 7 & 8

They're done directly in `src/ConversationSummarizer`