# Data Ingestion Process

## Overview
The data ingestion process is the first step in our pipeline, where raw data is loaded, validated, and prepared for analysis. This ensures that the dataset is reliable and ready for exploratory data analysis and modeling.

## Key Steps
1. **Reading Data**:
   - The dataset was read from a CSV file located at `/data/Telco_customer_churn.csv`.
   - Code snippet:
     ```python
     data = pd.read_csv(input_file_path)
     print(f"Data loaded with {data.shape[0]} rows and {data.shape[1]} columns.")
     ```

2. **Validation**:
   - Performed basic validation checks:
     - Checked for empty datasets: `if data.empty`.
     - Checked for duplicate rows: `data.duplicated().sum()`.
   - Summary:
     - Rows: 7043, Columns: 33.
     - Duplicate rows: 0.

3. **Saving Processed Data**:
   - The validated dataset was saved in the `/data/processed/` directory with a timestamp.
   - Code snippet:
     ```python
     timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
     output_file = os.path.join(output_dir, f"ingested_data_{timestamp}.csv")
     data.to_csv(output_file, index=False)
     ```

## Results
- Processed data saved successfully at `/data/processed/ingested_data_YYYYMMDD_HHMMSS.csv`.
- Preview of the data:
  ```plaintext
     CustomerID  Count    Country     State         City  Zip Code ...
  0  3668-QPYBK      1  United States  California  Los Angeles  90003 ...
  1  9237-HQITU      1  United States  California  Los Angeles  90005 ...


## Project Setup & Environment

In [1]:
import os
import pandas as pd
from datetime import datetime


In [6]:
# Defining input and output paths
input_file_path = "/data/Telco_customer_churn.csv"  # Adjust the path as needed
output_dir = "/data/processed/"

# Creating the output directory (if it doesn't exist already)
os.makedirs(output_dir, exist_ok=True)
print(f"Output directory set to: {output_dir}")


Output directory set to: /data/processed/


##  Data Ingestion

In [3]:
class DataIngestion:
    def __init__(self, input_file_path, output_dir):
        """
        Initializes the DataIngestion class with file path and output directory.

        :param input_file_path: Path to the input CSV file.
        :param output_dir: Directory to save the processed data.
        """
        self.input_file_path = input_file_path
        self.output_dir = output_dir

    def read_data(self):
        """
        Reads data from the input file.

        :return: Pandas DataFrame containing the dataset.
        """
        try:
            print(f"Reading data from {self.input_file_path}...")
            data = pd.read_csv(self.input_file_path)
            print(f"Data read successfully with {data.shape[0]} rows and {data.shape[1]} columns.")
            return data
        except Exception as e:
            print(f"Error reading data: {e}")
            raise

    def validate_data(self, data):
        """
        Performs basic validation checks on the dataset.

        :param data: Pandas DataFrame.
        """
        if data.empty:
            raise ValueError("The dataset is empty. Please provide a valid file.")
        if not all(data.columns):
            raise ValueError("Some columns have no names. Please check the file.")
        print("Basic validation checks passed.")

        # Example: Check for duplicates
        if data.duplicated().sum() > 0:
            print(f"Warning: Dataset contains {data.duplicated().sum()} duplicate rows.")
        else:
            print("No duplicate rows found.")

    def save_data(self, data):
        """
        Saves the ingested data to the output directory with a timestamp.

        :param data: Pandas DataFrame.
        """
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_file = os.path.join(self.output_dir, f"ingested_data_{timestamp}.csv")
        try:
            data.to_csv(output_file, index=False)
            print(f"Data saved to {output_file}.")
        except Exception as e:
            print(f"Error saving data: {e}")
            raise

    def run(self):
        """
        Runs the data ingestion pipeline.
        """
        data = self.read_data()
        self.validate_data(data)
        self.save_data(data)


In [4]:
# Instantiate and run the ingestion pipeline
ingestion = DataIngestion(input_file_path=input_file_path, output_dir=output_dir)
ingestion.run()


Reading data from ../data/telco_customer_churn.csv...
Data read successfully with 7043 rows and 33 columns.
Basic validation checks passed.
No duplicate rows found.
Data saved to ../data/processed/ingested_data_20250105_192040.csv.


In [5]:
# Listing all files in the processed data directory
import glob

processed_files = glob.glob(os.path.join(output_dir, "*.csv"))
print("Processed files:", processed_files)

# Loading the latest ingested file
latest_file = max(processed_files, key=os.path.getctime)
print(f"Loading the latest processed file: {latest_file}")
processed_data = pd.read_csv(latest_file)
print(processed_data.head())


Processed files: ['../data/processed\\ingested_data_20250105_185658.csv', '../data/processed\\ingested_data_20250105_191921.csv', '../data/processed\\ingested_data_20250105_192040.csv']
Loading the latest processed file: ../data/processed\ingested_data_20250105_192040.csv
   CustomerID  Count        Country       State         City  Zip Code  \
0  3668-QPYBK      1  United States  California  Los Angeles     90003   
1  9237-HQITU      1  United States  California  Los Angeles     90005   
2  9305-CDSKC      1  United States  California  Los Angeles     90006   
3  7892-POOKP      1  United States  California  Los Angeles     90010   
4  0280-XJGEX      1  United States  California  Los Angeles     90015   

                 Lat Long   Latitude   Longitude  Gender  ...        Contract  \
0  33.964131, -118.272783  33.964131 -118.272783    Male  ...  Month-to-month   
1   34.059281, -118.30742  34.059281 -118.307420  Female  ...  Month-to-month   
2  34.048013, -118.293953  34.048013 -1

### Inspect Processed Data

In [16]:
import pandas as pd
df = pd.read_csv("../data/processed/ingested_data_20250105_192040.csv")
print(df.head())


   CustomerID  Count        Country       State         City  Zip Code  \
0  3668-QPYBK      1  United States  California  Los Angeles     90003   
1  9237-HQITU      1  United States  California  Los Angeles     90005   
2  9305-CDSKC      1  United States  California  Los Angeles     90006   
3  7892-POOKP      1  United States  California  Los Angeles     90010   
4  0280-XJGEX      1  United States  California  Los Angeles     90015   

                 Lat Long   Latitude   Longitude  Gender  ...        Contract  \
0  33.964131, -118.272783  33.964131 -118.272783    Male  ...  Month-to-month   
1   34.059281, -118.30742  34.059281 -118.307420  Female  ...  Month-to-month   
2  34.048013, -118.293953  34.048013 -118.293953  Female  ...  Month-to-month   
3  34.062125, -118.315709  34.062125 -118.315709  Female  ...  Month-to-month   
4  34.039224, -118.266293  34.039224 -118.266293    Male  ...  Month-to-month   

  Paperless Billing             Payment Method  Monthly Charges Tota