<a href="https://colab.research.google.com/github/Mahjabeenqamar1/Machain-Learning/blob/main/Data_ETL_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data ETL Pipeline using Python
To develop a Data ETL pipeline using Python, the first step is to collect data from a data source. Let’s use the Fashion-MNIST dataset provided by the Keras library to keep things beginner-friendly:

In [1]:
import tensorflow.keras as keras
(xtrain, ytrain), (xtest, ytest) = keras.datasets.fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [2]:
# look at the shape of the data:
print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)


In [3]:
#clean and transform the data. Here we will normalize the pixel values to be between 0 and 1 and reshape the data into a 4D tensor
import numpy as np

xtrain = xtrain.astype('float32') / 255
xtest = xtest.astype('float32') / 255

xtrain = np.reshape(xtrain, (xtrain.shape[0], 28, 28, 1))
xtest = np.reshape(xtest, (xtest.shape[0], 28, 28, 1))

print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)

(60000, 28, 28, 1)
(60000,)
(10000, 28, 28, 1)
(10000,)


In [5]:
#load the data into a database. We can use SQLite to create a database and load the data into it
import sqlite3

conn = sqlite3.connect('fashion_mnist.db')

conn.execute('''CREATE TABLE IF NOT EXISTS images
             (id INTEGER PRIMARY KEY AUTOINCREMENT,
             image BLOB NOT NULL,
             label INTEGER NOT NULL);''')

for i in range(xtrain.shape[0]):
    conn.execute('INSERT INTO images (image, label) VALUES (?, ?)',
                [sqlite3.Binary(xtrain[i]), ytrain[i]])

conn.commit()

for i in range(xtest.shape[0]):
    conn.execute('INSERT INTO images (image, label) VALUES (?, ?)',
                [sqlite3.Binary(xtest[i]), ytest[i]])

conn.commit()

conn.close()

In [7]:
#ead the data you stored on the SQLite database
import sqlite3
conn = sqlite3.connect('fashion_mnist.db')
cursor = conn.cursor()

cursor.execute('SELECT * FROM images')
rows = cursor.fetchall()

import pandas as pd
data = pd.read_sql_query('SELECT * FROM images', conn)
data


Unnamed: 0,id,image,label
0,1,"b""\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...",b'\t'
1,2,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x00'
2,3,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x00'
3,4,"b""\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...",b'\x03'
4,5,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x00'
...,...,...,...
139995,139996,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\t'
139996,139997,"b""\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...",b'\x01'
139997,139998,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x08'
139998,139999,b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...,b'\x01'


##Summary
The code demonstrates the development of a Data ETL pipeline using Python. It extracts data from the Fashion-MNIST dataset, applies data cleaning and transformation, and loads it into an SQLite database. The pipeline enables efficient extraction, transformation, and loading, facilitating downstream data processing. By normalizing pixel values and reshaping the data, it prepares it for analysis. The SQLite database provides a structured storage for easy access and retrieval. Overall, this Python-based ETL pipeline serves as a practical example for understanding and implementing data extraction, transformation, and loading processes, essential for effective data management and analysis.