# [Sample Notebook] AfterWork: Data Engineering with Python and Cassandra

## Pre-requisite

In [124]:
# Install the Cassandra python driver
!pip install cassandra-driver



In [125]:
# Import the necessary libraries
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import json

## Creating a Cassandra database

Here's a step-by-step tutorial on how to install Cassandra on DataStax Astra and how to connect to it using Python. We don't need to give any credit card details for this option.

**Step 1: Sign up for DataStax Astra**

To use DataStax Astra, you must first sign up for an account. Go to the DataStax Astra website (https://astra.datastax.com/register) and sign up for a free account.

**Step 2: Create a database**

Once you have created an account and logged in, you can create a new database. Click on the "Create Database" button and follow the prompts to create a new database.

**Step 3: Create a keyspace**

After creating a database, you need to create a keyspace. Click on the "Add Keyspace" button and follow the prompts to create a new keyspace.

**Step 4: Generate an application token**

To connect to your Cassandra database using Python, you'll need to generate an application token. Go to the "Settings" tab and click on the "Generate New Token" button. Copy the token that is generated.

## 1. Setting up the Connection

In [None]:
# This secure connect bundle is autogenerated when you download your SCB,
# if yours is different update the file name below
cloud_config= {
  'secure_connect_bundle': 'secure-connect-cassandra.zip'
}

# This token JSON file is autogenerated when you download your token,
# if yours is different update the file name below
with open("Cassandra-token.json") as f:
    secrets = json.load(f)

CLIENT_ID = secrets["clientId"]
CLIENT_SECRET = secrets["secret"]

auth_provider = PlainTextAuthProvider(CLIENT_ID, CLIENT_SECRET)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

if session:
  print('Connected!')
else:
  print("An error occurred.")

### <font color='green'>Challenge: Create a Cassandra Database on DataStax Astra and Connect to It Using Python</font>

In this challenge, create a cassandra database on DataStax Astra and then connect to it using Python.

Hints:

* Sign up for a free account on DataStax Astra
* Create a new database and a keyspace within it
* Generate an Application Token for authentication
* Install the DataStax Python Driver
* Connect to your Cassandra database using Python

In [None]:
# Your code to install the cassadra driver


In [None]:
# Your code to connect to your cassandra database goes here


## 2. Loading and Reading Data from Cassandra

Before you can load data into Cassandra, you need to create a keyspace on DataStax Astra and then connect to it. Later, create a table.



In [None]:
# Create a table in the 'example' keyspace
session.execute("""
    CREATE TABLE IF NOT EXISTS example.users (
        id int PRIMARY KEY,
        name text,
        age int
    )
""")

Once you have created a keyspace and a table, you can load data into Cassandra. Here is an example of how to load data into the "users" table. The following code inserts two rows into the "users" table.




In [None]:
session.execute("""
    INSERT INTO example.users (id, name, age)
    VALUES (%s, %s, %s)
""", (1, 'John Doe', 30))

session.execute("""
    INSERT INTO example.users (id, name, age)
    VALUES (%s, %s, %s)
""", (2, 'Jane Doe', 28))


Finally, you can read data from Cassandra. Here is an example of how to read data from the "users" table:




In [None]:
# Select the data from the users table
rows = session.execute("SELECT * FROM example.users")

for row in rows:
    print(row.id, row.name, row.age)

# Close the session
session.shutdown()

### <font color='green'>Challenge: Loading and Reading Data from Cassandra</font>

Your task is to create a Python script that connects to a Cassandra cluster, creates a keyspace and a table to store telecommunications data, inserts data into the table, and then queries the data.

Here are the steps to follow:

1. Connect to the Cassandra cluster using the cassandra-driver.
2. Create a keyspace named telecommunications with a replication factor of 1.
3. Create a table named call_logs with the following columns:
4. call_id as an integer and primary key.
5. call_time as a timestamp.
6. caller_number as a text.
7. callee_number as a text.
8. duration as an integer.
9. call_type as a text.
10. Insert at least 5 rows of sample data into the call_logs table. You can use any dummy data that you like. i.e. 1, '2023-03-21 10:00:00+0000', '555-1234', '555-5678', 60, 'outgoing'
11. Query the data from the call_logs table and print the results.

Hints:
* Use the code from the lesson to connect to the Cassandra cluster, create a keyspace and a table, and insert data into the table.
* To query the data from the call_logs table, use the SELECT statement with the FROM clause and the name of the table.

In [None]:
# Connect to the cluster
# Your code goes here

# Create the 'call_logs' table
# Your code goes here

# Insert sample data into the 'call_logs' table
session.execute("""
    INSERT INTO telecommunications.call_logs (call_id, call_time, caller_number, callee_number, duration, call_type)
    VALUES (%s, %s, %s, %s, %s, %s)
""", (1, '2023-03-21 10:00:00+0000', '555-1234', '555-5678', 60, 'outgoing'))

session.execute("""
    INSERT INTO telecommunications.call_logs (call_id, call_time, caller_number, callee_number, duration, call_type)
    VALUES (%s, %s, %s, %s, %s, %s)
""", (2, '2023-03-21 10:05:00+0000', '555-5678', '555-1234', 120, 'incoming'))

session.execute("""
    INSERT INTO telecommunications.call_logs (call_id, call_time, caller_number, callee_number, duration, call_type)
    VALUES (%s, %s, %s, %s, %s, %s)
""", (3, '2023-03-21 10:10:00+0000', '555-1234', '555-7890', 180, 'outgoing'))

# Your code goes here

# Your code goes here


In [None]:
# Select the data from the call_logs table
# Your code goes here



# Close the session
# Your code goes here


## 3. Updating and Deleting Data from Cassandra

To update data in Cassandra, we will use the UPDATE statement. We can update a user's email address using the following Python code:



In [None]:
# Connect to the cluster
session = cluster.connect()

In [None]:
# Define the query
query = "UPDATE example.users SET age = %s WHERE id = %s"

# execute the query
session.execute(query, (50, 1))

In [None]:
# Delete data in Cassandra
query = "DELETE FROM example.users WHERE id = %s"
session.execute(query, (2,))

In [None]:
# Select the data from the users table
rows = session.execute("SELECT * FROM example.users")

for row in rows:
    print(row.id, row.name, row.age)

# Close the session
session.shutdown()

## 4. Load CSV data into Cassandra

In [None]:
# Connect to the example Keyspace
session = cluster.connect()

# Load the CSV data into a pandas DataFrame
import pandas as pd
df = pd.read_csv('https://archive.org/download/e202407/example.csv')

# Insert data from DataFrame into Cassandra table
insert_query = session.prepare("INSERT INTO example.users (id, name, age) VALUES (?, ?, ?)")

for index, row in df.iterrows():
    session.execute(insert_query, (row['id'], row['name'], row['age']))

print("Data inserted successfully.")

In [None]:
# Select the data from the users table
rows = session.execute("SELECT * FROM example.users")

for row in rows:
    print(row.id, row.name, row.age)

# Close the session
session.shutdown()

### <font color="green">Challenge</font>

Write code populate the call_logs table with the data found on this dataset (https://archive.org/download/call_logs_202407/call_logs.csv) and then query the data from the call_logs table to confirm the task.

In [None]:
# Connect to the cluster
# Your code goes here


# Load the CSV data into a pandas DataFrame
# Your code goes here


# Convert 'call_time' column to datetime objects
# Your code goes here


# Insert data from DataFrame into Cassandra table
# Your code goes here



In [None]:
# Select the data from the call_logs table
# Your code goes here


# Close the session
# Your code goes here



## 5. Creating a Data Pipeline using Python and Cassandra

Let's now learn how to create a data pipeline using Python and Cassandra.

In [None]:
def connect_to_cassandra():
    cloud_config = {
        'secure_connect_bundle': 'secure-connect-cassandra.zip'
    }
    with open("Cassandra-token.json") as f:
        secrets = json.load(f)

    CLIENT_ID = secrets["clientId"]
    CLIENT_SECRET = secrets["secret"]

    auth_provider = PlainTextAuthProvider(CLIENT_ID, CLIENT_SECRET)
    cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
    session = cluster.connect()
    return session

In [None]:
def extract_data(session, keyspace, table):
    query = f"SELECT * FROM {keyspace}.{table}"
    rows = session.execute(query)
    data = [row._asdict() for row in rows]
    df = pd.DataFrame(data)
    print(df.head())
    print('Data Extracted Successfully!')
    return df

In [None]:
def transform_data(df):
    df['name'] = df['name'].str.lower()
    print('Data Transformed Successfully!')
    return df

In [None]:
def load_data(session, keyspace, table, df):
    session.execute(f"""
        CREATE TABLE IF NOT EXISTS {keyspace}.{table} (
            id int PRIMARY KEY,
            name text,
            age int
        )
    """)

    insert_query = session.prepare(f"""
        INSERT INTO {keyspace}.{table} (id, name, age)
        VALUES (?, ?, ?)
    """)

    for index, row in df.iterrows():
        session.execute(insert_query, (row['id'], row['name'], row['age']))

    print('Data Loaded Successfully!')

In [None]:
if __name__ == "__main__":

    # Step 1: Connect to Cassandra
    session = connect_to_cassandra()

    # Step 2: Extract data from the 'example.users' table
    df = extract_data(session, 'example', 'users')

    # Step 3: Transform data (lowercase conversion of the 'name' field)
    df_transformed = transform_data(df)

    # Step 4: Load the transformed data back into a new table in the 'example' keyspace
    load_data(session, 'example', 'users_clean', df_transformed)

    print("Data Pipeline Executed Successfully.")

Let's confirm the data loaded to the users table.

In [None]:
# Select the data from the call_logs table
session = connect_to_cassandra()
rows = session.execute("SELECT * FROM example.users")
for row in rows:
    print(row.id, row.name, row.age)

### <font color="green">Challenge</font>

Your task is to create a Python script that connects to a Cassandra cluster, extracts data from the telecommunications keyspace and call_logs table, transforms the data by converting the duration from seconds to micro seconds, and loads the transformed data back into a new table in the same keyspace.

* Connect to the Cassandra cluster using the cassandra-driver.
* Extract data from the `telecommunications.call_logs` table.
* Transform the data by converting the duration column from seconds to microseconds.
* Load the transformed data back into a new table named `call_logs_transformed` in the telecommunications keyspace.

In [None]:
# Define the pipeline
# Your code goes here



In [None]:
# Confirm the transformation
# Your code goes here

# Close the session
# Your code goes here
