<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Getting Started with Jupyter Notebook, Python, and Pandas</b>
</header>

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Jupyter Notebook Reference: <a href = 'https://jupyter.org/documentation'>https://jupyter.org/documentation</a></li>
    <li>Python Pandas Reference: <a href = 'https://pandas.pydata.org/docs/user_guide/index.html'>https://pandas.pydata.org/docs/user_guide/index.html</a></li>
</ul>

<p style = 'font-size:16px;font-family:Arial'><b>Contents</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Overview of Jupyter Notebook and Python Pandas</li>
<li>Overview of Python Pandas
    <ul><li>Series</li>
    <li>DataFrames</li>
    <li>Accessing Data</li>
    <li>Modifying DataFrames</li></ul></li>
<li>Loading Data From Files
    <ul><li>DataFrame metadata</li>
    <li>Grouping, Aggregation, and Pivot Tables</li>
        <li>Data Cleansing and Transformation</li></ul></li>
<li>Loading Data From databases and remote systems
    <ul><li>Combining DataFrames - joins and concatenation</li></ul></li>
    <li>Persisting Data - saving to files</li></ol>

<hr>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Experience</b></p>


<p style = 'font-size:16px;font-family:Arial'>This demo takes about 2 minutes to run.</p>

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Section 1. Jupyter Notebook</b>

<p style = 'font-size:16px;font-family:Arial'>Jupyter Notebook is an open-source browser-based application that allows users to create and share interactive documents that contain live code, equations, visualizations, and/or narrative text.  Common uses include:</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>Data Cleansing and Transformation</li>
    <li>Numerical Simulation</li>
    <li>Statistical Modeling</li>
    <li>Data Visualization</li>
    <li>Machine Learning</li>
    <li>And much more</li>
</ul>
    
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Anatomy of a Notebook</b></p>

<p style = 'font-size:16px;font-family:Arial'>Notebooks consist of individual cells that can contain text, images, or executable code</p>
    <br>

**Text Cells** use a formatting syntax called markdown that uses special character sequences to format output.  For example, to make text **bold** one would wrap the text in two asterisks, like this: ```**bold**```

<p style = 'font-size:16px;font-family:Arial'>Note that the above text doesn't look quite like the rest of the text.  Markdown has very simple styling; for more advanced control over text presentation, one can use html</p>

```html
<p style = 'font-size:16px;font-family:Arial'>Note that the above text doesn't look quite like the rest of the text.  Markdown has very simple styling; for more advanced control over text presentation, one can use html</p>
```

<p style = 'font-size:16px;font-family:Arial'><b>Code Cells</b> contain executable code.  The code can be in different languages; it is most common to see Python, although R and Julia is common, and Teradata provides a dedicated SQL plugin for Vantage</p>

<p style = 'font-size:16px;font-family:Arial'><b>Magic Statements</b> are special directives that aren't in the underlying language of the notebook.  These statements direct the underlying notebook application to perform some function.  For Python notebooks, the IPython engine supports magic statements documented <a href = 'https://ipython.readthedocs.io/en/stable/interactive/magics.html'>here</a>.  Usually these statements are preceeded by a special character such as '%' or '!' and will instruct the environment to do something such as execute commandline statements, display images inline, save output to a file, etc.</p>

<hr>

In [1]:
# An Example of a Code Cell.
# To Run this Python code, execute the cell by clicking the "play" button
# Or pressing SHIFT-ENTER

print('hello world')

hello world


In [2]:
# Some python code preceeded by a magic statement
# that times the execution of the cell

%time
print('How long did it take?')

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.53 µs
How long did it take?


<p style = 'font-size:16px;font-family:Arial'><b>Note - take a look at <a href = 'https://jupyterlab.readthedocs.io/en/stable/user/interface.html#keyboard-shortcuts'>Keyboard Shortcuts</a> in the documentation.  Also, if you're in a Classic Jupyter Notebook, select Help->Keyboard Shortcusts.  There are many shortcuts to assist in working with code, executing code blocks, etc.</b></p>
<hr>

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Section 2. Python Pandas</b>

<p style = 'font-size:16px;font-family:Arial'><a href = 'https://pandas.pydata.org'>Python Pandas</a> is the most popular and powerful Python packaeges for working with data.  Pandas also provides the functional and linguistic basis for the Teradata "teradataml" package.  This notebook barely scratches the surface of what's available in Pandas and the documentation and community resources are a great place to continue learning.</p>
<hr>    
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Import Statements</b>


<p style = 'font-size:16px;font-family:Arial'>Pandas is a separate library that must be installed and loaded into your environment prior to use. Pandas can be installed via a magic command that will execute the command-line tool 'pip' - the python package management utlity. Run the following "!pip install" command if necessary.</p>

In [3]:
!pip install --upgrade pandas --user
!pip install --upgrade numpy --user



<p style = 'font-size:16px;font-family:Arial'>Common convention is to load the entire library with the alias of 'pd'.  Most online resources will use this convention.  Run the following cell:</p>

In [4]:
# commonly, all general import statements will be
# written near the top of a notebook or file

import getpass
import io
import numpy as np
import pandas as pd
from teradataml import create_context, remove_context

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Pandas Series</b>
    
<p style = 'font-size:16px;font-family:Arial'>A Pandas Series is an object that contains list of objects and a labeled index.  Think about it as a single column in an Excel worksheet, but with the added flexibility of the row number alternatively having a label.  This is very powerful, since items can be referenced using this label value instead of just position</p>

<p style = 'font-size:16px;font-family:Arial'>Items in a Series can consist of any Python object; strings, numbers, or even functions or programs.</p>

In [5]:
# Create a simple series from a list of values

# This is a simple Python list object with some numbers and strings as items
d = [1, 2, 'foo', 'bar']

# Create a Pandas Series by passing the list above as the data parameter to the Series() Module Function.
# Note the convention 'pd.function_name()':
my_series = pd.Series(data = d)

#Show the Series - the index has been auto-assigned as a positional number
my_series

0      1
1      2
2    foo
3    bar
dtype: object

In [6]:
# This Series can be enhanced by adding a list of labels for the index
labels = ['row1', 'row2', 'row3', 'row4']

# Re-create the Series with the labels as an index by adding a new parameter:
my_series = pd.Series(data = d, index = labels)

my_series

row1      1
row2      2
row3    foo
row4    bar
dtype: object

In [7]:
# Index value can be used to reference the item
my_series['row2']

2

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Pandas DataFrame</b>
    
<p style = 'font-size:16px;font-family:Arial'>Pandas DataFrames form the core of Pandas Data Management capabilities.  Most analytical and other operations are performed on DataFrames using many different method calls.</p>

<p style = 'font-size:16px;font-family:Arial'>A Pandas DataFrame closely resembles a table; with rows and columns.  A more Pandas-native way to understand a DataFrame is that it consists of multiple labeled Series that share an index; and each Series is a column.</p>

In [8]:
# Create lists of values and an index of labels as source data
Age = [23, 33, 55, 78]
FName = ['Bob', 'Sue', 'Jen', 'Jack']
ID = ['p1', 'p2', 'p3', 'p4']

# Create two series using the same index values:
series_1 = pd.Series(data = Age, index = ID)
series_2 = pd.Series(data = FName, index = ID)

# Create a DataFrame by concatenating the two Series
# concat() is a Pandas Module Function (note pd."") that will
# combine multiple Series or DataFrames together long either axis
# These series are passed as a List object:
df_1 = pd.concat([series_1, series_2], axis = 1) # axis = 1 means combine columns

df_1

Unnamed: 0,0,1
p1,23,Bob
p2,33,Sue
p3,55,Jen
p4,78,Jack


In [9]:
# Modifications can be performed on DataFrames
# This is an example of renaming the column labels using a dictionary object as a parameter
# the dictionary contains keys based on the current column label, and a value of the new one.
# This method call RETURNS a new DataFrame with the columns renamed - more on this below
df_1.rename(columns = {0:'Age', 1:'First Name'})

Unnamed: 0,Age,First Name
p1,23,Bob
p2,33,Sue
p3,55,Jen
p4,78,Jack


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Methods that return an object vs. modify the object</b>
    
<p style = 'font-size:16px;font-family:Arial'>Most Pandas Methods will <b>Return</b> an object (DataFrame or Series) and <b>not</b> modify the underlying object.  In the cell above, the 'rename()' method returns a DataFrame as a response to the method call - that's why the output is displayed the way we expect to see it.  However, the original DataFrame has not been modified:</p>

In [10]:
# Run this to see that df_1 hasn't been modified
df_1

Unnamed: 0,0,1
p1,23,Bob
p2,33,Sue
p3,55,Jen
p4,78,Jack


<p style = 'font-size:16px;font-family:Arial'>If the user wishes to modify the original DataFrame or Series, one can assign the object equivalent to the method call.  Many methods also have a Boolean parameter 'inplace' which controls whether the original object is modified "in place".  However, it is considered not best practice to use inplace - it is more explicit to use equivalence.</p>

In [11]:
# Assign equivalence:
df_1 = df_1.rename(columns = {0:'Age', 1:'First Name'})
df_1

Unnamed: 0,Age,First Name
p1,23,Bob
p2,33,Sue
p3,55,Jen
p4,78,Jack


In [12]:
# Inplace:
df_1.rename(columns = {'Age':'age', 'First Name':'FName'}, inplace = True)
df_1

Unnamed: 0,age,FName
p1,23,Bob
p2,33,Sue
p3,55,Jen
p4,78,Jack


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Accessing DataFrame Contents</b>
    
<p style = 'font-size:16px;font-family:Arial'>DataFrame contents can be accessed by row, column, or by using Boolean expressions.</p>

In [13]:
# Return the Series represented by the column label 'age'
df_1['age']

p1    23
p2    33
p3    55
p4    78
Name: age, dtype: int64

In [14]:
# Or a row:
df_1.loc['p1']

age       23
FName    Bob
Name: p1, dtype: object

In [15]:
# Or a single position (row/column):
df_1.loc['p3', 'age']

55

In [16]:
# Or an expression.
df_1[df_1['age'] > 50]

Unnamed: 0,age,FName
p3,55,Jen
p4,78,Jack


In [17]:
# the expression above actually creates a Series
# that consists of Boolean values that acts as a "mask"
# applied to the DataFrame:
df_1['age'] > 50

p1    False
p2    False
p3     True
p4     True
Name: age, dtype: bool

In [18]:
# so when multiple expressions are combined:
df_1[(df_1['age'] > 50) & (df_1['FName'] == 'Jen')]
# We're actually creating a mask using Boolean operators.

Unnamed: 0,age,FName
p3,55,Jen


In [19]:
# in this case OR
# Is applied when combining two Series
(df_1['age'] > 50) | (df_1['FName'] == 'Bob')

p1     True
p2    False
p3     True
p4     True
dtype: bool

In [20]:
df_1[(df_1['age'] > 50) | (df_1['FName'] == 'Bob')]

Unnamed: 0,age,FName
p1,23,Bob
p3,55,Jen
p4,78,Jack


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>DataFrame from a Dictionay Object</b>
    
<p style = 'font-size:16px;font-family:Arial'>Python Dictionaries are very powerful objects, and contain most of the information needed to construct a Pandas DataFrame.  DataFrames can be constructed directly from dictionary objects:</p>

In [21]:
# Create a dictionary consisting of two keys (A and B)
# each with a list of values
my_dict = {'A':[1,2,3,4], 'B':[5,6,7,8]}
print('Original Dictionary:')
print(my_dict)

df_from_dict = pd.DataFrame(my_dict)
print('New DataFrame:')
display(df_from_dict)

Original Dictionary:
{'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}
New DataFrame:


Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
3,4,8


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Adding and Removing Columns</b>

In [22]:
# New columns can be created using simple declaration:
df_from_dict['C'] = df_from_dict['A'] + df_from_dict['B']
df_from_dict

Unnamed: 0,A,B,C
0,1,5,6
1,2,6,8
2,3,7,10
3,4,8,12


In [23]:
# Or using the assign() method
# note this method returns the object vs. modifying it, so to modify it, assign equivalence

df_from_dict.assign(D = df_from_dict['A'] * df_from_dict['B'])


Unnamed: 0,A,B,C,D
0,1,5,6,5
1,2,6,8,12
2,3,7,10,21
3,4,8,12,32


In [24]:
# Drop method removes rows or columns depending on axis type
# again, it returns an object, so assign equivalence or inplace = True:

df_from_dict.drop('C', axis = 1, inplace = True)
df_from_dict

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
3,4,8


In [25]:
# Drop a row using the index:
df_from_dict.drop(3, axis = 0)

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7


<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Section 3. Populating DataFrames using Files</b>

<p style = 'font-size:16px;font-family:Arial'>The section above provided examples of constructing Series and DataFrames from other Python Object that we created manually in code.  More commonly, users will construct DataFrames from objects created from reading files or source data vs. manual entry.  Structured data such as excel files, csv, JSON, etc. as well as unstructured text and images can all be used to construct DataFrames directly.</p>

    
<p style = 'font-size:16px;font-family:Arial'>A sampling of the available methods are illustrated below.  The documentation will provide invaluable for understanding the variety and flexibility available in Pandas.</p>

In [26]:
# CSV is a common structured data format
# Reading in this data can be trivially simple by using
# one of pandas' module functions - read_csv()

df_1 = pd.read_csv('data/example.csv')
df_1

Unnamed: 0,col1,col2,col3,col4
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


<p style = 'font-size:16px;font-family:Arial'>openpyxl is a separate library that must be installed and loaded into your environment prior to use. openpyxl is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.<br>
openpyxl can be installed via a magic command that will execute the command-line tool 'pip' - the python package management utlity. Run the following "!pip install" command if necessary.</p>

In [27]:
!pip install openpyxl --user



In [28]:
# Excel is another common format
# Provide a file path and a sheet number
# Instruct which column to use as an index - in this case the first column
# finally - declare an "engine" - basically the underlying interpreter that reads the data

df_calories = pd.read_excel('data/Calorie_Log.xlsx', sheet_name = 'Sheet1', index_col = 0, engine = 'openpyxl')
df_calories

Unnamed: 0_level_0,Breakfast,Lunch,Dinner,Exercise
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-06-01,355,250,800 calories,Y
2022-06-02,349,435,955 calories,N
2023-06-03,280,85,735 calories,N
2024-06-03,452,125,257 calories,Y
2025-06-04,100,289,701 calories,Y
2026-06-05,70,736,321 calories,Y
2027-06-06,50,245,657 calories,N
2028-06-06,125,500,834 calories,N


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Investigating DataFrames</b>
    
<p style = 'font-size:16px;font-family:Arial'>Pandas provides many methods and attributes that are used to understand the content and structure of DataFrames</p>

<p style = 'font-size:16px;font-family:Arial'>Common attributes include:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>shape. Returns a Tuple consisting of the length of rows, columns</li>
    <li>dtypes. Returns a Series containing the Pandas datatype of the columns</li>
    <li>index.  Returns a Series of index values.</li>
    <li>columns.  Returns a Series of column labels</li>
</ul>

<p style = 'font-size:16px;font-family:Arial'>Common methods include:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>head().  Returns a Dataframe consisting of the first five (or pass an integer for more or less) rows of the DataFrame</li>
    <li>info(). Returns an object displaying summary information about the DataFrame.</li>
    <li>describe().  Returns a DataFrame which contains statistical information of the numeric columns</li>
    <li>count()/min()/max()/mean()/std().  Returns the respective aggregation of numeric columns.</li>
</ul>

In [29]:
df_calories.head()

Unnamed: 0_level_0,Breakfast,Lunch,Dinner,Exercise
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-06-01,355,250,800 calories,Y
2022-06-02,349,435,955 calories,N
2023-06-03,280,85,735 calories,N
2024-06-03,452,125,257 calories,Y
2025-06-04,100,289,701 calories,Y


In [30]:
df_calories.describe()

Unnamed: 0,Breakfast,Lunch
count,8.0,8.0
mean,222.625,333.125
std,154.485263,214.489052
min,50.0,85.0
25%,92.5,215.0
50%,202.5,269.5
75%,350.5,451.25
max,452.0,736.0


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>GroupBy</b>
    
<p style = 'font-size:16px;font-family:Arial'>groupby is a powerful method that will group rows of data together by one or more columns.  Grouped objects can have aggregations and other calculations performed on them.</p>

In [31]:
# Group the DataFrame by the column Exercise, and calculate the mean
# of the numeric columns
df_calories.groupby('Exercise').mean()

Unnamed: 0_level_0,Breakfast,Lunch
Exercise,Unnamed: 1_level_1,Unnamed: 2_level_1
N,201.0,316.25
Y,244.25,350.0


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Pivot Tables</b>
    
<p style = 'font-size:16px;font-family:Arial'>The pivot_table() method allows for very flexible control over the creation of pivot table DataFrames; including multiple groupings, aggregations, and formatting.</p>

<p style = 'font-size:16px;font-family:Arial'>For a comprehensive tutorial, see the article <a href = 'https://pbpython.com/pandas-pivot-table-explained.html'>"Practical Business Python"</a>.</p>

<p style = 'font-size:16px;font-family:Arial'>For the example below, this method call will use the same groupings as above, but add multiple aggregations and a totals row:</p>


In [32]:
# numpy is a powerful scientific/mathematic library that Pandas leverages.
# in this case, numpy provides us the mean and sum functions that are passed to the
# pivot_table method.

my_pivot = df_calories.pivot_table(index = 'Exercise', aggfunc = [np.mean, np.sum], margins = True)

my_pivot

Unnamed: 0_level_0,mean,mean,sum,sum
Unnamed: 0_level_1,Breakfast,Lunch,Breakfast,Lunch
Exercise,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
N,201.0,316.25,804,1265
Y,244.25,350.0,977,1400
All,222.625,333.125,1781,2665


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Data Transformation/Manipulation</b>
    
<p style = 'font-size:16px;font-family:Arial'>Note the column "Dinner" is missing in the above aggregation examples.  This is because the column "Dinner" is a string of format '### calories', and therefore isn't able to be numerically formulated.  Further above, we reviewed how to create a new column using simple assignment (df['new_column'] = EXPRESSION).  Another option is self-assignment; we can assign an existing column a new value based on an expression.  Here's the process of converting a string column to numeric:</p>

In [33]:
# Let's do a quick manual example with a piece of text first:

test_string = '234 calories'

# Python string manipulation makes this conversion/extraction easy:

new_values = test_string.split(' ') #split the string into a tuple using the space as the delimiter

print(new_values[0]) # The first item in the tuple is the string '234'

# However, it is still a string, so we need to cast it to a number:
print(type(new_values[0]))

print(type(int(new_values[0])))

234
<class 'str'>
<class 'int'>


In [34]:
# using the process above, we can convert the column
# There are some slight modifications since we're dealing with a DataFrame
# and not just a single string
#  expand = True will return the split string as a DataFrame instead of a tuple
#  this allows us to take the first item [0]
#  we use astype(int) instead of simple cast int() to create a numeric column type:

df_calories['Dinner'] = df_calories['Dinner'].str.split(' ', expand = True)[0].astype(int)

In [35]:
# check the work:
df_calories.dtypes

Breakfast     int64
Lunch         int64
Dinner        int64
Exercise     object
dtype: object

In [36]:
# Rerun the pivot_table method call, note Dinner appears:

my_pivot = df_calories.pivot_table(index = 'Exercise', aggfunc = [np.mean, np.sum], margins = True)

my_pivot

Unnamed: 0_level_0,mean,mean,mean,sum,sum,sum
Unnamed: 0_level_1,Breakfast,Dinner,Lunch,Breakfast,Dinner,Lunch
Exercise,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
N,201.0,795.25,316.25,804,3181,1265
Y,244.25,519.75,350.0,977,2079,1400
All,222.625,657.5,333.125,1781,5260,2665


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Other File Formats and file-like objects</b>
    
<p style = 'font-size:16px;font-family:Arial'>Pandas has read_ methods that support an extensive list of file formats:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>read_json</li>
    <li>read_html</li>
    <li>read_parquet</li>
    <li>etc.</li>
    </ul>


<p style = 'font-size:16px;font-family:Arial'>Most of these follow the same convention of passing a source file as a "path" or a file-like object.  Paths can be local filesystem paths, or they can be URIs (s3://bucket/path/to/object, ftp://user:pass@host/path, etc.).  Files can also be string or binary-serializable objects that can be read into a memory buffer using io.StringIO or io.BytesIO:</p>

In [37]:
# Construct a CSV formatted string
my_string = 'col1,col2\n1,2'

# read it into a memory buffer
my_buffer = io.StringIO(my_string)

# pass it to read_csv
df = pd.read_csv(my_buffer)
df

Unnamed: 0,col1,col2
0,1,2


<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Section 4. Populating DataFrames using Database connections</b>

<p style = 'font-size:16px;font-family:Arial'>Pandas and databases are a natural fit.  By combining these two technologies, users can leverage the power of large, server-based data management systems with the flexibility and innovation available in the Python ecosystem.</p>

<p style = 'font-size:16px;font-family:Arial'>Common relational database management systems use table structures as storage, and SQL (Structured Query Language) as the syntax used to interact in reading, operating on, and writing data.</p>
    
<p style = 'font-size:16px;font-family:Arial'>In order to access this data, users need to take some additional steps to connect to the database, and then execute SQL to operate on, or retrieve data to the client.</p>

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Establishing a Database connection</b>

<p style = 'font-size:16px;font-family:Arial'>The python ecosystem supports a common database interface called <a href = 'https://www.python.org/dev/peps/pep-0249/'>DBAPI2</a>.  One of the objects defined in this standard is the "connection" object type.  Pandas can use this connection object to perform the underlying communication with the target system.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Connection Libraries</b> Are ususally required and installed separately using pip or other means.  Some of these include:</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>pyodbc for ODBC connections (requires underlying ODBC drivers)</li>
    <li>jaydebeapi for JDBC connections (requires underlying java and JDBC drivers)</li>
    <li>vendor-specific native drivers (that usually don't require underlying configuration):
        <ul><li>teradatasql for Teradata Vantage</li>
            <li>pymssql for MS SQL Server</li>
            <li>cx_Oracle for Oracle</li>
            <li>others</li></ul></li>
    </ul>
    
<p style = 'font-size:16px;font-family:Arial'>These libraries all have their system-specific requirements and syntax for connecting tot he target system. See the vendor/provider documentation.</p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Make changes for your execution</b></p>

<p style = 'font-size:16px;font-family:Arial'>The Jupyter Module for Teradata provides a helper library called tdconnect - this can use the underlying client configs and pass a JWT token for SSO. Establish connection to Teradata Vantage server (uses the Teradata SQL Driver for Python). Before you execute the following statement, replace the variables &ltHOSTNAME&gt, &ltUID&gt and &ltPWD&gt with your target Vantage system hostname (or IP address), and your database user ID(QLID) and password, respectively.</p>
    
<p style = 'font-size:14px;font-family:Arial'>td_context = create_context(host="tdprdX.td.teradata.com", username="xy123456", password=gp.getpass(prompt='Password:'), logmech="LDAP")</p>
<hr>

In [38]:
eng = create_context(host = 'host.docker.internal', username='demo_user', password = getpass.getpass())

print(eng)

 ········


Engine(teradatasql://demo_user:***@host.docker.internal)


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Retrieving data using DBAPI connections</b>

<p style = 'font-size:16px;font-family:Arial'>Pandas provides a read_sql() module function for returning the results of a SQL query as a DataFrame.  Construct the query as a string, and pass it and the connection to the function:</p>

In [39]:
qry = 'SELECT top 10 * FROM DBC.TablesV;'

df_sql = pd.read_sql(qry, eng)

df_sql

Unnamed: 0,DataBaseName,TableName,Version,TableKind,ProtectionType,JournalFlag,CreatorName,RequestText,CommentString,ParentCount,...,BlockCompressionAlgorithm,BlockCompressionLevel,TableHeaderFormat,RowSizeFormat,MapName,ColocationName,TVMFlavor,FastAlterTable,IncrementalRestoreEnabled,AuthName
0,DBC,QryLogSteps_SZ,1,V,F,NN,DBC,REPLACE VIEW DBC.QryLogSteps_SZ\rAS\rLOCKING T...,,0,...,,,,0,,,,,,
1,DBC,StatsV_SZ,1,V,F,NN,DBC,REPLACE VIEW DBC.StatsV_SZ\rAS\rSELECT DBC.DB...,,0,...,,,,0,,,,,,
2,DBC,AllTempTablesVX,1,V,F,NN,DBC,REPLACE VIEW DBC.AllTempTablesVX\rAS SELECT ...,The AllTempTablesVX view provides information ...,0,...,,,,0,,,,,,
3,DBC,DBAUserAdmin,1,M,F,NN,DBC,REPLACE MACRO DBC.DBAUserAdmin AS (;);,The DBAUserAdmin exists to allow the DBA to ma...,0,...,,,,0,,,,,,
4,DBC,Indices,1,V,F,NN,DBC,REPLACE VIEW DBC.Indices\rAS\rSELECT CAST(SUBS...,The DBC.Indices view provides information abou...,0,...,,,,0,,,,,,
5,DBC,CopyCostProfile,1,M,F,NN,DBC,REPLACE MACRO DBC.CopyCostProfile\r ( Profile...,Creates an empty new cost profile of a given t...,0,...,,,,0,,,,,,
6,TDMaps,ActionsVX,1,V,F,NN,DBC,REPLACE VIEW TDMAPS.ActionsVX\rAS\rSELECT TDMA...,Provides info of recommended actions generated...,0,...,,,,0,,,,,,
7,DBC,AMPUsageVX,1,V,F,NN,DBC,REPLACE VIEW DBC.AMPUsageVX\rAS SELECT\r Acc...,The AMPUsageVX view provides information about...,0,...,,,,0,,,,,,
8,DBC,ObjectUseCountV,1,V,F,NN,DBC,REPLACE VIEW DBC.ObjectUseCountV AS\rSELECT Db...,Lists the object usage counts by usage type.,0,...,,,,0,,,,,,
9,DBC,ChildrenX,1,V,F,NN,DBC,REPLACE VIEW DBC.ChildrenX\rAS\rSELECT CAST(SU...,The ChildrenX view lists the names of database...,0,...,,,,0,,,,,,


<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>SQLAlchemy Connections</b>

<p style = 'font-size:16px;font-family:Arial'>The Python ecosystem has continued to evolve database integration capabilities, and is now supporting a high-level architectural pattern based on a project called <a href = 'https://www.sqlalchemy.org/'>SQLALchemy</a>.  SQLAlchemy provides a more object-oriented approach to interacting with RDBMS, essentially providing a more "pythonic" pattern for interacting with databases instead of relying wholly on SQL.</p>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Teradataml Package</b>
<p style = 'font-size:16px;font-family:Arial'>The teradataml package relies heavily on SQLAlchemy architecture, and provides significant capabilities to the user and the user's interaction with Vantage.  Please see the relevant notebooks in this directory and in the <a href = 'https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/November-2021/Introduction-to-Teradata-Package-for-Python'>Getting Started Guide</a> for more information.</p>

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Combining DataFrames</b>

<p style = 'font-size:16px;font-family:Arial'>Pandas provides multiple techniques for combining data contained in DataFrames and Series.  Recall the beginning of this document, an example was shown using the concat() function to comine Series into a DataFrame.</p>

<p style = 'font-size:16px;font-family:Arial'>Objects can be combined using <b>Methods</b> and <b>Module Functions</b>.  Methods are functions applied to an object, and Module functions are called under the Pandas module (usually pd)</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>The <b>Method call</b> will combine itself with another object provided as a parameter.  These include:
        <ul><li>DataFrame.join() will combine the calling DataFrame to another DataFrame using the calling DataFrame's index.</li>
            <li>DataFrame.merge() will combine the calling DataFrame to another using arbitrary columns.</li>
            <li>DataFrame.append() will add rows to the DataFrame using another Series or DataFrame.</li></ul></li>
    <li>The <b>Module Function</b> will take multiple standalone objects, typically as a List, and return a new DataFrame or Series.  These include:
        <ul><li>Pandas merge() will take two DataFrames, Series, or other objects and combine them using indices or column values</li>
            <li>Pandas concat() will take a List of DataFrames, Series, or other objects and combine them across an axis, and not attempt to match up column values.  In the case of mismatched indices, rows will be appended.</li></ul></li>
          </ul>  
          

<p style = 'font-size:16px;font-family:Arial'>join() and merge() Methods, and the merge() Module Function all behave in similar ways, but syntax and parameters differ for each.  Other important concepts include:</p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li>Declaring how to join objects.  Usually this parameter is one of 'left', 'right', 'outer', 'inner', 'cross'.  These are roughly equivalent to SQL-style JOIN directives.</li>
    <li>Which column or columns to use as keys to join on.  This parameter usually defaults to index, but can also be any column.</li>
    <li>Suffix.  Under conditions where the resulting object has duplicate column names, the functions will append a suffix to enforce uniqueness.</li>
    </ul>

In [40]:
# An example of merge.
# take the cleaned up DataFrame from above, and combine it to the original one:

display(df_calories.head())

# re-read the original data with the old string column:
df_orig = pd.read_excel('data/Calorie_Log.xlsx', sheet_name = 'Sheet1', index_col = 0, engine = 'openpyxl')

display(df_orig.head())

# Call the Merge function:
df_merge = pd.merge(left = df_calories,
                    right = df_orig, 
                    left_index = True,
                    right_index = True,
                    how = 'inner')

display(df_merge.head())

Unnamed: 0_level_0,Breakfast,Lunch,Dinner,Exercise
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-06-01,355,250,800,Y
2022-06-02,349,435,955,N
2023-06-03,280,85,735,N
2024-06-03,452,125,257,Y
2025-06-04,100,289,701,Y


Unnamed: 0_level_0,Breakfast,Lunch,Dinner,Exercise
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-06-01,355,250,800 calories,Y
2022-06-02,349,435,955 calories,N
2023-06-03,280,85,735 calories,N
2024-06-03,452,125,257 calories,Y
2025-06-04,100,289,701 calories,Y


Unnamed: 0_level_0,Breakfast_x,Lunch_x,Dinner_x,Exercise_x,Breakfast_y,Lunch_y,Dinner_y,Exercise_y
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-06-01,355,250,800,Y,355,250,800 calories,Y
2022-06-02,349,435,955,N,349,435,955 calories,N
2023-06-03,280,85,735,N,280,85,735 calories,N
2024-06-03,452,125,257,Y,452,125,257 calories,Y
2025-06-04,100,289,701,Y,100,289,701 calories,Y


<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Section 5. Persisting DataFrames</b>

<p style = 'font-size:16px;font-family:Arial'>The end result of most workflows is the output; the resulting analytic, visualization, data representation, etc.</p>

<p style = 'font-size:16px;font-family:Arial'>Pandas provides native capabilities for persisting data into human-readable, compressed, and/or SQL database formats. Note that many third-party libraries including <b>teradatasql/teradataml</b> provide methods for writing data to the target database system.</p>

<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>File Persistence</b>

<p style = 'font-size:16px;font-family:Arial'>Some common human-readable formats include:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Excel</li>
    <li>CSV</li>
    <li>JSON (Debatable)</li>
    <li>HTML</li>
    </ul>
    
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Optimized Formats</b>
<p style = 'font-size:16px;font-family:Arial'>Pandas can also write binary formats that are optimized for storage space, read/scan performance, etc.  Some of these include:</p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Apache Parquet</li>
    <li>Feather</li>
    <li>Pickle</li>
    </ul>
    
<p style = 'font-size:16px;font-family:Arial'><b>Pickle</b> is a common and useful Python file format.  Python will store the object itself to a file; including metadata, dependent package information, etc.  Any arbitrary object can be pickled and unpickled.  Have care, as there is a big security risk in that a pickle could contain malicious code that can execute upon object instantiation.  Never unpickle from an untrusted source.</p>

In [41]:
# Pickle the merged dataframe:

df_merge.to_pickle('data/df_merge.zip') # Adding a zip extension will automatically compress the file

In [42]:
# Read the pickle and display the data:
df_unpickle = pd.read_pickle('data/df_merge.zip')
df_unpickle.head()

Unnamed: 0_level_0,Breakfast_x,Lunch_x,Dinner_x,Exercise_x,Breakfast_y,Lunch_y,Dinner_y,Exercise_y
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-06-01,355,250,800,Y,355,250,800 calories,Y
2022-06-02,349,435,955,N,349,435,955 calories,N
2023-06-03,280,85,735,N,280,85,735 calories,N
2024-06-03,452,125,257,Y,452,125,257 calories,Y
2025-06-04,100,289,701,Y,100,289,701 calories,Y


<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Excel</b>

<p style = 'font-size:16px;font-family:Arial'>Excel continues to be the defacto standard for working with data in a desktop tool.  Pandas provides extremely flexible capabilities for outputting Excel format, including style/formatting, layout, tabs, etc.  Here is a simple example:</p>

In [43]:
df_merge.to_excel('my_excel.xlsx', sheet_name = 'Sheet1', header = True, engine = 'openpyxl')

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#E37C4D'>Cleanup</b>
<p style = 'font-size:16px;font-family:Arial'>It is a good practice to remove the context that we created to connect to Vantage</p>

In [44]:
remove_context()

True

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">©2022 Teradata. All Rights Reserved</footer>