# Table of Contents

-  [Importing libraries](#01)
-  [Reading in files](#02)
-  [Previewing datasets](#03)
-  [Renaming columns and variables](#04)
-  [Indexing and filtering rows and columns](#05)
-  [Computing summary statistics](#06)
-  [Pivoting on datasets](#07)
-  [Joining multiple datasets](#08)
-  [Exporting data to Excel or CSV files](#09)

<a id='01'></a>
# Situation: Importing libraries

While Python contains numerous useful functions and features in its "base" language, libraries offer scores of added capabilities. Importing libraries requires only a couple lines of code.

In [None]:
import pandas #"Standard" importing
import pandas as pd #Added "as pd" enables the user to type just "pd" rather than "pandas" when referencing the library
import os

<a id='02'></a>
# Situation: Reading in files

Most analyses in Python begin with at least one input file, most commonly CSV's or XLSX's. Users then typically perform operations, computations, and modeling on top of those input files. However, this requires the user to formally read in the file into whichever Python development environment they use.

Multiple approaches to this exist, but below you can see tutorials for the two most popular, using base Python and Pandas.

Before even beginning the reading in process, you must ensure they have Python pointing to the correct working directory. The working directory essentially tells Python where to look when attempting an import. As a result, unless the working directory contains the file to import, Python will produce an error when executing the import.

Begin with changing the working directory, if necessary.

In [None]:
print(os.getcwd()) #This shows the user the current working directory.
os.chdir("/import/analytics/dev/nexus/repos/jupyter_notebook_backup/jpn/adalke")
#Replace "Desired directory location" to the desired working directory's file path.

After ensuring the correct working directory, you can then read in files.

##### CSV or XLSX - Base Python option:

In [None]:
f = open("Sample_Sales.csv", "r") #Change "File Name" as necessasry
data = f.read()

##### CSV - Pandas option:

In [None]:
data = pd.read_csv("Sample_Sales.csv") #Change "File Name" as necessasry

##### XLSX - Pandas option:

In [None]:
data = pd.read_excel("Sample_Sales.xlsx", sheetname = 0) #Change "File Name" as necessasry

<a id='09'></a>
# Situation: Previewing datasets

Unless you know the structure and contents of a dataset, it can prove helpful to examine key facts like column names, data types, and summary statistics. Pandas provides numerous functions and methods that aid this goal.

Print a list of column names:

In [None]:
data.columns

Calculate the number of rows and columns:

In [None]:
data.shape

Print summary statistics for each column:

In [None]:
data.describe()

Provide an overview of values in each column:

In [None]:
data.info

Show datatypes for each column:

In [None]:
data.dtypes

<a id='10'></a>
# Situation: Renaming columns and variables

This type of operation requires multiple steps, although fortunately of the straightforward variety.

In [None]:
print(data.columns) #Prints a list containing the dataset's column names
col_names = data.columns.values #Creates a list containing the original column names
col_names[0] = 'Store_Num' #Renames the first column name - user can change the index number and new column name as desired
data.columns = col_names #Replaces the original column names with the new ones defined by the user

<a id='11'></a>
# Situation: Indexing and filtering rows and columns

To return only certain columns, simply place the column name(s) within quotation marks and brackets next to the dataset name:

In [None]:
data['Margin'] #Return only the "Margin" column

To return multiple columns, add another set of brackets, to create a list containing the column names to return.

In [None]:
data[['Date', 'Margin']]

Indexing can take multiple forms. At its most basic level, it simply entails selecting rows of a table or dataframe:

In [None]:
data[0:5]

Alternatively, Pandas contains functionality facilitating indexing: methods (an operation like a function) called "loc" and "iloc".

In [None]:
data.iloc[0:5] #This produces the same output as the "basic" indexing described above.

Python allows for easy filtering, by specifying the column(s) and values to use as criteria. Python then utilizes those criteria to index the rows meeting them.

Stated differently, Python "thinks" "return the rows where this condition proves true".

In [None]:
data[data['Margin'] > 0.6] #Returns values of rows where the "Margin" column value exceeds 0.6

You can go to an even greater level of precision by incorporating multiple filters.

In [None]:
# Filter 1:
data[data['Margin'] > 0.6]

# Filter 2:
data[data['Sales_Rev'] > 90]

# Combining the filters to return only the rows that meet BOTH:
data[(data['Margin'] > 0.6) & (data['Sales_Rev'] > 90)]

# Combining the filters to return only the rows that meet EITHER:
data[(data['Margin'] > 0.6) | (data['Sales_Rev'] > 90)]

<a id='12'></a>
# Situation: Computing summary statistics

Python's Numpy plays an instrumental role with summary statistics, by putting them only a function call away. For most functions, you can choose between calculating for specific columns or entire tables or dataframes.

In [None]:
import numpy as np

#Sum for each column in the "data" table:
np.sum(data)

#Sum for only the "Store" column in the "data" table - replace column name as desired:
np.sum(data['Store'])

#Average for each column in the "data" table:
np.mean(data)

#Maximum for each column in the "data" table:
np.max(data)

#Minimum for each column in the "data" table:
np.min(data)

#Median for the "Sales_Rev" column - this function requires a specified column:
np.median(data['Sales_Rev'])

#Range of the "Sales_Rev" column - also requires a specified column:
np.ptp(data['Sales_Rev'])

#Standard deviation for each column in the "data" table:
np.std(data)

# Finds values at requested percentiles of specified column - in this case, the 75th and 25th of "Sales_Rev":
np.percentile(data['Sales_Rev'], [75, 25])

<a id='13'></a>
# Situation: Pivoting on datasets

We've all grown accustomed to Pivot Tables in Excel, and thankfully, Pandas contains a function to create them in Python!

The function, "pivot_table", takes arguments that allow you to control the columns on which to pivot and which aggregate calculation to use.

The arguments to the pivot_table function go as follows:
-  data: Which dataframe to use as an input
-  values: (Optional) Which column to use as the Pivot Table values (lower-right section of Excel's Pivot Table menu)
-  index: Column(s) to use as the key values in the Pivot Table (lower-right section of Excel's Pivot Table menu)
-  columns: Column(s) to place across the top of the Pivot Table (upper-right section of Excel's Pivot Table menu)
-  aggfunc: How to calculate the values in the Pivot Table (leverage the Numpy functions listed in the "Computing summary statistics" situation)
-  fill_value: (Optional) Whether to replace missing values with any particular value (defaults to "None")
-  margins: (Optional) True/False boolean for whether to include subtotals and totals (defaults to "False")
-  dropna: (Optional) True/False boolean for whether to drop columns that only include NA's (defaults to "True")
-  margins_name: (Optional) Which rows/columns to use for totals when margins argument set to True (defaults to "All", for grand totals)

Simple example:

In [None]:
pd.pivot_table(data = data,
               values = 'Sales_Rev',
               index = 'Product_ID',
               columns = 'Store',
               aggfunc = np.sum) 

#Outputs as a Pandas dataframe

Fill the NaN's in first pivot with 0's:

In [None]:
pd.pivot_table(data = data,
               values = 'Sales_Rev',
               index = 'Product_ID',
               columns = 'Store',
               aggfunc = np.sum,
               fill_value = 0)

#Outputs as a Pandas dataframe

Move "Store" from columns to index:

In [None]:
pd.pivot_table(data = data,
               values = 'Sales_Rev',
               index = ['Product_ID', 'Store'],
               aggfunc = np.sum,
               fill_value = 0)

#Outputs as Pandas series

Add a count to values:

In [None]:
pd.pivot_table(data = data,
               values = 'Sales_Rev',
               index = 'Product_ID',
               aggfunc = [np.sum, len],
               fill_value = 0)

#Outputs as a Pandas dataframe

Restrict the count to count of Stores:

In [None]:
pd.pivot_table(data = data,
               values = ['Sales_Rev', 'Store'],
               index = 'Product_ID',
               aggfunc = {'Sales_Rev': np.sum, 'Store': len},
               fill_value = 0)

#Outputs as a Pandas dataframe

Use mean rather than sum for values:

In [None]:
pd.pivot_table(data = data,
               values = 'Sales_Rev',
               index = ['Product_ID', 'Store'],
               aggfunc = np.mean,
               fill_value = 0)

# Outputs as Pandas series

Add totals:

In [None]:
pd.pivot_table(data = data,
               values = 'Sales_Rev',
               index = ['Product_ID', 'Store'],
               aggfunc = np.sum,
               fill_value = 0,
               margins = True,
               margins_name = "Totals")

# Outputs as Pandas series

<a id='14'></a>
# Situation: Joining multiple datasets

Since joining datasets can take multiple forms, different approaches for each exist in Python, once again with the assistance of Pandas.

For example, with appending datasets, you can append by rows or columns (in other words, place datasets "on top of" or "next to" each other).

The examples below will demonstrate each of those in order:

In [None]:
df1 = data[data['Store'] == 1]
df2 = data[data['Store'] == 70]
pd.concat([df1, df2]) #This appends the "df2" dataset below "df1" (by rows).

df3 = data[['Store', 'Date', 'Customer_ID', 'Product_ID']][data['Store'] == 1]
df4 = data[['Transaction_ID', 'Sales_Rev']][data['Store'] == 1]
pd.concat([df3, df4], axis = 1) #This appends the "df4" dataset next to "df3" (by columns).

Excel users commonly take advantage of VLOOKUPs, and SQL users regularly invoke joins to link disparate tables or datasets. However, joining does require a couple steps, rather than a single function. Also, they more closely resemble SQL joins than Excel VLOOKUPs.

In [None]:
df5 = data[0:20]
df6 = data[len(data)-20:len(data)]

# Simplest method; performs an inner join by default:
pd.merge(df5, df6, on='Store')

# Adding the "how" argument enables specification of inner/left/right/outer:
pd.merge(df5, df6, on='Store', how='inner')

# When performing left or right joins, non-matching values will contain NaN:
pd.merge(df5, df6, on='Store', how='left')

# "suffixes" argument edits column labels for each joined dataframe:
pd.merge(df5, df6, on='Store', how='inner', suffixes = ('_Left', '_Right'))

<a id='16'></a>
# Situation: Exporting data to Excel or CSV files

This example builds off the pivot table section, to illustrate the type of output you can export:

In [None]:
pivot = pd.pivot_table(data = data,
                       values = ['Sales_Rev', 'Store'],
                       index = 'Product_ID',
                       aggfunc = {'Sales_Rev': np.sum, 'Store': len},
                       fill_value = 0)

Export to CSV:

In [None]:
pivot.to_csv("Sample_Export.csv") #Replace "Sample_Export.csv" with desired file name)

Export to Excel via a simpler option:

In [None]:
pivot.to_xlsx("Sample_Export.xlsx") #Replace "Sample_Export.xlsx" with desired file name)

Export to Excel via a more advanced option:

In [None]:
writer = pd.ExcelWriter("Sample_Export.xlsx", engine = 'xlsxwriter') #Replace "Sample_Export.xlsx" with desired file name)
pivot.to_excel(writer, sheet_name = 'SHEETNAMEXYZ')

workbook = writer.book
worksheet = writer.sheets['SHEETNAMEXYZ']

writer.save()

Files will export to the Jupyter directory housing this Notebook.

Now, when working within the Spark Jupyter notebook - a requirement when using the API's, a different function becomes necessary to write to a CSV.

Say you used the get_pmix_by_mitm API to pull data and then assigned it to a variable called "pull". You would then use code like the following to export the data to a CSV:

In [None]:
pull.write.csv('/user/<username>') #Change "username" to yourself