# Working Through Intro to Data Structures

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Pandas Series
A pandas series is a one dimensional array with labels (axis labels = index) able to hold any data type. 

In [3]:
#basic series creation
data = [0, 1, 2, 3, 4, 5, 6]
index = range(0, len(data))
my_series  = pd.Series(data, index=index)
print(my_series)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
dtype: int64


# The data is not limited to a single type. Can be a dict, np.ndarray, scalar value, ect. The index for each particular type has some caveats however.

Creating a series from an ndarray object requires that the index be of the same length as the ndarray.

In [4]:
#for example
new_series = pd.Series(np.random.randn(3), index = ['w', 's', 'a'])
new_series

w   -0.655309
s   -0.717232
a   -0.959745
dtype: float64

point of interest:
pandas allows non-unique index values except in cases where an operation does not support it, in that instance an error should be raised.


# Creating a series from a dict:
In this case the index argument can be passed the data corresponding to the label values. By default, the index will be built from the dicts sorted keys when possible. 

In [5]:
#without specifiying the index
my_dict = {'Alpha' : 'A', 'Bravo': 'B', 'Charlie':'C'}
no_id = pd.Series(my_dict)
print(no_id)

Alpha      A
Bravo      B
Charlie    C
dtype: object


In [6]:
#specifying the index
with_id = pd.Series(my_dict, index = ['Alpha', 2, 'Charlie'])
print(with_id)

Alpha        A
2          NaN
Charlie      C
dtype: object


Note: NaN (not a number) is a standard marker for absent data


Creating a series from a scalar value:
with scalar values an index must be specified and the scalar value will be repeated for the length of the index

In [7]:
scalar_series = pd.Series(6., index = ['l', 'k', 'j', 'h', 'i'])
print(scalar_series)

l    6.0
k    6.0
j    6.0
h    6.0
i    6.0
dtype: float64


# How a pandas series behaves:
It works like a numpy n dimensional array and works as an argument to numpy functions with some alterations. Namely, slicing the series will also slice the index.

In [8]:
L = [0,1,2,3,4,5]
I = ['a', 'b', 'c', 'd', 'e', 'f']
S = pd.Series(L, index = I)
print('slicing with numpy function yields: \n' + str(S[S >= S.mean()]))
print(S[2:4])
print(np.log(S[1:]))

slicing with numpy function yields: 
d    3
e    4
f    5
dtype: int64
c    2
d    3
dtype: int64
b    0.000000
c    0.693147
d    1.098612
e    1.386294
f    1.609438
dtype: float64


They can also function as a dictionary of fixed size where the index labels function as keys

In [9]:
#e.g. selecting a desired value from the index key
print(S['c'])
#altering a key value
S['f'] = 6
print(S)
#using booleans
'd' in S

2
a    0
b    1
c    2
d    3
e    4
f    6
dtype: int64


True

If the index label/key is not in the series an error will be raised.
if the .get method is attempted in this case, either None or a selected default missing value will be returned

In [10]:
#e.g. looking for the value of 'g' in the current series
S['g']

KeyError: 'g'

In [None]:
# with the .get method
print(S.get('g'))
#utilizing a specified missing value
print(S.get('g', np.nan))

# Label alignment and Vectorized numpy operations with a pandas Series

Most Numpy methods built for handling n dimensional arrays can also work with the pandas series. However, the pandas series is automatically aligned based on index labels unlike the ndarray

In [None]:
print(S + new_series)

As seen above, in cases where on label does not exist in one or the other series we yield missing values similar to multiplying by 0 but with the index labels still remaining. These rows can be dropped using the dropna function but thought should be taken before doing so as it may remove important information.

# Name Attribute
basically a title for a particular series which can either be specified selected automatically when slicing from a dataframe. A series can also be renamed as will be demonstrated below.

In [None]:
#assigning a title
new_series1 =pd.Series(new_series, name = 'new series')
print(new_series1)

In [None]:
#constructing a simple dataframe from a dict of pandas series and printing it
x = {'hypo' : pd.Series(['a', 'b', 'c', 'd'], index = [1, 2, 3, 4]),
     'hyper':pd.Series(['A', 'B', 'C', 'D'], index = [1, 2, 3, 4])}
df = pd.DataFrame(x)
print(df.head())

In [None]:
# finding the name of a 1D dataframe slice
Y = df.iloc[:, 0]
print(Y)

In [None]:
#renaming a series utilizing the pandas.series.rename() method
Y_1 = Y.rename('hyperbole')
print(Y_1)

# DataFrame
A pandas dataframe is similar to an excel spreadsheet and the most commonly used object. It is two dimensional (rows X columns) and can contain multiple datatypes of labeled data in a single frame.

It's capable of accepting multiple input types:
                Dict of 1D ndarrays, lists, dicts, or Series(as demostrated above)
                2D numpy.ndarray
                structured or record ndarray
                A Series
                another Dataframe
                
You also have the option of passing in your own row and column labels utilizing the index and columns arguments. This guarantees the labels of the subsequent dataframe. However, this also results in the loss of all data not specified by the passed index.

In cases where axis lables are not provided they are automatically constucted from the input data using common sense rules.


# constructing a dataframe from a dict or series of dicts



the index of the resulting Dataframe will be the union of indicies from the input Series objects. Nested dicts are converted to series prior to inclusion in the dataframe. If the 'column' arg is absent, the labels will be pulled from the sorted list of dict keys.

In [None]:
#create a list of dicts
d = {'alpha' : pd.Series(['a', 'b', 'c', 'd', 'e'], index = [1, 2, 3, 4 ,5]),
     'beta' : pd.Series(['g', 'h', 'f', 'j', 'k'], index = [1, 2, 3, 4, 5])}
#creating a dataframe from that list
df = pd.DataFrame(d)
print(df)

In [None]:
#constructing a dataframe with a specified index
df_1 = pd.DataFrame(d, index = [6, 5, 4, 8, 9])
print(df_1)

In [None]:
#constucting with specified column labels
df_2 = pd.DataFrame(d, columns = ['alpha', 'beta', 'gamma'])
df_2['gamma'] = 'a'
print(df_2)

you can access row and column labels by using their respective attribute functions

In [None]:
#accessing the index of df_2
df_2.index

In [None]:
#accessing the columns of df_2
df_2.columns

the keys in the dict can be overridden when a specific set of columns is passed into the pd.DataFrame() function along with the data.

# From adict of ndarrays/lists

Of vital importance here is that the ndarrays are required to all be of the same length n. If an index is also being input it must also be of length n. Otherwise a range object of len(ndarray(n)) will be created.

In [None]:
#creating the dict of ndarrays and corresponding index
dict = {'bananas' : [3, 5, 6, 8, 4],
        'apples' : [1, 7, 6, 8, 4],
        'mangos' : [5, 6, 8, 7, 3]}
index = ['2', '4', '6', '8', '10']
# passing the dict into the dataframe function with specified index
fruit = pd.DataFrame(dict, index=index)
print(fruit)

In [None]:
#without specified index
fruit_2 = pd.DataFrame(dict)
print(fruit_2)

# From structured or record array

same as ndarray but is not intended to have the same functionality as a 2d-numpy ndarray

# From a list of dicts
same as before but without the length limitations of an array

# From dict of tuples

this provides and easy method of creating a multi-indexed dataframe

In [None]:
#creating the dict of tuples
t_dict = {('m', 'k') : {('m', 'k'): 5, ('j', 'h'): 9},
         ('m', 't'): {('m', 'h'): 8, ('l', 'q'): 8}}
#creating the dataframe
t_df = pd.DataFrame(t_dict)
print(t_df)

# From a Series

A series is essentially a single column of a dataframe with an index. Constructing a dataframe from a single therefore results in a frame with the same index as the series and a column constructed from said series

In [None]:
#creating a datafame from the my_series variable used in the first section
df_s = pd.DataFrame(my_series)
print(df_s)

# alternate dataframe constructors

pd.DataFrame.from_dict
this function takes as input a dict of dicts or a dict of arrays and produces a dataframe, operating in a similar fashion except for the 'orient' parameter set to columns by default but can be changes to 'index' to use the dict keys for the row labels

In [None]:
#constructing a dict of dicts
d = {'A':{'a':3, 'b':6, 'c':5},
     'B':{'a':4, 'b':3, 'c':6},
     'C':{'a':7, 'b':4, 'c':4}}
#converting into dataframe without orient parameter
dofd_df = pd.DataFrame.from_dict(d)
print(dofd_df)

In [None]:
#with orient parameter
dofd_df = pd.DataFrame.from_dict(d, orient='index')
print(dofd_df)

# pd.DataFrame.from_records

for input, takes a list of tuples or an ndarray with a structured dtype. The index may be specified from a field of the structured dtype.

In [None]:
#producing a list of tuples
t = np.array([(1, 2), (3, 4), (5, 6), (6, 7)], dtype = np.int)
#producing dataframe
t_df = pd.DataFrame.from_records(t)
print(t_df)

# pd.DataFrame.from_items

takes a sequence of key, value pairs. The keys become column names and the values fil the rows. Allows constuction of a dataframe with a specified order to the columns. Arrays, as usual, must all be of the same length.

In [12]:
#creating items
items = [('j', [2, 3, 4, 5, 6, 7]), ('k', [1, 2, 3, 4, 5, 6]), ('l', [1, 2, 3, 4, 5,6])]
#producing dataframe
items_df = pd.DataFrame.from_items(items)
print(items_df)

   j  k  l
0  2  1  1
1  3  2  2
2  4  3  3
3  5  4  4
4  6  5  5
5  7  6  6


In [13]:
#using the orient = 'index' call moving the keys to row labels. Note that column labels must be specified
columns = ['b', 'c', 'f', 'd', 'r', 'o']
items_df = pd.DataFrame.from_items(items, orient = 'index', columns = columns)
print(items_df)

   b  c  f  d  r  o
j  2  3  4  5  6  7
k  1  2  3  4  5  6
l  1  2  3  4  5  6


# Column selection, addition, deletion

A DataFrame works in an analagous fashion to a dict of like-indexed Series objects. The syntax for both operations is similar.

In [14]:
# selecting a columns from items_df
print(items_df['c'])

j    3
k    2
l    2
Name: c, dtype: int64


In [15]:
#adding a new column to items_df
items_df['Alpha']  = items_df['b'] * items_df['r'] / items_df['o']
print(items_df)

   b  c  f  d  r  o     Alpha
j  2  3  4  5  6  7  1.714286
k  1  2  3  4  5  6  0.833333
l  1  2  3  4  5  6  0.833333


In [16]:
# with a boolean operation
items_df['error'] = items_df['Alpha'] < 1
print(items_df)

   b  c  f  d  r  o     Alpha  error
j  2  3  4  5  6  7  1.714286  False
k  1  2  3  4  5  6  0.833333   True
l  1  2  3  4  5  6  0.833333   True


In [17]:
#removing a column
del items_df['error']
print(items_df)

   b  c  f  d  r  o     Alpha
j  2  3  4  5  6  7  1.714286
k  1  2  3  4  5  6  0.833333
l  1  2  3  4  5  6  0.833333


In [18]:
#inserting a scalar value broadcasts it through the entire column
items_df['happy'] = 'sad'
print(items_df)

   b  c  f  d  r  o     Alpha happy
j  2  3  4  5  6  7  1.714286   sad
k  1  2  3  4  5  6  0.833333   sad
l  1  2  3  4  5  6  0.833333   sad


In [19]:
#inserting a series without the same index as the dataframe results in the addition of NaN value
#shorter index
items_df['Series_insert'] = items_df['b'][:1]
print(items_df)

   b  c  f  d  r  o     Alpha happy  Series_insert
j  2  3  4  5  6  7  1.714286   sad            2.0
k  1  2  3  4  5  6  0.833333   sad            NaN
l  1  2  3  4  5  6  0.833333   sad            NaN


In [20]:
#larger index
items_df['big_series'] = scalar_series
print(items_df)

   b  c  f  d  r  o     Alpha happy  Series_insert  big_series
j  2  3  4  5  6  7  1.714286   sad            2.0         6.0
k  1  2  3  4  5  6  0.833333   sad            NaN         6.0
l  1  2  3  4  5  6  0.833333   sad            NaN         6.0


apparently that does not extend the dataframe like i thought it would but rather shrinks the series to fit the dataframe.


following along with this idea you can insert a raw ndarray but it's length must match the index of the df.

In [21]:
#creating an array
ndarray = np.array([1, 2, 3])
print(ndarray)

[1 2 3]


In [22]:
#inserting the array
items_df.insert(10, value = ndarray, column = 'array',allow_duplicates = False)
print(items_df)

   b  c  f  d  r  o     Alpha happy  Series_insert  big_series  array
j  2  3  4  5  6  7  1.714286   sad            2.0         6.0      1
k  1  2  3  4  5  6  0.833333   sad            NaN         6.0      2
l  1  2  3  4  5  6  0.833333   sad            NaN         6.0      3


by default an array is inserted at the end but you can also specify the position using the insert function

In [24]:
#deleting the array column we added
del items_df['array']
print(items_df)

   b  c  f  d  r  o     Alpha happy  Series_insert  big_series
j  2  3  4  5  6  7  1.714286   sad            2.0         6.0
k  1  2  3  4  5  6  0.833333   sad            NaN         6.0
l  1  2  3  4  5  6  0.833333   sad            NaN         6.0


In [25]:
#reinserting the array column this time specifying the position
items_df.insert(3, 'array', ndarray)
print(items_df)

   b  c  f  array  d  r  o     Alpha happy  Series_insert  big_series
j  2  3  4      1  5  6  7  1.714286   sad            2.0         6.0
k  1  2  3      2  4  5  6  0.833333   sad            NaN         6.0
l  1  2  3      3  4  5  6  0.833333   sad            NaN         6.0


# assigning new columns and method chains


DataFrames have a method called assign() that provides an expedient method for creating a new column, potentially from existing columns. This method creates a copy of the original dataframe with the new column added.

In [31]:
#reducing the size of the dataframe
del items_df['Series_insert'], items_df['big_series'], items_df['happy']
print(items_df)

   b  c  f  array  d  r  o     Alpha
j  2  3  4      1  5  6  7  1.714286
k  1  2  3      2  4  5  6  0.833333
l  1  2  3      3  4  5  6  0.833333


In [40]:
#creating a new column with the assign() method, using existing column data.
items_df_copy = items_df.assign(beta = items_df['Alpha'] / items_df['array'])
print(items_df_copy)

   b  c  f  array  d  r  o     Alpha      beta
j  2  3  4      1  5  6  7  1.714286  1.714286
k  1  2  3      2  4  5  6  0.833333  0.416667
l  1  2  3      3  4  5  6  0.833333  0.277778
