# Working Through Intro to Data Structures

In [1]:
import numpy as np
import pandas as pd

# Pandas Series
A pandas series is a one dimensional array with labels (axis labels = index) able to hold any data type. 

In [9]:
#basic series creation
data = [0, 1, 2, 3, 4, 5, 6]
index = range(0, len(data))
my_series  = pd.Series(data, index=index)
print(my_series)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
dtype: int64


# The data is not limited to a single type. Can be a dict, np.ndarray, scalar value, ect. The index for each particular type has some caveats however.

Creating a series from an ndarray object requires that the index be of the same length as the ndarray.

In [3]:
#for example
new_series = pd.Series(np.random.randn(3), index = ['w', 's', 'a'])
new_series

w   -0.690117
s   -0.400602
a   -2.226436
dtype: float64

point of interest:
pandas allows non-unique index values except in cases where an operation does not support it, in that instance an error should be raised.


# Creating a series from a dict:
In this case the index argument can be passed the data corresponding to the label values. By default, the index will be built from the dicts sorted keys when possible. 

In [14]:
#without specifiying the index
my_dict = {'Alpha' : 'A', 'Bravo': 'B', 'Charlie':'C'}
no_id = pd.Series(my_dict)
print(no_id)

Alpha      A
Bravo      B
Charlie    C
dtype: object


In [15]:
#specifying the index
with_id = pd.Series(my_dict, index = ['Alpha', 2, 'Charlie'])
print(with_id)

Alpha        A
2          NaN
Charlie      C
dtype: object


Note: NaN (not a number) is a standard marker for absent data


Creating a series from a scalar value:
with scalar values an index must be specified and the scalar value will be repeated for the length of the index

In [19]:
scalar_series = pd.Series(6., index = ['l', 'k', 'j'])
print(scalar_series)

l    6.0
k    6.0
j    6.0
dtype: float64


# How a pandas series behaves:
It works like a numpy n dimensional array and works as an argument to numpy functions with some alterations. Namely, slicing the series will also slice the index.

In [27]:
L = [0,1,2,3,4,5]
I = ['a', 'b', 'c', 'd', 'e', 'f']
S = pd.Series(L, index = I)
print('slicing with numpy function yields: \n' + str(S[S >= S.mean()]))
print(S[2:4])
print(np.log(S[1:]))

slicing with numpy function yields: 
d    3
e    4
f    5
dtype: int64
c    2
d    3
dtype: int64
b    0.000000
c    0.693147
d    1.098612
e    1.386294
f    1.609438
dtype: float64


They can also function as a dictionary of fixed size where the index labels function as keys

In [28]:
#e.g. selecting a desired value from the index key
print(S['c'])
#altering a key value
S['f'] = 6
print(S)
#using booleans
'd' in S

2
a    0
b    1
c    2
d    3
e    4
f    6
dtype: int64


True

If the index label/key is not in the series an error will be raised.
if the .get method is attempted in this case, either None or a selected default missing value will be returned

In [30]:
#e.g. looking for the value of 'g' in the current series
S['g']

KeyError: 'g'

In [34]:
# with the .get method
print(S.get('g'))
#utilizing a specified missing value
print(S.get('g', np.nan))

None
nan


# Label alignment and Vectorized numpy operations with a pandas Series

Most Numpy methods built for handling n dimensional arrays can also work with the pandas series. However, the pandas series is automatically aligned based on index labels unlike the ndarray

In [38]:
print(S + new_series)

a   -1.459871
b         NaN
c         NaN
d         NaN
e         NaN
f         NaN
s         NaN
w         NaN
dtype: float64


As seen above, in cases where on label does not exist in one or the other series we yield missing values similar to multiplying by 0 but with the index labels still remaining. These rows can be dropped using the dropna function but thought should be taken before doing so as it may remove important information.

# Name Attribute
basically a title for a particular series which can either be specified selected automatically when slicing from a dataframe. A series can also be renamed as will be demonstrated below.

In [4]:
#assigning a title
new_series1 =pd.Series(new_series, name = 'new series')
print(new_series1)

w   -0.690117
s   -0.400602
a   -2.226436
Name: new series, dtype: float64


In [5]:
#constructing a simple dataframe from a dict of pandas series and printing it
x = {'hypo' : pd.Series(['a', 'b', 'c', 'd'], index = [1, 2, 3, 4]),
     'hyper':pd.Series(['A', 'B', 'C', 'D'], index = [1, 2, 3, 4])}
df = pd.DataFrame(x)
print(df.head())

  hyper hypo
1     A    a
2     B    b
3     C    c
4     D    d


In [9]:
# finding the name of a 1D dataframe slice
Y = df.iloc[:, 0]
print(Y)

1    A
2    B
3    C
4    D
Name: hyper, dtype: object


In [10]:
#renaming a series utilizing the pandas.series.rename() method
Y_1 = Y.rename('hyperbole')
print(Y_1)

1    A
2    B
3    C
4    D
Name: hyperbole, dtype: object


# DataFrame
A pandas dataframe is similar to an excel spreadsheet and the most commonly used object. It is two dimensional (rows X columns) and can contain multiple datatypes of labeled data in a single frame.

It's capable of accepting multiple input types:
                Dict of 1D ndarrays, lists, dicts, or Series(as demostrated above)
                2D numpy.ndarray
                structured or record ndarray
                A Series
                another Dataframe
                
You also have the option of passing in your own row and column labels utilizing the index and columns arguments. This guarantees the labels of the subsequent dataframe. However, this also results in the loss of all data not specified by the passed index.

In cases where axis lables are not provided they are automatically constucted from the input data using common sense rules.


# constructing a dataframe from a dict or series of dicts



the index of the resulting Dataframe will be the union of indicies from the input Series objects. Nested dicts are converted to series prior to inclusion in the dataframe. If the 'column' arg is absent, the labels will be pulled from the sorted list of dict keys.

In [4]:
#create a list of dicts
d = {'alpha' : pd.Series(['a', 'b', 'c', 'd', 'e'], index = [1, 2, 3, 4 ,5]),
     'beta' : pd.Series(['g', 'h', 'f', 'j', 'k'], index = [1, 2, 3, 4, 5])}
#creating a dataframe from that list
df = pd.DataFrame(d)
print(df)

  alpha beta
1     a    g
2     b    h
3     c    f
4     d    j
5     e    k


In [5]:
#constructing a dataframe with a specified index
df_1 = pd.DataFrame(d, index = [6, 5, 4, 8, 9])
print(df_1)

  alpha beta
6   NaN  NaN
5     e    k
4     d    j
8   NaN  NaN
9   NaN  NaN


In [7]:
#constucting with specified column labels
df_2 = pd.DataFrame(d, columns = ['alpha', 'beta', 'gamma'])
df_2['gamma'] = 'a'
print(df_2)

  alpha beta gamma
1     a    g     a
2     b    h     a
3     c    f     a
4     d    j     a
5     e    k     a


you can access row and column labels by using their respective attribute functions

In [8]:
#accessing the index of df_2
df_2.index

Int64Index([1, 2, 3, 4, 5], dtype='int64')

In [9]:
#accessing the columns of df_2
df_2.columns

Index(['alpha', 'beta', 'gamma'], dtype='object')

the keys in the dict can be overridden when a specific set of columns is passed into the pd.DataFrame() function along with the data.

currently working with multiple versions of this notebook.
adding in the next section here with the intention of merging notebooks later

# alternate dataframe constructors

pd.DataFrame.from_dict
this function takes as input a dict of dicts or a dict of arrays and produces a dataframe, operating in a similar fashion except for the 'orient' parameter set to columns by default but can be changes to 'index' to use the dict keys for the row labels

In [8]:
#constructing a dict of dicts
d = {'A':{'a':3, 'b':6, 'c':5},
     'B':{'a':4, 'b':3, 'c':6},
     'C':{'a':7, 'b':4, 'c':4}}
#converting into dataframe without orient parameter
dofd_df = pd.DataFrame.from_dict(d)
print(dofd_df)

   A  B  C
a  3  4  7
b  6  3  4
c  5  6  4


In [11]:
#with orient parameter
dofd_df = pd.DataFrame.from_dict(d, orient='index')
print(dofd_df)

   a  b  c
A  3  6  5
B  4  3  6
C  7  4  4


# pd.DataFrame.from_records

for input, takes a list of tuples or an ndarray with a structured dtype. The index may be specified from a field of the structured dtype.

In [22]:
#producing a list of tuples
t = np.array([(1, 2), (3, 4), (5, 6), (6, 7)], dtype = np.int)
#producing dataframe
t_df = pd.DataFrame.from_records(t)
print(t_df)

   0  1
0  1  2
1  3  4
2  5  6
3  6  7


# pd.DataFrame.from_items

takes a sequence of key, value pairs. The keys become column names and the values fil the rows. Allows constuction of a dataframe with a specified order to the columns. Arrays, as usual, must all be of the same length.

In [24]:
#creating items
items = [('j', [2, 3, 4, 5, 6, 7]), ('k', [1, 2, 3, 4, 5, 6]), ('l', [1, 2, 3, 4, 5,6])]
#producing dataframe
items_df = pd.DataFrame.from_items(items)
print(items_df)

   j  k  l
0  2  1  1
1  3  2  2
2  4  3  3
3  5  4  4
4  6  5  5
5  7  6  6


In [25]:
#using the orient = 'index' call moving the keys to row labels. Note that column labels must be specified
columns = ['b', 'c', 'f', 'd', 'r', 'o']
items_df = pd.DataFrame.from_items(items, orient = 'index', columns = columns)
print(items_df)

   b  c  f  d  r  o
j  2  3  4  5  6  7
k  1  2  3  4  5  6
l  1  2  3  4  5  6
