# 2. Pandas Data Structures

Last chapter introduced lists and dictionaries, this one is focused on loading data, looking at Series and DataFrame, and saving.

DataFrame is like a dictionary of Series, and a Series is like a list with items exclusively of the same dtype.

In [167]:
import pandas as pd

s = pd.Series(['agave',13])
print("Notice how this series has a dtype of 'object', because it's the LCD")
print(s)

print('You can also specify your own index, this doesn\'t seem much like a "series" to me though')
s = pd.Series(['Animal','Drummer Excellente'], index=['Name', 'Description'])
print(s)

Notice how this series has a dtype of 'object', because it's the LCD
0    agave
1       13
dtype: object
You can also specify your own index, this doesn't seem much like a "series" to me though
Name                       Animal
Description    Drummer Excellente
dtype: object


In [168]:
print('Answering some questions')
print('What happens if you pass different types?')

s = pd.Series([('1','A'),('2','B')])
print('a list of tuples just sets the value',s, sep='\n')
s = pd.Series(('1','A'),('2','B'))
print('passing multiple tuples works like you might want, though the index is the second value')
dictionary = {
    "1": "A",
    "2": "B",
    "3": "C"
}
s = pd.Series(dictionary)
print('Dictionaries should work ok too:',s, sep='\n')

Answering some questions
What happens if you pass different types?
a list of tuples just sets the value
0    (1, A)
1    (2, B)
dtype: object
passing multiple tuples works like you might want, though the index is the second value
Dictionaries should work ok too:
1    A
2    B
3    C
dtype: object


## Creating DataFrames

Dictionaries are the most common way to make DataFrames. The key is the column, and the values are the contents


In [169]:
scientists = pd.DataFrame(
    data={
        'Name': ['Rose','Barb'],
        'Born': ['1999-01-01', '2040-05-03'],
        'Died': ['2999-01-01', '3040-05-03'],
        'Job': ['Chemist', 'Biologist'],
        'Age': [13,33]
    },
    index=['R','B'],
    columns=['Name','Age','Job','Born','Died']
)
print(scientists)
print()
print('Note: An OrderedDictionary could have worked here too')

   Name  Age        Job        Born        Died
R  Rose   13    Chemist  1999-01-01  2999-01-01
B  Barb   33  Biologist  2040-05-03  3040-05-03

Note: An OrderedDictionary could have worked here too


## Using Series

In [170]:
rose = scientists.loc['R']
print(type(rose))
print('You can print the list of keys (or indices) with index or keys', rose.index)
print('You can also just get the values', rose.values)

<class 'pandas.core.series.Series'>
You can print the list of keys (or indices) with index or keys Index(['Name', 'Age', 'Job', 'Born', 'Died'], dtype='object')
You can also just get the values ['Rose' 13 'Chemist' '1999-01-01' '2999-01-01']


In [171]:
ages = scientists['Age']
print('You can do interesting things with Series', ages.mean(), ages.min(), ages.max(), ages.std())

You can do interesting things with Series 23.0 13 33 14.142135623730951


## Boolean Subsettings (Filtering)

In [172]:
scientists = pd.read_csv('../data/scientists.csv')
print(scientists)
print()
print('The describe() method is a great way to get a lay of the land:',scientists['Age'].describe(),sep='\n')
print()
ages = scientists['Age']
print('Check out this expression:')
print(ages < ages.mean())
print('Filtering is easy:', ages[ages < ages.mean()])

                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician

The describe() method is a great way to get a lay of the land:
count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64

Check out this expression:
0     True
1    False
2    False
3    False
4     True
5     True
6     T

## Correlating Series

If you perform operations between two vectors with the same length...

In [173]:
print(ages + ages + ages)
print('Scalars work too',ages + 1000, sep='\n')

0    111
1    183
2    270
3    198
4    168
5    135
6    123
7    231
Name: Age, dtype: int64
Scalars work too
0    1037
1    1061
2    1090
3    1066
4    1056
5    1045
6    1041
7    1077
Name: Age, dtype: int64


## Broadcasting

What if you apply operations between Series of different shapes? Well, if the keys match, the operation will be performed. If not, the result will have a "missing" value like NaN.

In [174]:
print(ages * (ages < ages.mean()))
print('Having a bit of fun, what if we wanted to see which scientists were already in their correct sorted order?')
sortedAgesSeries = ages.sort_values().reset_index()['Age']
print(sortedAgesSeries)
print(ages == sortedAgesSeries)

0    37
1     0
2     0
3     0
4    56
5    45
6    41
7     0
Name: Age, dtype: int64
Having a bit of fun, what if we wanted to see which scientists were already in their correct sorted order?
0    37
1    41
2    45
3    56
4    61
5    66
6    77
7    90
Name: Age, dtype: int64
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: Age, dtype: bool


## DataFrame Subsetting

Works just like you might think!

In [175]:
scientists[scientists['Age'] > scientists['Age'].mean()]


Unnamed: 0,Name,Born,Died,Age,Occupation
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


## Augmenting DataFrames

In [176]:
bornTime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
diedTime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
scientists['born_dt'], scientists['died_dt'] = bornTime, diedTime
print(scientists)

                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist   
1        William Gosset  1876-06-13  1937-10-16   61        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist   
5             John Snow  1813-03-15  1858-06-16   45           Physician   
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician   

     born_dt    died_dt  
0 1920-07-25 1958-04-16  
1 1876-06-13 1937-10-16  
2 1820-05-12 1910-08-13  
3 1867-11-07 1934-07-04  
4 1907-05-27 1964-04-14  
5 1813-03-15 1858-06-16  
6 1912-06-23 1954-06-07  
7 1777-04-30 1855-02-23  


## Mutating DataFrames

Note: when mutating, you should take care to use loc over direct indexing because it's possible to get a value-copy of the data instead of a reference back - depending on the kind of array underneath. I'm not getting it from the example below, but it's common to see a "SettingWithCopy" warning about this. Read more here: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-view-versus-copy

In [177]:
import random
random.seed(42)
print(scientists['Age'])
random.shuffle(scientists['Age'])
print(scientists['Age'])

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64
0    66
1    56
2    41
3    77
4    90
5    45
6    37
7    61
Name: Age, dtype: int64


Here is a better way to do it:

In [178]:
scientists['Age'] = scientists['Age'].\
    sample(len(scientists['Age']), random_state=24).\
    reset_index(drop=True)
print(scientists['Age'])

0    61
1    45
2    37
3    90
4    56
5    66
6    77
7    41
Name: Age, dtype: int64


Reading the docs about [shuffle](https://docs.python.org/3.6/library/random.html#random.shuffle), you can see they recommend the sample method instead. This works on immutable lists

In [179]:
## This is how you can pick 3 random winners
scientists.Name.sample(3)

3             Marie Curie
2    Florence Nightingale
6             Alan Turing
Name: Name, dtype: object

Now that we have messed up our scientist age, lets fix it from the dates. Pay special attention to that date type, that's slick!

In [180]:
scientists['age_days_dt'] = scientists.died_dt - scientists.born_dt
scientists['age_years_dt'] = scientists['age_days_dt'].astype('timedelta64[Y]')
print(scientists)

                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   61             Chemist   
1        William Gosset  1876-06-13  1937-10-16   45        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   37               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   90             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist   
5             John Snow  1813-03-15  1858-06-16   66           Physician   
6           Alan Turing  1912-06-23  1954-06-07   77  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   41       Mathematician   

     born_dt    died_dt age_days_dt  age_years_dt  
0 1920-07-25 1958-04-16  13779 days          37.0  
1 1876-06-13 1937-10-16  22404 days          61.0  
2 1820-05-12 1910-08-13  32964 days          90.0  
3 1867-11-07 1934-07-04  24345 days          66.0  
4 1907-05-27 1964-04-14  20777 days          56.0  
5 1

scientists['Age']## Dropping Values

In [182]:
if 'Age' in scientists.columns:
    scientists.drop(['Age'], axis=1)
print("What, the column is still here!?")
print(scientists.columns)
print("Let's reassign the result, derp")
if 'Age' in scientists.columns:
  scientists_dropped = scientists.drop(['Age'], axis=1)
print(scientists.columns)




What, the column is still here!?
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt', 'died_dt',
       'age_days_dt', 'age_years_dt'],
      dtype='object')
Let's reassign the result, derp
Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt', 'died_dt',
       'age_days_dt', 'age_years_dt'],
      dtype='object')


## Saving DataFrames

In [194]:
names = scientists['Name']
print("Serializing a pickle")
names.to_pickle('../output/scientists_names_series.pickle')
print("Serializing a dataframe")
scientists.to_pickle('../output/scientists_names_df.pickle')
from_pickle = pd.read_pickle('../output/scientists_names_df.pickle')
print(from_pickle)
print("It's common to see extensions of .p, .pl, or .pickle")

Serializing a pickle
Serializing a dataframe
                   Name        Born        Died  Age          Occupation  \
0     Rosaline Franklin  1920-07-25  1958-04-16   61             Chemist   
1        William Gosset  1876-06-13  1937-10-16   45        Statistician   
2  Florence Nightingale  1820-05-12  1910-08-13   37               Nurse   
3           Marie Curie  1867-11-07  1934-07-04   90             Chemist   
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist   
5             John Snow  1813-03-15  1858-06-16   66           Physician   
6           Alan Turing  1912-06-23  1954-06-07   77  Computer Scientist   
7          Johann Gauss  1777-04-30  1855-02-23   41       Mathematician   

     born_dt    died_dt age_days_dt  age_years_dt  
0 1920-07-25 1958-04-16  13779 days          37.0  
1 1876-06-13 1937-10-16  22404 days          61.0  
2 1820-05-12 1910-08-13  32964 days          90.0  
3 1867-11-07 1934-07-04  24345 days          66.0  
4 1907-05-

In [198]:
names.to_csv('../output/scientist_names_series.csv')
scientists.to_csv('../output/scientists.csv')
print("You can remove the index, no problemo")
scientists.to_csv('../output/scientists.csv', index=False)

You can remove the index, no problemo


In [211]:
print("You can only export DataFrames to excel, but you can easily convert a series to a DataFrame")
names_df = names.to_frame()
print(names_df)
names_df.to_excel('../output/scientists_names_series_df.xls')
print("Note: had to import the xlwt module for this")
print("Note: there are lots of other formats too, interesting stuff like to_sql or to_json")


You can only export DataFrames to excel, but you can easily convert a series to a DataFrame
                   Name
0     Rosaline Franklin
1        William Gosset
2  Florence Nightingale
3           Marie Curie
4         Rachel Carson
5             John Snow
6           Alan Turing
7          Johann Gauss
Note: had to import the xlwt module for this
Note: theere are lots of other formats too, interesting stuff like to_sql or to_json
