# Chapter 2 - Pandas Data Structures

## Creating Your Own Data

In [1]:
import pandas as pd

series = pd.Series(['banana', 42])
series

0    banana
1        42
dtype: object

The 0 and 1 on the left side are the "row numbers". this is actually the index for the series. We can manually assign a 
name to the indices.

In [2]:
series = pd.Series(['Matt Crichton', 'Aspiring Data Scientist'],
                  index = ['Person', 'Who'])
series

Person              Matt Crichton
Who       Aspiring Data Scientist
dtype: object

## Creating a Data Frame
The DataFrame can be thought of as a dictionary of Series objects.

In [3]:
scientists = pd.DataFrame({
    'Name' : ['Rosalina Franklin', 'William Gosset'],
    'Occupation' : ['Chemist', 'Statistician'],
    'Born' : ['1920-07-25', '1876-06-13'],
    'Died' : ['1954-04-16', '1937-10-16'],
    'Age' : [37, 61]
})
scientists

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosalina Franklin,Chemist,1920-07-25,1954-04-16,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


In [15]:
scientists = pd.DataFrame({
    'Name' : ['Rosaline Franklin', 'William Gosset'],
    'Occupation' : ['Chemist', 'Statistician'],
    'Born' : ['1920-07-25', '1876-06-13'],
    'Died' : ['1954-04-16', '1937-10-16'],
    'Age' : [37, 61],
    },
    index = ['Rosaline Franklin', 'William Gosset'],
    columns = ['Occupation', 'Born', 'Died', 'Age'])
scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1954-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


In the above examples, order is not guaranteed. Below, we use the OrderedDict function. 

In [20]:
from collections import OrderedDict

scientists = pd.DataFrame(OrderedDict([
    ('Name', ['Rosaline Franklin', 'William Gosset']),
    ('Occupation' , ['Chemist', 'Statistician']),
    ('Born', ['1920-07-25', '1876-06-13']),
    ('Died', ['1954-04-16', '1937-10-16']),
    ('Age', [37, 61])
]))
scientists

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1954-04-16,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


## The Series

In [2]:
scientists = pd.DataFrame(
    data = {'Occupation' : ['Chemist', 'Statistician'],
           'Born': ['1920-07-25', '1876-06-13'],
           'Died' : ['1954-04-16', '1937-10-16'],
           'Age': [37, 61]},
    index = ['Rosaline Franklin', 'William Gosset'],
    columns = ['Occupation', 'Born', 'Died', 'Age']
)
scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1954-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


We can now select a scientist by row index label, rather than just the index. A Pandas Series will be returned

In [4]:
first_row = scientists.loc['William Gosset']
first_row

Occupation    Statistician
Born            1876-06-13
Died            1937-10-16
Age                     61
Name: William Gosset, dtype: object

In [7]:
type(first_row)

pandas.core.series.Series

There are many methods associate dwith a Series object. Two examples are index and values.

In [9]:
first_row.index

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

In [11]:
first_row.values

array(['Statistician', '1876-06-13', '1937-10-16', 61], dtype=object)

In [12]:
first_row.keys()[0]

'Occupation'

In [13]:
first_row.keys()[1]

'Born'

## Attribute of a Series

In [15]:
# loc - Subset using index value
# iloc - Subset using index position
# dtype or dtypes - The type of the Series contents
# T - Transpose of the series
# shape - Dimensions of the data
# size - Number of elements in the Series
# values - ndarray or ndarray-like of the Series

The Pandas Series is very similar to the numpy.ndarray. In turn, many mehtods and functions will operate on both of them.

In [18]:
ages = scientists['Age']
ages

Rosaline Franklin    37
William Gosset       61
Name: Age, dtype: int64

In [21]:
ages.mean()

49.0

In [22]:
ages.min()

37

In [23]:
ages.max()

61

In [24]:
ages.std()

16.97056274847714

## Boolean Subsetting: Series

In [3]:
scientists = pd.read_csv('../data/scientists.csv')

In [4]:
scientists.shape

(8, 5)

In [5]:
ages = scientists['Age']
ages

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [30]:
ages.describe()

count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64

In [31]:
ages.mean()

59.125

What if we wanted to subset our ages by identifying those above the mean?

In [33]:
# Returns a series including the indices of the elements which fit the conditional in the square brackets.
# We are finding all ages greater than the mean age of the dataset.
ages[ages > ages.mean()]

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

In [35]:
# Below returns a boolean series showing the elements which pass/fail the conditional
ages > ages.mean()

0    False
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: Age, dtype: bool

The staement returns a series with a dtype of bool. In other words, we can not only subset values using labels and indices, but also supply a vector of boolean values. python has many fuctions and methods. Depending on how it is implemented, it may reutrn lables, indices, or booleans. Keep this point in mind as you learn new methods and seek to piece together various parts of our work.

# Operations Are Automatically Aligned and Vectorized (Broadcasting)
Many of the methods that work on series (and DataFrames) are vectorized, meaning  that they work on the entire vector simultaneously. This approach makes the code easier to read, and typically optimizations are available to make calculations faster

In [36]:
ages + ages

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [37]:
ages * ages

0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64

Vectors With Integers (Scalars)

In [38]:
ages + 100

0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64

## Vectors With Different Lengths
When working with vecotrs of different lengths, the bahaviour will depend on the type of the vectors. With a Series, the vectors will perform an operation matched by the NaN, signifying "not a numbner". This type of behavior, which is called broadcasting, differs between languages. Broadcasting in Pandas refers to how operations are calculated between arrays with different shapes.

In [42]:
ages + pd.Series([1, 100])

0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64

With other data types, the shapes must be the same or an error will be thrown.

In [6]:
import numpy as np
ages + np.array([1, 100])

ValueError: operands could not be broadcast together with shapes (8,) (2,) 

## Vectors With Common Index Labels (Automatic Alignment)
Pandas almost always aligns indices when actions are performed.

In [45]:
ages

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [44]:
rev_ages = ages.sort_index(ascending=False)
rev_ages

7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64

If we perform an operation using ages and rev_ages, it will still be conducted on an element-byu-elemnt basis, but the vectors will be aligned first befgore the operation is carried out.

In [46]:
# Reference output
ages * 2

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [47]:
# Note how we get the same values even though the vector is reversed
ages + rev_ages

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

# The DataFrame
The DataFrame is Pthon's way of storing spreadsheet-like data. Many of the features of th eSeries data structure carry over into the DataFrame.

## Boolean Susbsetting: DataFrames

In [7]:
scientists[scientists['Age'] > scientists['Age'].mean()]

Unnamed: 0,Name,Born,Died,Age,Occupation
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


Because of broadcasting, if we supply a boolean vector that is not the same length as the number of rows in the dataframe, the maximum number of rows returned would be the length of the bool vector.

In [8]:
scientists

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [10]:
# The line below should work because of broadcasting. It seems to be a bug
# scientists.loc[[True, True, False, True]]
scientists.loc[[True, True, False, True, False, False, True, True]]

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


## Broadcasting with DataFrames

In [12]:
first_half = scientists[: 4]
second_half = scientists[4 :]
first_half

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist


In [13]:
second_half

Unnamed: 0,Name,Born,Died,Age,Occupation
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


When we perform an action on a dataframe with a scalar, it will try to apply the operation on each cell of teh dataframe. Numbers could be mulptiplied by 2, and strings could be doubled. 

In [14]:
scientists * 2

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline FranklinRosaline Franklin,1920-07-251920-07-25,1958-04-161958-04-16,74,ChemistChemist
1,William GossetWilliam Gosset,1876-06-131876-06-13,1937-10-161937-10-16,122,StatisticianStatistician
2,Florence NightingaleFlorence Nightingale,1820-05-121820-05-12,1910-08-131910-08-13,180,NurseNurse
3,Marie CurieMarie Curie,1867-11-071867-11-07,1934-07-041934-07-04,132,ChemistChemist
4,Rachel CarsonRachel Carson,1907-05-271907-05-27,1964-04-141964-04-14,112,BiologistBiologist
5,John SnowJohn Snow,1813-03-151813-03-15,1858-06-161858-06-16,90,PhysicianPhysician
6,Alan TuringAlan Turing,1912-06-231912-06-23,1954-06-071954-06-07,82,Computer ScientistComputer Scientist
7,Johann GaussJohann Gauss,1777-04-301777-04-30,1855-02-231855-02-23,154,MathematicianMathematician


# Making Changes to Series and DataFrame

# Adding Additional Comulms
The Born and Dies sereies' are of type object, meaning they are strings.

In [16]:
scientists['Born'].dtypes

dtype('O')

In [17]:
scientists['Died'].dtype

dtype('O')

In [19]:
# Format the 'Born' column as a date/time
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
born_datetime

0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]

In [22]:
# Format the 'Died' column now
died_datetime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
died_datetime

0   1958-04-16
1   1937-10-16
2   1910-08-13
3   1934-07-04
4   1964-04-14
5   1858-06-16
6   1954-06-07
7   1855-02-23
Name: Died, dtype: datetime64[ns]

We'll now add two series onto our data frame with the date time representation of the 'Born' and 'Died' series

In [25]:
scientists['born_date'], scientists['died_date'] = (born_datetime, died_datetime)
scientists.head()

Unnamed: 0,Name,Born,Died,Age,Occupation,born_date,died_date
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist,1920-07-25,1958-04-16
1,William Gosset,1876-06-13,1937-10-16,61,Statistician,1876-06-13,1937-10-16
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse,1820-05-12,1910-08-13
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist,1867-11-07,1934-07-04
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist,1907-05-27,1964-04-14


## Directly Change a Column

In [29]:
scientists['Age']

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [30]:
import random

# Set random seed so values are always the same
random.seed(42)
random.shuffle(scientists['Age'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x[i], x[j] = x[j], x[i]


In [31]:
scientists['Age']

0    66
1    56
2    41
3    77
4    90
5    45
6    37
7    61
Name: Age, dtype: int64

In [33]:
scientists['Age'] = scientists['Age']
sample(len(scientists['Age'], random_state=24))
reset_index(drop=True) # Values stay randomized

NameError: name 'sample' is not defined

## Dropping Values

In [34]:
scientists.columns

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_date', 'died_date'], dtype='object')

We provide the axis=1 argument to drop column-wise

In [36]:
scientists_dropped = scientists.drop(['Age'], axis=1)

In [37]:
scientists_dropped.columns

Index(['Name', 'Born', 'Died', 'Occupation', 'born_date', 'died_date'], dtype='object')

## Exporting and Importing Data

In [39]:
scientists.to_pickle('../scientists_final.pickle')