# 7.14 Intro to Data Science: pandas Series and DataFrames

## 7.14.1 pandas Series

In [1]:
import pandas as pd

In [2]:
grades = pd.Series([87, 100, 94])

### Displaying a Series

In [6]:
grades

0     87
1    100
2     94
dtype: int64

### Creating a Series with All Elements Having the Same Value

In [7]:
pd.Series(98.6, range(3))

0    98.6
1    98.6
2    98.6
dtype: float64

The second argument is a one-dimensional iterable object (such as a list, an array or a range) containing the Series' indices.
The number of indices determines the number of elements.

### Accessing a Series' Elements

In [8]:
grades[0]

87

### Producing Descriptive Statistics for a Series

In [9]:
grades.count()

3

In [10]:
grades.mean()

93.66666666666667

In [11]:
grades.min()

87

In [12]:
grades.max()

100

In [13]:
grades.std() # standard deviation

6.506407098647712

Each of these is a functional-style reduction. Calling Series method **describe** produces all of these stats and more:

In [14]:
grades.describe()

count      3.000000
mean      93.666667
std        6.506407
min       87.000000
25%       90.500000
50%       94.000000
75%       97.000000
max      100.000000
dtype: float64

### Creating a Series with Custom Indices

In [15]:
grades = pd.Series([87, 100, 94], index=['Wally', 'Eva', 'Sam'])

In [16]:
grades

Wally     87
Eva      100
Sam       94
dtype: int64

### Dictionary Initializers

In [17]:
grades = pd.Series({'Wally':87, 'Eva':100, 'Sam':94})

In [18]:
grades

Wally     87
Eva      100
Sam       94
dtype: int64

### Accessing Elements of a Series Via Custom Indices

In [19]:
grades['Eva']

100

In [20]:
grades.Wally

87

Series also has *built-in* attributes. For example, the **dtype attribute** return teh underlying array's element type:

In [21]:
grades.dtype

dtype('int64')

and the **values attribute** return the underlying array:

In [22]:
grades.values

array([ 87, 100,  94], dtype=int64)

### Creating a Series of Strings

In [23]:
hardware = pd.Series(['Hammer', 'Saw', 'Wrench'])

In [24]:
hardware

0    Hammer
1       Saw
2    Wrench
dtype: object

In [26]:
hardware.str.contains('a')

0     True
1     True
2    False
dtype: bool

In [28]:
hardware.str.upper()

0    HAMMER
1       SAW
2    WRENCH
dtype: object

## 7.14.2 DataFrames

A **DataFrame** is an enhanced two-dimensional array. Like Series, DataFrames can have custom row and column indices, and offer additional operations and capabilities that make them more convenient for many data-science oriented tasks. DataFrames also support missing data. Each column in a DataFrame is a Series. The Series representing each column may contain different element types, as you'll soon see when we discuss loading datasets into DataFrames.

### Creating a DataFrame from a Dictionary

In [29]:
grades_dict = {'Wally': [87, 96, 70], 'Eva': [100, 87, 90],
               'Sam': [94, 77, 90], 'Katie': [100, 81, 82],
               'Bob': [83, 65, 85]}

In [30]:
grades = pd.DataFrame(grades_dict)

In [31]:
grades

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
0,87,100,94,100,83
1,96,87,77,81,65
2,70,90,90,82,85


### Customizing a DataFrame's Indices with the index Attribute

In [32]:
grades.index = ['Test1', 'Test2', 'Test3']

In [33]:
grades

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65
Test3,70,90,90,82,85


### Accessing a DataFrame's Columns

In [34]:
grades['Eva']

Test1    100
Test2     87
Test3     90
Name: Eva, dtype: int64

In [35]:
grades.Sam

Test1    94
Test2    77
Test3    90
Name: Sam, dtype: int64

### Selecting Rows via the loc and iloc Attributes

In [36]:
grades.loc['Test1']

Wally     87
Eva      100
Sam       94
Katie    100
Bob       83
Name: Test1, dtype: int64

In [37]:
grades.iloc[1]

Wally    96
Eva      87
Sam      77
Katie    81
Bob      65
Name: Test2, dtype: int64

### Selecting Rows via Slices and Lists with the loc and iloc Attributes

In [38]:
grades.loc['Test1':'Test3']

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65
Test3,70,90,90,82,85


In [39]:
grades.iloc[0:2]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65


In [40]:
grades.loc[['Test1', 'Test3']]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test3,70,90,90,82,85


In [41]:
grades.iloc[[0, 2]]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test3,70,90,90,82,85


### Selecting Subsets of the Rows and Columns

In [42]:
grades.loc['Test1':'Test2', ['Eva', 'Katie']]

Unnamed: 0,Eva,Katie
Test1,100,100
Test2,87,81


In [43]:
grades.iloc[[0, 2], 0:3]

Unnamed: 0,Wally,Eva,Sam
Test1,87,100,94
Test3,70,90,90


### Boolean Indexing

In [44]:
grades[grades >= 90]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,,100.0,94.0,100.0,
Test2,96.0,,,,
Test3,,90.0,90.0,,


In [46]:
grades[(grades >= 80) & (grades < 90)]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87.0,,,,83.0
Test2,,87.0,,81.0,
Test3,,,,82.0,85.0


### Accessing a Specific DataFrame Cell by Row and Column

In [47]:
grades.at['Test2', 'Eva']

87

In [48]:
grades.iat[0, 2]

94

In [49]:
grades.at['Test2', 'Eva'] = 100

In [50]:
grades.at['Test2', 'Eva']

100

In [51]:
grades.iat[1, 1] = 87

In [52]:
grades.iat[1, 1]

87

### Descriptive Statistics

In [53]:
grades.describe()

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
count,3.0,3.0,3.0,3.0,3.0
mean,84.333333,92.333333,87.0,87.666667,77.666667
std,13.203535,6.806859,8.888194,10.692677,11.015141
min,70.0,87.0,77.0,81.0,65.0
25%,78.5,88.5,83.5,81.5,74.0
50%,87.0,90.0,90.0,82.0,83.0
75%,91.5,95.0,92.0,91.0,84.0
max,96.0,100.0,94.0,100.0,85.0


In [54]:
pd.set_option('precision', 2)

In [55]:
grades.describe()

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
count,3.0,3.0,3.0,3.0,3.0
mean,84.33,92.33,87.0,87.67,77.67
std,13.2,6.81,8.89,10.69,11.02
min,70.0,87.0,77.0,81.0,65.0
25%,78.5,88.5,83.5,81.5,74.0
50%,87.0,90.0,90.0,82.0,83.0
75%,91.5,95.0,92.0,91.0,84.0
max,96.0,100.0,94.0,100.0,85.0


In [56]:
grades.mean()

Wally    84.33
Eva      92.33
Sam      87.00
Katie    87.67
Bob      77.67
dtype: float64

### Transposing the DataFrame with the T Attribute

In [57]:
grades.T

Unnamed: 0,Test1,Test2,Test3
Wally,87,96,70
Eva,100,87,90
Sam,94,77,90
Katie,100,81,82
Bob,83,65,85


In [58]:
grades.T.describe()

Unnamed: 0,Test1,Test2,Test3
count,5.0,5.0,5.0
mean,92.8,81.2,83.4
std,7.66,11.54,8.23
min,83.0,65.0,70.0
25%,87.0,77.0,82.0
50%,94.0,81.0,85.0
75%,100.0,87.0,90.0
max,100.0,96.0,90.0


In [59]:
grades.T.mean()

Test1    92.8
Test2    81.2
Test3    83.4
dtype: float64

### Sorting by Rows by Their Indices

In [60]:
grades.sort_index(ascending=False)

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test3,70,90,90,82,85
Test2,96,87,77,81,65
Test1,87,100,94,100,83


### Sorting by Column Indices

In [61]:
grades.sort_index(axis=1)

Unnamed: 0,Bob,Eva,Katie,Sam,Wally
Test1,83,100,100,94,87
Test2,65,87,81,77,96
Test3,85,90,82,90,70


### Sorting by Column Values

In [62]:
grades.sort_values(by='Test1', axis=1, ascending=False)

Unnamed: 0,Eva,Katie,Sam,Wally,Bob
Test1,100,100,94,87,83
Test2,87,81,77,96,65
Test3,90,82,90,70,85


In [63]:
grades.T.sort_values(by='Test1', ascending=False)

Unnamed: 0,Test1,Test2,Test3
Eva,100,87,90
Katie,100,81,82
Sam,94,77,90
Wally,87,96,70
Bob,83,65,85


In [64]:
grades.loc['Test1'].sort_values(ascending=False)

Katie    100
Eva      100
Sam       94
Wally     87
Bob       83
Name: Test1, dtype: int64

### Copy vs. In-Place Sorting

By default the sort_index and sort_values return a *copy* of the original DataFrame, which could require substancial memory in a big data application. You can sort the DataFrame *in place*, rather than *copying* the data. To do so, pass the keyword argument inplace=True to either sort_index or sort_values.