In [2]:
import pandas as pd
import numpy as np

## The Pandas Series Object

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data#

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a 
sequence of indices, which we can access with the values and index attributes. Th 
values are simply a familiar NumPy arr:ay

In [4]:
data.values#

array([0.25, 0.5 , 0.75, 1.  ])

We can also do the same with `data.index`

In [5]:
data.index#

RangeIndex(start=0, stop=4, step=1)

Data can be accessed by the associated index by using square-bracket notation:

In [6]:
data[1]#

0.5

In [7]:
data[1:3]#

1    0.50
2    0.75
dtype: float64

### Series as Generalized Numpy array

From what we’ve seen so far, it may look like the Series object is basically inter
changeable with a one-dimensional NumPy array. The essential differenceis that Pandas Series have an e ind. Tthe Pandas Series has an explicitly defined index associatedwith any value.

This explicit index definition gives the Series object additional capabilities. For 
example, the index need not be an integer, but can consist of values of any desire 
type. For example, if we wish, we can use strings as an index:alues.

In [8]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = ['a', 'b', 'c', 'd'])


In [9]:
data['b']#

0.5

We can even use nonsequential indices (Show example copy and change index of last example)

### Series as specialized dictionary

In this way, you can think of a Pandas Series a bit like a specialization of a Python 
dictionary. A dictionary is a structure that maps keys to a set of
valu.
ations

In [10]:
population_dict = {'California': 38332521, 
                   'Texas': 26448193, 
                   'New York': 19651127, 
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [11]:
population = pd.Series(population_dict)#
population#

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

By default, a Series will be created where the index is drawn from the sorted keys. 

From here item access can be performed

In [12]:
population['California']#

38332521

In [13]:
population['California':'Illinois']#

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

### More Series examples 

So now that we know a little bit more about series objects, lets practice making a few more.

In [14]:
pd.Series(5, index=[100,200,300])#

100    5
200    5
300    5
dtype: int64

In [1]:
pd.Series({2:'a', 1:'b', 3:'c'})#

NameError: name 'pd' is not defined

## The Pandas DataFrame Object

To demonstrate this, let’s first construct a new Series listing the area of each of the
five states discussed in the previous section:

In [112]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
 'Florida': 170312, 'Illinois': 149995}


In [18]:
area = pd.Series(area_dict)#
area#

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information


In [19]:
states = pd.DataFrame({'population':population,
                       'area':area})#
states#

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the Series object, the DataFrame has an index attribute that gives access to the 
index labels:

In [20]:
states.index#

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [21]:
states.columns#

Index(['population', 'area'], dtype='object')

## Data Selection in Series

In [22]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = ['a', 'b', 'c', 'd'])
data#

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [23]:
#slicing by explicit index
data['a':'c']#

a    0.25
b    0.50
c    0.75
dtype: float64

In [24]:
#slicing by implicit integer index
data[0:2]#

a    0.25
b    0.50
dtype: float64

In [25]:
# masking
data[(data>0.3)&(data<0.8)]#

b    0.50
c    0.75
dtype: float64

In [26]:
data['e'] = 1.25#
data#
data[['a','e']]#

a    0.25
e    1.25
dtype: float64

## Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary.

Understanding this is helpful in exploring data selection within this structurere.

In [27]:
area = pd.Series({'California': 423967, 
                  'Texas': 695662,
                  'New York': 141297, 
                  'Florida': 170312,
                  'Illinois': 149995})

pop = pd.Series({'California': 38332521, 
                 'Texas': 26448193,
                 'New York': 19651127, 
                 'Florida': 19552860,
                 'Illinois': 12882135})

data = pd.DataFrame({'area':area, 'pop':pop})#
data#

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual Series that make up the columns of the DataFrame can be accessed in the following way:

In [28]:
data['area']#

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

We can also access the data by using the attribute

In [29]:
data.area#

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

We can confirm that these are similar via

In [30]:
data.area is data['area']#

True

**Though this is a useful shorthand, keep in mind that it does not work for all cases!**

**For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible.** 

**For example, the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column:**

In [31]:
 data.pop is data['pop']#

False

In [32]:
data['density'] = data['pop'] / data['area']#
data#


Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


**Earlier, we used data selection for a series, but a series is a one-dimensional array object. For a DataFrame, which is a two dimensional array object, we need to use something more.**  

**Using the `iloc` indexer, we can index the array as if it is a simple one-dimensional array, but the DataFrame index and column labels are maintained a result**

In [33]:
data#

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [34]:
data.iloc[:3, :2]#

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


**The `loc` indexer also allows a hybrid of these two methods, but only with integer based indicies**

In [35]:
data_copy = data.reset_index()#
data_copy#
data_copy.loc[:2, 'pop':]#

Unnamed: 0,pop,density
0,38332521,90.413926
1,26448193,38.01874
2,19651127,139.076746


**For example, in the `loc` indexer we can combine masking and fancy indexing as in the following:**


In [36]:
data.loc[data.density >100, ['pop','density']]#

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


**Any of these indexing conventions may also be used to set or modify values. This is done in the standard way.**

In [37]:
data#

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [38]:
data.iloc[0,2] = 90#
data#

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


There are a couple extra indexing conventions that might seem unusual, but can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:

In [40]:
data['Florida':'Illinois']#

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [41]:
data[data.density > 100] #

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


## Missing Data in Pandas

The first value used by Pandas is `None` type, something that we've already talked about before. `None` type values are built-in python objects. Pandas has its own way of dealing with missing values with NaN, which is not a number.

In [53]:
data1 = pd.Series([1, np.nan, 'hello', None])
data1 #

0        1
1      NaN
2    hello
3     None
dtype: object

Pandas data structures have two useful methods for detecting null data: `isnull()` and `notnull()`. Either one will return a Boolean mask over the data. For example:

In [55]:
data1.isnull()#

0    False
1     True
2    False
3     True
dtype: bool

In [56]:
data1[data1.notnull()] #

0        1
2    hello
dtype: object

### Dropping Null Values

In addition to the masking used before, there are the convenience methods, `dropna()` (which removes NA values) and `fillna()` (which fills in NA values). For a Series, the result is straightforward:

In [58]:
data1.dropna() #

0        1
2    hello
dtype: object

For a DataFrame, there are more options. Consider the following DataFrame:

In [59]:
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]])
df #

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so `dropna()` gives a number of options for a DataFrame.

By default, `dropna()` will drop all rows in which any null value is present:

In [60]:
df.dropna()#

Unnamed: 0,0,1,2
1,2.0,3.0,5


Alternatively, you can drop NA values along a different axis; `axis=1` drops all columns containing a null value:

In [71]:
df.dropna(axis='columns') #

Unnamed: 0,2
0,2
1,5
2,6


Both options drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. You can specific this through the parameters. Let's add a new column

In [76]:
df[3] =np.nan #
df #

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [80]:
df.dropna(axis='columns', how='all')#

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


### Filling null values

Sometimes rather than dropping NA values, you’d rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of interpolation from the good values. For this Pandas as a method called  fillna() method.


In [83]:
data3 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data3 #

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

We can fill NA entries with a single value, such as zero:

In [84]:
data.fillna(0)#

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

We can specify a forward-fill, which  fills missing values with the previous non-null value in the same column.

In [87]:
data.fillna(method='ffill')#

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

The bfill method fills missing values with the next non-null value in the same column.

In [86]:
data.fillna(method='bfill')#

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

For DataFrames, the options are similar, but we can also specify an `axis` along which the fills take place.

In [88]:
df#

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [93]:
df.fillna(method='ffill', axis=1)#

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


## Combining Datasets: Concatenation

Here we’ll take a look at simple concatenation of Series and DataFrames with the functions we have available; later we’ll dive into more sophisticated methods like merges and joins that's also used in Pandas.

For convenience, we have a function, which creates a DataFrame that's going to be useful to us.

In [94]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
    for c in cols}
    return pd.DataFrame(data, ind)

# example DataFrame
make_df('ABC', range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


In [98]:
df1 = make_df('AB', range(1,3))
df2 = make_df('AB', range(3,5))
print(df1); print(df2);

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4


In [99]:
print(pd.concat([df1, df2]))#

    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


By default, the concatenation takes place row-wise within the DataFrame. If we want to concatenate with columns, we have to use the following example

In [101]:
df3 = make_df('AB', range(0,2))
df4 = make_df('CD', range(0,2))
print(df3); print(df4);

    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1


In [103]:
print(pd.concat([df3, df4], axis=1)) #

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


### Contenation with joins

**In the simple examples we just looked at, we were mainly concatenating DataFrames with shared column names. In practice, data from different sources might have different sets of column names, and `pd.concat` offers several options in this case.**

**Lets concatenation of the following two DataFrames, which have some (but not all!) columns in common:**

In [108]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); print(pd.concat([df5, df6]))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
     A   B   C    D
1   A1  B1  C1  NaN
2   A2  B2  C2  NaN
3  NaN  B3  C3   D3
4  NaN  B4  C4   D4


**By default, the entries for which no data is available are filled with NA values.** 

**To change this, we can specify this function by using the join parameter**

In [110]:
print(df5); print(df6);
print(pd.concat([df5, df6], join='inner'))#

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
    B   C
1  B1  C1
2  B2  C2
3  B3  C3
4  B4  C4
