In [6]:
import numpy as np

In [55]:
import pandas as pd

Series is a one-dimensional labeled array capable of holding any data type (intefers, strings, floating point numbers, Python objects, etc.) The axis labels are collectively referred to as the index. The basic method is to create Series is to call:

s = pd.Series(data, index=index)

Data can be many different things:  
 * a Python dict  
 * an ndarray  
 * a scalar value
 
 The passed index is a list of axis labels.

## From ndarray

If data is an ndarray, an index must be the same length as the data. If no index is passed, one will be created having values [0, ..., len(data) -1].

In [57]:
#Here we specify the index
s= pd.Series(np.random.randn(5), index = ['a', 'b', 'c', 'd', 'e'])
s
    

a    1.450278
b   -1.383915
c    0.496959
d   -0.651165
e    0.276804
dtype: float64

In [58]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [59]:
#Here we let Pandas create a default index
pd.Series(np.random.randn(5))

0    0.450184
1    0.813035
2   -2.118474
3    1.733930
4   -0.030232
dtype: float64

#### **_Pandas supports non-unique index calues. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time._**

## From dict

*Series* can be created from dicts:

In [60]:
d = {'b': 1, 'a': 0, 'c': 2}

In [61]:
pd.Series(d)

b    1
a    0
c    2
dtype: int64

When data is a dict, and an index is not passed, the *Series* index will be ordered by the dict's insertion order. There is no sorting if you have Python version >=3.6 and Pandas version >=0.23.

## From scalar value

If **data** is a scalar value, an index must be provided. The value will be repeated to match the length of the index.

In [27]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

## Series is ndarray-like

Series acts very similarly to a **ndarray** from **NumPy** and is a valid argument to most NumPy functions. Operations such as slicing will also slice the index.

In [62]:
s[0]

1.4502781617067275

In [63]:
s[:3]

a    1.450278
b   -1.383915
c    0.496959
dtype: float64

In [64]:
s[s>s.median()]

a    1.450278
c    0.496959
dtype: float64

In [65]:
s[[4, 3, 1]]

e    0.276804
d   -0.651165
b   -1.383915
dtype: float64

In [66]:
np.exp(s)

a    4.264301
b    0.250595
c    1.643715
d    0.521438
e    1.318908
dtype: float64

**We will address indexing and slicing in the following tutorials.**

Each series has a **dtype**

In [68]:
s.dtype

dtype('float64')

While **Series** is ndarray-like, if you need an actual **ndarray**, then use **Series.to_numpy()**

In [69]:
s.to_numpy()

array([ 1.45027816, -1.38391538,  0.49695898, -0.65116549,  0.27680378])

## Series is **dict**-like

A **Series** is like a fixed-size dict in which you can get and set values by an index label.

In [70]:
s['a']

1.4502781617067275

In [71]:
s['e'] = 12.

In [72]:
s

a     1.450278
b    -1.383915
c     0.496959
d    -0.651165
e    12.000000
dtype: float64

In [73]:
'e' in s

True

In [74]:
'f' in s

False

If a label is not contained, an exception is raised:

In [75]:
s['f']

KeyError: 'f'

## Vectorized operations
When working with raw **NumPy** arrays, looping through value-by-value is usually not necessary. The same is true when working with **Series** in Pandas. **Series** can also be passed into most NumPy methods expecting an ndarray.

In [76]:
s + s

a     2.900556
b    -2.767831
c     0.993918
d    -1.302331
e    24.000000
dtype: float64

In [77]:
s * 2

a     2.900556
b    -2.767831
c     0.993918
d    -1.302331
e    24.000000
dtype: float64

In [78]:
np.exp(s)

a         4.264301
b         0.250595
c         1.643715
d         0.521438
e    162754.791419
dtype: float64

A key difference between **Series** and **ndarray** is that operations between **Series** automaticaly align data based on the label. Thus you can write computations without considering whether the **Series** involved have the same labels.

In [80]:
s1 = s[1:]

In [81]:
s2 = s[:-1]

In [82]:
s1 +s2

a         NaN
b   -2.767831
c    0.993918
d   -1.302331
e         NaN
dtype: float64

**The result of an operation between unaligned *Series* will have the union of the indexes involved. If a label is not found in one *Series* or the other, the result will be marked as missing *NaN***

## Name attribute

**Series** can also have a **name** attribute.

In [83]:
s = pd.Series(np.random.randn(5), name='something')

In [84]:
s

0   -0.027344
1    0.354525
2   -0.347193
3    1.433865
4    0.858176
Name: something, dtype: float64

In [85]:
s.name

'something'

    
      
  
  
   # **Pandas DataFrame**
In this tutorial, we will get familiar with the data type **DataFrame**
  
## DataFrame
  
**DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a **dict** of **Series** objects. It is generally the most commonly used Pandas object. Like **Series**, **DataFrame** accepts many different kinds of input:
* dict of 1D ndwarrays, lists, dicts, or Series
* 2D NumPy ndarray
* Series
* DataFrame
  
## From dict of Series or dicts
The resulting index will be the union of the indexes of the various Series.

In [137]:
d = {'one':pd.Series([1., 2., 3.], index = ['a', 'b', 'c']),
    'two': pd.Series([1., 2., 3., 4.], index = ['a', 'b', 'c', 'd'])}


In [138]:
df = pd.DataFrame(d)

In [139]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [140]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [141]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


When data is a dict and a *columns* is not passed, the DataFrame columns will be ordered by the dict's insertion order. There is no sorting if you have Python version >=3.6 and Pandas version >=0.23.

**If you pass an index and/or columns, you are guaranteeing the index and/or columns of the resulting DataFrame. Thus a dict of Series plus a specific index and/or columns will discard all data not matching to the passed index and/or columns. See the last example above with an empty column labeled *three*.**
  
## From dict of Series or dicts
  
  The ndarrays must all be of the same length. If an index is passed, it must obviously also be of the same length as the arrays. If no index is passed, the rsult will be **range(n)**, where n is the array length.

In [142]:
d = {'one': [1., 2., 3., 4.],
    'two': [4., 3., 2., 1.]}

In [143]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [144]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


## From a Series

The result will be a **DataFrame** with the same index as the input **Series**, and with one column whose name is the original name of the **Series** (only if no other column name provided).

In [145]:
pd.DataFrame(pd.Series(np.random.randn(5), name='something'))

Unnamed: 0,something
0,-0.573452
1,0.394882
2,0.624802
3,-0.729485
4,-0.158351


## Column selection, addition, deletion

You can treat a **DataFrame** semantically like a dict of like-indexed **Series** objects. Getting, setting and deleting columns works with the same syntax as the analogous dict operations:

In [146]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [147]:
df['three'] = df['one'] * df['two']

In [148]:
df['flag'] = df['one'] > 2

In [149]:
df

Unnamed: 0,one,two,three,flag
a,1.0,1.0,1.0,False
b,2.0,2.0,4.0,False
c,3.0,3.0,9.0,True
d,,4.0,,False


Columns can be deleted like with a dict:

In [150]:
del df['two']
del df['three']

When inserting a scalar value, it will naturally be propagated to fill the column:

In [151]:
df['foo'] = 'bar'

In [152]:
df

Unnamed: 0,one,flag,foo
a,1.0,False,bar
b,2.0,False,bar
c,3.0,True,bar
d,,False,bar


When inserting a **Series** that does not have the same index as the **DataFrame**, it will be conformed to the **DataFrame's** index.

In [153]:
df['one_trunc']=df['one'][:2]

In [154]:
df

Unnamed: 0,one,flag,foo,one_trunc
a,1.0,False,bar,1.0
b,2.0,False,bar,2.0
c,3.0,True,bar,
d,,False,bar,


In [None]:
In [65]: del df['two']
When inserting a scalar value, it will naturally be propagated to fill the column:

In [68]: df['foo'] = 'bar'

In [69]: df
Out[69]: 
   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar

Operations with scalars are just as you would expect:

In [189]:
#Create a datetime index
import datetime
datetimeindex = pd.date_range(start='2000-01-01', periods=3, freq='D')
#Periods = number of rows. 
#If created seperately, periods will = columns.
#Frequency = specificty... "D" = year/month/day, "H" = year/month/day/time

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(8,3), index=pd.date_range('2000-01-01', periods=(8), freq='D'),  columns=list('ABC'))

In [199]:
df * 5 + 2

Unnamed: 0,A,B,C
2000-01-01,0.413814,3.374919,1.529458
2000-01-02,4.097665,1.369001,2.086235
2000-01-03,3.627526,-0.741435,5.203655
2000-01-04,5.787445,1.746354,-6.636818
2000-01-05,14.288987,8.675735,-8.700051
2000-01-06,1.082654,10.019433,0.330074
2000-01-07,0.999176,-1.177928,-2.075043
2000-01-08,-8.32807,-0.928794,1.576165


In [200]:
1/df

Unnamed: 0,A,B,C
2000-01-01,-3.152216,3.636577,-10.626053
2000-01-02,2.383602,-7.923939,57.98088
2000-01-03,3.072147,-1.823862,1.560717
2000-01-04,1.320151,-19.712541,-0.578917
2000-01-05,0.406868,0.748981,-0.467288
2000-01-06,-5.450508,0.623485,-2.994146
2000-01-07,-4.995885,-1.573352,-1.226981
2000-01-08,-0.484118,-1.707187,-11.797057


In [201]:
df ** 4

Unnamed: 0,A,B,C
2000-01-01,0.010128,0.005718,7.843539e-05
2000-01-02,0.030979,0.000254,8.848319e-08
2000-01-03,0.011226,0.090372,0.16854
2000-01-04,0.329235,7e-06,8.902973
2000-01-05,36.490874,3.177724,20.97313
2000-01-06,0.001133,6.617512,0.01244252
2000-01-07,0.001605,0.163191,0.4412135
2000-01-08,18.205255,0.117727,5.163037e-05


Boolean operatiors are vectorized as well:

In [204]:
df1 = pd.DataFrame({'a':[1,0,1], 'b': [0,1,1]}, dtype=bool)

In [205]:
df2 = pd.DataFrame({'a': [0, 1, 1], 'b': [1, 1, 0]}, dtype=bool)

In [206]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [207]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [208]:
df1 ^ df2

Unnamed: 0,a,b
0,True,True
1,True,False
2,False,True


In [209]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False


  
  
  
# Pandas dtypes

We will learn about the data types that Pandas is using in Series and DataFrames. This is very important in order to further understand what we can do with our data in Pandas.

## dtypes

For the most part, **Pandas** uses **NumPy** arrays and dtypes for **Series** or individual columns of a **DataFrame**. Numpy provides support for float, int, bool, timedelta64[ns] and datetime64[ns]. 

However, **NumPy** doesn't allow non-numeric data types, therefore, **Pandas** has to extend NumPy's type system in a few places. The following table lists most of Pandas extension types(the most common ones):

|**Kind of Data**  | **Data Type**   | **String Aliases**   |   |   |
|---|---|---|---|---|
| Categorical  | CategoricalDtype  | 'category'   |   |   |
| nullable integer  | Int64Dtype  | 'Int8', 'UInt8', 'UInt16'...  |   |   |
| Strings  | StringDtype  | 'string'  |   |   |
| Boolean(with NA)  | BooleanDtype  | 'boolean', 'bool'  |   |   |
| any  | object type  | 'object  |   |   |

A convenient **dtypes** attribute for DataFrame returns a Series with the data type of each column.


In [215]:
dft = pd.DataFrame({'A': np.random.rand(3),
                   'B' : 1,
                   'C' : 'foo',
                   'D' : pd.Timestamp('20010102'),
                   'E' : pd.Series([1.0] * 3).astype('float32'),
                    'F' : False,
                    'G' : pd.Series([1] * 3, dtype='int8')})

In [216]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.026454,1,foo,2001-01-02,1.0,False,1
1,0.105976,1,foo,2001-01-02,1.0,False,1
2,0.855205,1,foo,2001-01-02,1.0,False,1


In [217]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

**Series** has the same attribute as well:

In [218]:
dft['A'].dtype

dtype('float64')

Pandas has two ways of storing strings.  
1. **obect dtype**, which can hold any Python object, including strings.  
2. **StringDtype**, which is dedicated to strings (introduced in 2020, only in the Pandas 1.0.0 version)

It is recommended to use **StringDtype** for strings because an object can hide any data type inside.

In [220]:
# string data forces an ''object'' dtype
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3      6
4    foo
dtype: object

## Converting

You can use the **astype()** method to explicitly convert dtypes from one to another. These will default return a copy, even if the **dtype** was unchanged (pass copy=False to change this behavior).
In addition, they will raise an exception if the **astype()** operation is invalid

In [221]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [222]:
df1.dtypes

A    float32
dtype: object

In [223]:
# conversion of dtypes

df1 = df1.astype('float64')

In [224]:
df1.dtypes

A    float64
dtype: object

#### You can *.astype()* on a subset of columns as well, even on a single column, a.k.a Series.

In [225]:
dft1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [4, 5, 6], 'c' : [7, 8, 9]})

In [226]:
dft1 = dft1.astype({'a' : np.bool, 'c': np.float64})

In [227]:
dft1

Unnamed: 0,a,b,c
0,True,4,7.0
1,False,5,8.0
2,True,6,9.0


In [229]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object