# Python for Data Analysis Part 9: Pandas DataFrames
http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-9-pandas.html

numpy's ndarray deal with numeric data only and can't deal with heterogeneous data, i.e., data of different data types. 

Pandas gives 2 data structures for this: series and DataFrame
series are 1-dimensional labeled arrays.
DataFrames are labeled 2-d structures, like a spreadsheet or table or a database.

Unlike ndarray, they can be accessed not just by indices, but by using labels.

## Pandas Series

Series are very similar to ndarrays: the main difference between them is that with series, you can provide custom index labels and then operations you perform on series automatically align the data based on the labels.

<font size="3">__Series can be created using pd.Series by passing either a list, ndarray or a dictionary.__</font>

In [1]:
import numpy as np
import pandas as pd

In [7]:
# Define a new series by passing a collection of homogeneous data like ndarray or list, along with a list of 
# associated indexes to pd.Series()
my_series = pd.Series(data=[2,3.5,5,'a'], index = ['a', 'b', 'c', 'd'])  # data can be heterogeneous in a series
my_series

a      2
b    3.5
c      5
d      a
dtype: object

In [8]:
# without passing index, the default indexes from 0 is given as label
my_series2 = pd.Series(data=[2,3.5,5,'a'])
my_series2

0      2
1    3.5
2      5
3      a
dtype: object

In [12]:
my_dict = {'x':2, 'a':5, 'b':'str', 'c':5.5}
my_series3 = pd.Series(my_dict)
my_series3

x      2
a      5
b    str
c    5.5
dtype: object

<font size="3">__Items in a Series can be accessed either using a label or an index.__</font>

In [14]:
print(my_series["a"])
print(my_series[0])

2
2


In [16]:
# slice a series
print(my_series[0:2])
print(my_series["a":"c"])

a      2
b    3.5
dtype: object
a      2
b    3.5
c      5
dtype: object


In [21]:
# operations performed on two series align by label
my_series + my_series

a     4
b     7
c    10
d    aa
dtype: object

In [23]:
# if the two series have different labels, then the unmatched labels will have NaN
my_series + my_series2

a    NaN
b    NaN
c    NaN
d    NaN
0    NaN
1    NaN
2    NaN
3    NaN
dtype: object

In [27]:
# a series is even a valid argument to numpy array functions.
my_series = pd.Series(data=[1,2,3,4.5], index=['a', 'b', 'c', 'd'])
print(np.mean(my_series))

2.625

## DataFrame creation and indexing
A DataFrame is a 2D table with labeled columns that can each hold different types of data. DataFrames are essentially a Python implementation of the types of tables you'd see in an Excel workbook or SQL database.

pd.DataFrame() is the api to create a dataframe.
Dataframes can be created out of dictionaries, 2D arrays, pandas series.

In [32]:
# DataFrame creation from a dictionary
my_dict = {'name': ['Sindhu', 'Varun', 'Sachi'],
          'age' : np.array([6,3,4]),
          'weight' : (15, 10, 12),
           "height" : pd.Series([4.5, 5, 6.1], 
                                index=["Sindhu","Varun","Sachi"]),
           'siblings': 1,
           'gender': ['F', 'M', 'F']
          }
df = pd.DataFrame(my_dict)
df

Unnamed: 0,name,age,weight,height,siblings,gender
Sindhu,Sindhu,6,15,4.5,1,F
Varun,Varun,3,10,5.0,1,M
Sachi,Sachi,4,12,6.1,1,F


In [40]:
# without index labels, numeric row index labels are assigned
my_dict2 = {'name': ['Sindhu', 'Varun', 'Sachi'],
          'age' : np.array([6,3,4]),
          'weight' : (15, 10, 12),
           "height" : [4.5, 5, 6.1],
           'siblings': 1,
           'gender': ['F', 'M', 'F']
          }
df2 = pd.DataFrame(my_dict2)
df2

Unnamed: 0,name,age,weight,height,siblings,gender
0,Sindhu,6,15,4.5,1,F
1,Varun,3,10,5.0,1,M
2,Sachi,4,12,6.1,1,F


In [41]:
# You can provide custom row labels when creating a DataFrame by adding the index argument
df3 = pd.DataFrame(my_dict2, index=my_dict2['name'])
df3

Unnamed: 0,name,age,weight,height,siblings,gender
Sindhu,Sindhu,6,15,4.5,1,F
Varun,Varun,3,10,5.0,1,M
Sachi,Sachi,4,12,6.1,1,F


In [44]:
# Get a column by name
print(df3['age'])
# or
print(df3.age)

Sindhu    6
Varun     3
Sachi     4
Name: age, dtype: int32
Sindhu    6
Varun     3
Sachi     4
Name: age, dtype: int32


In [46]:
# delete a column
del df3['gender']
df3

Unnamed: 0,name,age,weight,height,siblings
Sindhu,Sindhu,6,15,4.5,1
Varun,Varun,3,10,5.0,1
Sachi,Sachi,4,12,6.1,1


In [48]:
# Add a new column
df3['class'] = ['1', 'nursery', 'LKG']
df3

Unnamed: 0,name,age,weight,height,siblings,class
Sindhu,Sindhu,6,15,4.5,1,1
Varun,Varun,3,10,5.0,1,nursery
Sachi,Sachi,4,12,6.1,1,LKG


In [49]:
# Inserting a single value into a DataFrame causes it to be all the rows
df3['married'] = False
df3

Unnamed: 0,name,age,weight,height,siblings,class,married
Sindhu,Sindhu,6,15,4.5,1,1,False
Varun,Varun,3,10,5.0,1,nursery,False
Sachi,Sachi,4,12,6.1,1,LKG,False


In [53]:
# When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be filled with NaN
df3['School'] = pd.Series(['AFS', 'Shemrock'], index=['Sindhu', 'Varun'])
df3

Unnamed: 0,name,age,weight,height,siblings,class,married,School
Sindhu,Sindhu,6,15,4.5,1,1,False,AFS
Varun,Varun,3,10,5.0,1,nursery,False,Shemrock
Sachi,Sachi,4,12,6.1,1,LKG,False,


In [55]:
# You can select both rows or columns by label with df.loc[row, column]
df3.loc['Sindhu']

name        Sindhu
age              6
weight          15
height         4.5
siblings         1
class            1
married      False
School         AFS
Name: Sindhu, dtype: object

In [57]:
df3.loc['Sindhu', 'age']

6

In [59]:
df3.loc['Sindhu':'Varun', 'age':'class']

Unnamed: 0,age,weight,height,siblings,class
Sindhu,6,15,4.5,1,1
Varun,3,10,5.0,1,nursery


In [60]:
# Select rows or columns by numeric index with df.iloc[row, column]
df3.iloc[0]

name        Sindhu
age              6
weight          15
height         4.5
siblings         1
class            1
married      False
School         AFS
Name: Sindhu, dtype: object

In [62]:
df3.iloc[0,3]

4.5

In [64]:
# Select rows or columns based on a mixture of both labels and numeric indexes with df.ix[row, column]
df3.ix[0, 'name':'class']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


name        Sindhu
age              6
weight          15
height         4.5
siblings         1
class            1
Name: Sindhu, dtype: object

In [67]:
# select rows by passing in boolean sequence
df3[[False, False, True]]

Unnamed: 0,name,age,weight,height,siblings,class,married,School
Sachi,Sachi,4,12,6.1,1,LKG,False,


In [73]:
# subsetting data using logical operations
boolean_index = df3.age>5
boolean_index
df3[boolean_index]

#other way to do it
df3[df3.age>5]

Unnamed: 0,name,age,weight,height,siblings,class,married,School
Sindhu,Sindhu,6,15,4.5,1,1,False,AFS


## Exploring DataFrames


In [1]:
from ggplot import mtcars

ModuleNotFoundError: No module named 'ggplot'

In [75]:
pip3 install ggplot

SyntaxError: invalid syntax (<ipython-input-75-6cd73f354b99>, line 1)