**Getting Started with pandas
Pandas stands for “Python Data Analysis Library”.
What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software (think Excel or SPSS for example. . This is so much easier to work with in comparison to working with lists and/or dictionaries through for loops or list comprehension.
In order to use Pandas in your Python IDE (Integrated Development Environment) like Jupyter Notebook or Spyder (both of them come with Anaconda by default), you need to import the Pandas library first. Importing a library means loading it into the memory and then it’s there for you to work with. In order to import Pandas all you have to do is run the following code:
import pandas as pd 
import numpy as np 

**When you want to use Pandas for data analysis, you’ll usually use it in one of three different ways:
Convert a Python’s list, dictionary or Numpy array to a Pandas data frame
Open a local file using Pandas, usually a CSV file, but could also be a delimited text file (like TSV), Excel, etc
Open a remote file or database like a CSV or a JSONon a website through a URL or read from a SQL table/database

Reference link: https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673
Good link: https://www.tutorialspoint.com/python_pandas/

**Key Features of Pandas
-Fast and efficient DataFrame object with default and customized indexing.
-Tools for loading data into in-memory data objects from different file formats.
-Data alignment and integrated handling of missing data.
-Reshaping and pivoting of date sets.
-Label-based slicing, indexing and subsetting of large data sets.
-Columns from a data structure can be deleted or inserted.
-Group by data for aggregation and transformations.
-High performance merging and joining of data.
-Time Series functionality.

**Pandas deals with the following three data structures −
-Series
-DataFrame
-Panel
These data structures are built on top of Numpy array, which means they are fast.

**Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.
A series can be created using various inputs like −
-Array
-Dict
-Scalar value or constant

**While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

**Series**

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [1]:
import numpy as np
import pandas as pd

**Creating a Series**
You can convert a list,numpy array, or dictionary to a Series:

We are going to create four objects of various types

In [2]:
# create a list of strings
labels = ['a','b','c']
labels

['a', 'b', 'c']

In [3]:
# create a list of numbers
my_list = [10,20,30]
my_list

[10, 20, 30]

In [4]:
# create an object call it arr. It will be numpy array and will pass my_list
arr = np.array(my_list)
arr

array([10, 20, 30])

In [5]:
# create a dictionary d. a as a key and 0 as the value, b as a key and 20 as a value, c as a key and 30 as a value
d = {'a':10,'b':20,'c':30}
d

{'a': 10, 'b': 20, 'c': 30}

**Creating a series using lists**
let us see how can we create a series
Type pd.series and press shift tab. You will find that it takes various types of parameters. Note we are not giving the series a name

**Constructor**
A pandas Series can be created using the following constructor −
--pandas.Series( data, index, dtype, copy)
**Parameter & Description**
--data takes various forms like ndarray, list, constants
--Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.
--dtype is for data type. If None, data type will be inferred
--Copy data. Default False

A series can be created using various inputs like −
--Array
--Dict
--Scalar value or constant

In [6]:
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

the key to a panda series is that you can actually specify what you want that index to be.

In [7]:
# we are going to create a series specifying the data = my_list and the index is labels
# I can call those data points using those labels index
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

In [8]:
# we can do that without specifying the parameter name but provide the parameter values in the correct order
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

**Creating a series using a NumPy Array**
Using a Numpy Array will be the same as using a list

In [9]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [10]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

**Creating a series using a dictionary**
pandas will take the keys of the dictionary and set it as index and values of the dictionary as the data of the series

In [11]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

**Data in a Series**

A pandas Series can hold a variety of object types:
A pandas Series can hold almost any Python objects as its data points

In [12]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [13]:
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

**Using an Index**

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [14]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])                                   

In [15]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [16]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])                                   

In [17]:
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

grapping information from the series

In [18]:
ser1['USA']

1

**Operations are then also done based off of index**
It will match up the operation based on the index. 
--So it will say all the data in both series based on the index. SO we have Germany in both as indeces so it will add those values. Italy exists only in one so it will have the data value NaN also USSR exists only in one series so it will have the data value as NaN. 
--If it can not find a match it will put the result as NaN as we saw for the indeces Italy and USSR. 
--Something to note that when you perform operations on pandas series of numpy arrays your integers will be converted into float.

In [19]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

**Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [20]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print (s)

0    5
1    5
2    5
3    5
dtype: int64


In [21]:
#Retrieve the first element. As we already know, the counting starts from zero for the array, 
#which means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print (s[0])

1


**Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index)

In [22]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print (s[:3])

a    1
b    2
c    3
dtype: int64


 **create a Series with an index identifying each data point with a label:

In [23]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)

d    4
b    7
a   -5
c    3
dtype: int64


**Retrieve the index of a series

In [24]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

**you can use labels in the index when selecting single values or a set of values:

In [25]:
 obj2['a']

-5

In [26]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

**Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link

In [27]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [28]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

**Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [29]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [30]:
obj3 = pd.Series(sdata)

**When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series

In [31]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
##display obj4
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

**I will use the terms “missing” or “NA” interchangeably to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:

In [32]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [33]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

**DataFrames**


- DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
- A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

**Features of DataFrame**

- Potentially columns are of different types
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

**Creating a pandas DataFrame**
- A pandas DataFrame can be created using the following constructor −
- pandas.DataFrame( data, index, columns, dtype, copy)

**The parameters of the constructor are as follows**

- data-takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.	
- index-For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.	
- columns-For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed.
- Data type of each column.
- copy-This command (or whatever it is) is used for copying of data, if the default is False.
**Good Reference**
https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm


**A pandas DataFrame can be created using various inputs like −
Lists
dict
Series
Numpy ndarrays
Another DataFrame

**Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.

In [34]:
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)

   0
0  1
1  2
2  3
3  4
4  5


In [35]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


In [36]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print (df)

     Name   Age
0    Alex  10.0
1     Bob  12.0
2  Clarke  13.0


**Create a DataFrame from Dict of ndarrays / Lists

In [37]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print (df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


**There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

In [38]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
# display the dataframe
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


//as you saw in the above example
//The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:

**For large DataFrames, the head method selects only the first five rows:

In [39]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [40]:
import pandas as pd
import numpy as np
from numpy.random import randn

In [41]:
#We are going to set a seed just to make sure that we get the same random numbers together
np.random.seed(101)

In [42]:
#create a dataframe
df=pd.DataFrame(randn (5,4),['A','B','C','D','E'],['W','X','Y','Z'])

In [43]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


- So basically what we have here is a list of columns w x y z and corresponding rows A B C D E.
- Now each of these columns is actually just a panda's series.
- So W is a panda series as well as X Y and Z and they all share a common index and that's basically all data frame is it's just a bunch of series that share an index.

- Let's go ahead and show how we can use indexing and selection actually grab one of these serious objects

In [44]:
# grap the W column
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

- Look at the ouput it looks like a series and this actually try because each column is a pandas series.
- Let us confirm this by calling the type

In [45]:
type(df['W'])

pandas.core.series.Series

In [46]:
#let us see the type of df itself
type(df)

pandas.core.frame.DataFrame

Just to re-iterate a data frame is just a bunch of series that share the same index

In [47]:
# another way to grap a column from a dataframe is specify the dataframe use the dot notation and add the column name
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

- It is recommended not use this so as not to get confused with the different dataframe methods. 
- You can see the different methods of the dataframe by typing the dataframe name and type dot then press tab keyboard key

In [48]:
# use the pracket notation to grap a column
df ['X']

A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64

- If you want a bunch of columns use the bracket notation and pass a list of column names
- When you ask from more than one column you actually get a dataframe
- If I asked for a single column, I will get back a series

In [49]:
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [50]:
#creating a column which is the sum of two columns
df ['new']=df['W']+df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [51]:
#to remove a column you use df.drop and pass the column name and axis= 1 to specify colum
#note axis=0 refers to a row
df.drop('new', axis=1)


Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


- Note: you need to know an important thing on using this drop method.
- it does not happen in place. In other words it does not affect the data frame itself

In [52]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


- to drop that column in place you need to specify 
- the inplace=true as an argument to the drop method
- the default inplace argument is false and you can see that if you typed df.drop and pressed the tab keyboard key

In [53]:
df.drop('new',axis=1,inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


- you can use the drop method to drop rows. You can pass the row name and optionally axis=0. Note the axis=0 is the default.


In [54]:
df.drop('E')

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [55]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [56]:
# to find the number of row and columns in a pandas dataframe you can use shape 
# method
df.shape

(5, 4)

In [57]:
#if you want to drop the row in place you need to specify inplace =True
df.drop('E',inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


- Selecting rows
- You use loc method

In [58]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

- It returns a series
- So in pandas dataframes not only columns are series but also rows are series as well
- let us check it using the type method

In [59]:
type (df.loc['A'])

pandas.core.series.Series

- another way to retrieve row we can use numeric index position using the method
 iloc

In [60]:
df.iloc[3]

W    0.188695
X   -0.758872
Y   -0.933237
Z    0.955057
Name: D, dtype: float64

- Selecting subsets of row and columns

In [61]:
df.loc['B','Y']

-0.8480769834036315

In [62]:
# let us say we want as subset the A,B rows and W columns
df.loc[['A','B'],'W']


A    2.706850
B    0.651118
Name: W, dtype: float64

In [63]:
# let us say we want a subset of A, B rows and X, Y columns
df.loc[['A','B'],['X','Y']]

Unnamed: 0,X,Y
A,0.628133,0.907969
B,-0.319318,-0.848077


In [64]:
# conditional selection
# let us recreate the dataframe
df=pd.DataFrame(randn (5,4),['A','B','C','D','E'],['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [65]:
# using comparison operators against the data frame 
# such as asking where is the data frame greater than zero.
# it will return true if the data value satisfies the condition and false if not
df >0

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,False,True,True,True
C,True,True,True,True
D,False,False,False,True
E,False,True,True,True


In [66]:
# or where the data frame is less than zero
df < 0

Unnamed: 0,W,X,Y,Z
A,False,False,True,True
B,True,False,False,False
C,False,False,False,False
D,True,True,True,False
E,True,False,False,False


In [67]:
# let us do the following
booldf=df > 0
booldf

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,False,True,True,True
C,True,True,True,True
D,False,False,False,True
E,False,True,True,True


In [68]:
# if we passed this booldf to our data frame we will get the exact value if the condition is satisfied and NaN if not
df [booldf]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,,
B,,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,,,,0.484752
E,,1.901755,0.238127,1.996652


In [69]:
# we can perform the previous two steps in one
df [df>0]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,,
B,,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,,,,0.484752
E,,1.901755,0.238127,1.996652


In [70]:
# let us recall the contents of the original df
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [71]:
# let us look at the following example
df ['W'] >0

A     True
B    False
C     True
D    False
E    False
Name: W, dtype: bool

In [72]:
# we get a series of True and False and it is linked to the actual index
df [df ['W']>0]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
C,0.807706,0.07296,0.638787,0.329646


we found that it filtered the data frame displaying only elements satisfying the condition that elements of column W greater than zero. It skipped row C where the condition is not satisfied

In [73]:
# let us look at df again and let us say we want to retrieve only elements of the dataframe where Z values is greater than zero
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [74]:
df [df ['Z']>0]

Unnamed: 0,W,X,Y,Z
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [75]:
# let us say we want to retrieve elements in the dataframe only for those where Z values are less than zero
df [df ['Z']<0]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119


In [76]:
# let us look at the following example
resultdf = df [ df['W']>0]
resultdf

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
C,0.807706,0.07296,0.638787,0.329646


In [77]:
#resultdf is a dataframe which is subset of the df
resultdf['X']

A    1.693723
C    0.072960
Name: X, dtype: float64

In [78]:
# we can do those previous two steps in one
# we want to retrieve all the X values where W values are greater than zero
df[df ['W']>0]['X']

A    1.693723
C    0.072960
Name: X, dtype: float64

In [79]:
# retrieve all X and Y values where W values are greater than zero
df [df['W']>0][['X','Y']]

Unnamed: 0,X,Y
A,1.693723,-1.706086
C,0.07296,0.638787


In [80]:
# retrieve all the values in the dataframe df where W values are greater than Zero and Y values are greater than 1
df[(df ['W']>0) & (df['Y']>1)]

Unnamed: 0,W,X,Y,Z


In [81]:
# retieve all the values in the dataframe df where X values are greater than zero or Y values are greate than 1
df[(df ['W']>0) | (df['Y']>1)]

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
C,0.807706,0.07296,0.638787,0.329646


In [82]:
# to reset the index to 0,1,2,...
# you call a method reset_index on the dataframe
df.reset_index ()

Unnamed: 0,index,W,X,Y,Z
0,A,0.302665,1.693723,-1.706086,-1.159119
1,B,-0.134841,0.390528,0.166905,0.184502
2,C,0.807706,0.07296,0.638787,0.329646
3,D,-0.497104,-0.75407,-0.943406,0.484752
4,E,-0.116773,1.901755,0.238127,1.996652


In [83]:
# note that the index is reset to a column and the actual index became numeric
# note this does not occur in place. If you tried to display the original df it has not changed
df

Unnamed: 0,W,X,Y,Z
A,0.302665,1.693723,-1.706086,-1.159119
B,-0.134841,0.390528,0.166905,0.184502
C,0.807706,0.07296,0.638787,0.329646
D,-0.497104,-0.75407,-0.943406,0.484752
E,-0.116773,1.901755,0.238127,1.996652


In [84]:
# if you want the resetting of the index to be done inplace you need to specify the inplace = true
df.reset_index(inplace=True)
df

Unnamed: 0,index,W,X,Y,Z
0,A,0.302665,1.693723,-1.706086,-1.159119
1,B,-0.134841,0.390528,0.166905,0.184502
2,C,0.807706,0.07296,0.638787,0.329646
3,D,-0.497104,-0.75407,-0.943406,0.484752
4,E,-0.116773,1.901755,0.238127,1.996652


In [85]:
# let us recreate our df again to its original values
#create a dataframe
df=pd.DataFrame(randn (5,4),['A','B','C','D','E'],['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,-0.993263,0.1968,-1.136645,0.000366
B,1.025984,-0.156598,-0.031579,0.649826
C,2.154846,-0.610259,-0.755325,-0.346419
D,0.147027,-0.479448,0.558769,1.02481
E,-0.925874,1.862864,-1.133817,0.610478


In [86]:
# let us create a new index 
# we are going to use a small trick to create the list.
# Put a list of items separated by a space put them in single quotes
# and call split on space method
newind ='CA NY WY OR CO'.split ( )
newind

['CA', 'NY', 'WY', 'OR', 'CO']

In [87]:
# let us specify this list as a column named States in the dataframe
df ['States']= newind

In [88]:
# so we added a column to our dataframe names States that contains the elements 
# we specified 
df

Unnamed: 0,W,X,Y,Z,States
A,-0.993263,0.1968,-1.136645,0.000366,CA
B,1.025984,-0.156598,-0.031579,0.649826,NY
C,2.154846,-0.610259,-0.755325,-0.346419,WY
D,0.147027,-0.479448,0.558769,1.02481,OR
E,-0.925874,1.862864,-1.133817,0.610478,CO


In [89]:
# if you have a column in your dataframe that you want it to be the index
# you do the same but you use set_index instead of reset_index and you pass
# the column name that you want to be the index
df.set_index('States')


Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.993263,0.1968,-1.136645,0.000366
NY,1.025984,-0.156598,-0.031579,0.649826
WY,2.154846,-0.610259,-0.755325,-0.346419
OR,0.147027,-0.479448,0.558769,1.02481
CO,-0.925874,1.862864,-1.133817,0.610478


In [90]:
# keep in mind that will overwrite the original index. This is the difference 
# between set_index and reset_index
# note if we tried now df this is not inplace
df

Unnamed: 0,W,X,Y,Z,States
A,-0.993263,0.1968,-1.136645,0.000366,CA
B,1.025984,-0.156598,-0.031579,0.649826,NY
C,2.154846,-0.610259,-0.755325,-0.346419,WY
D,0.147027,-0.479448,0.558769,1.02481,OR
E,-0.925874,1.862864,-1.133817,0.610478,CO


In [91]:
# to implment it in place- but recall this will overwrite your original index 
# you need to specify inplace=True argument
df.set_index("States", inplace = True) 
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,-0.993263,0.1968,-1.136645,0.000366
NY,1.025984,-0.156598,-0.031579,0.649826
WY,2.154846,-0.610259,-0.755325,-0.346419
OR,0.147027,-0.479448,0.558769,1.02481
CO,-0.925874,1.862864,-1.133817,0.610478


**Optional Section** 
Multi-Index

- Multi-Index and Index Hierarchy
- Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:
- The purpose of zip() is to map the similar index of multiple containers so that they can be used just using as single entity.

In [92]:
# Index Levels
# we use the zip and list to create a list of tuples
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
# pd.MulitIndex.from_tuples takes a list and create a multindex from it
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [93]:
# create a dataframe with mulit index
# rand (6,2) so we have 6 rows and two columns
# index to be hier_indes
# columns to be 'A' and 'B'
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df


Unnamed: 0,Unnamed: 1,A,B
G1,1,0.38603,2.084019
G1,2,-0.376519,0.230336
G1,3,0.681209,1.035125
G2,1,-0.03116,1.939932
G2,2,-1.005187,-0.74179
G2,3,0.187125,-0.732845


In [94]:
# let us see how to retrieve elements from a multi_indexed dataframe
# We will use the loc and bracket notation, grap from the very outside index
# If I want every thing under G1
df.loc['G1']

Unnamed: 0,A,B
1,0.38603,2.084019
2,-0.376519,0.230336
3,0.681209,1.035125


In [95]:
# if I want to grap elements under G1 and 1
# the idea is call from outside index into inside index
df.loc['G1'].loc[1]

A    0.386030
B    2.084019
Name: 1, dtype: float64

In [96]:
# we can name those indeces
# let us first display the dataframe
# The indexes are not named
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.38603,2.084019
G1,2,-0.376519,0.230336
G1,3,0.681209,1.035125
G2,1,-0.03116,1.939932
G2,2,-1.005187,-0.74179
G2,3,0.187125,-0.732845


In [97]:
# we are going to give the indexes names
# let us name the very outside index Groups 
# and the second inside one Num. We can do that using index.names method
df.index.names=['Groups', 'Num']
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Groups,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.38603,2.084019
G1,2,-0.376519,0.230336
G1,3,0.681209,1.035125
G2,1,-0.03116,1.939932
G2,2,-1.005187,-0.74179
G2,3,0.187125,-0.732845


In [98]:
# Now the indexes are labeled
# let us how can we grap infromation from the multi-indexed dataframe
# let us say we want to grap the value 0.3226118
df.loc['G2'].loc[2]['B']

-0.7417897046689249

In [99]:
# let us see a function known as cross section xs  
# that returns a cross-section (row(s) or column(s)) from the Series/DataFrame.
# Defaults to cross-section on the rows (axis=0)
# We will use that when we have multiple level index
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Groups,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.38603,2.084019
G1,2,-0.376519,0.230336
G1,3,0.681209,1.035125
G2,1,-0.03116,1.939932
G2,2,-1.005187,-0.74179
G2,3,0.187125,-0.732845


In [100]:
# Let us say we want to grap everything under G1
# we can do it as we did before using df.loc['G1']
# another way is to use this cross section
df.xs ('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.38603,2.084019
2,-0.376519,0.230336
3,0.681209,1.035125


In [101]:
# The nice thing about this cross section function xs it can skip and go inside
# for example let us say I want to grap all elements in the dataframe under Num=1
# in other words those in Groups G1 and G2 with Num=1
# you can use xs specify what you want then specify level as the name of the index
df.xs(1, level='Num')

Unnamed: 0_level_0,A,B
Groups,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.38603,2.084019
G2,-0.03116,1.939932


In [102]:
# Grap cross section where level Num =1
# let say that I want all elements in the dataframe where Num=2
df.xs(2, level='Num')

Unnamed: 0_level_0,A,B
Groups,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,-0.376519,0.230336
G2,-1.005187,-0.74179
