# Intro to Pandas

- Known as Python's version of Excel
- It provides on-the-fly data analysis, wrangling, and exploration
- Built on top of NumPy

Topics:
- Series
- Dataframes
- Data manipulation
- Slicing / Dicing/ Filtering
- Operations e.g. statistics
- Import/Export

`pip install pandas`

In [2]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Series

Series is the building block of a pandas table (Dataframe) it resembles a list.

In [3]:
MyList = [1 , 4, 5, 7]
MySer = pd.Series(MyList)

In [3]:
MySer

0    1
1    4
2    5
3    7
dtype: int64

In [4]:
MyList[0]

1

In [5]:
# change the index of a series
MySer = pd.Series(MyList, index=['a', 'b', 'c', 'd'])
MySer

a    1
b    4
c    5
d    7
dtype: int64

In [6]:
MySer['c']

5

In [7]:
MySer.reset_index(drop=True)

0    1
1    4
2    5
3    7
dtype: int64

In [11]:
s2 = pd.Series(list('Python'))
s2

0    P
1    y
2    t
3    h
4    o
5    n
dtype: object

In [12]:
# convert pandas series into a numpy array
s2.values

array(['P', 'y', 't', 'h', 'o', 'n'], dtype=object)

In [14]:
s2[4]

'o'

## Dataframes

**Anatomy of a DataFrame**
 
![df](https://static.packt-cdn.com/products/9781839213106/graphics/Images/B15597_01_01.png)

In [15]:
# build a dataframe from a dictionary

data = {
        'name' : ['Mark', 'Anupam', 'Becky', ' Tom'],
        'age' : [30, 25, 18, 45]
}

In [16]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,Mark,30
1,Anupam,25
2,Becky,18
3,Tom,45


In [19]:
type(df)

pandas.core.frame.DataFrame

In [18]:
type(df['name'])

pandas.core.series.Series

In [20]:
df['name']

0      Mark
1    Anupam
2     Becky
3       Tom
Name: name, dtype: object

In [22]:
#get the first row
df.loc[0]

Unnamed: 0,name,age
0,Mark,30
1,Anupam,25
2,Becky,18
3,Tom,45


In [24]:
# get the first 3 rows
df.loc[0:2]

Unnamed: 0,name,age
0,Mark,30
1,Anupam,25
2,Becky,18


In [25]:
df.ndim

2

In [26]:
df.shape

(4, 2)

In [72]:
product = pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']}
              
             ,index=['Product A', 'Product B']
             )

product

Unnamed: 0,Bob,Sue
Product A,I liked it.,Pretty good.
Product B,It was awful.,Bland.


In [76]:
# loc only takes numerical values if the index is not overwritten
# when you have a custom index, you need to use the new index
product.loc['Product A']

Bob     I liked it.
Sue    Pretty good.
Name: Product A, dtype: object

In [74]:
product.iloc[0]

Bob     I liked it.
Sue    Pretty good.
Name: Product A, dtype: object

In [34]:
# get the index
product.index

Index(['Product A', 'Product B'], dtype='object')

In [35]:
# get the index - for default num index, it gives a range
df.index

RangeIndex(start=0, stop=4, step=1)

In [36]:
# build a dataframe from a dictionary

data = {
        'name' : ['Mark', 'Anupam', 'Becky', ' Tom'],
        'age' : [30, 25, 18, 45],
        'city' : ['New York', 'Nashville', 'San Diego', 'Birmingham']
}

In [37]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,city
0,Mark,30,New York
1,Anupam,25,Nashville
2,Becky,18,San Diego
3,Tom,45,Birmingham


In [38]:
df.T

Unnamed: 0,0,1,2,3
name,Mark,Anupam,Becky,Tom
age,30,25,18,45
city,New York,Nashville,San Diego,Birmingham


In [40]:
df['name'] # the col names should be literal and case sensitive

0      Mark
1    Anupam
2     Becky
3       Tom
Name: name, dtype: object

In [44]:
# get a specifc list of columns
df[['name', 'age']]

Unnamed: 0,name,age
0,Mark,30
1,Anupam,25
2,Becky,18
3,Tom,45


In [51]:
# using 2 slicers - get the first 2 rows of 2 selected columns
df[['name', 'age']].loc[0:1]

Unnamed: 0,name,age
0,Mark,30
1,Anupam,25


In [53]:
df['name'][0] # this syntax only works if you have 1 column (because the result is a series)

'Mark'

### `iloc`

In [85]:
# build a dataframe from a dictionary

data = {
        'name' : ['Mark', 'Anupam', 'Becky', ' Tom'],
        'age' : [30, 25, 18, 45],
        'city' : ['New York', 'Nashville', 'San Diego', 'Birmingham']
}

df = pd.DataFrame(data)
df

Unnamed: 0,name,age,city
0,Mark,30,New York
1,Anupam,25,Nashville
2,Becky,18,San Diego
3,Tom,45,Birmingham


In [69]:
df.iloc[2,1] #row, col selection

18

loc: label based, iloc: integer based


In [70]:
df.iloc[:,1:3] # row range, col range

Unnamed: 0,age,city
0,30,New York
1,25,Nashville
2,18,San Diego
3,45,Birmingham


### Iteration

Exercise: grab the table above and use its elements row by row to print out meaningful insights

In [91]:
# method 1
for i in df.index:
    print('Student Name:',df['name'][i], '\t Student Age:', df['age'][i], '\t Age After Graduation:', df['age'][i]+4)

Student Name: Mark 	 Student Age: 30 	 Age After Graduation: 34
Student Name: Anupam 	 Student Age: 25 	 Age After Graduation: 29
Student Name: Becky 	 Student Age: 18 	 Age After Graduation: 22
Student Name:  Tom 	 Student Age: 45 	 Age After Graduation: 49


In [96]:
# method 2 - iterrows
for i, row in df.iterrows():
        print('Student Name:',row['name'], '\t Student Age:', row['age'], '\t Age After Graduation:', row['age']+4)

Student Name: Mark 	 Student Age: 30 	 Age After Graduation: 34
Student Name: Anupam 	 Student Age: 25 	 Age After Graduation: 29
Student Name: Becky 	 Student Age: 18 	 Age After Graduation: 22
Student Name:  Tom 	 Student Age: 45 	 Age After Graduation: 49


### DF with Multiple Indexes

In [81]:
import numpy as np
arrays = [
      np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
        np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
    ]
   

In [82]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df


Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-0.32609,0.024403,-1.528382,1.023829
bar,two,-0.176471,0.167343,0.150771,0.687471
baz,one,-0.027088,1.084916,1.800504,0.926192
baz,two,0.474304,1.096145,1.812577,-0.253702
foo,one,-0.673919,-0.563285,-0.66079,0.390657
foo,two,-0.125729,-0.098946,1.088,-2.256558
qux,one,-0.990914,-0.723095,0.202293,-0.343853
qux,two,0.861814,0.657186,-1.213568,1.226791


In [83]:
df.index.name = 'index1'

In [84]:
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-0.32609,0.024403,-1.528382,1.023829
bar,two,-0.176471,0.167343,0.150771,0.687471
baz,one,-0.027088,1.084916,1.800504,0.926192
baz,two,0.474304,1.096145,1.812577,-0.253702
foo,one,-0.673919,-0.563285,-0.66079,0.390657
foo,two,-0.125729,-0.098946,1.088,-2.256558
qux,one,-0.990914,-0.723095,0.202293,-0.343853
qux,two,0.861814,0.657186,-1.213568,1.226791
