# Pandas

## Importing Pandas

In [6]:
import pandas as pd
import numpy as np

## Series in Pandas

In [2]:
s1 = pd.Series([1, 2, 3, 5, -3])

In [3]:
type(s1)

pandas.core.series.Series

In [4]:
print(s1)

0    1
1    2
2    3
3    5
4   -3
dtype: int64


In [7]:
s2 = pd.Series([1, 2, 3, 5, -3], dtype=np.int32)

In [8]:
print(s2)

0    1
1    2
2    3
3    5
4   -3
dtype: int32


pass a list as an argument to the constructor function to create a series

In [10]:
x = [1, 2, 3 , 5, -3]
s3 = pd.Series(x)
print(s3)

0    1
1    2
2    3
3    5
4   -3
dtype: int64


pass a NumPy ndarray as an argument to the constructor function to create a series

In [11]:
y = np.array(x)
s4 = pd.Series(y)
print(s4.values)

[ 1  2  3  5 -3]


retrieve the index

In [12]:
print(s4.index)

RangeIndex(start=0, stop=5, step=1)


assign a custom index

In [13]:
s5 = pd.Series( x, index = ['a', 'b', 'c', 'd', 'e'])
print(s5)

a    1
b    2
c    3
d    5
e   -3
dtype: int64


## Basic Operations on Series

we can
display the negative numbers as

In [14]:
print(s5[s5 < 0])

e   -3
dtype: int64


We can also retrieve the positive numbers as follows

In [15]:
print(s5[s5 > 0])

a    1
b    2
c    3
d    5
dtype: int64


To perform a
multiplication operation

In [16]:
c = 3
print(s5 * c)

a     3
b     6
c     9
d    15
e    -9
dtype: int64


## Data Frames in Pandas

A dataframe is a two-dimensional labeled data structure with columns of
that can be of different datatype

In [17]:
data = {'city': ['Mumbai', 'Mumbai', 'Mumbai',
'Hyderabad', 'Hyderabad', 'Hyderabad'],
'year': [2010, 2011, 2012, 2010, 2011, 2012,],
'population': [10.0, 10.1, 10.2, 5.2, 5.3, 5.5]}

In [18]:
df1 = pd.DataFrame(data)
print(df1)

        city  year  population
0     Mumbai  2010        10.0
1     Mumbai  2011        10.1
2     Mumbai  2012        10.2
3  Hyderabad  2010         5.2
4  Hyderabad  2011         5.3
5  Hyderabad  2012         5.5


code to display the top five records

In [19]:
df1.head()

Unnamed: 0,city,year,population
0,Mumbai,2010,10.0
1,Mumbai,2011,10.1
2,Mumbai,2012,10.2
3,Hyderabad,2010,5.2
4,Hyderabad,2011,5.3


code to display the last five records

In [20]:
df1.tail()

Unnamed: 0,city,year,population
1,Mumbai,2011,10.1
2,Mumbai,2012,10.2
3,Hyderabad,2010,5.2
4,Hyderabad,2011,5.3
5,Hyderabad,2012,5.5


create a dataframe with a particular order of columns

In [21]:
df2 = pd.DataFrame(data, columns=['year', 'city','population'])
print(df2)

   year       city  population
0  2010     Mumbai        10.0
1  2011     Mumbai        10.1
2  2012     Mumbai        10.2
3  2010  Hyderabad         5.2
4  2011  Hyderabad         5.3
5  2012  Hyderabad         5.5


Next let’s create a dataframe with an additional column and
custom index

In [22]:
df3 = pd.DataFrame(data, columns=['year', 'city', 'population', 'GDP'], index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(df3)

       year       city  population  GDP
one    2010     Mumbai        10.0  NaN
two    2011     Mumbai        10.1  NaN
three  2012     Mumbai        10.2  NaN
four   2010  Hyderabad         5.2  NaN
five   2011  Hyderabad         5.3  NaN
six    2012  Hyderabad         5.5  NaN


print the list of columns

In [23]:
print(df3.columns)

Index(['year', 'city', 'population', 'GDP'], dtype='object')


display the data of a column 

In [24]:
print(df3.year)

one      2010
two      2011
three    2012
four     2010
five     2011
six      2012
Name: year, dtype: int64


In [26]:
print(df3['city'])

one         Mumbai
two         Mumbai
three       Mumbai
four     Hyderabad
five     Hyderabad
six      Hyderabad
Name: city, dtype: object


display the datatype of a column

In [27]:
print(df3['year'].dtype)

int64


In [28]:
print(df3.year.dtype)

int64


display the datatype of all the columns

In [29]:
print(df3.dtypes)

year            int64
city           object
population    float64
GDP            object
dtype: object


We can retrieve any record using the index as follows

In [30]:
df3.loc['one']

year            2010
city          Mumbai
population      10.0
GDP              NaN
Name: one, dtype: object

assign the same value to all the members of a column

In [31]:
df3.GDP = 10
print(df3)

       year       city  population  GDP
one    2010     Mumbai        10.0   10
two    2011     Mumbai        10.1   10
three  2012     Mumbai        10.2   10
four   2010  Hyderabad         5.2   10
five   2011  Hyderabad         5.3   10
six    2012  Hyderabad         5.5   10


We can assign an ndarray to the GDP column as follows

In [32]:
df3.GDP = np.arange(6)
print(df3)

       year       city  population  GDP
one    2010     Mumbai        10.0    0
two    2011     Mumbai        10.1    1
three  2012     Mumbai        10.2    2
four   2010  Hyderabad         5.2    3
five   2011  Hyderabad         5.3    4
six    2012  Hyderabad         5.5    5


We can also assign it a list as follows

In [33]:
df3.GDP = [3, 2, 0, 9, -0.4, 7]
print(df3)

       year       city  population  GDP
one    2010     Mumbai        10.0  3.0
two    2011     Mumbai        10.1  2.0
three  2012     Mumbai        10.2  0.0
four   2010  Hyderabad         5.2  9.0
five   2011  Hyderabad         5.3 -0.4
six    2012  Hyderabad         5.5  7.0


Let’s assign a series to it

In [34]:
val = pd.Series([-1.4, 1.5, -1.3], index=['two', 'four', 'five'])
df3.GDP = val
print(df3)

       year       city  population  GDP
one    2010     Mumbai        10.0  NaN
two    2011     Mumbai        10.1 -1.4
three  2012     Mumbai        10.2  NaN
four   2010  Hyderabad         5.2  1.5
five   2011  Hyderabad         5.3 -1.3
six    2012  Hyderabad         5.5  NaN


## Reading Data Stored in CSV Format

In [35]:
df = pd.read_csv('https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv')
df.head(5)

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
