## Pandas 

- pandas is a data analytics library for python
- a fast and efficient Dataframe object for data manipulation with integrated indexing
- tools for reading and writing the data between in memory data structure and data format CSV and text files, Microsoft Excel     SQL database and the fast HDF5 format 

In [1]:
import pandas as pd
print(pd.__version__)

1.5.3


In [3]:
df = pd.read_csv('data.csv')
df # dataframe

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [6]:
df.head() # first 5 rows

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [5]:
df.tail() #last 5 row

Unnamed: 0,Country,Age,Salary,Purchased
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [7]:
df.head(3)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No


## Dataframe attribute

In [15]:
print(df.columns) # all the names of column
print('-'*20)
print(df.dtypes)  # data types of each column
print('-'*20)
print(df.shape)   # no of rows and column in data frame
print('-'*20)
print(df.index)   # 
print('-'*20)
print(df.info())  

Index(['Country', 'Age', 'Salary', 'Purchased'], dtype='object')
--------------------
Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object
--------------------
(10, 4)
--------------------
RangeIndex(start=0, stop=10, step=1)
--------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes
None


In [16]:
print(df.describe()) 

#describe function provides statstical information about numerical values

             Age        Salary
count   9.000000      9.000000
mean   38.777778  63777.777778
std     7.693793  12265.579662
min    27.000000  48000.000000
25%    35.000000  54000.000000
50%    38.000000  61000.000000
75%    44.000000  72000.000000
max    50.000000  83000.000000


## Indexing with loc and iloc

In [23]:
print(df.iloc[0,0])
print('-'*20)
print(df.iloc[:3,0])  # first 3 rows and 0th column
print('-'*20)
print(df.iloc[:3,:]) # first 3 rows and all column
print('-'*20)
print(df.iloc[:,:2]) # all rows and 2 column
print('-'*20)
print(df.iloc[::-1,0])  # all the rows in reverse order from first column
print('-'*20)
print(df.iloc[:5,::-1])  # first rows from all the column printed in reverse order

France
--------------------
0     France
1      Spain
2    Germany
Name: Country, dtype: object
--------------------
   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
--------------------
   Country   Age
0   France  44.0
1    Spain  27.0
2  Germany  30.0
3    Spain  38.0
4  Germany  40.0
5   France  35.0
6    Spain   NaN
7   France  48.0
8  Germany  50.0
9   France  37.0
--------------------
9     France
8    Germany
7     France
6      Spain
5     France
4    Germany
3      Spain
2    Germany
1      Spain
0     France
Name: Country, dtype: object
--------------------
  Purchased   Salary   Age  Country
0        No  72000.0  44.0   France
1       Yes  48000.0  27.0    Spain
2        No  54000.0  30.0  Germany
3        No  61000.0  38.0    Spain
4       Yes      NaN  40.0  Germany


In [27]:
# using loc method

print(df.loc[:3,"Country"])   # rows including 3rd rows from country will be printed
print('-'*20)
print(df.loc[:4,"Country":'Age'])
print('-'*20)
print(df.loc[:4,"Country":"Salary"])


0     France
1      Spain
2    Germany
3      Spain
Name: Country, dtype: object
--------------------
   Country   Age
0   France  44.0
1    Spain  27.0
2  Germany  30.0
3    Spain  38.0
4  Germany  40.0
--------------------
   Country   Age   Salary
0   France  44.0  72000.0
1    Spain  27.0  48000.0
2  Germany  30.0  54000.0
3    Spain  38.0  61000.0
4  Germany  40.0      NaN


## Condition based Access

In [36]:
print(df['Country']) # or df.Country
print('-'*20)
print(df.Country == 'France')
print('-'*20)
mask = df.Country == 'France'  # create the mask, use it as indexes
print(df.Country[mask])

0     France
1      Spain
2    Germany
3      Spain
4    Germany
5     France
6      Spain
7     France
8    Germany
9     France
Name: Country, dtype: object
--------------------
0     True
1    False
2    False
3    False
4    False
5     True
6    False
7     True
8    False
9     True
Name: Country, dtype: bool
--------------------
0    France
5    France
7    France
9    France
Name: Country, dtype: object


In [37]:
# Access all the row corresponding to Germany

Germ_mask = df.Country == 'Germany'
df[Germ_mask]

Unnamed: 0,Country,Age,Salary,Purchased
2,Germany,30.0,54000.0,No
4,Germany,40.0,,Yes
8,Germany,50.0,83000.0,No


In [46]:
# access all the counry having salary more than 60000

sal = df.Salary > 60000
df[['Country','Salary']][sal]

Unnamed: 0,Country,Salary
0,France,72000.0
3,Spain,61000.0
7,France,79000.0
8,Germany,83000.0
9,France,67000.0


In [50]:
pur = df.Purchased == 'Yes'
df[['Country','Purchased']][pur]

Unnamed: 0,Country,Purchased
1,Spain,Yes
4,Germany,Yes
5,France,Yes
7,France,Yes
9,France,Yes


In [88]:
cs = (df.Country == 'France') & (df.Salary > 60000)
df[['Country','Salary']][cs]

Unnamed: 0,Country,Salary
0,France,72000.0
7,France,79000.0
9,France,67000.0


## Data Frames

### creating data frames these are 2 dimensional array which allow storage of data in data series,2d storage with one column



In [63]:
# creating series with list

l = [1,2,3,4,5,6]
import pandas as pd
ds_list = pd.Series(l)
print(ds_list)
print(type(ds_list))


0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
<class 'pandas.core.series.Series'>


In [65]:
#series using numpy

import numpy as np
A = np.random.randint(1,10,10)  # generate 10 random number between 1 to 10
ds_list = pd.Series(A)
ds_numpy = pd.Series(A)
print(type(ds_list))
print(ds_numpy)


<class 'pandas.core.series.Series'>
0    2
1    5
2    4
3    4
4    5
5    9
6    7
7    7
8    8
9    1
dtype: int32


In [66]:
# series with custom index

d = {'a':10, 'b':30, 'c':50, 'd':70}
ds_dict = pd.Series(d)
print(ds_dict)

a    10
b    30
c    50
d    70
dtype: int64


In [82]:
q = {
         'names' : ['goldy', 'pinky', 'bubbly', 'rocky','bheem','ram','raj'], 
         'age':[50,40,25,21,37,27,41],
         'place' : ['dli', 'chn', 'mys', 'dvg','dli','jpr','pune'],
         'branch' : ['cse','ise','ece','cse','mech','cse','ece']
    }

df_dict = pd.DataFrame(q)
df_dict

Unnamed: 0,names,age,place,branch
0,goldy,50,dli,cse
1,pinky,40,chn,ise
2,bubbly,25,mys,ece
3,rocky,21,dvg,cse
4,bheem,37,dli,mech
5,ram,27,jpr,cse
6,raj,41,pune,ece


In [85]:
print(df_dict.branch.unique())
print(len(df_dict.branch.unique()))

['cse' 'ise' 'ece' 'mech']
4


In [86]:
print(df_dict.branch.value_counts())

cse     3
ece     2
ise     1
mech    1
Name: branch, dtype: int64


In [87]:
print(df_dict.place.value_counts())

dli     2
chn     1
mys     1
dvg     1
jpr     1
pune    1
Name: place, dtype: int64
