<a href="https://colab.research.google.com/github/JP109/ML-Basics/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd

# **1. Series**

### **1. Create Series:**

In [3]:
# Using a list
list_1=['a','b','c','d']
labels_1=[1,2,3,4]

series_1 = pd.Series(data=list_1, index=labels_1)
series_1

1    a
2    b
3    c
4    d
dtype: object

In [4]:
# Using a numpy array
arr_1 = np.array([1,2,3,4])

series_2 = pd.Series(arr_1)
series_2

0    1
1    2
2    3
3    4
dtype: int64

In [5]:
# Using a dictionary
dict_1={'name':'Jai', 'surname':'Pawar', 'age':20}

series_3 = pd.Series(dict_1)
series_3

name         Jai
surname    Pawar
age           20
dtype: object

### **2. Access data in series**

In [6]:
series_3['name']

'Jai'

In [7]:
# Get data type
series_2.dtype

dtype('int64')

### **3. Math operations**

In [8]:
series_2 + series_2

0    2
1    4
2    6
3    8
dtype: int64

In [9]:
series_2 / series_2

0    1.0
1    1.0
2    1.0
3    1.0
dtype: float64

In [10]:
# Pass them into numpy functions
np.exp(series_2)

0     2.718282
1     7.389056
2    20.085537
3    54.598150
dtype: float64

Main difference between Series and numpy arrays is that operations align by labels in Series:

In [11]:
series_4=pd.Series({4:5,5:6,6:7,7:8})
series_2 + series_4

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
dtype: float64

This happened because both the series did not line up according to their labels

In [12]:
# Assign names to series
series_5 = pd.Series({1:2, 3:4}, name='yeehaw')
series_5.name

'yeehaw'

# **2. Dataframe**

### **1. Creating dataframes**

In [13]:
# Using numpy array
arr_2 =np.random.randint(10,50,size=(2,3))
df_1 = pd.DataFrame(arr_2,['A','B'],['C','D','E'])
df_1

Unnamed: 0,C,D,E
A,30,31,28
B,37,45,34


In [14]:
# Using multiple Series
dict_3 = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two': pd.Series([4, 5, 6, 7], index=['a', 'b', 'c', 'd'])}
df_2 = pd.DataFrame(dict_3)
df_2

Unnamed: 0,one,two
a,1.0,4
b,2.0,5
c,3.0,6
d,,7


In [15]:
# Using a dictionary 
#from_dict() has 3 params= data:Supplied below, orient:default is 'column' as below, can be reversed ny supplying 'index,
#columns:list of values to be used as labels when orientation is 'index'
df_3 = pd.DataFrame.from_dict(dict([('A',[1,2,3]),('B',[4,5,6])]))
df_3

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [16]:
# Shape of df
df_1.shape

(2, 3)

### **2. Accessing/Editing data**

In [17]:
# Accessing a column using a label
df_1['C']

A    30
B    37
Name: C, dtype: int64

In [18]:
df_1[['C','D']]

Unnamed: 0,C,D
A,30,31
B,37,45


In [19]:
# Accessing a row as a series
df_1.loc['A']

C    30
D    31
E    28
Name: A, dtype: int64

In [20]:
df_1.iloc[1]

C    37
D    45
E    34
Name: B, dtype: int64

In [21]:
# Accessing rows and columns
df_1.loc[['A','B'], ['D','E']]

Unnamed: 0,D,E
A,31,28
B,45,34


In [22]:
# Create a new column
df_1['Total']=df_1['C']+df_1['D']+df_1['E']
df_1

Unnamed: 0,C,D,E,Total
A,30,31,28,89
B,37,45,34,116


In [23]:
# Create a new row
dict_2 = {'C':44, 'D':45, 'E': 46}
new_row = pd.Series(dict_2, name='F')
df_1 = df_1.append(new_row)
df_1

Unnamed: 0,C,D,E,Total
A,30.0,31.0,28.0,89.0
B,37.0,45.0,34.0,116.0
F,44.0,45.0,46.0,


In [24]:
# Delete columns
df_1.drop('Total', axis=1, inplace=True)
df_1

Unnamed: 0,C,D,E
A,30.0,31.0,28.0
B,37.0,45.0,34.0
F,44.0,45.0,46.0


In [25]:
# Delete rows
df_1.drop('B', axis=0, inplace=True)
df_1

Unnamed: 0,C,D,E
A,30.0,31.0,28.0
F,44.0,45.0,46.0


In [26]:
# Set a column as index
df_1['Sex'] = ['Men', 'Women']
df_1.set_index('Sex', inplace=True)
df_1

Unnamed: 0_level_0,C,D,E
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Men,30.0,31.0,28.0
Women,44.0,45.0,46.0


In [27]:
# Replace index with numbers
df_1.reset_index(inplace=True)
df_1

Unnamed: 0,Sex,C,D,E
0,Men,30.0,31.0,28.0
1,Women,44.0,45.0,46.0


In [28]:
# Adding a column using .assign()
# This allows columns to be added in any position, unlike only at the end with the list method
# It returns a copy of the original df with the new column(or any other change), without touching original df

df_4= df_1.assign(J=df_1['C']*df_1['D'])
print(df_4)
print(df_1)

     Sex     C     D     E       J
0    Men  30.0  31.0  28.0   930.0
1  Women  44.0  45.0  46.0  1980.0
     Sex     C     D     E
0    Men  30.0  31.0  28.0
1  Women  44.0  45.0  46.0


In [29]:
# Chained indexing
# Setting values by using chained indexing can have unexpected results(check out link below)
df_4['D'][0]=float('nan')
df_4

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Sex,C,D,E,J
0,Men,30.0,,28.0,930.0
1,Women,44.0,45.0,46.0,1980.0


In [45]:
# Hence, loc should be used instead of chained indexing
df_4.loc[1,'D']=float('nan')
df_4

Unnamed: 0,Sex,C,D,E,J
1,Women,44.0,,46.0,1980.0


In [47]:
# Combine dataframes, combine_first() replaces NaN elements of 1st array with corresponding elements of 2nd array
df_4.combine_first(df_1)

Unnamed: 0,C,D,E,J,Sex
0,30.0,31.0,28.0,,Men
1,44.0,45.0,46.0,1980.0,Women


In [54]:
# Conditional indexing
# All conditional operators return boolean
# gt, lt, ge, le, eq & ne are comparison operators, apart from the normal python ones(>,<,>=,<=,==,!=)
print(df_2)

print(df_2>2)
df_2.gt(2)

   one  two
a  1.0    4
b  2.0    5
c  3.0    6
d  NaN    7
     one   two
a  False  True
b  False  True
c   True  True
d  False  True


Unnamed: 0,one,two
a,False,True
b,False,True
c,True,True
d,False,True


In [55]:
# Storing the condition result in a variable
bool_1 = df_2>5
df_2[bool_1]

Unnamed: 0,one,two
a,,
b,,
c,,6.0
d,,7.0


In [57]:
# Conditions for a single column
df_2['two']>5

a    False
b    False
c     True
d     True
Name: two, dtype: bool

In [58]:
# Return row only if cell value in column matches a condition
df_2[df_2['two']>5]

Unnamed: 0,one,two
c,3.0,6
d,,7


In [59]:
# Selections can be stacked
df_2[df_2['two']>5]['one']

c    3.0
d    NaN
Name: one, dtype: float64

In [61]:
# Multiple conditions
df_2[(df_2['two']>5) & (df_2['two']<7)]

Unnamed: 0,one,two
c,3.0,6


### **3. Importing data**
Pandas can work with CSV, txt files, JSON, XML, PDF, SQL, HTML, XLSX, DOCX, ZIP, Images, Hierachial data format, MP3, MP4 and more

In [None]:
# Read from a csv