## Data Frame
- Data Frame is 2-d Data Structure i.e., data aligned in Tabular fashion in Rows & Columns

#### Features
- Columns can be of different Datatypes
- Size is Mutable
- Labelled Axes (Rows and Columns)
- Arithmetic Operations can be performed on Columns

#### Topics Covered
* Create Data Frame using Array
* Create Data Frame using Dictionary
* 1. Slicing
* 2. Basic Operations
* 3. Merging
* 4. Joining
* 5. Concatenation
* 6. Set any Column in DataFrame as Index Column
* 7. Change Column Name
* 8. Selecting Elements in DataFrames in 2 ways(loc & iloc)
* 9. Useful for Data Preprocessing
* 10. Convert DataFrame to Array (DataFrame.to_numpy)
* 11. DataFrame Manipulations
* 12. groupby
* 13. Imputer method to handle Missing Values

In [1]:
import pandas as pd
import numpy as np

#### Create Data Frame using Series

In [2]:
dfData = {'C1': pd.Series([1,2,3,4], index=['R1','R2','R3','R4']), 'C2': pd.Series([5,6,7], index=['R1','R2','R4'])}
dfData

{'C1': R1    1
 R2    2
 R3    3
 R4    4
 dtype: int64,
 'C2': R1    5
 R2    6
 R4    7
 dtype: int64}

In [3]:
dfs = pd.DataFrame(dfData)
dfs

Unnamed: 0,C1,C2
R1,1,5.0
R2,2,6.0
R3,3,
R4,4,7.0


In [4]:
dfs['C3'] = pd.Series([12, 13, 14], index = ['R3','R1','R5'])
dfs

Unnamed: 0,C1,C2,C3
R1,1,5.0,13.0
R2,2,6.0,
R3,3,,12.0
R4,4,7.0,


#### Create Data Frame using Array Directly

In [5]:
dfn= pd.DataFrame([[900,9],[800,8],[700,7]], index=['R1','R2','R3'], columns=['C1','C2'])
dfn

Unnamed: 0,C1,C2
R1,900,9
R2,800,8
R3,700,7


#### Create Data Frame using Array

In [6]:
arr = np.arange(20).reshape(4,5)
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [7]:
[[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]]

[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]]

In [8]:
df = pd.DataFrame(arr, index=['R1','R2','R3','R4'], columns=['C1','C2','C3','C4','C5'])
df

Unnamed: 0,C1,C2,C3,C4,C5
R1,0,1,2,3,4
R2,5,6,7,8,9
R3,10,11,12,13,14
R4,15,16,17,18,19


#### Create Data Frame using Dictionary

In [9]:
# Creating Dictionary - Storing dictionary in a Variable 
BookShop = {"Day":[1,2,3,4,5,6], "Visitors":[100, 120, 500, 100, 460, 120], "BooksSold":[90, 40,300,87,600,40]}
print(type(BookShop))
BookShop

<class 'dict'>


{'Day': [1, 2, 3, 4, 5, 6],
 'Visitors': [100, 120, 500, 100, 460, 120],
 'BooksSold': [90, 40, 300, 87, 600, 40]}

In [10]:
df1 = pd.DataFrame(BookShop)
df1

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


### Operations in DataFrame

#### 1. Slicing

In [11]:
df1.head()              # Default head(): Represents First 5 rows

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600


In [12]:
df1.tail()              # Default tail(): Represents Last 5 rows

Unnamed: 0,Day,Visitors,BooksSold
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [13]:
df1.head(3)

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300


In [14]:
df1.tail(3)

Unnamed: 0,Day,Visitors,BooksSold
3,4,100,87
4,5,460,600
5,6,120,40


#### 2. Basic Operations

In [15]:
df1.shape               # shape

(6, 3)

In [16]:
df2 = pd.DataFrame({"Day":[4,7,1,10,2,9], "Visitors":[100, 120, 510, 100, 450, 110], "BooksRented":[90, 40,300,87,600,40]})
df2

Unnamed: 0,Day,Visitors,BooksRented
0,4,100,90
1,7,120,40
2,1,510,300
3,10,100,87
4,2,450,600
5,9,110,40


In [17]:
df1

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [18]:
df1['Visitors'].value_counts()  #  .value_counts() --> Gives Count of Repeated Values 

100    2
120    2
460    1
500    1
Name: Visitors, dtype: int64

In [19]:
df1['Visitors'] > 100          # Conditional Check

0    False
1     True
2     True
3    False
4     True
5     True
Name: Visitors, dtype: bool

In [20]:
df1

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [21]:
df1[df1['Visitors'] > 100]    # df1['Visitors'] > 100 --> Gives Index Values

Unnamed: 0,Day,Visitors,BooksSold
1,2,120,40
2,3,500,300
4,5,460,600
5,6,120,40


In [22]:
df2[df1['Visitors'] > 100]     # df1['Visitors'] > 100 --> Gives Index Values

Unnamed: 0,Day,Visitors,BooksRented
1,7,120,40
2,1,510,300
4,2,450,600
5,9,110,40


In [23]:
df2

Unnamed: 0,Day,Visitors,BooksRented
0,4,100,90
1,7,120,40
2,1,510,300
3,10,100,87
4,2,450,600
5,9,110,40


In [24]:
df1['Visitors'].unique()       # unique() --> Gives unique values 

array([100, 120, 500, 460], dtype=int64)

In [25]:
df1.dtypes                     # dtypes   --> Gives Datatype of columns

Day          int64
Visitors     int64
BooksSold    int64
dtype: object

In [26]:
df1

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [27]:
df1.describe()                  # describe() --> Applicable for Numerical Columns only

Unnamed: 0,Day,Visitors,BooksSold
count,6.0,6.0,6.0
mean,3.5,233.333333,192.833333
std,1.870829,191.694201,221.702879
min,1.0,100.0,40.0
25%,2.25,105.0,51.75
50%,3.5,120.0,88.5
75%,4.75,375.0,247.5
max,6.0,500.0,600.0


In [28]:
df1.info()                        # info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Day        6 non-null      int64
 1   Visitors   6 non-null      int64
 2   BooksSold  6 non-null      int64
dtypes: int64(3)
memory usage: 272.0 bytes


#### 3. Merging

In [29]:
df1

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [30]:
df3=df1.merge(df2)  # .merge() --> While Merging 2 DataFrames : Should match all Values in Common Columns in both DataFrames 
df3

Unnamed: 0,Day,Visitors,BooksSold,BooksRented
0,4,100,87,90


In [31]:
df4=pd.merge(df1,df2)   # While Merging 2 DataFrames : Should match all Values in Common Columns in both DataFrames 
df4

Unnamed: 0,Day,Visitors,BooksSold,BooksRented
0,4,100,87,90


In [32]:
df5=pd.merge(df1,df2, on="Visitors")  # All combination of Values on basis of Visitors
df5

Unnamed: 0,Day_x,Visitors,BooksSold,Day_y,BooksRented
0,1,100,90,4,90
1,1,100,90,10,87
2,4,100,87,4,90
3,4,100,87,10,87
4,2,120,40,7,40
5,6,120,40,7,40


#### 4. Joining 
- Joining Operation performs based on Index Value
- Based on the first data frame, it going to check the index , then joins the values
- All column names to be different (Here columns can't Overlap /Merge)

In [33]:
df6 = pd.DataFrame({"Day1":[4,7,1,10,2], "Visitors1":[100, 120, 510, 100, 450], "BooksRented":[90, 40,300,87,40]})
df6

Unnamed: 0,Day1,Visitors1,BooksRented
0,4,100,90
1,7,120,40
2,1,510,300
3,10,100,87
4,2,450,40


In [34]:
df1

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [35]:
df7 = df1.join(df6)   # join() --> Doubt : After joining, Why 2nd DataFrame values are changing into Float? 
df7

Unnamed: 0,Day,Visitors,BooksSold,Day1,Visitors1,BooksRented
0,1,100,90,4.0,100.0,90.0
1,2,120,40,7.0,120.0,40.0
2,3,500,300,1.0,510.0,300.0
3,4,100,87,10.0,100.0,87.0
4,5,460,600,2.0,450.0,40.0
5,6,120,40,,,


#### 5. Concatenation 

In [36]:
df8=pd.DataFrame({'Students':[100,120,500,100],'Penality':[90,40,300,87]}, index=[2018,2019,2020,2021])
df8

Unnamed: 0,Students,Penality
2018,100,90
2019,120,40
2020,500,300
2021,100,87


In [37]:
df9=pd.DataFrame({'Teachers':[10,12,50,10],'SalaryCut':[9,4,30,8]}, index=[2012,2019,2020,2013])
df9

Unnamed: 0,Teachers,SalaryCut
2012,10,9
2019,12,4
2020,50,30
2013,10,8


In [38]:
df10=pd.concat([df8, df9])              # Concatenation by Row (Concatenation by Column done by axis=1)
df10

Unnamed: 0,Students,Penality,Teachers,SalaryCut
2018,100.0,90.0,,
2019,120.0,40.0,,
2020,500.0,300.0,,
2021,100.0,87.0,,
2012,,,10.0,9.0
2019,,,12.0,4.0
2020,,,50.0,30.0
2013,,,10.0,8.0


In [39]:
df1010=pd.concat([df8, df9]).reset_index(drop=True)   #Index starts with 0,1,2...
df1010

Unnamed: 0,Students,Penality,Teachers,SalaryCut
0,100.0,90.0,,
1,120.0,40.0,,
2,500.0,300.0,,
3,100.0,87.0,,
4,,,10.0,9.0
5,,,12.0,4.0
6,,,50.0,30.0
7,,,10.0,8.0


In [40]:
df11=df8.append(df9)               # Concatenation can be done using append() function also
df11

Unnamed: 0,Students,Penality,Teachers,SalaryCut
2018,100.0,90.0,,
2019,120.0,40.0,,
2020,500.0,300.0,,
2021,100.0,87.0,,
2012,,,10.0,9.0
2019,,,12.0,4.0
2020,,,50.0,30.0
2013,,,10.0,8.0


#### 6. Set any Column in DataFrame as Index Column
- Setting any one Column as Index in DataFrame, but It is temporary
-  1. Need to store in any DataFrame (RealTime) (or)
-  2. Use inplace=True (To set permanently)

In [41]:
df12 = pd.DataFrame({"Day":[1,2,3,4,5,6], "Visitors":[100, 120, 500, 100, 460, 120], "BooksSold":[90, 40,300,87,600,40]})

In [42]:
df12.set_index('Day')       # .set_index('Column Name')

Unnamed: 0_level_0,Visitors,BooksSold
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1,100,90
2,120,40
3,500,300
4,100,87
5,460,600
6,120,40


In [43]:
df12

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [44]:
df12.set_index(['Day', 'BooksSold'])     # Set 2 columns as index Columns

Unnamed: 0_level_0,Unnamed: 1_level_0,Visitors
Day,BooksSold,Unnamed: 2_level_1
1,90,100
2,40,120
3,300,500
4,87,100
5,600,460
6,40,120


In [45]:
df13 = df12.set_index('Day')
df13

Unnamed: 0_level_0,Visitors,BooksSold
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1,100,90
2,120,40
3,500,300
4,100,87
5,460,600
6,120,40


In [46]:
df12.set_index('Day',inplace=True)

In [47]:
df12

Unnamed: 0_level_0,Visitors,BooksSold
Day,Unnamed: 1_level_1,Unnamed: 2_level_1
1,100,90
2,120,40
3,500,300
4,100,87
5,460,600
6,120,40


In [48]:
df12.set_index('Visitors',inplace=True)

In [49]:
df12

Unnamed: 0_level_0,BooksSold
Visitors,Unnamed: 1_level_1
100,90
120,40
500,300
100,87
460,600
120,40


#### 7. Change Column Name
- Changing any one Column Name in DataFrame, but It is temporary
-  1. Need to store in any DataFrame (RealTime) (or)
-  2. Use inplace=True (To set permanently)

In [50]:
df14 = pd.DataFrame({"Day":[1,2,3,4,5,6], "Visitors":[100, 120, 500, 100, 460, 120], "BooksSold":[90, 40,300,87,600,40]})
df14

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [51]:
df14.rename(columns={"Day": "No. of Days", "Visitors": "Students"})   # .rename --> Change column name

Unnamed: 0,No. of Days,Students,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [52]:
df14

Unnamed: 0,Day,Visitors,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


In [53]:
df15 = df10.rename(columns={"Day": "No. of Days", "Visitors": "Students"})
df15

Unnamed: 0,Students,Penality,Teachers,SalaryCut
2018,100.0,90.0,,
2019,120.0,40.0,,
2020,500.0,300.0,,
2021,100.0,87.0,,
2012,,,10.0,9.0
2019,,,12.0,4.0
2020,,,50.0,30.0
2013,,,10.0,8.0


In [54]:
df15

Unnamed: 0,Students,Penality,Teachers,SalaryCut
2018,100.0,90.0,,
2019,120.0,40.0,,
2020,500.0,300.0,,
2021,100.0,87.0,,
2012,,,10.0,9.0
2019,,,12.0,4.0
2020,,,50.0,30.0
2013,,,10.0,8.0


In [55]:
df14.rename(columns={"Day": "No. of Days", "Visitors": "Students"}, inplace=True)
df14

Unnamed: 0,No. of Days,Students,BooksSold
0,1,100,90
1,2,120,40
2,3,500,300
3,4,100,87
4,5,460,600
5,6,120,40


#### 8. Selecting Elements in DataFrames by 2 ways
    - loc : Selection by Label Name
    - iloc :Selection by Index No.

In [56]:
df16=pd.DataFrame(np.arange(2,82,2).reshape(8,5),index=['R0','R1','R2','R3','R4','R5','R6','R7'],columns=['C0','C1','C2','C3','C4']) 
df16

Unnamed: 0,C0,C1,C2,C3,C4
R0,2,4,6,8,10
R1,12,14,16,18,20
R2,22,24,26,28,30
R3,32,34,36,38,40
R4,42,44,46,48,50
R5,52,54,56,58,60
R6,62,64,66,68,70
R7,72,74,76,78,80


In [57]:
# Accessing Row Elements using loc & iloc
print(df16.loc['R2'])
print(type(df16.loc['R2']))

print("")

print(df16.iloc[1])
print(type(df16.iloc[1]))

C0    22
C1    24
C2    26
C3    28
C4    30
Name: R2, dtype: int32
<class 'pandas.core.series.Series'>

C0    12
C1    14
C2    16
C3    18
C4    20
Name: R1, dtype: int32
<class 'pandas.core.series.Series'>


In [58]:
# Selecting a Specific Element in Rowwise (Index) & Columnwise using loc & iloc
print(df16.loc['R2','C3'])
print(type(df16.loc['R2','C3']))

print("")

print(df16.iloc[1,2])
print(type(df16.iloc[1,2]))

28
<class 'numpy.int32'>

16
<class 'numpy.int32'>


In [59]:
# Selecting Specific Rows and Specific columns
df16.loc[['R2','R4'],['C1', 'C3']]

Unnamed: 0,C1,C3
R2,24,28
R4,44,48


In [60]:
df16.loc['R2':'R4']      # loc : Slicing Operator functionality Not Works??

Unnamed: 0,C0,C1,C2,C3,C4
R2,22,24,26,28,30
R3,32,34,36,38,40
R4,42,44,46,48,50


In [61]:
print(df16.iloc[1:3])     # We can use .iloc (or) directly df[][]  ;;  # iloc : Slicing Operator functionality Works
print("")
print(df16[1:3])

    C0  C1  C2  C3  C4
R1  12  14  16  18  20
R2  22  24  26  28  30

    C0  C1  C2  C3  C4
R1  12  14  16  18  20
R2  22  24  26  28  30


In [62]:
# Selecting Column elements 
print(df16['C3'])
print(type(df16['C3']))

R0     8
R1    18
R2    28
R3    38
R4    48
R5    58
R6    68
R7    78
Name: C3, dtype: int32
<class 'pandas.core.series.Series'>


In [63]:
df16.iloc[:,:]   # can use df16  # iloc[Rows: , Columns:]

Unnamed: 0,C0,C1,C2,C3,C4
R0,2,4,6,8,10
R1,12,14,16,18,20
R2,22,24,26,28,30
R3,32,34,36,38,40
R4,42,44,46,48,50
R5,52,54,56,58,60
R6,62,64,66,68,70
R7,72,74,76,78,80


In [64]:
df16[:5][:2]

Unnamed: 0,C0,C1,C2,C3,C4
R0,2,4,6,8,10
R1,12,14,16,18,20


In [65]:
df16[['C2','C4']]

Unnamed: 0,C2,C4
R0,6,10
R1,16,20
R2,26,30
R3,36,40
R4,46,50
R5,56,60
R6,66,70
R7,76,80


In [66]:
# Selecting few rows for selected columns
df16[2:5]['C2']

R2    26
R3    36
R4    46
Name: C2, dtype: int32

In [67]:
# To get Boolean with a condition
df16.loc['R2']>0

C0    True
C1    True
C2    True
C3    True
C4    True
Name: R2, dtype: bool

In [68]:
print(df16.iloc[:,:-2])    # iloc and Slicing Operation
print(type(df16.iloc[:,:-2]))

    C0  C1  C2
R0   2   4   6
R1  12  14  16
R2  22  24  26
R3  32  34  36
R4  42  44  46
R5  52  54  56
R6  62  64  66
R7  72  74  76
<class 'pandas.core.frame.DataFrame'>


In [69]:
print(df16.iloc[:2,0:1])            # Difference b/w DataFrame and Series ???? Series having multiple Columns and Single Row???
print(type(df16.iloc[:2,0:1]))
print((df16.iloc[:2,0:1]).shape)

print("")

print(df16.iloc[:2,0])           # One column -=> Series??
print(type(df16.iloc[:2,0]))
print((df16.iloc[:2,0:1]).shape)

print("")

print(df16.loc['R2'])         # Each Row in DataFrame called DataSeries
print(type(df16.loc['R2']))
print(df16.loc['R2'].shape)

    C0
R0   2
R1  12
<class 'pandas.core.frame.DataFrame'>
(2, 1)

R0     2
R1    12
Name: C0, dtype: int32
<class 'pandas.core.series.Series'>
(2, 1)

C0    22
C1    24
C2    26
C3    28
C4    30
Name: R2, dtype: int32
<class 'pandas.core.series.Series'>
(5,)


#### 9. Useful for Data Preprocessing

In [70]:
df17 = pd.DataFrame(np.random.randn(4,3), index=['R0','R3','R5','R7'], columns=['C0','C1','C2'])
df17

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R3,-0.101143,0.731324,0.785212
R5,2.256737,0.554406,-2.342249
R7,-1.201626,1.006398,0.164221


In [71]:
print(df17.isnull())            # .isnull() -> To check all data is present/ not. Useful in Real Time
print("")
print(df17['C1'].isnull())
print("")
print(df17.loc['R3'].isnull())

       C0     C1     C2
R0  False  False  False
R3  False  False  False
R5  False  False  False
R7  False  False  False

R0    False
R3    False
R5    False
R7    False
Name: C1, dtype: bool

C0    False
C1    False
C2    False
Name: R3, dtype: bool


In [72]:
df18 = df17.reindex(['R0','R1','R2','R3','R4','R5','R6','R7','R8']) # .reindex() --> Add Row to a DataFrame
df18

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,,,
R2,,,
R3,-0.101143,0.731324,0.785212
R4,,,
R5,2.256737,0.554406,-2.342249
R6,,,
R7,-1.201626,1.006398,0.164221
R8,,,


In [73]:
print(df18.isnull())
print("")
print(df18.isnull().sum())           # .isnull.sum() --> No. of null values in a Column
print("")
print(df18['C1'].isnull().sum())
print("")
print(df18.loc['R2'].isnull().sum())

       C0     C1     C2
R0  False  False  False
R1   True   True   True
R2   True   True   True
R3  False  False  False
R4   True   True   True
R5  False  False  False
R6   True   True   True
R7  False  False  False
R8   True   True   True

C0    5
C1    5
C2    5
dtype: int64

5

3


In [74]:
df18.isnull().any()

C0    True
C1    True
C2    True
dtype: bool

In [75]:
df18

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,,,
R2,,,
R3,-0.101143,0.731324,0.785212
R4,,,
R5,2.256737,0.554406,-2.342249
R6,,,
R7,-1.201626,1.006398,0.164221
R8,,,


In [76]:
df18.ffill()       # .ffill() --> Fill the null values with Forward row Data

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,-0.077028,0.079888,-0.665178
R2,-0.077028,0.079888,-0.665178
R3,-0.101143,0.731324,0.785212
R4,-0.101143,0.731324,0.785212
R5,2.256737,0.554406,-2.342249
R6,2.256737,0.554406,-2.342249
R7,-1.201626,1.006398,0.164221
R8,-1.201626,1.006398,0.164221


In [77]:
df18

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,,,
R2,,,
R3,-0.101143,0.731324,0.785212
R4,,,
R5,2.256737,0.554406,-2.342249
R6,,,
R7,-1.201626,1.006398,0.164221
R8,,,


In [78]:
df18.bfill()       # .bfill() --> Fill the null values with Backward row Data

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,-0.101143,0.731324,0.785212
R2,-0.101143,0.731324,0.785212
R3,-0.101143,0.731324,0.785212
R4,2.256737,0.554406,-2.342249
R5,2.256737,0.554406,-2.342249
R6,-1.201626,1.006398,0.164221
R7,-1.201626,1.006398,0.164221
R8,,,


In [79]:
df18

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,,,
R2,,,
R3,-0.101143,0.731324,0.785212
R4,,,
R5,2.256737,0.554406,-2.342249
R6,,,
R7,-1.201626,1.006398,0.164221
R8,,,


In [80]:
df18.fillna(1)     # .fillna(value) --> Fill the null values with some value

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,1.0,1.0,1.0
R2,1.0,1.0,1.0
R3,-0.101143,0.731324,0.785212
R4,1.0,1.0,1.0
R5,2.256737,0.554406,-2.342249
R6,1.0,1.0,1.0
R7,-1.201626,1.006398,0.164221
R8,1.0,1.0,1.0


In [81]:
df18.fillna(method='bfill')       # Same as .bfill()

Unnamed: 0,C0,C1,C2
R0,-0.077028,0.079888,-0.665178
R1,-0.101143,0.731324,0.785212
R2,-0.101143,0.731324,0.785212
R3,-0.101143,0.731324,0.785212
R4,2.256737,0.554406,-2.342249
R5,2.256737,0.554406,-2.342249
R6,-1.201626,1.006398,0.164221
R7,-1.201626,1.006398,0.164221
R8,,,


In [82]:
df19 = df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                          "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                         "born": ["", pd.Timestamp("1940-04-25"), pd.NaT]})
df19

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [83]:
df20 = df19.dropna()           # dropna() : Drop the rows where at least one element is missing.
df20

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [84]:
df21 = df19.dropna(axis=1)    # dropna(axis=1):  Drop the Columns where at least one element is missing.
df21

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


In [85]:
df22 = pd.DataFrame({'C1': [10,20,30,0,8000], 'C2': [13,17,0,10,18] })
df22

Unnamed: 0,C1,C2
0,10,13
1,20,17
2,30,0
3,0,10
4,8000,18


In [86]:
df22.replace({0:1, 8000:44})     # replace(): Replaces with NewValues in the Dataframe

Unnamed: 0,C1,C2
0,10,13
1,20,17
2,30,1
3,1,10
4,44,18


#### 10. Convert DataFrame to Array (DataFrame.to_numpy)

In [87]:
df22.values                   # .values --> Convert Dataframe into Array

array([[  10,   13],
       [  20,   17],
       [  30,    0],
       [   0,   10],
       [8000,   18]], dtype=int64)

In [88]:
df22.values.shape

(5, 2)

#### 11. DataFrame Manipulations

In [89]:
# Assign Dataframe
#df23 = df22        # Don't assign like this --> Use copy function
# del function deleting 'C2' in df23, but it is effecting df22 also, Bcz both share common memory location
# How to copy data of df from memory location to other??
df23 = df22.copy()
df24 = df22.copy()

In [90]:
df23

Unnamed: 0,C1,C2
0,10,13
1,20,17
2,30,0
3,0,10
4,8000,18


In [91]:
# Column Addition
df23['C3'] = [12, 90, 65, 34, pd.NaT]
df23

Unnamed: 0,C1,C2,C3
0,10,13,12
1,20,17,90
2,30,0,65
3,0,10,34
4,8000,18,NaT


In [92]:
# Column Calculation
df23['C4'] = df23['C1']+df23['C2']
df23

Unnamed: 0,C1,C2,C3,C4
0,10,13,12,23
1,20,17,90,37
2,30,0,65,30
3,0,10,34,10
4,8000,18,NaT,8018


In [93]:
# Column Deletion
del df23['C2']       
df23

Unnamed: 0,C1,C3,C4
0,10,12,23
1,20,90,37
2,30,65,30
3,0,34,10
4,8000,NaT,8018


In [94]:
df22    

Unnamed: 0,C1,C2
0,10,13
1,20,17
2,30,0
3,0,10
4,8000,18


In [95]:
# Column Remove using Pop
df23.pop('C1')

0      10
1      20
2      30
3       0
4    8000
Name: C1, dtype: int64

In [96]:
df23

Unnamed: 0,C3,C4
0,12,23
1,90,37
2,65,30
3,34,10
4,NaT,8018


In [97]:
df24

Unnamed: 0,C1,C2
0,10,13
1,20,17
2,30,0
3,0,10
4,8000,18


In [98]:
# Row Selection using loc & iloc
df24['index']=['a','b','c','d','e']
df24.set_index('index',inplace=True)
df24

Unnamed: 0_level_0,C1,C2
index,Unnamed: 1_level_1,Unnamed: 2_level_1
a,10,13
b,20,17
c,30,0
d,0,10
e,8000,18


In [99]:
df24.iloc[1:4]   # Slicing done (Row Wise) done using iloc operation

Unnamed: 0_level_0,C1,C2
index,Unnamed: 1_level_1,Unnamed: 2_level_1
b,20,17
c,30,0
d,0,10


In [100]:
df25 = pd.DataFrame([[900,9]], columns=['C1','C2'], index=['f'])
df25

Unnamed: 0,C1,C2
f,900,9


In [101]:
# Appending one df to df (RowWise addition)
df26 = df24.append(df25)
df26

Unnamed: 0,C1,C2
a,10,13
b,20,17
c,30,0
d,0,10
e,8000,18
f,900,9


In [102]:
# Row Deletion using Drop
df26.drop('d')

Unnamed: 0,C1,C2
a,10,13
b,20,17
c,30,0
e,8000,18
f,900,9


#### 12. groupby

In [103]:
sales=pd.DataFrame({'weekday':['sun','sun','mon','mon'],
                   'city':['austin','dallas','austin','dallas'],
                   'bread':[139,237,326,456],
                   'butter':[80,20,70,30]})

sales

Unnamed: 0,weekday,city,bread,butter
0,sun,austin,139,80
1,sun,dallas,237,20
2,mon,austin,326,70
3,mon,dallas,456,30


In [104]:
sales.groupby(['weekday']).count()  # groupby and count

Unnamed: 0_level_0,city,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
mon,2,2,2
sun,2,2,2


In [105]:
sales.groupby(['weekday']).sum()    # groupby and sum

Unnamed: 0_level_0,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
mon,782,100
sun,376,100


In [106]:
sales.groupby(['weekday']).mean()  # groupby and mean

Unnamed: 0_level_0,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
mon,391,50
sun,188,50


In [107]:
sales.groupby(['city'])['bread'].max() # groupby and max

city
austin    326
dallas    456
Name: bread, dtype: int64

In [108]:
sales.groupby(['city'])['butter'].max()

city
austin    80
dallas    30
Name: butter, dtype: int64

In [109]:
sales.groupby(['city'])['bread','butter'].max()

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,bread,butter
city,Unnamed: 1_level_1,Unnamed: 2_level_1
austin,326,80
dallas,456,30


In [110]:
sales

Unnamed: 0,weekday,city,bread,butter
0,sun,austin,139,80
1,sun,dallas,237,20
2,mon,austin,326,70
3,mon,dallas,456,30


In [111]:
sales.groupby('city')['bread', 'butter'].agg(['max','sum'])   # groupby and aggregation

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,bread,bread,butter,butter
Unnamed: 0_level_1,max,sum,max,sum
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
austin,326,465,80,150
dallas,456,693,30,50


In [112]:
# Create dictionary with data 
dict_Mov = { "ID":[1, 2, 3,4,5],"Movies":["The Godfather", "Fight Club", "Casablanca","Indra","tagore"], 
                        "Week_1_Viewers":[30, 30, 40,90,95], 
                        "Week_2_Viewers":[60, 40, 80,110,150], 
                        "Week_3_Viewers":[40, 20, 20,50,60] }; 

# Convert dictionary to dataframe 
df_Mov = pd.DataFrame(dict_Mov); 
df1_Mov=df_Mov.set_index("ID")
print(df1) 

   Day  Visitors  BooksSold
0    1       100         90
1    2       120         40
2    3       500        300
3    4       100         87
4    5       460        600
5    6       120         40


In [113]:
dict1_Mov={"Week_1_Viewers":"total_viewership",
      "Week_2_Viewers":"total_viewership",
      "Week_3_Viewers":"total_viewership",
      "Movies":"Movies"}

df2_Mov=df1_Mov.groupby(dict1_Mov,axis=1).sum()
df2_Mov

#even though ‘Movies’ isn’t being merged into another column it still has to be present in the groupby_dict, 
#else it won’t be in the final dataframe.

#To calculate the Total_Viewers we have used the .sum() function which sums up all the values of the respective rows.

Unnamed: 0_level_0,Movies,total_viewership
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,The Godfather,130
2,Fight Club,90
3,Casablanca,140
4,Indra,250
5,tagore,305


#### 13. Imputer method to handle Missing Values

In [114]:
dfi = pd.DataFrame({"A":[8,3,None,4,7], 
                    "B":[None,2,4,3,5], 
                    "C":[4,None,8,5,None], 
                    "D":[None,4,2,None,1]})
dfi

Unnamed: 0,A,B,C,D
0,8.0,,4.0,
1,3.0,2.0,,4.0
2,,4.0,8.0,2.0
3,4.0,3.0,5.0,
4,7.0,5.0,,1.0


In [115]:
# Replace Missing values by Column Mean Value
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
dfi[['A','B','C','D']]=imp.fit_transform(dfi[['A','B','C','D']])
dfi

Unnamed: 0,A,B,C,D
0,8.0,3.5,4.0,2.333333
1,3.0,2.0,5.666667,4.0
2,5.5,4.0,8.0,2.0
3,4.0,3.0,5.0,2.333333
4,7.0,5.0,5.666667,1.0


In [116]:
# Replace '?' by Nan in Dataframe
dfq = pd.DataFrame({"A":[8,3,'?',4,7], 
                    "B":[None,2,4,3,5], 
                    "C":[4,None,8,5,None], 
                    "D":[None,4,2,None,1]})
dfq

Unnamed: 0,A,B,C,D
0,8,,4.0,
1,3,2.0,,4.0
2,?,4.0,8.0,2.0
3,4,3.0,5.0,
4,7,5.0,,1.0


In [117]:
dfq['A'].replace('?',np.nan, inplace=True)
dfq

Unnamed: 0,A,B,C,D
0,8.0,,4.0,
1,3.0,2.0,,4.0
2,,4.0,8.0,2.0
3,4.0,3.0,5.0,
4,7.0,5.0,,1.0
