# <div class="alert alert-success" >(3) Pandas DataFrame

The pandas DataFrame object extends the capabilities of the Series object into two-dimensions. A Series object adds an index to a NumPy array but can only associate a single data item per index label, a DataFrame integrates multiple Series objects by aligning them along common index labels. This automatic alignment by index label provides a seamless view across all the Series at each index label that has the appearance of a row in a table.
    
Because of the increased dimensionality of the DataFrame object, it becomes necessary to provide a means to select both rows and columns. Carrying over from a Series, the DataFrame uses the [] operator for selection, but it is now applied to the selection of columns of data. This means that another construct must be used to select specific rows of a DataFrame object. For those operations, a DataFrame object provides several methods and attributes that can be used in various fashions to select data by rows. A DataFrame also introduces the concept of multiple axes, specifically the horizontal and vertical axis. Functions from pandas can then be applied to either axis, in essence stating that the operation be applied horizontally to all the values in the rows, or up and down each column.    

In [1]:
import pandas as pd
import numpy as np

## <div class= "alert alert-info">Creating DataFrame from scratch
    
There are several ways to create a DataFrame. Probably the most straightforward way, is by creating it from a NumPy array. The following code creates a DataFrame from a two dimensional NumPy array.    

#### Creating dataframe from list (just to understand)

In [2]:
lst1 = [1,2,3]  
df_l1 = pd.DataFrame([lst1],index=[1,2,3],columns=['a','b','c'])  # lst is 1D so we pass it as 2D
df_l1

Unnamed: 0,a,b,c
1,1,2,3
2,1,2,3
3,1,2,3


In [3]:
lst2 = [[1,2,3] , [3,2,1] , [1,1,3] ]
df_l2 = pd.DataFrame(lst2,index=['record1','record2','record3'],columns=['series1','series2','series3'])
df_l2

Unnamed: 0,series1,series2,series3
record1,1,2,3
record2,3,2,1
record3,1,1,3


#### Creating dataframe from dictionary

A DataFrame object can also be created by passing a dictionary containing one or
more Series objects, where the dictionary keys contain the column names and each
series is one column of data:

In [4]:
students = {'rolls':[1,2,3,4,5,6],
            'names':['Ali','Amjad', 'Saima', 'Qaiser','Hamid','Nasir'], 
            'courses':['Python', 'Pandas', 'Numpy','Pandas', 'Numpy', "Python"], 
            'mode':['Online', 'Onsite','Online', 'Onsite','Online', 'Onsite']}

<b> Note</b>   for dataframe each list should have same length. so for uneven length list in a dictionary add NaN 

In [5]:
df_d  = pd.DataFrame(students)    # Creating a data frame
df_d

Unnamed: 0,rolls,names,courses,mode
0,1,Ali,Python,Online
1,2,Amjad,Pandas,Onsite
2,3,Saima,Numpy,Online
3,4,Qaiser,Pandas,Onsite
4,5,Hamid,Numpy,Online
5,6,Nasir,Python,Onsite


<mark> rolls, names, courses and mode are 4 series combined to make a data frame. horizontal rows are called records</mark> 

In [6]:
type(df_d)

pandas.core.frame.DataFrame

In [7]:
type(df_d["rolls"])

pandas.core.series.Series

In [8]:
type(df_d[["rolls",'mode']])

pandas.core.frame.DataFrame

### <div class= "alert alert-warning">DataFrame dimension
    
The dimensions of a DataFrame object can be determined using its .shape property. A DataFrame is always two-dimensional. The first value informs us about the number of rows and the second value is the number of columns:

In [9]:
# what's the shape of this DataFrame i.e dimensions
df_d.shape  # it is 4 rows by 6 columns

(6, 4)

### <div class= "alert alert-warning"> Naming dataframe columns
    
Column names can be specified at the time of creating the DataFrame by using the
columns parameter of the DataFrame constructor:

In [10]:
# specify column names
df = pd.DataFrame(np.array([[10, 11], [20, 21]]),columns=['a', 'b'])
df

Unnamed: 0,a,b
0,10,11
1,20,21


The names of the columns of a DataFrame can be accessed with its .columns property:

In [11]:
# what are the names of the columns?
df.columns

Index(['a', 'b'], dtype='object')

<mark>This value of the .columns property is actually a pandas index. The individual column names can be accessed by position.</mark>

In [12]:
df.columns[0]     # df.columns[1]   = 'b'

'a'

The names of the columns can be changed by assigning a list of the new names to the the .columns property:

In [13]:
df.columns = ['c1', 'c2'] # rename the columns
df

Unnamed: 0,c1,c2
0,10,11
1,20,21


### <div class= "alert alert-warning"> Naming dataframe index i.e labels

Index labels can likewise be assigned using the index parameter of the constructor or by assigning a list directly to the .index property.

In [14]:
# create a DataFrame with named columns and rows

df = pd.DataFrame(np.array([[0, 1], [2, 3]]),columns=['c1', 'c2'],index=['r1', 'r2']) 
df

Unnamed: 0,c1,c2
r1,0,1
r2,2,3


Similar to the Series object, the index of a DataFrame object can be accessed with its .index property: 

In [15]:
# retrieve the index of the DataFrame
df.index

Index(['r1', 'r2'], dtype='object')

#### Creating dataframe from ndarray object

In [16]:
df_s = pd.DataFrame(np.array([[10, 11,12], [20, 21,22]]))
df_s

Unnamed: 0,0,1,2
0,10,11,12
1,20,21,22


Each row of the array forms a row in the DataFrame object. Since we did not specify an index, pandas creates a default int64 index in the same manner as a Series object. Since we did not specify column names, pandas also assigns the names for
each column with a zero-based integer series.

#### Creating dataframe from list of series

In [17]:
s1 = pd.Series(np.arange(10, 15))
s2 = pd.Series(np.arange(15, 20))
df1 = pd.DataFrame([s1,s2])
df1

Unnamed: 0,0,1,2,3,4
0,10,11,12,13,14
1,15,16,17,18,19


In [18]:
s1 = pd.Series(np.arange(10, 15),index=['a','b','c','d','e'])
s2 = pd.Series(np.arange(15, 20),index=['a','b','c','d','e'])
df1 = pd.DataFrame([s1,s2])
df1

Unnamed: 0,a,b,c,d,e
0,10,11,12,13,14
1,15,16,17,18,19


In [19]:
s1 = pd.Series(np.arange(10, 15),index=['a','b','c','d','e'])  #if indecis of series different or one missing
s2 = pd.Series(np.arange(15, 20))
df1 = pd.DataFrame([s1,s2])
df1

Unnamed: 0,a,b,c,d,e,0,1,2,3,4
0,10.0,11.0,12.0,13.0,14.0,,,,,
1,,,,,,15.0,16.0,17.0,18.0,19.0


A DataFrame also performs automatic alignment of the data for each Series passed in by a dictionary. For example, the following code adds a third column in the DataFrame initialization. This third Series contains two values and will specify its index. When the DataFrame is created, each series in the dictionary is aligned with each other by the index label, as it is added to the DataFrame object. The code is as follows:

In [20]:
s1 = pd.Series(np.arange(1, 6, 1))
s2 = pd.Series(np.arange(6, 11, 1))
s3 =  pd.Series(np.arange(12, 14), index=[1, 2])
df2 = pd.DataFrame({'c1': s1, 'c2': s2, 'c3': s3})
df2

Unnamed: 0,c1,c2,c3
0,1,6,
1,2,7,12.0
2,3,8,13.0
3,4,9,
4,5,10,


The first two Series did not have an index specified, so they both were indexed with 0..4. The third Series has index values, and therefore the values for those indexes are placed in DataFrame in the row with the matching index from the previous columns. Then, pandas automatically filled in NaN for the values that were not supplied.

## <div class= "alert alert-info"> Adding, Accessing and modifying DataFrame data
  

In [21]:
df_d

Unnamed: 0,rolls,names,courses,mode
0,1,Ali,Python,Online
1,2,Amjad,Pandas,Onsite
2,3,Saima,Numpy,Online
3,4,Qaiser,Pandas,Onsite
4,5,Hamid,Numpy,Online
5,6,Nasir,Python,Onsite


In [22]:
df_d['names']    # selecting single column (i.e series)

0       Ali
1     Amjad
2     Saima
3    Qaiser
4     Hamid
5     Nasir
Name: names, dtype: object

In [23]:
df_d.names

0       Ali
1     Amjad
2     Saima
3    Qaiser
4     Hamid
5     Nasir
Name: names, dtype: object

In [24]:
df_d[["names", 'rolls']]  # selecting two column

Unnamed: 0,names,rolls
0,Ali,1
1,Amjad,2
2,Saima,3
3,Qaiser,4
4,Hamid,5
5,Nasir,6


In [25]:
df_d['cities']  = "islamabad" # creating new column with islamabad broadcasted to all records
df_d

Unnamed: 0,rolls,names,courses,mode,cities
0,1,Ali,Python,Online,islamabad
1,2,Amjad,Pandas,Onsite,islamabad
2,3,Saima,Numpy,Online,islamabad
3,4,Qaiser,Pandas,Onsite,islamabad
4,5,Hamid,Numpy,Online,islamabad
5,6,Nasir,Python,Onsite,islamabad


In [26]:
df_d['cities']  = ["islamabad" , "karachi","lahor","peshawar","quetta","xyz"] # reassigning column values
df_d['Ages']  =  [12,33,44,13,55,21]
df_d

Unnamed: 0,rolls,names,courses,mode,cities,Ages
0,1,Ali,Python,Online,islamabad,12
1,2,Amjad,Pandas,Onsite,karachi,33
2,3,Saima,Numpy,Online,lahor,44
3,4,Qaiser,Pandas,Onsite,peshawar,13
4,5,Hamid,Numpy,Online,quetta,55
5,6,Nasir,Python,Onsite,xyz,21


#### list comprehension for adding series to dataframe:

In [27]:
df_d['Status'] = ["Pass" if age>25 else "Fails" for age in df_d['Ages']]

df['Status'] = ["Pass" if age>25 else "Fails" for age in df['Ages']] 

this is a list comprehension. we are saying that iteate through Ages column (iterator = age) and  replace pass in Status column if age is >25 else place fail

In [28]:
df_d['onsite/online'] = df_d['mode'].apply(lambda x: 'Karachi' if x=="Onsite" else "Outside Khi")
df_d

Unnamed: 0,rolls,names,courses,mode,cities,Ages,Status,onsite/online
0,1,Ali,Python,Online,islamabad,12,Fails,Outside Khi
1,2,Amjad,Pandas,Onsite,karachi,33,Pass,Karachi
2,3,Saima,Numpy,Online,lahor,44,Pass,Outside Khi
3,4,Qaiser,Pandas,Onsite,peshawar,13,Fails,Karachi
4,5,Hamid,Numpy,Online,quetta,55,Pass,Outside Khi
5,6,Nasir,Python,Onsite,xyz,21,Fails,Karachi


#### functions for addinf series to dataframe:

In [29]:
def generationAssign(age):
    if age <20:
        return "Teen Ager"
    elif age<40:
        return "Young"
    elif age <60:
        return "Matured"
    else:
        return 'Senior'

In [30]:
df_d['Next Gen'] = df_d['Ages'].apply(generationAssign)
df_d

Unnamed: 0,rolls,names,courses,mode,cities,Ages,Status,onsite/online,Next Gen
0,1,Ali,Python,Online,islamabad,12,Fails,Outside Khi,Teen Ager
1,2,Amjad,Pandas,Onsite,karachi,33,Pass,Karachi,Young
2,3,Saima,Numpy,Online,lahor,44,Pass,Outside Khi,Matured
3,4,Qaiser,Pandas,Onsite,peshawar,13,Fails,Karachi,Teen Ager
4,5,Hamid,Numpy,Online,quetta,55,Pass,Outside Khi,Matured
5,6,Nasir,Python,Onsite,xyz,21,Fails,Karachi,Young


#### making an exel file from a dataframe

In [31]:
df_d.to_excel("student.xlsx")          #df_d data frame to exel file

## <div class= "alert alert-info"> Selecting rows and values of a DataFrame using the index
    
Elements of an array or Series are selected using the [] operator. DataFrame overloads [] to select columns instead of rows, except for a specific case of slicing. Therefore, most operations of selection of one or more rows in a DataFrame, require
alternate methods to using []. Understanding this is important in pandas, as it is a common mistake is try and select rows using [] due to familiarity with other languages or data structures.
    
When doing so, errors are often received, and can often be difficult to diagnose without realizing [] is working along a completely different axis than with a Series object. Row selection using the index on a DataFrame then breaks down to the following general categories of operations:
    
• Slicing using the [] operator
    
• Label or location based lookup using .loc, .iloc
    
• Scalar lookup by label or location using .at and .iat
    
We will briefly examine each of these techniques and attributes. Remember, all of these are working against the content of the index of the DataFrame. There is no involvement with data in the columns in the selection of the rows. We will cover that in the next section on Boolean selection.    
    
    
### (1). Slicing using the [ ] operator


Slicing a DataFrame across its index is syntactically identical to performing the same on a Series. Because of this, we will not go into the details of the various permutations of slices in this section, and only give representative examples
applied to a DataFrame. Slicing works along both positions and labels. The following code demonstrates several examples of slicing by position:    

In [32]:
df_d  = pd.DataFrame(students,index = ['r0', 'r1', 'r2', 'r3', 'r4', 'r5'])    # Creating a data frame
df_d

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


####  <span style= "color:green"> [  records_start_point ,  records_end_point     ]

In [33]:
df_d[:0]         # names of series

Unnamed: 0,rolls,names,courses,mode


In [34]:
df_d[:1]         # first record

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online


In [35]:
df_d[:3]         # first three records

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online


In [36]:
df_d[:6]          # All 6 records

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


In [37]:
df_d[2:]      # All records starting from record 2

Unnamed: 0,rolls,names,courses,mode
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


In [38]:
df_d['r2':'r4']    # returns records of label r2 to r4 (including r4)

Unnamed: 0,rolls,names,courses,mode
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online


#### <mark> Doing the above with index exclude record at index 4

In [39]:
df_d[2:4]    # returns records of label 2 to 4 (excluding 4)

Unnamed: 0,rolls,names,courses,mode
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite


### (2). Selecting rows by index label and location:.loc[] and .iloc[]

Rows can be retrieved via an index label value using .loc[]. This is shown in the following code:

In [40]:
df_d.loc['r3'] # get row with label r3

rolls           4
names      Qaiser
courses    Pandas
mode       Onsite
Name: r3, dtype: object

In [41]:
type(df_d.loc['r3'] )    # returned as a Series

pandas.core.series.Series

In [42]:
df_d.loc[['r3','r4']]          # get records at labels r3 & r4  

Unnamed: 0,rolls,names,courses,mode
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online


Rows can also be retrieved by location using .iloc[]:


In [43]:
df_d.iloc[[3, 4]]      # get rows in locations 3 and 4

Unnamed: 0,rolls,names,courses,mode
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online


In [44]:
# get the location of row 0  and row 3 in the index
i1 = df_d.index.get_loc('r0')
i2 = df_d.index.get_loc('r3')

"{0} {1}".format(i1, i2)    # location (index) of label r0 and r3

'0 3'

In [45]:
df_d.iloc[[i1, i2]]    # and get the rows at loc i1 & i2

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online
r3,4,Qaiser,Pandas,Onsite


### (3). Scalar lookup by label or location using .at[  ] and .iat[  ]

Scalar values can be looked up by label using .at, by passing both the row label and then the column name/value:

In [46]:
df_d

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


By label in both the index and column:

In [47]:
df_d.at['r0', 'mode']     # value in the mode column and row r0

'Online'

### By location. Row 0, column 4

In [48]:
df_d.iat[0, 3]

'Online'

## <div class= "alert alert-info"> Selecting rows of a DataFrame by Boolean selection

Rows can also be selected by using Boolean selection, using an array calculated from the result of applying a logical condition on the values in any of the columns. This allows us to build more complicated selections than those based simply upon index
labels or positions.

In [49]:
# what rows have rolls < 5?
df_d.rolls <5

r0     True
r1     True
r2     True
r3     True
r4    False
r5    False
Name: rolls, dtype: bool

This results in a Series that can be used to select the rows where the value is True, exactly the same way it was done with a Series or a NumPy array:

In [50]:
df_d[df_d.rolls <5]

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite


Multiple conditions can be put together using parentheses; at the same time, it is possible to select only a subset of the columns.

In [51]:
 df_d[(df_d.rolls < 5) & (df_d.rolls > 2)] [['rolls']]  # get only the rolls column where rolls is < 5 and > 2

Unnamed: 0,rolls
r2,3
r3,4


In [52]:
df_d[(df_d.rolls < 5) & (df_d.rolls > 2)] # get the record where rolls is < 5 and > 2

Unnamed: 0,rolls,names,courses,mode
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite


## <div class= "alert alert-info">  Modifying the structure and content of DataFrame
    
The structure and content of a DataFrame can be mutated in several ways. Rows and columns can be added and removed, and data within either can be modified to take on new values. Additionally, columns, as well as index labels, can also be renamed.
Each of these will be described in the following sections.



### <div class= "alert alert-success">  Modification of Columns

In [53]:
df_d

Unnamed: 0,rolls,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


### <span style="color:green"> Renaming columns
    
A column can be renamed using the .rename() method of the DataFrame. The rolls Value column rennamed to roll_no:    

In [54]:
d = df_d.rename(columns = {'rolls': 'roll_no'})
d.columns

Index(['roll_no', 'names', 'courses', 'mode'], dtype='object')

This has returned a new DataFrame object with the renamed column and data copied from the original DataFrame. We can verify that the original DataFrame did not have its column names modified:

In [55]:
df_d.columns

Index(['rolls', 'names', 'courses', 'mode'], dtype='object')

To modify the DataFrame without making a copy, we can use the inplace=True parameter to .rename():

In [56]:
df_d.rename(columns = {'rolls': 'roll_no'}, inplace = True)
df_d.columns

Index(['roll_no', 'names', 'courses', 'mode'], dtype='object')

### <span style="color:green">Adding and inserting columns

Columns can be added to a DataFrame using several methods. The simplest way is by merging a new Series into the DataFrame object, along the index using the [] operator assigning the Series to a new column, with a name not already in the .columns index. Note that this will modify the DataFrame in-place and not result in a copy.


Alignment of data is important to understanding this process, as pandas does not simply concatenate the Series to the DataFrame. pandas will first align the data in the DataFrame with the Series using the index from both objects, and fill in the data from the Series into the new DataFrame at the appropriate index labels.


To demonstrate this, we will add a purely demonstrative column called Twice_roll_no which adds a new column with a calculated value of 2.0 * the roll_no column. Since this modifies the DataFrame object in-place, we will also make a copy and then add
the column to the copy, so as to leave the original unmodified:

In [57]:
df_d_c = df_d.copy()         # make a copy
df_d_c['Twice_roll_no'] = df_d.roll_no * 2      # add a new column to the copy
df_d_c[:2]

Unnamed: 0,roll_no,names,courses,mode,Twice_roll_no
r0,1,Ali,Python,Online,2
r1,2,Amjad,Pandas,Onsite,4


<mark>This process is actually selecting the roll_no column out of the df_d object, then creating another Series with each value of the roll_no multiplied by two. The DataFrame then aligns this new Series by label, copies the data at the appropriate
labels, and adds the column at the end of the columns index.

If you want to add the column at a different location in the DataFrame object, instead of at the rightmost position, use the .insert() method of the DataFrame. The following code inserts the Twice_roll_no column between names and courses:

In [58]:
df_d_c1 = df_d.copy()         # make a copy
df_d_c1 .insert(2, 'Twice_roll_no', df_d.roll_no * 2)     # insert  df_d.roll_no * 2  as the third column in the DataFrame
df_d_c1 [:2]

Unnamed: 0,roll_no,names,Twice_roll_no,courses,mode
r0,1,Ali,2,Python,Online
r1,2,Amjad,4,Pandas,Onsite


It is important to remember that this is not simply inserting a column into the DataFrame. The alignment process used here is performing a left join of the DataFrame and the Series by their index labels, and then creating the column and populating the
data in the appropriate cells in the DataFrame from matching entries in the Series. If an index label in the DataFrame is not matched in the Series, the value used will be NaN. Items in the Series that do not have a matching label will be ignored.

In [59]:
rcopy = df_d[0:3][['names']].copy()    # extract the first four rows and just the names column
rcopy

Unnamed: 0,names
r0,Ali
r1,Amjad
r2,Saima


In [60]:
 # create a new Series to merge as a column
s = pd.Series({'MMM': 'Is in the DataFrame','r2': 'Not in the DataFrame'} ) # one label exists in rcopy (r2), and MMM does not
print(s)
rcopy['Comment'] = s # add rcopy into a column named 'Comment'
rcopy

MMM     Is in the DataFrame
r2     Not in the DataFrame
dtype: object


Unnamed: 0,names,Comment
r0,Ali,
r1,Amjad,
r2,Saima,Not in the DataFrame


The labels for r0 and r1 were not found in rcopy and therefore, the values in the result are NaN. r2 is the only value in both, so the value from rcopy is put in the result.

##  <span style="color:green">Replacing the contents of a column

In general, assignment of a Series to a column using the [] operator will either create a new column if the column does not already exist, or replace the contents of a column if it already exists. To demonstrate replacement, the following code replaces the Twice_roll_no column with the result of the multiplication, instead of creating a new column:

In [64]:
df_d_c2 = df_d_c1.copy()
df_d_c2.Twice_roll_no = df_d_c1.Twice_roll_no+1 # replace the twice_roll_no column data with the new values 
df_d_c2[:6]

Unnamed: 0,roll_no,names,Twice_roll_no,courses,mode
r0,1,Ali,3,Python,Online
r1,2,Amjad,5,Pandas,Onsite
r2,3,Saima,7,Numpy,Online
r3,4,Qaiser,9,Pandas,Onsite
r4,5,Hamid,11,Numpy,Online
r5,6,Nasir,13,Python,Onsite


To emphasize that this is also doing an alignment, we can change the sample slightly. The following code only utilizes the prices from three of the first four rows. This will force the result to not align values for 497 of the symbols, resulting in NaN values:

In [67]:
copy = df_d_c1.copy()

Twice_roll_nos = df_d_c1.iloc[[3, 1, 0]].Twice_roll_no.copy()   # this just copies the first 2 rows of Twice_roll_no
Twice_roll_nos

r3    8
r1    4
r0    2
Name: Twice_roll_no, dtype: int64

In [68]:
# now replace the Twice_roll_no column with Twice_roll_nos
copy.Twice_roll_no = Twice_roll_nos
# it's not really simple insertion, it is alignment, values are put in the correct place according to labels
copy

Unnamed: 0,roll_no,names,Twice_roll_no,courses,mode
r0,1,Ali,2.0,Python,Online
r1,2,Amjad,4.0,Pandas,Onsite
r2,3,Saima,,Numpy,Online
r3,4,Qaiser,8.0,Pandas,Onsite
r4,5,Hamid,,Numpy,Online
r5,6,Nasir,,Python,Onsite


##  <span style="color:green">Deleting columns in a DataFrame
    
Columns can be deleted from a DataFrame by using the del keyword, the pop(column) method of the DataFrame, or by calling the drop() method of the DataFrame.
    
The behavior of each of these differs slightly:
    
- del will simply delete the Series from the DataFrame (in-place)
    
- pop() will both delete the Series and return the Series as a result (also in-place)
    
- drop(labels, axis=1) will return a new DataFrame with the column(s) removed (the original DataFrame object is not modified)
    
The following code demonstrates using del to delete the BookValue column from acopy of the sp500 data:    

In [71]:
copy = df_d[:2].copy()   # make a copy of a subset of the data frame
copy

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite


In [73]:
# delete the mode column

del copy['mode']    # deletion is in-place
copy

Unnamed: 0,roll_no,names,courses
r0,1,Ali,Python
r1,2,Amjad,Pandas


The following code demonstrates using the .pop() method to remove a column:

In [75]:
# Example of using pop to remove a column from a DataFrame


copy = df_d[:2].copy()   # make a copy of a subset of the data frame
popped = copy.pop('mode')      # this will remove Sector and return it as a series. (pop works in-place)

# Sector column removed in-place
copy

Unnamed: 0,roll_no,names,courses
r0,1,Ali,Python
r1,2,Amjad,Pandas


The .drop() method can be used to remove both rows and columns. To use it to remove a column, specify axis=1:

In [82]:
# Example of using drop to remove a column

copy = df_d[:2].copy()   # make a copy of a subset of the DataFrame
afterdrop = copy.drop(['mode'], axis = 1) # return a new DataFrame with mode removed. (the copy DataFrame is not modified
print("before using drop() method:\n\n",copy)
print('\n')
print("after using drop() method:\n\n",afterdrop)

before using drop() method:

     roll_no  names courses    mode
r0        1    Ali  Python  Online
r1        2  Amjad  Pandas  Onsite


after using drop() method:

     roll_no  names courses
r0        1    Ali  Python
r1        2  Amjad  Pandas


### <div class= "alert alert-success">  Modification of Rows 

##  <span style="color:green">Adding rows to a DataFrame
    
Rows can be added to a DataFrame object via several different operations:
    
- Appending a DataFrame to another
    
- Concatenation of two DataFrame objects
    
- Setting with enlargement
    
### Appending rows with .append():
    
Appending is performed using the .append() method of the DataFrame. The process of appending returns a new DataFrame with the data from the original DataFrame added first, and the rows from the second. <mark>Appending does not perform alignment and can result in duplicate index values</mark>.  The following code demonstrates appending two DataFrame objects extracted from the df_d data. The first DataFrame consists of rows  1 and 2, and the
second consists of rows 3 and 2. Row 2 (with label r2) is included in
both to demonstrate creation of duplicate index labels. The code is as follows:    

In [86]:
df1 = df_d.iloc[1:3].copy()     # copy r1 &r2 of df_d

df2 = df_d.iloc[[2, 3]]   # copy r2 and r3 rows

appended = df1.append(df2)      # append df1 and df2

appended                        # the result is the rows of the first followed by those of the second

  appended = df1.append(df2)      # append df1 and df2


Unnamed: 0,roll_no,names,courses,mode
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite


The set of columns of the DataFrame objects being appended do not need to be the same. The resulting DataFrame will consist of the union of the columns in both and where either did not have a column, NaN will be used as the value. The following code demonstrates this by creating a third DataFrame using the same index as df1, but having a single column with a unique column name:

In [87]:
# DataFrame using df1.index and just a PER column also a good example of using a scalar value to initialize multiple rows

df3 = pd.DataFrame(0.0,index=df1.index,columns=['PER'])
df3

Unnamed: 0,PER
r1,0.0
r2,0.0


In [88]:
# df1 had no PER column, so NaN for those rows. df3 had no roll_no, names, coursrs or mode, so NaN values

df1.append(df3)

  df1.append(df3)


Unnamed: 0,roll_no,names,courses,mode,PER
r1,2.0,Amjad,Pandas,Onsite,
r2,3.0,Saima,Numpy,Online,
r1,,,,,0.0
r2,,,,,0.0


To append without forcing the index to be taken from either DataFrame, you can use the ignore_index=True parameter. This is useful when the index values are not of significant meaning, and you just want concatenated data with sequentially increasing integers as indexes:

In [89]:
df1.append(df3, ignore_index=True)    # ignore index labels, create default index

  df1.append(df3, ignore_index=True)    # ignore index labels, create default index


Unnamed: 0,roll_no,names,courses,mode,PER
0,2.0,Amjad,Pandas,Onsite,
1,3.0,Saima,Numpy,Online,
2,,,,,0.0
3,,,,,0.0


##  <span style="color:green">Concatenating DataFrame objects with pd.concat()

A DataFrame can be concatenated to another using the pd.concat() function. This function functions similarly to the .append() method, but also adds the ability to specify an axis (appending can be row or column based), as well as being able to perform several join operations between the objects. Also,<mark> the function takes a list of pandas objects to concatenate, so you can concatenate more than two objects in a single call </mark>. The default operation of pd.concat() on two DataFrame objects operates in the same way as the .append() method. This can be demonstrated by reconstructing the two datasets from the earlier example and concatenating them. This is shown in the  following example that concatenate along rows (axis = 0):

In [90]:
df1 = df_d.iloc[1:3].copy()     # copy r1 &r2 of df_d
df2 = df_d.iloc[[2, 3]]         # copy r2 and r3 rows
pd.concat([df1, df2])           # pass them as a list

Unnamed: 0,roll_no,names,courses,mode
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite


Actually, pandas calculates the sorted union of distinct column names across all supplied objects and uses those as the columns, and then appends data along the rows for each object in the order given in the list. A slight variant of this example adds an additional column to one of the DataFrame objects and then performs the concatenation:

In [93]:
df2_2 = df2.copy()
df2_2.insert(3, 'Foo', pd.Series(0, index=df2.index))        # add a column to df2_2 that is not in df1
df2_2


Unnamed: 0,roll_no,names,courses,Foo,mode
r2,3,Saima,Numpy,0,Online
r3,4,Qaiser,Pandas,0,Onsite


In [94]:
pd.concat([df1, df2_2])                                      # after concatenating

Unnamed: 0,roll_no,names,courses,mode,Foo
r1,2,Amjad,Pandas,Onsite,
r2,3,Saima,Numpy,Online,
r2,3,Saima,Numpy,Online,0.0
r3,4,Qaiser,Pandas,Onsite,0.0


Duplicate index labels still result, as the rows are copied verbatim from the source objects. However, note the NaN values in the rows originating from df1, since it does not have a Foo column. 

<mark>Using the keys parameter, it is possible to differentiate the pandas objects from which the rows originated. The following code adds a level to the index which represents the source object:

In [95]:
# specify keys
r = pd.concat([df1, df2_2], keys=['df1', 'df2'])
r

Unnamed: 0,Unnamed: 1,roll_no,names,courses,mode,Foo
df1,r1,2,Amjad,Pandas,Onsite,
df1,r2,3,Saima,Numpy,Online,
df2,r2,3,Saima,Numpy,Online,0.0
df2,r3,4,Qaiser,Pandas,Onsite,0.0


We can change the axis of the concatenation to work along the columns by specifying axis=1, which will calculate the sorted union of the distinct index labels from the rows and then append columns and their data from the specified objects.To demonstrate, the following splits the df_d data into two DataFrame objects, each with a different set of columns, and then concatenates along axis=1:

In [107]:
df3 = df_d[:3][['roll_no', 'names']]   
print('first three rows and columns (roll_no and names):\n\n',df3)
df4 = df_d[:3][['courses']]             
print('first three rows and column (courses):\n\n',df4,'\n\n')
print('put them back together along axis = 1:')
pd.concat([df3, df4], axis=1)         

first three rows and columns (roll_no and names):

     roll_no  names
r0        1    Ali
r1        2  Amjad
r2        3  Saima
first three rows and column (courses):

    courses
r0  Python
r1  Pandas
r2   Numpy 


put them back together along axis = 1:


Unnamed: 0,roll_no,names,courses
r0,1,Ali,Python
r1,2,Amjad,Pandas
r2,3,Saima,Numpy


We can further examine this operation by adding a column to the second DataFrame that has a duplicate name to a column in the first. The result will have duplicate columns, as the columns are blindly appended without regard to already existing columns:

In [110]:
df4_2 = df4.copy()
df4_2.insert(1, 'names', pd.Series(1, index=df4_2.index))   
print('After add a column to df4_2, that is also in df3\n\n:',df4_2)
pd.concat([df3, df4_2], axis=1)

After add a column to df4_2, that is also in df3

:    courses  names
r0  Python      1
r1  Pandas      1
r2   Numpy      1


Unnamed: 0,roll_no,names,courses,names.1
r0,1,Ali,Python,1
r1,2,Amjad,Pandas,1
r2,3,Saima,Numpy,1


To be very specific, pandas is performing an outer join along the labels of the specified axis. An inner join can be specified using the join='inner' parameter, which changes the operation from being a sorted union of distinct labels to the distinct
values of the intersection of the labels. To demonstrate, the following selects two subsets of the financial data with one row in common and performs an inner join:

In [117]:
df5 = df_d[:3][['roll_no','names']] 
print('first three rows and first two columns:\n\n',df5)
df6 = df_d[2:5][['roll_no','names']]
print('\nrow 2 through 4 and first two columns:\n\n',df6)
print('\ninner join on index labels will return in only one row:\n\n')
pd.concat([df5, df6], join='inner', axis=1) 

first three rows and first two columns:

     roll_no  names
r0        1    Ali
r1        2  Amjad
r2        3  Saima

row 2 through 4 and first two columns:

     roll_no   names
r2        3   Saima
r3        4  Qaiser
r4        5   Hamid

inner join on index labels will return in only one row:




Unnamed: 0,roll_no,names,roll_no.1,names.1
r2,3,Saima,3,Saima


##  <span style="color:green">Adding rows (and columns) via setting with enlargement
    
Rows can also be added to a DataFrame through the .loc property. <mark>This technique is referred to as setting with enlargement</mark>. The parameter for .loc specifies the index label where the row is to be placed. If the label does not exist, the values are appended to the DataFrame using the given index label. If it does exist, then the values in the specified row are replaced. 
    
The following example takes a subset of sp500 and adds a row with the label FOO:

In [121]:
ss = df_d[:3].copy()                      # get a small subset of the df_d. Make sure to copy the slice to make a copy
# create a new row with index label FOO and assign some values to the columns via a list
ss.loc['FOO'] = [7,'Ayesha', 'Pandas', 'Online']
ss

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
FOO,7,Ayesha,Pandas,Online


Note that the change is made in place. If FOO already exists as an index label, then the column data would be replaced. This is one of the means of updating data in a DataFrame in-place, as .loc not only retrieves row(s), but also lets you modify the
results that are returned. 

It is also possible to add columns in this manner. The following code demonstrates by adding a new column to a subset of df_d using .loc. Note that to accomplish this, we use the colon in the rows' position to select all rows to be included to add
the new column and value:

In [122]:
ss = df_d[:3].copy()         # copy of subset / slice
ss.loc[:,'PER'] = 0          # add the new column initialized to 0
ss

Unnamed: 0,roll_no,names,courses,mode,PER
r0,1,Ali,Python,Online,0
r1,2,Amjad,Pandas,Onsite,0
r2,3,Saima,Numpy,Online,0


##  <span style="color:green">Removing rows from a DataFrame
    
Removing rows from a DataFrame object is normally performed using one of three techniques:
    
- Using the .drop() method
    
- Boolean selection
    
- Selection using a slice
    
Technically, only the .drop() method removes rows in-place on the source object. The other techniques either create a copy without specific rows, or a view into the rows that are not to be dropped. Details of each are given in the following sections.
    
### Removing rows using .drop()
To remove rows from a DataFrame by the index label, you can use the .drop() method of the DataFrame. The .drop() method takes a list of index labels and will return a copy of the DataFrame with the rows for the specified labels removed. The source DataFrame remains unmodified. The code is as follows:

In [125]:
ss = df_d[:5].copy()
print(ss)
# drop rows with labels ABT and ACN
afterdrop = ss.drop(['r0', 'r3'])
afterdrop

    roll_no   names courses    mode
r0        1     Ali  Python  Online
r1        2   Amjad  Pandas  Onsite
r2        3   Saima   Numpy  Online
r3        4  Qaiser  Pandas  Onsite
r4        5   Hamid   Numpy  Online


Unnamed: 0,roll_no,names,courses,mode
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r4,5,Hamid,Numpy,Online


In [126]:
ss      # note that ss is not modified

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online


### Removing rows using Boolean selection

Boolean selection can be used to remove rows from a DataFrame by creating a new DataFrame without the desired rows. Suppose we want to remove rows where roll_no > 4. The process to do this, is to first determine which rows match that criteria, and then to select the rows that do not. The following code selects those rows and lets us know how many of them there are:

In [128]:
selection = df_d.roll_no > 4
x = df_d[selection]
x

Unnamed: 0,roll_no,names,courses,mode
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


To remove rows thad dont match the criteria ie roll_no > 4 use binary not (~)

In [129]:
y = df_d[~selection]
y

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite


### Removing rows using a slice

Slicing is also often used to remove records from a DataFrame. It is a process similar to Boolean selection, where we select out all of the rows, except for the ones you want deleted. 

Suppose we want to remove all but the first three records from df_d. The slice to perform this task is [:3]:

In [130]:
onlyFirstThree = df_d[:3]
onlyFirstThree

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online


<mark>Remember, that this result is a slice. Therefore, it is a view into the DataFrame. Data has not been removed from the df_d object. Changes to these three rows will change the data in df_d. To prevent this from occurring, the proper action 
is to make a copy of the slice, as follows:

In [131]:
onlyFirstThree = df_d[:3].copy()
onlyFirstThree

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online


### <div class= "alert alert-success">Changing scalar values in a DataFrame
    
Scalar values in a DataFrame can be changed by assignment of the new value to the result of the value lookup using the .iloc and .loc attributes. These two attributes can all be passed both a row and column selectors, and the result can be
assigned a new value that will be made in the original DataFrame.
    
The following code makes a copy of df_d and then demonstrates changing the roll_no on
the r3 and r4 rows:

In [133]:
subset = df_d.copy()
subset.loc['r3', 'roll_no'] = 10
subset.loc['r4', 'roll_no'] = 20
subset

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,10,Qaiser,Pandas,Onsite
r4,20,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


.loc may suffer from lower performance, as compared to .iloc, due to the possibility of needing to map the label values into locations. The following example gets the location of the specific row and column that is desired to be changed and then uses
.iloc to execute the change :

In [135]:
subset = df_d.copy()
roll_no_loc = df_d.columns.get_loc('roll_no')    # get the location of the Price column
r2_row_loc = df_d.index.get_loc('r2')            # get the location of the r2 row
subset.iloc[r2_row_loc, roll_no_loc] = 1000       # change the price
subset

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,1000,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


This may be look like overkill for this small example. But if this is where code is being executed frequently, such as in a loop or in response to market changes, looking up the locations once and always using .loc with those values, will give significant performance gains over the other options.

### <div class= "alert alert-success">Resetting and reindexing
    
A DataFrame can have its index reset by using the .reset_index(). A common use of this, is to move the contents of a DataFrame object's index into one or more columns. The following code moves the symbols in the index of sp500 into a column and replaces the index with a default integer index. The result is a new DataFrame, not an in-place update. The code is as follows:

In [136]:
reset_df_d = df_d.reset_index()
reset_df_d

Unnamed: 0,index,roll_no,names,courses,mode
0,r0,1,Ali,Python,Online
1,r1,2,Amjad,Pandas,Onsite
2,r2,3,Saima,Numpy,Online
3,r3,4,Qaiser,Pandas,Onsite
4,r4,5,Hamid,Numpy,Online
5,r5,6,Nasir,Python,Onsite


One or more columns can also be moved into the index. Another common scenario is exhibited by the reset variable we just created, as this may have been data read in from a file with the symbols in a column when we really would like it in the index.
To do this, we can utilize the .set_index() method. The following code moves Symbol into the index of a new DataFrame:

In [137]:
reset_df_d.set_index('index')      # move the Symbol column into the index

Unnamed: 0_level_0,roll_no,names,courses,mode
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite
r4,5,Hamid,Numpy,Online
r5,6,Nasir,Python,Onsite


An index can be explicitly set using the .set_index() method. This method, given a list of values representing the new index, will create a new DataFrame using the specified values, and then align the data from the target in the new object.

The following code demonstrates this, by using a subset of df_d and assigning a new index that contains a subset of those indexes and an additional label FOO:

In [138]:
# get first four rows
subset = df_d[:4].copy()
subset

Unnamed: 0,roll_no,names,courses,mode
r0,1,Ali,Python,Online
r1,2,Amjad,Pandas,Onsite
r2,3,Saima,Numpy,Online
r3,4,Qaiser,Pandas,Onsite


In [139]:

reindexed = subset.reindex(index=['r0', 'r3', 'FOO'])  # reindex to have MMM, ABBV, and FOO index labels

reindexed                                                 # note that r1 and r2 are dropped and FOO has NaN values

Unnamed: 0,roll_no,names,courses,mode
r0,1.0,Ali,Python,Online
r3,4.0,Qaiser,Pandas,Onsite
FOO,,,,


Reindexing can also be done upon the columns. The following reindexes the columns of subset:

In [143]:
subset.reindex(columns=['roll_no','Names','Courses'])

Unnamed: 0,roll_no,Names,Courses
r0,1,,
r1,2,,
r2,3,,
r3,4,,


This result is created by pandas by creating a new DataFrame with the specified columns, and then aligning the data for those columns from the subset into the new object. Because subset did not have 'Names' and 'Courses' columns, the values are filled with NaN.

Finally, a DataFrame can also be reindexed on rows and columns at the same time:

In [145]:
subset.reindex(columns=['roll_no','Names','Courses'],index=['R0','R1','R2','r3'])

Unnamed: 0,roll_no,Names,Courses
R0,,,
R1,,,
R2,,,
r3,4.0,,
