# Dataframe

The DataFrame data structure is the heart of the Panda's library. It's a primary object that you'll be working  with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object. where there's an index and multiple columns of  content, with each column having a label. In fact, the distinction between a column and a row is really only a  conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.

## Creating A Dataframe


Before we continue with our dicussion lets first import pandas library

In [1]:
import pandas as pd

Let start forming a dataframe. Lets understand everything step by step.

###### Step1:
Lets first make the dictionaries that we want in the dataframe we are going to create.

In [2]:
schoolGrades1={"Name":"Abhay Ohja",
             "Subject":"Physics",
             "Marks":90}
schoolGrades2={"Name":"Ashish Kumar",
              "Subject":"Maths",
              "Marks":100}
schoolGrade3={"Name":"Ayush Ohja",
             "Subject":"Computer",
             "Marks":85}

Lets now make a dataframe. The comman syntax involes passing the dictionaries together in a list and then passing the label of each index label in the index attribute as seperate list. Let see the code below to understand what we meant.

In [3]:
dataFrameSchoolGrade=pd.DataFrame([schoolGrades1,schoolGrades2,schoolGrade3],index=["School1","School2","School1"])

#Note lets decode the code to understand whats happening here. We noted that Dataframe is nothing but a 2 dimensional series
#and is indexed via labels just like Series. So the index attribute in the above code is used to set the labels. Note labels
#will function as row headers of the dataframe.

#Now lets look at the code in the previous tab. We created 3 series objects in that tab. Also it had labels and value. Note
#that the labels of all of them are the same. Its important to have all the labels same for series objects part of the same
#dataframe. Now the labels of the series will function as column headers. Now we have disccussed about row and column header
#rest all of the data will fill the table created by the row and columns.

#Just to summerize, labels of dataframe make up the row header. Labels of series make up the column header

In [4]:
#lET PRINT THE DATAFRAME
dataFrameSchoolGrade

Unnamed: 0,Name,Subject,Marks
School1,Abhay Ohja,Physics,90
School2,Ashish Kumar,Maths,100
School1,Ayush Ohja,Computer,85


In [5]:
# An alternative method is that you could use a list of dictionaries, where each dictionary 
# represents a row of data.

students = [{'Name': 'Abhay Ohja',
              'Class': 'Physics',
              'Score': 90},
            {'Name': 'Ashish Kumar',
             'Class': 'Maths',
             'Score': 100},
            {'Name': 'Ayush Ohja',
             'Class':'Computer',
             'Score':85}]

# Then we pass this list of dictionaries into the DataFrame function
df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
# And lets print the head again
df.head()

Unnamed: 0,Name,Class,Score
school1,Abhay Ohja,Physics,90
school2,Ashish Kumar,Maths,100
school1,Ayush Ohja,Computer,85


## Extracting Data Out Of Dataframes

### 1)Accessing Via Rows

Similar to the series, we can extract data using the .iloc and .loc attributes. Because the DataFrame is two-dimensional, passing a single value to the loc indexing operator will return the series if there's only one row to return.

If there are two row to return we generally observe that a dataframe is generated. 

Lets see a few example to understand.

In [6]:
#Example1
dataFrameSchoolGrade.loc["School1"]

Unnamed: 0,Name,Subject,Marks
School1,Abhay Ohja,Physics,90
School1,Ayush Ohja,Computer,85


As observable as we had two rows to return, a dataframe was generated.

In [7]:
type(dataFrameSchoolGrade.loc["School1"])

pandas.core.frame.DataFrame

In [8]:
#Example2
dataFrameSchoolGrade.loc["School2"]

Name       Ashish Kumar
Subject           Maths
Marks               100
Name: School2, dtype: object

As we had only one row to return, the return type of loc was a series

In [9]:
type(dataFrameSchoolGrade.loc["School2"])

pandas.core.series.Series

In [10]:
#Let just see a few applications of the iloc attribute.
#Example3
dataFrameSchoolGrade.iloc[2]

Name       Ayush Ohja
Subject      Computer
Marks              85
Name: School1, dtype: object

In [11]:
dataFrameSchoolGrade.iloc[0]

Name       Abhay Ohja
Subject       Physics
Marks              90
Name: School1, dtype: object

As obervable single row are generated with thier datatypes set at series.


One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes.For instance, if you wanted to just list the student names for school1, you would supply two parameters to .loc, one being the row index and the other being the column name.

The general syntax can be noted as:

###### < Dataframe >.loc["Row Name","Column Name"]

What this command is actually doing is that its accessing a row whose name match with row name and in place of printing the whole row its just printing that part of row whose column has been summoned.

Let see an example to fully understand its functioning.

In [12]:
#Example1
dataFrameSchoolGrade.loc["School1","Name"]

School1    Abhay Ohja
School1    Ayush Ohja
Name: Name, dtype: object

In [13]:
type(dataFrameSchoolGrade.loc["School1","Name"])

pandas.core.series.Series

As observable irrespective of the number of row generated, the output of this code will always be a Series

In [14]:
#Lets see a few more example
dataFrameSchoolGrade.loc["School2","Subject"]

'Maths'

In [15]:
#Lets see a example where we are using iloc in place of loc to access rows.
dataFrameSchoolGrade.iloc[1,"Subject"]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

As observable this scheme will not work with iloc.

### 2)Accessing Via Columns


What would we do if we just wanted to select a single column though? Well, there are a few mechanisms. 

###### Mechanism 1

Firstly, we could transpose the matrix. This pivots all of the rows into columns and all of the columns into rows, and is done with the T attribute. After that we can simply use the loc or iloc methods to access row which effectively are columns that turned into rows after the transpose. Lets see a few syntaxes.

###### Syntax For Applying transpose

###### < Name Of Dataframe >.T

Lets see a few example to understand what we are talking about.

In [16]:
dataFrameSchoolGrade.T

Unnamed: 0,School1,School2,School1.1
Name,Abhay Ohja,Ashish Kumar,Ayush Ohja
Subject,Physics,Maths,Computer
Marks,90,100,85


In [17]:
dataFrameSchoolGrade.T.loc["Name"]

School1      Abhay Ohja
School2    Ashish Kumar
School1      Ayush Ohja
Name: Name, dtype: object

As observable we have successfully accessed the column "Name" in the dataframe.

In [18]:
type(dataFrameSchoolGrade.T.loc["Name"])

pandas.core.series.Series

As observable the return type is a Series.

In [19]:
#Lets see another example
dataFrameSchoolGrade.T.loc["Subject"]

School1     Physics
School2       Maths
School1    Computer
Name: Subject, dtype: object

###### Mechanism 2

Since iloc and loc are used for row selection, Pandas reserves the indexing operator to be directly used for directly for column selection. In a Panda's DataFrame, columns always have a name. So this selection is always label based(Name Based rather than number\index based), and is not as confusing as it was when using the square bracket operator on the series objects.

Lets see few examples to understand.

In [20]:
#Example1
dataFrameSchoolGrade["Name"]

School1      Abhay Ohja
School2    Ashish Kumar
School1      Ayush Ohja
Name: Name, dtype: object

In [21]:
#Example2
dataFrameSchoolGrade["Marks"]

School1     90
School2    100
School1     85
Name: Marks, dtype: int64

In [22]:
#Example3
dataFrameSchoolGrade["Subject"]

School1     Physics
School2       Maths
School1    Computer
Name: Subject, dtype: object

In [23]:
type(dataFrameSchoolGrade["Subject"])

pandas.core.series.Series

As observable the return type is Series.

In practice, this works really well since you're often trying to add or drop new columns. However, this also means that you get a key error if you try and use .loc with a column name

In [24]:
df.loc['Name']

KeyError: 'Name'

### 3)Concept Of Chaining Using Indexing Operator

Since the result of using the indexing operator is either a DataFrame or Series, you can chain operations together.

The general syntax to followed for this while process is as follows.

###### < Name of Dataframe >loc[Name Of Row You Want to work with][Name of columns you want to display]

###### Example1
For instance, we can select all of the rows which related to school1 using.loc, then project the name column from just those rows.

In [25]:
dataFrameSchoolGrade.loc["School1"]["Name"]

School1    Abhay Ohja
School1    Ayush Ohja
Name: Name, dtype: object

In [26]:
# If you get confused, use type to check the responses from resulting operations
print(type(df.loc['school1'])) #should be a DataFrame
print(type(df.loc['school1']['Name'])) #should be a Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


Chaining the indexing operator can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting data, this is not a big deal, though it might be slower than necessary. If you are changing data though this is an important distinction and can be a source of error.

(***)

As we saw, .loc does row selection, and it can take two parameters, the row index and the column names. The .loc attribute also 
supports slicing.

If we wanted to select all rows, we can use a colon(:) to indicate a full slice from beginning to end. This is just like slicing characters in a list in python. Then we can add the column name as the second parameter as a string. If we wanted to include multiple columns, we could do so in a list. and Pandas will bring back only the columns we have asked for.

In [27]:
dataFrameSchoolGrade.loc[:,["Name","Subject"]]

Unnamed: 0,Name,Subject
School1,Abhay Ohja,Physics
School2,Ashish Kumar,Maths
School1,Ayush Ohja,Computer


## Deleting A Row In Dataframe

##### Method 1

Lets talk about dropping data. It's easy to delete data in Series and DataFrames, and we can use the drop function to do so. This function takes a single parameter, which is the index or row label, to drop. This is another tricky place for new users -- the drop function doesn't change the DataFrame by default! Instead, the drop function returns to you a copy of the DataFrame with the given rows removed.

The general syntax of usage is

###### < Name of Dataframe >.drop(< Name of Row Index >)

Let see an example.

In [28]:
dataFrameSchoolGrade.drop("School1")

Unnamed: 0,Name,Subject,Marks
School2,Ashish Kumar,Maths,100


In [29]:
dataFrameSchoolGrade

Unnamed: 0,Name,Subject,Marks
School1,Abhay Ohja,Physics,90
School2,Ashish Kumar,Maths,100
School1,Ayush Ohja,Computer,85


As observable when  we look at our original DataFrame we see the data is still intact.

##### Method 2(Modified method 1)

(**************************************************************************************************************************)

Drop has two interesting optional parameters. The first is called inplace, and if it's set to true, the DataFrame will be updated in place, instead of a copy being returned. The second parameter is the axis. By default, its value is 0, indicating the row axis. But you could change it to 1 if you want to drop a column.

The general syntax of this code will be.

###### < Name of dataframe >.drop(< Name of row\column to be dropped >,inplace=True\False,axis=0\1)

Let see an example to understant its functioning.

In [30]:
#Lets create a copy of orignal dataframe to avoid any issues.
copyFrame=dataFrameSchoolGrade
copyFrame.drop("School2",inplace=True,axis=0)

In [31]:
copyFrame

Unnamed: 0,Name,Subject,Marks
School1,Abhay Ohja,Physics,90
School1,Ayush Ohja,Computer,85


As observable, the row with row index School2 is dropped. Let see an example where we can drop a column.

In [32]:
copyFrame.drop("Marks",inplace=True,axis=1)

In [33]:
copyFrame

Unnamed: 0,Name,Subject
School1,Abhay Ohja,Physics
School1,Ayush Ohja,Computer


##### Method 3

There is a second way to drop a column, and that's directly through the use of the indexing operator, using the del keyword. This way of dropping data, however, takes immediate effect on the DataFrame and does not return a view.

The general syntax is:

###### del < Name of dataframe >[Name of column]

Let see an example. Note that this only works on a column.

In [34]:
del copyFrame["Name"]

In [35]:
copyFrame

Unnamed: 0,Subject
School1,Physics
School1,Computer


In [36]:
del copyFrame["School1"]

KeyError: 'School1'

As observable an error is generated as soon as we pass a row.

## Adding A Column To Pre-Existing Dataframe

In [37]:
# Finally, adding a new column to the DataFrame is as easy as assigning it to some value using
# the indexing operator. For instance, if we wanted to add a class ranking column with default 
# value of None, we could do so by using the assignment operator after the square brackets.
# This broadcasts the default value to the new column immediately.

dataFrameSchoolGrade['ClassRanking'] = None
dataFrameSchoolGrade

Unnamed: 0,Subject,ClassRanking
School1,Physics,
School1,Computer,
