Now we will look how to create and perform operations on dataframes and understand what it does or looks like.

In [1]:
import pandas as pd
import numpy as np
from numpy.random import randn
my_data = randn(4,3) # 4 rows, 3 columns
my_rows = ['row1', 'row2', 'row3', 'row4']
my_cols = ['monday', 'tuesday', 'wednesday']

Now we will actually create the dataframe and parameters go like data,rows,columns

In [2]:
my_df = pd.DataFrame(data=my_data, index=my_rows, columns=my_cols)
print(my_df)

        monday   tuesday  wednesday
row1 -0.203385  1.622957  -0.619391
row2 -0.297136 -1.418792  -1.337988
row3  0.521635  0.080755  -0.198449
row4 -0.297743  0.775618  -0.118279


Now we will import a given dataset that is already a csv file and convert it directly to a dataframe.
pd.read_csv will do this whole operation for us and save us the data pulling and pushing step. I will use the already existing iris data that I used in the neural network notebook in the same repo.

In [2]:
my_df2 = pd.read_csv('Data/iris.csv')
print(my_df2)

     sepal_length  sepal_width  petal_length  petal_width           class
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


We can now pull out rows using the index numbers in this dataframe using loc()

In [4]:
print(my_df2.loc[50])

sepal_length                7.0
sepal_width                 3.2
petal_length                4.7
petal_width                 1.4
class           Iris-versicolor
Name: 50, dtype: object


Now we will look at accessing data from this entire dataframe:
1. First 5 rows
2. First n rows
3. Last 5 rows
4. Last n rows
5. Get context/info

In [6]:
my_df2.head() 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [7]:
my_df2.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [8]:
my_df2.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [9]:
my_df2.tail(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [10]:
my_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Now we will look at these below features:
1. Number of dimensions in dataframe
2. Datatypes of columns alone (Not like info)
3. Get basic stats like avg, 25%, 50% and all that kind of information ---> Do it for specific columns


In [11]:
my_df2.ndim

2

In [12]:
my_df2.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
class            object
dtype: object

In [13]:
my_df2['sepal_length'].describe() #for one column

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sepal_length, dtype: float64

Now we will look at counting operations on these dataframes.
1. Count number of something in descending order (Default in value_counts()). Just pass ascending=True in the value_counts function to reverse this order. 

In [14]:
my_df2['class'].value_counts()

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

2. By default if there is a none value (NaN), value_counts will not return it. To include it use dropna=False as parameter for value_counts()