## Creating a Pandas dataframe

In [1]:
import pandas as pd
import copy

In [2]:
df = pd.DataFrame(data=[1, 2, 3], columns=["category"])
df

Unnamed: 0,category
0,1
1,2
2,3


Alternatively you can use a dictionary to create a pandas Dataframe

In [7]:
pd.DataFrame.from_dict({"category": [1, 2, 3]})

Unnamed: 0,category
0,1
1,2
2,3


In [6]:
pd.DataFrame({"category": pd.Series([1, 2, 3])})

Unnamed: 0,category
0,1
1,2
2,3


Creating a test dataframe by duplicating df

In [4]:
df1 = copy.deepcopy(df)
df1

Unnamed: 0,category
0,1
1,2
2,3


Adding a single column

In [5]:
df1["cities"]  = pd.Series([["Nairobi", "Mombasa"], ["Lagos", "Nigeria"], ["Cape Town", "Johannesburg"]])
df1

Unnamed: 0,category,cities
0,1,"[Nairobi, Mombasa]"
1,2,"[Lagos, Nigeria]"
2,3,"[Cape Town, Johannesburg]"


Adding multiple columns using assign function

In [6]:
df1 = df1.assign(
    adult_viewers = pd.Series([500000, 1000000, 750000]),
    aged_viewers = pd.Series([200000, 400000, 300000]),
    young_viewers = pd.Series([1000000, 2000000, 3500000])
)

In [7]:
df1

Unnamed: 0,category,cities,adult_viewers,aged_viewers,young_viewers
0,1,"[Nairobi, Mombasa]",500000,200000,1000000
1,2,"[Lagos, Nigeria]",1000000,400000,2000000
2,3,"[Cape Town, Johannesburg]",750000,300000,3500000


To add a new row we use the append function

In [8]:
df1 = df1.append(
    {
        'cities': ["Kolkata", "Hyderabad"],
        'adult_viewers': 2000000,
        'aged_viewers': 2000000,
        'young_viewers': 1500000
    },
    ignore_index =True
 )

In [9]:
df1

Unnamed: 0,category,cities,adult_viewers,aged_viewers,young_viewers
0,1.0,"[Nairobi, Mombasa]",500000,200000,1000000
1,2.0,"[Lagos, Nigeria]",1000000,400000,2000000
2,3.0,"[Cape Town, Johannesburg]",750000,300000,3500000
3,,"[Kolkata, Hyderabad]",2000000,2000000,1500000


To duplicate df1 and concatanate

In [10]:
df2 = pd.concat([df1, df1], sort=False)
df2

Unnamed: 0,category,cities,adult_viewers,aged_viewers,young_viewers
0,1.0,"[Nairobi, Mombasa]",500000,200000,1000000
1,2.0,"[Lagos, Nigeria]",1000000,400000,2000000
2,3.0,"[Cape Town, Johannesburg]",750000,300000,3500000
3,,"[Kolkata, Hyderabad]",2000000,2000000,1500000
0,1.0,"[Nairobi, Mombasa]",500000,200000,1000000
1,2.0,"[Lagos, Nigeria]",1000000,400000,2000000
2,3.0,"[Cape Town, Johannesburg]",750000,300000,3500000
3,,"[Kolkata, Hyderabad]",2000000,2000000,1500000


To delete a row

In [11]:
df1 = df1.drop([3])
df1

Unnamed: 0,category,cities,adult_viewers,aged_viewers,young_viewers
0,1.0,"[Nairobi, Mombasa]",500000,200000,1000000
1,2.0,"[Lagos, Nigeria]",1000000,400000,2000000
2,3.0,"[Cape Town, Johannesburg]",750000,300000,3500000


To delete a column

In [12]:
df1 = df1.drop(["cities"], axis=1)
df1

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers
0,1.0,500000,200000,1000000
1,2.0,1000000,400000,2000000
2,3.0,750000,300000,3500000


We can chain functions as follows(dropping the cities column and dropping duplicates)

In [13]:
df2 = df2.drop(["cities"], axis=1).drop_duplicates()
df2

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers
0,1.0,500000,200000,1000000
1,2.0,1000000,400000,2000000
2,3.0,750000,300000,3500000
3,,2000000,2000000,1500000


You can fill a NaN yourself, drop it or use a mean or median to fill it.

In [14]:
# To fill yourself
df2.fillna(4)

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers
0,1.0,500000,200000,1000000
1,2.0,1000000,400000,2000000
2,3.0,750000,300000,3500000
3,4.0,2000000,2000000,1500000


In [17]:
# To fill using a mean
df2.fillna(df2.mean())

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers
0,1.0,500000,200000,1000000
1,2.0,1000000,400000,2000000
2,3.0,750000,300000,3500000
3,2.0,2000000,2000000,1500000


In [18]:
# To fill using a median
df2.fillna(df2.median())

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers
0,1.0,500000,200000,1000000
1,2.0,1000000,400000,2000000
2,3.0,750000,300000,3500000
3,2.0,2000000,2000000,1500000


In [19]:
# To drop the NaN
df2.dropna()

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers
0,1.0,500000,200000,1000000
1,2.0,1000000,400000,2000000
2,3.0,750000,300000,3500000


In [21]:
df2 = df2.fillna(4)
df2

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers
0,1.0,500000,200000,1000000
1,2.0,1000000,400000,2000000
2,3.0,750000,300000,3500000
3,4.0,2000000,2000000,1500000


Performing operations on the dataframe

In [23]:
# You can add columns as shown:
df2["viewers"] = df2["adult_viewers"] + df2["aged_viewers"] + df2["young_viewers"]
df2

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers,viewers
0,1.0,500000,200000,1000000,1700000
1,2.0,1000000,400000,2000000,3400000
2,3.0,750000,300000,3500000,4550000
3,4.0,2000000,2000000,1500000,5500000


In [24]:
# Or perform an operation on a single column:
df2["expected_clicks"] = 0.03*df2["viewers"]
df2

Unnamed: 0,category,adult_viewers,aged_viewers,young_viewers,viewers,expected_clicks
0,1.0,500000,200000,1000000,1700000,51000.0
1,2.0,1000000,400000,2000000,3400000,102000.0
2,3.0,750000,300000,3500000,4550000,136500.0
3,4.0,2000000,2000000,1500000,5500000,165000.0


## Iterations on columns and rows

We can iterate through rows and columns of Pandas objects using itertuples and iteritems.
Itertuples is used to iterate through the rows of a dataframe.
iteritems is used to iterate through the columns of a dataframe

## 1. Itertuples

In [28]:
# Using the itertuples to iterate through the rows:
for row in df2.itertuples():
    print(row)

Pandas(Index=0, category=1.0, adult_viewers=500000, aged_viewers=200000, young_viewers=1000000, viewers=1700000, expected_clicks=51000.0)
Pandas(Index=1, category=2.0, adult_viewers=1000000, aged_viewers=400000, young_viewers=2000000, viewers=3400000, expected_clicks=102000.0)
Pandas(Index=2, category=3.0, adult_viewers=750000, aged_viewers=300000, young_viewers=3500000, viewers=4550000, expected_clicks=136500.0)
Pandas(Index=3, category=4.0, adult_viewers=2000000, aged_viewers=2000000, young_viewers=1500000, viewers=5500000, expected_clicks=165000.0)


In [29]:
# To remove the index in the tuple:
for row in df2.itertuples(index=False):
    print(row)

Pandas(category=1.0, adult_viewers=500000, aged_viewers=200000, young_viewers=1000000, viewers=1700000, expected_clicks=51000.0)
Pandas(category=2.0, adult_viewers=1000000, aged_viewers=400000, young_viewers=2000000, viewers=3400000, expected_clicks=102000.0)
Pandas(category=3.0, adult_viewers=750000, aged_viewers=300000, young_viewers=3500000, viewers=4550000, expected_clicks=136500.0)
Pandas(category=4.0, adult_viewers=2000000, aged_viewers=2000000, young_viewers=1500000, viewers=5500000, expected_clicks=165000.0)


In [30]:
# TO rename the name of the tuple:
for row in df2.itertuples(index=False, name="Monthly"):
    print(row)

Monthly(category=1.0, adult_viewers=500000, aged_viewers=200000, young_viewers=1000000, viewers=1700000, expected_clicks=51000.0)
Monthly(category=2.0, adult_viewers=1000000, aged_viewers=400000, young_viewers=2000000, viewers=3400000, expected_clicks=102000.0)
Monthly(category=3.0, adult_viewers=750000, aged_viewers=300000, young_viewers=3500000, viewers=4550000, expected_clicks=136500.0)
Monthly(category=4.0, adult_viewers=2000000, aged_viewers=2000000, young_viewers=1500000, viewers=5500000, expected_clicks=165000.0)


## 2. Iterrows

This method iterates over the rows of a dataframe in tuples of the type (label, content), where label is the index of the row and content is a pandas series containing every item in the row.

In [35]:
for label, content in df2.iterrows():
    print(label, content["viewers"], getattr(content, "expected_clicks"))
# The getattr function returns the value of the named attribute of an object. In this case the object is the content the attribute is expected_clicks.

0 1700000.0 51000.0
1 3400000.0 102000.0
2 4550000.0 136500.0
3 5500000.0 165000.0


## 3. Iteritems

This method iterates over the columns of a dataframe in tuples of the type (label, content), where label is the name of the column and content is the content of the column in the form of a Pandas series object

In [37]:
for label, content in df2.iteritems():
    print(label, content[2])
# This prints all the labels but specific to the row with index 2

category 3.0
adult_viewers 750000
aged_viewers 300000
young_viewers 3500000
viewers 4550000
expected_clicks 136500.0


## Map an Apply Built-In functions

## 1. Map

Returns an object of the same type as that that was passed into it

In [39]:
df2["category"].map({
    1:"A",
    2:"B",
    3:"C",
    4:"D"
})

0    A
1    B
2    C
3    D
Name: category, dtype: object

## 2. Map

This applies the function passed and returns a Dataframe. It can take multiple columns as inputs.

In [41]:
df2[['young_viewers', 'viewers']].apply(lambda x: (x[0]*1.0)/x[1], axis=1)
# The lambda in this case specifies the function(young_viewers/viewers) while axis=1 specifies that the function should be applied to every row

0    0.588235
1    0.588235
2    0.769231
3    0.272727
dtype: float64

## Grouping Data

In [43]:
df2.groupby("young_viewers")["expected_clicks"].sum()

young_viewers
1000000     51000.0
1500000    165000.0
2000000    102000.0
3500000    136500.0
Name: expected_clicks, dtype: float64