# What is Pandas and why it is important?

The name "pandas" is derived from "panel data" and "Python data analysis." It is a python library used for data manipulation and analysis. It provides easy-to-use data structures and functions designed to work with structured data, such as tabular data and time series.

Why pandas is important:

1) Ease of Use:
pandas is designed for ease of use and provides a high-level interface for data analysis. Its syntax is intuitive, making it accessible to both beginners and experienced data scientists.

2) Versatility:
It is versatile and can handle a wide range of data types and data sources. Whether you are working with CSV files, Excel spreadsheets, SQL databases, or other formats, pandas provides tools to read, manipulate, and analyze the data.

3)Efficiency:
pandas is built on top of NumPy, a numerical computing library for Python. It leverages NumPy arrays for efficient computation, making it suitable for large datasets and complex operations.

4) Data Analysis and Exploration:
It facilitates exploratory data analysis and data exploration tasks, allowing users to gain insights into the structure and characteristics of their data.

5) Integration with Other Libraries:
pandas integrates well with other data science and machine learning libraries in the Python ecosystem, such as NumPy, scikit-learn, and Matplotlib.

Key components of pandas include:

1) DataFrame:

The central data structure in pandas is the DataFrame, which is a two-dimensional, tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table.

2) Series:

A Series is a one-dimensional labeled array and is another fundamental data structure in pandas. A DataFrame is essentially a collection of Series.

In [160]:
import pandas as pd
import numpy as np

In [81]:
## We can create a pandas dataframe from a dictionary. Where the keys becomes the column  names.
my_dict = {"name" : ['Deb' , 'Philipp', 'Suman', 'Tom', 'Anja'] , "age" : [29, 30, 27, 24, 26], 
           'email' : ['deb@xyz.com', 'philipp@abc.com', 'suman@xyz.com', 'tom@cde.com', 'anja@abc.com']}

df = pd.DataFrame(my_dict)

In [82]:
df  ## This is how dataframe looks like

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com


In [15]:
## From here we can see that each individual column of a dataframe is a series and a dataframe is made up of multiple series

type(df['name']) 

pandas.core.series.Series

In [16]:
## You can import a csv file into pandas and read it as a dataframe

## Change the path below to any file of your wish

df_new = pd.read_csv(r'E:\Thesis\Training_records.csv')

In [17]:
'''Selecting only specific 2 columns. This is a very important syntax as most of the time in real life you have to 
select the few columns of your interest from 100s of columns''' 

df_few = df[['name', 'email']]

df_few

Unnamed: 0,name,email
0,Deb,deb@xyz.com
1,Philipp,philipp@abc.com
2,Suman,suman@xyz.com
3,Tom,tom@cde.com
4,Anja,anja@abc.com


In [18]:
df.head(2)  ## Returns top 2 rows. you change the number to get the desired number of rows from the beginning

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com


In [286]:
df.tail(2)  ## Returns last 2 rows. you change the number to get the desired number of rows from the last

Unnamed: 0,name,premika,email
1,suman,ashwariya,suman@yahoo.com
2,arka,chandrima,bantu@gmail.com


In [38]:
'''Changing the index that is the 1st numeric column without a column name automatically generated to a desired column of 
your choice. We have to provide inplace true if we want to parmanently integrate the change into the main dataframe. 
If inplace true is not provided or mentioned as false, then in that case the change will be temporary and wont be 
integrated into the main dataframe'''

df.set_index('name', inplace = True)  ## changing the index to name column. Technically the index column should be unique

In [39]:
df

Unnamed: 0_level_0,age,email
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Deb,29,deb@xyz.com
Philipp,30,philipp@abc.com
Suman,27,suman@xyz.com
Tom,24,tom@cde.com
Anja,26,anja@abc.com


In [44]:
df.reset_index(inplace = True) ## Changing the index back to the original

In [45]:
df

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com


In [46]:
## iloc funtion means index location using this we can get any particular value from a dataframe

## The 1st value is the index of the row and 2nd in the index of the column

print(df.iloc[2, 0])

print(df.iloc[2, -1])

Suman
suman@xyz.com


In [48]:
## loc funtion means location using this we can get any particular value from a dataframe but without using index

## Here the 1st value is the index of the row and 2nd value is the name of the column

df.loc[2, 'age'] 

27

In [49]:
df.set_index('name', inplace = True) 

In [50]:
df

Unnamed: 0_level_0,age,email
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Deb,29,deb@xyz.com
Philipp,30,philipp@abc.com
Suman,27,suman@xyz.com
Tom,24,tom@cde.com
Anja,26,anja@abc.com


In [52]:
'''Since in the above dataframe we have set the index as name so now while using loc function we can use the name of the index'''
df.loc['Tom', 'age']  

24

In [53]:
df.reset_index(inplace = True)

In [55]:
'''Finding the name Tom and refelcting its corrsponding age and email'''

df[df['name'] == 'Tom']

Unnamed: 0,name,age,email
3,Tom,24,tom@cde.com


In [57]:
filt = (df['name'] == 'Tom')  ## Creating a filter

In [58]:
filt

0    False
1    False
2    False
3     True
4    False
Name: name, dtype: bool

In [59]:
df.loc[filt]  ## passing the filter into the loc function also gives the same filtering result like mentoned above

Unnamed: 0,name,age,email
3,Tom,24,tom@cde.com


In [83]:
'''Inserting new rows into the dataframe'''
'''Using loc method where we pass the index not in the dataframe and it creates a new entry with the list mentioned'''
df.loc[5] = ['Charlie', 22, 'charlie@zop.com']

In [84]:
df

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com


In [92]:
'''Inserting new rows into the dataframe, using append method'''

df = df.append({'name': 'Alice', 'age': 24, 'email': 'alice@abc.com'}, ignore_index=True)

In [106]:
df

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com
6,Alice,24,alice@abc.com


In [100]:
'''Creating a filter with or '|' operator'''

filt = (df['age'] == 24) | (df['age'] == 27)

In [101]:
df.loc[filt]

Unnamed: 0,name,age,email
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
6,Alice,24,alice@abc.com


In [102]:
'''creating an filter with and '&' operator'''

filt = (df['age'] == 24) & (df['name'] == 'Tom')

In [103]:
df.loc[filt]

Unnamed: 0,name,age,email
3,Tom,24,tom@cde.com


In [109]:
'''Delete a row based on index of the row'''

'''you have to provide the syntax as df = df.drop(2, axis=0) to override the existing dataframe I havenot done it 
because I don't want to delete any data as of now'''

df.drop(2, axis = 0) 

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com
6,Alice,24,alice@abc.com


In [111]:
df

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com
6,Alice,24,alice@abc.com


In [113]:
'''Delete a specific column based on name of the column'''

'''NOTE: axis 0 in drop method refers to rows and axis 1 refers to column'''

df.drop('age', axis=1)

Unnamed: 0,name,email
0,Deb,deb@xyz.com
1,Philipp,philipp@abc.com
2,Suman,suman@xyz.com
3,Tom,tom@cde.com
4,Anja,anja@abc.com
5,Charlie,charlie@zop.com
6,Alice,alice@abc.com


In [114]:
df

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com
6,Alice,24,alice@abc.com


In [124]:
##Creating a new dataframe and appending it with the existing one

my_new_dict_1 = {'name' : ['ali', 'michael'], 'age' : [32, 26], 'email': ['ali@abc.com', 'michael@xyz.com']}

In [125]:
df_new = pd.DataFrame(my_new_dict_1)

In [126]:
df_new

Unnamed: 0,name,age,email
0,ali,32,ali@abc.com
1,michael,26,michael@xyz.com


In [118]:
df = df.append(df_new)

In [120]:
df.reset_index(inplace = True)

In [127]:
df = df.drop('index' , axis = 1)

In [128]:
df

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com
6,Alice,24,alice@abc.com
7,ali,32,ali@abc.com
8,michael,26,michael@xyz.com


In [139]:
'''Explaining how to get rid of null values'''

df.loc[9] = ['kate', None, 'kate@zop.com']

In [140]:
df   ## here at index 9 the age column has a null value

Unnamed: 0,name,age,email
0,Deb,29.0,deb@xyz.com
1,Philipp,30.0,philipp@abc.com
2,Suman,27.0,suman@xyz.com
3,Tom,24.0,tom@cde.com
4,Anja,26.0,anja@abc.com
5,Charlie,22.0,charlie@zop.com
6,Alice,24.0,alice@abc.com
7,ali,32.0,ali@abc.com
8,michael,26.0,michael@xyz.com
9,kate,,kate@zop.com


In [141]:
'''returning boolean valued dataframe with True at places where there is null value'''

null_values = df.isnull()

null_values

Unnamed: 0,name,age,email
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,True,False


In [142]:
'''getting the count of null values at each column'''

null_counts = df.isnull().sum()

null_counts

name     0
age      1
email    0
dtype: int64

In [143]:
'''Deleting the rows with null values'''

df.dropna(inplace=True)

df

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com
6,Alice,24,alice@abc.com
7,ali,32,ali@abc.com
8,michael,26,michael@xyz.com


In [153]:
'''Getting the index of a row based on a specific element of that row'''
df.index[df['name'] == 'Tom'].item()

3

Example 1: Getting all the names that starts with 'A' and returning all of their corresponding data in a dataframe

In [149]:
index_1 = []
for i in df['name']:
    st = i.lower()
    st = list(st)
    #print(st)
    if st[0] == 'a':
        index_1.append(df.index[df['name'] == i].item())
        
#df.loc

In [150]:
index_1

[4, 6, 7]

In [151]:
df.loc[index_1]

Unnamed: 0,name,age,email
4,Anja,26,anja@abc.com
6,Alice,24,alice@abc.com
7,ali,32,ali@abc.com


In [155]:
'''creating a filter function to get the rows where name starts with 'A' '''

filt = (df['name'].str.startswith('A'))

In [156]:
df.loc[filt]

Unnamed: 0,name,age,email
4,Anja,26,anja@abc.com
6,Alice,24,alice@abc.com


In [157]:
'''Putting a ~ in front of the filter function gives the opposite result'''

df.loc[~filt]

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
5,Charlie,22,charlie@zop.com
7,ali,32,ali@abc.com
8,michael,26,michael@xyz.com


In [158]:
'''Using numpy with pandas'''

Unnamed: 0,name,age,email
0,Deb,29,deb@xyz.com
1,Philipp,30,philipp@abc.com
2,Suman,27,suman@xyz.com
3,Tom,24,tom@cde.com
4,Anja,26,anja@abc.com
5,Charlie,22,charlie@zop.com
6,Alice,24,alice@abc.com
7,ali,32,ali@abc.com
8,michael,26,michael@xyz.com


In [161]:
'''Create a new column indicating who's age is greater than 25'''

df['age_greater_25'] = np.where(df['age'] > 25, 'yes', 'no')

In [162]:
df

Unnamed: 0,name,age,email,age_greater_25
0,Deb,29,deb@xyz.com,yes
1,Philipp,30,philipp@abc.com,yes
2,Suman,27,suman@xyz.com,yes
3,Tom,24,tom@cde.com,no
4,Anja,26,anja@abc.com,yes
5,Charlie,22,charlie@zop.com,no
6,Alice,24,alice@abc.com,no
7,ali,32,ali@abc.com,yes
8,michael,26,michael@xyz.com,yes


In [164]:

df.loc[9] = ['kate', None, 'kate@zop.com', None]

In [165]:
df

Unnamed: 0,name,age,email,age_greater_25
0,Deb,29.0,deb@xyz.com,yes
1,Philipp,30.0,philipp@abc.com,yes
2,Suman,27.0,suman@xyz.com,yes
3,Tom,24.0,tom@cde.com,no
4,Anja,26.0,anja@abc.com,yes
5,Charlie,22.0,charlie@zop.com,no
6,Alice,24.0,alice@abc.com,no
7,ali,32.0,ali@abc.com,yes
8,michael,26.0,michael@xyz.com,yes
9,kate,,kate@zop.com,


In [166]:
'''Replacing the null values at age column with fixed age 20'''

df['age'] = np.where(df['age'].isnull(), 20, df['age'])

In [168]:
df['age_greater_25'] = np.where(df['age'] > 25, 'yes', 'no')

In [169]:
df

Unnamed: 0,name,age,email,age_greater_25
0,Deb,29,deb@xyz.com,yes
1,Philipp,30,philipp@abc.com,yes
2,Suman,27,suman@xyz.com,yes
3,Tom,24,tom@cde.com,no
4,Anja,26,anja@abc.com,yes
5,Charlie,22,charlie@zop.com,no
6,Alice,24,alice@abc.com,no
7,ali,32,ali@abc.com,yes
8,michael,26,michael@xyz.com,yes
9,kate,20,kate@zop.com,no


Practice Example 1: create a new column with the name email_abc and set 'yes' as value in the column if the email address falls
    in the @abc.com domain and 'no' if I don't. Also return the number of emails falling in @abc.com domain.

In [170]:
df['email_abc'] = np.where(df['email'].str.contains('@abc.com'), 'yes', 'no')

In [171]:
df

Unnamed: 0,name,age,email,age_greater_25,email_abc
0,Deb,29,deb@xyz.com,yes,no
1,Philipp,30,philipp@abc.com,yes,yes
2,Suman,27,suman@xyz.com,yes,no
3,Tom,24,tom@cde.com,no,no
4,Anja,26,anja@abc.com,yes,yes
5,Charlie,22,charlie@zop.com,no,no
6,Alice,24,alice@abc.com,no,yes
7,ali,32,ali@abc.com,yes,yes
8,michael,26,michael@xyz.com,yes,no
9,kate,20,kate@zop.com,no,no


In [172]:
'''Counting the number of yes in the email_abc column to get the number of emails falling under @abc.com domain'''

yes_count = df['email_abc'].value_counts().get('yes', 0)

yes_count

4

In [None]:
'''In Pandas, you can use the groupby method to group a DataFrame by one or more columns and then 
perform aggregation on those groups using various aggregation functions'''

In [177]:
grouped_df = df.groupby('age_greater_25').agg({
    'age': ['mean']
}).reset_index()

In [178]:
grouped_df   ## Reflect the mean age of 2 groups one having age greater than 25 and other less than 25

Unnamed: 0_level_0,age_greater_25,age
Unnamed: 0_level_1,Unnamed: 1_level_1,mean
0,no,22.5
1,yes,28.333333


Practice example 2: Calculate the sum of users for each email domains that are @abc.com, @xyz.com, @cde.com and @zop.com and also the mean age of users at each of these domains

In [189]:
df['domains'] = None

In [190]:
for i in df['email']:
    val = i.split('@')
    idx = df.index[df['email'] == i].item()
    #print(val)
    #break
    df.loc[idx, 'domains'] = val[-1] 

In [191]:
df

Unnamed: 0,name,age,email,age_greater_25,email_abc,domains
0,Deb,29,deb@xyz.com,yes,no,xyz.com
1,Philipp,30,philipp@abc.com,yes,yes,abc.com
2,Suman,27,suman@xyz.com,yes,no,xyz.com
3,Tom,24,tom@cde.com,no,no,cde.com
4,Anja,26,anja@abc.com,yes,yes,abc.com
5,Charlie,22,charlie@zop.com,no,no,zop.com
6,Alice,24,alice@abc.com,no,yes,abc.com
7,ali,32,ali@abc.com,yes,yes,abc.com
8,michael,26,michael@xyz.com,yes,no,xyz.com
9,kate,20,kate@zop.com,no,no,zop.com


In [192]:
grouped_df = df.groupby('domains').agg({
    'age': ['mean'],
    'domains' : ['count']
}).reset_index()

In [193]:
grouped_df

Unnamed: 0_level_0,domains,age,domains
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count
0,abc.com,28.0,4
1,cde.com,24.0,1
2,xyz.com,27.333333,3
3,zop.com,21.0,2
