This Kernel aims at showing the various features of pandas library and how it can be used for data analysis. By saying data analysis we meant various techniques to clean , wrangle , preprocess and visulization of the features.

<a class="anchor" id="0.1"></a>
# Table of Contents 

1. [Introduction to Pandas](#1)
2. [Data Structures in Pandas](#2)
3. [What is Series and how we can create it](#3)
4. [Slicing Series](#4)
5. [Appending Series](#5)
6. [Operations on Series](#6)
7. [What is DataFrame and how we can create it](#7)
8. [Delete Column in DataFrame](#8)
9. [Delete Rows in DataFrame](#9)
10. [Data Selection in DataFrame](#10)
11. [Set Values in DataFrame](#11)
12. [Dealing with Null values](#12)
13. [Importing csv files](#13)
14. [Descriptive Statistics](#14)
15. [Apply function on DataFrame](#15)
16. [Merge DataFrames](#16)
17. [Like Operations in Pandas](#17)
18. [Regex in Pandas DataFrame ](#18)
19. [Replace values in DataFrame](#19)
20. [Important DataFrame methods](#20)
21. [Group By Operations](#21)
22. [Stack and unstack in Pandas](#22)
23. [Pivot Tables](#23)
24. [Hierarchical Indexing](#24)
25. [Crosstab in Pandas](#25)
26. [Row and Column Bind](#26)
27. [Data Visualizations using Pandas](#27)

# 1. Introduction to Pandas  <a class="anchor" id="1"></a>


[Table of Contents](#0.1)

Pandas is a package used for managing data.

Pandas main use is that it creates 2 new data types for storing data: series and dataframe.

Think of a pandas dataframe like an excel spreadsheet that is storing some data.  One column can have customer name, one column can have product sold name, another column can have price or quantity... Then the rows could be individual sales.

A dataframe is made up of several series.  Each column of a dataframe is a series.

We can name each column and row of a dataframe.

A pandas dataframe is very similar to a data.frame in R.

Similar to numpy arrays, a dataframe is a more robust data type for storing data than lists of lists. Dataframes are more flexible than numpy arrays.

A numpy array can create a matrix with all entries of the same data type.  In a dataframe each column can have its own datatype.  

That's not to say numpy arrays aren't useful.  It is often easiest to convert some subset of a dataframe to a numpy array and then use that to do some math.

Pandas also has SQL-like functions for merging, joining, and sorting dataframes.


## 1.1 Import Pandas  

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import warnings
warnings.filterwarnings("ignore")
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 2.Data Structures in Pandas  <a class="anchor" id="2"></a>

[Table of Contents](#0.1)


Pandas provides three types of data structures -Series, Dataframe and Panels all of which are built on top of the numpy array. 

- **Series** :- It is one dimensional,labeled and homogeneous array of immutable size. 
- **DataFrames** :- It is two dimensional , heterogeneously typed, size mutable and tabular data structure.
- **Panels** :- It is three dimensional, labelled and size-mutable array

All the above data structures are **value mutable**

# 3.What is Series and how we can create it  <a class="anchor" id="3"></a>

[Table of Contents](#0.1)

A series is a single dimensional array structures that stores homogeneous data i.e data of single type. Some of the features of the series are 

- All the elemenets of the series are value mutable and size-immutable.
- Data can be of multi data types ndarrays, lists , constant , series and dict etc. 
- Indexes must be unique , hashable and has the same length as data. Defaults to np.arrange(n) in case if no index is passed.
- dtype: Data type pf each column, if none is mentioned it will be inferred automatically. 
- copy : Deep copies data and set to fault as default


In [None]:
#Creating an empty series 

empty_series =pd.Series()
print(empty_series)

In [None]:
# Create series from Nump Array
v = np.array([10,20,30,40,50,60,70])
s1 = pd.Series(v)
s1

In [None]:
#Datatype of Series
s1.dtype

In [None]:
# number of items present in the series
s1.item

In [None]:
# Number of bytes consumed by Series
s1.nbytes

In [None]:
# Shape of the Series
s1.shape

In [None]:
# number of dimensions
s1.ndim

In [None]:
# Length of Series
len(s1)

In [None]:
s1.count()

In [None]:
s1.size

In [None]:
# Create series from List
s0 = pd.Series([10,20,30],index = ['A','B','C'])
s0

In [None]:
# Modifying index in Series
s1.index = ['a' , 'b' , 'c' , 'd' , 'e' , 'f' , 'g']
s1

In [None]:
# Create Series using Random and Range function
v2 = np.random.random(10)
ind2 = np.arange(0,10)
s = pd.Series(v2,ind2)
v2 , ind2 , s

In [None]:
# Creating Series from Dictionary
dict1 = {'a1' :10 , 'a2' :20 , 'a3':30 , 'a4':40}
s3 = pd.Series(dict1)
s3

In [None]:
pd.Series(99, index=[0, 1, 2, 3, 4, 5])

# 4.Slicing Series  <a class="anchor" id="4"></a>

[Table of Contents](#0.1)

Retrieving the part of the series through slicing 

In [None]:
s

In [None]:
# Return all elements of the series
s[:]

In [None]:
# First three element of the Series
s[0:3]

In [None]:
# Last element of the Series
s[-1:]

In [None]:
# Fetch first 4 elements in a series
s[:4]

In [None]:
# Return all elements of the series except last two elements.
s[:-2]

In [None]:
# Return all elements of the series except last element.
s[:-1]

In [None]:
# Return last two elements of the series
s[-2:]

In [None]:
# # Return last element of the series
s[-1:]

In [None]:
s[-3:-1]

# 5.Appending Series  <a class="anchor" id="5"></a>

[Table of Contents](#0.1)

In [None]:
s2 = s1.copy()
s2

In [None]:
s3

In [None]:
# Append S2 & S3 Series
s4 = s2.append(s3)
s4

In [None]:
# When "inplace=False" it will return a new copy of data with the operation performed
s4.drop('a4' , inplace=False)

In [None]:
s4

In [None]:
# When we use "inplace=True" it will affect the dataframe
s4.drop('a4', inplace=True)
s4

In [None]:
s4 = s4.append(pd.Series({'a4': 7}))
s4

# 6. Operations on Series <a class="anchor" id="6"></a>

[Table of Contents](#0.1)

In [None]:
v1 = np.array([10,20,30])
v2 = np.array([1,2,3])
s1 = pd.Series(v1)
s2 = pd.Series(v2)
s1 , s2

In [None]:
# Addition of two series
s1.add(s2)

In [None]:
# Subtraction of two series
s1.sub(s2)

In [None]:
# Subtraction of two series
s1.subtract(s2)

In [None]:
# Increment all numbers in a series by 10
s1.add(10)

In [None]:
# Multiplication of two series
s1.mul(s2)

In [None]:
# Multiplication of two series
s1.multiply(s2)

In [None]:
# Multiply each element by 10000
s1.multiply(10000)

In [None]:
# Division
s1.divide(s2)

In [None]:
# Division
s1.div(s2)

In [None]:
# MAX number in a series
s1.max()

In [None]:
# Min number in a series
s1.min()

In [None]:
# Average
s1.mean()

In [None]:
# Median
s1.median()

In [None]:
# Standard Deviation
s1.std()

In [None]:
# Series comparison
s1.equals(s2)

In [None]:
s4 =s1

In [None]:
# Series comparison
s1.equals(s4)

In [None]:
s5 = pd.Series([10,10,20,20,30,30], index=[0, 1, 2, 3, 4, 5])
s5

In [None]:
s5.value_counts()

# 7. What is dataframe and how we can create it <a class="anchor" id="7"></a>

[Table of Contents](#0.1)

A DataFrame is 2D Structure in which the data is aligned in a tabular fashion consisting of rows and columns. Some of the features of the dataframe : 

- A Dataframe can be created using the following constructor -pandas.DataFrame(data,index,dtype,copy)
- It can be of multiple data types such as ndarray, lists, constants,series , dict etc.
- dtype : Data type of each column 
- copy : creates a deep copy of the data , set to false  default 





In [None]:
#Creating an empty dataframe 

df = pd.DataFrame()
df

In [None]:
# Create Dataframe using List
lang = ['Java' , 'Python' , 'C#' , 'C++']
df = pd.DataFrame(lang)
df

In [None]:
# Add column in the Dataframe
rating = [1,2,3,4]
df[1] = rating
df

In [None]:
# Give the meaning ful name to the column
df.columns = ['Language','Rating']
df

In [None]:
# Create Dataframe from Dictionary
data = [{'A': 1, 'B': 2},{'A': 5, 'B': 10, 'c': 20}]
df2 = pd.DataFrame(data)
df3 = pd.DataFrame(data, index=['row1', 'row2'], columns=['A', 'B'])
df4 = pd.DataFrame(data, index=['row1', 'row2'], columns=['A', 'B' ,'C'])
df5 = pd.DataFrame(data, index=['row1', 'row2'], columns=['A', 'B' ,'C' , 'D'])

In [None]:
df2

In [None]:
df3

In [None]:
df4

In [None]:
df5

In [None]:
# Create Dataframe from Dictionary
df0 = pd.DataFrame({'ID' :[1,2,3,4] , 'Name' :['Saurav' , 'Vishal' , 'Rahul' , 'Sumit']})
df0

In [None]:
# Create a DataFrame from Dictionary of Series
dict = {'A' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'B' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(dict)
df1

## Dataframe of Random Numbers with Date Indices

In [None]:
dates = pd.date_range(start='2021-01-20', end='2021-01-26')
dates

In [None]:
dates = pd.date_range('today',periods= 7)
dates

In [None]:
dates = pd.date_range(start='2020-01-20', periods=7)
dates

In [None]:
M = np.random.random((7,7))
M

In [None]:
dframe = pd.DataFrame(M , index=dates)
dframe

In [None]:
#Changing Column Names
dframe.columns = ['C1' , 'C2' , 'C3', 'C4', 'C5', 'C6', 'C7']
dframe

In [None]:
# List Index
dframe.index

In [None]:
# List Column Names
dframe.columns

In [None]:
# Datatype of each column
dframe.dtypes

In [None]:
# Sort Dataframe by Column 'C1' in Ascending Order
dframe.sort_values(by='C1')

In [None]:
# Sort Dataframe by Column 'C1' in Descending Order
dframe.sort_values(by='C1' , ascending=False)

# 8.Delete Column in DataFrame  <a class="anchor" id="8"></a>

[Table of Contents](#0.1)

In [None]:
df1

In [None]:
# Delete Column using "del" function
del df1['B']

In [None]:
df1

In [None]:
df5

In [None]:
# Delete Column using pop()
df5.pop('C')

In [None]:
df5

In [None]:
dict = {'A' : pd.Series([1, 2, 3,11], index=['a', 'b', 'c','d']),
'B' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df12 = pd.DataFrame(dict)
df12

In [None]:
df12.drop(['A'], axis=1,inplace=True)
df12

# 9.Delete Rows in DataFrame  <a class="anchor" id="9"></a>

[Table of Contents](#0.1)

In [None]:
col1 = np.linspace(10, 100, 30)
col2 = np.random.randint(10,100,30)
df10 = pd.DataFrame({"C1" : col1 , "C2" :col2})
df10

In [None]:
# Delete rows with index values 17,18,19
df10 = df10.drop([17,18,19], axis=0)
df10

In [None]:
# Delete rows with index values 16 without using assignment operation
df10.drop([16], axis=0,inplace=True)
df10

In [None]:
df10.drop(df10.index[5] , inplace=True)
df10

In [None]:
#Delete first three rows
df10 = df10.iloc[3:,]
df10

In [None]:
#Delete last four rows
df10 = df10.iloc[:-4,]
df10

In [None]:
#Keep top 10 rows
df10 = df10.iloc[:10,]
df10

In [None]:
df10

In [None]:
df10.index[df10['C2'] == 58].tolist()

In [None]:
# Delete row based on Column value
df10.drop(df10.index[df10['C2'] == 56].tolist() , axis=0,inplace=True)
df10

In [None]:
# Delete row based on Column value
df10 = df10.drop(df10[df10["C2"]==79].index)
df10

In [None]:
# Delete all rows with column C2 value 14
df10 = df10[df10.C2 != 44]
df10

In [None]:
# Delete all rows with column C2 value 88 & 55 using isin operator
df10 = df10[~(df10.C2.isin ([22,93]))]
df10

In [None]:
# Keep all rows with column C2 value 10,89,31 & 64 using isin operator
df10 = df10[df10.C2.isin ([42,76])]
df10

In [None]:
dict = {'A' : pd.Series([1, 2, 3,11], index=['a', 'b', 'c','d']),
'B' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df11 = pd.DataFrame(dict)
df11

In [None]:
#Delete all rows with label "d"
df11.drop("d", axis=0,inplace=True)
df11

In [None]:
df13 = pd.DataFrame({ 'ID' :[1,2,3,4] ,
'Name' :['Saurav' , 'Vishal' , 'Swati' , 'Vijay'] ,
'location' : ['India' , 'Australia','UK' , 'US'] })
df13

In [None]:
ind = df13[((df13.Name == 'Ross') &(df13.ID == 3) & (df13.location == 'UK'))].index
df13.drop(ind,inplace=True)
df13

# 10.Data Selection in Dataframe   <a class="anchor" id="10"></a>

[Table of Contents](#0.1)

In [None]:
df

In [None]:
df.index = [1,2,3,4]
df

In [None]:
# Data selection using row label
df.loc[1]

In [None]:
# Data selection using position (Integer Index based)
df.iloc[1]

In [None]:
df.loc[1:2]

In [None]:
df.iloc[1:2]

In [None]:
# Data selection based on Condition
df.loc[df.Rating > 2]

In [None]:
df1

In [None]:
# Row & Column label based selection
df1.loc['a']

**iloc does not work on the labels , it will give you an error.**

In [None]:
dframe

In [None]:
# Data selection using Row Label
dframe['2020-01-20' : '2020-01-22' ]

In [None]:
# Selecting all rows & selected columns
dframe.loc[:,['C1' , 'C7']]

In [None]:
#row & column label based selection
dframe.loc['2020-01-20' : '2020-01-22',['C1' , 'C7']]

In [None]:
# Data selection based on Condition
dframe[dframe['C1'] > 0.5]

In [None]:
# Data selection based on Condition
dframe[(dframe['C1'] > 0.5) & (dframe['C4'] > 0.5)]

In [None]:
# Data selection using position (Integer Index based)
dframe.iloc[0][0]

In [None]:
# Select all rows & first three columns
dframe.iloc[:,0:3]

In [None]:
dframe.iloc[0][0] = 10

In [None]:
# Display all rows where C1 has value of 10 or 20
dframe[dframe['C1'].isin([10,20])]

# 11. Data Selection in DataFrame <a class="anchor" id="11"></a>

[Table of Contents](#0.1)

In [None]:
# Set value of 888 for all elements in column 'C1'
dframe['C1'] = 888
dframe

In [None]:
# Set value of 777 for first three rows in Column 'C6'
dframe.at[0:3,'C6'] = 777

In [None]:
dframe

In [None]:
# Set value of 333 in first row and third column
dframe.iat[0,2] = 333
dframe

In [None]:
dframe.iloc[0,2] = 555
dframe

In [None]:
# Create Copy of the calling objects data along with indices.
# Modifications to the data or indices of the copy will not be reflected in the original ob
dframe1 = dframe.copy(deep=True)

In [None]:
dframe1[(dframe1['C1'] > 0.5) & (dframe1['C4'] > 0.5)] = 0

In [None]:
dframe1[dframe1['C1'] == 0]

In [None]:
# Replace zeros in Column C1 with 99
dframe1[dframe1['C1'].isin([0])] = 99
dframe1

In [None]:
# Display all rows where value of C1 is 99
dframe1[dframe1['C1'] == 99]

# 12.Dealing with NULL Values   <a class="anchor" id="12"></a>

[Table of Contents](#0.1)

In [None]:
dframe.at[0:8 , 'C7'] = np.NaN
dframe.at[0:2 , 'C6'] = np.NaN
dframe.at[5:6 , 'C5'] = np.NaN
dframe

In [None]:
# Detect Non-Missing Values
# It will return True for NOT-NULL values and False for NULL values
dframe.notna()

In [None]:
# Detect Missing or NULL Values
# It will return True for NULL values and False for NOT-NULL values
dframe.isna()

In [None]:
# Fill all NULL values with 1020
dframe = dframe.fillna(1020)
dframe

In [None]:
dframe.at[0:5 , 'C7'] = np.NaN
dframe.at[0:2 , 'C6'] = np.NaN
dframe.at[5:6 , 'C5'] = np.NaN
dframe

In [None]:
# Replace Null values in Column 'C5' with number 123
# Replace Null values in Column 'C6' with number 789
dframe.fillna(value={'C5' : 123 , 'C6' : 789})

In [None]:
#Replace first NULL value in Column C7 with 789
dframe.fillna(value={'C7' : 789} , limit=1)

In [None]:
# Drop Rows with NULL values
dframe.dropna()

In [None]:
# Drop Columns with NULL values
dframe.dropna(axis='columns')

In [None]:
# Drop Rows with NULL values present in C5 or C6
dframe.dropna(subset=['C5' ,'C6'])

# 13. Importing csv files <a class="anchor" id="13"></a>

[Table of Contents](#0.1)

In [None]:
#Reading the csv file
unv_df=pd.read_csv("/kaggle/input/world-university-rankings/cwurData.csv")

In [None]:
# Top 10 rows of the Dataframe
unv_df.head(10)

In [None]:
# Bottom 10 rows of the Dataframe
unv_df.tail(10)

In [None]:
# Unique values in Country column
unv_df['country'].unique()

In [None]:
# Number of Unique values in Country column
unv_df['country'].nunique()

In [None]:
#Dataframe information
unv_df.info()

In [None]:
# Reading columns
unv_df['country'].head(10)

In [None]:
# Reading columns
df1 = unv_df[['country' ,'institution','national_rank' , 'year']]
df1.head(10)

In [None]:
#Read specific rows
df1.iloc[1:4]

In [None]:
#Filter data
df1.loc[df1['country']== 'USA']

In [None]:
#Sort Data Frame
display('Sorted Data Frame', df1.sort_values(['country'], ascending=True).head(5))

In [None]:
#Sort Data Frame
display('Sorted Data Frame', df1.sort_values(['country'], ascending=False).head(5))

In [None]:
#Sort Data Frame - Ascending on "Country" & descending on "national rank"
display('Sorted Data Frame', df1.sort_values(['country', 'national_rank'], ascending=[True,False]))

In [None]:
#Iterating through the dataset
for index , row in df1.iterrows():
    if (row['country'] == 'United Kingdom' ):
        display(row[['country', 'institution']])

In [None]:
#Unique Values
unv_df['country'].drop_duplicates(keep='first').head(10)

In [None]:
# Countries with universities 
countries = unv_df['country'].unique()
type(countries) , countries

In [None]:
pokemon_csv=pd.read_csv("/kaggle/input/pokemon/Pokemon.csv")
pokemon_csv.head()

In [None]:
# Sum of Columns
pokemon_csv['Total'] = pokemon_csv['HP'] + pokemon_csv['Attack']
pokemon_csv.head(5)

In [None]:
# Sum of Columns
pokemon_csv['Total'] = pokemon_csv.iloc[:,4:10].sum(axis=1)
pokemon_csv.head(5)

In [None]:
#Shifting "Total" column
cols = list(pokemon_csv.columns)
pokemon_csv = pokemon_csv[cols[0:10] + [cols[-1]] + cols[10:12]]
pokemon_csv.head(5)

In [None]:
#Shifting "Legendary" column - Index location -1 or 12
cols = list(pokemon_csv.columns)
pokemon_csv = pokemon_csv[cols[0:10] + [cols[-1]] + cols[10:12]]
pokemon_csv.head(5)

In [None]:
#Shifting "Generation" column - Index location -1 or 12
cols = list(pokemon_csv.columns)
pokemon_csv = pokemon_csv[cols[0:10] + [cols[12]] + cols[10:12]]
pokemon_csv.head(5)

In [None]:
#Save to CSV file
pokemon_csv.to_csv('poke_updated.csv')

In [None]:
#Save to CSV file without index column
pokemon_csv.to_csv('poke_updated1.csv', index=False)

In [None]:
pokemon_csv.head(10)

In [None]:
# Save Dataframe as text file
pokemon_csv.to_csv('poke.txt' , sep='\t' , index=False)

In [None]:
# Save Dataframe as xlsx file
pokemon_csv.to_excel('poke.xlsx')

In [None]:
# Save Dataframe as xlsx file without row names
pokemon_csv.to_excel('poke.xlsx', index=0)

In [None]:
#Filtering using loc
pokemon_csv.loc[pokemon_csv['Type 2'] == 'Dragon']

In [None]:
#Filtering using loc
pokemon_csv_2 = pokemon_csv.loc[(pokemon_csv['Type 2'] == 'Dragon') & (pokemon_csv['Type 1'] == 'Dark')]
pokemon_csv_2

In [None]:
#Reset index for Dataframe df3 keeping old index column
pokemon_csv_3 = pokemon_csv_2.reset_index()
pokemon_csv_3

In [None]:
#Reset index for Dataframe df3 removing old index column
pokemon_csv_2.reset_index(drop=True , inplace=True)
pokemon_csv_2

# 14.Descriptive Statistics  <a class="anchor" id="14"></a>

[Table of Contents](#0.1)

In [None]:
#Check for the missing values 

unv_df.isna().any()

In [None]:
#Calculating the no of missing values for broad_impact column
unv_df['broad_impact'].isna().sum()

In [None]:
# Mean of all Columns
unv_df.mean()

In [None]:
# Max value per column
unv_df.max()

In [None]:
# Min value per column
unv_df.min()

In [None]:
# Median
unv_df.median()

In [None]:
unv_df.std() #Standard Deviation

In [None]:
unv_df.var() #Variance

In [None]:
#Lower Quartile / First Quartile
unv_df.quantile(0.25)

In [None]:
#Second Quartile / Median
unv_df.quantile(0.50)

In [None]:
# Upper Quartile
unv_df.quantile(0.75)

In [None]:
#IQR (Interquartile Range)
unv_df.quantile(0.75) - unv_df.quantile(0.25)

In [None]:
# SUM of column values
unv_df.sum()

In [None]:
# GENERATES DESCRIPTIVE STATS
unv_df.describe()

In [None]:
#Return unbiased skew
# https://www.youtube.com/watch?v=HnMGKsupF8Q
unv_df.skew()

In [None]:
# Return unbiased kurtosis using Fisher’s definition of kurtosis
# https://www.youtube.com/watch?v=HnMGKsupF8Q
unv_df.kurt()

In [None]:
#Correlation
unv_df.corr()

In [None]:
#Covariance
unv_df.cov()

In [None]:
# Hormonic Mean
import statistics as st
st.harmonic_mean(unv_df['alumni_employment'])

In [None]:
# low median of the data with EVEN length
st.median_low(unv_df['alumni_employment'])

In [None]:
# High median of the data with EVEN length
st.median_high(unv_df['alumni_employment'])

In [None]:
# Mode of Dataset
st.mode(unv_df['publications'])

In [None]:
# Sample Variance
st.variance(unv_df['alumni_employment'])

In [None]:
#Population Variance
st.pvariance(unv_df['alumni_employment'])

In [None]:
#Sample Standard Deviation
st.stdev(unv_df['alumni_employment'])

In [None]:
#Population Standard Deviation
st.pstdev(unv_df['alumni_employment'])

# 15.Apply function on Dataframe <a class="anchor" id="15"></a>

[Table of Contents](#0.1)

In [None]:
# Finding MAX value in Columns
unv_df.apply(max)

In [None]:
# Finding minimum value in Columns
unv_df.apply(min)

In [None]:
#Sum of Column Values
unv_df.apply(np.sum)

In [None]:
# Square root of all values in a DataFrame
dframe.applymap(np.sqrt)

In [None]:
# Square root of all values in a DataFrame
import math
dframe.applymap(math.sqrt)

In [None]:
dframe.applymap(float)

In [None]:
# Using Lambda function in Dataframes
unv_df.apply(lambda x: min(x))

In [None]:
# Using Lambda function in Dataframes
dframe.apply(lambda x: x*x)

# 16. Merge Dataframe  <a class="anchor" id="16"></a>

[Table of Contents](#0.1)

In [None]:
shanghai_csv = pd.read_csv("/kaggle/input/world-university-rankings/shanghaiData.csv")
shanghai_csv.head()

In [None]:
time_df=pd.read_csv("/kaggle/input/world-university-rankings/timesData.csv")
time_df.head()

In [None]:
# Inner Join
pd.merge(shanghai_csv, time_df, on='university_name', how='inner')

In [None]:
# Full Outer Join
pd.merge(shanghai_csv, time_df, on='university_name', how='outer')

In [None]:
# Left Outer Join
pd.merge(shanghai_csv, time_df, on='university_name', how='left')

In [None]:
#Right Outer Join
pd.merge(shanghai_csv, time_df, on='university_name', how='right')

# 17.Like Operations in Pandas  <a class="anchor" id="17"></a>

[Table of Contents](#0.1)

In [None]:
pokemon_csv.Name.str.contains("rill").head(10)

In [None]:
# Display all rows containing Name "rill"
pokemon_csv.loc[pokemon_csv.Name.str.contains("rill")]

In [None]:
# Exclude all rows containing "rill"
pokemon_csv.loc[~pokemon_csv.Name.str.contains("rill")].head(10)

In [None]:
#Display all rows with Type-1 as "Grass" and Type-2 as "Poison"
pokemon_csv.loc[pokemon_csv['Type 1'].str.contains("Grass") & pokemon_csv['Type 2'].str.contains("Poison")]

In [None]:
pokemon_csv.loc[pokemon_csv['Type 1'].str.contains('Grass|Water',regex = True)].head(10)

In [None]:
# Due to Case-sensitive it will not return any data
pokemon_csv.loc[pokemon_csv['Type 1'].str.contains('grass|water',regex = True)].head(10)

In [None]:
# To ignore case we can use "case = False"
pokemon_csv.loc[pokemon_csv['Type 1'].str.contains('grass|water', case = False ,regex = True)].head(10)

In [None]:
# To ignore case we can use "Flags = re.I"
import regex as re
pokemon_csv.loc[pokemon_csv['Type 1'].str.contains('grass|water',flags = re.I ,regex = True)].head(10)

# 18.Regex in Pandas dataframe   <a class="anchor" id="18"></a>

[Table of Contents](#0.1)

In [None]:
#Get all rows with name starting with "wa"

pokemon_csv.loc[pokemon_csv.Name.str.contains('^Wa',flags = re.I ,regex = True)].head(10)

In [None]:
#Get all rows with name starting with "wa" followed by any letter between a-l
pokemon_csv.loc[pokemon_csv.Name.str.contains('^Wa[a-l]+',flags = re.I ,regex = True)].head(10)

In [None]:
#Get all rows with name starting with x , y, z
pokemon_csv.loc[pokemon_csv.Name.str.contains('^[x-z]',flags = re.I ,regex = True)]

In [None]:
# Extracting first 3 characters from "Name" column
pokemon_csv['Name2'] = pokemon_csv.Name.str.extract(r'(^\w{3})')
pokemon_csv.head(5)

In [None]:
# Return all rows with "Name" starting with character 'B or b'
pokemon_csv.loc[pokemon_csv.Name.str.match(r'(^[B|b].*)')].head(5)

# 19.Replace values in dataframe  <a class="anchor" id="19"></a>

[Table of Contents](#0.1)

In [None]:
pokemon_csv['Type 1'] = pokemon_csv['Type 1'].replace({"Grass" : "Meadow" , "Fire" :"Blaze"})
pokemon_csv.head()

In [None]:
pokemon_csv['Type 2'] = pokemon_csv['Type 2'].replace({"Poison" : "Venom"})
pokemon_csv.head()

In [None]:
pokemon_csv['Type 2'] = pokemon_csv['Type 2'].replace(['Venom' , 'Dragon'] , 'DANGER')
pokemon_csv

In [None]:
pokemon_csv.loc[pokemon_csv['Type 2'] == 'DANGER' , 'Name2'] = np.NaN
pokemon_csv

In [None]:
pokemon_csv.loc[pokemon_csv['Total'] > 400 , ['Name2' , 'Legendary']] = 'ALERT'
pokemon_csv.head(10)

In [None]:
pokemon_csv.loc[pokemon_csv['Total'] > 400 , ['Legendary' , 'Name2']] = ['ALERT-1' , 'ALERT-2']
pokemon_csv.head(10)

# 20.Important DataFrame methods         <a class="anchor" id="20"></a>

[Table of Contents](#0.1)

### iteritems() :- iterates over each column as key,value pair 

In [None]:
for key,value in unv_df.iteritems():
    print(key,value)

### iterrows() : iterates over each row as key,value pair

In [None]:
for key,value in unv_df.iterrows():
    print(key,value)

# 21. GroupBy Method   <a class="anchor" id="21"></a>

[Table of Contents](#0.1)

In [None]:
unv_df.groupby(['year']).mean().head(10)

In [None]:
unv_df.groupby(['year']).mean().sort_values("quality_of_faculty",ascending=False).head(10)

In [None]:
unv_df.sum()

In [None]:
unv_df.groupby(['year']).sum().head()

In [None]:
unv_df.count()

In [None]:
unv_df['count1'] = 0
unv_df.groupby(['year']).count()['count1']

In [None]:
unv_df['count1'] = 0
unv_df.groupby(['year','country']).count()['count1']

# 22.Stack & unstack in Pandas <a class="anchor" id="22"></a>

[Table of Contents](#0.1)

In [None]:
col = pd.MultiIndex.from_product([['2010','2015'],['Literacy' , 'GDP']])
data =([[80,7,88,6],[90,8,92,7],[89,7,91,8],[87,6,93,8]])
df6 = pd.DataFrame(data, index=['India','USA' , 'Russia' , 'China'], columns=col)
df6

In [None]:
# Stack() Function stacks the columns to rows.
st_df = df6.stack()
st_df

In [None]:
#Unstacks the row to columns
unst_df = st_df.unstack()
unst_df

In [None]:
unst_df = unst_df.unstack()
unst_df

In [None]:
unst_df = unst_df.unstack()
unst_df

# 23. Pivot Table <a class="anchor" id="23"></a>

[Table of Contents](#0.1)

In [None]:
# Pivot table with SUM aggregation
pd.pivot_table(unv_df , index= ['year' , 'quality_of_faculty'] , aggfunc='sum')

In [None]:
# Pivot table with MEAN aggregation
pd.pivot_table(unv_df , index= ['year' , 'quality_of_faculty'] , aggfunc='mean')

# 24. Hierarchical indexing                          <a class="anchor" id="24"></a>

[Table of Contents](#0.1)

In [None]:
new_unv_df=unv_df.set_index(['year', 'quality_of_faculty'])
new_unv_df

In [None]:
new_unv_df.index

In [None]:
new_unv_df.loc[2015]

In [None]:
new_unv_df=unv_df.set_index(['year', 'quality_of_faculty','country'])
new_unv_df

In [None]:
# Swaping the columns in Hierarchical index
new_unv_df = new_unv_df.swaplevel('year', 'quality_of_faculty')
new_unv_df

# 25.Crossstab in Pandas <a class="anchor" id="25"></a>

[Table of Contents](#0.1)

In [None]:
pd.crosstab(unv_df['year'] , unv_df.score , margins=True)

In [None]:
# 2 way cross table
pd.crosstab(unv_df.score,unv_df['year'] , margins=True)

# 26. Row and Column Bind  <a class="anchor" id="26"></a>

[Table of Contents](#0.1)

In [None]:
df1 = pd.DataFrame({'ID' :[1,2,3,4] , 'Name' :['Anand' , 'Ranjan' , 'Rathore' , 'Kumar'] , 'Score' : ['99','66','55','88']})
df1

In [None]:
df2 = pd.DataFrame({'ID' :[5,6,7,8] , 'Name' :['Saurav' , 'Sumit' , 'Vishal' , 'Abhishek'] , 'Score' : ['99','66','55','88']})
df2

In [None]:
# Row Bind with concat() function
pd.concat([df1 , df2])

In [None]:
# Row Bind with append() function
df1.append(df2)

In [None]:
#Column Bind
pd.concat([df1,df2] , axis = 1)

# 27. Data Visualization   <a class="anchor" id="27"></a>

[Table of Contents](#0.1)

## Plotting graph with plot() method


In [None]:
data=pd.Series(np.random.randn(1000).cumsum())


In [None]:
data.plot()


In [None]:
data1 = pd.DataFrame(np.random.randn(100, 4),columns=list('ABCD'))

In [None]:
data1 = data1.cumsum()


In [None]:
data1.plot()


## Bar Plot

In [None]:
data1.iloc[10].plot(kind='bar')


In [None]:
data1.iloc[10].plot.bar()


In [None]:
df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()

In [None]:
#Stacked Bar plot 

df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar(stacked=True)

In [None]:
#horizontal bar plot 


df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])

df.plot.barh(stacked=True)

## Histograms 

In [None]:
iris_df=pd.read_csv("/kaggle/input/iris/Iris.csv")

In [None]:
iris_df.head()

In [None]:
iris_df.plot.hist(alpha=0.7)

In [None]:
iris_df.plot.hist(alpha=1, stacked=True)


In [None]:
bins=25
iris_df.plot.hist(alpha=1, stacked=True, bins=25)


In [None]:
iris_df["SepalWidthCm"].plot.hist(orientation="horizontal")


In [None]:
iris_df["SepalWidthCm"].diff().hist()


In [None]:
iris_df.hist(color="blue", alpha=1, bins=20)


In [None]:
iris_df.hist("PetalLengthCm",by="Species")


In [None]:
df = pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),'c':
np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])

df.plot.hist(bins=20)

## Box Plots 

In [None]:
iris_df.plot.box()


In [None]:
colors={'boxes': 'Red', 'whiskers': 'blue','medians': 'Black', 'caps': 'Green'}
iris_df.plot.box(color=colors)


In [None]:
iris_df.plot.box(vert=False)


In [None]:
iris_df.boxplot()


In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=(8,8)
plt.style.use("ggplot")
iris_df.boxplot(by='Species')

In [None]:
df = pd.DataFrame(unv_df, columns=['alumni_employment','quality_of_faculty','publications','influence'])
df.plot.box()

## Area Chart

In [None]:
df = pd.DataFrame(np.random.rand(10, 4), columns=list("ABCD"))
df.head()

In [None]:
df["A"].plot.area()


In [None]:
df.plot.area()


In [None]:
df.plot.area(stacked=False)


In [None]:
iris_df.plot.area()


In [None]:
iris_df.plot.area(stacked=False)


## Scatter Plot 

In [None]:
df.plot.scatter(x='A', y='B')


In [None]:
iris_df.columns

In [None]:
ax=iris_df.plot.scatter(x='SepalLengthCm', y='SepalWidthCm', 
                     color='Blue', label='sepal')
iris_df.plot.scatter(x='PetalLengthCm', y='PetalWidthCm', color='red', 
                  label='petal', ax=ax)

In [None]:
iris_df.plot.scatter(x='SepalLengthCm', y='SepalWidthCm', 
                  c='PetalLengthCm', s=100)

In [None]:
iris_df.plot.scatter(x='SepalLengthCm', y='SepalWidthCm', 
                  s=iris_df['PetalLengthCm'] * 50)

## Hexagonal bin charts

In [None]:
iris_df.plot.hexbin(x="SepalLengthCm", y="SepalWidthCm", gridsize=25)


In [None]:
iris_df.plot.hexbin(x="SepalLengthCm", y="SepalWidthCm", gridsize=10)


## Pie Charts

In [None]:

df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])
df.plot.pie(subplots=True)

In [None]:

iris_avg=iris_df["PetalWidthCm"].groupby(iris_df["Species"]).mean()
iris_avg

In [None]:
iris_avg.plot.pie()


In [None]:
iris_avg_2=iris_df[["PetalWidthCm", 
                 "PetalLengthCm"]].groupby(iris_df["Species"]).mean()

In [None]:
iris_avg_2.plot.pie(subplots=True)


In [None]:
iris_avg.plot.pie()

In [None]:
iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"], 
                  colors=list("brg"), fontsize=25, figsize=(10,10))

In [None]:
iris_avg.plot.pie(labels=["setosa","versicolor", "virginica"], colors=list("brg"), 
           autopct='%.2f', fontsize=25, figsize=(10,10))

## Density Chart

In [None]:
iris_df.plot.kde()


## Scatter Matrix 

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(iris_df, alpha=0.5, diagonal='kde')