<a href="https://colab.research.google.com/github/Gokul7120/Python-/blob/main/workbook_Pandas_Intro_main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language.
pandas is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

* Ordered and unordered (not necessarily fixed-frequency) time series data.

* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

* Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure


The two primary data structures of pandas,
* **Series (1-dimensional)** and
* **DataFrame (2-dimensional)**, handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

* Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

* Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

* Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

* Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data

* Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects

* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

* Intuitive merging and joining data sets

* Flexible reshaping and pivoting of data sets

* Hierarchical labeling of axes (possible to have multiple labels per tick)

* Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format

* Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

**[Reference: pandas Official website: click here](https://pandas.pydata.org/docs/getting_started/index.html)**



In [None]:
# Excel
    # -> quick look
    # -> Data transformation
    # -> Data Visualization & Dashboarding
    # -> Data Analysis
    # -> Data Validation & Management

# Power BI
  # -> Visualization tool
  # -> Dashboarding

# SQL
  # -> Query language
  # -> interact with databases
  # -> retrieval
  # -> insert, update , delete

# Python
  # -> Programming language
  # -> Machine Learning
  # -> automation purposes

# Data -> Structured Data (Tabular data) -> Pandas

In [None]:
! pip install pandas    # install pandas library



In [None]:
import pandas as pd
     # import pandas library in notebook

In [None]:
pd.__version__            # check the version of the pandas

'2.2.2'

## Creation of Pandas Series

A pandas series can be created using list,tuple,dictionary and numpy array. To create series use `pandas.Series()`



In [None]:
# Series : pd.Series(data)
import numpy as np

data = ["Apple", "Mango", "Grapes", "banana"]

arr = np.array(data)
print(arr)

['Apple' 'Mango' 'Grapes' 'banana']


In [None]:
# list/tuple/dict/numpy array

# create a series using list
data = ["Apple", "Mango", "Grapes", "banana"]

s1 = pd.Series(data)
s1

Unnamed: 0,0
0,Apple
1,Mango
2,Grapes
3,banana


In [None]:
print(type(s1))

<class 'pandas.core.series.Series'>


In [None]:
fruits_series = pd.Series(data , index = ['f1','f2','f3', 'f4'])
fruits_series

Unnamed: 0,0
f1,Apple
f2,Mango
f3,Grapes
f4,banana


In [None]:
# data : list/tuple/np.array/dict/ set?
pd.Series(data)

In [None]:
# Series created using a dict
population_dict = {
    'California':38956785,
    'Texas':26441568,
    'New York':19555647,
    'Florida':12364569,
    'Illinois':12882135
}

pd.Series(population_dict)

Unnamed: 0,0
California,38956785
Texas,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
fruits_series

Unnamed: 0,0
f1,Apple
f2,Mango
f3,Grapes
f4,banana


In [None]:
s1

Unnamed: 0,0
0,Apple
1,Mango
2,Grapes
3,banana


In [None]:
# access a element
s1[0], s1[2]

('Apple', 'Grapes')

In [None]:
fruits_series

Unnamed: 0,0
f1,Apple
f2,Mango
f3,Grapes
f4,banana


In [None]:
fruits_series['f4']

'banana'

In [None]:
["f"+str(i) for i in range(1,len(data)+1)]

['f1', 'f2', 'f3', 'f4']

In [None]:
# providing custom index
data = ("Apple", "Mango", "Grapes", "banana")
s2 = pd.Series(data, index = ["p"+str(i) for i in range(1,len(data)+1)])
s2

Unnamed: 0,0
p1,Apple
p2,Mango
p3,Grapes
p4,banana


In [None]:
# Display all the indexes
s2.index

Index(['p1', 'p2', 'p3', 'p4'], dtype='object')

In [None]:
s2.values

array(['Apple', 'Mango', 'Grapes', 'banana'], dtype=object)

In [None]:
# access a elem
s2['p3']

'Grapes'

In [None]:
s2

Unnamed: 0,0
p1,Apple
p2,Mango
p3,Grapes
p4,banana


In [None]:
s2["p2":"p3"]

Unnamed: 0,0
p2,Mango
p3,Grapes


In [None]:
s1 = pd.Series(data, index = [100, 101, 102, 103])
s1

Unnamed: 0,0
100,Apple
101,Mango
102,Grapes
103,banana


In [None]:
# providing custom index


In [None]:
# pandas series using set : Not allowed
fruits = {'Apple', 'banana', 'Mango', 'Grapes'}
pd.Series(fruits)

TypeError: 'set' type is unordered

In [None]:
population_dict = {
    'California':38956785,
    'Texas':26441568,
    'New York':19555647,
    'Florida':12364569,
    'Illinois':12882135
}

population_dict

{'California': 38956785,
 'Texas': 26441568,
 'New York': 19555647,
 'Florida': 12364569,
 'Illinois': 12882135}

In [None]:
type(population_dict)

dict

In [None]:
# create a pandas series using dict
population  = pd.Series(population_dict)
population

Unnamed: 0,0
California,38956785
Texas,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
# row labels / index

In [None]:
population

Unnamed: 0,0
California,38956785
Texas,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
# print the index of that series # population.index
population.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [None]:
## 2nd way to change the indexes
population.index = ['California', 'XYZ', 'New York', 'Florida', 'Illinois']
population

Unnamed: 0,0
California,38956785
XYZ,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
# fetching element using index
population['Florida']

26441568

In [None]:
# fetching element using indexing - other way
population[1]

  population[1]


26441568

In [None]:
population

Unnamed: 0,0
California,38956785
XYZ,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
# Slicing in Series : using custom indexes
population['XYZ':'Florida']

Unnamed: 0,0
XYZ,26441568
New York,19555647
Florida,12364569


In [None]:
# Slicing in Series : using zero based indexing
population[1:4]

Unnamed: 0,0
XYZ,26441568
New York,19555647
Florida,12364569


In [None]:
# creating series with list

# default index is created in pandas using numpy.arange(len(list))

In [None]:
# scaler e.g a = 5
# 1d - vector
# 2d - matrix
# >=3d - tensors

In [None]:
 # creating series for a scaler value
pd.Series(6)

Unnamed: 0,0
0,6


In [None]:
pd.Series( 5 , index=[100,200,300, 400] )
# it gives preferances to the index

Unnamed: 0,0
100,5
200,5
300,5
400,5


In [None]:
pd.Series( [5,2] , index=[100,200,300] )

# it gives preferances to the index if index doesn't match with length of the values it shows ValueError

ValueError: Length of values (2) does not match length of index (3)

## Type and Shape of the series

To check the type of the Series use `type()` and to check the shape of the series use `.shape`.

In [None]:
# Check the type of the series
s = pd.Series([2,4,565,34])
type(s)
# s

In [None]:
s

Unnamed: 0,0
0,2
1,4
2,565
3,34


In [None]:
# shape of the series
s.shape

(4,)

# Data Frame




In [None]:
# Data Frame -( 2D data structure) / Table

# create a dataframe with different different variations
# data in csv/xlsx/pkl/parquet/tsv files => import data in python using pandas dataframe
# Database schemas => Load that in python environment

## Creating Pandas DataFrame

A pandas DataFrame can be created using pandas Series, dictionary, list, tuple and numpy array.

To create pandas DataFrame use `pd.DataFrame()`

In [None]:
pd.DataFrame()

In [None]:
# pd.Series(data) # list, tuple, numpy array, dict

# pd.DataFrame() # list, tuple, numpy array, dict, Series,

In [None]:
# From a dictionary of lists or arrays.   e.g. []
# From a list of dictionaries.
# From a NumPy array.
# From a CSV file.
# From a SQL database.
# From a dictionary of Series.

In [None]:
# pd.DataFrame()
arr = np.random.randint(10,20,(10,4))
arr

array([[11, 14, 12, 16],
       [16, 13, 15, 13],
       [14, 10, 11, 11],
       [13, 12, 13, 14],
       [15, 16, 12, 18],
       [17, 18, 11, 15],
       [15, 10, 16, 12],
       [10, 15, 19, 19],
       [19, 13, 12, 18],
       [16, 18, 16, 14]])

In [None]:
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2,3
0,11,14,12,16
1,16,13,15,13
2,14,10,11,11
3,13,12,13,14
4,15,16,12,18
5,17,18,11,15
6,15,10,16,12
7,10,15,19,19
8,19,13,12,18
9,16,18,16,14


In [None]:
# Create a df with custom index & Column labels
df = pd.DataFrame(arr,
                  index = ["r"+str(i) for i in range(1,11)],
                  columns = ["c"+str(i) for i in range(1,5)])
df

Unnamed: 0,c1,c2,c3,c4
r1,11,14,12,16
r2,16,13,15,13
r3,14,10,11,11
r4,13,12,13,14
r5,15,16,12,18
r6,17,18,11,15
r7,15,10,16,12
r8,10,15,19,19
r9,19,13,12,18
r10,16,18,16,14


In [None]:
np.full((3,4), 40)

array([[40, 40, 40, 40],
       [40, 40, 40, 40],
       [40, 40, 40, 40]])

In [None]:
df1 = pd.DataFrame(np.full((3,4), 40),
                  index = [100,101,102],
                  columns = ['col1','col2','col3', 'col4']
                  )
df1

Unnamed: 0,col1,col2,col3,col4
100,40,40,40,40
101,40,40,40,40
102,40,40,40,40


In [None]:
# Fetch index
df1.index

Index([100, 101, 102], dtype='int64')

In [None]:
# Fetch columns
df1.columns

Index(['col1', 'col2', 'col3', 'col4'], dtype='object')

In [None]:
# Another way to rename the column & row labels
df1.index = ["i1","i2", "i3"]
df1.columns = ['col1', 'c2', 'col3', 'col4']
df1

Unnamed: 0,col1,c2,col3,col4
i1,40,40,40,40
i2,40,40,40,40
i3,40,40,40,40


In [None]:
# 3rd way to change a col name
df1.rename(columns = {"c2":"Col2"}, inplace = True)

In [None]:
df1

Unnamed: 0,col1,Col2,col3,col4
i1,40,40,40,40
i2,40,40,40,40
i3,40,40,40,40


In [None]:
population_dict = {
    'California':38956785,
    'Texas':26441568,
    'New York':19555647,
    'Florida':12364569,
    'Illinois':12882135
}

population=pd.Series(population_dict)

In [None]:
population # row labels are visible with defualt as zero based

Unnamed: 0,0
California,38956785
Texas,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
# create a dataframe using series
df = pd.DataFrame(population)
df

Unnamed: 0,0
California,38956785
Texas,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
df

Unnamed: 0,0
California,38956785
Texas,26441568
New York,19555647
Florida,12364569
Illinois,12882135


In [None]:
# row labels - index
print(df.index)
print(list(df.index))

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
['California', 'Texas', 'New York', 'Florida', 'Illinois']


In [None]:
df1.rename(columns = {'c3':'col3', 'c4': 'col4'}, inplace = True)

In [None]:
df1

Unnamed: 0,col1,col2,col3,col4,col5,col6
i1,1,1,1,1,1,1
i2,1,1,1,1,1,1
i3,1,1,1,1,1,1


In [None]:
# Create a table of shape (3, 4) with all zeros with custom row & col labels

df1 = pd.DataFrame(np.zeros((3,4), dtype = int), index = [1,2,3], columns = ['col1', 'col2', 'col3', 'col4'])
df1

Unnamed: 0,col1,col2,col3,col4
1,0,0,0,0
2,0,0,0,0
3,0,0,0,0


In [None]:
# Create a table of shape (3,3) with all 1,  along with custom row & col labels
pd.DataFrame(np.ones((3,3), dtype = np.int16),
             index = [100,101,102],
             columns = ['c1','c2','c3'])

Unnamed: 0,c1,c2,c3
100,1,1,1
101,1,1,1
102,1,1,1


In [None]:
# changing the row labels / indexes
df1.index

Index([1, 2, 3], dtype='int64')

In [None]:
# changing the column labels
df1.columns

Index(['col1', 'col2', 'col3', 'col4'], dtype='object')

In [None]:
 # DataFrame shape will give you the information similar to 2D numpy array with rows and columns


In [None]:
# pd.Series(dict, index)

In [None]:
# Creating the pandas DataFrame with list of the dictionries
data = [ {'a':i, 'b':2*i} for i in range(3) ]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [None]:
# list of dictionaries
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [None]:
df.columns

Index(['col1', 'col2', 'col3', 'col4'], dtype='object')

In [None]:
df.index

Index([100, 101, 102], dtype='int64')

In [None]:
# List of dictionaries
data = [ {'a':i, 'b':2*i} for i in range(3) ]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [None]:
# a customized index can be setup for a smaller DataFrame
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [None]:
# to check the name of the columns use df.columns


In [None]:
# to change the names of the columns use 'df.columns=list of the new column names'
df.columns=['A','B']
df

Unnamed: 0,A,B
X,0,0
Y,1,2
Z,2,4


In [None]:
# to cange the index of the DataFrame use after creating the DataFrame use 'df.index=list of the new index values'
df.index=['i','ii','iii']
df

Unnamed: 0,A,B
i,0,0
ii,1,2
iii,2,4


In [None]:
df.columns=['column1','column2']
df

Unnamed: 0,column1,column2
i,0,0
ii,1,2
iii,2,4


In [None]:
print(df.columns)

Index(['column1', 'column2'], dtype='object')


In [None]:
# list of column names
list(df.columns)

['column1', 'column2']

In [None]:
df.columns[0]

'a'

In [None]:
print(df.columns[0])  # to get name of the column according to index position use df.column[index of the column]

column1


In [None]:
print(list(df.columns))     # to get the list of the column names

['column1', 'column2']


In [None]:
df.index   # to check the index values

Index(['i', 'ii', 'iii'], dtype='object')

In [None]:
print(list(df.index))    # to get the list of the index values

['i', 'ii', 'iii']


In [None]:
df

Unnamed: 0,column1,column2
i,0,0
ii,1,2
iii,2,4


In [None]:
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [None]:
d =  [ {'a':1,'b':2} , {'b':3,'c':4} ]
d

[{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]

In [None]:
# unique keys across all dicts : a,b,c
pd.DataFrame(d)

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [None]:
# if a DataFrame is created using the dictionary and the keys are not matched in two dictionaries then it will be filled with NaN values

# pd.DataFrame( [ {'a':1,'b':2} , {'b':3,'c':4} ] )

In [None]:
# list of dictionaries

In [None]:
# To create a DataFrame with numpy array
import numpy as np
pd.DataFrame(arr, columns = ["Col1", "Col2"], index = [1,2,3])

Unnamed: 0,Col1,Col2
1,0.213671,0.376168
2,0.125871,0.268332
3,0.029248,0.925053


In [None]:
# create a df with 4 student records (name, age, roll, address).
# l = [["Aman", 24, 32, "Delhi"],
#      ["X", 24, 32, "Delhi"],
#      ["Y", 24, 32, "Delhi"],
#      ["Z", 24, 32, "Delhi"]
#       ]
l = [{'name':"Aman", 'Age':24, "Roll": 25, "Address": "Delhi"},
     {'name':"Aman", 'Age':24, "Roll": 25, "Address": "Delhi"},
     {'name':"Aman", 'Age':24, "Roll": 25, "Address": "Delhi"},
     {'name':"Aman", 'Age':24, "Roll": 25, "Address": "Delhi"}]

pd.DataFrame(l)

Unnamed: 0,name,Age,Roll,Address
0,Aman,24,25,Delhi
1,Aman,24,25,Delhi
2,Aman,24,25,Delhi
3,Aman,24,25,Delhi


In [None]:
# create a df with 4 student records (name, age, roll, address).
l = [["Aman", 24, 32, "Delhi"],
     ["X", 24, 32, "Delhi"],
     ["Y", 24, 32, "Delhi"],
     ["Z", 24, 32, "Delhi"]
      ]
pd.DataFrame(l, columns = ['name', 'age', 'roll', 'address'])

Unnamed: 0,0,1,2,3
0,Aman,24,32,Delhi
1,X,24,32,Delhi
2,Y,24,32,Delhi
3,Z,24,32,Delhi


In [None]:
l = [["Ram",2], ["Shyam", 3]]
pd.DataFrame(l)

Unnamed: 0,0,1
0,Ram,2
1,Shyam,3


In [None]:
# List of Dictionaries

d = [ {'city':'Delhi','data':1000}, {'city':'Mumbai','data':2000},
     {'city':'Bangalore','data':1500} ]
pd.DataFrame(d)

Unnamed: 0,city,data
0,Delhi,1000
1,Mumbai,2000
2,Bangalore,1500


## Importing Data to create DataFrame

To create pandas DataFrame, a data can be imported from csv,excel files. Also, data can be imported from RDBMS such as MySql.

To create DataFrame from csv file use following syntax:

```pandas.read_csv(filepath_or_buffer, names=_NoDefault.no_default, index_col=None, squeeze=None, skipinitialspace=False, skiprows=None, skipfooter=0, na_values=None)```

Not all parameters are mentioned in the syntax. If we have properly arranged and cleaned data just use :

``` pandas.read_csv(file_path)```


[For complete syntax click here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

In [None]:
import pandas as pd

In [None]:
pd.read_csv('student_records.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'student_records.csv'

In [None]:
# connection bw drive & Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# /content/drive/MyDrive/student_records.csv
path = '/content/drive/MyDrive/student_records.csv'
df = pd.read_csv(path)
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
#  You can read the file in Jupyter notebook from your local machine
# This can work only if your notebook and the file is in the same folder
df1 = pd.read_csv('student_records1.csv')

In [None]:
#  You can read the file in Jupyter notebook from your local machine
# This can work only if your notebook and the file is in the same folder
df1 = pd.read_csv('student_records1.csv')

## Data Exploration
Data exploration will help finding the information about the data. In pandas it can be done using diiferent ways.

## Viewing/Inspecting Data

Use these commands to take a look at specific sections of your pandas DataFrame or Series.

`df.head(n)` | First n rows of the DataFrame

By default, it returns the first 5 rows of the Dataframe.

`df.tail(n) `| Last n rows of the DataFrame

By default, it returns the last 5 rows of the Dataframe.

`df.shape `| Number of rows and columns

`df.size` | Number of elements in this object.

Return the number of rows if Series, otherwise returns the number of rows times the number of columns if DataFrame.

`df.info() `| Index, Datatype and Memory information

`df.describe() `| Summary statistics for numerical columns

`df.value_counts(dropna=False)` | View unique values and counts

`df.apply(pd.Series.value_counts)` | Unique values and counts for all columns

`df.ndim `| Returns dimension of dataframe/series.

1 for one dimension (series), 2 for two dimensions (dataframe).

`df.sample( )` | generate a sample randomly either row or column.

It allows you to select values randomly from a Series or DataFrame. It is useful when we want to select a random sample from a distribution.

`df.isna( )` or `df.isnull()` | This function returns a dataframe filled with boolean values with true indicating missing values.

`df.isnull( ).sum( )` | Return the number of missing values in each column.

`df.dropna( )` | remove a row or a column from a dataframe that has a NaN or missing values in it.





### Basic Data Exploration

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
# see top n records from the table.
df.head(2)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes


In [None]:
df.head()  # by default n = 5

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No


In [None]:
# bottom n records
# 10 records
df.tail(3)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
df.tail()

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
# Get a sample record
df.sample()

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
7,Trent,C,Y,75,33.0,No


In [None]:
df.sample(2)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
4,Marvin,E,N,20,30.0,No
0,Henry,A,Y,90,85.0,Yes


In [None]:
# Get the shape of the df : rows x cols
df.shape

(10, 6)

In [None]:
# no of cols
df.shape[1]

6

In [None]:
df.size

60

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
# Information : Index, Datatype and Memory information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           10 non-null     object 
 1   OverallGrade   9 non-null      object 
 2   Obedient       10 non-null     object 
 3   ResearchScore  10 non-null     int64  
 4   ProjectScore   9 non-null      float64
 5   Recommend      10 non-null     object 
dtypes: float64(1), int64(1), object(4)
memory usage: 612.0+ bytes


### Statistical Summary of Pandas DataFrame

In [None]:
# Summary of the numeric columns
df.describe()   # descriptive statistics

Unnamed: 0,ResearchScore,ProjectScore
count,10.0,9.0
mean,55.7,48.888889
std,32.256093,26.412329
min,10.0,15.0
25%,25.0,30.0
50%,67.5,51.0
75%,82.5,71.0
max,92.0,85.0


In [None]:
# transpose the describe DataFrame
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ResearchScore,10.0,55.7,32.256093,10.0,25.0,67.5,82.5,92.0
ProjectScore,9.0,48.888889,26.412329,15.0,30.0,51.0,71.0,85.0


In [None]:
# include all columns
df.describe(include = 'all')

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
count,10,9,10,10.0,9.0,10
unique,10,6,3,,,3
top,Henry,A,Y,,,No
freq,1,2,6,,,6
mean,,,,55.7,48.888889,
std,,,,32.256093,26.412329,
min,,,,10.0,15.0,
25%,,,,25.0,30.0,
50%,,,,67.5,51.0,
75%,,,,82.5,71.0,


The above DataFrame show information as total number of records in the specific columns in count. The values of mean ,min max and standard deviation (std) are shown. Also quantile values of the data is also given as 25%,50% and 75%. Here 50% is median value.

In [None]:
# pick a single cols from the table e.g. Name

# df['Column'] / df.Column
df['Name']

Unnamed: 0,Name
0,Henry
1,John
2,David
3,Holmes
4,Marvin
5,Simon
6,Robert
7,Trent
8,Judy
9,Chris


In [None]:
df.Name

Unnamed: 0,Name
0,Henry
1,John
2,David
3,Holmes
4,Marvin
5,Simon
6,Robert
7,Trent
8,Judy
9,Chris


In [None]:
# 'First Name'
# df.First Name
# df['First Name']

In [None]:
# TO pick multiple cols
cols = ['Name', 'Recommend']

df[cols]

Unnamed: 0,Name,Recommend
0,Henry,Yes
1,John,Yes
2,David,No
3,Holmes,No
4,Marvin,No
5,Simon,Yes
6,Robert,No
7,Trent,No
8,Judy,No
9,Chris,NO


In [None]:
df[['Name', 'ResearchScore', 'ProjectScore']]

Unnamed: 0,Name,ResearchScore,ProjectScore
0,Henry,90,85.0
1,John,85,51.0
2,David,10,17.0
3,Holmes,75,71.0
4,Marvin,20,30.0
5,Simon,92,79.0
6,Robert,60,59.0
7,Trent,75,33.0
8,Judy,25,
9,Chris,25,15.0


In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
print(df.head(1))

    Name OverallGrade Obedient  ResearchScore  ProjectScore Recommend
0  Henry            A        Y             90          85.0       Yes


In [None]:
# Split the df into two parts : numeric_df & Categorical_df
# Method1
categorical_df = df[['Name', 'OverallGrade', 'Obedient', 'Recommend']]
numeric_df = df[['ResearchScore', 'ProjectScore']]

display(categorical_df.head(2))
display(numeric_df.head(2))

Unnamed: 0,Name,OverallGrade,Obedient,Recommend
0,Henry,A,Y,Yes
1,John,C,N,Yes


Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0


In [None]:
# Method 2
num_cols = list(df.describe().columns)
num_cols

['ResearchScore', 'ProjectScore']

In [None]:
all_cols = list(df.columns)
# all_cols

cat_cols = [col for col in all_cols if col not in num_cols]
cat_cols

['Name', 'OverallGrade', 'Obedient', 'Recommend']

In [None]:
df[num_cols]

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,
9,25,15.0


In [None]:
df[cat_cols]

Unnamed: 0,Name,OverallGrade,Obedient,Recommend
0,Henry,A,Y,Yes
1,John,C,N,Yes
2,David,F,N,No
3,Holmes,B,Y,No
4,Marvin,E,N,No
5,Simon,A,Y,Yes
6,Robert,B,Y,No
7,Trent,C,Y,No
8,Judy,,Y,No
9,Chris,D,U,NO


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Name           10 non-null     object 
 1   OverallGrade   9 non-null      object 
 2   Obedient       10 non-null     object 
 3   ResearchScore  10 non-null     int64  
 4   ProjectScore   9 non-null      float64
 5   Recommend      10 non-null     object 
dtypes: float64(1), int64(1), object(4)
memory usage: 612.0+ bytes


In [None]:
# Method 3:  select_dtypes()
num_df = df.select_dtypes(include = ['int64', 'float64'])
num_df

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,
9,25,15.0


In [None]:
num_df

In [None]:
numeric_df = df.select_dtypes(exclude = ['object'])
numeric_df

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,
9,25,15.0


In [None]:
df.columns

Index(['Name', 'OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore',
       'Recommend'],
      dtype='object')

In [None]:
df['ResearchScore'].describe()          # to get statistical information for specific column

Unnamed: 0,ResearchScore
count,10.0
mean,55.7
std,32.256093
min,10.0
25%,25.0
50%,67.5
75%,82.5
max,92.0


In [None]:
df['ResearchScore'].max()

92

In [None]:
df['ResearchScore'].std()

32.256093308947925

In [None]:
df['ResearchScore'].quantile(0.5)   # to get specific quantile values  50% quantile is median of the data

67.5

In [None]:
df['ResearchScore'].quantile(0.25)

25.0

In [None]:
round(df[['ResearchScore','ProjectScore']].quantile(0.75),2)   # rount off the 75% value to 2 decimal points

Unnamed: 0,0.75
ResearchScore,82.5
ProjectScore,71.0


In [None]:
round(df[['ResearchScore','ProjectScore']].std(),2)

Unnamed: 0,0
ResearchScore,32.26
ProjectScore,26.41


In [None]:
round(df.describe(),2)   # to round off the total DataFrame values to 2 decimal points

Unnamed: 0,ResearchScore,ProjectScore
count,10.0,9.0
mean,55.7,48.89
std,32.26,26.41
min,10.0,15.0
25%,25.0,30.0
50%,67.5,51.0
75%,82.5,71.0
max,92.0,85.0


In [None]:
LINE_BREAK = "_"*50

In [None]:
# Number of distinct values
df.Recommend.unique()

array(['Yes', 'No', 'NO'], dtype=object)

In [None]:
df.Recommend.value_counts()

Unnamed: 0_level_0,count
Recommend,Unnamed: 1_level_1
No,6
Yes,3
NO,1


In [None]:
df.Recommend.value_counts().reset_index().style.background_gradient(cmap = "viridis")

Unnamed: 0,Recommend,count
0,No,6
1,Yes,3
2,NO,1


In [None]:
# Total unique values
#  -- select distinct Recommend from df.
df.Recommend.unique()

array(['Yes', 'No', 'NO'], dtype=object)

In [None]:
# fetch all the distinct values from Recommend

In [None]:
# value_counts()
df.Recommend.value_counts()

Unnamed: 0_level_0,count
Recommend,Unnamed: 1_level_1
No,6
Yes,3
NO,1


In [None]:
df.Obedient.value_counts()

Unnamed: 0_level_0,count
Obedient,Unnamed: 1_level_1
Y,6
N,3
U,1


In [None]:
df.Obedient.value_counts().index

Index(['Y', 'N', 'U'], dtype='object', name='Obedient')

In [None]:
df.Obedient.value_counts().values

array([6, 3, 1])

In [None]:
df['Recommend'].value_counts()

Unnamed: 0_level_0,count
Recommend,Unnamed: 1_level_1
No,6
Yes,3
NO,1


In [None]:
# NO -> No

df['Recommend'] = df['Recommend'].replace("NO", "No")

In [None]:
df.Recommend.value_counts().reset_index()

Unnamed: 0,Recommend,count
0,No,7
1,Yes,3


In [None]:
# looking to split our df into two parts: all_numeric_cols | all_categoric_cols
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,No


In [None]:
# Manual pick - not feasible
df[['ResearchScore', 'ProjectScore']]

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,
9,25,15.0


#### Split df into two: numeric_df & categorical_df

In [None]:
# to get the list of Numeric columns
num_cols = list(df.describe().columns)
num_cols

['ResearchScore', 'ProjectScore']

In [None]:
numeric_df = df[num_cols]
numeric_df.head(3)

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0


In [None]:
cat_cols = [col for col in df.columns if col not in num_cols]
cat_cols

['Name', 'OverallGrade', 'Obedient', 'Recommend']

In [None]:
cat_df = df[cat_cols]
cat_df.head(2)

Unnamed: 0,Name,OverallGrade,Obedient,Recommend
0,Henry,A,Y,Yes
1,John,C,N,Yes


In [None]:
df.select_dtypes(include = 'object')

Unnamed: 0,Name,OverallGrade,Obedient,Recommend
0,Henry,A,Y,Yes
1,John,C,N,Yes
2,David,F,N,No
3,Holmes,B,Y,No
4,Marvin,E,N,No
5,Simon,A,Y,Yes
6,Robert,B,Y,No
7,Trent,C,Y,No
8,Judy,,Y,No
9,Chris,D,U,No


In [None]:
df.select_dtypes(include=['float64', 'int64'])

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,
9,25,15.0


In [None]:
df.select_dtypes(exclude='object')

Unnamed: 0,Research Score,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,
9,25,15.0


In [None]:
# all_cols - numeric_cols

In [None]:
# categorical cols


['Obedient', 'Recommend', 'Name', 'Overall Grade']

In [None]:
# df['OverallGrade']
# df.OverallGrade

# extract multiple cols : df[list of cols]
# df['OverallGrade']
# df[['Name','Recommend']]

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,No


In [None]:
# Top 3 students basis on the ResearchScore
df.sort_values(by = 'ResearchScore', ascending = False).head(3)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
5,Simon,A,Y,92,79.0,Yes
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes


In [None]:
df.nlargest(3, "ResearchScore")

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
5,Simon,A,Y,92,79.0,Yes
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes


In [None]:
# Bottom 3 students basis on the ProjectScore
df.sort_values(by = 'ProjectScore', ascending = True).head(3)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
9,Chris,D,U,25,15.0,No
2,David,F,N,10,17.0,No
4,Marvin,E,N,20,30.0,No


In [None]:
df.nsmallest(3, "ProjectScore")

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
9,Chris,D,U,25,15.0,No
2,David,F,N,10,17.0,No
4,Marvin,E,N,20,30.0,No


#### Drop columns or rows

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,No


In [None]:
# df.drop(coulumns,axis=1) will delete the columns from the DataFrame

# Drop Obedient columns
df.drop('Obedient', axis = 1)

Unnamed: 0,Name,OverallGrade,ResearchScore,ProjectScore,Recommend
0,Henry,A,90,85.0,Yes
1,John,C,85,51.0,Yes
2,David,F,10,17.0,No
3,Holmes,B,75,71.0,No
4,Marvin,E,20,30.0,No
5,Simon,A,92,79.0,Yes
6,Robert,B,60,59.0,No
7,Trent,C,75,33.0,No
8,Judy,,25,,No
9,Chris,D,25,15.0,No


In [None]:
# Drop Multiple columns
df.drop(['Obedient','OverallGrade'], axis = 1)

Unnamed: 0,Name,ResearchScore,ProjectScore,Recommend
0,Henry,90,85.0,Yes
1,John,85,51.0,Yes
2,David,10,17.0,No
3,Holmes,75,71.0,No
4,Marvin,20,30.0,No
5,Simon,92,79.0,Yes
6,Robert,60,59.0,No
7,Trent,75,33.0,No
8,Judy,25,,No
9,Chris,25,15.0,No


In [None]:
# Drop a row using row label
df.drop(5, axis = 0)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,No


In [None]:
# Drop multiple rows
df.drop([0,3,7], axis = 0).reset_index(drop = True)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,John,C,N,85,51.0,Yes
1,David,F,N,10,17.0,No
2,Marvin,E,N,20,30.0,No
3,Simon,A,Y,92,79.0,Yes
4,Robert,B,Y,60,59.0,No
5,Judy,,Y,25,,No
6,Chris,D,U,25,15.0,No


In [None]:
 # call only the specific data according to the column name

In [None]:
   # stastical info of the categorical columns

In [None]:
x.astype('object').describe()                 # to get the categorical stastical info of all columns

Unnamed: 0,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
count,9,10,10,9.0,10
unique,6,3,8,9.0,3
top,A,Y,75,85.0,No
freq,2,6,2,1.0,6


In [None]:
x.describe(include='object')     # to get the stastical info of the categorical column

Unnamed: 0,OverallGrade,Obedient,Recommend
count,9,10,10
unique,6,3,3
top,A,Y,No
freq,2,6,6


In [None]:
x.describe(include='all')       # to get the stastical info of the numerical as well as categorical columns

Unnamed: 0,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
count,9,10,10.0,9.0,10
unique,6,3,,,3
top,A,Y,,,No
freq,2,6,,,6
mean,,,55.7,48.888889,
std,,,32.256093,26.412329,
min,,,10.0,15.0,
25%,,,25.0,30.0,
50%,,,67.5,51.0,
75%,,,82.5,71.0,


### Value counts

In [None]:
df['Recommend'].unique()

array(['Yes', 'No', 'NO'], dtype=object)

In [None]:
df['Recommend'].value_counts()    # View unique values and counts

Unnamed: 0_level_0,count
Recommend,Unnamed: 1_level_1
No,6
Yes,3
NO,1


In [None]:
df['Recommend'].value_counts(normalize=True)  # View unique values and counts in fraction between 0 to 1.

Unnamed: 0_level_0,proportion
Recommend,Unnamed: 1_level_1
No,0.6
Yes,0.3
NO,0.1


In [None]:
df['Recommend'].value_counts(normalize=True) *100 # View unique values and counts in percentage

Unnamed: 0_level_0,proportion
Recommend,Unnamed: 1_level_1
No,60.0
Yes,30.0
NO,10.0


In [None]:
df["OverallGrade"].value_counts()

Unnamed: 0_level_0,count
OverallGrade,Unnamed: 1_level_1
A,2
C,2
B,2
F,1
E,1
D,1


The above value_counts will not include the NaN values. To consider the NaN values use `.value_counts(dropna=False)`

In [None]:
df["OverallGrade"].value_counts(dropna=False)

A      2
C      2
B      2
F      1
E      1
NaN    1
D      1
Name: OverallGrade, dtype: int64

In [None]:
df["OverallGrade"].value_counts(dropna=False,normalize=True)

A      0.2
C      0.2
B      0.2
F      0.1
E      0.1
NaN    0.1
D      0.1
Name: OverallGrade, dtype: float64

In [None]:
df["OverallGrade"].value_counts(dropna=False,normalize=True) *100

A      20.0
C      20.0
B      20.0
F      10.0
E      10.0
NaN    10.0
D      10.0
Name: OverallGrade, dtype: float64

In [None]:
df.apply(pd.Series.value_counts) # Unique values and counts for all columns

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
10,,,,1.0,,
15.0,,,,,1.0,
17.0,,,,,1.0,
20,,,,1.0,,
25,,,,2.0,,
30.0,,,,,1.0,
33.0,,,,,1.0,
51.0,,,,,1.0,
59.0,,,,,1.0,
60,,,,1.0,,


### Dealing with Null Values

#### Checking the null values in the DataFrame

In [None]:
 # This function returns a dataframe filled with boolean values with true indicating missing values.
 df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
# Check for missing values
# Decide what to do:
    # - Drop
    # - Fill missing values

In [None]:
 # Return the number of missing values in each column.
df.isna().sum()

Unnamed: 0,0
Name,0
OverallGrade,1
Obedient,0
ResearchScore,0
ProjectScore,1
Recommend,0


In [None]:
# total no of records
len(df)

10

In [None]:
100 * df.isna().sum() / len(df)

Unnamed: 0,0
Name,0.0
OverallGrade,10.0
Obedient,0.0
ResearchScore,0.0
ProjectScore,10.0
Recommend,0.0


In [None]:
# Return the number of missing values in each column in percentage.

Name              0.0
OverallGrade     10.0
Obedient          0.0
ResearchScore     0.0
ProjectScore     10.0
Recommend         0.0
dtype: float64

In the above example OverallGrade and ProjectScore columns contain 10% null values each.

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


#### Handeling The Null Values
In order to deal with NaN or Null values there are multiple ways according to the projects or data as follows.

1. Drop all null values
2. Fill Null values





##### Drop all null values:

`DataFrame.dropna(axis=0, how=_NoDefault.no_default,inplace=False)`

    axis : {0 or ‘index’, 1 or ‘columns’}, default 0:
    Determine if rows or columns which contain missing values are removed.
        0, or ‘index’ : Drop rows which contain missing values.
        1, or ‘columns’ : Drop columns which contain missing value.

    how : {‘any’, ‘all’}, default ‘any’:
    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
        ‘any’ : If any NA values are present, drop that row or column.
        ‘all’ : If all values are NA, drop that row or column.

    inplace: bool, default False
    Whether to modify the DataFrame rather than creating a new one.

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
df_copy = df.copy()

In [None]:
df_copy

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
# drop rows having any nan
df_copy.dropna()

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
df_copy

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
df_copy.dropna(axis = 1)

Unnamed: 0,Name,Obedient,ResearchScore,Recommend
0,Henry,Y,90,Yes
1,John,N,85,Yes
2,David,N,10,No
3,Holmes,Y,75,No
4,Marvin,N,20,No
5,Simon,Y,92,Yes
6,Robert,Y,60,No
7,Trent,Y,75,No
8,Judy,Y,25,No
9,Chris,U,25,NO


In [None]:
df_copy.dropna(axis = 0)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
df_copy.dropna(inplace = True)

In [None]:
df_copy.reset_index(drop = True, inplace = True)

In [None]:
# remove a row from a dataframe that has a NaN or missing values in it.
df_copy

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
x=df_copy.dropna(axis=1) # remove a column from a dataframe that has a NaN or missing values in it.
x

Unnamed: 0,Name,Obedient,Research Score,Recommend
0,Henry,Y,90,Yes
1,John,N,85,Yes
2,David,N,10,No
3,Holmes,Y,75,No
4,Marvin,N,20,No
5,Simon,Y,92,Yes
6,Robert,Y,60,No
7,Trent,Y,75,No
8,Judy,Y,25,No
9,Chris,U,25,No


In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
x=df_copy.dropna(how='any')        # ‘any’ : If any NA values are present, drop that row or column.

x

Unnamed: 0,Name,Grade,Obedient,Research Score,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
x=df_copy.dropna(how='all') # ‘all’ : If all values are NA, drop that row or column.
x

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
x.dropna(inplace=True) # Whether to modify the DataFrame rather than creating a new one.
x

In [None]:
# fix the indexes / reset the indexes : reset_index()
df.reset_index(drop = True, inplace = True)

In [None]:
df.dropna()

Unnamed: 0,Name,grade,Obedient,Research Score,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
9,Chris,D,U,25,15.0,No


##### Fill Null values :  

 `.fillna()`


`DataFrame.fillna(value=None, method=None, axis=None, inplace=False)`

    value: scalar, dict, Series, or DataFrame

    Value to use to fill holes (e.g. 0),
    alternately a dict/Series/DataFrame of values specifying which value to
    use for each index (for a Series) or column (for a DataFrame).
    Values not in the dict/Series/DataFrame will not be filled.
    This value cannot be a list.


    method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

    Method to use for filling holes in reindexed Series
    pad / ffill: propagate last valid observation forward to next valid
    backfill / bfill: use next valid observation to fill gap.

    axis : {0 or ‘index’, 1 or ‘columns’}
    Axis along which to fill missing values. For Series this parameter is
    unused and defaults to 0.

    inplace : bool, default False

    If True, fill in-place. Note: this will modify any other views on this object
    (e.g., a no-copy slice for a column in a DataFrame).






In [None]:
x = df.copy()

In [None]:
x

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
missing_value_perct = 100*x.isna().sum()/len(x)
missing_value_perct = missing_value_perct.reset_index()

missing_value_perct.columns = ['columns','missing_perc']

# missing_value_perct

# threshold = 10
list(missing_value_perct[missing_value_perct['missing_perc']>= 10]['columns'])

['OverallGrade', 'ProjectScore']

In [None]:
x.drop(columns = list(missing_value_perct[missing_value_perct['missing_perc']>= 10]['columns']))

Unnamed: 0,Name,Obedient,ResearchScore,Recommend
0,Henry,Y,90,Yes
1,John,N,85,Yes
2,David,N,10,No
3,Holmes,Y,75,No
4,Marvin,N,20,No
5,Simon,Y,92,Yes
6,Robert,Y,60,No
7,Trent,Y,75,No
8,Judy,Y,25,No
9,Chris,U,25,NO


In [None]:
missing_value_perct.values

array([ 0., 10.,  0.,  0., 10.,  0.])

In [None]:
# dropna()
# fillna()
x.fillna(0)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,0,Y,25,0.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
# Median
x.ProjectScore.median()

51.0

In [None]:
x['ProjectScore'] = x['ProjectScore'].fillna(x.ProjectScore.median())

In [None]:
x

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,51.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
x['OverallGrade'].mode()[0]

'A'

In [None]:
x['OverallGrade'] = x['OverallGrade'].fillna(x['OverallGrade'].mode()[0])

In [None]:
x

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,A,Y,25,51.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
x= df.copy()
x

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,NO


In [None]:
# fill categorical column with mode value since more than one mode values
# are present then we are replacing it with first mode value

    # fill numerical columns with mean value

In [None]:
# method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}
# pad / ffill: propagate last valid observation forward to next valid
# backfill / bfill: use next valid observation to fill gap.
x.fillna(method='backfill')

  x.fillna(method='backfill')


Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,D,Y,25,15.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
x.fillna(method='bfill')

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,D,Y,25,15.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
x.fillna(method='bfill', axis=1)

  x.fillna(method='bfill', axis=1)


Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,Y,Y,25,No,No
9,Chris,D,U,25,15.0,NO


In [None]:
x.fillna(method='ffill')

  x.fillna(method='ffill')


Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,C,Y,25,33.0,No
9,Chris,D,U,25,15.0,NO


In [None]:
x.fillna(method='ffill', axis=1)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,Judy,Y,25,25.0,No
9,Chris,D,U,25,15.0,NO


### Unique values

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
df['Recommend'].unique()   # to find the unique values in the particular column

array(['Yes', 'No', 'NO'], dtype=object)

In [None]:
df['OverallGrade'].unique()

In [None]:
df['Recommend'].value_counts()

Unnamed: 0_level_0,count
Recommend,Unnamed: 1_level_1
No,6
Yes,3
NO,1


In [None]:
# to replace the value in the particular column
df['Recommend'] = df['Recommend'].replace("NO", "No")

In [None]:
df['Recommend'].unique()

array(['Yes', 'No'], dtype=object)

In [None]:
df.values

array([['Henry', 'A', 'Y', 90, 85.0, 'Yes'],
       ['John', 'C', 'N', 85, 51.0, 'Yes'],
       ['David', 'F', 'N', 10, 17.0, 'No'],
       ['Holmes', 'B', 'Y', 75, 71.0, 'No'],
       ['Marvin', 'E', 'N', 20, 30.0, 'No'],
       ['Simon', 'A', 'Y', 92, 79.0, 'Yes'],
       ['Robert', 'B', 'Y', 60, 59.0, 'No'],
       ['Trent', 'C', 'Y', 75, 33.0, 'No'],
       ['Judy', nan, 'Y', 25, nan, 'No'],
       ['Chris', 'D', 'U', 25, 15.0, 'No']], dtype=object)

In [None]:
nparray = df.values
print(nparray)
print(type(nparray))
print(nparray.shape)

[['Henry' 'A' 'Y' 90 85.0 'Yes']
 ['John' 'C' 'N' 85 51.0 'Yes']
 ['David' 'F' 'N' 10 17.0 'No']
 ['Holmes' 'B' 'Y' 75 71.0 'No']
 ['Marvin' 'E' 'N' 20 30.0 'No']
 ['Simon' 'A' 'Y' 92 79.0 'Yes']
 ['Robert' 'B' 'Y' 60 59.0 'No']
 ['Trent' 'C' 'Y' 75 33.0 'No']
 ['Judy' nan 'Y' 25 nan 'No']
 ['Chris' 'D' 'U' 25 15.0 'No']]
<class 'numpy.ndarray'>
(10, 6)


In [None]:
df['ResearchScore'].sum()

np.int64(557)

In [None]:
col_num = df.describe().columns
x = df[col_num]
x

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,15.0


In [None]:
# x = df[list(df.describe().columns)]

### Arithmatic operations on DataFrame

In [None]:
import pandas as pd

# Creating the data dictionary
data = {
    "Name": ["Henry", "John", "David", "Holmes", "Marvin", "Simon", "Robert", "Trent", "Chris"],
    "OverallGrade": ["A", "C", "F", "B", "E", "A", "B", "C", "D"],
    "Obedient": ["Y", "N", "N", "Y", "N", "Y", "Y", "Y", "U"],
    "ResearchScore": [90, 85, 10, 75, 20, 92, 60, 75, 25],
    "ProjectScore": [85.0, 51.0, 17.0, 71.0, 30.0, 79.0, 59.0, 33.0, 15.0],
    "Recommend": ["Yes", "Yes", "No", "No", "No", "Yes", "No", "No", "NO"]
}

# Converting the data to a DataFrame
df = pd.DataFrame(data)

# Displaying the DataFrame
display(df)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
x = df[['ProjectScore', 'ResearchScore']]
x

Unnamed: 0,ProjectScore,ResearchScore
0,85.0,90
1,51.0,85
2,17.0,10
3,71.0,75
4,30.0,20
5,79.0,92
6,59.0,60
7,33.0,75
8,15.0,25


In [None]:
x.sum(axis = 1)

Unnamed: 0,0
0,175.0
1,136.0
2,27.0
3,146.0
4,50.0
5,171.0
6,119.0
7,108.0
8,40.0


In [None]:
x.sum()

Unnamed: 0,0
ProjectScore,440.0
ResearchScore,532.0


In [None]:
# sum()
# mean()
# median()
# count()
# min()
# max()
# std()
# var()
# cov()
# corr()

In [None]:
x.ResearchScore.mean()

59.111111111111114

In [None]:
x.cov()

Unnamed: 0,ProjectScore,ResearchScore
ProjectScore,697.611111,716.138889
ResearchScore,716.138889,1039.611111


In [None]:
x.sum()

Unnamed: 0,0
ProjectScore,440.0
ResearchScore,532.0


In [None]:
x.sum(axis=1)

Unnamed: 0,0
0,175.0
1,136.0
2,27.0
3,146.0
4,50.0
5,171.0
6,119.0
7,108.0
8,40.0


In [None]:
x.mean(axis=0)

Unnamed: 0,0
ProjectScore,48.888889
ResearchScore,59.111111


In [None]:
x.mean(axis=1)

Unnamed: 0,0
0,87.5
1,68.0
2,13.5
3,73.0
4,25.0
5,85.5
6,59.5
7,54.0
8,20.0


In [None]:
x.count(axis=0)

ResearchScore    10
ProjectScore      9
dtype: int64

In [None]:
x.count(axis=1)

0    2
1    2
2    2
3    2
4    2
5    2
6    2
7    2
8    1
9    2
dtype: int64

In [None]:
x.min(axis=0)

ResearchScore    10.0
ProjectScore     15.0
dtype: float64

In [None]:
x.min(axis=1)

0    85.0
1    51.0
2    10.0
3    71.0
4    20.0
5    79.0
6    59.0
7    33.0
8    25.0
9    15.0
dtype: float64

In [None]:
x.max(axis=0)

ResearchScore    92.0
ProjectScore     85.0
dtype: float64

In [None]:
x.max(axis=1)

0    90.0
1    85.0
2    17.0
3    75.0
4    30.0
5    92.0
6    60.0
7    75.0
8    25.0
9    25.0
dtype: float64

In [None]:
x.median(axis=0)

ResearchScore    67.5
ProjectScore     51.0
dtype: float64

In [None]:
x.median(axis=0)

ResearchScore    67.5
ProjectScore     51.0
dtype: float64

In [None]:
df.mode(axis=0)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Chris,A,Y,25.0,15.0,No
1,David,B,,75.0,17.0,
2,Henry,C,,,30.0,
3,Holmes,,,,33.0,
4,John,,,,51.0,
5,Judy,,,,59.0,
6,Marvin,,,,71.0,
7,Robert,,,,79.0,
8,Simon,,,,85.0,
9,Trent,,,,,


In [None]:
df.mode(axis=1)

  warn(f"Unable to sort modes: {err}")


Unnamed: 0,0,1,2,3,4,5
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,Y,25,No,,
9,Chris,D,U,25,15.0,NO


In [None]:
x.std(axis=0)

ResearchScore    32.256093
ProjectScore     26.412329
dtype: float64

In [None]:
x.std(axis=1)

0     3.535534
1    24.041631
2     4.949747
3     2.828427
4     7.071068
5     9.192388
6     0.707107
7    29.698485
8          NaN
9     7.071068
dtype: float64

In [None]:
x.var(axis=0)

ResearchScore    1040.455556
ProjectScore      697.611111
dtype: float64

In [None]:
x.var(axis=1)

0     12.5
1    578.0
2     24.5
3      8.0
4     50.0
5     84.5
6      0.5
7    882.0
8      NaN
9     50.0
dtype: float64

In [None]:
x.cov()

Unnamed: 0,ResearchScore,ProjectScore
ResearchScore,1040.455556,716.138889
ProjectScore,716.138889,697.611111


In [None]:
x.corr()

Unnamed: 0,ResearchScore,ProjectScore
ResearchScore,1.0,0.840921
ProjectScore,0.840921,1.0


In [None]:
# Pandas dataframe.cumsum() is used to find the cumulative sum value over any axis.
# Each cell is populated with the cumulative sum of the values seen so far.
print(x)
print()
print(x.cumsum())

   ResearchScore  ProjectScore
0             90          85.0
1             85          51.0
2             10          17.0
3             75          71.0
4             20          30.0
5             92          79.0
6             60          59.0
7             75          33.0
8             25           NaN
9             25          15.0

   ResearchScore  ProjectScore
0             90          85.0
1            175         136.0
2            185         153.0
3            260         224.0
4            280         254.0
5            372         333.0
6            432         392.0
7            507         425.0
8            532           NaN
9            557         440.0


In [None]:
# Pandas dataframe.cumsum() is used to find the cumulative sum value over any axis.
# Each cell is populated with the cumulative sum of the values seen so far.
print(x)
print()
print(x.cumsum(axis=0,skipna=True))


   ResearchScore  ProjectScore
0             90          85.0
1             85          51.0
2             10          17.0
3             75          71.0
4             20          30.0
5             92          79.0
6             60          59.0
7             75          33.0
8             25           NaN
9             25          15.0

   ResearchScore  ProjectScore
0             90          85.0
1            175         136.0
2            185         153.0
3            260         224.0
4            280         254.0
5            372         333.0
6            432         392.0
7            507         425.0
8            532           NaN
9            557         440.0


The output is a dataframe with cells containing the cumulative sum of the values seen so far along the index axis. Any Nan value in the dataframe is skipped.

In [None]:
# Pandas dataframe.cumsum() is used to find the cumulative sum value over any axis.
# Each cell is populated with the cumulative sum of the values seen so far.
print(x)
print()
print(x.cumsum(axis=1,skipna=True))


   ResearchScore  ProjectScore
0             90          85.0
1             85          51.0
2             10          17.0
3             75          71.0
4             20          30.0
5             92          79.0
6             60          59.0
7             75          33.0
8             25           NaN
9             25          15.0

   ResearchScore  ProjectScore
0           90.0         175.0
1           85.0         136.0
2           10.0          27.0
3           75.0         146.0
4           20.0          50.0
5           92.0         171.0
6           60.0         119.0
7           75.0         108.0
8           25.0           NaN
9           25.0          40.0


In [None]:
x.sum()

Unnamed: 0,ProjectScore,ResearchScore
0,85.0,90
1,51.0,85
2,17.0,10
3,71.0,75
4,30.0,20
5,79.0,92
6,59.0,60
7,33.0,75
8,15.0,25


In [None]:
x.ProjectScore.sum()

440.0

### idxmax and idxmin
`.idxmax()` function returns index of first occurrence of maximum over requested axis. While finding the index of the maximum value across any index, all NA/null values are excluded.

    Syntax: DataFrame.idxmax(axis=0, skipna=True)

    Parameters :
    axis : 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
    skipna : Exclude NA/null values. If an entire row/column is NA, the result will be NA

    Returns : idxmax : Series

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
# extract numeric df
x = df[['ResearchScore', 'ProjectScore']]
x

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,15.0


In [None]:
x.idxmax()

Unnamed: 0,0
ResearchScore,5
ProjectScore,0


In [None]:
x.idxmin()

Unnamed: 0,0
ResearchScore,2
ProjectScore,8


In [None]:
x

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,15.0


In [None]:
x.idxmax(axis = 1)

Unnamed: 0,0
0,ResearchScore
1,ResearchScore
2,ProjectScore
3,ResearchScore
4,ProjectScore
5,ResearchScore
6,ResearchScore
7,ResearchScore
8,ResearchScore


In [None]:
# The idxmin() method returns a Series with the index of the minimum value for each column.
x.idxmin(axis = 0)

Unnamed: 0,0
ResearchScore,2
ProjectScore,9


In [None]:
x

Unnamed: 0,ResearchScore,ProjectScore
0,90,85.0
1,85,51.0
2,10,17.0
3,75,71.0
4,20,30.0
5,92,79.0
6,60,59.0
7,75,33.0
8,25,
9,25,15.0


### nlargest and nsmallest
Pandas `nlargest()` method is used to get n largest values from a data frame or a series.

    Syntax:

    DataFrame.nlargest(n, columns, keep='first')

    Parameters:

    n: int, Number of values to select
    columns: Column to check for values or user can select column while calling too.
    [For example: data[“age”].nlargest(3) OR data.nlargest(3, “age”)]

    keep: object to set which value to select if duplicates exit. Options are ‘first’ or ‘last’

In [None]:
# return the records of top 3 students on the basis of ResearchScore
# sort this dataframe in descending order on ResearchScore => df.head(3)
# df.sort_values(by = 'ResearchScore', ascending = False).head(3)

df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
# extract the information of top 3 performers in research score
# Select *
# from Student
# order by ReseachScore desc
# limit 3

# nlargest
df.nlargest(3, 'ResearchScore')

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
5,Simon,A,Y,92,79.0,Yes
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes


In [None]:
df.nlargest(2, 'ProjectScore')

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
5,Simon,A,Y,92,79.0,Yes


In [None]:
# nsmallest
df.nsmallest(3, 'ResearchScore')

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
2,David,F,N,10,17.0,No
4,Marvin,E,N,20,30.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
# Pandas nsmallest() method is used to get n least values from a data frame or a series.
df.nsmallest(1,'ResearchScore')

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
2,David,F,N,10,17.0,No


### Add Column

To add a column to a DataFrame use column name assignment method similar to dictionary as

`df['New_Column']=column_data as per any condition`


In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Chris,D,U,25,15.0,NO


In [None]:
# Full Marks
df['FullMarks'] = 200

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks
0,Henry,A,Y,90,85.0,Yes,200
1,John,C,N,85,51.0,Yes,200
2,David,F,N,10,17.0,No,200
3,Holmes,B,Y,75,71.0,No,200
4,Marvin,E,N,20,30.0,No,200
5,Simon,A,Y,92,79.0,Yes,200
6,Robert,B,Y,60,59.0,No,200
7,Trent,C,Y,75,33.0,No,200
8,Chris,D,U,25,15.0,NO,200


In [None]:
# Marks_obtained = ResearchScore +	ProjectScore
df['Marks_obtained'] = df['ResearchScore'] + df['ProjectScore']
df.head(3)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained
0,Henry,A,Y,90,85.0,Yes,200,175.0
1,John,C,N,85,51.0,Yes,200,136.0
2,David,F,N,10,17.0,No,200,27.0


In [None]:
# Perc_Marks =  Marks_obtained*100/TotalMarks
df['Perc_Marks'] = 100 * df['Marks_obtained'] / df['FullMarks']

In [None]:
df.sample(4)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks
6,Robert,B,Y,60,59.0,No,200,119.0,59.5
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Marvin,E,N,20,30.0,No,200,50.0,25.0
1,John,C,N,85,51.0,Yes,200,136.0,68.0


In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Full Marks,Marks_obtained,Perc_Marks
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
1,John,C,N,85,51.0,Yes,200,136.0,68.0
2,David,F,N,10,17.0,No,200,27.0,13.5
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Marvin,E,N,20,30.0,No,200,50.0,25.0
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
6,Robert,B,Y,60,59.0,No,200,119.0,59.5
7,Trent,C,Y,75,33.0,No,200,108.0,54.0
8,Judy,,Y,25,,No,200,,
9,Chris,D,U,25,15.0,NO,200,40.0,20.0


In [None]:
df['Perc_Marks'].astype(str) + "%"

Unnamed: 0,Perc_Marks
0,87.5%
1,68.0%
2,13.5%
3,73.0%
4,25.0%
5,85.5%
6,59.5%
7,54.0%
8,20.0%


In [None]:
def grade_allotment(marks):
  if marks >= 60:
    return "First"
  elif marks >= 35 and marks < 60:
    return "Second"
  elif marks >= 0 and marks < 35:
    return "Third"
  else:
    return "Others"

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
1,John,C,N,85,51.0,Yes,200,136.0,68.0
2,David,F,N,10,17.0,No,200,27.0,13.5
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Marvin,E,N,20,30.0,No,200,50.0,25.0
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
6,Robert,B,Y,60,59.0,No,200,119.0,59.5
7,Trent,C,Y,75,33.0,No,200,108.0,54.0
8,Chris,D,U,25,15.0,NO,200,40.0,20.0


In [None]:
df['Perc_Marks'][2]

np.float64(13.5)

In [None]:
df['grade1'] = ''

for i in range(len(df)):
  if df['Perc_Marks'][i] >= 60:
    df['grade1'][i] = "First"
  elif df['Perc_Marks'][i] >= 35 and df['Perc_Marks'][i] < 60:
    df['grade1'][i] = "Second"
  else:
    df['grade1'][i] = "Third"

df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['grade1'][i] = "First"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['grade1'][i] = "First"
You are set

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First
2,David,F,N,10,17.0,No,200,27.0,13.5,Third
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0,First
4,Marvin,E,N,20,30.0,No,200,50.0,25.0,Third
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5,First
6,Robert,B,Y,60,59.0,No,200,119.0,59.5,Second
7,Trent,C,Y,75,33.0,No,200,108.0,54.0,Second
8,Chris,D,U,25,15.0,NO,200,40.0,20.0,Third


In [None]:
df['grade2'] = ''

for i in range(len(df)):
  df['grade2'][i] = grade_allotment(df['Perc_Marks'][i])

df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['grade2'][i] = grade_allotment(df['Perc_Marks'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['grad

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1,grade2
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First,First
2,David,F,N,10,17.0,No,200,27.0,13.5,Third,Third
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0,First,First
4,Marvin,E,N,20,30.0,No,200,50.0,25.0,Third,Third
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5,First,First
6,Robert,B,Y,60,59.0,No,200,119.0,59.5,Second,Second
7,Trent,C,Y,75,33.0,No,200,108.0,54.0,Second,Second
8,Chris,D,U,25,15.0,NO,200,40.0,20.0,Third,Third


In [None]:
def grade_allotment(marks):
  if marks >= 60:
    return "First"
  elif marks >= 35 and marks < 60:
    return "Second"
  elif marks >= 0 and marks < 35:
    return "Third"
  else:
    return "Others"


df['grade3'] = df['Perc_Marks'].apply(grade_allotment)

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1,grade2,grade3
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First,First,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First,First,First
2,David,F,N,10,17.0,No,200,27.0,13.5,Third,Third,Third
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0,First,First,First
4,Marvin,E,N,20,30.0,No,200,50.0,25.0,Third,Third,Third
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5,First,First,First
6,Robert,B,Y,60,59.0,No,200,119.0,59.5,Second,Second,Second
7,Trent,C,Y,75,33.0,No,200,108.0,54.0,Second,Second,Second
8,Chris,D,U,25,15.0,NO,200,40.0,20.0,Third,Third,Third


In [None]:
df.columns

Index(['Name', 'OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore',
       'Recommend', 'FullMarks', 'Marks_obtained', 'Perc_Marks', 'grade1',
       'grade2', 'grade3'],
      dtype='object')

In [None]:
df.head(2)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1,grade2,grade3
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First,First,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First,First,First


In [None]:
df.rename(columns = {'OverallGrade': 'grade', 'Full Marks': "Total_marks"}, inplace = True)

In [None]:
df.head(2)

Unnamed: 0,Name,grade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1,grade2,grade3
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First,First,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First,First,First


In [None]:
df.columns = ['Name', 'Old_grade', 'Obedient', 'Research_Score', 'ProjectScore',
       'Recommend', 'Total_marks', 'Marks_obtained', 'Perc_Marks',
       'New_grade']


df.head(2)

Unnamed: 0,Name,Old_grade,Obedient,Research_Score,ProjectScore,Recommend,Total_marks,Marks_obtained,Perc_Marks,New_grade
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First


In [None]:
# rearrange order


Using Dataframe.assign() method

This method will create a new dataframe with a new column added to the old dataframe.

In [None]:
x=x.assign(Mean=x.mean(axis=1))

In [None]:
x

Unnamed: 0,ResearchScore,ProjectScore,TOTAL,Mean
0,90,85.0,175.0,116.666667
1,85,51.0,136.0,90.666667
2,10,17.0,27.0,18.0
3,75,71.0,146.0,97.333333
4,20,30.0,50.0,33.333333
5,92,79.0,171.0,114.0
6,60,59.0,119.0,79.333333
7,75,33.0,108.0,72.0
8,25,,,25.0
9,25,15.0,40.0,26.666667


### rename column name

One way of renaming the columns in a Pandas Dataframe is by using the rename() function. This method is quite useful when we need to rename some selected columns because we need to specify information only for the columns which are to be renamed.

In [None]:
# pd.DataFrame(list,columns = [], index = [])
df.columns = ['Name', 'OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore',
       'Recommend', 'TotalMarks', 'MarksObtained', 'Perc_Marks']

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,TotalMarks,MarksObtained,Perc_Marks
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
1,John,C,N,85,51.0,Yes,200,136.0,68.0
2,David,F,N,10,17.0,No,200,27.0,13.5
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Marvin,E,N,20,30.0,No,200,50.0,25.0
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
6,Robert,B,Y,60,59.0,No,200,119.0,59.5
7,Trent,C,Y,75,33.0,No,200,108.0,54.0
8,Chris,D,U,25,15.0,No,200,40.0,20.0


In [None]:
df.rename(columns = {"Perc_Marks":"PercentageMarks"}, inplace = True)

In [None]:
x

Unnamed: 0,ResearchScore,ProjectScore,Total,Mean
0,90,85.0,175.0,116.666667
1,85,51.0,136.0,90.666667
2,10,17.0,27.0,18.0
3,75,71.0,146.0,97.333333
4,20,30.0,50.0,33.333333
5,92,79.0,171.0,114.0
6,60,59.0,119.0,79.333333
7,75,33.0,108.0,72.0
8,25,,,25.0
9,25,15.0,40.0,26.666667


### Remove row or column
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.

`DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')`

In [None]:
df.drop('FullMarks', axis = 1)

Unnamed: 0,Name,grade,Obedient,ResearchScore,ProjectScore,Recommend,Marks_obtained,Perc_Marks,grade1,grade2,grade3
0,Henry,A,Y,90,85.0,Yes,175.0,87.5,First,First,First
1,John,C,N,85,51.0,Yes,136.0,68.0,First,First,First
2,David,F,N,10,17.0,No,27.0,13.5,Third,Third,Third
3,Holmes,B,Y,75,71.0,No,146.0,73.0,First,First,First
4,Marvin,E,N,20,30.0,No,50.0,25.0,Third,Third,Third
5,Simon,A,Y,92,79.0,Yes,171.0,85.5,First,First,First
6,Robert,B,Y,60,59.0,No,119.0,59.5,Second,Second,Second
7,Trent,C,Y,75,33.0,No,108.0,54.0,Second,Second,Second
8,Chris,D,U,25,15.0,NO,40.0,20.0,Third,Third,Third


In [None]:
df.drop([8],axis=0,inplace=True)
df

Unnamed: 0,Name,grade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1,grade2,grade3
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First,First,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First,First,First
2,David,F,N,10,17.0,No,200,27.0,13.5,Third,Third,Third
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0,First,First,First
4,Marvin,E,N,20,30.0,No,200,50.0,25.0,Third,Third,Third
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5,First,First,First
6,Robert,B,Y,60,59.0,No,200,119.0,59.5,Second,Second,Second
7,Trent,C,Y,75,33.0,No,200,108.0,54.0,Second,Second,Second


In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Total_Marks,Marks_obtained,Perc_Marks
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
1,John,C,N,85,51.0,Yes,200,136.0,68.0
2,David,F,N,10,17.0,No,200,27.0,13.5
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Marvin,E,N,20,30.0,No,200,50.0,25.0
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
6,Robert,B,Y,60,59.0,No,200,119.0,59.5
7,Trent,C,Y,75,33.0,No,200,108.0,54.0
8,Judy,,Y,25,,No,200,,
9,Chris,D,U,25,15.0,NO,200,40.0,20.0


In [None]:
df.drop('Total_Marks', axis =1 )

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Marks_obtained,Perc_Marks
0,Henry,A,Y,90,85.0,Yes,175.0,87.5
1,John,C,N,85,51.0,Yes,136.0,68.0
2,David,F,N,10,17.0,No,27.0,13.5
3,Holmes,B,Y,75,71.0,No,146.0,73.0
4,Marvin,E,N,20,30.0,No,50.0,25.0
5,Simon,A,Y,92,79.0,Yes,171.0,85.5
6,Robert,B,Y,60,59.0,No,119.0,59.5
7,Trent,C,Y,75,33.0,No,108.0,54.0
8,Judy,,Y,25,,No,,
9,Chris,D,U,25,15.0,NO,40.0,20.0


In [None]:
df.drop([8], axis =0 , inplace  = True)

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Total_Marks,Marks_obtained,Perc_Marks
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
1,John,C,N,85,51.0,Yes,200,136.0,68.0
2,David,F,N,10,17.0,No,200,27.0,13.5
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Marvin,E,N,20,30.0,No,200,50.0,25.0
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
6,Robert,B,Y,60,59.0,No,200,119.0,59.5
7,Trent,C,Y,75,33.0,No,200,108.0,54.0
9,Chris,D,U,25,15.0,NO,200,40.0,20.0


### Reset Index
After deleting a row from DataFrame the index is also deleted. To reset the index use `.reset_index()`

In [None]:
df.reset_index()

Unnamed: 0,index,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Total_Marks,Marks_obtained,Perc_Marks
0,0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
1,1,John,C,N,85,51.0,Yes,200,136.0,68.0
2,2,David,F,N,10,17.0,No,200,27.0,13.5
3,3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,4,Marvin,E,N,20,30.0,No,200,50.0,25.0
5,5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
6,6,Robert,B,Y,60,59.0,No,200,119.0,59.5
7,7,Trent,C,Y,75,33.0,No,200,108.0,54.0
8,9,Chris,D,U,25,15.0,NO,200,40.0,20.0


In [None]:
df.reset_index(drop = True, inplace = True)

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Total_Marks,Marks_obtained,Perc_Marks
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
1,John,C,N,85,51.0,Yes,200,136.0,68.0
2,David,F,N,10,17.0,No,200,27.0,13.5
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Marvin,E,N,20,30.0,No,200,50.0,25.0
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
6,Robert,B,Y,60,59.0,No,200,119.0,59.5
7,Trent,C,Y,75,33.0,No,200,108.0,54.0
8,Chris,D,U,25,15.0,NO,200,40.0,20.0


In [None]:
[elem for elem in range(101, 110)]

[101, 102, 103, 104, 105, 106, 107, 108, 109]

In [None]:
df.index = [elem for elem in range(101, 110)]

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Total_Marks,Marks_obtained,Perc_Marks
101,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
102,John,C,N,85,51.0,Yes,200,136.0,68.0
103,David,F,N,10,17.0,No,200,27.0,13.5
104,Holmes,B,Y,75,71.0,No,200,146.0,73.0
105,Marvin,E,N,20,30.0,No,200,50.0,25.0
106,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
107,Robert,B,Y,60,59.0,No,200,119.0,59.5
108,Trent,C,Y,75,33.0,No,200,108.0,54.0
109,Chris,D,U,25,15.0,NO,200,40.0,20.0


### Sort The DataFrame
#### Sort Values
Pandas sort_values() function sorts a data frame in Ascending or Descending order of passed Column. It’s different than the sorted Python function since it cannot sort a data frame and particular column cannot be selected.
Let’s discuss Dataframe.sort_values() Single Parameter Sorting:
    Syntax:


    DataFrame.sort_values(by, axis=0, ascending=True, inplace=False,
    kind=’quicksort’, na_position=’last’)


    Every parameter has some default values except the ‘by’ parameter.
    Parameters:

    by: Single/List of column names to sort Data Frame by.
    axis: 0 or ‘index’ for rows and 1 or ‘columns’ for Column.
    ascending: Boolean value which sorts Data frame in ascending order if True.
    inplace: Boolean value. Makes the changes in passed data frame itself if True.
    kind: String which can have three inputs(‘quicksort’, ‘mergesort’ or
    ‘heapsort’) of algorithm used to sort data frame.
    na_position: Takes two string input ‘last’ or ‘first’ to set position of
    Null values. Default is ‘last’.

In [None]:
df

Unnamed: 0,Name,grade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1,grade2,grade3
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First,First,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First,First,First
2,David,F,N,10,17.0,No,200,27.0,13.5,Third,Third,Third
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0,First,First,First
4,Marvin,E,N,20,30.0,No,200,50.0,25.0,Third,Third,Third
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5,First,First,First
6,Robert,B,Y,60,59.0,No,200,119.0,59.5,Second,Second,Second
7,Trent,C,Y,75,33.0,No,200,108.0,54.0,Second,Second,Second


In [None]:
df.sort_values(by = 'Perc_Marks', ascending = False).reset_index(drop = True, inplace = True)

In [None]:
df

Unnamed: 0,Name,grade,Obedient,ResearchScore,ProjectScore,Recommend,FullMarks,Marks_obtained,Perc_Marks,grade1,grade2,grade3
0,Henry,A,Y,90,85.0,Yes,200,175.0,87.5,First,First,First
1,John,C,N,85,51.0,Yes,200,136.0,68.0,First,First,First
2,David,F,N,10,17.0,No,200,27.0,13.5,Third,Third,Third
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0,First,First,First
4,Marvin,E,N,20,30.0,No,200,50.0,25.0,Third,Third,Third
5,Simon,A,Y,92,79.0,Yes,200,171.0,85.5,First,First,First
6,Robert,B,Y,60,59.0,No,200,119.0,59.5,Second,Second,Second
7,Trent,C,Y,75,33.0,No,200,108.0,54.0,Second,Second,Second


In [None]:
df.sort_values(by = "ResearchScore", ascending = False ).reset_index(drop = True)

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Full_marks,Marks_obtained,Perc_Marks
0,Simon,A,Y,92,79.0,Yes,200,171.0,85.5
1,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
2,John,C,N,85,51.0,Yes,200,136.0,68.0
3,Holmes,B,Y,75,71.0,No,200,146.0,73.0
4,Trent,C,Y,75,33.0,No,200,108.0,54.0
5,Robert,B,Y,60,59.0,No,200,119.0,59.5
6,Judy,,Y,25,,No,200,,
7,Chris,D,U,25,15.0,NO,200,40.0,20.0
8,Marvin,E,N,20,30.0,No,200,50.0,25.0
9,David,F,N,10,17.0,No,200,27.0,13.5


In [None]:
df.sort_values(by=['ResearchScore', 'ProjectScore'])

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Total_Marks,Marks_obtained,Perc_Marks
103,David,F,N,10,17.0,No,200,27.0,13.5
105,Marvin,E,N,20,30.0,No,200,50.0,25.0
109,Chris,D,U,25,15.0,NO,200,40.0,20.0
107,Robert,B,Y,60,59.0,No,200,119.0,59.5
108,Trent,C,Y,75,33.0,No,200,108.0,54.0
104,Holmes,B,Y,75,71.0,No,200,146.0,73.0
102,John,C,N,85,51.0,Yes,200,136.0,68.0
101,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
106,Simon,A,Y,92,79.0,Yes,200,171.0,85.5


In [None]:
df=df.sort_values(by=['ResearchScore', 'ProjectScore'])
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,Total_Marks,Marks_obtained,Perc_Marks
103,David,F,N,10,17.0,No,200,27.0,13.5
105,Marvin,E,N,20,30.0,No,200,50.0,25.0
109,Chris,D,U,25,15.0,NO,200,40.0,20.0
107,Robert,B,Y,60,59.0,No,200,119.0,59.5
108,Trent,C,Y,75,33.0,No,200,108.0,54.0
104,Holmes,B,Y,75,71.0,No,200,146.0,73.0
102,John,C,N,85,51.0,Yes,200,136.0,68.0
101,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
106,Simon,A,Y,92,79.0,Yes,200,171.0,85.5


In [None]:
df.reset_index(inplace=True, drop = True)

In [None]:
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend,TotalMarks,MarksObtained,PercentageMarks
0,David,F,N,10,17.0,No,200,27.0,13.5
1,Marvin,E,N,20,30.0,No,200,50.0,25.0
2,Chris,D,U,25,15.0,No,200,40.0,20.0
3,Robert,B,Y,60,59.0,No,200,119.0,59.5
4,Holmes,B,Y,75,71.0,No,200,146.0,73.0
5,Trent,C,Y,75,33.0,No,200,108.0,54.0
6,John,C,N,85,51.0,Yes,200,136.0,68.0
7,Henry,A,Y,90,85.0,Yes,200,175.0,87.5
8,Simon,A,Y,92,79.0,Yes,200,171.0,85.5


In [None]:
df=df.sort_values(by='ResearchScore',ascending=False)
df

In [None]:
df=df.sort_values(by=['OverallGrade','ResearchScore'])
df

In [None]:
df=df.sort_values(by=['OverallGrade','ResearchScore'],ascending =False)
df

In [None]:
x.reset_index(inplace=True)
x

In [None]:
x.drop(columns=['index'],inplace=True)
x

In [None]:
x=df.sort_values(by=['ResearchScore','ProjectScore'])
x

In [None]:
x.sort_index()                  # sort the DataFrame with respect to index

In [None]:
x['Mean']=x.mean(axis=1)

In [None]:
x

In [None]:
np.arange(8,80,9)

In [None]:
import numpy as np
x['TIMEPASS'] = np.array([1,2,3,4,5,6,7,8,9])
x

### Drop duplicates
Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.

`DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')`

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels.

Parameters

    labels : single label or list-like

        Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.
    axis : {0 or ‘index’, 1 or ‘columns’}, default 0

        Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
    index : single label or list-like

        Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
    columns : single label or list-like

        Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).
    level : int or level name, optional

        For MultiIndex, level from which the labels will be removed.
    inplace : bool, default False

        If False, return a copy. Otherwise, do operation inplace and return None.
    errors : {‘ignore’, ‘raise’}, default ‘raise’

        If ‘ignore’, suppress error and only existing labels are dropped.

Returns

    DataFrame or None

        DataFrame without the removed index or column labels or None if inplace=True.



In [None]:
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
    })
df

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [None]:
df.drop_duplicates()

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [None]:
# By default, it removes duplicate rows based on all columns.
df.drop_duplicates()

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [None]:
# To remove duplicates on specific column(s), use subset.

df.drop_duplicates(subset=['brand','style'])

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0


In [None]:
# To remove duplicates and keep last occurrences, use keep.

df.drop_duplicates(subset=['brand', 'style'], keep='last')

Unnamed: 0,brand,style,rating
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
4,Indomie,pack,5.0


In [None]:
x = 24