<h1><p style="text-align: center;">Pandas</p></h1>

![Pandas](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/512px-Pandas_logo.svg.png?20200209204934)

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

## Table of Contents
1. [Introduction to Pandas](#intro)
    - [What is Pandas?](#wip)
    - [Installation and Setup](#install)
    - [Importing Pandas](#import)
    - [Why to use Pandas?](#wtup)
2. [What kind of data does pandas handle?](#handle)
    - [Series](#series)
    - [Data frame](#dataFrame)
3. [Reading a Tabular data](#03)
    - [Reading a CSV file from a local directory](#local)
    - [Reading a CSV file from a URL](#url)
    - [Specifying custom delimiter and header row](#deli)
    - [Specifying column names and data types](#names)
    - [Displaying a Data Frame](#display)
4. [Method of DataFrame](#methods)
    - [Indexing & Slicing](#index)
    - [Filtering of the DataFrame](#filter)
    - [Sorting](#sort)
    - [Dropping Column/s](#drop)
    - [Missing Values](#miss)
    - [Handling Missing Values](#hmiss)
    - [Derived column from existing columns](#derived)
5. [Statistical Measures](#stats)
    - [Aggregating statistics](#agg)
    - [Aggregating statistics grouped by category](#gagg)
6. [Exporting DataFrame](#Export)


---
<h2><p style="text-align: center;"> 1. Introduction to Pandas</p></h2> <a class = 'anchor' id = 'intro'></a> 

### Introduction to Pandas <a class = 'anchor' id = 'wip'></a>
- Pandas is an open-source Python library for data manipulation and analysis. 
- It provides data structures for efficiently storing and manipulating **large datasets**, as well as tools for **data cleaning, grouping, filtering**, and **visualization**. 
- Pandas is built on **top of NumPy**, another popular Python library for **scientific computing**.    
Pandas lies at the core of a rich ecosystem of data science libraries. A typical exploratory data science workflow might look like:

![Numpy_DS](https://numpy.org/images/content_images/ds-landscape.png)

- Extract, Transform, Load: Pandas, Intake, PyJanitor
- Exploratory analysis: Jupyter, Seaborn, Matplotlib, Altair
- Model and evaluate: scikit-learn, statsmodels, PyMC3, spaCy
- Report in a dashboard: Dash, Panel, Voila

### Installation and Setup <a class = 'anchor' id = 'install'></a>
To install pandas, we can use `pip`, the Python package manager. Open our command prompt and type the following command:
```
pip install pandas 
```
![image.png](attachment:c27f5733-fd94-4561-92d4-c620a24b9cf9.png)

### Importing Pandas <a class = 'anchor' id = 'import'></a>
To start using pandas, we need to import it into our Python code. In most cases, we will import it with an alias `'pd'`, which is a common convention in the data science community.
```
import pandas as pd
```

In [1]:
import pandas as pd

### Why to use Pandas? <a class = 'anchor' id = 'wtup'></a>
Pandas provides several benefits for data manipulation and analysis:

- **Data structures**: Pandas provides two main data structures - **Series and DataFrame** - that are optimized for data analysis. A Series is a one-dimensional array-like object that can hold any data type, while a DataFrame is a two-dimensional table-like data structure with columns of potentially different data types.

- **Data cleaning**: Pandas provides tools for cleaning and transforming data, such as removing missing values, filling in missing values, and converting data types.

-  **Data analysis**: Pandas provides tools for data analysis, such as grouping, filtering, sorting, and aggregation.

- **Data visualization**: Pandas provides tools for data visualization, such as plotting and graphing.

- **Integration with other libraries**: Pandas is built on top of NumPy and integrates well with other Python libraries, such as Matplotlib and Scikit-learn.

---
<h2><p style="text-align: center;"> 2. What kind of data does pandas handle?</p></h2> <a class = 'anchor' id = 'handle'></a> 


![image.png](attachment:image.png)

### Series <a class = 'anchor' id = 'series'></a>
- In pandas, a Series is a **one-dimensional** labeled array that can hold any data type, including integers, floats, strings, and Python objects. It is similar to **a column** in a spreadsheet, excel, CSV file or a database table.

- A Series is made up of two arrays: one array holds the **data values**, and the other array holds the **index labels** that **uniquely identify** each data value. The index labels can be any hashable object, such as **integers, strings, or datetime objects**.

![01_table_series](https://pandas.pydata.org/docs/_images/01_table_series.svg)

In [2]:
import pandas as pd

# create a Series with default index labels
s1 = pd.Series([1, 2, 3, 4, 5])
# A Series is created with the default index labels of 0, 1, 2, 3, and 4. The data values are integers from 1 to 5.
print('Series 1 : \n')
print(s1)

Series 1 : 

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [3]:
type(s1)

pandas.core.series.Series

In [4]:
# A Series is created with custom index labels of 'a', 'b', 'c', 'd', and 'e'. 
# Again, the data values are integers from 1 to 5.
# create a Series with custom index labels
s2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
print('Series 2 : ')
print(s2)

Series 2 : 
a    1
b    2
c    3
d    4
e    5
dtype: int64


In [5]:
# A Series is created from a Python dictionary.
# The dictionary keys are used as the index labels, and the dictionary values are used as the data values.
# create a Series from a dictionary
data = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
s3 = pd.Series(data)
print('Series 3 : ')
print(s3)

Series 3 : 
a    1
b    2
c    3
d    4
e    5
dtype: int64


Once a Series has been created, we can access its data and **index labels**  using the **values** and index attributes, respectively. 

In [6]:
# create a Series with custom index labels
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

# access the data values
print('Values of Series Data : ',s.values)  

# access the index labels
print('Index of Series Data : ',s.index)   


Values of Series Data :  [1 2 3 4 5]
Index of Series Data :  Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


#### Arithmetic Operation 

In [7]:
# perform an arithmetic operation on the data values
# addition
add = s + 10
add

a    11
b    12
c    13
d    14
e    15
dtype: int64

In [8]:
# substraction
sub = s - 4
sub

a   -3
b   -2
c   -1
d    0
e    1
dtype: int64

In [9]:
# Multipication with a number
multi = s*9
multi

a     9
b    18
c    27
d    36
e    45
dtype: int64

In [10]:
# division
div = s/5
div

a    0.2
b    0.4
c    0.6
d    0.8
e    1.0
dtype: float64

#### Filtering Operation
The filtersing operation the Series based on a condition `(s > 2.3)`. The resulting Series only contains the values that satisfy the condition.

In [11]:
# filter the Series based on a condition
s_filtered = s[s > 2.3]
s_filtered

c    3
d    4
e    5
dtype: int64

### DataFrame<a class="anchor" id="01a"></a>
- DataFrame is a `pandas`'s object and it is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a **spreadsheet, csv file** or a **SQL table**, where the rows represent the **observations**, and the columns represent the **variables**.
- A DataFrame is made up of **one or more Series objects**, each representing a column in the DataFrame. The Series objects must have the **same length** and are aligned based on their **index labels**.


![01_table_dataframe](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)

To manually store data in a table, create a DataFrame. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame.

In [12]:
# create a DataFrame from a dictionary
data = {'Name': ['Maria', 'Allen', 'Rahul', 'Ali', 'Shurti'],
        'Age': [25, 32, 18, 47, 22],
        'Gender': ['F', 'M', 'M', 'M', 'F'],
        'Salary': [50000, 75000, 40000, 90000, 60000]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Gender,Salary
0,Maria,25,F,50000
1,Allen,32,M,75000
2,Rahul,18,M,40000
3,Ali,47,M,90000
4,Shurti,22,F,60000


- A DataFrame is created from a Python dictionary. 
- The keys of the dictionary correspond to the column names, and the values correspond to the data values. 
- Each column is represented as a **Series object**, and the DataFrame is constructed by aligning the Series objects based on their index labels.

In spreadsheet software, the table representation of our data would look very similar:
![image.png](attachment:image.png)

 ---
<h2><p style="text-align: center;"> 3. Reading a Tabular Data</p></h2> <a class="anchor" id="03"></a>


Pandas provides several functions to read tabular data into a DataFrame, **including read_csv, read_excel, and read_sql**. Here are some examples of how to read tabular data using read_csv.
![02_io_readwrite](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg)

### Reading a CSV file from a local directory <a class = 'anchor' id = 'local'></a>
`read_csv` is used to read a CSV file located at `path/to/file.csv` and store the data in a DataFrame object named `df`.

In [13]:
# read a CSV file from a local directory
df = pd.read_csv('dataset/data.csv')
df

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


### Reading a CSV file from a URL <a class = 'anchor' id = 'url'></a>
`read_csv` is used to read a CSV file from a URL and store the data in a DataFrame object named df.

In [14]:
# read a CSV file from a URL
url = 'https://raw.githubusercontent.com/Shubham007-web/Dummy_datasets/main/df.csv'
df = pd.read_csv(url)
df

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


###  Specifying custom delimiter and header row <a class = 'anchor' id = 'deli'></a>
- `read_csv` is used to read a CSV file located at `'path/to/data.csv'` with a custom delimiter (`'|'`) and a header row at index 0.
- This csv file with delimiter `|` looks like this:
![image.png](attachment:image.png)

In [15]:
# read a CSV file with custom delimiter and header row
df = pd.read_csv('dataset/new.csv', delimiter=',', header=0)  # with delimiter = ','
df

Unnamed: 0,Name|Age|Salary|Experience|Gender
0,Himadri|21|15|1.0|Female
1,Hritik|20|36|1.2|Male
2,Sriyanka|22|14|2.0|Female
3,Utkarsh|23|12|0.5|Male
4,Rahul|21|16|11.0|Male
5,Niharika|22|11|1.0|Female
6,Himanhsu|24|13|1.4|Male
7,Saloni|21|9|1.2|Female
8,Ayshui|20|10|0.5|Female
9,Shivam|24|12|1.0|Male


In [16]:
df = pd.read_csv('dataset/new.csv', delimiter='|', header=0)  # with delimiter = '|'
df

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


### Specifying column names and data types <a class = 'anchor' id = 'names'></a>
- `read_csv` is used to read a CSV file located at `'path/to/file.csv'` with custom column names **('id', 'name', and 'age')** and data types (int, str, and int, respectively). 
- The header parameter is set to 0 to indicate that the file has a header row, and the names parameter is used to specify the column names.

In [17]:
# read a CSV file with custom column names and data types
dtypes = { 'Name': str, 'Age': int,'Salary': int, 'Experience': float,'Gender': str}
df = pd.read_csv('dataset/data.csv', header=0, names=['name', 'age','Salary', 'Experience', 'Gender'], dtype=dtypes)
df

Unnamed: 0,name,age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


In [18]:
# renaming the columns name of the data frame
df = pd.read_csv('dataset/data.csv')
df

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


In [19]:

df.rename(columns = {'Name' : 'name','Experience':'Total_Experience'})

Unnamed: 0,name,Age,Salary,Total_Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


### Displaying a Data Frame <a class = 'class' id= 'display'></a>
Pandas that allow you to display a small portion of a data frame.
- `head(n)` --> will display first nth rows, by deafult it will show 5  
- `tail(n)` --> will display last nth rows, by deafult it will show 5
- `info()` --> Information of the data frame

In [20]:
# head()
df.head() # first 5 rows

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male


In [21]:
# head(n)
df.head(2)  # first 2 rows

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male


In [22]:
# tail()
df.tail()  # last 5 rows

Unnamed: 0,Name,Age,Salary,Experience,Gender
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


In [23]:
# tail(n)
df.tail(2) # last 2 rows

Unnamed: 0,Name,Age,Salary,Experience,Gender
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


In [24]:
# information  of the DataFrame()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        10 non-null     object 
 1   Age         10 non-null     int64  
 2   Salary      10 non-null     int64  
 3   Experience  10 non-null     float64
 4   Gender      10 non-null     object 
dtypes: float64(1), int64(2), object(2)
memory usage: 528.0+ bytes


### Attributes of Data Frame <a class = 'anchor' id = 'attri'></a>
Every Python object it's own properties(attributes), a pandas's `DataFrame()`  is also an object. There are some attributes given below:
![image.png](attachment:image.png)

In [25]:
# let's create data frame 
df = pd.read_csv('dataset/data.csv')
df

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


In [26]:
# index ---> Index of the data frame
df.index

RangeIndex(start=0, stop=10, step=1)

In [27]:
# columns --> columns of the data frame
df.columns

Index(['Name', 'Age', 'Salary', 'Experience', 'Gender'], dtype='object')

In [28]:
# shape --> shape of the data frame
df.shape

(10, 5)

In [29]:
# size --> size of the data frame
df.size

50

In [30]:
# ndim  --> dimension of the data frame
df.ndim

2

In [31]:
# dtypes --> data types of the columns inside the data frame
df.dtypes

Name           object
Age             int64
Salary          int64
Experience    float64
Gender         object
dtype: object

In [32]:
# axes --> rows and columns
df.axes

[RangeIndex(start=0, stop=10, step=1),
 Index(['Name', 'Age', 'Salary', 'Experience', 'Gender'], dtype='object')]

In [33]:
# values --> returns all the values of the data frame
df.values

array([['Himadri', 21, 15, 1.0, 'Female'],
       ['Hritik', 20, 36, 1.2, 'Male'],
       ['Sriyanka', 22, 14, 2.0, 'Female'],
       ['Utkarsh', 23, 12, 0.5, 'Male'],
       ['Rahul', 21, 16, 11.0, 'Male'],
       ['Niharika', 22, 11, 1.0, 'Female'],
       ['Himanhsu', 24, 13, 1.4, 'Male'],
       ['Saloni', 21, 9, 1.2, 'Female'],
       ['Ayshui', 20, 10, 0.5, 'Female'],
       ['Shivam', 24, 12, 1.0, 'Male']], dtype=object)

In [34]:
# T --> transpose of the data frame
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Name,Himadri,Hritik,Sriyanka,Utkarsh,Rahul,Niharika,Himanhsu,Saloni,Ayshui,Shivam
Age,21,20,22,23,21,22,24,21,20,24
Salary,15,36,14,12,16,11,13,9,10,12
Experience,1.0,1.2,2.0,0.5,11.0,1.0,1.4,1.2,0.5,1.0
Gender,Female,Male,Female,Male,Male,Female,Male,Female,Female,Male



---
<h2><p style="text-align: center;"> 4.Methods of the DataFrame</p></h2> <a class = 'anchor' id = 'methods'></a> 

### Indexing & Slicing (Subset of the data frame)<a class = 'anchor' id = 'index'></a>
![03_subset_columns](https://pandas.pydata.org/docs/_images/03_subset_columns.svg)
- Indexing and slicing in Pandas data frames are similar to indexing and slicing in Python lists, but with some additional functionality.
- Indexing in Pandas data frames allows you to select a subset of rows and/or columns based on their labels or positions. Slicing allows you to select a range of rows and/or columns
![image.png](attachment:image.png)

In [35]:
# let's create data frame 
df = pd.read_csv('dataset/data.csv')
df.head()

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male


In [36]:
# indexing
df['Name']

0     Himadri
1      Hritik
2    Sriyanka
3     Utkarsh
4       Rahul
5    Niharika
6    Himanhsu
7      Saloni
8      Ayshui
9      Shivam
Name: Name, dtype: object

In [37]:
# loc[]
df.loc[:,'Name']  # all the rows and only name columns

0     Himadri
1      Hritik
2    Sriyanka
3     Utkarsh
4       Rahul
5    Niharika
6    Himanhsu
7      Saloni
8      Ayshui
9      Shivam
Name: Name, dtype: object

In [38]:
# we can also perform same with iloc[]
df.iloc[:,0]  # all the rows and 0th column

0     Himadri
1      Hritik
2    Sriyanka
3     Utkarsh
4       Rahul
5    Niharika
6    Himanhsu
7      Saloni
8      Ayshui
9      Shivam
Name: Name, dtype: object

In [39]:
# multiple columns
df.loc[:,['Name', 'Gender']] 

Unnamed: 0,Name,Gender
0,Himadri,Female
1,Hritik,Male
2,Sriyanka,Female
3,Utkarsh,Male
4,Rahul,Male
5,Niharika,Female
6,Himanhsu,Male
7,Saloni,Female
8,Ayshui,Female
9,Shivam,Male


In [40]:
# iloc[]
df.iloc[:,[0,-1]]   # all rows and first and last columns of the data frame

Unnamed: 0,Name,Gender
0,Himadri,Female
1,Hritik,Male
2,Sriyanka,Female
3,Utkarsh,Male
4,Rahul,Male
5,Niharika,Female
6,Himanhsu,Male
7,Saloni,Female
8,Ayshui,Female
9,Shivam,Male


In [41]:
# loc[]
df.loc[6,'Name']

'Himanhsu'

In [42]:
# iloc[]
df.iloc[6,0]

'Himanhsu'

In [43]:
#loc[]
df.loc[2:7,["Name","Age","Gender"]]

Unnamed: 0,Name,Age,Gender
2,Sriyanka,22,Female
3,Utkarsh,23,Male
4,Rahul,21,Male
5,Niharika,22,Female
6,Himanhsu,24,Male
7,Saloni,21,Female


In [44]:
# iloc[]
df.iloc[2:7,[0,1,4]]

Unnamed: 0,Name,Age,Gender
2,Sriyanka,22,Female
3,Utkarsh,23,Male
4,Rahul,21,Male
5,Niharika,22,Female
6,Himanhsu,24,Male


### Filtering of the DataFrame <a class = 'anchor' id = 'filter'></a>
- Filtering in a Pandas DataFrame allows us to select a subset of rows based on a **certain condition or criteria**. To do this, we can use boolean indexing.
![03_subset_rows](https://pandas.pydata.org/docs/_images/03_subset_rows.svg)
- The output of the conditional expression (```>```, but also ```==```, ```!=```, ```<```, ```<=```,… would work) is actually a pandas ```Series``` of boolean values (either ```True``` or ```False```) with the same number of rows as the original ```DataFrame```. Such a ```Series``` of boolean values can be used to filter the ```DataFrame``` by putting it in between the selection brackets ```[]```. Only rows for which the value is ```True``` will be selected.

In [45]:
# let's create data frame 
df = pd.read_csv('dataset/data.csv')
df.head()

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male


In [46]:
# let's find the persons having age greater or equal to 22
df['Age'] <= 22  # return boolen values of the data

0     True
1     True
2     True
3    False
4     True
5     True
6    False
7     True
8     True
9    False
Name: Age, dtype: bool

In [47]:
df[df['Age'] <= 22]

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female


In [48]:
# let's filter the data, where age >= 22 and Salary <= 16
df[(df['Age'] <= 22) & (df['Salary'] <= 16)]

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
2,Sriyanka,22,14,2.0,Female
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female


#### Note: When combining multiple conditional statements, each condition must be surrounded by parentheses ```()```. Moreover, we can not use ```or```/```and``` but need to use the ```or``` operator ```|``` and the ```and``` operator ```&```.

### Sorting <a class = 'anchor' id = 'sort'></a>

To sort a Pandas DataFrame in Python, we can use the `sort_values()` method. This method sorts the DataFrame by one or more columns in either ascending or descending order.

In [49]:
data= pd.read_csv('dataset/data.csv')
data.head()

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male


In [50]:
# Sort by a single column "Salary"
# df_sorted1 = df.sort_values('Salary')
df_sorted1 = df.sort_values('Salary',ascending=True)  # in ascending oreder
df_sorted1

Unnamed: 0,Name,Age,Salary,Experience,Gender
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
5,Niharika,22,11,1.0,Female
3,Utkarsh,23,12,0.5,Male
9,Shivam,24,12,1.0,Male
6,Himanhsu,24,13,1.4,Male
2,Sriyanka,22,14,2.0,Female
0,Himadri,21,15,1.0,Female
4,Rahul,21,16,11.0,Male
1,Hritik,20,36,1.2,Male


**Note**: The ascending parameter is used to specify the sort order for each column. If we don't specify this parameter, the default value is `True`, which means the DataFrame will be sorted in **ascending order**.

In [51]:
df_sorted2 = df.sort_values('Salary',ascending=False)  # in descending oreder
df_sorted2

Unnamed: 0,Name,Age,Salary,Experience,Gender
1,Hritik,20,36,1.2,Male
4,Rahul,21,16,11.0,Male
0,Himadri,21,15,1.0,Female
2,Sriyanka,22,14,2.0,Female
6,Himanhsu,24,13,1.4,Male
3,Utkarsh,23,12,0.5,Male
9,Shivam,24,12,1.0,Male
5,Niharika,22,11,1.0,Female
8,Ayshui,20,10,0.5,Female
7,Saloni,21,9,1.2,Female


The DataFrame is sorted first by the **`"Salary"`** column in **descending order**, and then by the **`"Age"`** column in **ascending order**.

In [52]:
# Sort by multiple columns
df_sorted3 = df.sort_values(['Salary', 'Age'], ascending=[False, True])
df_sorted3

Unnamed: 0,Name,Age,Salary,Experience,Gender
1,Hritik,20,36,1.2,Male
4,Rahul,21,16,11.0,Male
0,Himadri,21,15,1.0,Female
2,Sriyanka,22,14,2.0,Female
6,Himanhsu,24,13,1.4,Male
3,Utkarsh,23,12,0.5,Male
9,Shivam,24,12,1.0,Male
5,Niharika,22,11,1.0,Female
8,Ayshui,20,10,0.5,Female
7,Saloni,21,9,1.2,Female


### Droping Column/s <a class ='anchor' id = 'drop'></a> 
We can use the `drop()` function to drop columns of the given data.

In [53]:
df = pd.read_csv('dataset/data.csv')
df

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male
5,Niharika,22,11,1.0,Female
6,Himanhsu,24,13,1.4,Male
7,Saloni,21,9,1.2,Female
8,Ayshui,20,10,0.5,Female
9,Shivam,24,12,1.0,Male


Here let's drop `Gender` column from the data.

In [54]:
df.drop('Gender', axis = 1)  # axis = 1 means we are droping column

Unnamed: 0,Name,Age,Salary,Experience
0,Himadri,21,15,1.0
1,Hritik,20,36,1.2
2,Sriyanka,22,14,2.0
3,Utkarsh,23,12,0.5
4,Rahul,21,16,11.0
5,Niharika,22,11,1.0
6,Himanhsu,24,13,1.4
7,Saloni,21,9,1.2
8,Ayshui,20,10,0.5
9,Shivam,24,12,1.0


In [55]:
# more than one columns
df.drop(['Gender','Age'], axis = 1)  # axis = 1 means we are droping columns

Unnamed: 0,Name,Salary,Experience
0,Himadri,15,1.0
1,Hritik,36,1.2
2,Sriyanka,14,2.0
3,Utkarsh,12,0.5
4,Rahul,16,11.0
5,Niharika,11,1.0
6,Himanhsu,13,1.4
7,Saloni,9,1.2
8,Ayshui,10,0.5
9,Shivam,12,1.0


### Missing Values <a class = 'anchor' id = 'miss'></a>
Pandas ```DataFrame.isna()``` or ```DataFrame.isnull()``` method returns a boolean mask where ```True``` is set for missing values (NaN) and ```False``` for non-missing values.

In [56]:
df = pd.read_csv('dataset/new_data.csv')
df.head()

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,,1.0,Female
1,Hritik,22,36.0,1.2,Male
2,,22,14.0,2.0,Female
3,Utkarsh,23,12.0,0.5,Male
4,Rahul,21,,11.0,Male


In [57]:
# let's find the missing values
# df.isnull()
df.isnull().sum()

Name          1
Age           0
Salary        3
Experience    0
Gender        1
dtype: int64

In [58]:
# isna()
df.isna().sum()

Name          1
Age           0
Salary        3
Experience    0
Gender        1
dtype: int64

In [59]:
# notnull() --> return True when values are missing
df.notnull().sum()

Name           9
Age           10
Salary         7
Experience    10
Gender         9
dtype: int64

In [60]:
# notna() --> return True when values are missing
df.notna().sum()

Name           9
Age           10
Salary         7
Experience    10
Gender         9
dtype: int64

In [61]:
# getting non missing values of data for Salary column
df['Salary'][df['Salary'].notnull()]

1    36.0
2    14.0
3    12.0
5    11.0
6    13.0
7     9.0
8    10.0
Name: Salary, dtype: float64

### Handling Missing Values<a class ='anchor' id = 'hmiss'></a>
As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While `NaN' is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object.
- `dropna()`
- `fillna()`

In [62]:
df1 = df.copy()
df1.head()

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,,1.0,Female
1,Hritik,22,36.0,1.2,Male
2,,22,14.0,2.0,Female
3,Utkarsh,23,12.0,0.5,Male
4,Rahul,21,,11.0,Male


The above dataset has some missing value, we had drop missing rows/columns.

In [63]:
df1.dropna()

Unnamed: 0,Name,Age,Salary,Experience,Gender
1,Hritik,22,36.0,1.2,Male
3,Utkarsh,23,12.0,0.5,Male
5,Niharika,22,11.0,1.0,Female
6,Himanhsu,24,13.0,1.4,Male
7,Saloni,21,9.0,1.2,Female
8,Ayshui,20,10.0,0.5,Female


`fillna()` method can be use following ways:

In [64]:
# fillna()
# fillna()
# df2.fillna(0)
# df2.fillna(df2.mean())
# df2.fillna(df2.median())
# df2.fillna({"Age":25, "Experience" : 2})
# df2.fillna(method = 'backfill')
# df2.fillna(method = 'bfill')
# df2.fillna(method = 'ffill')
df1.fillna(method = 'pad')

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,,1.0,Female
1,Hritik,22,36.0,1.2,Male
2,Hritik,22,14.0,2.0,Female
3,Utkarsh,23,12.0,0.5,Male
4,Rahul,21,12.0,11.0,Male
5,Niharika,22,11.0,1.0,Female
6,Himanhsu,24,13.0,1.4,Male
7,Saloni,21,9.0,1.2,Female
8,Ayshui,20,10.0,0.5,Female
9,Shivam,24,10.0,1.0,Female


###  Derived column from existing columns <a class = 'anchor' id = 'derived'></a>
![05_newcolumn_1](https://pandas.pydata.org/docs/_images/05_newcolumn_1.svg)

In [65]:
air_quality = pd.read_csv("dataset/air_quality_no2.csv", index_col=0, parse_dates=True)
air_quality.head()

Unnamed: 0_level_0,station_antwerp,station_paris,station_london
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-05-07 02:00:00,,,23.0
2019-05-07 03:00:00,50.5,25.0,19.0
2019-05-07 04:00:00,45.0,27.7,19.0
2019-05-07 05:00:00,,50.4,16.0
2019-05-07 06:00:00,,61.9,


In [66]:
# Want to express the NO2 concentration of the station in London in mg/m.
air_quality["station_london"] * 1.882

datetime
2019-05-07 02:00:00    43.286
2019-05-07 03:00:00    35.758
2019-05-07 04:00:00    35.758
2019-05-07 05:00:00    30.112
2019-05-07 06:00:00       NaN
                        ...  
2019-06-20 22:00:00       NaN
2019-06-20 23:00:00       NaN
2019-06-21 00:00:00       NaN
2019-06-21 01:00:00       NaN
2019-06-21 02:00:00       NaN
Name: station_london, Length: 1035, dtype: float64

In [67]:
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

To create a new column, use the ```[]``` brackets with the new column name at the left side of the assignment.

The calculation of the values is done element-wise. This means all values in the given column are multiplied by the value 1.882 at once. We do not need to use a loop to iterate each of the rows!

![05_newcolumn_2](https://pandas.pydata.org/docs/_images/05_newcolumn_2.svg)

In [68]:
# check the ratio of the values in Paris versus Antwerp and save the result in a new column.
air_quality["station_paris"] / air_quality["station_antwerp"]

datetime
2019-05-07 02:00:00         NaN
2019-05-07 03:00:00    0.495050
2019-05-07 04:00:00    0.615556
2019-05-07 05:00:00         NaN
2019-05-07 06:00:00         NaN
                         ...   
2019-06-20 22:00:00         NaN
2019-06-20 23:00:00         NaN
2019-06-21 00:00:00         NaN
2019-06-21 01:00:00         NaN
2019-06-21 02:00:00         NaN
Length: 1035, dtype: float64

In [69]:
air_quality["ratio_paris_antwerp"] = (air_quality["station_paris"] / air_quality["station_antwerp"])

The calculation is again element-wise, so the ```/``` is applied for the values in each row.

Also other mathematical operators (```+```, ```-```, ```*```, ```/```,…) or logical operators (```<```, ```>```, ```==```,…) work element-wise. The latter was already used in the subset data tutorial to filter rows of a table using a conditional expression.

If you need more advanced logic, you can use arbitrary Python code via ```apply()```(will cover later).

In [70]:
# rename the data columns to the corresponding station identifiers used by OpenAQ.
air_quality.rename(columns={"station_antwerp": "BETR801","station_paris": "FR04014","station_london": "London Westminster",})

Unnamed: 0_level_0,BETR801,FR04014,London Westminster,london_mg_per_cubic,ratio_paris_antwerp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-05-07 02:00:00,,,23.0,43.286,
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,0.495050
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,0.615556
2019-05-07 05:00:00,,50.4,16.0,30.112,
2019-05-07 06:00:00,,61.9,,,
...,...,...,...,...,...
2019-06-20 22:00:00,,21.4,,,
2019-06-20 23:00:00,,24.9,,,
2019-06-21 00:00:00,,26.5,,,
2019-06-21 01:00:00,,21.8,,,


In [71]:
air_quality_renamed = air_quality.rename(columns={"station_antwerp": "BETR801",
                                                  "station_paris": "FR04014","station_london": "London Westminster",})

In [72]:
air_quality_renamed.head()

Unnamed: 0_level_0,BETR801,FR04014,London Westminster,london_mg_per_cubic,ratio_paris_antwerp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-05-07 02:00:00,,,23.0,43.286,
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,0.49505
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,0.615556
2019-05-07 05:00:00,,50.4,16.0,30.112,
2019-05-07 06:00:00,,61.9,,,


The ```rename()``` function can be used for both row labels and column labels. Provide a dictionary with the keys the current names and the values the new names to update the corresponding names.

The mapping should not be restricted to fixed names only, but can be a mapping function as well. For example, converting the column names to lowercase letters can be done using a function as well:

In [73]:
air_quality_renamed = air_quality_renamed.rename(columns=str.lower)

In [74]:
air_quality_renamed.head()

Unnamed: 0_level_0,betr801,fr04014,london westminster,london_mg_per_cubic,ratio_paris_antwerp
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-05-07 02:00:00,,,23.0,43.286,
2019-05-07 03:00:00,50.5,25.0,19.0,35.758,0.49505
2019-05-07 04:00:00,45.0,27.7,19.0,35.758,0.615556
2019-05-07 05:00:00,,50.4,16.0,30.112,
2019-05-07 06:00:00,,61.9,,,


**Note**: 
- Create a new column by assigning the output to the ```DataFrame``` with a new column name in between the ```[]```.
- Operations are element-wise, no need to ```loop``` over rows.
- Use ```rename``` with a dictionary or function to rename row labels or column names.

---
<h2><p style="text-align: center;"> 5. Statistical Measures</p></h2> <a class = 'anchor' id = 'stats'></a> 

### Aggregating statistics<a class="anchor" id="agg"></a>
![06_aggregate](https://pandas.pydata.org/docs/_images/06_aggregate.svg)

In [75]:
# let's create another dataframe of titanic
titanic = pd.read_csv("dataset/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [76]:
# What is the average age of the Titanic passengers?
titanic["Age"].mean()

29.69911764705882

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.

![06_reduction](https://pandas.pydata.org/docs/_images/06_reduction.svg)

In [77]:
titanic[["Age", "Fare"]].describe()

Unnamed: 0,Age,Fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the ```DataFrame.agg()``` method:

In [78]:
titanic.agg({"Age": ["min", "max", "median", "skew"],"Fare": ["min", "max", "median", "mean"],})

Unnamed: 0,Age,Fare
min,0.42,0.0
max,80.0,512.3292
median,28.0,14.4542
skew,0.389108,
mean,,32.204208


### Aggregating statistics grouped by category<a class="anchor" id="gagg"></a>

![06_groupby](https://pandas.pydata.org/docs/_images/06_groupby.svg)

In [79]:
# What is the average age for male versus female Titanic passengers?
titanic[["Sex", "Age"]].groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


As our interest is the average age for each gender, a subselection on these two columns is made first: ```titanic[["Sex", "Age"]]```. Next, the ```groupby()``` method is applied on the ```Sex``` column to make a group per category. The average ```age``` for each ```gender``` is calculated and returned.

Calculating a given statistic (e.g. ```mean``` age) for each category in a column (e.g. male/female in the Sex column) is a common pattern. The ```groupby``` method is used to support this type of operations. This fits in the more general s```plit-apply-combine``` pattern:

- Split the data into groups
- Apply a function to each group independently
- Combine the results into a data structure

The apply and combine steps are typically done together in pandas.

In the previous example, we explicitly selected the 2 columns first. If not, the ```mean``` method is applied to each column containing numerical columns by ```passing numeric_only=True```:

In [80]:
titanic.groupby("Sex").mean(numeric_only=True)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


It does not make much sense to get the average value of the ```Pclass```. If we are only interested in the average age for each ```gender```, the selection of columns (rectangular brackets ```[]``` as usual) is supported on the grouped data as well:

In [81]:
titanic.groupby("Sex")["Age"].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

![06_groupby_select_detail](https://pandas.pydata.org/docs/_images/06_groupby_select_detail.svg)

The ```Pclass``` column contains numerical data but actually represents 3 ```categories``` (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a ```Categorical``` data type to handle this type of data. Will cover later.

In [82]:
# What is the mean ticket fare price for each of the sex and cabin class combinations?
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

Grouping can be done by multiple columns at the same time. Provide the column names as a ```list``` to the ```groupby()``` method.

----

### Count number of records by category<a class="anchor" id="06c"></a>

![06_valuecounts](https://pandas.pydata.org/docs/_images/06_valuecounts.svg)

In [83]:
# What is the number of passengers in each of the cabin classes?
titanic["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

The ```value_counts()``` method counts the number of records for each category in a column.

The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

In [84]:
titanic.groupby("Pclass")["Pclass"].count()

Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

Note:- Both ```size``` and ```count``` can be used in combination with ```groupby```. Whereas ```size``` includes ```NaN``` values and just provides the number of rows (size of the table), ```count``` excludes the missing values. In the ```value_counts``` method, use the ```dropna``` argument to include or exclude the ```NaN``` values.

In [85]:
titanic["Pclass"].value_counts(dropna=False)

3    491
1    216
2    184
Name: Pclass, dtype: int64

#### Note: 
- Aggregation statistics can be calculated on entire columns or rows.
- ```groupby``` provides the power of the split-apply-combine pattern.
- ```value_counts``` is a convenient shortcut to count the number of entries in each category of a variable.

Handling dates, timezones, and Unix timestamps in Pandas can be very useful when working with time series data or other temporal data. Pandas provides a number of tools for handling these types of data in a flexible and efficient way.

### Exporting DataFrame
A DataFrame can be exported following types of the files:
- CSV file (with many delimeter) `df.to_csv('file_name')`
- josn file `df.to_json('file_name')`
- Excel file `df.to_excel('file_name')`
- HTML file `df.to_html('file_name')`
- HDF file  `df.to_hdf('file_name')`
- pickle file `df.to_pickle('file_name')`
- Dictionary `df.to_dict(file_name)`

In [86]:
data = pd.read_csv('dataset/data.csv')
data.head()

Unnamed: 0,Name,Age,Salary,Experience,Gender
0,Himadri,21,15,1.0,Female
1,Hritik,20,36,1.2,Male
2,Sriyanka,22,14,2.0,Female
3,Utkarsh,23,12,0.5,Male
4,Rahul,21,16,11.0,Male


In [87]:
# Export data in csv
# data.to_csv('csv_Data.csv')
# data.to_csv('csv_Data1.csv',sep = ',') # sep = ',' is default
data.to_csv('csv_Data2.csv',sep = '|')

In [88]:
# to JSON file
data.to_json('jsonData1.json')

In [89]:
# to excel file
data.to_excel('Excel_Data.xlsx')

In [91]:
# to dictionary data
data.to_dict('DictionaryData')
