# Putting Some Pandas In Your Python 🐼

<img style="float: right;" width="400" height="400" src="image/00_pandas.jpg">

## Introduction to Pandas
`pandas` is a Python package providing **fast, flexible, and expressive data structures** designed to make working with `relational or labeled data` both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.


Reference: https://pandas.pydata.org/docs/getting_started/index.html

**Question: What are the Data Structures in Pandas?**  
**Answer:**  Series (similar to 1 dim numpy array) and DataFrame (similar to 2 dim numpy array)

**Installation Command**  
<code>! pip install pandas</code>

**Importing Pandas**  
<code>import pandas as pd</code>

### What's covered in this notebook?
1. Pandas Data Structure - Series (ndarray-like)
	- Creating Series using Python list or dict
	- Creating Series from Numpy ndarray
	- Creating Series from scalar
	- Accessing Properties/Attributes and Methods of Series
	- Accessing data using Indexing and Slicing
2. Pandas Data Structure - DataFrame
	- Creating Series using Python dict, list or tuple
	- Creating Series using Numpy Array
	- Accessing Attributes/Properties and Methods of DataFrame
3. Working with Tabular Data
	- Dataframe to .csv & .xlsx
	- Reading .xlsx File
	- Reading .csv File - Iris Dataset
4. Non-Visual Data Analysis using Pandas (Statistical Analysis)
	- sum()
	- min() and max()
	- mean(), median(), var() and std()
	- describe() to summarize the data
	- corr(), skew() and kurt()
	- count(), unique() and value_counts() for categorical column
	- DataFrame.agg()
5. Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
	- Reading .csv File - Weather Dataset
	- Filtering Single Column vs Multiple Columns from a ` DataFrame`
	- Filtering Rows from a `DataFrame`
	- Filtering specific rows and columns from a `DataFrame`
	- loc() vs iloc()


## Getting Started

In [1]:
! pip install pandas



## Import Pandas Module

In [2]:
import pandas as pd
import numpy as np

## Pandas Data Structure - Series (ndarray-like)
`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**.  

The basic method to create a `Series` is to call:  
<code>s = pd.Series(data, index=index)</code>  

**Important Note:** Series data structures are `value-mutable` (the values they contain can be altered) but `not size-mutable`. 

Here, data can be many different things:
> a Python list or dict  
> an ndarray  
> a scalar value (like 5)

### Creating Series using Python list or dict

In [3]:
# pd.Series(data,index)
# index-> Unique, Hashable, same length as data. By default np.arange(n)
import pandas as pd

s = pd.Series([1, 2, 3, 4])

print(s)

0    1
1    2
2    3
3    4
dtype: int64


In [4]:
s = pd.Series(['x', 'y', 'z', 'abc'])

print(s)

0      x
1      y
2      z
3    abc
dtype: object


In [5]:
s = pd.Series(['kanav', 'bansal'])

print(s)

0     kanav
1    bansal
dtype: object


In [6]:
d = {"b": 1, "a": 0, "c": 2}

s = pd.Series(d)

print(s)

b    1
a    0
c    2
dtype: int64


### Creating Series from Numpy ndarray

In [7]:
data = np.array([10, 20, 30, 40, 50])

s = pd.Series(data)

print(s)

0    10
1    20
2    30
3    40
4    50
dtype: int32


In [8]:
# data = np.array([[1, 2, 3], [4, 5, 6]])

# s = pd.Series(data)

# print(s)

### Creating Series from scalar

In [9]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

### Accessing Properties/Attributes and Methods of Series

In [54]:
import pandas as pd
import numpy as np

data = np.array([10, 20, 30, 40, 50, 60, 70, 80])

s = pd.Series(data)

In [55]:
print("Data Type:", s.dtype)
print("Shape:", s.shape)
print("Values:", s.values)
print("Array:", s.array)

Data Type: int32
Shape: (8,)
Values: [10 20 30 40 50 60 70 80]
Array: <PandasArray>
[10, 20, 30, 40, 50, 60, 70, 80]
Length: 8, dtype: int32


In [56]:
print("Method to extract actual numpy ndarray:", s.to_numpy())

Method to extract actual numpy ndarray: [10 20 30 40 50 60 70 80]


In [57]:
s.head()

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [58]:
s.tail()

3    40
4    50
5    60
6    70
7    80
dtype: int32

In [59]:
s.info()

<class 'pandas.core.series.Series'>
RangeIndex: 8 entries, 0 to 7
Series name: None
Non-Null Count  Dtype
--------------  -----
8 non-null      int32
dtypes: int32(1)
memory usage: 160.0 bytes


### Accessing data using Indexing and Slicing

In [12]:
s = pd.Series([1, 2, 3, 4, 5])

print(s[2])

3


In [13]:
print(s[1:])

1    2
2    3
3    4
4    5
dtype: int64


In [14]:
print(s[1:4])

1    2
2    3
3    4
dtype: int64


In [15]:
print(s[[1, 4]])

1    2
4    5
dtype: int64


In [16]:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

print(s)

a    1
b    2
c    3
d    4
e    5
dtype: int64


In [17]:
print(s['a'])

1


In [18]:
print(s['a':])

a    1
b    2
c    3
d    4
e    5
dtype: int64


In [19]:
# Retrieve multiple elements

print(s[['a', 'b', 'e']])

a    1
b    2
e    5
dtype: int64


In [20]:
print(s['f'])

KeyError: 'f'

In [21]:
# Using the Series.get() method, a missing label will return None or specified default
print(s.get("f"))

None


In [22]:
print(s.get("f", np.nan))

nan


## Pandas Data Structure - DataFrame

Pandas is a general 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column.

**Important Note:** Pandas data structures are `value-mutable` (the values they contain can be altered) as well as `size-mutable`. 


<img style="float: right;" width="300" height="300" src="image/01_table_dataframe.PNG">

**Question: What kind of data does pandas handle?**  
**Answer:** When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.  

#### Remember
> Import the package, aka `import pandas as pd`  
> A table of data is stored as a pandas `DataFrame`  
> Each column in a DataFrame is a `Series`  
> You can do things by `applying a method` to a DataFrame or Series  

### Creating a Pandas DataFrame
**Syntax**  
<code>df = pd.DataFrame(data, index=idxs, columns=cols)</code>  

Here data can be many different things:
> Python Dict, List or Tuple  
> Numpy array

### Creating Series using Python dict, list or tuple

In [23]:
# Creating dataframe using Python Dictionary

data = {
        'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 
        'Age': [28,34,np.nan,42],
        'Gender': ['Male', 'Female', 'Female', 'Male']
       }

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Gender
0,Tom,28.0,Male
1,Jack,34.0,Female
2,Steve,,Female
3,Ricky,42.0,Male


In [24]:
# Creating a dataframe using Tuple/list

data = [('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain')]

df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2,3
0,1/1/2019,13,6,Rain
1,2/1/2019,11,7,Fog
2,3/1/2019,12,8,Sunny
3,4/1/2019,8,5,Snow
4,5/1/2019,9,6,Rain


In [25]:
# Creating a dataframe using Tuple/list

data = (('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain'))

df = pd.DataFrame(data, columns=['Day', 'Temperature', 'WindSpeed', 'Event'])

df

Unnamed: 0,Day,Temperature,WindSpeed,Event
0,1/1/2019,13,6,Rain
1,2/1/2019,11,7,Fog
2,3/1/2019,12,8,Sunny
3,4/1/2019,8,5,Snow
4,5/1/2019,9,6,Rain


In [26]:
# Creating a dataframe using Tuple/list

data = (['1/1/2019', 13, 6, 'Rain'],
       ['2/1/2019', 11, 7, 'Fog'],
       ['3/1/2019', 12, 8, 'Sunny'],
       ['4/1/2019', 8, 5, 'Snow'],
       ['5/1/2019', 9, 6, 'Rain'])

df = pd.DataFrame(data, 
                  index=['I1', 'I2', 'I3', 'I4', 'I5'], 
                  columns=['Day', 'Temperature', 'WindSpeed', 'Event'])

df

Unnamed: 0,Day,Temperature,WindSpeed,Event
I1,1/1/2019,13,6,Rain
I2,2/1/2019,11,7,Fog
I3,3/1/2019,12,8,Sunny
I4,4/1/2019,8,5,Snow
I5,5/1/2019,9,6,Rain


In [27]:
# print(type(df['Temperature']))

# print(type(df[['Temperature']]))

### Creating Series using Numpy Array

In [60]:
import numpy as np

In [61]:
arr = np.random.randint(100, 1999, size=(1000, 100))

print(arr.shape)

(1000, 100)


In [62]:
df = pd.DataFrame(arr)

df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1946,1750,380,1609,1865,149,999,1474,142,841,...,1188,1994,1572,648,996,722,198,1934,261,1460
1,1793,1085,329,404,899,1956,1579,1411,485,1482,...,119,948,1671,699,442,1565,252,880,110,1036
2,415,550,1555,777,1652,1013,1821,1284,1194,106,...,433,1659,1915,1285,1563,1656,1420,945,177,215
3,845,1031,419,610,1525,806,1854,1941,1236,1433,...,1177,183,1422,1937,1581,828,1369,986,1647,1762
4,868,602,995,245,1110,608,1626,1055,612,1877,...,1698,158,1873,1179,1359,356,862,937,1316,1666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1944,1225,837,412,989,531,884,1851,1124,805,...,131,1647,714,1543,813,1260,1784,1138,585,1857
996,1447,779,303,323,340,1520,1044,691,992,322,...,1826,776,1768,1273,803,738,817,1954,1405,1265
997,1094,1286,300,941,1276,752,313,1348,991,1323,...,1877,827,614,1873,1476,1358,423,1331,1242,1625
998,656,1401,771,1073,862,631,1075,1485,1611,1565,...,1960,1929,340,169,520,658,663,880,726,528


In [63]:
df = pd.DataFrame(arr, columns=["col_"+str(i) for i in range(1, 101) ])

df

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_91,col_92,col_93,col_94,col_95,col_96,col_97,col_98,col_99,col_100
0,1946,1750,380,1609,1865,149,999,1474,142,841,...,1188,1994,1572,648,996,722,198,1934,261,1460
1,1793,1085,329,404,899,1956,1579,1411,485,1482,...,119,948,1671,699,442,1565,252,880,110,1036
2,415,550,1555,777,1652,1013,1821,1284,1194,106,...,433,1659,1915,1285,1563,1656,1420,945,177,215
3,845,1031,419,610,1525,806,1854,1941,1236,1433,...,1177,183,1422,1937,1581,828,1369,986,1647,1762
4,868,602,995,245,1110,608,1626,1055,612,1877,...,1698,158,1873,1179,1359,356,862,937,1316,1666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1944,1225,837,412,989,531,884,1851,1124,805,...,131,1647,714,1543,813,1260,1784,1138,585,1857
996,1447,779,303,323,340,1520,1044,691,992,322,...,1826,776,1768,1273,803,738,817,1954,1405,1265
997,1094,1286,300,941,1276,752,313,1348,991,1323,...,1877,827,614,1873,1476,1358,423,1331,1242,1625
998,656,1401,771,1073,862,631,1075,1485,1611,1565,...,1960,1929,340,169,520,658,663,880,726,528


### Accessing Attributes/Properties and Methods of DataFrame

In [28]:
# Create Dictionary of Series
import pandas as pd
import numpy as np

data = {'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky', 'Vin', 'James', 'Vin']),
       'Age':pd.Series([25,26,25,35,23,33,31]),
       'Rating':pd.Series([4.23,4.1,3.4,5,2.9,np.nan,3.1])}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [29]:
print('Shape of DataFrame:', df.shape)
print()
print('Name of each column:', df.columns)
print()
print('Data Types of each Columns:\n', df.dtypes)
print()
print('Axes:\n', df.axes)
print()
print('Return data as numpy array:\n', df.values)

Shape of DataFrame: (7, 3)

Name of each column: Index(['Name', 'Age', 'Rating'], dtype='object')

Data Types of each Columns:
 Name       object
Age         int64
Rating    float64
dtype: object

Axes:
 [RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]

Return data as numpy array:
 [['Tom' 25 4.23]
 ['Jack' 26 4.1]
 ['Steve' 25 3.4]
 ['Ricky' 35 5.0]
 ['Vin' 23 2.9]
 ['James' 33 nan]
 ['Vin' 31 3.1]]


In [30]:
# Data types of each column

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


The method `info()` provides technical information about a DataFrame, so let’s explain the output in more detail:

> - It is indeed a `DataFrame`.  
> - There are `7 entries`, i.e. 7 rows.  
> - Each row has a `row label` (aka the `index`) with values ranging from `0 to 6`.  
> - The table has `3 columns`. Name and Age columns have a value for each of the rows (all 7 values are non-null). Rating column do have missing values and less than 7 non-null values.  
> - The column Name consists of textual data (strings, aka object). The other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).  
> - The kind of data (characters, integers,…) in the different columns are summarized by listing the `dtypes`.  
> - The approximate amount of RAM used to hold the DataFrame is provided as well.

In [31]:
# head -> by default head returns first 5 rows

df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9


In [32]:
df.head(2)

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1


In [33]:
# tail -> by default tail returns last 5 rows

df.tail()

Unnamed: 0,Name,Age,Rating
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [34]:
df.tail(2)

Unnamed: 0,Name,Age,Rating
5,James,33,
6,Vin,31,3.1


## Working with Tabular Data

**Question: How do I read and write tabular data?**  
**Answer:** pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). Importing data from each of these data sources is provided by function with the prefix `read_*`. Similarly, the `to_*` methods are used to store data.

#### Remember
> Getting data in to pandas from many different file formats or data sources is supported by `read_*` functions.  
> Exporting data out of pandas is provided by different `to_*` methods.  
> The `head/tail/info` methods and the `dtypes` attribute are convenient for a first check.  

<img width="600" height="600" src="image/02_io_readwrite.PNG"> 

### Dataframe to .csv & .xlsx

In [35]:
import pandas as pd
import numpy as np

# Create Dictionary of Series
data = {'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky', 'Vin', 'James', 'Smith']),
       'Age':pd.Series([25,26,25,35,23,33,31]),
       'Rating':pd.Series([4.23,4.1,3.4,5,np.nan,4.7,3.1])}

df = pd.DataFrame(data)

In [36]:
df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,


In [37]:
df.tail()

Unnamed: 0,Name,Age,Rating
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,
5,James,33,4.7
6,Smith,31,3.1


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


In [39]:
# Write Dataframe to CSV

df.to_csv('data/temp/new_csv_file.csv')

In [40]:
# Write Dataframe to CSV without index

df.to_csv('data/temp/new_csv_file_no_index.csv', index=False)

In [41]:
# Write Dataframe to XLSX

df.to_excel('data/temp/new_excel_file.xlsx', sheet_name='stud_data')

In [42]:
# Write Dataframe to XLSX without index

df.to_excel('data/temp/new_excel_file_noIndex.xlsx', sheet_name='stud_data', index=False)

### Reading .xlsx File

In [43]:
import pandas as pd

df = pd.read_excel('data/weather_data.xlsx')

df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


### Reading .csv File - Iris Dataset

In [44]:
import pandas as pd

df = pd.read_csv('data/Iris.csv')

df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


**Data Description**  
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). 

The iris data set is widely used as a beginner's dataset for machine learning purposes.  

<table>
    <tr>
        <td> 
            <p align="center">
                <img src="image/04_iris_setosa.jpg" width="150" /> 
                <br>
                <em style="color: grey">Iris Setosa</em> 
            </p>             
        </td>
        <td> 
            <p align="center">
                <img src="image/05_iris_versicolor.jpg" width="250" /> 
                <br>
                <em style="color: grey">Iris Versicolor</em> 
            </p>
        </td>
        <td> 
            <p align="center">
                <img src="image/06_iris_virginica.jpg" width="250" /> 
                <br>
                <em style="color: grey">Iris Virginica</em> 
            </p>
        </td>
    </tr>
</table>


In [45]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [46]:
df.shape

(150, 6)

In [47]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [49]:
# Write Dataframe to CSV

df.to_csv('data/temp/new_iris.csv')

In [50]:
# Write Dataframe to CSV

df.to_csv('data/temp/new_iris_no_index.csv', index=False)

## Non-Visual Data Analysis using Pandas (Statistical Analysis)

<img style="float: right;" width="300" height="300" src="image/03_reduction.PNG">

**Question: How to calculate summary statistics?**  
**Answer:** Basic statistics (mean, median, min, max, counts…) are easily calculable. These or custom aggregations can be applied on the entire data set, a sliding window of the data, or grouped by categories. The latter is also known as the split-apply-combine approach.

#### Remember
> Aggregation statistics(mean, median, min, max, counts…) can be calculated on entire columns or rows.  
> `groupby` provides the power of the split-apply-combine pattern.  
> `value_counts` is a convenient shortcut to count the number of entries in each category of a variable.

In [64]:
import pandas as pd

df = pd.read_csv('data/Iris.csv')

df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


### sum()

In [65]:
# sum()-> returns the sum of values for requested axis. by default axis = 0

df.sum()

Id                                                           11325
SepalLengthCm                                                876.5
SepalWidthCm                                                 458.1
PetalLengthCm                                                563.8
PetalWidthCm                                                 179.8
Species          Iris-setosaIris-setosaIris-setosaIris-setosaIr...
dtype: object

In [66]:
# axis = 1 -> row wise sum

df.sum(axis=1)

# How to fix the warning ?

  df.sum(axis=1)


0       11.2
1       11.5
2       12.4
3       13.4
4       15.2
       ...  
145    163.2
146    162.7
147    164.7
148    166.3
149    165.8
Length: 150, dtype: float64

In [67]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].sum(axis=1)

0      10.2
1       9.5
2       9.4
3       9.4
4      10.2
       ... 
145    17.2
146    15.7
147    16.7
148    17.3
149    15.8
Length: 150, dtype: float64

### min() and max()

In [68]:
df.min()

Id                         1
SepalLengthCm            4.3
SepalWidthCm             2.0
PetalLengthCm            1.0
PetalWidthCm             0.1
Species          Iris-setosa
dtype: object

In [69]:
df.max()

Id                          150
SepalLengthCm               7.9
SepalWidthCm                4.4
PetalLengthCm               6.9
PetalWidthCm                2.5
Species          Iris-virginica
dtype: object

### mean(), median(), var() and std()

In [70]:
# mean()

df.mean()

# How to fix the warning ?

  df.mean()


Id               75.500000
SepalLengthCm     5.843333
SepalWidthCm      3.054000
PetalLengthCm     3.758667
PetalWidthCm      1.198667
dtype: float64

In [71]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].mean()

SepalLengthCm    5.843333
SepalWidthCm     3.054000
PetalLengthCm    3.758667
PetalWidthCm     1.198667
dtype: float64

In [72]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [73]:
# Syntax: DataFrame.select_dtypes(include=None, exclude=None)
num_cols = df.select_dtypes(include=['float64']).columns

print(num_cols)

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object')


In [74]:
df[num_cols].median()

SepalLengthCm    5.80
SepalWidthCm     3.00
PetalLengthCm    4.35
PetalWidthCm     1.30
dtype: float64

In [75]:
df[num_cols].var()

SepalLengthCm    0.685694
SepalWidthCm     0.188004
PetalLengthCm    3.113179
PetalWidthCm     0.582414
dtype: float64

In [76]:
# std()
df[num_cols].std()

SepalLengthCm    0.828066
SepalWidthCm     0.433594
PetalLengthCm    1.764420
PetalWidthCm     0.763161
dtype: float64

### describe() to summarize the data

In [77]:
# describe() -> summarizing the data

df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [78]:
# include object, number, all

df.describe(include=['object'])

Unnamed: 0,Species
count,150
unique,3
top,Iris-setosa
freq,50


In [79]:
df.describe(include=['number'])

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [80]:
# Don't pass 'all' as a list

df.describe(include='all')

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,Iris-setosa
freq,,,,,,50
mean,75.5,5.843333,3.054,3.758667,1.198667,
std,43.445368,0.828066,0.433594,1.76442,0.763161,
min,1.0,4.3,2.0,1.0,0.1,
25%,38.25,5.1,2.8,1.6,0.3,
50%,75.5,5.8,3.0,4.35,1.3,
75%,112.75,6.4,3.3,5.1,1.8,


### corr(), skew() and kurt()

In [71]:
df.corr()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Id,1.0,0.716676,-0.397729,0.882747,0.899759
SepalLengthCm,0.716676,1.0,-0.109369,0.871754,0.817954
SepalWidthCm,-0.397729,-0.109369,1.0,-0.420516,-0.356544
PetalLengthCm,0.882747,0.871754,-0.420516,1.0,0.962757
PetalWidthCm,0.899759,0.817954,-0.356544,0.962757,1.0


In [72]:
df.skew()

# How to fix this warning?

  df.skew()


Id               0.000000
SepalLengthCm    0.314911
SepalWidthCm     0.334053
PetalLengthCm   -0.274464
PetalWidthCm    -0.104997
dtype: float64

In [73]:
df.kurt()

# How to fix this warning?

  df.kurt()


Id              -1.200000
SepalLengthCm   -0.552064
SepalWidthCm     0.290781
PetalLengthCm   -1.401921
PetalWidthCm    -1.339754
dtype: float64

In [74]:
num_cols = df.select_dtypes(include=['float64']).columns

df[num_cols].kurt()

SepalLengthCm   -0.552064
SepalWidthCm     0.290781
PetalLengthCm   -1.401921
PetalWidthCm    -1.339754
dtype: float64

### count(), unique() and value_counts() for categorical column

In [82]:
df['Species'].count()

150

In [83]:
df['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [81]:
df['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

### DataFrame.agg()
Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the `DataFrame.agg()` method.  

List of all the aggregating statistics can be found on below reference:  
Reference: https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats

In [69]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [70]:
df.agg(
    {
        "SepalLengthCm" : ["min", "max", "median", "count"],
        "PetalWidthCm" : ["min", "max", "mean", "count"],
        "Species" : ["count"]
    }
)

Unnamed: 0,SepalLengthCm,PetalWidthCm,Species
min,4.3,0.1,
max,7.9,2.5,
median,5.8,,
count,150.0,150.0,150.0
mean,,1.198667,


## Accessing Data in a DataFrame using Indexing and Slicing in `Pandas DataFrame`

<img style="float: right;" width="300" height="300" src="image/07_subset_columns.PNG">

**Question: How do I select a subset of a table?**  
**Answer:** Selecting or filtering specific rows and/or columns? Filtering the data on a condition? Methods for slicing, selecting, and extracting the data you need are available in pandas.


#### Remember
> When selecting subsets of data, square brackets [] are used.  
> Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.  
> Select specific rows and/or columns using loc when using the row and column names.  
> Select specific rows and/or columns using iloc when using the positions in the table.  
> You can assign new values to a selection based on loc/iloc.



### Reading .csv File - Weather Dataset
**Data Description**  
Weather data collected from the National Weather Service. It contains the first six months of 2016, for a weather station in central park. It contains for each day the minimum temperature, maximum temperature, average temperature, precipitation, new snow fall, and current snow depth. The temperature is measured in Fahrenheit and the depth is measured in inches. T means that there is a trace of precipitation.

In [1]:
import pandas as pd

df = pd.read_csv('data/nyc_weather.csv')

df.head()

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,1-1-2016,42,34,38.0,0.0,0.0,0
1,2-1-2016,40,32,36.0,0.0,0.0,0
2,3-1-2016,45,35,40.0,0.0,0.0,0
3,4-1-2016,36,14,25.0,0.0,0.0,0
4,5-1-2016,29,11,20.0,0.0,0.0,0


In [2]:
print("Shape of DataFrame:", df.shape)
print("Features/Columns:", df.columns)

Shape of DataFrame: (366, 7)
Features/Columns: Index(['date', 'maximum temperature', 'minimum temperature',
       'average temperature', 'precipitation', 'snow fall', 'snow depth'],
      dtype='object')


In [3]:
df.describe()

# Why didn't it generate precipitation, snow fall and snow depth statistical description ?

Unnamed: 0,maximum temperature,minimum temperature,average temperature
count,366.0,366.0,366.0
mean,64.625683,49.806011,57.215847
std,18.041787,16.570747,17.12476
min,15.0,-1.0,7.0
25%,50.0,37.25,44.0
50%,64.5,48.0,55.75
75%,81.0,65.0,73.5
max,96.0,81.0,88.5


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date                 366 non-null    object 
 1   maximum temperature  366 non-null    int64  
 2   minimum temperature  366 non-null    int64  
 3   average temperature  366 non-null    float64
 4   precipitation        366 non-null    object 
 5   snow fall            366 non-null    object 
 6   snow depth           366 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 20.1+ KB


In [5]:
# What is the maximum of avg temperature?

df['average temperature'].max()

88.5

In [6]:
# Average of Minimum Temperature

df['minimum temperature'].mean()

49.80601092896175

### Filtering Single Column vs Multiple Columns from a ` DataFrame`
To select a single column, use square brackets [] with the column name of the column of interest.

In [7]:
# Selecting Single Column

max_temp_df = df['maximum temperature']

max_temp_df.head()

0    42
1    40
2    45
3    36
4    29
Name: maximum temperature, dtype: int64

In [9]:
print("Type of df['maximum temperature']:", type(max_temp_df))
print("Shape:", max_temp_df.shape)

Type of df['maximum temperature']: <class 'pandas.core.series.Series'>
Shape: (366,)


In [10]:
# Selecting Multiple Columns

temp_df = df[['maximum temperature', 'minimum temperature']]

temp_df.head()

Unnamed: 0,maximum temperature,minimum temperature
0,42,34
1,40,32
2,45,35
3,36,14
4,29,11


In [11]:
print("Type of df[['maximum temperature', 'minimum temperature']]:", type(temp_df))
print("Shape:", temp_df.shape)

Type of df[['maximum temperature', 'minimum temperature']]: <class 'pandas.core.frame.DataFrame'>
Shape: (366, 2)


### Filtering Rows from a `DataFrame`
Similar to numpy Pandas can accept boolean indexes.  
To select rows based on a conditional expression, use a condition inside the selection brackets [].

In [14]:
df["maximum temperature"] > 95

0      False
1      False
2      False
3      False
4      False
       ...  
361    False
362    False
363    False
364    False
365    False
Name: maximum temperature, Length: 366, dtype: bool

In [15]:
# Similar to numpy Pandas can accept boolean indexes
df[ df["maximum temperature"] > 95 ]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
204,23-7-2016,96,80,88.0,0,0,0
225,13-8-2016,96,81,88.5,0,0,0


The output of the **conditional expression (>, but also ==, !=, <, <=,… would work)** is actually a **pandas Series of boolean values** (either True or False) with the same number of rows as the original DataFrame. Such a `Series` of **boolean values can be used to filter the DataFrame** by putting it in between the selection brackets []. Only **rows for which the value is True will be selected**.

In [16]:
df[df['date'].isin(['10-5-2016', '10-4-2016', '10-6-2016'])]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
100,10-4-2016,50,31,40.5,0.0,0.0,0
130,10-5-2016,63,50,56.5,0.0,0.0,0
161,10-6-2016,77,57,67.0,0.0,0.0,0


Similar to the conditional expression, **the isin() conditional function returns a True for each row the values are in the provided list**. To filter the rows based on such a function, use the conditional function inside the selection brackets []. 

The above is equivalent to filtering by rows for which the date is either '10-5-2016' or '10-4-2016' or '10-6-2016' and combining the three statements with an **| (or) operator**:

In [17]:
df[ (df['date']=='10-5-2016') | 
    (df['date']=='10-4-2016') | 
    (df['date']=='10-6-2016') 
  ]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
100,10-4-2016,50,31,40.5,0.0,0.0,0
130,10-5-2016,63,50,56.5,0.0,0.0,0
161,10-6-2016,77,57,67.0,0.0,0.0,0


**Remember**  
When combining multiple conditional statements, **each condition must be surrounded by parentheses ()**. Moreover, you can not use `or`/`and` but need to use the `or` operator `|` and the `and` operator `&`.

### Filtering specific rows and columns from a `DataFrame`

In [31]:
# # Slicing ?

df[1:5]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
1,2-1-2016,40,32,36.0,0.0,0.0,0
2,3-1-2016,45,35,40.0,0.0,0.0,0
3,4-1-2016,36,14,25.0,0.0,0.0,0
4,5-1-2016,29,11,20.0,0.0,0.0,0


In [35]:
# # What if I want a slice of 1 to 4 rows and 2 to 4 cols

# df[ 1:5, 'maximum temperature' : 'average temperature' ]

In [37]:
# # Turns out to be an InvalidIndexError. Let's try to fix it

# df[ 1:5, 1:4 ]

#### How to resolve this? 😢
In case, you want a subset of both rows and columns in one go, just using selection brackets [] is not sufficient anymore.  
Here `loc`/`iloc` operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

**Syntax:**  
<code>df.loc[row_label, col_label]</code>  
<code>df.iloc[row_index, col_index]</code>

### loc() vs iloc()

In [38]:
# Lable based accessing
df.loc[100]

date                   10-4-2016
maximum temperature           50
minimum temperature           31
average temperature         40.5
precipitation               0.00
snow fall                    0.0
snow depth                     0
Name: 100, dtype: object

In [39]:
df.loc[100, "date"]

'10-4-2016'

In [40]:
# Index based accessing
df.iloc[100]

date                   10-4-2016
maximum temperature           50
minimum temperature           31
average temperature         40.5
precipitation               0.00
snow fall                    0.0
snow depth                     0
Name: 100, dtype: object

In [42]:
df.iloc[100, 0]

'10-4-2016'

In [43]:
# # Slicing with Lables

# df.loc[ 10:15, "minimum temperature":"precipitation" ]

In [44]:
# # Slicing with indexes

# df.iloc[ 10:15, 2:5 ]

In [19]:
# Remeber this ?

df[ df["maximum temperature"] > 95 ]

# Equivalent to filtering rows with max temp greater than 95

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
204,23-7-2016,96,80,88.0,0,0,0
225,13-8-2016,96,81,88.5,0,0,0


In [20]:
# What if we want only `dates` with max temp greater than 95 ?

df.loc[ df["maximum temperature"] > 95, "date" ]

204    23-7-2016
225    13-8-2016
Name: date, dtype: object

In [47]:
# Looks like a Series. Can we convert it to a numpy array?

df.loc[ df["maximum temperature"] > 95, "date" ].to_numpy()

array(['23-7-2016', '13-8-2016'], dtype=object)