<a href="https://colab.research.google.com/github/LAXMINARAYANA-MENDA/DSML/blob/master/Postread_%3C%3E_Pandas_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas - 4

- Melting
  - `pd.melt()`
- Pivoting
  - `pd.pivot()`
  - `pd.pivot_table()`
- Binning
  - `pd.cut()`

- Null/Missing values
  - `None` vs `NaN` values
  - `isna()` & `isnull()`
- Removing null values
  - `dropna()`
- String methods
- Datetime values
- Writing to a file

---

In [None]:
import warnings
warnings.filterwarnings("ignore")

### PFizer data

For this topic we will be using data of few drugs being developed by **PFizer**.

Dataset: https://drive.google.com/file/d/173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ/view?usp=sharing

In [None]:
!gdown 173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ

Downloading...
From: https://drive.google.com/uc?id=173A59xh2mnpmljCCB9bhC4C5eP2IS6qZ
To: /content/Pfizer_1.csv
  0% 0.00/1.51k [00:00<?, ?B/s]100% 1.51k/1.51k [00:00<00:00, 5.92MB/s]


**What is the data about?**
- Temperature (K)
- Pressure (P)

The data is recorded after an **interval of 1 hour** everyday to monitor the drug stability in a drug development test.

These data points are therefore used to **identify the optimal set of values of parameters** for the stability of the drugs.

Let's explore this dataset -


In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('Pfizer_1.csv')
data

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,23.0,22.0,,21.0,21.0,22,23.0,21.0,22.0,20,20.0,21
1,15-10-2020,diltiazem hydrochloride,Pressure,12.0,13.0,,11.0,13.0,14,16.0,16.0,24.0,18,19.0,20
2,15-10-2020,docetaxel injection,Temperature,,17.0,18.0,,17.0,18,,,23.0,23,25.0,25
3,15-10-2020,docetaxel injection,Pressure,,22.0,22.0,,22.0,23,,,27.0,26,29.0,28
4,15-10-2020,ketamine hydrochloride,Temperature,24.0,,,27.0,,26,25.0,24.0,23.0,22,21.0,20
5,15-10-2020,ketamine hydrochloride,Pressure,8.0,,,7.0,,9,10.0,11.0,10.0,9,9.0,11
6,16-10-2020,diltiazem hydrochloride,Temperature,34.0,35.0,36.0,36.0,37.0,38,37.0,38.0,39.0,40,,42
7,16-10-2020,diltiazem hydrochloride,Pressure,18.0,19.0,20.0,21.0,22.0,23,24.0,25.0,25.0,24,,27
8,16-10-2020,docetaxel injection,Temperature,46.0,47.0,,48.0,48.0,49,50.0,52.0,55.0,56,57.0,58
9,16-10-2020,docetaxel injection,Pressure,23.0,24.0,,25.0,26.0,27,28.0,29.0,28.0,28,29.0,30


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       18 non-null     object 
 1   Drug_Name  18 non-null     object 
 2   Parameter  18 non-null     object 
 3   1:30:00    16 non-null     float64
 4   2:30:00    16 non-null     float64
 5   3:30:00    12 non-null     float64
 6   4:30:00    14 non-null     float64
 7   5:30:00    16 non-null     float64
 8   6:30:00    18 non-null     int64  
 9   7:30:00    16 non-null     float64
 10  8:30:00    14 non-null     float64
 11  9:30:00    16 non-null     float64
 12  10:30:00   18 non-null     int64  
 13  11:30:00   16 non-null     float64
 14  12:30:00   18 non-null     int64  
dtypes: float64(9), int64(3), object(3)
memory usage: 2.2+ KB


---

### Melting

As we saw earlier, the dataset has **18 rows** and **15 columns**.

If you notice further, you'll see:
- The columns are `1:30:00`, `2:30:00`, `3:30:00`, ... so on.
- `Temperature` and `Pressure` of each date is in a separate row.

**Can we restructure our data into a better format?**

- Maybe we can have a column for `time`, with `timestamps` as the column value.

**Where will the Temperature/Pressure values go?**

- We can similarly create one column containing the values of these parameters.
- "Melt" the timestamp column into two columns** - timestamp and corresponding values

**How can we restructure our data into having every row corresponding to a single reading?**

In [None]:
pd.melt(data, id_vars=['Date', 'Parameter', 'Drug_Name'])

Unnamed: 0,Date,Parameter,Drug_Name,variable,value
0,15-10-2020,Temperature,diltiazem hydrochloride,1:30:00,23.0
1,15-10-2020,Pressure,diltiazem hydrochloride,1:30:00,12.0
2,15-10-2020,Temperature,docetaxel injection,1:30:00,
3,15-10-2020,Pressure,docetaxel injection,1:30:00,
4,15-10-2020,Temperature,ketamine hydrochloride,1:30:00,24.0
...,...,...,...,...,...
211,17-10-2020,Pressure,diltiazem hydrochloride,12:30:00,14.0
212,17-10-2020,Temperature,docetaxel injection,12:30:00,23.0
213,17-10-2020,Pressure,docetaxel injection,12:30:00,28.0
214,17-10-2020,Temperature,ketamine hydrochloride,12:30:00,24.0


This converts our data from `wide` to `long` format.

Notice that the `id_vars` are set of variables which remain unmelted.

**How does `pd.melt()` work?**

- Pass in the **DataFrame**.
- Pass in the **column names that we don't want to melt**.

But we can provide better names to these new columns.

**How can we rename the columns "variable" and "value" as per our original dataframe?**

In [None]:
data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
            var_name = "time",
            value_name = 'reading')
data_melt

Unnamed: 0,Date,Drug_Name,Parameter,time,reading
0,15-10-2020,diltiazem hydrochloride,Temperature,1:30:00,23.0
1,15-10-2020,diltiazem hydrochloride,Pressure,1:30:00,12.0
2,15-10-2020,docetaxel injection,Temperature,1:30:00,
3,15-10-2020,docetaxel injection,Pressure,1:30:00,
4,15-10-2020,ketamine hydrochloride,Temperature,1:30:00,24.0
...,...,...,...,...,...
211,17-10-2020,diltiazem hydrochloride,Pressure,12:30:00,14.0
212,17-10-2020,docetaxel injection,Temperature,12:30:00,23.0
213,17-10-2020,docetaxel injection,Pressure,12:30:00,28.0
214,17-10-2020,ketamine hydrochloride,Temperature,12:30:00,24.0


**Conclusion:**

- The labels of the timestamp columns are conviniently **melted into a single column** - `time`
- It retained all the values in `reading` column.
- The labels of columns such as `1:30:00`, `2:30:00` have now become categories of the `variable` column.
- The values from columns we are melting are stored in the `value` column.

---

## Pivoting

Now suppose we want to convert our data back to the **wide format**.

The reason could be to maintain the structure for storing or some other purpose.

Notice,
- The variables `Date`, `Drug_Name` and `Parameter` will remain same.
- The column names will be extracted from the column `time`.
- The values will be extracted from the column `readings`.

**How can we restructure our data back to the original wide format?**

In [None]:
data_melt.pivot(index=['Date','Drug_Name','Parameter'],  # Columns used to make new frame’s index
                columns = 'time',                        # Column used to make new frame’s columns
                values='reading')                        # Column used for populating new frame’s values.

Unnamed: 0_level_0,Unnamed: 1_level_0,time,10:30:00,11:30:00,12:30:00,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00
Date,Drug_Name,Parameter,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
15-10-2020,diltiazem hydrochloride,Pressure,18.0,19.0,20.0,12.0,13.0,,11.0,13.0,14.0,16.0,16.0,24.0
15-10-2020,diltiazem hydrochloride,Temperature,20.0,20.0,21.0,23.0,22.0,,21.0,21.0,22.0,23.0,21.0,22.0
15-10-2020,docetaxel injection,Pressure,26.0,29.0,28.0,,22.0,22.0,,22.0,23.0,,,27.0
15-10-2020,docetaxel injection,Temperature,23.0,25.0,25.0,,17.0,18.0,,17.0,18.0,,,23.0
15-10-2020,ketamine hydrochloride,Pressure,9.0,9.0,11.0,8.0,,,7.0,,9.0,10.0,11.0,10.0
15-10-2020,ketamine hydrochloride,Temperature,22.0,21.0,20.0,24.0,,,27.0,,26.0,25.0,24.0,23.0
16-10-2020,diltiazem hydrochloride,Pressure,24.0,,27.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,25.0
16-10-2020,diltiazem hydrochloride,Temperature,40.0,,42.0,34.0,35.0,36.0,36.0,37.0,38.0,37.0,38.0,39.0
16-10-2020,docetaxel injection,Pressure,28.0,29.0,30.0,23.0,24.0,,25.0,26.0,27.0,28.0,29.0,28.0
16-10-2020,docetaxel injection,Temperature,56.0,57.0,58.0,46.0,47.0,,48.0,48.0,49.0,50.0,52.0,55.0


Notice that `pivot()` is the exact opposite of `melt()`.

We are getting **multiple indices** here, but we can get single index again using `reset_index()`.

In [None]:
data_melt.pivot(index=['Date','Drug_Name','Parameter'],
                columns = 'time',
                values='reading').reset_index()

time,Date,Drug_Name,Parameter,10:30:00,11:30:00,12:30:00,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00
0,15-10-2020,diltiazem hydrochloride,Pressure,18.0,19.0,20.0,12.0,13.0,,11.0,13.0,14.0,16.0,16.0,24.0
1,15-10-2020,diltiazem hydrochloride,Temperature,20.0,20.0,21.0,23.0,22.0,,21.0,21.0,22.0,23.0,21.0,22.0
2,15-10-2020,docetaxel injection,Pressure,26.0,29.0,28.0,,22.0,22.0,,22.0,23.0,,,27.0
3,15-10-2020,docetaxel injection,Temperature,23.0,25.0,25.0,,17.0,18.0,,17.0,18.0,,,23.0
4,15-10-2020,ketamine hydrochloride,Pressure,9.0,9.0,11.0,8.0,,,7.0,,9.0,10.0,11.0,10.0
5,15-10-2020,ketamine hydrochloride,Temperature,22.0,21.0,20.0,24.0,,,27.0,,26.0,25.0,24.0,23.0
6,16-10-2020,diltiazem hydrochloride,Pressure,24.0,,27.0,18.0,19.0,20.0,21.0,22.0,23.0,24.0,25.0,25.0
7,16-10-2020,diltiazem hydrochloride,Temperature,40.0,,42.0,34.0,35.0,36.0,36.0,37.0,38.0,37.0,38.0,39.0
8,16-10-2020,docetaxel injection,Pressure,28.0,29.0,30.0,23.0,24.0,,25.0,26.0,27.0,28.0,29.0,28.0
9,16-10-2020,docetaxel injection,Temperature,56.0,57.0,58.0,46.0,47.0,,48.0,48.0,49.0,50.0,52.0,55.0


In [None]:
data_melt.head()

Unnamed: 0,Date,Drug_Name,Parameter,time,reading
0,15-10-2020,diltiazem hydrochloride,Temperature,1:30:00,23.0
1,15-10-2020,diltiazem hydrochloride,Pressure,1:30:00,12.0
2,15-10-2020,docetaxel injection,Temperature,1:30:00,
3,15-10-2020,docetaxel injection,Pressure,1:30:00,
4,15-10-2020,ketamine hydrochloride,Temperature,1:30:00,24.0


Now if you notice,
- We are using 2 rows to log readings for a single experiment.

**Can we further restructure our data into dividing the `Parameter` column into T/P?**

- A format like `Date | time | Drug_Name | Pressure | Temperature` would be suitable.
- We want to **split one single column into multiple columns**.

**How can we divide the `Parameter` column again?**

In [None]:
data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
                            columns = 'Parameter',
                            values='reading')
data_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,Parameter,Pressure,Temperature
Date,time,Drug_Name,Unnamed: 3_level_1,Unnamed: 4_level_1
15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
15-10-2020,10:30:00,docetaxel injection,26.0,23.0
15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
15-10-2020,11:30:00,docetaxel injection,29.0,25.0
...,...,...,...,...
17-10-2020,8:30:00,docetaxel injection,26.0,19.0
17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0
17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0
17-10-2020,9:30:00,docetaxel injection,27.0,20.0


Notice that a **multi-index** dataframe has been created.

We can use `reset_index()` to remove the multi-index.

In [None]:
data_tidy = data_tidy.reset_index()
data_tidy

Parameter,Date,time,Drug_Name,Pressure,Temperature
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0
...,...,...,...,...,...
103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0
104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0
105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0
106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0


We can rename our ```index``` column from `Parameter` to simply `None`.

In [None]:
data_tidy.columns.name = None
data_tidy.head()

Unnamed: 0,Date,time,Drug_Name,Pressure,Temperature
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0


Now suppose we want to find some insights, like **mean temperature day-wise**.

**Can we use pivot to find the day-wise mean value of temperature for each drug?**

In [None]:
import numpy as np
data_tidy.pivot(index=['Drug_Name'],
                columns = 'Date',
                values=['Temperature'])

ValueError: Index contains duplicate entries, cannot reshape

**Why did we get an error?**

- We need to find the **average** of temperature values throughout a day.
- If you notice, the error shows **duplicate entries**.

Hence, the index values should be unique entry for each row.

**What can we do to get our required mean values then?**

In [None]:
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature'], aggfunc=np.mean)

Unnamed: 0_level_0,Temperature,Temperature,Temperature
Date,15-10-2020,16-10-2020,17-10-2020
Drug_Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
diltiazem hydrochloride,21.454545,37.454545,15.636364
docetaxel injection,20.75,51.454545,17.5
ketamine hydrochloride,23.555556,11.5,18.5


This function is similar to `pivot()`, with an extra feature of an aggregator.

**How does `pivot_table()` work?**

- The initial parameters are same as what we use in `pivot()`.
- As an extra parameter, we pass the **type of aggregator**.

**Note:**

- We could have done this using `groupby` too.
- In fact, `pivot_table` uses `groupby` in the backend to group the data and perform the aggregration.
- The only difference is in the type of output we get using both the functions.

**Similarly, what if we want to find the minimum values of temperature and pressure on a particular date?**

In [None]:
pd.pivot_table(data_tidy, index='Drug_Name', columns='Date', values=['Temperature', 'Pressure'], aggfunc=np.min)

Unnamed: 0_level_0,Pressure,Pressure,Pressure,Temperature,Temperature,Temperature
Date,15-10-2020,16-10-2020,17-10-2020,15-10-2020,16-10-2020,17-10-2020
Drug_Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
diltiazem hydrochloride,11.0,18.0,3.0,20.0,34.0,10.0
docetaxel injection,22.0,23.0,20.0,17.0,46.0,12.0
ketamine hydrochloride,7.0,12.0,8.0,20.0,8.0,13.0


---

## Binning

**Binning**

Sometimes, we would want our data to be in **categorical** form instead of **continuous/numerical**.

- Let's say, instead of knowing specific test values of a month, I want to know its type.
- Depending on the level of granularity, we want to have - Low, Medium, High, Very High.

**How can we derive bins/buckets from continous data?**

- use `pd.cut()`

Let's try to use this on our `Temperature` column to categorise the data into bins.

But to define categories, let's first check `min` and `max` temperature values.

In [None]:
data_tidy

Unnamed: 0,Date,time,Drug_Name,Pressure,Temperature
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0
...,...,...,...,...,...
103,17-10-2020,8:30:00,docetaxel injection,26.0,19.0
104,17-10-2020,8:30:00,ketamine hydrochloride,11.0,20.0
105,17-10-2020,9:30:00,diltiazem hydrochloride,9.0,13.0
106,17-10-2020,9:30:00,docetaxel injection,27.0,20.0


In [None]:
print(data_tidy['Temperature'].min(), data_tidy['Temperature'].max())

8.0 58.0


Here,
- Min value = 8
- Max value = 58

Lets's keep some buffer for future values and take the range from 5-60 (instead of 8-58).

We'll divide this data into **4 bins** of 10-15 values each.

In [None]:
temp_points = [5, 20, 35, 50, 60]

temp_labels = ['low','medium','high','very_high'] # labels define the severity of the resultant output of the test

In [None]:
data_tidy['temp_cat'] = pd.cut(data_tidy['Temperature'], bins=temp_points, labels=temp_labels)
data_tidy.head()

Unnamed: 0,Date,time,Drug_Name,Pressure,Temperature,temp_cat
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,low
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0,medium
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0,medium
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,low
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0,medium


In [None]:
data_tidy['temp_cat'].value_counts()

Unnamed: 0_level_0,count
temp_cat,Unnamed: 1_level_1
low,45
medium,30
high,15
very_high,5


**Note:** By default, `pd.cut()` creates intervals of the form (x, y] — which includes the right endpoint but excludes the left one.

---

## Data Preparation

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('Pfizer_1.csv')

data_melt = pd.melt(data,id_vars = ['Date', 'Drug_Name', 'Parameter'],
            var_name = "time",
            value_name = 'reading')

data_tidy = data_melt.pivot(index=['Date','time', 'Drug_Name'],
                                        columns = 'Parameter',
                                        values='reading')
data_tidy = data_tidy.reset_index()
data_tidy.columns.name = None

In [None]:
data.head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,23.0,22.0,,21.0,21.0,22,23.0,21.0,22.0,20,20.0,21
1,15-10-2020,diltiazem hydrochloride,Pressure,12.0,13.0,,11.0,13.0,14,16.0,16.0,24.0,18,19.0,20
2,15-10-2020,docetaxel injection,Temperature,,17.0,18.0,,17.0,18,,,23.0,23,25.0,25
3,15-10-2020,docetaxel injection,Pressure,,22.0,22.0,,22.0,23,,,27.0,26,29.0,28
4,15-10-2020,ketamine hydrochloride,Temperature,24.0,,,27.0,,26,25.0,24.0,23.0,22,21.0,20


In [None]:
data_melt.head()

Unnamed: 0,Date,Drug_Name,Parameter,time,reading
0,15-10-2020,diltiazem hydrochloride,Temperature,1:30:00,23.0
1,15-10-2020,diltiazem hydrochloride,Pressure,1:30:00,12.0
2,15-10-2020,docetaxel injection,Temperature,1:30:00,
3,15-10-2020,docetaxel injection,Pressure,1:30:00,
4,15-10-2020,ketamine hydrochloride,Temperature,1:30:00,24.0


In [None]:
data_tidy.head()

Unnamed: 0,Date,time,Drug_Name,Pressure,Temperature
0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0
1,15-10-2020,10:30:00,docetaxel injection,26.0,23.0
2,15-10-2020,10:30:00,ketamine hydrochloride,9.0,22.0
3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0
4,15-10-2020,11:30:00,docetaxel injection,29.0,25.0


---

### `None` vs `NaN`

If you notice, there are many `NaN` values in our data.

**What are these `NaN` values?**

- They are basically **missing/null values**.
- A null value signifies an **empty cell/no data**.

There can be 2 kinds of missing values:
1. `None`
2. `NaN` (Not a Number)

**Whats the difference between the `None` and `NaN`?**

Both `None` and `NaN` can be used for missing values, but their representation and behaviour may differ based on the **column's data type**.

In [None]:
type(None)

NoneType

In [None]:
type(np.nan)

float

1. **None in Non-numeric** columns: None can be used directly, and it will appear as None.
2. **None in Numeric** columns: Pandas automatically converts None to NaN.
3. **NaN in Numeric** columns: NaN is used to represent missing values and appears as NaN.
4. **NaN in Non-numeric** Columns: NaN can be used, and it appears as NaN.

In [None]:
pd.Series([1, np.nan, 2, None])

Unnamed: 0,0
0,1.0
1,
2,2.0
3,


For **numerical** type, Pandas changes `None` to `NaN`.

In [None]:
pd.Series(["1", "np.nan", "2", None])

Unnamed: 0,0
0,1
1,np.nan
2,2
3,


In [None]:
pd.Series(["1", "np.nan", "2", np.nan])

Unnamed: 0,0
0,1
1,np.nan
2,2
3,


For **object** type, the `None` is preserved and not changed to `NaN`.

---

### `isna()` & `isnull()`

**How to get the count of missing values for each row/column?**

- `df.isna()`
- `df.isnull()`

In [None]:
data.isna().head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
3,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
4,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False


In [None]:
data.isnull().head()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
0,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
3,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False
4,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False


Notice that both `isna()` and `isnull()` give the same results.

**But why do we have two methods, `isna()` and `isnull()` for the same operation?**

- `isnull()` is just an alias for `isna()`

In [None]:
pd.isnull

In [None]:
pd.isna

As we can see, the function signature is same for both.

- `isna()` returns a **boolean dataframe**, with each cell as a boolean value.
- This value corresponds to **whether the cell has a missing value**.
- On top of this, we can use `.sum()` to find the count of the missing values.

In [None]:
data.isna().sum()

Unnamed: 0,0
Date,0
Drug_Name,0
Parameter,0
1:30:00,2
2:30:00,2
3:30:00,6
4:30:00,4
5:30:00,2
6:30:00,0
7:30:00,2


This gives us the total number of missing values in each column.

**How can we get the number of missing values in each row?**

In [None]:
data.isna().sum(axis=1)

Unnamed: 0,0
0,1
1,1
2,4
3,4
4,3
5,3
6,1
7,1
8,1
9,1


**Note:** By default, the value is `axis=0` for `sum()`.

**We now have identified the null count, but how do we deal with them?**

We have two options:
- Delete the rows/columns containing the null values.
- Fill the missing values with some data/estimate.

Let's first look at deleting the rows.

---

### Removing null values

**How can we drop rows containing null values?**

In [None]:
data.dropna()

Unnamed: 0,Date,Drug_Name,Parameter,1:30:00,2:30:00,3:30:00,4:30:00,5:30:00,6:30:00,7:30:00,8:30:00,9:30:00,10:30:00,11:30:00,12:30:00
14,17-10-2020,docetaxel injection,Temperature,12.0,13.0,14.0,15.0,16.0,17,18.0,19.0,20.0,21,22.0,23
15,17-10-2020,docetaxel injection,Pressure,20.0,22.0,22.0,22.0,22.0,23,25.0,26.0,27.0,28,29.0,28
16,17-10-2020,ketamine hydrochloride,Temperature,13.0,14.0,15.0,16.0,17.0,18,19.0,20.0,21.0,22,23.0,24
17,17-10-2020,ketamine hydrochloride,Pressure,8.0,9.0,10.0,11.0,11.0,12,12.0,11.0,12.0,13,14.0,15


Notice that rows with even a single missing value have been deleted.

**What if we want to delete the columns having missing value?**

In [None]:
data.dropna(axis=1)

Unnamed: 0,Date,Drug_Name,Parameter,6:30:00,10:30:00,12:30:00
0,15-10-2020,diltiazem hydrochloride,Temperature,22,20,21
1,15-10-2020,diltiazem hydrochloride,Pressure,14,18,20
2,15-10-2020,docetaxel injection,Temperature,18,23,25
3,15-10-2020,docetaxel injection,Pressure,23,26,28
4,15-10-2020,ketamine hydrochloride,Temperature,26,22,20
5,15-10-2020,ketamine hydrochloride,Pressure,9,9,11
6,16-10-2020,diltiazem hydrochloride,Temperature,38,40,42
7,16-10-2020,diltiazem hydrochloride,Pressure,23,24,27
8,16-10-2020,docetaxel injection,Temperature,49,56,58
9,16-10-2020,docetaxel injection,Pressure,27,28,30


Notice that every column which had even a single missing value has been deleted.

**But what are the problems with deleting rows/columns?**
- loss of valuable data

So instead of dropping, it would be better to **fill the missing values with some data**.

---


**Note**

* **Data imputation for null or missing values will be covered in detail in the upcoming modules (within the DAV - Fundamentals module).**

---

## String methods

**What kind of questions can we use string methods for?**

- Find rows which contains a particular string.

Say,

**How you can you filter rows containing "hydrochloride" in their drug name?**

In [None]:
data_tidy.loc[data_tidy['Drug_Name'].str.contains('hydrochloride')].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,time,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg
Drug_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
diltiazem hydrochloride,0,15-10-2020,10:30:00,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242
diltiazem hydrochloride,3,15-10-2020,11:30:00,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242
diltiazem hydrochloride,6,15-10-2020,12:30:00,diltiazem hydrochloride,20.0,21.0,24.848485,15.424242
diltiazem hydrochloride,9,15-10-2020,1:30:00,diltiazem hydrochloride,12.0,23.0,24.848485,15.424242
diltiazem hydrochloride,12,15-10-2020,2:30:00,diltiazem hydrochloride,13.0,22.0,24.848485,15.424242


- So in general, we will be using the following format: `Series.str.function()`

- `Series.str` can be used to access the values of the series as strings and apply several methods to it.

Now suppose we want to form a new column based on the year of the experiments?

**What can we do form a column containing the year?**

In [None]:
data_tidy['Date'].str.split('-')

Unnamed: 0_level_0,Unnamed: 1_level_0,Date
Drug_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
diltiazem hydrochloride,0,"[15, 10, 2020]"
diltiazem hydrochloride,3,"[15, 10, 2020]"
diltiazem hydrochloride,6,"[15, 10, 2020]"
diltiazem hydrochloride,9,"[15, 10, 2020]"
diltiazem hydrochloride,12,"[15, 10, 2020]"
...,...,...
ketamine hydrochloride,95,"[17, 10, 2020]"
ketamine hydrochloride,98,"[17, 10, 2020]"
ketamine hydrochloride,101,"[17, 10, 2020]"
ketamine hydrochloride,104,"[17, 10, 2020]"


To extract the year, we need to select the last element of each list.

In [None]:
data_tidy['Date'].str.split('-').apply(lambda x:x[2])

Unnamed: 0_level_0,Unnamed: 1_level_0,Date
Drug_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
diltiazem hydrochloride,0,2020
diltiazem hydrochloride,3,2020
diltiazem hydrochloride,6,2020
diltiazem hydrochloride,9,2020
diltiazem hydrochloride,12,2020
...,...,...
ketamine hydrochloride,95,2020
ketamine hydrochloride,98,2020
ketamine hydrochloride,101,2020
ketamine hydrochloride,104,2020


But there are certain problems with this approach.

- The **dtype of the output is still an object**, we would prefer a number type.
- The date format will always **not be in day-month-year**, it can vary.

Thus, to work with such date-time type of data, we can use a special method from Pandas.

---

## Datetime

**How can we handle datetime data types?**

- We can use the `to_datetime()` function of Pandas
- It takes as input:
  - Array/Scalars with values having proper date/time format
  - `dayfirst`: Indicating if the day comes first in the date format used
  - `yearfirst`: Indicates if year comes first in the date format used

Let's first merge our `Date` and `Time` columns into a new `timestamp` column.

In [None]:
data_tidy['timestamp'] = data_tidy['Date'] + " " + data_tidy['time']

In [None]:
data_tidy.drop(['Date', 'time'], axis=1, inplace=True)

In [None]:
data_tidy.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg,timestamp
Drug_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
diltiazem hydrochloride,0,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242,15-10-2020 10:30:00
diltiazem hydrochloride,3,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242,15-10-2020 11:30:00
diltiazem hydrochloride,6,diltiazem hydrochloride,20.0,21.0,24.848485,15.424242,15-10-2020 12:30:00
diltiazem hydrochloride,9,diltiazem hydrochloride,12.0,23.0,24.848485,15.424242,15-10-2020 1:30:00
diltiazem hydrochloride,12,diltiazem hydrochloride,13.0,22.0,24.848485,15.424242,15-10-2020 2:30:00


Now let's convert our `timestamp` column into **datetime**.

In [None]:
data_tidy['timestamp'] = pd.to_datetime(data_tidy['timestamp'])
data_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,Drug_Name,Pressure,Temperature,Temperature_avg,Pressure_avg,timestamp
Drug_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
diltiazem hydrochloride,0,diltiazem hydrochloride,18.0,20.0,24.848485,15.424242,2020-10-15 10:30:00
diltiazem hydrochloride,3,diltiazem hydrochloride,19.0,20.0,24.848485,15.424242,2020-10-15 11:30:00
diltiazem hydrochloride,6,diltiazem hydrochloride,20.0,21.0,24.848485,15.424242,2020-10-15 12:30:00
diltiazem hydrochloride,9,diltiazem hydrochloride,12.0,23.0,24.848485,15.424242,2020-10-15 01:30:00
diltiazem hydrochloride,12,diltiazem hydrochloride,13.0,22.0,24.848485,15.424242,2020-10-15 02:30:00
...,...,...,...,...,...,...,...
ketamine hydrochloride,95,ketamine hydrochloride,11.0,17.0,17.709677,11.935484,2020-10-17 05:30:00
ketamine hydrochloride,98,ketamine hydrochloride,12.0,18.0,17.709677,11.935484,2020-10-17 06:30:00
ketamine hydrochloride,101,ketamine hydrochloride,12.0,19.0,17.709677,11.935484,2020-10-17 07:30:00
ketamine hydrochloride,104,ketamine hydrochloride,11.0,20.0,17.709677,11.935484,2020-10-17 08:30:00


In [None]:
data_tidy.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 108 entries, ('diltiazem hydrochloride', np.int64(0)) to ('ketamine hydrochloride', np.int64(107))
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Drug_Name        108 non-null    object        
 1   Pressure         108 non-null    float64       
 2   Temperature      108 non-null    float64       
 3   Temperature_avg  108 non-null    float64       
 4   Pressure_avg     108 non-null    float64       
 5   timestamp        108 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 10.4+ KB


The type of `timestamp` column has been changed from `object` to `datetime`.

Now, let's look at a single timestamp using Pandas.

**How can we extract information from a single timestamp using Pandas?**

In [None]:
ts = data_tidy['timestamp'][0]
ts

Timestamp('2020-10-15 10:30:00')

In [None]:
ts.year, ts.month, ts.day, ts.month_name()

(2020, 10, 15, 'October')

In [None]:
ts.hour, ts.minute, ts.second

(10, 30, 0)

This data parsing from `string` to `datetime` makes it easier to work with such data.

We can use this data from the columns as a whole using `.dt` object.

In [None]:
data_tidy['timestamp'].dt

<pandas.core.indexes.accessors.DatetimeProperties object at 0x7de9cf621c90>

- `dt` gives properties of values in a column.
- From this `DatetimeProperties` of column `'end'`, we can extract `year`.

In [None]:
data_tidy['timestamp'].dt.year

Unnamed: 0_level_0,Unnamed: 1_level_0,timestamp
Drug_Name,Unnamed: 1_level_1,Unnamed: 2_level_1
diltiazem hydrochloride,0,2020
diltiazem hydrochloride,3,2020
diltiazem hydrochloride,6,2020
diltiazem hydrochloride,9,2020
diltiazem hydrochloride,12,2020
...,...,...
ketamine hydrochloride,95,2020
ketamine hydrochloride,98,2020
ketamine hydrochloride,101,2020
ketamine hydrochloride,104,2020


We can use `strfttime` (**short for stringformat time**), to modify our datetime format.

Let's learn this with the help of few examples.

In [None]:
data_tidy['timestamp'][0]

Timestamp('2020-10-15 10:30:00')

In [None]:
print(data_tidy['timestamp'][0].strftime('%Y')) # formatter for year

2020


Similarly we can combine the format types to modify the datetime format as per our convinience.

A comprehensive list of other formats can be found here: https://pandas.pydata.org/docs/reference/api/pandas.Period.strftime.html

In [None]:
data_tidy['timestamp'][0].strftime('%m-%d')

'10-15'

---

## Writing to a file

**How can we write our dataframe to a CSV file?**

- We have to provide the `path` and `file_name` in which we want to store the data.

In [None]:
data_tidy.to_csv('pfizer_tidy.csv', sep=",", index=False)

Setting `index=False` will not inlcude the index column while writing.

---