# Day 16 of 100 days of Data Science
a crowdsourced Data Science learning program by Mr. Sharan

----
## Pandas: Date, Categorical and Sparse Data

__Date:__ Refers to Date and Time series data

__Categorical Data:__ Text columns that are repetative and that cannot perform numerical operations.

__Sparse Data:__ When a large number of data is either 0 or null it is stored in the form of sparse data to utilize minimum memory.

In [1]:
import pandas as pd
import numpy as np

### Date
__Creating a Date column__

In [2]:
print(pd.date_range('6/1/2022', periods=10))

DatetimeIndex(['2022-06-01', '2022-06-02', '2022-06-03', '2022-06-04',
               '2022-06-05', '2022-06-06', '2022-06-07', '2022-06-08',
               '2022-06-09', '2022-06-10'],
              dtype='datetime64[ns]', freq='D')


__Date column with Frequency__

In [3]:
print(pd.date_range('6/1/2022', periods=5, freq='B')) #Business day

DatetimeIndex(['2022-06-01', '2022-06-02', '2022-06-03', '2022-06-06',
               '2022-06-07'],
              dtype='datetime64[ns]', freq='B')


__Some common frequencies__
   - W   - one week
   - M   - month calender end
   - MS  - month starting
   - SM  - semi month end
   - Q   - quarter end
   - QS  - quater starting
   - A   - year end
   - AS  - year starting
   - D   - day
   - H   - hour
   - T   - minute
   - S   - second

In [4]:
print(pd.date_range('3/1/2022', periods=5, freq='MS'))

DatetimeIndex(['2022-03-01', '2022-04-01', '2022-05-01', '2022-06-01',
               '2022-07-01'],
              dtype='datetime64[ns]', freq='MS')


__Time Delta__

In [5]:
day1 = pd.to_datetime('today')
day2 = day1 + pd.Timedelta('1 day')
print("Day 1:", day1)
print("Day 2:", day2, day2.day_name())

Day 1: 2022-01-07 16:31:36.530496
Day 2: 2022-01-08 16:31:36.530496 Saturday


__Date Operations__

In [6]:
date = pd.Series(pd.date_range('2020-1-1', periods=7, freq='D'))
to_be_added = pd.Series([pd.Timedelta(days=i) for i in range(7)])
date_df = pd.DataFrame({'Date': date, 'To_Add': to_be_added})
print(date_df)
date_df['Final_Date'] = date_df['Date'] + date_df['To_Add']
print(date_df)

        Date To_Add
0 2020-01-01 0 days
1 2020-01-02 1 days
2 2020-01-03 2 days
3 2020-01-04 3 days
4 2020-01-05 4 days
5 2020-01-06 5 days
6 2020-01-07 6 days
        Date To_Add Final_Date
0 2020-01-01 0 days 2020-01-01
1 2020-01-02 1 days 2020-01-03
2 2020-01-03 2 days 2020-01-05
3 2020-01-04 3 days 2020-01-07
4 2020-01-05 4 days 2020-01-09
5 2020-01-06 5 days 2020-01-11
6 2020-01-07 6 days 2020-01-13


In [7]:
date = pd.Series(pd.date_range('2020-1-1', periods=7, freq='D'))
to_be_added = pd.Series([pd.Timedelta(days=i) for i in range(7)])
date_df['year'] = date_df['Date'].dt.year
date_df['month'] = date_df['Date'].dt.month
date_df['day'] = date_df['Date'].dt.day
date_df

Unnamed: 0,Date,To_Add,Final_Date,year,month,day
0,2020-01-01,0 days,2020-01-01,2020,1,1
1,2020-01-02,1 days,2020-01-03,2020,1,2
2,2020-01-03,2 days,2020-01-05,2020,1,3
3,2020-01-04,3 days,2020-01-07,2020,1,4
4,2020-01-05,4 days,2020-01-09,2020,1,5
5,2020-01-06,5 days,2020-01-11,2020,1,6
6,2020-01-07,6 days,2020-01-13,2020,1,7


### Categoriacal Data

In [8]:
s = pd.Series(["a","b","c","a"], dtype="category")
print(s)

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]


__Using the categorical function to create a categorical data__

In [9]:
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
print(cat)

[a, b, c, a, b, c]
Categories (3, object): [a, b, c]


__When we have values that are not defined in the categories, they are changed to NaN value__

In [10]:
cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c', 'd', 'e'], categories= ['b', 'c', 'a'])
print(cat)

[a, b, c, a, b, c, NaN, NaN]
Categories (3, object): [b, c, a]


__Indexing in categorical data__

In [11]:
s = pd.Series(["a","b","c","a"], dtype="category")
print("First", s[0], " Second", s[1])

First a  Second b


__Removing specific types of Categories__

In [12]:
print(s.cat.remove_categories("a"))

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (2, object): [b, c]


### Sparce Data
__Creating a dataframe with the first 9998 values as Null__

In [13]:
df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = np.nan
sdf = df.astype(pd.SparseDtype("float", np.nan))
print(type(sdf))

<class 'pandas.core.frame.DataFrame'>


__Checking the data type of the dataframe__

In [14]:
sdf.dtypes

0    Sparse[float64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
3    Sparse[float64, nan]
dtype: object

__Checking the memory usage of the dataframe, since its sparse, the usage is low__

In [15]:
sdf.sparse.density

0.0002

## Thank You!