### Optimizing A Data Set for Memory Usage

In [5]:
import pandas as pd

In [6]:
employees_data = '/Users/ypushiev/Learning/PANDAS IN ACTION/Chapter 5 Dataframe filtering/Data/employees.csv'

In [7]:
pd.read_csv(employees_data).isna().sum()

First Name     68
Gender        147
Start Date      2
Salary          2
Mgmt           68
Team           44
dtype: int64

In [44]:
df_employees = pd.read_csv(employees_data, parse_dates = ['Start Date'])

  df_employees = pd.read_csv(employees_data, parse_dates = ['Start Date'])


In [45]:
df_employees['Start Date'] = pd.to_datetime(df_employees['Start Date'], format='%Y-%m-%d')

In [46]:
df_employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  933 non-null    object        
 1   Gender      854 non-null    object        
 2   Start Date  999 non-null    datetime64[ns]
 3   Salary      999 non-null    float64       
 4   Mgmt        933 non-null    object        
 5   Team        957 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 47.1+ KB


**memory usage: 47.1+ KB**

In [47]:
df_employees.head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
3,Jerry,,2005-03-04,138705.0,True,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT


**Convert datetime into the object to cut the time from the data**

In [48]:
df_employees['Start Date'] = df_employees['Start Date'].dt.strftime('%Y-%m-%d')

In [49]:
df_employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   First Name  933 non-null    object 
 1   Gender      854 non-null    object 
 2   Start Date  999 non-null    object 
 3   Salary      999 non-null    float64
 4   Mgmt        933 non-null    object 
 5   Team        957 non-null    object 
dtypes: float64(1), object(5)
memory usage: 47.1+ KB


**memory usage: 47.1+ KB**

### Converting Data Types with the astype Method

In [50]:
df_employees['Mgmt'].astype('bool').tail()

996     False
997     False
998     False
999      True
1000     True
Name: Mgmt, dtype: bool

**Convert object type into the bool type and see results in the size**

In [51]:
df_employees['Mgmt'] = df_employees['Mgmt'].astype('bool')

In [52]:
df_employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   First Name  933 non-null    object 
 1   Gender      854 non-null    object 
 2   Start Date  999 non-null    object 
 3   Salary      999 non-null    float64
 4   Mgmt        1001 non-null   bool   
 5   Team        957 non-null    object 
dtypes: bool(1), float64(1), object(4)
memory usage: 40.2+ KB


**memory usage: 40.2+ KB**

### Convert into the integer type

**The Salary column contains NaN values**

In [53]:
df_employees['Salary'].isna().sum()

2

**Replace NaN values by 0**

In [54]:
df_employees['Salary']= df_employees['Salary'].fillna(0)

In [56]:
df_employees['Salary']=df_employees['Salary'].astype('int')

In [57]:
df_employees['Salary'].head()

0         0
1     61933
2    130590
3    138705
4    101004
Name: Salary, dtype: int64

In [37]:
df_employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   First Name  933 non-null    object
 1   Gender      854 non-null    object
 2   Start Date  999 non-null    object
 3   Salary      1001 non-null   int64 
 4   Mgmt        1001 non-null   bool  
 5   Team        957 non-null    object
dtypes: bool(1), int64(1), object(4)
memory usage: 40.2+ KB


**The memory might be reduced more using INT32**

**Check max value**

In [65]:
df_employees['Salary'].max()

149908

In [59]:
df_employees['Salary']=df_employees['Salary'].astype('int32')

In [60]:
df_employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   First Name  933 non-null    object
 1   Gender      854 non-null    object
 2   Start Date  999 non-null    object
 3   Salary      1001 non-null   int32 
 4   Mgmt        1001 non-null   bool  
 5   Team        957 non-null    object
dtypes: bool(1), int32(1), object(4)
memory usage: 36.3+ KB


In [66]:
df_employees['Salary'].max()

149908

In [61]:
df_employees['Salary'].head()

0         0
1     61933
2    130590
3    138705
4    101004
Name: Salary, dtype: int32

**memory usage: 36.3+ KB**

### Convert into categorical type

**Check unique values in the Dataframe** 

In [67]:
df_employees.nunique()

First Name    200
Gender          2
Start Date    971
Salary        995
Mgmt            2
Team           10
dtype: int64

**The Gender column contains only 2 unique values** 

In [69]:
df_employees['Gender'].astype('category').head()

0      Male
1      Male
2    Female
3       NaN
4      Male
Name: Gender, dtype: category
Categories (2, object): ['Female', 'Male']

In [70]:
df_employees['Gender'] = df_employees['Gender'].astype('category')

**Convert Team column type**

In [73]:
df_employees['Team'] = df_employees['Team'].astype('category')

In [74]:
df_employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   First Name  933 non-null    object  
 1   Gender      854 non-null    category
 2   Start Date  999 non-null    object  
 3   Salary      1001 non-null   int32   
 4   Mgmt        1001 non-null   bool    
 5   Team        957 non-null    category
dtypes: bool(1), category(2), int32(1), object(2)
memory usage: 23.1+ KB


**memory usage: 23.1+ KB**

**The Dataframe size has been decreased from 47.1+ KB to 23.1+ KB approximately 50%**