sadman Sakib Rafi

## **Importing Libraries: NumPy and Pandas**


In [34]:
import numpy as np
import pandas as pd 

## **Reading CSV Files into DataFrame Objects**
### I'm Using Regression with a Flood Prediction Dataset here

In [35]:
DataFrame=pd.read_csv('/kaggle/input/playground-series-s4e5/train.csv')
test_df=pd.read_csv('/kaggle/input/playground-series-s4e5/test.csv')

## **Generating Descriptive Statistics for DataFrame Columns**


In [36]:
DataFrame.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,1117957.0,558978.0,322726.531784,0.0,279489.0,558978.0,838467.0,1117956.0
MonsoonIntensity,1117957.0,4.92145,2.056387,0.0,3.0,5.0,6.0,16.0
TopographyDrainage,1117957.0,4.926671,2.093879,0.0,3.0,5.0,6.0,18.0
RiverManagement,1117957.0,4.955322,2.072186,0.0,4.0,5.0,6.0,16.0
Deforestation,1117957.0,4.94224,2.051689,0.0,4.0,5.0,6.0,17.0
Urbanization,1117957.0,4.942517,2.083391,0.0,3.0,5.0,6.0,17.0
ClimateChange,1117957.0,4.934093,2.057742,0.0,3.0,5.0,6.0,17.0
DamsQuality,1117957.0,4.955878,2.083063,0.0,4.0,5.0,6.0,16.0
Siltation,1117957.0,4.927791,2.065992,0.0,3.0,5.0,6.0,16.0
AgriculturalPractices,1117957.0,4.942619,2.068545,0.0,3.0,5.0,6.0,16.0


In [37]:
DataFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1117957 entries, 0 to 1117956
Data columns (total 22 columns):
 #   Column                           Non-Null Count    Dtype  
---  ------                           --------------    -----  
 0   id                               1117957 non-null  int64  
 1   MonsoonIntensity                 1117957 non-null  int64  
 2   TopographyDrainage               1117957 non-null  int64  
 3   RiverManagement                  1117957 non-null  int64  
 4   Deforestation                    1117957 non-null  int64  
 5   Urbanization                     1117957 non-null  int64  
 6   ClimateChange                    1117957 non-null  int64  
 7   DamsQuality                      1117957 non-null  int64  
 8   Siltation                        1117957 non-null  int64  
 9   AgriculturalPractices            1117957 non-null  int64  
 10  Encroachments                    1117957 non-null  int64  
 11  IneffectiveDisasterPreparedness  1117957 non-null 

### **Observation** 
* The dataset consists of 21 columns, each storing integer values.
* One column, "FloodProbability," contains floating-point values.
* The majority of columns have the data type int64, indicating they store whole numbers.
* The "FloodProbability" column has the data type float64, suggesting it can store decimal numbers.
* The memory usage of the dataset is approximately 187.6 MB.


## **Calculating and Displaying Memory Usage Before Optimization**


In [38]:
# Calculate the memory usage of the DataFrame in megabytes (MB) by summing up memory usage of each column and converting from bytes to MB
memory_usage=DataFrame.memory_usage().sum()/1024**2
print("Memory Usage Before Optimization :{:.2f} MB".format(memory_usage))

Memory Usage Before Optimization :187.65 MB


## **Memory Optimization Function for DataFrame**


In [39]:
def memory_optimization(data_frame):

    cols = data_frame.columns
    for col in cols:
        column_type = data_frame[col].dtype  # Getting the data type of  the column  of the DataFrame
        
        if column_type != object:
            col_min=data_frame[col].min()  # Calculate the minimum value of the column 'col' in the DataFrame
            col_max=data_frame[col].max()  # Calculate the maximum value of the column 'col' in the DataFrame
            
            if str(column_type)[:3] == 'int':
                # Check if the minimum and maximum values of the column 'col' fall within the range of the int8 data type
                if col_min > np.iinfo(np.int8).min and col_max < np.iinfo(np.int8).max:
                    # Convert the values in the column 'col' of the DataFrame to the int8 data type
                    data_frame[col] = data_frame[col].astype(np.int8)
                elif col_min > np.iinfo(np.int16).min and col_max < np.iinfo(np.int16).max:
                    data_frame[col] = data_frame[col].astype(np.int16)
                elif col_min > np.iinfo(np.int32).min and col_max < np.iinfo(np.int32).max:
                    data_frame[col] = data_frame[col].astype(np.int32)
                elif col_min > np.iinfo(np.int64).min and col_max < np.iinfo(np.int64).max:
                    data_frame[col] = data_frame[col].astype(np.int64)
            else:
                if col_min > np.finfo(np.float16).min and col_max < np.finfo(np.float16).max:
                    data_frame[col] = data_frame[col].astype(np.float16)
                elif col_min > np.finfo(np.float32).min and col_max < np.finfo(np.float32).max:
                    data_frame[col] = data_frame[col].astype(np.float32)
                else:
                    data_frame[col] = data_frame[col].astype(np.float64)
        else:
            data_frame[col] = data_frame[col].astype('object')
            
            
    optimized_memory_usage = data_frame.memory_usage().sum()/1024**2
    
    print("memory usage after optimization is: {:.2f} MB".format(optimized_memory_usage))
    
    
    return optimized_memory_usage

## **Comparing Memory Usage Before and After Optimization**


In [40]:
memory_usage_BEFORE = DataFrame.memory_usage().sum()/1024**2

memory_usage_AFTER = memory_optimization(DataFrame) #

print("Memory Usage Reduce By: {:.2f}%".format(100*(memory_usage_BEFORE-memory_usage_AFTER)/memory_usage_BEFORE))

memory usage after optimization is: 27.72 MB
Memory Usage Reduce By: 85.23%


# Memory Usage Reduction Achieved: 85.23%


In [41]:
DataFrame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1117957 entries, 0 to 1117956
Data columns (total 22 columns):
 #   Column                           Non-Null Count    Dtype  
---  ------                           --------------    -----  
 0   id                               1117957 non-null  int32  
 1   MonsoonIntensity                 1117957 non-null  int8   
 2   TopographyDrainage               1117957 non-null  int8   
 3   RiverManagement                  1117957 non-null  int8   
 4   Deforestation                    1117957 non-null  int8   
 5   Urbanization                     1117957 non-null  int8   
 6   ClimateChange                    1117957 non-null  int8   
 7   DamsQuality                      1117957 non-null  int8   
 8   Siltation                        1117957 non-null  int8   
 9   AgriculturalPractices            1117957 non-null  int8   
 10  Encroachments                    1117957 non-null  int8   
 11  IneffectiveDisasterPreparedness  1117957 non-null 

### **Observation After optimizing data** 

* The dataset now occupies approximately 27.7 MB of memory, significantly reduced from the previous 187.6 MB.
* Most of the columns have been optimized to use the int8 data type, consuming less memory.
* The "id" column has been optimized to use the int32 data type, further reducing memory usage.
* The "FloodProbability" column has been optimized to use the float16 data type, maintaining precision while minimizing memory usage.
* By optimizing data types, memory usage has been reduced while preserving data integrity and analysis capabilities.

# Key Functions Used in this Process

> DataFrame.memory_usage(index=True, deep=False)
**calculates how much memory each column uses, showing the bytes.**

> np.iinfo(np.int8) **returns information about the minimum and maximum representable values for the int8 data type**

> dataframe[columns].astype(np.int8) **converts the data type of the specified columns in the DataFrame to the int8 data type.**