![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The Pandas Library - Saving Data

*Basic initialization of the workspace.*

In [1]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [2]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

Pandas installed at version: 1.1.5


Load sample data for processing:

In [3]:
# load EU quality of life indicator 
quality_of_life_data_frame = pd.read_csv(
    "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/EU_Quality_Of_Life.csv",
 )
print(
   "Loaded quality of life data frame with shape {}".format(
       quality_of_life_data_frame.shape
   ) 
)

Loaded quality of life data frame with shape (6315, 6)


## 1. Data Saving

Let's clean up the raw data by removing the records with missing values:

In [4]:
# remove records where data is empty
clean_quality_of_life_data_frame = quality_of_life_data_frame.drop(
    quality_of_life_data_frame.index[np.isnan(quality_of_life_data_frame["OBS_VALUE"])],
    axis = 0
 )

print(
    "Removing empty data has dropped {} records".format(
        quality_of_life_data_frame.shape[0] - clean_quality_of_life_data_frame.shape[0]
    )
)

Removing empty data has dropped 131 records


In [5]:
# remove redundant values
previous_row_counts = clean_quality_of_life_data_frame.shape[0]
clean_quality_of_life_data_frame = clean_quality_of_life_data_frame[
     (clean_quality_of_life_data_frame["isced11"] != "TOTAL")
     &
     (clean_quality_of_life_data_frame["sex"] != "T")
     &
     (clean_quality_of_life_data_frame["age"] != "Y_GE16")
     &
     (clean_quality_of_life_data_frame["age"] != "Y_GE65")
    ]   

print(
    "Removing redundant data has dropped {} records".format(
        previous_row_counts - clean_quality_of_life_data_frame.shape[0]
    )
)

Removing redundant data has dropped 3721 records


In [6]:
# ensure that the resulted data frame has 
# the 2013 and 2018 data as column values 
quality_of_life_2013_data = clean_quality_of_life_data_frame[clean_quality_of_life_data_frame["TIME_PERIOD"] == 2013]
quality_of_life_2018_data = clean_quality_of_life_data_frame[clean_quality_of_life_data_frame["TIME_PERIOD"] == 2018]
merged_data = pd.merge(
    quality_of_life_2013_data,
    quality_of_life_2018_data,
    on = ["geo", "isced11", "sex", "age"],
    how = "inner"
)

print(
    "A sample of merged data results is \n{}".format(
        merged_data[0:10]
    )
)

A sample of merged data results is 
  isced11 sex     age geo  TIME_PERIOD_x  OBS_VALUE_x  TIME_PERIOD_y  OBS_VALUE_y
0   ED0-2   F  Y16-24  AT           2013          8.2           2018          8.0
1   ED0-2   F  Y16-24  BE           2013          7.7           2018          7.8
2   ED0-2   F  Y16-24  BG           2013          5.5           2018          6.3
3   ED0-2   F  Y16-24  CY           2013          7.5           2018          7.9
4   ED0-2   F  Y16-24  CZ           2013          7.8           2018          8.2
5   ED0-2   F  Y16-24  DE           2013          7.4           2018          7.7
6   ED0-2   F  Y16-24  DK           2013          8.3           2018          7.2
7   ED0-2   F  Y16-24  EE           2013          7.3           2018          7.9
8   ED0-2   F  Y16-24  EL           2013          7.0           2018          7.2
9   ED0-2   F  Y16-24  ES           2013          7.3           2018          7.6


In [7]:
# create a clean data frame
result_data = pd.DataFrame(
    data = {
        "geo": merged_data["geo"],
        "age": merged_data["age"],
        "sex": merged_data["sex"],
        "isced11": merged_data["isced11"],
        "OBS_VALUE_2013": merged_data["OBS_VALUE_x"],
        "OBS_VALUE_2018": merged_data["OBS_VALUE_y"]
    }
)

result_data = result_data.sort_values(
    ["geo", "age", "sex", "isced11"]
)

print(
    "A sample of clean data results is \n{}".format(
        result_data[0:10]
    )
)

A sample of clean data results is 
    geo     age sex isced11  OBS_VALUE_2013  OBS_VALUE_2018
0    AT  Y16-24   F   ED0-2             8.2             8.0
375  AT  Y16-24   F   ED3_4             8.4             8.5
754  AT  Y16-24   F   ED5-8             8.4             8.5
188  AT  Y16-24   M   ED0-2             8.3             8.4
564  AT  Y16-24   M   ED3_4             8.5             8.3
32   AT  Y25-34   F   ED0-2             6.8             7.7
407  AT  Y25-34   F   ED3_4             8.2             8.0
777  AT  Y25-34   F   ED5-8             8.6             8.4
219  AT  Y25-34   M   ED0-2             7.3             7.4
596  AT  Y25-34   M   ED3_4             8.1             8.0


In case of big data, the cleanup and processing operation can be lengty and expensive. It is preferrable to save the processed data and load it later for futher processing.

In order to save the processed data into a CSV format we can use the [**to_csv**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) method of the data frame:

In [8]:
result_data.to_csv("processed_data.csv")

The [**to_excel**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html) method allows saving of the data in Excel format:

In [9]:
result_data.to_excel("processed_data.xlsx")

In case of really high volume of data, it is prefferable to save it in a format that is open source and designed to handle such data volumes and complexities. The **parquet** format is extensively used for this purpose, saving data in this format can be done via the [**to_parquet**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html) method. 

In [10]:
result_data.to_parquet("processed_data.parquet")