![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The Pandas Library - Creating Data Containers

*Basic initialization of the workspace.*

In [None]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [None]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

Pandas installed at version: 1.1.5


## 1. Pandas data containers

The Pandas library support two types kind of of data containers:

*   **Series** - an indexed uni-dimensional data container; 
*   **DataFrame** - an indexed bi-dimensional data container (rows and columns). Each colum from a DataFrame is basically a Data Series, all columns from the DataFrame sharing the same index.

In the following we will focus on creating these data containers.

### 1.1 Creating Data Series

Pandas series can be created either from data or by loading data from files.

#### 1.1.1 Creating Data Series from data

Pandas Data Series can be created directly from existing data such as Python arrays, lists and dictionaries along with NumPy arrays: 

In [None]:
# creating Pandas series from Python arrays
simple_array = ["One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine"]
print(
    "The series associated to the array \n{}\n is \n{}\n".format(
        simple_array,
        pd.Series(simple_array)
    )
)

The series associated to the array 
['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine']
 is 
0      One
1      Two
2    Three
3     Four
4     Five
5      Six
6    Seven
7    Eight
8     Nine
dtype: object



In [None]:
# creating Pandas series from Python lists
simple_list = (1, "Two", 3, "Four", 5, "Six", 7, "Eight", 9)
print(
    "The series associated to the list \n{}\n is \n{}\n".format(
        simple_list,
        pd.Series(simple_list)
    )
)

The series associated to the list 
(1, 'Two', 3, 'Four', 5, 'Six', 7, 'Eight', 9)
 is 
0        1
1      Two
2        3
3     Four
4        5
5      Six
6        7
7    Eight
8        9
dtype: object



In [None]:
# creating Pandas series from Python dictionaries
# allows also setting the values for the index column
simple_dictionary = {1:"One", 2: "Two", 3:"Three", 4:"Four", 5:"Five", 6:"Six", 7:"Seven", 8:"Eight", 9:"Nine"}
print(
    "The series associated to the dictionary \n{}\n is \n{}\n".format(
        simple_dictionary,
        pd.Series(simple_dictionary)
    )
)

The series associated to the dictionary 
{1: 'One', 2: 'Two', 3: 'Three', 4: 'Four', 5: 'Five', 6: 'Six', 7: 'Seven', 8: 'Eight', 9: 'Nine'}
 is 
1      One
2      Two
3    Three
4     Four
5     Five
6      Six
7    Seven
8    Eight
9     Nine
dtype: object



In [None]:
# creating Pandas series from Numpy array
numpy_array = np.arange(10) + 1
print(
    "The series associated to the NumPy array \n{}\n is \n{}\n".format(
        numpy_array,
        pd.Series(numpy_array)
    )
)

The series associated to the NumPy array 
[ 1  2  3  4  5  6  7  8  9 10]
 is 
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64



Pandas series can be created also by specifying its index as well:

In [None]:
# create a time range
year = "2022"
time_period_range = pd.period_range(
  year + "-01-01", 
  year + "-12-31", 
  freq = "M"
)

# create a data range for the number of the month in year
months= [
          pd.Timestamp(str(range_item) + "-01").month_name() 
                            for range_item in time_period_range
        ]

month_series = pd.Series(
    index = time_period_range, 
    data = months
)

print(
    "The dates of months in year {} are \n{}\n".format(
        year,
        month_series
    )
)

The dates of months in year 2022 are 
2022-01      January
2022-02     February
2022-03        March
2022-04        April
2022-05          May
2022-06         June
2022-07         July
2022-08       August
2022-09    September
2022-10      October
2022-11     November
2022-12     December
Freq: M, dtype: object



#### 1.1.2 Creating Data Series from loaded data

Pandas allows for the creation of data series from data stored in CSV and Excel files. The functions that allows to load data from CSV and Excel files are [**read_csv**](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and [**read_excel**](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html).

It must be noted that the loaded data is returned as a NumPy DataFrame and it is needed to call the [**squeeze**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.squeeze.html) method in order to transform the Pandas DataFrame into a Pandas Series.

In [None]:
# loading the data from the CSV file  
# skipping the comment row and using the year as the index column
csv_loaded_data = pd.read_csv(
   "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Urban_Air_Polution.csv",
   skiprows = 1,
   index_col = 0  
)

print(
    "The CSV loaded data series is \n{}".format(
        csv_loaded_data.squeeze()
      )
)

The CSV loaded data series is 
Year
1990    20.32
1995    18.57
2000    17.70
2005    17.16
2010    17.06
2011    17.64
2012    16.32
2013    15.37
2014    15.01
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64


In [None]:
# loading the data from the Excel file  
# skipping the comment row and using the year as the index column
excel_loaded_data = pd.read_excel(
   "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Urban_Air_Polution.xlsx",
   skiprows = 1,
   index_col = 0  
)

print(
    "The Excel loaded data series is \n{}".format(
        excel_loaded_data.squeeze()
      )
)

The Excel loaded data series is 
Year
1990    20.32
1995    18.57
2000    17.70
2005    17.16
2010    17.06
2011    17.64
2012    16.32
2013    15.37
2014    15.01
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64


### 1.2 DataFrame 

As with Data Series, the Panda's DataFrame objects can be created from regular Python data objects or they can be created from data loaded from the data files.

#### 1.2.1 Creating DataFrames from data

Pandas DataFrames can be created - as well - directly from existing data such as Python arrays, lists and dictionaries along with NumPy arrays: 

In [None]:
# create the raw data needed for dataframe creation
index_values = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
numeric_values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
en_literals = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"]
fr_literals = ["un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf", "dix"]
column_names = ["EN", "FR"]

In [None]:
# create the dataframe using a dictionary format
dictionary_data = {
    "EN": en_literals,
    "FR": fr_literals
}

print(
    "The created data frame is \n{}\n".format(
        pd.DataFrame(
          data = dictionary_data,
          index = index_values
        )
    )
)

The created data frame is 
      EN      FR
0    one      un
1    two    deux
2  three   trois
3   four  quatre
4   five    cinq
5    six     six
6  seven    sept
7  eight    huit
8   nine    neuf
9    ten     dix



In [None]:
# create the dataframe data created 
# by vstack function
numpy_data = np.vstack((en_literals, fr_literals)).T

print(
    "The created data frame is \n{}\n".format(
        pd.DataFrame(
          data = numpy_data,
          columns = column_names,
          index = index_values,
        )
    )
)

The created data frame is 
      EN      FR
0    one      un
1    two    deux
2  three   trois
3   four  quatre
4   five    cinq
5    six     six
6  seven    sept
7  eight    huit
8   nine    neuf
9    ten     dix



In [None]:
# create data series specifying simultaneously the column names, values and
# data types (using data series)
created_data_frame = pd.DataFrame(
                      data = {
                          "numeric_value": pd.Series(data = numeric_values, dtype = np.dtype("i")),
                          "EN":  pd.Series(data = en_literals, dtype = np.dtype("U16")), 
                          "FR":  pd.Series(data = fr_literals, dtype = np.dtype("U16"))
                      }
                    )

print(
    "The created data frame is \n{}\n".format(
        created_data_frame
    )
)

The created data frame is 
   numeric_value     EN      FR
0              1    one      un
1              2    two    deux
2              3  three   trois
3              4   four  quatre
4              5   five    cinq
5              6    six     six
6              7  seven    sept
7              8  eight    huit
8              9   nine    neuf
9             10    ten     dix



### 1.2.2 Creating DataFrames from loaded data

The same functions [**read_csv**](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and [**read_excel**](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) are used to created DataFrames from data stored in data files.

On a succesful execution, the functions will return full fledget DataFrame objects.

In [None]:
# loading the data from the CSV file 
# containing the science and technology data for Romania
csv_loaded_data = pd.read_csv(
   "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Science_And_Technology.csv",
)

print(
    "The data loaded from the data file is \n{}\n with the data types \n{}\n".format(
      csv_loaded_data,
      csv_loaded_data.dtypes   
    )
)

The data loaded from the data file is 
    Country Name Country ISO3  Year                                     Indicator Name     Indicator Code         Value
0        Romania          ROU  2020  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  8.868424e+08
1        Romania          ROU  2019  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  9.367352e+08
2        Romania          ROU  2018  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  9.623846e+08
3        Romania          ROU  2017  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  9.110375e+08
4        Romania          ROU  2016  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  8.320201e+08
..           ...          ...   ...                                                ...                ...           ...
410      Romania          ROU  2011  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS  1.163584e+01
4

In [None]:
# loading the data from the Excel file
# containing the science and technology data for Romania
excel_loaded_data = pd.read_excel(
    "https://github.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/raw/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Science_And_Technology.xlsx",
 )

print(
    "The data loaded from the data file is \n{}\n with the data types \n{}\n".format(
      excel_loaded_data,
      excel_loaded_data.dtypes   
    )
)

The data loaded from the data file is 
    Country Name Country ISO3  Year                                     Indicator Name     Indicator Code         Value
0        Romania          ROU  2020  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  8.868424e+08
1        Romania          ROU  2019  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  9.367352e+08
2        Romania          ROU  2018  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  9.623846e+08
3        Romania          ROU  2017  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  9.110375e+08
4        Romania          ROU  2016  Charges for the use of intellectual property, ...     BM.GSR.ROYL.CD  8.320201e+08
..           ...          ...   ...                                                ...                ...           ...
410      Romania          ROU  2011  High-technology exports (% of manufactured exp...  TX.VAL.TECH.MF.ZS  1.163584e+01
4