![*INTERTECHNICA - SOLON EDUCATIONAL PROGRAMS - TECHNOLOGY LINE*](https://solon.intertechnica.com/assets/IntertechnicaSolonEducationalPrograms-TechnologyLine.png)

# Data Manipulation with Python - The Pandas Library - Creating Pandas Data Containers

*Basic initialization of the workspace.*

In [16]:
!python -m pip install numpy
import numpy as np
print ("NumPy installed at version: {}".format(np.__version__))

NumPy installed at version: 1.19.5


In [17]:
!python -m pip install pandas
import pandas as pd
print ("Pandas installed at version: {}".format(pd.__version__))

#adjust pandas DataFrame display for a wider target 
pd.set_option('display.expand_frame_repr', False)

Pandas installed at version: 1.1.5


## 1. Pandas data containers

The Pandas library support two types kind of of data containers:

*   **Series** - an indexed uni-dimensional data container; 
*   **DataFrame** - an indexed bi-dimensional data container (rows and columns).

In the following we will focus on creating these data containers.

### 1.1 Creating Data Series

Pandas series can be created either from data or by loading data from files.

#### 1.1.1 Creating Data Series from data

Pandas Data Series can be created directly from existing data such as Python arrays, lists and dictionaries along with NumPy arrays: 

In [18]:
# creating Pandas series from Python arrays
simple_array = ["One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine"]
print(
    "The series associated to the array \n{}\n is \n{}\n".format(
        simple_array,
        pd.Series(simple_array)
    )
)

The series associated to the array 
['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine']
 is 
0      One
1      Two
2    Three
3     Four
4     Five
5      Six
6    Seven
7    Eight
8     Nine
dtype: object



In [19]:
# creating Pandas series from Python lists
simple_list = (1, "Two", 3, "Four", 5, "Six", 7, "Eight", 9)
print(
    "The series associated to the list \n{}\n is \n{}\n".format(
        simple_list,
        pd.Series(simple_list)
    )
)

The series associated to the list 
(1, 'Two', 3, 'Four', 5, 'Six', 7, 'Eight', 9)
 is 
0        1
1      Two
2        3
3     Four
4        5
5      Six
6        7
7    Eight
8        9
dtype: object



In [20]:
# creating Pandas series from Python dictionaries
# allows also setting the values for the index column
simple_dictionary = {1:"One", 2: "Two", 3:"Three", 4:"Four", 5:"Five", 6:"Six", 7:"Seven", 8:"Eight", 9:"Nine"}
print(
    "The series associated to the dictionary \n{}\n is \n{}\n".format(
        simple_dictionary,
        pd.Series(simple_dictionary)
    )
)

The series associated to the dictionary 
{1: 'One', 2: 'Two', 3: 'Three', 4: 'Four', 5: 'Five', 6: 'Six', 7: 'Seven', 8: 'Eight', 9: 'Nine'}
 is 
1      One
2      Two
3    Three
4     Four
5     Five
6      Six
7    Seven
8    Eight
9     Nine
dtype: object



In [21]:
# creating Pandas series from Numpy array
numpy_array = np.arange(10) + 1
print(
    "The series associated to the NumPy array \n{}\n is \n{}\n".format(
        numpy_array,
        pd.Series(numpy_array)
    )
)

The series associated to the NumPy array 
[ 1  2  3  4  5  6  7  8  9 10]
 is 
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64



Pandas series can be created also by specifying its index as well:

In [22]:
# create a time range
year = "2022"
time_period_range = pd.period_range(
  year + "-01-01", 
  year + "-12-31", 
  freq = "M"
)

# create a data range for the number of the month in year
months= [
          pd.Timestamp(str(range_item) + "-01").month_name() 
                            for range_item in time_period_range
        ]

month_series = pd.Series(
    index = time_period_range, 
    data = months
)

print(
    "The dates of months in year {} are \n{}\n".format(
        year,
        month_series
    )
)

The dates of months in year 2022 are 
2022-01      January
2022-02     February
2022-03        March
2022-04        April
2022-05          May
2022-06         June
2022-07         July
2022-08       August
2022-09    September
2022-10      October
2022-11     November
2022-12     December
Freq: M, dtype: object



#### 1.1.2 Creating Data Series from loaded data

Pandas allows for the creation of data series from data stored in CSV and Excel files. The functions that allows to load data from CSV and Excel files are [**read_csv**](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and [**read_excel**](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html).

It must be noted that the loaded data is returned as a NumPy DataFrame and it is needed to call the [**squeeze**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.squeeze.html) method in order to transform the Pandas DataFrame into a Pandas Series.

In [23]:
# loading the data from the CSV file  
# skipping the comment row and using the year as the index column
csv_loaded_data = pd.read_csv(
   "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Urban_Air_Polution.csv",
   skiprows = 1,
   index_col = 0  
)

print(
    "The CSV loaded data series is \n{}".format(
        csv_loaded_data.squeeze()
      )
)

The CSV loaded data series is 
Year
1990    20.32
1995    18.57
2000    17.70
2005    17.16
2010    17.06
2011    17.64
2012    16.32
2013    15.37
2014    15.01
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64


In [24]:
# loading the data from the Excel file  
# skipping the comment row and using the year as the index column
excel_loaded_data = pd.read_excel(
   "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Urban_Air_Polution.xlsx",
   skiprows = 1,
   index_col = 0  
)

print(
    "The Excel loaded data series is \n{}".format(
        excel_loaded_data.squeeze()
      )
)

The Excel loaded data series is 
Year
1990    20.32
1995    18.57
2000    17.70
2005    17.16
2010    17.06
2011    17.64
2012    16.32
2013    15.37
2014    15.01
2015    15.43
2016    14.54
2017    14.61
Name: Air Pollution Mean Exposure, dtype: float64


### 1.2 DataFrame 

As with Data Series, the Panda's DataFrame objects can be created from regular Python data objects or they can be created from data loaded from the data files.

#### 1.2.1 Creating DataFrames from data

Pandas DataFrames can be created - as well - directly from existing data such as Python arrays, lists and dictionaries along with NumPy arrays: 

In [25]:
# create the raw data as regular Python strings
en_literals = ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]
fr_literals = ["un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf"]

# create the data necessary for the dataframe creation 
# (data, column names, index values)
data = zip(en_literals, fr_literals)
column_names = ["EN", "FR"]
index_values = np.arange(9) + 1

print(
    "The created data frame is \n{}\n".format(
        pd.DataFrame(
          data = data,
          columns = column_names,
          index = index_values
        )
    )
)


The created data frame is 
      EN      FR
1    one      un
2    two    deux
3  three   trois
4   four  quatre
5   five    cinq
6    six     six
7  seven    sept
8  eight    huit
9   nine    neuf



In [26]:
# create the dctionary data necessary for the dataframe creation 
# (data, column names, index values)
dictionary_data = {
    "EN": en_literals,
    "FR": fr_literals
}

print(
    "The created data frame is \n{}\n".format(
        pd.DataFrame(
          data = dictionary_data,
          index = index_values
        )
    )
)

The created data frame is 
      EN      FR
1    one      un
2    two    deux
3  three   trois
4   four  quatre
5   five    cinq
6    six     six
7  seven    sept
8  eight    huit
9   nine    neuf



In [27]:
# create the numpy data 
# using vstack function
numpy_data = np.vstack((en_literals, fr_literals)).T

print(
    "The created data frame is \n{}\n".format(
        pd.DataFrame(
          data = numpy_data,
          columns = column_names,
          index = index_values,
        )
    )
)

The created data frame is 
      EN      FR
1    one      un
2    two    deux
3  three   trois
4   four  quatre
5   five    cinq
6    six     six
7  seven    sept
8  eight    huit
9   nine    neuf



### 1.2.2 Creating DataFrames from loaded data

The same functions [**read_csv**](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) and [**read_excel**](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) are used to created DataFrames from data stored in data files.

On a succesful execution, the functions will return full fledget DataFrame objects.

In [28]:

# loading the data from the CSV file  
csv_loaded_data = pd.read_csv(
   "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Urban_Multi_Factor.csv",
)

print(
    "The data loaded from the data file is \n{}\n with the data types \n{}\n".format(
      csv_loaded_data,
      csv_loaded_data.dtypes   
    )
)

# we can observe that the percetage data has been loaded as <object> data type
# as Pandas could not interpret the values as representing numeric data type


The data loaded from the data file is 
    Year  Population in largest city Population in the largest city (% of urban population)
0   1960                     1002300                                             15.92%    
1   1961                     1031002                                             15.92%    
2   1962                     1060568                                             15.96%    
3   1963                     1090981                                             16.00%    
4   1964                     1122310                                             16.04%    
..   ...                         ...                                                ...    
56  2016                     1839695                                             17.32%    
57  2017                     1830515                                             17.33%    
58  2018                     1821380                                             17.32%    
59  2019                     1812290     

In [29]:
# loading the data from the Excel file  
excel_loaded_data = pd.read_excel(
    "https://raw.githubusercontent.com/INTERTECHNICA-BUSINESS-SOLUTIONS-SRL/CourseDataManipulationWithPython/main/Module%203%20-%20The%20Pandas%20Library/Session%202%20-%20Pandas%20basics/data/RO_Urban_Multi_Factor.xlsx",
 )

print(
    "The data loaded from the data file is \n{}\n with the data types \n{}\n".format(
      excel_loaded_data,
      excel_loaded_data.dtypes   
    )
)

# due to the fact that Excel contains better information on data types,
# the percentage values are correctly interpreted as numeric data

The data loaded from the data file is 
    Year  Population in largest city  Population in the largest city (% of urban population)
0   1960                     1002300                                             0.1592     
1   1961                     1031002                                             0.1592     
2   1962                     1060568                                             0.1596     
3   1963                     1090981                                             0.1600     
4   1964                     1122310                                             0.1604     
..   ...                         ...                                                ...     
56  2016                     1839695                                             0.1732     
57  2017                     1830515                                             0.1733     
58  2018                     1821380                                             0.1732     
59  2019                     18