# Analysis of input data and constraints of columns
***

## Beginning remarks

1. In the folder to this laboratory is located example [data set.zip](./input_data.xlsx).
2. If you want check works this section change *db_string* to your database settings because to correct the work needs a database connection.
3. To analysis and clear data input it use [Pandas framework](https://pandas.pydata.org/pandas-docs/stable/index.html).

Pandas is one of the most popular Python libraries to data manipulation. Because pandas allow:
- read and save data file in many extension (csv, xls, xlsx, etc.) and source (url address, database, file, etc. ),
- manage two-dimensional data tables,
- aggregate data,
- selection data,
- indexing data,
- and a lot more.

Because in this course we focus on database, presented methods to analysis and clear input data are very basic.

### Exercise

The data directory [data_set.zip](./data_set) contains data sets in csv files. Directory `Lab4` contain also [input_data.xlsx](./input_data.xlsx). Chose and suggest a 3NF database structure and describe it using SQLAlchemy. Create a database schema in  [dbdiagram](https://dbdiagram.io/). To the created database schema add constraints and insert the data.

## Prepare the environment

In [1]:
from sqlalchemy import create_engine, Column, Integer, String, Float, ForeignKey, ForeignKeyConstraint, PrimaryKeyConstraint, Sequence, CheckConstraint, UniqueConstraint, Boolean
import pandas as pd
from sqlalchemy.orm import relationship, declarative_base
import numpy as np

In [2]:
config_PostgreSQL = {
    "database_type": "",
    "user": "",
    "password": "",
    "database_url": "",
    "port": ,
    "database_name": ""
}

db_string = "{database_type}://{user}:{password}@{database_url}:{port}/{database_name}".format(**config_PostgreSQL)

engine = create_engine(db_string)

# test the connection
try:
    conn = engine.connect()
    print("Connected successfully!")
except:
    print("Failed to connect")

Base = declarative_base()

Connected successfully!


In [3]:
# import zipfile
# import io

# def extract_internal_zip_from_zip(outer_zip_path, internal_zip_relative_path=None):
#     """
#     Extracts the contents of an internal ZIP file from within another ZIP file into memory as a dictionary.

#     This function reads the contents of an internal ZIP file located inside another ZIP file,
#     storing each file's data in a dictionary. The keys in the dictionary are the file names, 
#     and the values are `BytesIO` objects containing the file data.

#     Additionally, the function can list the files in the outer ZIP archive.

#     Parameters:
#     outer_zip_path (str): The file path to the outer ZIP archive.
#     internal_zip_relative_path (str): The relative path to the nested ZIP file inside the outer ZIP (optional).

#     Returns:
#     dict: A dictionary where keys are file names (str) and values are `BytesIO` objects 
#           containing the contents of each file within the nested ZIP.
          
#     or

#     list: A list of files in the outer ZIP if `internal_zip_relative_path` is None.
#     """
    
#     extracted_files = {}
#     outer_zip_name = outer_zip_path.split('/')[-1]  # Extract the outer ZIP file name
    
#     with zipfile.ZipFile(outer_zip_path, 'r') as outer_zip:
#         # List files in the outer ZIP and filter only the .zip files
#         outer_files = [f for f in outer_zip.namelist() if f.endswith('.zip')]
#         print(f"Files in the outer ZIP archive ({outer_zip_name}): ", ", ".join(outer_files))
        
#         # If no internal ZIP is specified, return the list of outer ZIP files
#         if not internal_zip_relative_path:
#             return outer_files
        
#         # Extract files from the internal ZIP if specified
#         with outer_zip.open(internal_zip_relative_path) as internal_zip_file:
#             with zipfile.ZipFile(internal_zip_file, 'r') as inner_zip:
#                 for file_name in inner_zip.namelist():
#                     with inner_zip.open(file_name) as file:
#                         extracted_files[file_name] = io.BytesIO(file.read())
                        
#     return extracted_files

# outer_zip_path = 'data_set.zip'
# files_in_outer_zip = extract_internal_zip_from_zip(outer_zip_path)

In [4]:
# extracted_files = extract_internal_zip_from_zip('data_set.zip', '0/brasilian-houses-to-rent.zip')

# for file_name, file_data in extracted_files.items():
#     print(f"File: {file_name}, Size: {len(file_data.getvalue())} bytes")

In [5]:
# csv_file = extracted_files['houses_to_rent.csv']
# csv_file.seek(0)

# data = pd.read_csv(csv_file)
# display(data)

## Read data set

In [Pandas framework](https://pandas.pydata.org/pandas-docs/stable/index.html) exist a lot of function to read data from file/database/URL or other, and represented them in [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The DataFrame representation is very to comfortable to use because is some equivalent to the database table structure in Python program. 

To read data from a file we need know only type of file and structure of data in this file. For example, if we use CSV file we must know how cells are separated, coding of the file, etc. All functions to read data from a file are part of pandas core module more about this you can read [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

In [6]:
data = pd.read_excel('input_data.xlsx')
display(data)

Unnamed: 0,city,country,area,population,monument,president
0,New York,US,1213.37,8175133.0,Empire State Building,Bill de Blasio
1,New York,USA,,8175133.0,Central Park,Bill de Blasio
2,New York,United States of America,,,Statue of Liberty,Bill de Blasio
3,New York,America,,,St. Patrick's Cathedral,Bill de Blasio
4,New York,USA,1213.37,,Times Square,Bill de Blasio
5,Kraków,Poland,,,Wawel,
6,Kraków,PL,326.85,774839.0,Kazimierz,
7,Kraków,Polska,326.85,,Church of St. Adalbert,
8,Kraków,Poland,326.85,,Juliusz Słowacki Theatre,
9,Kraków,Poland,326.85,,Saint Anne's Church,


As you can see, the table was loaded in full and the Pandas recognized the headers and types of variables in the columns. The table in Pandas is organized similar to a database table. Table in Pandas has type [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Rows are records and columns group cells with this same type of data. Column has type [Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html).

If we want check headers name we can use command:

In [7]:
print(data.columns)

Index(['city', 'country', 'area', 'population', 'monument', 'president'], dtype='object')


## Basic analysis

If the data were missing, Pandas assigned the value NaN. To check if data contains empty cells, duplicate information we can use command:

In [8]:
# Check for missing values in each column
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Check for duplicate rows
duplicate_rows = data.duplicated().sum()
print("\nNumber of duplicate rows:", duplicate_rows)

Missing values in each column:
 city           0
country        0
area           8
population    11
monument       1
president      5
dtype: int64

Number of duplicate rows: 0


To show summary information about the column or our DataFrame we use functions describe. Functions describe can be used on [the level DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe) or on [the level Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.describe.html?highlight=describe#pandas.Series.describe). But working of this functions in this cases is exactly the same, calculation and printing summary information on the data in our DataFrame or single column.

Different we can see only if data in a column has object or numeric (float, integer, etc.) type. If we use describe on a column on object type, on the result we get information about the number of elements different on  NaN (count), the number of unique elements (unique), the most common value (top), frequency of most common value (freq) and summary info about series. On the column on type number this description has form basic statistic. Example of use in two described cases:  

In [9]:
# read heders
print(data.city.describe())

count           15
unique           3
top       New York
freq             5
Name: city, dtype: object


In [10]:
print(data.area.describe())

count       7.000000
mean      607.340000
std       419.793857
min       326.850000
25%       326.850000
50%       326.850000
75%       865.305000
max      1213.370000
Name: area, dtype: float64


Function describe is useful to discover the basic structure of columns. But in creating process of database structure more useful function is [info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html). This method prints information about a DataFrame including the index data type and column data types, non-null values and memory usage. On this easy way, we see where we need to check data. For example:

In [11]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city        15 non-null     object 
 1   country     15 non-null     object 
 2   area        7 non-null      float64
 3   population  4 non-null      float64
 4   monument    14 non-null     object 
 5   president   10 non-null     object 
dtypes: float64(2), object(4)
memory usage: 848.0+ bytes
None


## Find unique value

To check values of columns we can use the function `unique`. Function unique find in Series unique values and returned this values in the order of appearance. For example:

In [12]:
all_city = data['city'].unique()
print("City array: {0}".format(all_city))

all_country = data['country'].unique()
print("Country array: {0}".format(all_country))

all_area = data['area'].unique()
print("Area array: {0}".format(all_area))

all_population = data['population'].unique()
print("Population array: {0}".format(all_population))

all_monument = data['monument'].unique()
print("Monument array: {0}".format(all_monument))

all_president = data['president'].unique()
print("President array: {0}".format(all_president))

City array: ['New York' 'Kraków' 'Warszawa']
Country array: ['US' 'USA' ' United States of America' 'America' 'Poland' 'PL' 'Polska']
Area array: [1213.37     nan  326.85  517.24]
Population array: [8175133.      nan  774839. 1783321.]
Monument array: ['Empire State Building' 'Central Park ' ' Statue of Liberty '
 "St. Patrick's Cathedral" ' Times Square' 'Wawel' 'Kazimierz'
 ' Church of St. Adalbert ' 'Juliusz Słowacki Theatre'
 "Saint Anne's Church" 'Palace of Culture and Science' nan
 'Jabłonowski Palace' 'Holy Cross Church' 'Three Crosses Square']
President array: ['Bill de Blasio' nan 'Rafał Trzaskowski']


How we can see city values are correct, but we have problems with the country name. Because values describing this same country have a different form, for example: USA and US. For this case we have two solutions.

First, we can decide that we crate table with a different name of country and table with an official name. In the next step, we connect the table with an official country name with the city. Or the second solution is mapping name of the country to chosen by us. In this script we presented the second solution. 

## Mapping value

For the realisation that solution we use the function `map`. This function can be used to map values from two series having one column same, to mapping value from dictionary or from function mapping. This function works only with Series. Passing a data frame would give an Attribute error. Passing series with different length will give the output series of length same as the caller.

To correct use this function, in this case, we must create a dictionary where keys are values from the (country) and all these keys must be subscribed to the value chosen by us as correct. Example of use:

In [13]:
dicionary_corect = {'US':'USA', 'USA':'USA', ' United States of America':'USA', 'America':'USA', 'Poland':'POL', 'PL':'POL', 'Polska':'POL' }

mapping_country = data['country'].map(dicionary_corect)

data['country'] = mapping_country

In [14]:
display(data)

Unnamed: 0,city,country,area,population,monument,president
0,New York,USA,1213.37,8175133.0,Empire State Building,Bill de Blasio
1,New York,USA,,8175133.0,Central Park,Bill de Blasio
2,New York,USA,,,Statue of Liberty,Bill de Blasio
3,New York,USA,,,St. Patrick's Cathedral,Bill de Blasio
4,New York,USA,1213.37,,Times Square,Bill de Blasio
5,Kraków,POL,,,Wawel,
6,Kraków,POL,326.85,774839.0,Kazimierz,
7,Kraków,POL,326.85,,Church of St. Adalbert,
8,Kraków,POL,326.85,,Juliusz Słowacki Theatre,
9,Kraków,POL,326.85,,Saint Anne's Church,


## Supplementing data

On the next step, we need check data correctness. For that we need create validation function based on knowledge about data set and described issue. In the present case we can check correctness of area and population of city.

To do that we can be based on city list *all_city* get the length of the list on unique value from area and population. In the first step, we create a loop by *all_city* and on the beginning, we get all unique value from arena and population where values are different by NaN. In this case, we use [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) methods from Pandas.


For example, to get all unique values of the area from *city* we use on the beginning get from *data* all area values where 'city' is equal 'New York' (*data['city']== 'New York'*) and all row where 'area' is different from NaN (*~data['area'].isna()*) - description of function [isna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isna.html?highlight=isna#pandas.Series.isna). For the end, we get only unique values from result set.

In the next step, we need to check a number of *area*. To do that we check the length of the list with *area*. If this length is equal 1 we must assign the area value to all records. We can do that using [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html), Example:

Full script example to extract area and population value and check validation:

In [15]:
# check area and population value
for city in all_city:
    # get uniqe value area and population for the city
    area = data[(data['city']==city) & (~data['area'].isna())]['area'].unique()
    population = data[(data['city']==city) & (~data['population'].isna())]['population'].unique()

    if len(area) == 1:
        data.loc[(data['city']==city) & (data['area'].isna()), 'area'] = area[0]
    else:
        print('Area data mismatch on the context of {0}'.format(city))
        
    if len(population) == 1:
        data.loc[(data['city']==city) & (data['population'].isna()), 'population'] = population[0]
    else:
        print('Population data mismatch on the context of {0}'.format(city))

data

Unnamed: 0,city,country,area,population,monument,president
0,New York,USA,1213.37,8175133.0,Empire State Building,Bill de Blasio
1,New York,USA,1213.37,8175133.0,Central Park,Bill de Blasio
2,New York,USA,1213.37,8175133.0,Statue of Liberty,Bill de Blasio
3,New York,USA,1213.37,8175133.0,St. Patrick's Cathedral,Bill de Blasio
4,New York,USA,1213.37,8175133.0,Times Square,Bill de Blasio
5,Kraków,POL,326.85,774839.0,Wawel,
6,Kraków,POL,326.85,774839.0,Kazimierz,
7,Kraków,POL,326.85,774839.0,Church of St. Adalbert,
8,Kraków,POL,326.85,774839.0,Juliusz Słowacki Theatre,
9,Kraków,POL,326.85,774839.0,Saint Anne's Church,


## Split data to table

On the next step, we must split a data set to tables represented in the relational database. In this section, we see an example of only two tables *countries* and *cities*. 

To create table *countries* we can use constructor of [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). In the constructor function, we give the first argument as unique values from column country in *data* (*data['country'].unique()*) and the second argument as columns name (*columns=['country']*). In the next step using DataFrame object atrybut *index* we change index name to 'id'. Example of use: 

In [16]:
# get country
country_list = pd.DataFrame(data['country'].unique(), columns=['country'])

country_list.index.name = 'id'

country_list

Unnamed: 0_level_0,country
id,Unnamed: 1_level_1
0,USA
1,POL


Afterwards, we need to create table *cities*. In this case, we use the function [drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to remove all duplicates from the set of tuple 'city' and 'country'. In the next step, we [reset_index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) to set new id value. And of the last, we [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) column 'index' created after reset_index operation. To the better presentation of relationships between tables cities and countries, we can change the column name in *city_list* using function [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).


In [17]:
# get city and conect with country
city_list = data[['city','country']].drop_duplicates().reset_index().drop(columns = ['index']);
city_list.index.name = 'id'

city_list = city_list.rename(columns = {'country':'country_id'})
city_list

Unnamed: 0_level_0,city,country_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,New York,USA
1,Kraków,POL
2,Warszawa,POL


On this phase of data preparation, we need to change the values in 'country_id' to value id from city_list. To do that one more time we use function [map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html?highlight=map#pandas.Series.map) but with lambda function. Example:

In [18]:
city_list['country_id'] = city_list['country_id'].map(lambda x: country_list[country_list['country'] == x].index.values.astype(int)[0])

city_list

Unnamed: 0_level_0,city,country_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,New York,0
1,Kraków,1
2,Warszawa,1


In [19]:
# get area and population
city_pop_area = data[['city','area', 'population', 'president']].drop_duplicates().reset_index().drop(columns = ['index']);
city_pop_area.index.name = 'id'

city_pop_area = city_pop_area.rename(columns = {'city':'city_id'})

city_pop_area['city_id'] = city_pop_area['city_id'].map(lambda x:  city_list[city_list['city'] == x].index.values.astype(int)[0])

city_pop_area

Unnamed: 0_level_0,city_id,area,population,president
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,1213.37,8175133.0,Bill de Blasio
1,1,326.85,774839.0,
2,2,517.24,1783321.0,Rafał Trzaskowski


In [20]:
# get city and monument

city_monuments = data[['city', 'monument']].drop_duplicates().dropna().reset_index().drop(columns = ['index']);
city_monuments.index.name = 'id'

city_monuments = city_monuments.rename(columns = {'city':'city_id'})

city_monuments['city_id'] = city_monuments['city_id'].map(lambda x:  city_list[city_list['city'] == x].index.values.astype(int)[0])

city_monuments

Unnamed: 0_level_0,city_id,monument
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,Empire State Building
1,0,Central Park
2,0,Statue of Liberty
3,0,St. Patrick's Cathedral
4,0,Times Square
5,1,Wawel
6,1,Kazimierz
7,1,Church of St. Adalbert
8,1,Juliusz Słowacki Theatre
9,1,Saint Anne's Church


## Database structure

In this section we show how add constraints to database table and column in table. 

To add constraints to table we need use atrybut *[__table_args__](https://docs.sqlalchemy.org/en/13/orm/extensions/declarative/table_config.html)*. To set of table constraints we need assign tuple with constraints object to this class atribute. For example we can create constraints:
- [CheckConstraint](https://docs.sqlalchemy.org/en/13/core/constraints.html#sqlalchemy.schema.CheckConstraint)
- [ForeignKey](https://docs.sqlalchemy.org/en/13/core/constraints.html#sqlalchemy.schema.ForeignKeyConstraint)
- [UniqueConstraint](https://docs.sqlalchemy.org/en/13/core/constraints.html#sqlalchemy.schema.UniqueConstraint)
More examples in section [Defining Constraints and Indexes](https://docs.sqlalchemy.org/en/13/core/constraints.html) from SqlAlchemy documentation.  

To creating constraints for column we use constructor of class [Column](https://docs.sqlalchemy.org/en/13/core/metadata.html?highlight=column#sqlalchemy.schema.Column). We can use the following function arguments to set constraints:
- autoincrement,
- default,
- nullable,
- primary_key,
- unique,
- onupdate.

Example of script to database structure with constraints creating:

In [21]:
if 'Country' not in globals():
    class Country(Base):
        __tablename__ = 'countries'
        __table_args__ = (
            CheckConstraint('length(country) = 3'),
            UniqueConstraint('country'),
        )
        id = Column(Integer, Sequence('seq_country_id'), primary_key=True)
        country = Column(String(3), nullable=False)

if 'City' not in globals():
    class City(Base):
        __tablename__ = 'cities'
        __table_args__ = (
            CheckConstraint('length(city) > 0'),
        )
        id = Column(Integer, Sequence('seq_city_id'), primary_key=True)
        country_id = Column(Integer, ForeignKey('countries.id'), nullable=False)
        city = Column(String(100), nullable=False)

if 'CityData' not in globals():
    class CityData(Base):
        __tablename__ = 'city_data'
        __table_args__ = (
            CheckConstraint('area > 0'),
            CheckConstraint('population >= 0'),
        )
        id = Column(Integer, Sequence('seq_city_data_id'), primary_key=True)
        city_id = Column(Integer, ForeignKey('cities.id'), nullable=False)
        area = Column(Float, nullable=True, default=0)
        population = Column(Integer, nullable=True, default=0)
        president = Column(String(60), nullable=True, default='')

if 'Monument' not in globals():
    class Monument(Base):
        __tablename__ = 'monuments'
        __table_args__ = (
            CheckConstraint('length(monument) > 0'),
        )
        id = Column(Integer, Sequence('seq_monument_id'), primary_key=True)
        city_id = Column(Integer, ForeignKey('cities.id'), nullable=False)
        monument = Column(String(100), nullable=False)

Base.metadata.create_all(engine)

## Insert data

To insert data into a database on a fast and easy way we can use pandas function [to_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html).

If we use function [to_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html) we need set atrybuts:
- name - name of SQL table.
- con  - the user is responsible for engine disposal and connection closure for the SQLAlchemy.
- if_exists - definition how to  behave if the table already exists. Possible option:
	- fail: Raise a ValueError. (default option)
	- replace: Drop the table before inserting new values.
	- append: Insert new values to the existing table.


Example of use:

In [None]:
country_list.to_sql('countries',engine, if_exists='append')
city_list.to_sql('cities',engine, if_exists='append')
city_pop_area.to_sql('city_data',engine, if_exists='append')
city_monuments.to_sql('monuments',engine, if_exists='append')

## Database schema:

![db schema](images/Table.png)

```sql
Table countries {
  id int [pk, increment]            
  country char(3) [unique, not null]
}

Table cities {
  id int [pk, increment]
  city varchar(100) [not null]
  country_id int [ref: > countries.id, not null]
}

Table city_data {
  id int [pk, increment]
  city_id int [ref: > cities.id, not null]
  area float [not null, default: 0]
  population int [not null, default: 0]
  president varchar(60)
}

Table monuments {
  id int [pk, increment]
  city_id int [ref: > cities.id, not null]
  name varchar(100) [not null]
}
```