# <center> Working with Tabular Data Files </center>

- [What is Tabular Data Format](#section_1)
- [Pandas read_csv() Function](#section_2)
- [Pandas read_excel() Function](#section_3)
- [Pandas read_sql() Function](#section_4)

<hr>

### What is Tabular Data Format <a class="anchor" id="section_1"></a>


Tabular data is usually structured into rows and columns and presented in various file formats including CSV files, Excel spreadsheet, and SQL tables. Tabular files can be accessed from the local computer or online.

<br>

### Pandas read_csv() Function <a class="anchor" id="section_2"></a>

Pandas library provides the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) built-in function to read a comma-separated values (csv) file into DataFrame.

The dataset file we are going to use is stored online on [FiveThirtyEight](https://github.com/fivethirtyeight/data) Github repo. This repo has a good collection of publicly available datasets. 

The dataset we are looking at today is alcohol-consumption around the world from the repo link above. Let's read it below:

In [6]:
# Import Pandas library


In [8]:
# Create an alcohol-consumption dataFrame using read_csv() function from the repo link above

# Display the DataFrame


We notice this dataset has about 193 rows and 5 different columns. 

Also notice, the default behavior of the Pandas library is to display the top and bottom 5 rows of the DataFrame object and indicate the hidden rows in the middle part of the results.

<br>

In real life scenarios, data professionals may sometimes need to access an extremely large dataset file with thousands of rows and columns.

In this case, we can make use of some of the optional parameters to only select a specific subset of the large data file.

This can be very useful especially if you are moving large datasets through the network.


In [4]:
# Create an alcohol-consumption dataFrame using read_csv() function from the repo link above. Filter rows and columns using function parameters

# Display the DataFrame


<br>

Another great parameter available to read_csv() function is the `index_col`. 

In [5]:
# Create an alcohol-consumption dataFrame using read_csv() function from the repo link above. Assign one column as DataFrame index using label

# Display the DataFrame


<br>

In [6]:
# Create an alcohol-consumption dataFrame using read_csv() function from the repo link above. Assign one column as DataFrame index using the numeric position

# Display the DataFrame


Notice how we can change the index value to be either the column’s numeric position or label. 

In both examples above, the results are the same as we see the same column is now assigned as the DataFrame index.


<br>

### Pandas read_excel() Function <a class="anchor" id="section_3"></a>

Another commonly used tabular data format is spreadsheets. Pandas library provides the [read_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) built-in function to access Microsoft Excel spreadsheet files.

In the following example, we will use an Excel file that is already stored on our local machine. This file is also available in the course repo.

There are 2 datasets stored in the sheet names: `short_list`and `long_list`. The short one has a list of 5 countries while the long one has a list of 14 countries.

Notice how the [read_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) example makes use of the `sheet_name` parameter to tell the system which sheet name contains the needed dataset. For a complete list of all parameters for each built-in function, check the Pandas official documentation to know more. 

In [1]:
# Create a DataFrame object using read_excel() function. Use parameter to identify dataset


# Display the DataFrame


In [7]:
# Create a DataFrame object using read_excel() function. Use parameter to identify dataset


# Display the DataFrame


Similar to what we learned before, the `read_excel()` function also has the optional `index_col` parameters.

**Quiz**:<br>
Can you use the parameter `index_col` to assign one of the columns to be the DataFrame index?

<br>

### Pandas read_sql() Function <a class="anchor" id="section_4"></a>

Another common scenario is to query relational database tables using SQL language. Obviously, you would need to provide the necessary credentials and metadata to establish a connection with the database server. You can then apply [read_sql()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) function to pass the SQL query and load the result into a Pandas DataFrame object. 

To simulate this scenario, the following code will create a local database using the Python SQLite engine. We will then use Pandas to access the data using SQL queries. 

In [2]:
# Import SQLite library


# Assign the database name


# Create the database file


# Establish a connection with the database file


# Create a database table


# Add some data


# Commit changes and close the connection


# Close the connection


The relational database name `local_db_example.db` should appear as an external file in the same location with your notebook. The database file already includes dummy data describing employee details. The following code queries the data into a Pandas DataFrame object. 

<br>

In [3]:
# Identify the database name


# Establish a connection with the database file


# Use Pandas function to pass SQL query and create a DataFrame object


# Print the generated DataFrame


# Close the connection


In the above example, we created a local database file and used the Pandas library to query the data using SQL, and passed the results into a Pandas DataFrame object. In more practical examples, you may need to query data from relational databases that are stored on remote servers or in the cloud.

**[Back to Top](#title)**