# Python for Data Analysis: Pandas

## Lesson Outline
1. Introduction to Pandas
2. Reading from files into Pandas Dataframe
3. Summary attributes & statistics
4. Selecting specific columns
5. Data Manipulation
6. Generating few plots
7. Saving dataframe

# Introduction to Pandas

The Pandas module is Python's fundamental data analytics library and it provides high-performance, easy-to-use data structures and tools for data analysis. Pandas allows for creating pivot tables, computing new columns based on other columns, etc. Pandas also facilitates grouping rows by column values and joining tables as in SQL. A good cheat sheet for Pandas can be found here. Pandas is a very comprehensive and mature module that can be used for advanced data analytics, and this tutorial presents just a very basic overview of Pandas' capabilities.

Source: https://www.featureranking.com/tutorials/python-tutorials/pandas/


Several other Python libraries also work with Pandas objects.

You can have a look at its complete documentation here: https://pandas.pydata.org/docs/index.html

## Reading from files into Pandas Dataframe

In [None]:
# import necessary libraries


![pandas.PNG](attachment:pandas.PNG)

In [None]:
# get the data


In [None]:
 # Defaults to the first 5 lines of the dataframe

In [None]:
 # if you want to output 20 lines

## Summary
The function shape counts the number of rows and columns. Thus, number of rows and columns can be obtained with .shape[0] and .shape[1] respectively.

We can sort the data with respect to a particular column by calling sort_values(). Let's sort the data by df porosity values in descending order. If you want to do it in ascending order, set ascending = True in the command below.

describe() generates **descriptive statistics**. Keep in mind that this function excludes **null values**.

We can also use the following functions to summarize the data. All these methods exclude null values by default.

* count() to count the number of elements.
* value_counts() to get a frequency distribution of unique values.
* unique() to get the number of unique values.
* mean() to calculate the arithmetic mean of a given set of numbers.
* std() to calculate the sample standard deviation of a given set of numbers.
* max() to return the maximum of the provided values.
* min() to return minimum of the provided values.

## Selecting specific columns and Rows
When there are many columns, we may prefer to select only the ones we are interested in. 

Extract data using rows

loc and iloc are two functions that can be used to slice data from specific row indexes.

* loc – locates the rows by name

loc performs slicing based explicit index.
It takes string indexes to retrieve data from specified rows
* iloc – locates the rows by integer index

iloc performs slicing based on Python’s default numerical index.

Let's say we want to select the "Porosity" and the "Quartz" columns only.

You can also create a new dataframe, df_subset from the original dataframe, df by selecting only a few columns

You can also select **specific ranges of values and create a new dataframe**.

Here I am selecting all entries with 'Porosity' values less than 5.0

Here I am selecting all entries with 'Porosity' values less than 5.0 and all 'Quartz' values less than 20.0.

Here I am selecting all entries with 'Depth' greater than 6500 and lesser than 6600.

## Data Manipulation
### Handling missing values
Dealing with missing values is a time consuming but crucial task. We should first identify the missing values and then try to determine why they are missing.

There are two basic strategies to handle missing values:

* Remove rows and columns with missing values.
* Impute missing values, replacing them with predefined values.

Missing values are a bit complicated in Python as they can be denoted by either "na" or "null" in Pandas (both mean the same thing). Furthermore, NumPy denotes missing values as "NaN" (that is, "not a number").

First, let's count the number of missing values in each column.

In [None]:
  #  count the number of missing values in each column

The function dropna() drops rows with at least one missing value.

In [None]:
# df.dropna(axis=1)

## A Few Plotting Routines 
Pandas allows for direct visualization of a data frame's columns. First, let's prepare the plotting environment using Matplotlib.

Alternatively, we can examine the histogram of these columns.

## Save your dataframe