# <center> Exploring Data Objects </center>

- [Head and Tail Methods](#section_1)
- [The info() Method](#section_2)
- [Shape and Size Attributes](#section_3)
- [Descriptive Statistics](#section_4)
- [Unique and Value Counts](#section_5)
<hr>

### Head and Tail Methods <a class="anchor" id="section_1"></a>

[`Head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and [`tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html) are widely used methods to display the upper and lower parts of Pandas data objects.

In [5]:
# Import Pandas Library


In the example below, we will create a DataFrame using the [alcohol consumption](https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv) dataset that we used earlier in the course.

This dataset has 193 rows and 5 columns. 

In [1]:
# Read dataset from GitHub repository


# Display DataFrame


In [3]:
# Display top DataFrame rows


In [6]:
# Display bottom DataFrame rows


We can adjust the default behaviour and pass the number of records we want to display as you can see in these two examples:

In [5]:
# Display top 8 DataFrame rows


# Display bottom 8 DataFrame rows


In summary, these two functions are mainly used to have a quick look and verify we are using the correct dataset.

### The info() Method <a class="anchor" id="section_2"></a>

The [info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method is designed to give us a high-level summary about our DataFrame objects. 

Let's apply the `info()` method to our alcohol DataFrame below. 

In [4]:
# Display summary of the DataFrame columns


The results first highlight the number of records in the DataFrame and the range of the numerical index value automatically assigned to this DataFrame. It shows the total number of columns (5 columns in our dataset). Next, it lists the column names with their respective data types and how many values of that column contain an empty or null value.

In this dataset, it seems we don’t have any missing values since the number of records is equal to the number of non-null counts. We notice the data types for the country column is Pandas objects which represent text values, while three servings columns (beer_servings, spirit_servings, and wine_servings) have the int64 data type which represents integer numbers, and total litres column assigned float64 data type which allows real numbers.

At this stage, we have an idea about what changes we need to make in order to have the correct data types. For example, numerical data types such as int64 allow us to apply mathematical calculations on the values while object data type allows us to apply text formatting functions. In the next section about data cleaning, we will learn how to change data types.

Finally, the function displays data about how many columns there are for each data type and the memory size of this DataFrame (the memory size info can be useful when working with a large DataFrame and you may wish to optimize the DataFrame size).

### Shape and Size Attributes <a class="anchor" id="section_3"></a>

[`Shape`](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.shape.html) and [`size`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html) attributes are used to idenify dimensionality of DataFrame objects and count the number of elements

In [8]:
# How many elements in alcohol_data DataFrame


In [9]:
# How many elements in alcohol_data[`country`] Series


In [10]:
# Check the dimension of alcohol_data DataFrame


In [11]:
# Print items generated from the shape attribute


### Descriptive Statistics <a class="anchor" id="section_4"></a>

[`Describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) is a DataFrame method provide descriptive statistics such as central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [12]:
# Display statistical analysis of the DataFrame


From the example above, we notice the [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function was only applied to the numerical columns and the country name column was ignored. This is because descriptive statistics are based on numerical columns only to summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

In addition to the numerical statistical summary, you can also explore the features of text values in DataFrames. For this exercise, we will use the [country codes dataset](https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv) from the Open Data GitHub repository. The data include many details about each country's international codes and geographic regions.

### Unique and Value Counts <a class="anchor" id="section_5"></a>

The descriptive statistics in the above examples are mainly for the numerical values in our dataset, we could also have non-numerical columns such as free text and categories. 

So maybe we want to know the number of unique feature values and how many different values are there?

**Unique() Method**

The Pandas [unique()](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) method returns unique values in order of appearance which does not sort.

In [14]:
# Read the country codes dataset from GitHub repository


# Display DataFrame head


The column `Region Name` appears to be a text column that holds the geographical region of each country. In order to find the number of individual region values we can apply the unique() and value_counts() functions on Pandas series values like below:

In [15]:
# Display unique individual region names


**Value_counts() Method**

This function [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) returns a series containing counts of unique values. 

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [16]:
# Display the number of individual region names


In this lesson we have learned about some quick commands that will help you to investigate your dataframe. 

To learn more about some other useful commands, stay turned!