# <center> Exploring Data Objects </center>

- [Head and Tail Methods](#section_1)
- [The info() Method](#section_2)
- [Shape and Size Attributes](#section_3)
- [Descriptive Statistics](#section_4)
- [Unique and Value Counts](#section_5)
<hr>

### Head and Tail Methods <a class="anchor" id="section_1"></a>

[`Head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and [`tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html) are widely used methods to display the upper and lower parts of Pandas data objects.

In [1]:
# Import Pandas Library
import pandas as pd

In the example below, we will create a DataFrame using the [alcohol consumption](https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv) dataset that we used earlier in the course.

This dataset has 193 rows and 5 columns. 

In [2]:
# Read dataset from GitHub repository
alcohol_data = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")

# Display DataFrame
alcohol_data

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
0,Afghanistan,0,0,0,0.0
1,Albania,89,132,54,4.9
2,Algeria,25,0,14,0.7
3,Andorra,245,138,312,12.4
4,Angola,217,57,45,5.9
...,...,...,...,...,...
188,Venezuela,333,100,3,7.7
189,Vietnam,111,2,1,2.0
190,Yemen,6,0,0,0.1
191,Zambia,32,19,4,2.5


In [3]:
# Display top DataFrame rows
alcohol_data.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
0,Afghanistan,0,0,0,0.0
1,Albania,89,132,54,4.9
2,Algeria,25,0,14,0.7
3,Andorra,245,138,312,12.4
4,Angola,217,57,45,5.9


In [4]:
# Display bottom DataFrame rows
alcohol_data.tail()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
188,Venezuela,333,100,3,7.7
189,Vietnam,111,2,1,2.0
190,Yemen,6,0,0,0.1
191,Zambia,32,19,4,2.5
192,Zimbabwe,64,18,4,4.7


We can adjust the default behaviour and pass the number of records we want to display as you can see in these two examples:

In [5]:
# Display top 8 DataFrame rows
alcohol_data.head(8)

# Display bottom 8 DataFrame rows
alcohol_data.tail(8)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
185,Uruguay,115,35,220,6.6
186,Uzbekistan,25,101,8,2.4
187,Vanuatu,21,18,11,0.9
188,Venezuela,333,100,3,7.7
189,Vietnam,111,2,1,2.0
190,Yemen,6,0,0,0.1
191,Zambia,32,19,4,2.5
192,Zimbabwe,64,18,4,4.7


In summary, these two functions are mainly used to have a quick look and verify we are using the correct dataset.

### The info() Method <a class="anchor" id="section_2"></a>

The [info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method is designed to give us a high-level summary about our DataFrame objects. 

Let's apply the `info()` method to our alcohol DataFrame below. 

In [6]:
# Display summary of the DataFrame columns
alcohol_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
dtypes: float64(1), int64(3), object(1)
memory usage: 7.7+ KB


The results first highlight the number of records in the DataFrame and the range of the numerical index value automatically assigned to this DataFrame. It shows the total number of columns (5 columns in our dataset). Next, it lists the column names with their respective data types and how many values of that column contain an empty or null value.

In this dataset, it seems we don’t have any missing values since the number of records is equal to the number of non-null counts. We notice the data types for the country column is Pandas objects which represent text values, while three servings columns (beer_servings, spirit_servings, and wine_servings) have the int64 data type which represents integer numbers, and total litres column assigned float64 data type which allows real numbers.

At this stage, we have an idea about what changes we need to make in order to have the correct data types. For example, numerical data types such as int64 allow us to apply mathematical calculations on the values while object data type allows us to apply text formatting functions. In the next section about data cleaning, we will learn how to change data types.

Finally, the function displays data about how many columns there are for each data type and the memory size of this DataFrame (the memory size info can be useful when working with a large DataFrame and you may wish to optimize the DataFrame size).

### Shape and Size Attributes <a class="anchor" id="section_3"></a>

[`Shape`](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.shape.html) and [`size`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html) attributes are used to idenify dimensionality of DataFrame objects and count the number of elements

In [7]:
# How many elements in alcohol_data DataFrame
alcohol_data.size

965

In [8]:
# How many elements in alcohol_data[`country`] Series
alcohol_data['country'].size

193

In [9]:
# Check the dimension of alcohol_data DataFrame
alcohol_data.shape

(193, 5)

In [10]:
# Print items generated from the shape attribute
v0, v1 = alcohol_data.shape

f"There are {v0} records, and {v1} columns"

'There are 193 records, and 5 columns'

### Descriptive Statistics <a class="anchor" id="section_4"></a>

[`Describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) is a DataFrame method provide descriptive statistics such as central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [11]:
# Display statistical analysis of the DataFrame
alcohol_data.describe()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


From the example above, we notice the [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function was only applied to the numerical columns and the country name column was ignored. This is because descriptive statistics are based on numerical columns only to summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

In addition to the numerical statistical summary, you can also explore the features of text values in DataFrames. For this exercise, we will use the [country codes dataset](https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv) from the Open Data GitHub repository. The data include many details about each country's international codes and geographic regions.

### Unique and Value Counts <a class="anchor" id="section_5"></a>

The descriptive statistics in the above examples are mainly for the numerical values in our dataset, we could also have non-numerical columns such as free text and categories. 

So maybe we want to know the number of unique feature values and how many different values are there?

**Unique() Method**

The Pandas [unique()](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) method returns unique values in order of appearance which does not sort.

In [12]:
# Read the country codes dataset from GitHub repository
countries_data = pd.read_csv("https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv")

# Display DataFrame head
countries_data

Unnamed: 0,FIFA,Dial,ISO3166-1-Alpha-3,MARC,is_independent,ISO3166-1-numeric,GAUL,FIPS,WMO,ISO3166-1-Alpha-2,...,Sub-region Name,official_name_ru,Global Name,Capital,Continent,TLD,Languages,Geoname ID,CLDR display name,EDGAR
0,TPE,886,TWN,ch,Yes,158.0,925,TW,,TW,...,,,,Taipei,AS,.tw,"zh-TW,zh,nan,hak",1668284.0,Taiwan,
1,AFG,93,AFG,af,Yes,4.0,1,AF,AF,AF,...,Southern Asia,Афганистан,World,Kabul,AS,.af,"fa-AF,ps,uz-AF,tk",1149361.0,Afghanistan,B2
2,ALB,355,ALB,aa,Yes,8.0,3,AL,AB,AL,...,Southern Europe,Албания,World,Tirana,EU,.al,"sq,el",783754.0,Albania,B3
3,ALG,213,DZA,ae,Yes,12.0,4,AG,AL,DZ,...,Northern Africa,Алжир,World,Algiers,AF,.dz,ar-DZ,2589581.0,Algeria,B4
4,ASA,1-684,ASM,as,Territory of US,16.0,5,AQ,,AS,...,Polynesia,Американское Самоа,World,Pago Pago,OC,.as,"en-AS,sm,to",5880801.0,American Samoa,B5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,SAH,212,ESH,ss,In contention,732.0,268,WI,,EH,...,Northern Africa,Западная Сахара,World,El-Aaiun,AF,.eh,"ar,mey",2461445.0,Western Sahara,U5
246,YEM,967,YEM,ye,Yes,887.0,269,YM,YE,YE,...,Western Asia,Йемен,World,Sanaa,AS,.ye,ar-YE,69543.0,Yemen,T7
247,ZAM,260,ZMB,za,Yes,894.0,270,ZA,ZB,ZM,...,Sub-Saharan Africa,Замбия,World,Lusaka,AF,.zm,"en-ZM,bem,loz,lun,lue,ny,toi",895949.0,Zambia,Y4
248,ZIM,263,ZWE,rh,Yes,716.0,271,ZI,ZW,ZW,...,Sub-Saharan Africa,Зимбабве,World,Harare,AF,.zw,"en-ZW,sn,nr,nd",878675.0,Zimbabwe,Y5


The column `Region Name` appears to be a text column that holds the geographical region of each country. In order to find the number of individual region values we can apply the unique() and value_counts() functions on Pandas series values like below:

In [13]:
# Display unique individual region names
countries_data['Region Name'].unique()

array([nan, 'Asia', 'Europe', 'Africa', 'Oceania', 'Americas'],
      dtype=object)

**Value_counts() Method**

This function [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) returns a series containing counts of unique values. 

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [14]:
# Display the number of individual region names
countries_data['Region Name'].value_counts()

Africa      60
Americas    57
Europe      52
Asia        50
Oceania     29
Name: Region Name, dtype: int64

In this lesson we have learned about some quick commands that will help you to investigate your dataframe. 

To learn more about some other useful commands, stay turned!