## 1.2 LOADING YOUR FIRST DATA SET

About the Gapminder Data Set

The Gapminder data set originally comes from www.gapminder.org. The version of the Gapminder data used in this book was prepared by Jennifer Bryan from the University of British Columbia. The repository can be found at: www.github.com/jennybc/gapminder.

Google for gapminder.tsv, open this file in github as raw using Note, then same the data as .tsv at C:\Users\Chak\data

In [3]:
import pandas as pd

In [8]:
df = pd.read_csv('data/gapminder.tsv', sep='\t')

In [9]:
print(df.head())

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


We can check whether we are working with a Pandas DataFrame by using the built-in type function (i.e., it comes directly from Python, not any package such as Pandas). 
The type function is handy when you begin working with many different types of Python objects and need to know which object you are currently working on.

In [10]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


The data set we loaded is currently saved as a Pandas DataFrame object and is relatively small. Every DataFrame object has a shape attribute that will give us the number of rows and columns of the DataFrame.

In [11]:
# get the number of rows and columns

print(df.shape)

(1704, 6)


The shape attribute returns a tuple (Appendix J) in which the first value is the number of rows and the second number is the number of columns. From the preceding results, we see our Gapminder data set has 1704 rows and 6 columns.

Since shape is an attribute of the dataframe, and not a function or method of the DataFrame, it does not have parentheses after the period. If you made the mistake of putting parentheses after the shape attribute, it would return an error.

In [12]:
# shape is an attribute, not a method

# this will cause an error

print(df.shape())

TypeError: 'tuple' object is not callable

Typically, when first looking at a data set, we want to know how many rows and columns there are (we just did that). To get the gist of which information it contains, we look at the columns. The column names, like shape, are specified using the column attribute of the dataframe object.

In [13]:
# get column names

print(df.columns)

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')


The Pandas DataFrame object is similar to the DataFrame-like objects found in other languages (e.g., Julia and R) Each column (Series) has to be the same type, whereas each row can contain mixed types. In our current example, we can expect the country column to be all strings and the year to be integers. However, it’s best to make sure that is the case by using the dtypes attribute or the info method. Table 1.1 compares the types in Pandas to the types in native Python.

In [14]:
# get the dtype of each column

print(df.dtypes)

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object


In [15]:
# get more information about our data

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None
