# Data ingestion & inspection

In [1]:
import utils

## Intro to pandas DataFrames

### Inspecting Data

We can use the DataFrame methods `.head()` and `.tail()` to view the first few and last few rows of a DataFrame. We have imported pandas as ```pd``` and loaded population data from 1960 to 2018 as a DataFrame `urban_population`. This dataset was obtained from the [World Bank](https://databank.worldbank.org/reports.aspx?source=2&type=metadata&series=SP.URB.TOTL.IN.ZS#).

Let's use `urban_population.head()` and `urban_population.tail()` to verify that the first and last rows match a file on disk. In later exercises, we will see how to extract values from DataFrames with indexing.

In [2]:
# First 5 rows
utils.urban_population.head()

Unnamed: 0,Year,Country Name,Country Code,Urban population (% of total),Total Population
0,1960,Afghanistan,AFG,8.401,8996973.0
1,1960,Albania,ALB,30.705,1608800.0
2,1960,Algeria,DZA,30.51,11057863.0
3,1960,American Samoa,ASM,66.211,20123.0
4,1960,Andorra,AND,58.45,13411.0


In [3]:
# Last 5 rows
utils.urban_population.tail()

Unnamed: 0,Year,Country Name,Country Code,Urban population (% of total),Total Population
15571,2018,Sub-Saharan Africa,SSF,40.176823,1078307000.0
15572,2018,Sub-Saharan Africa (excluding high income),SSA,40.175341,1078210000.0
15573,2018,Sub-Saharan Africa (IDA & IBRD countries),TSS,40.176823,1078307000.0
15574,2018,Upper middle income,UMC,66.233368,2655636000.0
15575,2018,World,WLD,55.270579,7594270000.0


### DataFrame data types

Pandas is aware of the data types in the columns of your DataFrame. It is also aware of null and `NaN` ('Not-a-Number') types which often indicate missing data. 

We can use `urban_population.info()` to determine information about the total count of non-null entries and infer the total count of null entries, which likely indicates missing data.

In [4]:
utils.urban_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15576 entries, 0 to 15575
Data columns (total 5 columns):
Year                             15576 non-null int64
Country Name                     15576 non-null object
Country Code                     15576 non-null object
Urban population (% of total)    15332 non-null float64
Total Population                 15409 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 608.6+ KB


### NumPy and pandas working together

Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute `.values` to represent a DataFrame df as a NumPy array. You can also pass pandas data structures to NumPy methods. 

In this example, we have loaded world population data every 10 years since 1960 into the DataFrame `world_population`. This dataset was derived from the one used in the previous exercise.

In [5]:
# Import numpy
import numpy as np

In [6]:
# Create array of DataFrame values: np_vals
np_vals = utils.world_population.values

In [7]:
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)

In [8]:
# Create array of new DataFrame by passing df to np.log10(): df_log10
df_log10 = np.log10(utils.world_population)

In [9]:
# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'utils.world_population', 'df_log10']]

np_vals has type <class 'numpy.ndarray'>
np_vals_log10 has type <class 'numpy.ndarray'>
utils.world_population has type <class 'pandas.core.frame.DataFrame'>
df_log10 has type <class 'pandas.core.frame.DataFrame'>


[None, None, None, None]

## Building DataFrames from scratch

### Zip lists to build a DataFrame

### Labeling data

### Building DataFrames with broadcasting

## Importing & exporting data

### Reading a flat file

### Delimiters, headers, and extensions

## Plotting with pandas

### Plotting series using pandas

### Plotting DataFrames

# Exploratory data analysis

## Visual exploratory data analysis

### pandas line plots

### pandas scatter plots

### pandas box plots

### pandas hist, pdf and cdf

## Statistical exploratory data analysis

### Fuel efficiency

### Bachelor's degrees awarded to women

### Median vs mean

### Quantiles

### Standard deviation of temperature

## Separating populations

### Filtering and counting

### Separate and summarize

### Separate and plot