In [1]:
import pandas as pd

# Introduction to Pandas

## What is Pandas? 

A very widely used python data analysis library
   - data structures based on numpy 
   - plotting depends on matplotlib
   

## What are the data structures in Pandas?

### Series

A one-dimensional labelled array of any datatype
   - It is possible to have mixed datatypes but this should be avoided.
   - Labels are called the **index**

#### Creating Series

In [2]:
# From tuple
my_tuple =(1, 3, 5, 7, 12)
my_tuple_series= pd.Series(my_tuple)

In [3]:
my_tuple_series

0     1
1     3
2     5
3     7
4    12
dtype: int64

In [4]:
# From list
my_list = [1, 3, 5, 7, 12]
my_list[0]=2
pd.Series(my_list)

0     2
1     3
2     5
3     7
4    12
dtype: int64

In [5]:
# From dictionary
my_dictionary = {'names':['Bob', 'Alice'], 'ages':[25, 34]}
my_dictionary

{'names': ['Bob', 'Alice'], 'ages': [25, 34]}

In [6]:
pd.Series(my_dictionary)

names    [Bob, Alice]
ages         [25, 34]
dtype: object

### Index and RangeIndex

Index is the name of the labels given to the data in a Series. It is automatically generated if not explicitly passed. 

Automatic index is a range of integers from 0 to len(Series)-1, called RangeIndex

Can be also passed as eg. a list of the same length of as the series. Python automatically assigns first entry to first label etc. 

In [109]:
# pandas.RangeIndex start, stop, step (can be negative)
my_list = [1, 3, 5, 7, 12]
my_list_series = pd.Series(my_list)

In [111]:
my_list_series.index

RangeIndex(start=0, stop=5, step=1)

In [113]:
list(pd.RangeIndex(10, -10, step=-1))

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9]

A rangeindex is an **immutable object** - it cannot be changed once it is made, if we want a new one we need to create a brand new one and save it elsewhere in memory. Pandas does this for performance optimisation reasons.

### DataFrames

The main data structure in Pandas - a two dimensional array of labelled indices and columns.


In [115]:
# Can make dataframe from dictionary:
pd.DataFrame(my_dictionary)

Unnamed: 0,names,ages
0,Bob,25
1,Alice,34


In [117]:
# from 2D nested lists:
data= [['Alice', 'Bob', 'Charlie'], [12, 14, 13]]
pd.DataFrame(data)

Unnamed: 0,0,1,2
0,Alice,Bob,Charlie
1,12,14,13


In [118]:
!pwd

/Users/spiced/testing/week_01


In [124]:
# import as a csv file:

df = pd.read_csv('large_countries_2015.csv', index_col=0)

In [125]:
df

Unnamed: 0,population,fertility,continent
Bangladesh,160995600.0,2.12,Asia
Brazil,207847500.0,1.78,South America
China,1376049000.0,1.57,Asia
India,1311051000.0,2.43,Asia
Indonesia,257563800.0,2.28,Asia
Japan,126573500.0,1.45,Asia
Mexico,127017200.0,2.13,North America
Nigeria,182202000.0,5.89,Africa
Pakistan,188924900.0,3.04,Asia
Philippines,100699400.0,2.98,Asia


In [126]:
# export as csv file:
df.to_csv('my_csv.csv')

### Important Methods and Attributes of data frames

In [129]:
# Check the .shape attribute:
df.shape

(12, 3)

In [135]:
# Look at first or last rows using .head() and .tail():
df

Unnamed: 0,population,fertility,continent
Bangladesh,160995600.0,2.12,Asia
Brazil,207847500.0,1.78,South America
China,1376049000.0,1.57,Asia
India,1311051000.0,2.43,Asia
Indonesia,257563800.0,2.28,Asia
Japan,126573500.0,1.45,Asia
Mexico,127017200.0,2.13,North America
Nigeria,182202000.0,5.89,Africa
Pakistan,188924900.0,3.04,Asia
Philippines,100699400.0,2.98,Asia


In [133]:
# Select column(s) by name:
df['population']

Bangladesh       1.609956e+08
Brazil           2.078475e+08
China            1.376049e+09
India            1.311051e+09
Indonesia        2.575638e+08
Japan            1.265735e+08
Mexico           1.270172e+08
Nigeria          1.822020e+08
Pakistan         1.889249e+08
Philippines      1.006994e+08
Russia           1.434569e+08
United States    3.217736e+08
Name: population, dtype: float64

In [134]:
# Select row(s) name using .loc:
df.loc['Bangladesh']

population    160995642.0
fertility            2.12
continent            Asia
Name: Bangladesh, dtype: object

In [137]:
# Select rows or columns by integer index:
df.iloc[:,0]

Bangladesh       1.609956e+08
Brazil           2.078475e+08
China            1.376049e+09
India            1.311051e+09
Indonesia        2.575638e+08
Japan            1.265735e+08
Mexico           1.270172e+08
Nigeria          1.822020e+08
Pakistan         1.889249e+08
Philippines      1.006994e+08
Russia           1.434569e+08
United States    3.217736e+08
Name: population, dtype: float64

In [140]:
# Select by logical condition:
df.loc[df['fertility']<2]


Unnamed: 0,population,fertility,continent
Brazil,207847500.0,1.78,South America
China,1376049000.0,1.57,Asia
Japan,126573500.0,1.45,Asia
Russia,143456900.0,1.61,Europe
United States,321773600.0,1.97,North America


In [143]:
df.loc[df['continent']=='Europe']

Unnamed: 0,population,fertility,continent
Russia,143456918.0,1.61,Europe


In [144]:
#.between
df.loc[df['fertility'].between(1,2)]

Unnamed: 0,population,fertility,continent
Brazil,207847500.0,1.78,South America
China,1376049000.0,1.57,Asia
Japan,126573500.0,1.45,Asia
Russia,143456900.0,1.61,Europe
United States,321773600.0,1.97,North America


In [149]:
df.columns

Index(['population', 'fertility', 'continent'], dtype='object')

### Bonus: Series .name attributes

In [17]:
# .name attribute = set to None automatically, can be added manually. Becomes column name in df
# .index.name becomes the index name - gets added on top of the series display

### Bonus: What does `dtype` mean?

 - Python lists store references to the objects in memory rather than the objects themselves: each object has to be accessed separately in a different location.
 - Numpy arrays are homogenous (all entries have the same `dtype`). Numbers are stored as fixed length (eg. 64bit float, 8bit integer etc). This leads to numpy being faster and more space efficient than lists (data is *contiguous* = stored next to each other in memory). 
 - If given mixed numeric types, numpy automatically converts them to one (eg. converts int to float) = *"implicit upcasting"*. 
 - Strings have variable length, so numpy saves a *reference to the string in memory* rather than the string itself
 - if pandas (=numpy under the hood) sees mixed types including a string, it saves EVERYTHING as `dtype object` - this slows down storage and is a frequent source of bugs. 
 - Tip: Always double-check dtype especially when importing data!
 

## References:
 - [Pandas documentation](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
 - [Course Material on Pandas](http://krspiced.pythonanywhere.com/chapters/project_gapminder/introduction_to_pandas.html)