<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/jvcqp2iy2jlx2b32rmzdt0tx8lvxgzkp.png" width = 300, align = "center"></a>
# Data Analysis with Python I 


<hr>

### Table of Contents 
[**Numpy Lib**](#Numpy-Lib)   
[**Pandas Lib**](#Pandas-Lib)  
[**CO2 emissions by Country by Year**](#CO2-emissions-by-Country-by-Year)  
[**Get the Data**](#Get-the-Data)  
[**Import the data using Pandas**](#Import-the-data-using-Pandas)  
[**Dataframe characteristics **](#Dataframe-characteristics)  
[**Subsetting the Dataframe**](#Subsetting-the-Dataframe)  
[**Conditional Subsetting**](#Conditional-Subsetting)  





<hr>

<h1, align="center">NumPy Lib</h1>
__NumPy__: 
- Fast
- Multidimensional Arrays 
- Vectorized Computation


### NumPy ndarray
N-dimensional array object.  
To perform mathematical operations on whole blocks of data. 

In [1]:
# to insert a library (a file consisting of Python code) use "import"
import numpy as np

#### Creating ndarrays
- Use the __array__ function.  
- All of the elements must be the __same type__ (homogeneous)

In [2]:
data = np.array([[ 1.9526, -0.246 , -0.8856],
[ 0.5639, 0.2379, 0.9104]])
data

array([[ 1.9526, -0.246 , -0.8856],
       [ 0.5639,  0.2379,  0.9104]])

Every array has __number of dimensions__ and __shape__ (a tuple indicating the size of each dimension)

In [3]:
data.ndim

2

In [4]:
data.shape

(2, 3)

__dtype:__ an object describing the data type of the array:


In [5]:
data.dtype

dtype('float64')

__empty__, __range__ , __zeros__ and **ones** arrays

In [6]:
data1 = np.empty((3,2))
data1

array([[ 1.9526,  0.246 ],
       [ 0.8856,  0.5639],
       [ 0.2379,  0.9104]])

In [7]:
data0 = np.zeros(10)
data0

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [8]:
data3 = np.arange(10)
data3

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### <span style="color: red">YOUR TURN:</span> 
create a 2x4 array of ones and find the data type and shape of the array

In [13]:
#You code

data2x4= np.array([[1,1,1,1],[1,1,1,1]])
print(data2x4)
print(data2x4.dtype)
print(data2x4.shape)


[[1 1 1 1]
 [1 1 1 1]]
int64
(2, 4)


#### Data Types

In [15]:
data = np.array([1.25, -9.6, 42], dtype=np.string_)
data

array([b'1.25', b'-9.6', b'42'],
      dtype='|S4')

cast to floating point

In [16]:
data.astype(np.float32) #this is casting old array into new data types

array([  1.25      ,  -9.60000038,  42.        ], dtype=float32)

### Operations

In [18]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

Arithmetic operations

In [19]:
arr*2

array([[  2.,   4.,   6.],
       [  8.,  10.,  12.]])

In [20]:
arr*arr

array([[  1.,   4.,   9.],
       [ 16.,  25.,  36.]])

### Indexing and Slicing
to select a subset of your data or individual elements

In [22]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Same as Python lists:

In [23]:
arr[5]

5

In [25]:
arr[5:8]

array([5, 6, 7])

you assign a scalar value to a slice:

In [26]:
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

In [None]:
arr[5:]

in n-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays

In [27]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [29]:
arr2d[2]


array([7, 8, 9])

In [30]:
arr2d[0][2]   # also: arr2d[0, 2]

3

In [31]:
arr2d[2:]

array([[7, 8, 9]])

sliced along axis 0, the first axis  
e.g. The first two rows, and the last two columns

In [42]:
arr2d[:3, 1]

array([2, 5, 8])

### Boolean Indexing

In [43]:
data = np.random.rand(7,4)
data

array([[ 0.7237165 ,  0.47356256,  0.83670582,  0.4932561 ],
       [ 0.13016602,  0.65359208,  0.16800025,  0.88494227],
       [ 0.18893947,  0.43851666,  0.16913188,  0.04401804],
       [ 0.91843808,  0.61281645,  0.74275246,  0.26576825],
       [ 0.24806737,  0.88007237,  0.08016854,  0.64481533],
       [ 0.32024567,  0.33804403,  0.83857029,  0.48108211],
       [ 0.25347398,  0.41868508,  0.89839544,  0.28907641]])

In [44]:
data > 0.4    # Comparison Operators: ==, !=, &, |

array([[ True,  True,  True,  True],
       [False,  True, False,  True],
       [False,  True, False, False],
       [ True,  True,  True, False],
       [False,  True, False,  True],
       [False, False,  True,  True],
       [False,  True,  True, False]], dtype=bool)

In [45]:
data[(data>0.4)]

array([ 0.7237165 ,  0.47356256,  0.83670582,  0.4932561 ,  0.65359208,
        0.88494227,  0.43851666,  0.91843808,  0.61281645,  0.74275246,
        0.88007237,  0.64481533,  0.83857029,  0.48108211,  0.41868508,
        0.89839544])

### Functions

In [46]:
arr = np.arange(5)
np.sqrt(arr)   #squar root function

array([ 0.        ,  1.        ,  1.41421356,  1.73205081,  2.        ])

In [47]:
np.exp(arr)   #exponentiation function

array([  1.        ,   2.71828183,   7.3890561 ,  20.08553692,  54.59815003])

In [48]:
np.mean(arr) #average function

2.0

### <span style="color: red">YOUR TURN:</span> 

Create a 3x2 numpy array (use __ones__ function), and select its second row

In [56]:
## YOUR CODE BELOW

my_array = np.ones([3,2])
print(my_array)
my_array[1]



[[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]


array([ 1.,  1.])

<h1, align="center">Pandas Lib</h1>

Pandas
- for most kinds of data analysis
- for structured or tabular data
- high-level data structure
- built on top of NumPy  
- time series manipulation


Objects:  
- Series  
- DataFrame  



First, import the Pandas package

In [57]:
import pandas as pd

#### Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. 

In [58]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [59]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

In [64]:
print('b' in obj2)

print(obj2 ==-5)

# like a mask
obj2[obj2 ==-5]

True
d    False
b    False
a     True
c    False
dtype: bool


a   -5
dtype: int64

# DataFrame
- represents a tabular, spreadsheet-like data structure 
- ceach columns can be a different value type (numeric, string, boolean, etc.)
- has both a row and column index

## CO2 emissions by Country by Year

Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.

http://data.worldbank.org/indicator/EN.ATM.CO2E.PC/

## Get the Data

(Optional:) Data can be downloaded from The World Bank [here](http://data.worldbank.org/indicator/EN.ATM.CO2E.PC/) or from Box [here](https://ibm.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv)

#### Here, we will use the bash command, `wget`, to fetch the csv file from a direct link

In [65]:
!wget --output-document co2emissions.csv https://ibm.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv

--2018-03-09 15:36:41--  https://ibm.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv [following]
--2018-03-09 15:36:42--  https://ibm.ent.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv
Resolving ibm.ent.box.com (ibm.ent.box.com)... 107.152.27.211
Connecting to ibm.ent.box.com (ibm.ent.box.com)|107.152.27.211|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://public.boxcloud.com/d/1/XwYvYAwqzXjNeXCyLrX6eHIAizKY36qKPxX5vFD8Qec_SHSIdNIsCk4XAiy-3vsLxHEzQrt17FRfPUL4H1ygM3EhaMAnKrqf4rm4RW3esqGnmkT7PuEyVBPeEsqWNTWVR0k0c8WtMYIjLcLHFjIEtb86id-k21jmmRixPeirrjwPF9rZE_sdjDdRYASfuRM1_YWKDiKYjnr-j1uxGTY3OjjrpMZrSrKR5FqyNu-Ky2GmaxoEUungAEALXGVtYEJ5YhMRQjVr_diHu0ogqJuNxGNV4

<hr>

<h2, align=center>Import the data using Pandas</h2>

#### Import required `pandas` library

In [67]:
import pandas as pd

#### Import data using `pd.read_csv`

In [68]:
data = pd.read_csv("co2emissions.csv", skiprows = 4)

In [70]:
type(data) #Checks the type of object

pandas.core.frame.DataFrame

In [71]:
data

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,6.350868,6.296125,6.049173,6.124770,5.968685,,,,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,...,0.088141,0.158962,0.249074,0.302936,0.425262,,,,,
3,Angola,AGO,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.104357,0.084718,0.216025,0.206877,0.216174,0.206089,...,1.311096,1.369425,1.430873,1.401654,1.354008,,,,,
4,Albania,ALB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.258195,1.374186,1.439956,1.181681,1.111742,1.166099,...,1.507536,1.580113,1.533178,1.515632,1.607038,,,,,
5,Arab World,ARB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.643964,0.685501,0.761148,0.875124,0.999248,1.166075,...,4.181153,4.373573,4.575251,4.764912,4.724500,,,,,
6,United Arab Emirates,ARE,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.118786,0.108937,0.163355,0.175712,0.132651,0.146370,...,23.195067,23.033600,21.102296,20.120957,20.433838,,,,,
7,Argentina,ARG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,2.367473,2.442616,2.522392,2.316356,2.538380,2.641714,...,4.496834,4.744178,4.427960,4.342272,4.562049,,,,,
8,Armenia,ARM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,1.694755,1.868611,1.469961,1.422998,1.671657,,,,,
9,American Samoa,ASM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,,,,,


#### Display first 5 rows of `data` using `head`

In [72]:
data.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,6.350868,6.296125,6.049173,6.12477,5.968685,,,,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,...,0.088141,0.158962,0.249074,0.302936,0.425262,,,,,
3,Angola,AGO,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.104357,0.084718,0.216025,0.206877,0.216174,0.206089,...,1.311096,1.369425,1.430873,1.401654,1.354008,,,,,
4,Albania,ALB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.258195,1.374186,1.439956,1.181681,1.111742,1.166099,...,1.507536,1.580113,1.533178,1.515632,1.607038,,,,,


#### Take just the first two rows using `head`

In [73]:
data.head(2)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,6.350868,6.296125,6.049173,6.12477,5.968685,,,,,


In [74]:
data.head(n=1)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,


### Don't remember what the parameters are for a function? Use `?`

In [78]:
#To close help window, press q
?data.head

#### Display last 7 rows of data

In [79]:
data.tail(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
246,"Congo, Dem. Rep.",COD,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.152228,0.150782,0.135559,0.139446,0.116926,0.14229,...,0.047312,0.048411,0.045604,0.049328,0.050303,,,,,
247,Zambia,ZMB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,0.950434,1.100197,...,0.150265,0.164692,0.184058,0.192079,0.21245,,,,,
248,Zimbabwe,ZWE,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,1.045374,1.179176,...,0.731867,0.569255,0.600521,0.646073,0.691698,,,,,


<hr>

<h2, align=center>Dataframe characteristics</h2>

#### How many rows and columns are there?

In [80]:
data.shape #(rows, columns)

(249, 61)

In [81]:
data.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
count,180.0,181.0,182.0,183.0,188.0,188.0,188.0,188.0,187.0,188.0,...,232.0,232.0,232.0,232.0,232.0,0.0,0.0,0.0,0.0,0.0
mean,2.147848,2.273331,2.382827,2.925962,3.076591,3.203665,3.216921,3.287063,3.515189,4.13608,...,5.02304,4.980116,4.755409,4.911618,4.884828,,,,,
std,4.348322,4.542379,4.889944,8.741776,8.776709,8.964691,8.309355,7.504329,7.450713,10.20869,...,6.59831,6.328807,5.97544,6.10008,6.051724,,,,,
min,0.008022,0.007897,0.008481,0.00939,0.011598,0.01191,0.013258,0.0118,0.015594,0.016116,...,0.021964,0.021615,0.020868,0.020542,0.02135,,,,,
25%,0.176743,0.173541,0.196524,0.192676,0.213766,0.229534,0.231413,0.241491,0.269411,0.318197,...,0.765607,0.776208,0.793028,0.797337,0.829105,,,,,
50%,0.606451,0.6464,0.657922,0.65598,0.726711,0.716547,0.760381,0.785352,0.884673,0.910402,...,2.893936,2.989729,2.95836,2.94388,3.045189,,,,,
75%,1.96004,2.442616,2.674866,2.21164,2.404358,2.609091,3.256631,3.997464,4.279434,4.267795,...,7.219751,7.022516,6.522355,6.816148,6.711322,,,,,
max,36.685183,36.583778,42.637118,99.57594,92.969565,85.563167,78.734608,77.629183,76.10435,99.840439,...,55.3368,48.60162,44.836401,42.639076,44.018926,,,,,


#### What are the column names?

In [82]:
data.columns

Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', 'Unnamed: 60'],
      dtype='object')

#### What is the first column name?

In [83]:
data.columns[0]

'Country Name'

#### What is the first and second column name?

In [84]:
data.columns[[0,1]]

Index(['Country Name', 'Country Code'], dtype='object')

<hr>

## Review Questions

### <span style="color: red">YOUR TURN:</span> 

#### Print the first 3 rows of `data`

In [87]:
## YOUR CODE BELOW
data.head(3)


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,6.350868,6.296125,6.049173,6.12477,5.968685,,,,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,...,0.088141,0.158962,0.249074,0.302936,0.425262,,,,,


#### Print the names of the first and last column names

In [98]:
## YOUR CODE BELOW

# note : if you want to return several values you need to put a list into the [] block, ie [0] ok [[0,2,4]] ok
print(data.columns[[0, -1]])

Index(['Country Name', 'Unnamed: 60'], dtype='object')


<hr>

<h2, align=center>Subsetting the Dataframe</h2>

### Select columns

#### Select columns by name:

In [99]:
data['Country Name']

0                               Aruba
1                             Andorra
2                         Afghanistan
3                              Angola
4                             Albania
5                          Arab World
6                United Arab Emirates
7                           Argentina
8                             Armenia
9                      American Samoa
10                Antigua and Barbuda
11                          Australia
12                            Austria
13                         Azerbaijan
14                            Burundi
15                            Belgium
16                              Benin
17                       Burkina Faso
18                         Bangladesh
19                           Bulgaria
20                            Bahrain
21                       Bahamas, The
22             Bosnia and Herzegovina
23                            Belarus
24                             Belize
25                            Bermuda
26          

#### Select columns by column number:

In [100]:
data[data.columns[0]]

0                               Aruba
1                             Andorra
2                         Afghanistan
3                              Angola
4                             Albania
5                          Arab World
6                United Arab Emirates
7                           Argentina
8                             Armenia
9                      American Samoa
10                Antigua and Barbuda
11                          Australia
12                            Austria
13                         Azerbaijan
14                            Burundi
15                            Belgium
16                              Benin
17                       Burkina Faso
18                         Bangladesh
19                           Bulgaria
20                            Bahrain
21                       Bahamas, The
22             Bosnia and Herzegovina
23                            Belarus
24                             Belize
25                            Bermuda
26          

#### Select the last column:

In [101]:
data[data.columns[-1]]

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
15    NaN
16    NaN
17    NaN
18    NaN
19    NaN
20    NaN
21    NaN
22    NaN
23    NaN
24    NaN
25    NaN
26    NaN
27    NaN
28    NaN
29    NaN
       ..
219   NaN
220   NaN
221   NaN
222   NaN
223   NaN
224   NaN
225   NaN
226   NaN
227   NaN
228   NaN
229   NaN
230   NaN
231   NaN
232   NaN
233   NaN
234   NaN
235   NaN
236   NaN
237   NaN
238   NaN
239   NaN
240   NaN
241   NaN
242   NaN
243   NaN
244   NaN
245   NaN
246   NaN
247   NaN
248   NaN
Name: Unnamed: 60, Length: 249, dtype: float64

#### Subset Multiple Columns by Name

In [102]:
data[['Country Name', 'Country Code']]

Unnamed: 0,Country Name,Country Code
0,Aruba,ABW
1,Andorra,AND
2,Afghanistan,AFG
3,Angola,AGO
4,Albania,ALB
5,Arab World,ARB
6,United Arab Emirates,ARE
7,Argentina,ARG
8,Armenia,ARM
9,American Samoa,ASM


### Select rows

A few different ways:

In [103]:
data[0:2] #First and second row

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,6.350868,6.296125,6.049173,6.12477,5.968685,,,,,


In [104]:
data[:2] 

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,6.350868,6.296125,6.049173,6.12477,5.968685,,,,,


In [109]:
print(data[0:2][data.columns[0:2]])

data.iloc[[0,2]] #First and third rows

  Country Name Country Code
0        Aruba          ABW
1      Andorra          AND


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,...,0.088141,0.158962,0.249074,0.302936,0.425262,,,,,


In [107]:
data[0:2]['Country Name']

0      Aruba
1    Andorra
Name: Country Name, dtype: object

#### Select the last row

In [111]:
data.iloc[-1]

Country Name                                    Zimbabwe
Country Code                                         ZWE
Indicator Name    CO2 emissions (metric tons per capita)
Indicator Code                            EN.ATM.CO2E.PC
1960                                                 NaN
1961                                                 NaN
1962                                                 NaN
1963                                                 NaN
1964                                             1.04537
1965                                             1.17918
1966                                             1.32366
1967                                             1.12296
1968                                             1.30983
1969                                             1.34045
1970                                             1.56786
1971                                             1.62332
1972                                              1.4758
1973                           

## Conditional Subsetting

In [112]:
# use this to get help for any command
?data.loc

Recall: there are various logical operators to create logical statements:

In [115]:
1 == 2

False

In [116]:
"Me" != "You"

True

In [117]:
1000 > 1

True

When you apply a logical statement to an array, an array of Trues/Falses are returned, with respect to the logical statement.

In [118]:
import numpy as np
my_range = np.array(range(1,100))
my_range

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85,
       86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [119]:
result = my_range > 50
result

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

Then you can use the array of Trues/Falses to return *only the true values* from the original array.

In [120]:
my_range[result]


array([51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

#### Select rows based on a condition

In [121]:
data['Country Name']

0                               Aruba
1                             Andorra
2                         Afghanistan
3                              Angola
4                             Albania
5                          Arab World
6                United Arab Emirates
7                           Argentina
8                             Armenia
9                      American Samoa
10                Antigua and Barbuda
11                          Australia
12                            Austria
13                         Azerbaijan
14                            Burundi
15                            Belgium
16                              Benin
17                       Burkina Faso
18                         Bangladesh
19                           Bulgaria
20                            Bahrain
21                       Bahamas, The
22             Bosnia and Herzegovina
23                            Belarus
24                             Belize
25                            Bermuda
26          

In [122]:
#Data where the country name is Albania. Returns True/False
data['Country Name'] == 'Albania'

0      False
1      False
2      False
3      False
4       True
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
219    False
220    False
221    False
222    False
223    False
224    False
225    False
226    False
227    False
228    False
229    False
230    False
231    False
232    False
233    False
234    False
235    False
236    False
237    False
238    False
239    False
240    False
241    False
242    False
243    False
244    False
245    False
246    False
247    False
248    False
Name: Country Name, Length: 249, dtype: bool

In [123]:
data[data['Country Code'] == 'CHN'] #Subset based on condition

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
38,China,CHN,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.170381,0.836047,0.661428,0.640002,0.625646,0.665524,...,5.153564,5.311152,5.778143,6.172489,6.710302,,,,,


#### Select rows based on multiple conditions

In [124]:
data[(data['Country Name'] == 'China') & (data['Country Code'] == 'CHN')]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
38,China,CHN,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.170381,0.836047,0.661428,0.640002,0.625646,0.665524,...,5.153564,5.311152,5.778143,6.172489,6.710302,,,,,


#### Why does the following return no hits?

In [128]:
data[(data['Country Name'] == 'Canada') & (data['Country Code'] == 'CHN')]
data[(data['Country Name'] == 'Canada') | (data['Country Code'] == 'CHN')]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
33,Canada,CAN,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,10.770847,10.627898,11.130627,11.132102,12.30537,12.814001,...,17.0519,16.366165,15.089098,14.589054,14.135813,,,,,
38,China,CHN,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.170381,0.836047,0.661428,0.640002,0.625646,0.665524,...,5.153564,5.311152,5.778143,6.172489,6.710302,,,,,


### Select data by row and column

What does the following do?

In [130]:
data[['Country Name', 'Country Code', '2010']][16:21]

Unnamed: 0,Country Name,Country Code,2010
16,Benin,BEN,0.541771
17,Burkina Faso,BFA,0.107673
18,Bangladesh,BGD,0.37036
19,Bulgaria,BGR,5.966884
20,Bahrain,BHR,18.435025


<hr>

## Review Question

### <span style="color: red">YOUR TURN:</span> 
<h3>Find the CO2 Emission per capita for France and Germany in 2010 and 2011</h3>

Your output should look like:

<img src = https://ibm.box.com/shared/static/23drzeu5h2uydwhb0un9qa4jq9yzrjbi.png, width = 300>

In [None]:
#hint
# data['Country Name']== ??
# data[['Country Name', '2010', '2011']].iloc[[??,??]]

In [157]:
## YOUR CODE BELOW

data[['Country Name', '2010', '2011']][52:74]
data[(data['Country Name'] == 'France') | (data['Country Name'] == 'Germany')][['Country Name','2010','2011']]

data[data['Country Name'].isin(['France', 'Germany'])][['Country Name','2010','2011']]

# note 'in' is bad for performance

Unnamed: 0,Country Name,2010,2011
52,Germany,9.179817,8.917833
73,France,5.496707,5.185043


## Want to learn more?

You can take free [Python 101](https://cocl.us/DX0108EN_PY0101EN) or [Data Analysis with Python](https://cocl.us/DX0108EN_DA0101EN) courses.  

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](https://cocl.us/DX0108EN_DSX). 

<hr>
Copyright &copy; 2017 [Cognitive Class](https://cognitiveclass.ai/?utm_source=cccopyrightlink&utm_medium=cclabs). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).


<h3>Authors:</h3>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone wp-image-2258 " src="https://ibm.box.com/shared/static/tyd41rlrnmfrrk78jx521eb73fljwvv0.jpg" alt="Saeed Aghabozorgi" width="178" height="178" /></div>
<h4>Saeed Aghabozorgi</h4>
<p><a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>
</article>
<br>
<article class="teacher">
<div class="teacher-image" style="    float: left;
    width: 115px;
    height: 115px;
    margin-right: 10px;
    margin-bottom: 10px;
    border: 1px solid #CCC;
    padding: 3px;
    border-radius: 3px;
    text-align: center;"><img class="alignnone size-medium wp-image-2177" src="https://ibm.box.com/shared/static/2ygdi03ahcr97df2ofrr6cf8knq4kodd.jpg" alt="Polong Lin" width="300" height="300" /></div>
<h4>Polong Lin</h4>
<p>
<a href="https://ca.linkedin.com/in/polonglin">Polong Lin</a> is a Data Scientist and Lead Data Science Advocate at IBM. Polong is a regular speaker in conferences and meetups where he teaches data science. Polong holds a M.Sc. in Cognitive Psychology.</p>
</article>