This lesson uses Gapminder data:
    - Employment Levels
    - Life Expectancy
    - GDP
    - School Completion Rates
Some questions we can answer through these datasets:
    - How has employment in a particular country varied over time
    - What are the highest and lowest employment levels
        - Which countries have them
        - Where in a particular country on the spectrum
    - Same question for different variables
    - How do these variables relate each other
    - Are there consistent trends across countries
#One-Dimensional Data in Numpy and Pandas

In [1]:
import unicodecsv

def read_csv(filename):
    with open (filename, "rb") as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)
    
daily_enagagement = read_csv("daily_engagement.csv")

In [2]:
def get_unique_students(data):
    unique_students =set()
    for data_point in data:
        unique_students.add(data_point['acct'])
    return unique_students

unique_enagagement_students = get_unique_students(daily_enagagement)
len(unique_enagagement_students)

1237

This entire thing can be done with pandas with much less time consumed

In [3]:
import pandas as pd

In [4]:
daily_engagement = pd.read_csv("daily_engagement.csv")

In [5]:
len(daily_engagement['acct'].unique())

1237

#Numpy Arrays
One-Dimension Structure:
    - In Pandas: Series
    - In NumPy (Numerical Python): Array
**Numpy Arrays and Python Lists**

*Similarities*
    - Access elements by positions: a[0]
    - Access a range of elements: a[1:3]
    - Use Loops: for x in a:
*Differences*
    - Each element should have same type: string, int, boolean, etc
    - Convenient Function: mean(), std()
    - Can be multi-dimensional

In [10]:
import numpy as np

countries = np.array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina','Armenia', 'Australia', 'Austria', 'Azerbaijan',
                      'Bahamas','Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium','Belize', 'Benin', 'Bhutan', 'Bolivia',
                      'Bosnia and Herzegovina'])

employment = np.array([55.70000076,51.40000153,50.5,75.69999695,58.40000153,40.09999847,61.5,57.09999847,60.90000153,66.59999847,
                       60.40000153,68.09999847,66.90000153,53.40000153,48.59999847,56.79999924,71.59999847,58.40000153,70.40000153,
                       41.20000076])

In [11]:
print(countries[0])
print(countries[3])

Afghanistan
Angola


In [12]:
print(countries[0:3])
print(countries[:3])
print(countries[17:])
print(countries[:])

['Afghanistan' 'Albania' 'Algeria']
['Afghanistan' 'Albania' 'Algeria']
['Bhutan' 'Bolivia' 'Bosnia and Herzegovina']
['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Argentina' 'Armenia'
 'Australia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh'
 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina']


In [13]:
print(countries.dtype)
print(employment.dtype)
print(np.array([0, 1, 2, 3]).dtype)
print(np.array([1.0, 1.5, 2.0, 2.5]).dtype)
print(np.array([True, False, True]).dtype)
print(np.array(['AL', 'AK', 'AZ', 'AR', 'CA']).dtype)

<U22
float64
int32
float64
bool
<U2


In [14]:
for country in countries:
    print('Examining country {}'.format(country))

for i in range(len(countries)):
    country = countries[i]
    country_employment = employment[i]
    print('Country {} has employment {}'.format(country,country_employment))

Examining country Afghanistan
Examining country Albania
Examining country Algeria
Examining country Angola
Examining country Argentina
Examining country Armenia
Examining country Australia
Examining country Austria
Examining country Azerbaijan
Examining country Bahamas
Examining country Bahrain
Examining country Bangladesh
Examining country Barbados
Examining country Belarus
Examining country Belgium
Examining country Belize
Examining country Benin
Examining country Bhutan
Examining country Bolivia
Examining country Bosnia and Herzegovina
Country Afghanistan has employment 55.70000076
Country Albania has employment 51.40000153
Country Algeria has employment 50.5
Country Angola has employment 75.69999695
Country Argentina has employment 58.40000153
Country Armenia has employment 40.09999847
Country Australia has employment 61.5
Country Austria has employment 57.09999847
Country Azerbaijan has employment 60.90000153
Country Bahamas has employment 66.59999847
Country Bahrain has employmen

In [15]:
print(employment.mean())
print(employment.std())
print(employment.max())
print(employment.sum())

58.6850000385
9.33826911369
75.69999695
1173.70000077


In [16]:
def max_employment(countries, employment):
    
    i =employment.argmax()
    
    max_country = countries[i]
    max_value = employment[i]

    return (max_country, max_value)

print(max_employment(countries,employment))

('Angola', 75.699996949999999)


##Vectorized Operation
A vector is a list of numbers

**Vector Addition:** Adding the corresposing elements of two vectors, e.g., 1,2,3 + 4,5,6 = 5,7,9

**Vector Multiplication with Scalar:** Multiplying all the elements with scalar term, e.g., 1,2,3 * 3 = 3,6,9

**More Vectorized Operations:** 
    - Math Operations:
        - Add: +
        - Subtract: -
        - Multiply: *
        - Divide: /
        - Exponential: **
    - Logical Operations (For Boolean Arrays):
        - And: &
        - Or: |
        - Not: ~
    - Comparison Operations:
        - Greater: >
        - Greater or equal: >=
        - Less: <
        - Less or equal: <=
        - Equal: ==
        - Not Equal: !=

In [17]:
a = np.array([1, 2, 3, 4])
b = np.array([1, 2, 1, 2])

print(a + b)
print(a - b)
print(a * b)
print(a / b)
print(a ** b)

[2 4 4 6]
[0 0 2 2]
[1 4 3 8]
[ 1.  1.  3.  2.]
[ 1  4  3 16]


In [18]:
a = np.array([1, 2, 3, 4])
b = 2

print(a + b)
print(a - b)
print(a * b)
print(a / b)
print(a ** b)

[3 4 5 6]
[-1  0  1  2]
[2 4 6 8]
[ 0.5  1.   1.5  2. ]
[ 1  4  9 16]


In [19]:
a = np.array([True, True, False, False])
b = np.array([True, False, True, False])

print(a & b)
print(a | b)
print(~a)

print(a & True)
print(a & False)

print(a | True)
print(a | False)

[ True False False False]
[ True  True  True False]
[False False  True  True]
[ True  True False False]
[False False False False]
[ True  True  True  True]
[ True  True False False]


In [20]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

print(a > b)
print(a >= b)
print(a < b)
print(a <= b)
print(a == b)
print(a != b)

[False False False  True  True]
[False False  True  True  True]
[ True  True False False False]
[ True  True  True False False]
[False False  True False False]
[ True  True False  True  True]


In [21]:
a = np.array([1, 2, 3, 4])
b = 2

print(a > b)
print(a >= b)
print(a < b)
print(a <= b)
print(a == b)
print(a != b)

[False False  True  True]
[False  True  True  True]
[ True False False False]
[ True  True False False]
[False  True False False]
[ True False  True  True]


In [22]:
countries = np.array(['Algeria', 'Argentina', 'Armenia', 'Aruba', 'Austria','Azerbaijan','Bahamas', 'Barbados', 'Belarus',
                      'Belgium', 'Belize', 'Bolivia','Botswana', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi','Cambodia',
                      'Cameroon', 'Cape Verde'])

female_completion = np.array([97.35583, 104.62379, 103.02998, 95.14321, 103.69019, 98.49185, 100.88828, 95.43974, 92.11484,
                              91.54804, 95.98029, 98.22902, 96.12179, 119.28105, 97.84627, 29.07386, 38.41644, 90.70509,
                              51.7478, 95.45072])

male_completion = np.array([95.47622, 100.66476, 99.7926, 91.48936, 103.22096, 97.80458, 103.81398, 88.11736, 93.55611,
                            87.76347, 102.45714, 98.73953, 92.22388, 115.3892 , 98.70502, 37.00692, 45.39401, 91.22084,
                            62.42028, 90.66958])

In [25]:
def overall_completion_rate(female_completion, male_completion): 
    overall_completion_rate_by_country = (female_completion + male_completion)/2.0
    return overall_completion_rate_by_country

In [26]:
print(overall_completion_rate(female_completion,male_completion))

[  96.416025  102.644275  101.41129    93.316285  103.455575   98.148215
  102.35113    91.77855    92.835475   89.655755   99.218715   98.484275
   94.172835  117.335125   98.275645   33.04039    41.905225   90.962965
   57.08404    93.06015 ]


##Standarizing Data
How does one data point compare to the rest?

Ways to answer: Convert each data point to number of standard deviations away from the mean

For eg: Mean Employment Rate: 58.6%, Standard Deviations: 10.5%, United States: 62.3%.

Thus, Difference b/w Total and Mean Rate = 3.7% or 0.35 sd

For Maxico with 57.9%, difference = -0.7% or -0.067 sd

In [27]:
countries = np.array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia', 'Austria',
                      'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize',
                      'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina'])

employment = np.array([55.70000076, 51.40000153, 50.5, 75.69999695, 58.40000153, 40.09999847, 61.5, 57.09999847, 60.90000153,
                       66.59999847, 60.40000153, 68.09999847, 66.90000153, 53.40000153, 48.59999847, 56.79999924,
                       71.59999847, 58.40000153, 70.40000153, 41.20000076])

In [29]:
# Change this country name to change what country will be printed when you
# click "Test Run". Your function will be called to determine the standardized
# score for this country for each of the given 5 Gapminder variables in 2007.
# The possible country names are available in the Downloadables section.

country_name = 'United States'

def standardize_data(values):
    
    mean_value = values.mean()
    std_value = values.std()
    standard_values = (mean_value - values)/std_value
    return standard_values

In [31]:
print(standardize_data(employment))

[ 0.31965231  0.780123    0.87650077 -1.82207181  0.03051941  1.99019768
 -0.30144772  0.16973184 -0.23719615 -0.84758731 -0.18365304 -1.00821665
 -0.87971351  0.56595055  1.07996476  0.20185762 -1.38301845  0.03051941
 -1.2545153   1.87240259]
