<a href="https://colab.research.google.com/github/JunnieLee/data_science_tutorials/blob/master/Intro_to_DataScience_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Numpy basics

In [3]:
import numpy as np

array = np.array([1, 4, 5, 8], float)
print(array)
print("")

array = np.array([[1, 2, 3], [4, 5, 6]], float)  # a 2D array/Matrix
print(array)

[1. 4. 5. 8.]

[[1. 2. 3.]
 [4. 5. 6.]]


The array object class is the foundation of Numpy, and Numpy arrays are like
lists in Python, except that every thing inside an array must be of the
same type, like int or float.

In [4]:
array = np.array([1, 4, 5, 8], float)
print(array)
print("")
print(array[1])
print("")
print(array[:2])
print("")
array[1] = 5.0
print(array[1])

[1. 4. 5. 8.]

4.0

[1. 4.]

5.0


You can index, slice, and manipulate a Numpy array much like you would with a Python list.

In [5]:
# Matrix indexing and slicing in action

two_D_array = np.array([[1, 2, 3], [4, 5, 6]], float)
print(two_D_array)
print("")
print(two_D_array[1][1])
print("")
print(two_D_array[1, :])
print("")
print(two_D_array[:, 2])
print("")

[[1. 2. 3.]
 [4. 5. 6.]]

5.0

[4. 5. 6.]

[3. 6.]



In [6]:
# Here are some arithmetic operations that you can do with Numpy arrays

array_1 = np.array([1, 2, 3], float)
array_2 = np.array([5, 2, 6], float)
print(array_1 + array_2)
print("")
print(array_1 - array_2)
print("")
print(array_1 * array_2)

[6. 4. 9.]

[-4.  0. -3.]

[ 5.  4. 18.]


In [7]:
array_1 = np.array([[1, 2], [3, 4]], float)
array_2 = np.array([[5, 6], [7, 8]], float)

print(array_1 + array_2)
print("")
print(array_1 - array_2)
print("")
print(array_1 * array_2)

[[ 6.  8.]
 [10. 12.]]

[[-4. -4.]
 [-4. -4.]]

[[ 5. 12.]
 [21. 32.]]


In addition to the standard arthimetic operations, Numpy also has a range of
other mathematical operations that you can apply to Numpy arrays, such as
mean and dot product. --> numpy에서 vector와 matrix의 product를 구하기 위해서 dot() 함수를 사용한다. 

In [8]:
array_1 = np.array([1, 2, 3], float)
array_2 = np.array([[6], [7], [8]], float)

print(np.mean(array_1))
print(np.mean(array_2))
print("")
print(np.dot(array_1, array_2))

2.0
7.0

[44.]


# Pandas basics

* series

You can think of Series as an one-dimensional object that is similar to
an array, list, or column in a database. By default, it will assign an
index label to each item in the Series ranging from 0 to N, where N is
the number of items in the Series minus one.

In [10]:
import pandas as pd

series = pd.Series(['Dave', 'Cheng-Han', 'Udacity', 42, -1789710578])
print(series)

0           Dave
1      Cheng-Han
2        Udacity
3             42
4    -1789710578
dtype: object


You can also manually assign indices to the items in the Series when
creating the series

In [11]:
series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])

print(series)

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
Power Level                9001
dtype: object


You can use index to select specific items from the Series

In [12]:
series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])

print(series['Instructor'])
print("")
print(series[['Instructor', 'Curriculum Manager', 'Course Number']])

Dave

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
dtype: object


You can also use boolean operators to select specific items from the Series

In [13]:
cuteness = pd.Series([1, 2, 3, 4, 5], index=['Cockroach', 'Fish', 'Mini Pig',
                                                 'Puppy', 'Kitten'])

print(cuteness > 3)
print("")
print(cuteness[cuteness > 3])

Cockroach    False
Fish         False
Mini Pig     False
Puppy         True
Kitten        True
dtype: bool

Puppy     4
Kitten    5
dtype: int64


* dataframe

You can think of a Dataframe as something with rows and columns. It is
similar to a spreadsheet, a database table.




To create a dataframe, you can pass a dictionary of lists to the Dataframe
constructor:


1) The key of the dictionary will be the column name


2) The associating list will be the values within that column.

In [14]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions','Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}


football = pd.DataFrame(data)
print(football)

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012


Pandas also has various functions that will help you understand some basic
information about your data frame. Some of these functions are:
- 1) dtypes: to get the datatype for each column
- 2) describe: useful for seeing basic statistics of the dataframe's numerical
   columns
- 3) head: displays the first five rows of the dataset
- 4) tail: displays the last five rows of the dataset

In [15]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

football = pd.DataFrame(data)
print(football.dtypes)
print("")
print(football.describe())
print("")
print(football.head())
print("")
print(football.tail())

losses     int64
team      object
wins       int64
year       int64
dtype: object

          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012

   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012


## Create a DataFrame

In [0]:
from pandas import DataFrame, Series

def create_dataframe():

    countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

    gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
    bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
    
    olympic_medal_counts = {'country_name':Series(countries), 'gold':Series(gold), \
                            'silver':Series(silver), 'bronze':Series(bronze)}

    olympic_medal_counts_df = DataFrame(olympic_medal_counts)
    
    return olympic_medal_counts_df

In [24]:
create_dataframe()

Unnamed: 0,bronze,country_name,gold,silver
0,9,Russian Fed.,13,11
1,10,Norway,11,5
2,5,Canada,10,10
3,12,United States,9,7
4,9,Netherlands,8,7
5,5,Germany,8,6
6,2,Switzerland,6,3
7,1,Belarus,5,0
8,5,Austria,4,8
9,7,France,4,4


In [25]:
create_dataframe().loc[21] # dataframe은 .loc[index] 로 row별 data를 grab할 수 있음

bronze              6
country_name    Italy
gold                0
silver              2
Name: 21, dtype: object

In [30]:
df =create_dataframe()
df[df['gold']>=9]

Unnamed: 0,bronze,country_name,gold,silver
0,9,Russian Fed.,13,11
1,10,Norway,11,5
2,5,Canada,10,10
3,12,United States,9,7


In [33]:
df['country_name'][df['gold']>=9]

# 1. df['country_name']이 일단 해당 dataframe에서 country_name이라는 이름의 column 데이터를 grab
# 2. 그리고 1에서 grab한 data중에 df['gold']>=9 를 했을때 True값이 나오는 애들만
#    최종적으로 filter되어서 반환됨

0     Russian Fed.
1           Norway
2           Canada
3    United States
Name: country_name, dtype: object

In [34]:
df['gold']>=9

0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
Name: gold, dtype: bool

You can think of a DataFrame as a group of Series that share an index.
This makes it easy to select specific columns that you want from the 
DataFrame. 


Also a couple pointers:
- 1) Selecting a single column from the DataFrame will return a Series
- 2) Selecting multiple columns from the DataFrame will return a DataFrame


In [35]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions','Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

football = pd.DataFrame(data)

print(football['year'])
print('')
print(football.year) # shorthand for football['year']
print('')
print(football[['year', 'wins', 'losses']])

0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: int64

0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: int64

   year  wins  losses
0  2010    11       5
1  2011     8       8
2  2012    10       6
3  2011    15       1
4  2012    11       5
5  2010     6      10
6  2011    10       6
7  2012     4      12


Row selection can be done through multiple ways.


Some of the basic and common methods are:
   - 1) Slicing
   - 2) An individual index (through the functions iloc or loc)
   - 3) Boolean indexing

You can also combine multiple selection requirements through boolean
operators like & (and) or | (or)

In [41]:
print(football.iloc[0]) # dataframe은 iloc로 row 데이터 grab (그냥 iloc하면 세로로 나옴)
print("")
print(football.iloc[[0]]) # .iloc[[row_index]] 하면 가로로 나옴

print("-----------------------------------------")
print(football[3:5])

print("-----------------------------------------")
print(football[football.wins > 10]) # football.wins > 10 했을때 대응하는 위치의 값이 True인 애들만 return

print("-----------------------------------------")
print(football[(football.wins > 10) & (football.team == "Packers")])
      # football.wins > 10 했을때 대응하는 위치의 값이 True이고 team 값이 "Packers"인 애들만 return

losses        5
team      Bears
wins         11
year       2010
Name: 0, dtype: object

   losses   team  wins  year
0       5  Bears    11  2010
-----------------------------------------
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012
-----------------------------------------
   losses     team  wins  year
0       5    Bears    11  2010
3       1  Packers    15  2011
4       5  Packers    11  2012
-----------------------------------------
   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers    11  2012


# Pandas Vectorized Methods

In [0]:
import numpy
from pandas import DataFrame, Series


def avg_medal_count():
  
    countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

    gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
    bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
    
    olympic_medal_counts = {'country_name':countries,
                            'gold': Series(gold),
                            'silver': Series(silver),
                            'bronze': Series(bronze)}    
    
    olympic_medal_counts_df = DataFrame(olympic_medal_counts)
    
    avg_medal_count = olympic_medal_counts_df[['gold', 'silver','bronze']].apply(numpy.mean)
    
    return avg_medal_count

In [43]:
avg_medal_count()

gold      3.807692
silver    3.730769
bronze    3.807692
dtype: float64

# Matrix Multiplication

In [0]:
import numpy
from pandas import DataFrame, Series


def points():

    countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

    gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
    bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
    
    olympic_medal_counts = {'country_name':countries,
                            'gold': Series(gold),
                            'silver': Series(silver),
                            'bronze': Series(bronze)}    
    olympic_medal_counts_df = DataFrame(olympic_medal_counts)
    
    medal_counts = olympic_medal_counts_df[['gold','silver','bronze']]
    points = numpy.dot(medal_counts,[4,2,1])

    olympic_points_df = DataFrame({'country_name': Series(countries), 'points': Series(points)})
    
    # df['points'] = df[['gold','silver','bronze']].dot([4, 2, 1])
    # olympic_points_df = df[['country_name','points']]
    
    return olympic_points_df

In [50]:
points()

Unnamed: 0,country_name,points
0,Russian Fed.,83
1,Norway,64
2,Canada,65
3,United States,62
4,Netherlands,55
5,Germany,49
6,Switzerland,32
7,Belarus,21
8,Austria,37
9,France,31
