In [1]:
import pandas as pd
import numpy as np

https://planetpython.org/

https://dataskeptic.com/ - podcast

In [10]:
k = pd.Series(np.arange(1,11,3),index=['a','b','c','d'])
k

a     1
b     4
c     7
d    10
dtype: int64

In Python, we have the none type to indicate a lack of data. But what do we do if we want to have a typed list like we do in the series object? **Underneath, pandas does some type conversion. If we create a list of strings and we have one element, a None type, pandas inserts it as a None and uses the type object for the underlying array.**

You'll notice a couple of things. **First, NaN is a different value. Second, pandas
set the dytpe of this series to floating point numbers instead of object or ints.** That's
maybe a bit of a surprise - why not just leave this as an integer? Underneath, pandas
represents NaN as a floating point number, and because integers can be typecast to
floats, pandas went and converted our integers to floats. So when you're wondering why the
list of integers you put into a Series is not floats, it's probably because there is some missing data.

NaN is similar to None, but it's a numeric value and treated differently


In [17]:
pd.Series([1,2,None])

0    1.0
1    2.0
2    NaN
dtype: float64

In [16]:
np.nan == None

False

In [19]:
np.nan == np.nan, np.isnan(np.nan)

(False, True)

Pandas has nice compatability
1. It can store anything like number, string, tuple, list and so on
2. It can make series if data is iterable object

<q>Can I set different type of data in one series?

In [22]:
pd.Series({"ss":33,"dd":44,"ee":55}), pd.Series([[1,2,3],[2,3,4]]), pd.Series((1,2,3))

(ss    33
 dd    44
 ee    55
 dtype: int64, 0    [1, 2, 3]
 1    [2, 3, 4]
 dtype: object, 0    1
 1    2
 2    3
 dtype: int64)

In [27]:
# length of data and length of index should be same
pd.Series({"a":1,"b":2,"c":3},index=["a","b","d"])

a    1.0
b    2.0
d    NaN
dtype: float64

In [28]:
pd.Series([1,2],index=[1,2,3])

ValueError: Length of passed values is 2, index implies 3

loc : query by index

iloc : query by position

**loc and iloc is not method, it's attribute** -> use [ ] not ( )

Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameter, the operator will behave as if you want it to query via the iloc attribute

In [37]:
ex = pd.Series([1,2,3,4,5],index=['a','b','c','d',1])
ex,ex.loc['a':],ex.iloc[[2,3,4]], ex['a':]

(a    1
 b    2
 c    3
 d    4
 1    5
 dtype: int64, a    1
 b    2
 c    3
 d    4
 1    5
 dtype: int64, c    3
 d    4
 1    5
 dtype: int64, a    1
 b    2
 c    3
 d    4
 1    5
 dtype: int64)

In [38]:
#It's compicated whether it's indexing position if index is integer
#It's recommended to use loc, iloc explicitly
ex[1]

5

Modern computers can do many tasks simultaneously, especially, 
but not only, tasks involving mathematics.
Pandas and the underlying numpy libraries support a method of computation called vectorization. 
Vectorization works with most of the functions in the numpy library, including the sum function.

In [40]:
np.sum(ex)

15

In [42]:
number = pd.Series(np.random.randint(0,1000,100000))
number.head()

0    451
1    744
2    426
3    815
4    174
dtype: int64

In [48]:
%%timeit -n 100
#magin function
#Write % and tab key
total = 0
for num in number:
    total+=num
total/len(number)

11.1 ms ± 99.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [49]:
%%timeit -n 100
#Vectorization is the ability for a computer to execute multiple intstructions at once
np.sum(number)

132 µs ± 2.94 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [50]:
#Broadcasting -> Faster faster faster than accessing series values one by one
number+=3
number.head()

0    454
1    747
2    429
3    818
4    177
dtype: int64

In [52]:
number[0:3] = 222
number.head()

0    222
1    222
2    222
3    818
4    177
dtype: int64

In [54]:
#Add new value
new = pd.Series([1,2,3,4,5],index=list('abcde'))
new
new['a']=3
new['f']=5
new

a    3
b    2
c    3
d    4
e    5
f    5
dtype: int64

In [55]:
#Index is not unique.
new_dup = pd.Series([10,11,12],index=list('ggg'))
new_dup

g    10
g    11
g    12
dtype: int64

**append**

There are a couple of important considerations when using append. **First, Pandas will take 
the series and try to infer the best data types to use.** In this example, everything is a string, 
so there's no problems here. **Second, the append method doesn't actually change the underlying Series
objects, it instead returns a new series which is made up of the two appended together.** This is
a common pattern in pandas - by default returning a new object instead of modifying in place - and
one you should come to expect. By printing the original series we can see that that series hasn't
changed.

In [56]:
new_app = new.append(new_dup)
new,new_dup,new_app

(a    3
 b    2
 c    3
 d    4
 e    5
 f    5
 dtype: int64, g    10
 g    11
 g    12
 dtype: int64, a     3
 b     2
 c     3
 d     4
 e     5
 f     5
 g    10
 g    11
 g    12
 dtype: int64)

In [57]:
new_app['g'] #If we access index 'g', It return all elements whose indexes are 'g'

g    10
g    11
g    12
dtype: int64

# DataFrame

In [1]:
import pandas as pd
import numpy as np

In [9]:
#Dataframe is similar to 2D numpy array. It's conceptually 2D
#Dataframe also has nice compatibility
pd.DataFrame(np.random.randint(10,100,(3,5)),columns=list('abcde'))

ex1 = {'a':1,'b':2,'c':3}
ex2 = {'a':33,'b':55,"c":'aa'}
ex3 = {'a':4,'b':1,'d':11}
pd.DataFrame([ex1,ex2])
pd.DataFrame([ex1,ex2,ex3])

Unnamed: 0,a,b,c,d
0,1,2,3,
1,33,55,aa,
2,4,1,,11.0


In [15]:
df = pd.DataFrame(np.random.randint(1,100,(200,20)))
df.head()
type(df.loc[:,2])
df.T #Transopose matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
0,70,15,85,26,78,46,24,31,27,71,...,34,28,4,96,97,53,97,6,78,78
1,15,67,17,42,57,23,66,6,70,8,...,69,78,6,25,45,98,67,89,15,26
2,53,63,79,40,52,8,7,88,27,64,...,33,15,74,33,98,35,6,17,61,15
3,75,5,64,67,58,55,84,73,33,76,...,15,58,58,36,57,55,92,44,39,36
4,75,61,72,5,40,56,34,52,50,77,...,5,40,51,95,43,45,44,78,98,48
5,30,46,42,10,7,44,32,70,77,88,...,63,82,52,1,38,54,17,76,78,47
6,62,13,9,84,56,21,14,23,99,53,...,82,43,77,34,39,53,75,71,11,2
7,54,34,30,55,12,54,43,14,76,53,...,54,57,76,71,77,82,55,80,91,6
8,68,27,78,39,15,2,1,62,49,2,...,76,40,55,29,8,86,69,9,62,56
9,53,34,26,95,6,20,62,54,66,88,...,90,58,46,54,69,66,54,7,26,63


In [18]:
df = pd.DataFrame(np.random.randint(1,100,(200,10)),columns=list('abcdefghij'))
df[5]

KeyError: 5

In [20]:
df['a'] #Dataframe is label based, you can't use position base access not through iloc
df.iloc[:,5]

0      12
1      27
2      60
3      56
4      27
       ..
195    41
196    58
197    68
198    89
199    32
Name: f, Length: 200, dtype: int64

In [34]:
print(type(df['c']))
print(type(df.loc[6:10]))
print(type(df.loc[6:10,'d']))
print(type(df.loc[7:10]['d']))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


Chaining, by indexing on the return type of another index, can come with some costs and is
best avoided if you can use another approach. In particular, chaining tends to cause Pandas 
to return a copy of the DataFrame instead of a view on the DataFrame. 
For selecting data, this is not a big deal, though it might be slower than necessary. 
If you are changing data though this is an important distinction and can be a source of error.

In [36]:
#df.loc[6:10]['c':] #error
df.loc[6:10][['c','d']]
df.loc[6:10,['c','d']]

Unnamed: 0,c,d
6,65,18
7,91,29
8,80,75
9,9,8
10,98,25


In [42]:
copy_df = df.copy()
df.drop(10) #Drop makes copy of original dataframe by default
df
copy_df.drop('e',inplace=True,axis=1) 
#Can update original dataframe by inplace and can determine whether delete row or column by axis
copy_df
#Can execute delete and update at a time by del
del copy_df['f']
copy_df.head()

Unnamed: 0,a,b,c,d,g,h,i,j
0,87,35,77,63,14,32,70,77
1,76,82,68,40,94,30,92,63
2,71,93,18,31,59,39,99,4
3,8,24,46,38,19,95,5,27
4,26,50,45,64,23,74,59,29


In [43]:
#Add column to dataframe
df['k'] = None
#Change value of column
df['j'] = 1
df
#<Q> How can I add row to dataframe
#<A> 1. transpose, add and transpose
#<A> 2. Use loc

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k
0,87,35,77,63,30,12,14,32,70,1,
1,76,82,68,40,60,27,94,30,92,1,
2,71,93,18,31,46,60,59,39,99,1,
3,8,24,46,38,83,56,19,95,5,1,
4,26,50,45,64,95,27,23,74,59,1,
5,11,94,64,21,11,93,75,71,76,1,
6,45,75,65,18,92,53,23,26,78,1,
7,64,34,91,29,45,14,92,74,64,1,
8,66,93,80,75,43,72,83,4,80,1,
9,45,18,9,8,61,76,45,26,37,1,


Load -> Clean and Manipulate -> Data analysis and Modeling

In [6]:
#We can use shell command in iPython by '!'
!cat datasets/Admission_Predict.csv
!ls ./datasets/

Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR ,CGPA,Research,Chance of Admit 
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4,4.5,8.87,1,0.76
3,316,104,3,3,3.5,8,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2,3,8.21,0,0.65
6,330,115,5,4.5,3,9.34,1,0.9
7,321,109,3,3,4,8.2,1,0.75
8,308,101,2,3,4,7.9,0,0.68
9,302,102,1,2,1.5,8,0,0.5
10,323,108,3,3.5,3,8.6,0,0.45
11,325,106,3,3.5,4,8.4,1,0.52
12,327,111,4,4,4.5,9,1,0.84
13,328,112,4,4,4.5,9.1,1,0.78
14,307,109,3,4,3,8,1,0.62
15,311,104,3,3.5,2,8.2,1,0.61
16,314,105,3,3.5,2.5,8.3,0,0.54
17,317,107,3,4,3,8.7,0,0.66
18,319,106,3,4,3,8,1,0.65
19,318,110,3,4,3,8.8,0,0.63
20,303,102,3,3.5,3,8.5,0,0.62
21,312,107,3,3,2,7.9,1,0.64
22,325,114,4,3,2,8.4,0,0.7
23,328,116,5,5,5,9.5,1,0.94
24,334,119,5,5,4.5,9.7,1,0.95
25,336,119,5,4,3.5,9.8,1,0.97
26,340,120,5,4.5,4.5,9.6,1,0.94
27,322,109,5,4.5,3.5,8.8,0,0.76
28,298,98,2,1.5,2.5,7.5,1,0.44
29,295,93,1,2,2,7.2,0,0.46
30,310,99

400,333,117,4,5,4,9.66,1,0.95Admission_Predict.csv  census.csv  class_grades.csv  log.csv  presidents.csv


In [7]:
df = pd.read_csv('datasets/Admission_Predict.csv')
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [11]:
df2 = pd.read_csv('datasets/Admission_Predict.csv',index_col=0)
df2.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [27]:
new = list(df2.columns)
new[3] = 'State of Purpose'
new
df3 = df2.rename(columns = dict(zip(df2.columns,new)))
df3.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,State of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [26]:
#rename function don't change original dataframe
df2.head()
df4 = df2.rename(mapper=str.strip,axis='columns')

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [31]:
#map, lambda, list comprehension... for parallel task
#by map and lambda
new = list(map(lambda x:x.lower().strip(),list(df2.columns)))
#by list comprehension
new = [x.lower().strip() for x in list(df2.columns)]
df5 = df2.rename(columns=dict(zip(df2.columns,new)))
df5.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


### Querying Dataframe

1. Boolean masking

In [35]:
df = pd.read_csv('datasets/Admission_Predict.csv',index_col=0)
df.columns = list(map(lambda x:x.lower().strip(),list(df.columns)))
#You cannot change some values of df.columns but, you can change whole df.columns  
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [37]:
mask = df['chance of admit'] > 0.80
df[mask]

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
6,330,115,5,4.5,3.0,9.34,1,0.90
12,327,111,4,4.0,4.5,9.00,1,0.84
23,328,116,5,5.0,5.0,9.50,1,0.94
24,334,119,5,5.0,4.5,9.70,1,0.95
25,336,119,5,4.0,3.5,9.80,1,0.97
26,340,120,5,4.5,4.5,9.60,1,0.94
33,338,118,4,3.0,4.5,9.40,1,0.91
34,340,114,5,4.0,4.0,9.60,1,0.90
35,331,112,5,4.0,5.0,9.80,1,0.94


In [38]:
df.where(mask).head()

# We see that the resulting data frame keeps the original indexed values, and only data which met 
# the condition was retained. All of the rows which did not meet the condition have NaN data instead,
# but these rows were not dropped from our dataset. 

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,,,,,,,,
3,,,,,,,,
4,,,,,,,,
5,,,,,,,,


In [40]:
df.where(mask).dropna().head()
#where() is not used that often

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9
12,327.0,111.0,4.0,4.0,4.5,9.0,1.0,0.84
23,328.0,116.0,5.0,5.0,5.0,9.5,1.0,0.94
24,334.0,119.0,5.0,5.0,4.5,9.7,1.0,0.95


In [44]:
df['gre score']
df[['gre score','lor']]
df[df['gre score'] > 325.0] #mimicking loc() or where() dropna()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
6,330,115,5,4.5,3.0,9.34,1,0.90
12,327,111,4,4.0,4.5,9.00,1,0.84
13,328,112,4,4.0,4.5,9.10,1,0.78
23,328,116,5,5.0,5.0,9.50,1,0.94
24,334,119,5,5.0,4.5,9.70,1,0.95
25,336,119,5,4.0,3.5,9.80,1,0.97
26,340,120,5,4.5,4.5,9.60,1,0.94
32,327,103,3,4.0,4.0,8.30,1,0.74
33,338,118,4,3.0,4.5,9.40,1,0.91


In [49]:
df[df['gre score']>325.0 and df['chance of admit'] > 0.8]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This doesn't work. And despite using pandas for awhile, I still find I regularly try and do this. The
problem is that you have series objects, and **python underneath doesn't know how to compare two series using
and or or.** Instead, the pandas authors have overwritten the pipe | and ampersand & operators to handle this
for us

In [52]:
df[(df['gre score']>325.0) & (df['chance of admit'] > 0.8)]

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
6,330,115,5,4.5,3.0,9.34,1,0.90
12,327,111,4,4.0,4.5,9.00,1,0.84
23,328,116,5,5.0,5.0,9.50,1,0.94
24,334,119,5,5.0,4.5,9.70,1,0.95
25,336,119,5,4.0,3.5,9.80,1,0.97
26,340,120,5,4.5,4.5,9.60,1,0.94
33,338,118,4,3.0,4.5,9.40,1,0.91
34,340,114,5,4.0,4.0,9.60,1,0.90
35,331,112,5,4.0,5.0,9.80,1,0.94


One thing to watch out for is order of operations! A common error for new pandas users is
to try and do boolean comparisons using the & operator but not putting parentheses around
the individual terms you are interested in

**The problem is that Python is trying to bitwise and a 0.7 and a pandas dataframe**, when you really want
to bitwise and the broadcasted dataframes together

50% or more of the work you'll be doing in data cleaning involves querying DataFrames.

## Indexing Dataframe

In [53]:
df = pd.read_csv('datasets/Admission_Predict.csv',index_col=0)
df.columns = list(map(lambda x:x.lower().strip(),list(df.columns)))

In [59]:
df.head()
df['serial number'] = df.index
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit,serial number
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92,1
2,324,107,4,4.0,4.5,8.87,1,0.76,2
3,316,104,3,3.0,3.5,8.0,1,0.72,3
4,322,110,3,3.5,2.5,8.67,1,0.8,4
5,314,103,2,2.0,3.0,8.21,0,0.65,5


In [60]:
#set_index doesn't change original dataframe
df.set_index('chance of admit').head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,serial number
chance of admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


In [62]:
df = df.set_index('chance of admit').head()
df.reset_index().head() #It changes original dataframe / change index value to column

Unnamed: 0,chance of admit,gre score,toefl score,university rating,sop,lor,cgpa,research,serial number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


One nice feature of Pandas is multi-level indexing. This is similar to composite keys in 
relational database systems. To create a multi-level index, we simply call set index and 
give it a list of columns that we're interested in promoting to an index.

Pandas will search through these in order, finding the distinct data and form composite indices.
A good example of this is often found when dealing with geographical data which is sorted by 
|regions or demographics.

In [66]:
df = pd.read_csv('datasets/census.csv')
df.columns = list(map(lambda x:x.lower().strip(),list(df.columns)))
df.head()

Unnamed: 0,sumlev,region,division,state,county,stname,ctyname,census2010pop,estimatesbase2010,popestimate2010,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [79]:
df_temp = df.set_index('sumlev')
df_temp
df_temp.loc[50,:]

Unnamed: 0_level_0,region,division,state,county,stname,ctyname,census2010pop,estimatesbase2010,popestimate2010,popestimate2011,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
sumlev,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,55253,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,186659,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,27226,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,22733,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
50,3,6,1,9,Alabama,Blount County,57322,57322,57373,57711,...,1.807375,-1.177622,-1.748766,-2.062535,-1.369970,1.859511,-0.848580,-1.402476,-1.577232,-0.884411
50,3,6,1,11,Alabama,Bullock County,10914,10915,10887,10629,...,-30.953709,-5.180127,-1.130263,14.354290,-16.167247,-29.001673,-2.825524,1.507017,17.243790,-13.193961
50,3,6,1,13,Alabama,Butler County,20947,20946,20944,20673,...,-14.032727,-11.684234,-5.655413,1.085428,-6.529805,-13.936612,-11.586865,-5.557058,1.184103,-6.430868
50,3,6,1,15,Alabama,Calhoun County,118572,118586,118437,117768,...,-6.155670,-4.611706,-5.524649,-4.463211,-3.376322,-5.791579,-4.092677,-5.062836,-3.912834,-2.806406
50,3,6,1,17,Alabama,Chambers County,34215,34170,34098,33993,...,-2.731639,3.849092,2.872721,-2.287222,1.349468,-1.821092,4.701181,3.781439,-1.290228,2.346901
50,3,6,1,19,Alabama,Cherokee County,25989,25986,25976,26080,...,6.339327,1.113180,5.488706,-0.076806,-3.239866,6.416167,1.420264,5.757384,0.230419,-2.931307


In [77]:
df['sumlev'].unique()
df = df[df['sumlev']==50]
df.head()

Unnamed: 0,sumlev,region,division,state,county,stname,ctyname,census2010pop,estimatesbase2010,popestimate2010,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [106]:
df.shape
df.loc[[6,50000],:]

Unnamed: 0,sumlev,region,division,state,county,stname,ctyname,census2010pop,estimatesbase2010,popestimate2010,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
6,50.0,3.0,6.0,1.0,11.0,Alabama,Bullock County,10914.0,10915.0,10887.0,...,-30.953709,-5.180127,-1.130263,14.35429,-16.167247,-29.001673,-2.825524,1.507017,17.24379,-13.193961
50000,,,,,,,,,,,...,,,,,,,,,,


In [72]:
True in df['stname'].isna(), True in df['ctyname'].isna()
#Why it return true? 
#Becuase there exists a array.

(True, True)

In [73]:
True in list(df['stname'].isna()), True in list(df['ctyname'].isna())

(False, False)

In [81]:
df2 = df.set_index(['stname','ctyname'])
df2.head()
#<Q> Isn't there multi column?
#<Q> How about duplicate indices?

Unnamed: 0_level_0,Unnamed: 1_level_0,sumlev,region,division,state,county,census2010pop,estimatesbase2010,popestimate2010,popestimate2011,popestimate2012,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
stname,ctyname,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


An immediate question which comes up is how we can query this DataFrame. We saw previously that 
the loc attribute of the DataFrame can take multiple arguments. And it could query both the 
row and the columns. When you use a MultiIndex, **you must provide the arguments in order by the 
level you wish to query. Inside of the index, each column is called a level and the outermost 
column is level zero.** 

In [89]:
df2.loc['Michigan','Washtenaw County']

sumlev          50.000000
region           2.000000
division         3.000000
state           26.000000
county         161.000000
                  ...    
rnetmig2011      5.191395
rnetmig2012      1.248106
rnetmig2013      4.226778
rnetmig2014      3.801394
rnetmig2015      0.595048
Name: (Michigan, Washtenaw County), Length: 98, dtype: float64

In [85]:
df2.loc['Michigan','Washtenaw County','region':]

UnsortedIndexError: 'MultiIndex slicing requires the index to be lexsorted: slicing on levels [2], lexsort depth 1'

In [88]:
df2.loc[('Michigan','Washtenaw County'),'division':]

division                  3.000000
state                    26.000000
county                  161.000000
census2010pop        344791.000000
estimatesbase2010    345066.000000
                         ...      
rnetmig2011               5.191395
rnetmig2012               1.248106
rnetmig2013               4.226778
rnetmig2014               3.801394
rnetmig2015               0.595048
Name: (Michigan, Washtenaw County), Length: 96, dtype: float64

In [91]:
df2.loc[('Michigan','Washtenaw County'),'division':]

division                  3.000000
state                    26.000000
county                  161.000000
census2010pop        344791.000000
estimatesbase2010    345066.000000
                         ...      
rnetmig2011               5.191395
rnetmig2012               1.248106
rnetmig2013               4.226778
rnetmig2014               3.801394
rnetmig2015               0.595048
Name: (Michigan, Washtenaw County), Length: 96, dtype: float64

In [93]:
df2.loc['Michigan','region':]

Unnamed: 0_level_0,region,division,state,county,census2010pop,estimatesbase2010,popestimate2010,popestimate2011,popestimate2012,popestimate2013,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
ctyname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alcona County,2,3,26,1,10942,10942,10890,10775,10607,10572,...,0.276944,-2.899635,8.782284,-2.949572,3.077367,0.461574,-2.525489,9.160017,-2.568982,3.462038
Alger County,2,3,26,3,9601,9601,9564,9554,9496,9497,...,1.150748,0.314961,4.527984,-0.211093,-2.442262,1.359975,0.629921,4.843890,0.105546,-2.123706
Allegan County,2,3,26,5,111408,111408,111502,111530,111898,112391,...,-6.035008,-0.913046,0.267512,7.703397,3.056470,-5.604577,-0.546037,0.677697,8.269433,3.634485
Alpena County,2,3,26,7,29598,29598,29539,29342,29219,29026,...,-3.498582,-0.034152,-3.262083,0.172479,-2.770323,-3.294781,0.204915,-2.987381,0.448446,-2.493291
Antrim County,2,3,26,9,23580,23580,23499,23379,23337,23220,...,-3.797090,-0.342495,-0.730288,3.099240,-0.819018,-3.711762,-0.128436,-0.515497,3.357510,-0.560381
Arenac County,2,3,26,11,15899,15899,15854,15620,15496,15419,...,-11.310923,-3.085229,1.164483,0.195154,2.288554,-11.183834,-2.956678,1.293870,0.325256,2.419328
Baraga County,2,3,26,13,8860,8860,8841,8820,8715,8691,...,1.472170,-9.580838,-1.608641,-3.115265,-6.969451,2.151634,-8.896493,-0.804320,-2.307604,-6.156348
Barry County,2,3,26,15,59173,59175,59080,58970,59070,59140,...,-4.760695,-0.847170,-0.372219,1.267117,0.151831,-4.591275,-0.609963,-0.135352,1.520540,0.404882
Bay County,2,3,26,17,107771,107771,107695,107497,107121,106958,...,-2.444329,-3.075231,-1.270559,-5.853274,-5.162447,-1.914569,-2.441547,-0.653964,-5.130995,-4.445178
Benzie County,2,3,26,19,17525,17525,17507,17436,17390,17398,...,-2.403915,0.803997,3.966885,9.913188,-2.173789,-2.461151,0.746569,3.909394,9.855886,-2.230994


In [98]:
df2.loc[[('Michigan','Washtenaw County'),('Michigan','Wayne County')],'region':]

Unnamed: 0_level_0,Unnamed: 1_level_0,region,division,state,county,census2010pop,estimatesbase2010,popestimate2010,popestimate2011,popestimate2012,popestimate2013,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
stname,ctyname,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Michigan,Washtenaw County,2,3,26,161,344791,345066,345563,349048,351213,354289,...,0.129569,-4.309822,-1.780293,-2.955078,-6.078985,5.191395,1.248106,4.226778,3.801394,0.595048
Michigan,Wayne County,2,3,26,163,1820584,1820641,1815199,1801273,1792514,1775713,...,-13.340073,-10.271616,-14.119617,-11.903253,-8.762835,-11.344758,-8.098421,-11.732437,-9.161648,-6.010195


In [99]:
df2.loc[[('michigan','Washtenaw County'),('Michigan','Wayne County')],'region':]
#요거는 또 없는 데이터를 표현할 수 있네?

Unnamed: 0_level_0,Unnamed: 1_level_0,region,division,state,county,census2010pop,estimatesbase2010,popestimate2010,popestimate2011,popestimate2012,popestimate2013,...,rdomesticmig2011,rdomesticmig2012,rdomesticmig2013,rdomesticmig2014,rdomesticmig2015,rnetmig2011,rnetmig2012,rnetmig2013,rnetmig2014,rnetmig2015
stname,ctyname,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
michigan,Washtenaw County,,,,,,,,,,,...,,,,,,,,,,
Michigan,Wayne County,2.0,3.0,26.0,163.0,1820584.0,1820641.0,1815199.0,1801273.0,1792514.0,1775713.0,...,-13.340073,-10.271616,-14.119617,-11.903253,-8.762835,-11.344758,-8.098421,-11.732437,-9.161648,-6.010195


In [108]:
df2.T.head()

stname,Alabama,Alabama,Alabama,Alabama,Alabama,Alabama,Alabama,Alabama,Alabama,Alabama,...,Wyoming,Wyoming,Wyoming,Wyoming,Wyoming,Wyoming,Wyoming,Wyoming,Wyoming,Wyoming
ctyname,Autauga County,Baldwin County,Barbour County,Bibb County,Blount County,Bullock County,Butler County,Calhoun County,Chambers County,Cherokee County,...,Niobrara County,Park County,Platte County,Sheridan County,Sublette County,Sweetwater County,Teton County,Uinta County,Washakie County,Weston County
sumlev,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,...,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
region,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
division,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0
state,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,56.0,56.0,56.0,56.0,56.0,56.0,56.0,56.0,56.0,56.0
county,1.0,3.0,5.0,7.0,9.0,11.0,13.0,15.0,17.0,19.0,...,27.0,29.0,31.0,33.0,35.0,37.0,39.0,41.0,43.0,45.0


## Control Missing Values

We've seen a preview of how Pandas handles missing values using the None type and NumPy NaN values. Missing
values are pretty common in data cleaning activities. And, missing values can be there for any number of
reasons, and I just want to touch on a few here.

For instance, if you are running a survey and a respondant didn't answer a question the missing value is
actually an omission. This kind of missing data is called **Missing at Random** if there are other variables
that might be used to predict the variable which is missing. In my work when I delivery surveys I often find
that missing data, say the interest in being involved in a follow up study, often has some correlation with
another data field, like gender or ethnicity. If there is no relationship to other variables, then we call
this data **Missing Completely at Random (MCAR)**.

These are just two examples of missing data, and there are many more. For instance, data might be missing
because it wasn't collected, either by the process responsible for collecting that data, such as a
researcher, or because it wouldn't make sense if it were collected. This last example is extremely common
when you start joining DataFrames together from multiple sources, such as joining a list of people at a
university with a list of offices in the university (students generally don't have offices).

Pandas is pretty good at detecting missing values directly from underlying data formats, like CSV files.
Although most missing valuse are often formatted as NaN, NULL, None, or N/A, sometimes missing values are
not labeled so clearly. For example, I've worked with social scientists who regularly used the value of 99
in binary categories to indicate a missing value. The pandas read_csv() function has a parameter called
na_values to let us specify the form of missing values. It allows scalar, string, list, or dictionaries to
be used.

In [2]:
import pandas as pd
import numpy as np

In [112]:
df = pd.read_csv('datasets/class_grades.csv')
df

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.50
1,8,95.05,105.49,67.50,99.07,68.33
2,8,83.70,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.00,107.41,73.89
5,7,95.00,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.00
7,7,72.85,86.85,60.00,,56.11
8,8,84.26,93.10,47.50,18.52,50.83
9,7,90.10,97.55,51.25,88.89,63.61


In [113]:
df.isnull()

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [114]:
df.dropna().head(20)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72
12,8,97.16,103.71,72.5,93.52,63.33
13,7,91.28,83.53,81.25,99.81,92.22


In [116]:
df.head(10)
df2 = df.fillna(0)
df2.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


We can also use the na_filter option to turn off white space filtering, if white space is an actual value of
interest. But in practice, this is pretty rare. In data without any NAs, passing na_filter=False, can
improve the performance of reading a large file.

In addition to rules controlling how missing values might be loaded, **it's sometimes useful to consider
missing values as actually having information.**

In [117]:
df = pd.read_csv('datasets/log.csv')
df

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


Except for user Bob. It turns out that Bob has paused his playback so as time increases the playback
position doesn't change. Note too how difficult it is for us to try and derive this knowledge from the data,
because it's not sorted by time stamp as one might expect. This is actually not uncommon on systems which
have a high degree of parallelism. There are a lot of missing values in the paused and volume columns. It's
not efficient to send this information across the network if it hasn't changed. So this articular system
just inserts null values into the database if there's no changes.

In [118]:
df = df.set_index('time')
df.sort_index()
df.head()

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974454,cheryl,intro.html,6,,
1469974544,cheryl,intro.html,9,,
1469974574,cheryl,intro.html,10,,
1469977514,bob,intro.html,1,,


In [121]:
#Fill NaN method
#ffil : forward fill
#bfil : backward fill
df = df.reset_index()
df.fillna(method='ffill')
#fillna를 할거면 적어도 연관성 있는 애들로 해줘야 할텐데(상황에 따라 랜덤도 가능하겠지만)
#만약 index가 겹치면 어떻게? multilabel로 구분해주기

Unnamed: 0,index,time,user,video,playback position,paused,volume
0,0,1469974424,cheryl,intro.html,5,False,10.0
1,1,1469974454,cheryl,intro.html,6,False,10.0
2,2,1469974544,cheryl,intro.html,9,False,10.0
3,3,1469974574,cheryl,intro.html,10,False,10.0
4,4,1469977514,bob,intro.html,1,False,10.0
5,5,1469977544,bob,intro.html,1,False,10.0
6,6,1469977574,bob,intro.html,1,False,10.0
7,7,1469977604,bob,intro.html,1,False,10.0
8,8,1469974604,cheryl,intro.html,11,False,10.0
9,9,1469974694,cheryl,intro.html,14,False,10.0


In [122]:
df2 = df.set_index(['time','user'])
df2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,index,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1469974424,cheryl,0,intro.html,5,False,10.0
1469974454,cheryl,1,intro.html,6,,
1469974544,cheryl,2,intro.html,9,,
1469974574,cheryl,3,intro.html,10,,
1469977514,bob,4,intro.html,1,,


In [124]:
df2 = df2.sort_index()
df2.head()
df2.fillna(method='bfill')

Unnamed: 0_level_0,Unnamed: 1_level_0,index,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1469974424,cheryl,0,intro.html,5,False,10.0
1469974424,sue,13,advanced.html,23,False,10.0
1469974454,cheryl,1,intro.html,6,True,5.0
1469974454,sue,11,advanced.html,24,True,5.0
1469974484,cheryl,18,intro.html,7,True,5.0
1469974514,cheryl,19,intro.html,8,True,5.0
1469974524,sue,12,advanced.html,25,True,5.0
1469974544,cheryl,2,intro.html,9,True,5.0
1469974554,sue,14,advanced.html,26,True,5.0
1469974574,cheryl,3,intro.html,10,True,5.0


We can also do customized fill-in to replace values with the **replace() function**. It allows replacement from

**several approaches: value-to-value, list, dictionary, regex** Let's generate a simple example

In [136]:
df2.head()
df2.replace(np.nan,20)
df2.replace([0,np.nan],[10,20])
df2.replace('^a.+','Joker',regex=True)
df2['video'].replace('^a.+','Joker',regex=True)
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,index,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1469974424,cheryl,0,intro.html,5,False,10.0
1469974424,sue,13,advanced.html,23,False,10.0
1469974454,cheryl,1,intro.html,6,,
1469974454,sue,11,advanced.html,24,,
1469974484,cheryl,18,intro.html,7,,
1469974514,cheryl,19,intro.html,8,,
1469974524,sue,12,advanced.html,25,,
1469974544,cheryl,2,intro.html,9,,
1469974554,sue,14,advanced.html,26,,
1469974574,cheryl,3,intro.html,10,,


When you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, **the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded.** Why you have missing values really matters depending upon the problem you are trying to solve. It might be unreasonable to infer missing values, for instance, if the data shouldn't exist in the first place.

In [11]:
data = {'a':1,'b':2,'c':3,'d':4}
obj1 = pd.Series(data)

In [12]:
index = ['a','b','e','f']
obj2 = pd.Series(data,index=index)
pd.isnull(obj2)
#obj2['e']==None
obj2.iloc[0:3]
obj2.loc[1]
obj2.append(obj1)

TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [1] of <class 'int'>