# Intro to Pandas & Series Data

the aim of this lecture is being familiar with how to store and manipulate 1 dimentional indexed data in the series object.

- the series is one of the core data structure you can think of it as a cross between list and dictionaries.
The items are all stored in an order and there's labels with which you can retrieve them.
- An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data.

- It's important to note that the data column has a label of its own and can be retrieved using the .name attribute.
- This is different than with dictionaries and is useful when it comes to merging multiple columns of data. And we'll talk about that later on in the course.

In [1]:
import pandas as pd

In [7]:
# you can create a pandas series by by passing a list of values to a pd.Series()
# and pandas will assign and index's columns from 0 and as series'name None.
students = ['Alice','Jack', 'Molly']
print(pd.Series(students))

# you can see that the dtype is object = string [names of students]
# if we pass a list of numbers.
numbers = [1,2,3]
print(pd.Series(numbers))

0    Alice
1     Jack
2    Molly
dtype: object
0    1
1    2
2    3
dtype: int64


In [9]:
# pandas treats numbers and strings a little bit different when
# it comes to None in strings it keeps it as it is None and the dtype
# of the series is object.
students = ['Alice','Jack', None]
print(pd.Series(students))

#,but with numbers it converts the dtype to float instead of int
# and the value to NAN "Not A Number".
numbers = [1,2,None]
print(pd.Series(numbers))

# so, keep in mind that if you are wondering why the integers data you work
# with displayed as float that is probably buz there is missing data.

0    Alice
1     Jack
2     None
dtype: object
0    1.0
1    2.0
2    NaN
dtype: float64


In [11]:
# you also need to stress that None != NaN so lets prove it:
import numpy as np
print(np.nan == None)

# it turns out that NaN != NaN
print(np.nan == np.nan)

False
False


In [13]:
# this is raising the question how can i test the presence of NaN?
# this can be done using numpy isnan() function.

np.isnan(np.nan)

# So keep in mind when you see NaN, it's meaning is similar to None,
# but it's a numeric value and treated differently for efficiency reasons.

True

In [32]:
# thus far, we knew that we can creat pandas series by passing a list
# , additionally we can do so by passing dictionary but here the dictionary 
# keys will be the indexes of the series.

students_classes = {'Assem':'Math',
                   'Mohamed' :'Stat',
                    'Maged':'Physics'}

s = pd.Series(students_classes)
print(s)
print('\t')
print(s.index)

# As you play more with pandas you'll notice that a lot of things are
# implemented as numpy arrays, and have the dtype value set.
# This is true of indicies, and here pandas infered
# that we were using objects for the index.

Assem         Math
Mohamed       Stat
Maged      Physics
dtype: object
	
Index(['Assem', 'Mohamed', 'Maged'], dtype='object')


In [20]:
# Now, this is kind of interesting. The dtype of object is not just for
# strings, but for arbitrary objects. Lets create a more complex type of data,
# say, a list of tuples.
students = [("Alice","Brown"),("Jack", "White"),("Molly", "Green")]
s = pd.Series(students)
s
# each of the tuples is stored in the series object with dtype object.

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [30]:
# furthermore you can pass the indexes you want explicitly into series.
s = pd.Series(['Math','Stat','Physics'],index=['Assem','Mohamed','Maged'])
print(s)
print('\t')
# what if we create a dictionary that have the indexes(keys) and when we're 
# assigning the indexes in the function we wrote unlisted index.
students_classes = {'Assem':'Math',
                   'Mohamed' :'Stat',
                    'Maged':'Physics'}
s = pd.Series(students_classes,index=['Assem','Ali','Maged'])
print(s)
# it gives as NaN for the unlisted index

Assem         Math
Mohamed       Stat
Maged      Physics
dtype: object
	
Assem       Math
Ali          NaN
Maged    Physics
dtype: object


In this lecture we've explored the pandas Series data structure.
You've seen how to create a series from `lists` and `dictionaries`, how `indicies` on data work, and the way that pandas typecasts data including missing values.

# Querying a Series

- In this lecture, we'll talk about one of the primary data types of the Pandas library, the Series.
- You'll learn about the structure of the Series, how to query and merge Series objects together, and the importance of thinking about parallelization when engaging in data science programming.

In [38]:
# to query by numeric location (from 0) use iloc.
# to query by label index use loc. 
import pandas as pd
students_classes = {'Assem':'Math',
                   'Mohamed' :'Stat',
                    'Maged':'Physics',
                     'Sam':'History'}
s = pd.Series(students_classes)
print(s)
print('\t')
# if we want to see the 4th value we can use iloc or loc to do so:
print(s.iloc[3])      # keep in mind that they are not method 
print(s.loc['Sam'])   # they are attributes so we use[] not().
print(s[3])
print(s['Sam'])

Assem         Math
Mohamed       Stat
Maged      Physics
Sam        History
dtype: object
	
History
History
History
History


In [43]:
# using label index or position may be a little bit comfusing when dealing
# with the kind of dictionaries that indexed manually:
# so the safer option is using loc & iloc.
class_code = { 99: 'Physics',
               100: 'Chemistry',
               101: 'English',
               102: 'History'}
s = pd.Series(class_code)
# print(s[0])      will give an error.
print(s.iloc[0])   # will give you what you want.

Physics


In [49]:
# Now , the turn comes to how to do operations with this Series or more
# explicitly how to iterate over a series and invoke the operation you want.
grades = pd.Series([90,80,70,60])
total = 0
for grade in grades :
    total = grade+total             # this works well but slowly so we'll
    average = total/len(grades)     # use the leverage of numpy methods.
print(total,average)
print('total= {} & average= {}'.format(str(total),str(average)))

300 75.0
total= 300 & average= 75.0


In [50]:
# we can use numpy sum() function to do the same operation we did above.
import numpy as np
grades = pd.Series([90,80,70,60])
total = np.sum(grades)
average = total/len(grades)
print(total,average)

300 75.0


In [78]:
# Jupyter notebook has a magic function head()
# first lets create a big pandas series invoke 10000 random num.
numbers = pd.Series(np.random.randint(0,1000,10000))
print(numbers.head())
len(numbers)

0    212
1    564
2    661
3    838
4    836
dtype: int32


10000

In [79]:
# to know the time of how long this code is excuted on average
# you can use %%timeit and give it the num of loops to get the avg on.
%%timeit -n 100

total = 0
for number in numbers:
    total = number+total             
    average = total/len(grades)

UsageError: Line magic function `%%timeit` not found.


In [80]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

80.7 µs ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [81]:
# pandas & numpy share a broadcasting feature by whichwe can apply 
# an operation to every value in the series.
print(numbers.head())
numbers += 2
print('\t')
print(numbers.head())


0    212
1    564
2    661
3    838
4    836
dtype: int32
	
0    214
1    566
2    663
3    840
4    838
dtype: int32


In [90]:
# pandas does support iterating through a series much like a dict,
# allowing you to unpack values easily.
# SO, we can use iteritem() which return a label & value.
for label, value in numbers.iteritems():
    numbers.at[label]
numbers.head()

0    214
1    566
2    663
3    840
4    838
dtype: int32

In [None]:
# all of these should increase the value +2 in each time but they 
# didn't work here but they do in coursera notebook.

for label, value in numbers.iteritems():
    numbers.at[label]
numbers.head()

for label, value in numbers.iteritems():
    numbers.set_value(label, value+2)
numbers.head()

for label, value in s.iteritems():
    s.loc[label] = value+2
    
numbers.head()

In [91]:
s = pd.Series(np.random.randint(0.1000,10000))
for label, value in s.iteritems():
    s.loc[label] = value+2
    
numbers.head()

0    214
1    566
2    663
3    840
4    838
dtype: int32

In [94]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,1000)) # time of loops
for label, value in s.iteritems():
    s.loc[label]= value+2

41.3 ms ± 520 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [95]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,1000)) # time of broadcasting method
s+=2

295 µs ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [96]:
# 'loc' also can be used to add new element to the series

s = pd.Series([1, 2, 3])
# We could add some new value, maybe a university course
s.loc['History'] = 102

s
#you can see that mixed dtype of values or index labels are no problem for pd

0            1
1            2
2            3
History    102
dtype: int64

In [98]:
# what if the indexes are not unique
students_classes = pd.Series ({'Assem':'Math',
                   'Mohamed' :'Stat',
                    'Maged':'Physics',
                     'Sam':'History'})
print(students_classes)
print('\t')
ziad_classes = pd.Series(['Chemistry','English','French'],
                         index=['Ziad','Ziad','Ziad'])
print(ziad_classes)
print('\t')

all_student_classes = students_classes.append(ziad_classes)
print(all_student_classes)

Assem         Math
Mohamed       Stat
Maged      Physics
Sam        History
dtype: object
	
Ziad    Chemistry
Ziad      English
Ziad       French
dtype: object
	
Assem           Math
Mohamed         Stat
Maged        Physics
Sam          History
Ziad       Chemistry
Ziad         English
Ziad          French
dtype: object


In [99]:
print(students_classes)
print('\t')
print(all_student_classes.loc['Ziad'])

Assem         Math
Mohamed       Stat
Maged      Physics
Sam        History
dtype: object
	
Ziad    Chemistry
Ziad      English
Ziad       French
dtype: object


In this lecture, we focused on one of the primary data types of the Pandas library, the Series. You learned how to `query` the Series, with `.loc` and `.iloc`, that the Series is an indexed data structure, how to `merge` two Series 
objects together with `append()`, and the importance of `vectorization`.

# DataFrame

__1. Dataframe data structure__

In [106]:
import pandas as pd
record1 = pd.Series({'Name':'Alice',
                    'Class':'Physics',
                    'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                        'Class': 'Chemistry',
                        'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                        'Class': 'Biology',
                        'Score': 90})
# like a series the dataframe is indexed. here,I'll use each series as a row 
# in the data.
df = pd.DataFrame([record1,record2,record3],
                  index=['School1','School2','School1'])
df

Unnamed: 0,Name,Class,Score
School1,Alice,Physics,85
School2,Jack,Chemistry,82
School3,Helen,Biology,90


In [107]:
#an alternative way to do so that you can pass a list of dics to the function
students = [{'Name':'Alice',
             'Class':'Physics',
              'Score': 85},
              {'Name': 'Jack',
               'Class': 'Chemistry',
                'Score': 82},
               {'Name': 'Helen',
                'Class': 'Biology',
                'Score': 90}]
df = pd.DataFrame(students,index=['School1','School2','School1'])
df

Unnamed: 0,Name,Class,Score
School1,Alice,Physics,85
School2,Jack,Chemistry,82
School1,Helen,Biology,90


In [116]:
# just like Series we can extract data using loc() & iloc()
print(df.loc['School2'])
# to know the type of the output we can use type() function.
type(df.loc['School2'])
print('\t')
# this was a single row so that it gives us a series

# But, if we want school1 that has 2 rows in df it gives us DataFrame.
print(df.loc['School1'])
type(df.loc['School1'])


Name          Jack
Class    Chemistry
Score           82
Name: School2, dtype: object
	
          Name    Class  Score
School1  Alice  Physics     85
School1  Helen  Biology     90


pandas.core.frame.DataFrame

In [119]:
print(df.loc['School1']) # DataFrame
print('\t')
# we can pass the row & column we want into loc[]
print(df.loc['School1','Name']) # series

          Name    Class  Score
School1  Alice  Physics     85
School1  Helen  Biology     90
	
School1    Alice
School1    Helen
Name: Name, dtype: object


In [125]:
# So far we've been selecting rows what about coulmns?
# we can do this by transposing the dataframe usnig .T to make columns become
# rows then use loc[] to get the column out
print(df.T)
print('\t')
print(df.T.loc['Name'])  # loc[] using index (rows' names).

# furthermore, we can select a column just by using []
df['Name']               # df[] uses columns' name.

       School1    School2  School1
Name     Alice       Jack    Helen
Class  Physics  Chemistry  Biology
Score       85         82       90
	
School1    Alice
School2     Jack
School1    Helen
Name: Name, dtype: object


School1    Alice
School2     Jack
School1    Helen
Name: Name, dtype: object

In [129]:
#while this `df.loc['School1']` gives us DataFrame,we can use column indexing
print(df.loc['School1']['Name'])   # this is called chaining.
print('\t')                        # it's better to be avoided and use
                                   # df.loc['School1','Name'] instead.
print(type(df.loc['School1']))
print(type(df.loc['School1']['Name']))

School1    Alice
School1    Helen
Name: Name, dtype: object
	
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [131]:
# .loc[] can be used also for slicing,we ask for all the names and scores for 
# all schools using the .loc operator.
df.loc[: , ['Name','Score']]

Unnamed: 0,Name,Score
School1,Alice,85
School2,Jack,82
School1,Helen,90


In [142]:
# we can drop rows & columns using drop() / del by passing the row index.
# and even add columns using assigment operator [].

copy_df = df.copy()
print(copy_df.drop('School1'))
print('\t')
print(copy_df) 

#as you can see it dropF&s it but not permenently to do so we need to use 
# inplace = True and to drop columns use axis = 1.
copy_df.drop('Name',inplace = True , axis = 1)
print('\t')
print(copy_df)

# we can drop columns also by using `del` :
del copy_df ['Class']
print('\t')
print(copy_df)

# adding columns using []
copy_df['rank'] = None
print('\t')
print(copy_df)

         Name      Class  Score
School2  Jack  Chemistry     82
	
          Name      Class  Score
School1  Alice    Physics     85
School2   Jack  Chemistry     82
School1  Helen    Biology     90
	
             Class  Score
School1    Physics     85
School2  Chemistry     82
School1    Biology     90
	
         Score
School1     85
School2     82
School1     90
	
         Score  rank
School1     85  None
School2     82  None
School1     90  None


__2. DataFrame Indexing And Loading__

In [146]:
# one way to explore csv file:
!cat data/Admission_Predict.csv

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [147]:
# another way is using pandas read_csv() function.
import pandas as pd
df = pd.read_csv('data/Admission_Predict.csv')
df

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67


In [151]:
# as you can see there is a confution between index col inserted from
# pandas & serial No. col 
# so why not make serial No. as our index coulmn using index_col= col_index
df = pd.read_csv('data/Admission_Predict.csv',index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [164]:
# As I don't know what do SOP & LOR mean?
# lets put a descriptive names for these columns using .rename()
# through providing a dictionary holds the old names as key and new ones as value.
new_df = df.rename(columns={'GRE Score':'GRE Score', 'TOEFL Score':'TOEFL Score',
                   'University Rating':'University Rating', 
                   'SOP': 'Statement of Purpose','LOR': 'Letter of Recommendation',
                   'CGPA':'CGPA', 'Research':'Research',
                   'Chance of Admit':'Chance of Admit'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [165]:
# Why only sop is the one who gets changed?
# so lets show the columns' names more clearly.
new_df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

In [166]:
# As you may notice there is a `\s white space` after LOP & Chance of Admit.
new_df = new_df.rename(columns={'LOR ': 'Letter of Recommendation'})
new_df.head()               # 'Chance of Admit ' : 'Chance of Admit'

#there is a way to trim that white space in the last column instead of the above one which is str.strip
# we can pass it as mapper in the rename() and set axis as columns:
# this white spaces in headers is acommon issue so you have to tweek them out
new_df=new_df.rename(mapper = str.strip , axis = 'columns')
print(new_df.columns)

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement of Purpose',
       'Letter of Recommendation', 'CGPA', 'Research', 'Chance of Admit'],
      dtype='object')


In [169]:
# after we are done with new_df  lets go to `df original`
# lets change all of the column names to lower case and remove white spaces.
# First we need to get our list
cols = list(df.columns)
# creating list comprehension
modified_cols = [x.lower().strip() for x in cols]
# assign the modified ones to df.columns
df.columns = modified_cols

df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


__3. Querying a DataFrame__

- In this lecture we're going to talk about querying DataFrames. The first step in the process is to understand Boolean masking.
- Boolean masking is the heart of fast and efficient querying in numpy and pandas.
- With boolean masking, we can select data based on the criteria we desire and, frankly, you'll use it everywhere.

In [1]:
import pandas as pd
df = pd.read_csv('data/Admission_Predict.csv' , index_col=0)
# lets fix the problem of the white space & capital letter in columns names.
df.columns = [x.lower().strip() for x in df.columns]
# previous cell: list(df.columns) and iterate over this list.
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [9]:
# we are interested to create a data frame for who had more than 0.7 (CoA):
admited_mask = df['chance of admit'] > 0.7
print(admited_mask)        # this is a series
print(type(admited_mask))
df[admited_mask].head()

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool
<class 'pandas.core.series.Series'>


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


In [12]:
# we can create the same dataframe we just pulled out by using where()
# and pass the mask into it:
print(df.where(admited_mask).head())
print('\t')
# as you can see the rows that haven't matched the mask assigned as NaN
# so we need to drop them using dropna() function:
df.where(admited_mask).dropna().head()


            gre score  toefl score  university rating  sop  lor  cgpa  \
Serial No.                                                              
1               337.0        118.0                4.0  4.5  4.5  9.65   
2               324.0        107.0                4.0  4.0  4.5  8.87   
3               316.0        104.0                3.0  3.0  3.5  8.00   
4               322.0        110.0                3.0  3.5  2.5  8.67   
5                 NaN          NaN                NaN  NaN  NaN   NaN   

            research  chance of admit  
Serial No.                             
1                1.0             0.92  
2                1.0             0.76  
3                1.0             0.72  
4                1.0             0.80  
5                NaN              NaN  
	


Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


In [14]:
# So, now regarding index operator [] lets review what does it can do?:
# 1. calling a column or a list of columns from the dataframe
df[['sop','lor']].head()

Unnamed: 0_level_0,sop,lor
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.5,4.5
2,4.0,4.5
3,3.0,3.5
4,3.5,2.5
5,2.0,3.0


In [15]:
# 2. assigning masks for a data frame to pull out the interested rows
df[df['gre score']>320].head()

# And each of these are mimicing the functions of either loc[] or
# where().droupna()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9
7,321,109,3,3.0,4.0,8.2,1,0.75


In [17]:
# (df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9)  `NO`
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)  
# YES: but don't forget () between terms/conditions.

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [18]:
# another way to this to get rid of `> or <` is using .gt()  for greater than
# and .lt() for less than:
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [19]:
# you can chain these functions (.gt() .lt())
# too, which results in the same answer and the use of no visual operators.
df['chance of admit'].gt(0.7).lt(0.9)

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool

In [None]:
# This only works if your operator, such as less than or greater than,
# is built into the DataFrame, but I certainly find that last code example
# much more readable than one with `&` and `parenthesis`. 

__4. Indexing Dataframes__

In [35]:
import pandas as pd
df = pd.read_csv('data/Admission_Predict.csv',index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [36]:
# what if we want to make the last column as index column :
# first,lets create new column for to preserve Serial No. into.
df['Serial Number']= df.index
# using .set_index() function to assign what we want to be our index column.
df = df.set_index('Chance of Admit ')
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
Chance of Admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


In [37]:
# what if we want to get back the stuff for its default case
# we can use .reset_index() function:
df = df.reset_index()
df.head()

Unnamed: 0,Chance of Admit,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Serial Number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


In [48]:
# lets import new dataframe to work with:
df = pd.read_csv('data/census.csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [49]:
# what if we want to see the unique values in a specific column.
# we can use unique() function to do so:
df['SUMLEV'].unique()

# now , we knew that state takes 40 as a SUMLEV and county takes 50

array([40, 50], dtype=int64)

In [50]:
# we are interested in ros for county
df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [51]:
# instead of this plenty of columns we are interested in some columns 
columns_to_keep = ['STNAME','CTYNAME','BIRTHS2010','BIRTHS2011','BIRTHS2012','BIRTHS2013',
                   'BIRTHS2014','BIRTHS2015','POPESTIMATE2010','POPESTIMATE2011',
                   'POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015']

df = df[columns_to_keep]
df.head(100)

Unnamed: 0,STNAME,CTYNAME,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
1,Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
2,Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
3,Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
4,Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
5,Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,Alaska,Yukon-Koyukuk Census Area,21,96,87,93,85,82,5588,5731,5731,5661,5563,5533
99,Arizona,Apache County,276,1075,984,995,974,1089,71766,72387,72946,71953,71858,71474
100,Arizona,Cochise County,397,1773,1687,1632,1653,1580,131809,132996,131888,129605,127321,126427
101,Arizona,Coconino County,428,1802,1723,1683,1670,1691,134626,134182,135978,136612,137637,139097


In [52]:
# after inspecting data we found that census data breaks down population
# estmates by state and county so it will be useful if we index data by them
# using .set_index()

df = df.set_index(['STNAME','CTYNAME'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wyoming,Sweetwater County,167,640,595,657,629,620,43593,44041,45104,45162,44925,44626
Wyoming,Teton County,76,259,230,261,249,269,21297,21482,21697,22347,22905,23125
Wyoming,Uinta County,73,324,311,316,316,316,21102,20912,20989,21022,20903,20822
Wyoming,Washakie County,26,108,90,95,96,90,8545,8469,8443,8443,8316,8328


In [56]:
# the question that comes up here is how we can query this data which means
# how we can get a specific row out of them,If we want to see the population
# results from Washtenaw County in Michigan the state ,of course we'll use 
# .loc[] since we want rows by index but, be careful in the order of these
# indexes ,state then county.

# df.loc['Washtenaw County','Michigan']  `NO`
print(df.loc['Michigan','Washtenaw County'])

# for more than one value we can pass a list of tuples inside .loc[[(),()]]
df.loc[ [('Michigan', 'Washtenaw County'),
         ('Michigan', 'Wayne County')] ]

BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,977,3826,3780,3662,3683,3709,345563,349048,351213,354289,357029,358880
Michigan,Wayne County,5918,23819,23270,23377,23607,23586,1815199,1801273,1792514,1775713,1766008,1759335


__5. Missing Values__

- if you are running a survey and a respondant didn't a question the missing value is actually an omission. this called **Missing At Random** where there are other variable that might used to predict the missing one.
- if there is no correlation to other variables then we call this data **Missing Completely At Random**.
- These are just two examples of missing data, and there are many more. For instance, data might be missing because it wasn't collected, either by the process responsible for collecting that data, such as a researcher, or because it wouldn't make sense if it were collected.
- This last example is extremely common when you start joining DataFrames together from multiple sources, such as joining a list of people at a university with a list of offices in the university (students generally don't have offices but they are still people).
************
__Important Notes__
1. Although most missing valuse are often formatted as NaN, NULL, None, or N/A, sometimes missing values are not labeled so clearly. For example, I've worked with social scientists who regularly used the value of 99 in binary categories to indicate a missing value.

2. The pandas read_csv() function has a parameter called `na_values` to let us
specify the `form of missing values`. It allows scalar, string, list, or dictionaries to be used.

In [2]:
import pandas as pd

df = pd.read_csv('data/class_grades.csv')
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


In [3]:
# to know either this cell is missing or not we can use isnull() 
# which creates a boolean mask broadcasting all over the cells:
missing_mask = df.isnull()
missing_mask

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
94,False,True,False,False,False,False
95,False,True,False,False,False,False
96,False,False,False,False,False,False
97,False,False,False,False,False,False


In [6]:
# so after we knew that there are a missing values we can use function called
# dropna() to kick them out.
df.dropna() # as you may know now that to remove them permenently
            # you need to assign inplace = True.

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.50
1,8,95.05,105.49,67.50,99.07,68.33
4,8,91.32,93.64,95.00,107.41,73.89
5,7,95.00,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.00
...,...,...,...,...,...,...
92,7,95.60,82.28,76.88,108.33,78.33
93,8,87.52,91.58,56.25,71.85,85.00
96,8,89.94,102.77,87.50,90.74,87.78
97,7,95.60,76.13,66.25,99.81,85.56


In [7]:
# So, what if we wanted to fill these missing values with something.
# using fillna() .I need to fill them with 0 permenently:
df.fillna(0,inplace=True)
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


- We can also use the `na_filter` option to turn off white space filtering, if white space is an actual value of interest.

- But in practice, this is pretty rare. In data without any NAs, passing `na_filter=False`, can improve the performance of reading a large file.
____
- In addition to rules controlling how missing values might be loaded, it's sometimes useful to **consider missing values as actually having information**.
- I'll give an example from my own research.  I often deal with logs from online learning systems:

    - I've looked at video use in lecture capture systems. In these systems it's common for the player for have a heartbeat functionality where playback statistics are sent to the server every so often, maybe every 30 seconds.

    - These heartbeats can get big as they can carry the whole state of the playback system such as where the video play head is at, where the video size is, which video is being rendered to the screen, how loud the volume is.

In [39]:
df = pd.read_csv('data\log.csv')
df.head(20)

# time : timestamp in the unix epoch format.
# user : user name.
# video : the web page & the played video.
# playback position : the number of times user rewatched the video sent to
                    # server each 30 seconds.As the playback increases by
                    # 1 the time increases by 30 sec.
        
# Except for user Bob. It turns out that Bob has paused his playback so as time increases the playback
# position doesn't change.

# Note too how difficult it is for us to try and derive this knowledge
# from the data, because it's not sorted by time stamp as one might expect. 

# This is actually not uncommon on systems which have a high degree of
# parallelism.There are a lot of missing values in the paused and volume columns.

#It's not efficient to send this information across the network if it hasn't changed.
# So this articular system just inserts null values into the database if there's no changes.

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [40]:
df = df.set_index('time')
df = df.sort_index()
df

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [41]:
# as we can see after we set time as index and sorted our data ascendingly
# there is a multiple users used the sys. in the same time so its better to
# index the data using time & user.

df = df.reset_index()
df = df.set_index(['time','user'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [42]:
# now after we have the data indexed and sorted properly. we can fill the 
# NaNs using fillna() passing method = 'ffill' or 'bfill'.
df = df.fillna(method='ffill')
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0


In [43]:
# We can also do customized fill-in to replace values with the replace().
# It allows replacement from several approaches: value-to-value, list,
# dictionary, regex Let's generate a simple example.

df = pd.DataFrame({'A':[1,1,2,3,4],
                   'B':[3,6,3,8,9],
                   'C':['a','b','c','d','e']})
df

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [46]:
df.replace(1,100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [47]:
# if we want to change 2 values we pass a list for replace()
# we wanna 1 to 100 and 3 to 300:
df.replace([1,3],[100,300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


In [53]:
# what is really cool about pandas replacement that it supports regex:
df = pd.read_csv('data/log.csv')
df

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [56]:
# To replace using a regex we make the first parameter to replace
# the regex pattern we want to match, the second parameter the value we want
# to emit upon match, and then we pass in a third parameter "regex=True".

# Take a moment to pause this video and think about this problem:
# imagine we want to detect all html pages in the "video" column, lets say
# that just means they end with ".html", and we want to overwrite that with
# the keyword "webpage". How could we accomplish this?

df.replace(to_replace=".*.html$", value="webpage", regex=True)
# df.replace(to_replace="\w*.html", value="webpage", regex=True)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,


__6. Example: Manipulating DataFrame__

In [149]:
import pandas as pd
df = pd.read_csv('data/presidents.csv')
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"


In [150]:
# lets try using .str.split()
new_df = df.copy()
new_df['First'] = new_df.President.str.split(expand=True)[0]
new_df['last'] = new_df.President.str.split(expand=True)[1]

new_df.head(10)

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe
5,6,John Quincy Adams,"Jul 11, 1767","57 years, 236 daysMar 4, 1825","61 years, 236 daysMar 4, 1829","18 years, 356 days","Feb 23, 1848","80 years, 227 days",John,Quincy
6,7,Andrew Jackson,"Mar 15, 1767","61 years, 354 daysMar 4, 1829","69 years, 354 daysMar 4, 1837","8 years, 96 days","Jun 8, 1845","78 years, 85 days",Andrew,Jackson
7,8,Martin Van Buren,"Dec 5, 1782","54 years, 89 daysMar 4, 1837","58 years, 89 daysMar 4, 1841","21 years, 142 days","Jul 24, 1862","79 years, 231 days",Martin,Van
8,9,William H. Harrison,"Feb 9, 1773","68 years, 23 daysMar 4, 1841","68 years, 54 days Apr 4, 1841[b]",,"Apr 4, 1841","68 years, 54 days",William,H.
9,10,John Tyler,"Mar 29, 1790","51 years, 6 daysApr 4, 1841","54 years, 340 daysMar 4, 1845","16 years, 320 days","Jan 18, 1862","71 years, 295 days",John,Tyler


In [151]:
# After inspecting the data we found alot of flaws especially in last name
# so it's better to use regex with .repalace()\
df['first'] = df['President'].replace('[ ]\w*','',regex=True)
df['last']= df['President'].replace('\w*[ ]','',regex=True)

df

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,first,last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe
5,6,John Quincy Adams,"Jul 11, 1767","57 years, 236 daysMar 4, 1825","61 years, 236 daysMar 4, 1829","18 years, 356 days","Feb 23, 1848","80 years, 227 days",John,Adams
6,7,Andrew Jackson,"Mar 15, 1767","61 years, 354 daysMar 4, 1829","69 years, 354 daysMar 4, 1837","8 years, 96 days","Jun 8, 1845","78 years, 85 days",Andrew,Jackson
7,8,Martin Van Buren,"Dec 5, 1782","54 years, 89 daysMar 4, 1837","58 years, 89 daysMar 4, 1841","21 years, 142 days","Jul 24, 1862","79 years, 231 days",Martin,Buren
8,9,William H. Harrison,"Feb 9, 1773","68 years, 23 daysMar 4, 1841","68 years, 54 days Apr 4, 1841[b]",,"Apr 4, 1841","68 years, 54 days",William.,H.Harrison
9,10,John Tyler,"Mar 29, 1790","51 years, 6 daysApr 4, 1841","54 years, 340 daysMar 4, 1845","16 years, 320 days","Jan 18, 1862","71 years, 295 days",John,Tyler


In [152]:
# Although the previous code did work but it was slow so, lest try apply()
# first we need to drop the 2 columns we just created 
del(df['first'],df['last'])
df

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days"
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days"
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days"
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days"
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days"
5,6,John Quincy Adams,"Jul 11, 1767","57 years, 236 daysMar 4, 1825","61 years, 236 daysMar 4, 1829","18 years, 356 days","Feb 23, 1848","80 years, 227 days"
6,7,Andrew Jackson,"Mar 15, 1767","61 years, 354 daysMar 4, 1829","69 years, 354 daysMar 4, 1837","8 years, 96 days","Jun 8, 1845","78 years, 85 days"
7,8,Martin Van Buren,"Dec 5, 1782","54 years, 89 daysMar 4, 1837","58 years, 89 daysMar 4, 1841","21 years, 142 days","Jul 24, 1862","79 years, 231 days"
8,9,William H. Harrison,"Feb 9, 1773","68 years, 23 daysMar 4, 1841","68 years, 54 days Apr 4, 1841[b]",,"Apr 4, 1841","68 years, 54 days"
9,10,John Tyler,"Mar 29, 1790","51 years, 6 daysApr 4, 1841","54 years, 340 daysMar 4, 1845","16 years, 320 days","Jan 18, 1862","71 years, 295 days"


In [153]:
# The apply() function on a dataframe will take some arbitrary function
# you have written and apply it to either a Series (a single column)
# or DataFrame across all rows or columns. Lets write a function which
# just splits a string into two pieces using a single row of data (column).

def splitname(data_f):
    data_f['first_name'] = data_f['President'].split(" ")[0]
    data_f['last_name'] = data_f['President'].split(" ")[-1]
    return data_f

df = df.apply(splitname,axis = 'columns')
df

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,first_name,last_name
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe
5,6,John Quincy Adams,"Jul 11, 1767","57 years, 236 daysMar 4, 1825","61 years, 236 daysMar 4, 1829","18 years, 356 days","Feb 23, 1848","80 years, 227 days",John,Adams
6,7,Andrew Jackson,"Mar 15, 1767","61 years, 354 daysMar 4, 1829","69 years, 354 daysMar 4, 1837","8 years, 96 days","Jun 8, 1845","78 years, 85 days",Andrew,Jackson
7,8,Martin Van Buren,"Dec 5, 1782","54 years, 89 daysMar 4, 1837","58 years, 89 daysMar 4, 1841","21 years, 142 days","Jul 24, 1862","79 years, 231 days",Martin,Buren
8,9,William H. Harrison,"Feb 9, 1773","68 years, 23 daysMar 4, 1841","68 years, 54 days Apr 4, 1841[b]",,"Apr 4, 1841","68 years, 54 days",William,Harrison
9,10,John Tyler,"Mar 29, 1790","51 years, 6 daysApr 4, 1841","54 years, 340 daysMar 4, 1845","16 years, 320 days","Jan 18, 1862","71 years, 295 days",John,Tyler


In [154]:
#this was good but do you think this was less gross than RegexWith replace()?
# so lets try the extract() function to do this mission.
# drop these columns again:
del(df['first_name'])
del(df['last_name'])

# extract() takes pattern of regex specifically groups:
pattern = '(\w*)(?:.*[ ])(\w*$)'    # . Matches any character except \n \t. 
df['President'].str.extract(pattern).tail() 
                                    # $ matches from the back of string.
                                    # ?: we don't want to display this group

Unnamed: 0,0,1
39,Ronald,Reagan
40,George,Bush
41,Bill,Clinton
42,George,Bush
43,Barack,Obama


In [155]:
# if we named the the groups `?P< >` we can get named columns.
pattern = '(?P<first_n>\w*)(?:.*[ ])(?P<last_n>\w*)' # $ has been removed
names = df['President'].str.extract(pattern)  #  same result.
names.tail()

Unnamed: 0,first_n,last_n
39,Ronald,Reagan
40,George,Bush
41,Bill,Clinton
42,George,Bush
43,Barack,Obama


In [156]:
# we can easily copy these into our main data frame:
df['First'] = names['first_n']
df['Last'] = names['last_n']
df

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732[a]","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735[a]","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743[a]","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751[a]","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe
5,6,John Quincy Adams,"Jul 11, 1767","57 years, 236 daysMar 4, 1825","61 years, 236 daysMar 4, 1829","18 years, 356 days","Feb 23, 1848","80 years, 227 days",John,Adams
6,7,Andrew Jackson,"Mar 15, 1767","61 years, 354 daysMar 4, 1829","69 years, 354 daysMar 4, 1837","8 years, 96 days","Jun 8, 1845","78 years, 85 days",Andrew,Jackson
7,8,Martin Van Buren,"Dec 5, 1782","54 years, 89 daysMar 4, 1837","58 years, 89 daysMar 4, 1841","21 years, 142 days","Jul 24, 1862","79 years, 231 days",Martin,Buren
8,9,William H. Harrison,"Feb 9, 1773","68 years, 23 daysMar 4, 1841","68 years, 54 days Apr 4, 1841[b]",,"Apr 4, 1841","68 years, 54 days",William,Harrison
9,10,John Tyler,"Mar 29, 1790","51 years, 6 daysApr 4, 1841","54 years, 340 daysMar 4, 1845","16 years, 320 days","Jan 18, 1862","71 years, 295 days",John,Tyler


In [158]:
# now lets move on to clean up the Born column.
# first lets get rid of any thig not matched Month Day Year format.
df['Born'] = df['Born'].str.extract('([\w]{3} [\w]{1,2}, [\w]{4})')
df.head()

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,"Feb 22, 1732","57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,"Oct 30, 1735","61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,"Apr 13, 1743","57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,"Mar 16, 1751","57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,"Apr 28, 1758","58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


In [164]:
# lets figure out the type of the data stored in Born column.
df['Born'].head(1)

0    Feb 22, 1732
Name: Born, dtype: object

In [165]:
# the Date sored as a string.
# it's better to change it to datetime using pd.to_datetime()
df['Born'] = pd.to_datetime(df['Born'])
df['Born'].head(1)
# this is gonna help us make sorting operation regarding date.
# like who is the prisedent that was born in 1788 & so forth.

0   1732-02-22
Name: Born, dtype: datetime64[ns]

In [167]:
df.head(5)

Unnamed: 0,#,President,Born,Age atstart of presidency,Age atend of presidency,Post-presidencytimespan,Died,Age,First,Last
0,1,George Washington,1732-02-22,"57 years, 67 daysApr 30, 1789","65 years, 10 daysMar 4, 1797","2 years, 285 days","Dec 14, 1799","67 years, 295 days",George,Washington
1,2,John Adams,1735-10-30,"61 years, 125 daysMar 4, 1797","65 years, 125 daysMar 4, 1801","25 years, 122 days","Jul 4, 1826","90 years, 247 days",John,Adams
2,3,Thomas Jefferson,1743-04-13,"57 years, 325 daysMar 4, 1801","65 years, 325 daysMar 4, 1809","17 years, 122 days","Jul 4, 1826","83 years, 82 days",Thomas,Jefferson
3,4,James Madison,1751-03-16,"57 years, 353 daysMar 4, 1809","65 years, 353 daysMar 4, 1817","19 years, 116 days","Jun 28, 1836","85 years, 104 days",James,Madison
4,5,James Monroe,1758-04-28,"58 years, 310 daysMar 4, 1817","66 years, 310 daysMar 4, 1825","6 years, 122 days","Jul 4, 1831","73 years, 67 days",James,Monroe


# Quiz 2

### 1


In [160]:
import pandas as pd
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj1 = pd.Series(sdata)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj2 = pd.Series(sdata, index=states)
obj3 = pd.isnull(obj2)

In [161]:
obj1 ,obj2 ,obj3

(Ohio      35000
 Texas     71000
 Oregon    16000
 Utah       5000
 dtype: int64,
 California        NaN
 Ohio          35000.0
 Oregon        16000.0
 Texas         71000.0
 dtype: float64,
 California     True
 Ohio          False
 Oregon        False
 Texas         False
 dtype: bool)

In [162]:
obj2['California'] == None

False

### 2

In [163]:
import pandas as pd
d = {'1': 'Alice','2': 'Bob','3': 'Rita','4': 'Molly','5': 'Ryan'}
s = pd.Series(d)
s

1    Alice
2      Bob
3     Rita
4    Molly
5     Ryan
dtype: object

In [164]:
# extract rows with student ranks that are lower than 3? 
s.iloc[0:3]

1    Alice
2      Bob
3     Rita
dtype: object

### 3

In [165]:
df = pd.read_csv('data/Admission_Predict.csv',index_col=0)
df.head(1)

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92


In [166]:
df.rename(mapper = lambda x: x.upper(), axis = 1)
df.head(1)

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92


### 4

In [167]:
df = pd.DataFrame({'Serial No.' : [1,2,3,4,5],
                   'gre score':[337,324,316,322,314],
                   'toefl score':[118,107,104,110,103]})
df

Unnamed: 0,Serial No.,gre score,toefl score
0,1,337,118
1,2,324,107
2,3,316,104
3,4,322,110
4,5,314,103


In [168]:
df.where(df['toefl score'] > 105).dropna()

Unnamed: 0,Serial No.,gre score,toefl score
0,1.0,337.0,118.0
1,2.0,324.0,107.0
3,4.0,322.0,110.0


In [169]:
df[df['toefl score'] > 105]

Unnamed: 0,Serial No.,gre score,toefl score
0,1,337,118
1,2,324,107
3,4,322,110


In [170]:
df.where(df['toefl score'] > 105)

Unnamed: 0,Serial No.,gre score,toefl score
0,1.0,337.0,118.0
1,2.0,324.0,107.0
2,,,
3,4.0,322.0,110.0
4,,,


### 5
To create a DataFrame in pandas we can do so using : 
1. python dict
2. pandas Series object
3. 2D ndarray


### 6

In [171]:
nums = {'one':[0,4,8,12],
             'two':[1,5,9,13],
             'three': [2,6,10,14],
             'four':[3,7,11,15]}
df = pd.DataFrame(nums,index=['Ohio','Colorado','Utah','New York'])
df

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [172]:
df.drop(['Utah', 'Colorado'])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


In [173]:
df.drop('Ohio')

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [174]:
df.drop('one', axis = 1)

Unnamed: 0,two,three,four
Ohio,1,2,3
Colorado,5,6,7
Utah,9,10,11
New York,13,14,15


In [175]:
#  df.drop('two')  it gives us error.

### 7

In [176]:
import pandas as pd
s1 = pd.Series({1: 'Alice', 2: 'Jack', 3: 'Molly'})
s2 = pd.Series({'Alice': 1, 'Jack': 2, 'Molly': 3})

In [177]:
# s2.loc[1]  error

### 8

__loc and iloc are attributes of pandas. Series object, not methods.__

In [178]:
s = pd.Series({1: 'Alice', 2: 'Jack', 3: 'Molly'})
s1 = pd.Series({'Alice': 1, 'Jack': 2, 'Molly': 3})

In [179]:
s.append(s1) # This is correct, because this will return a new object
             # instead of directly adding to the end of s.
s

1    Alice
2     Jack
3    Molly
dtype: object

In [180]:
s1.loc['Alice']

1

In [181]:
# iterate over all the elements
for items in s.iteritems():
    print(items)

(1, 'Alice')
(2, 'Jack')
(3, 'Molly')


### 9

In [182]:
df = pd.DataFrame({'Serial No.' : [1,2,3,4,5],
                   'gre score':[337,324,316,322,314],
                   'toefl score':[118,107,104,110,103]})
df

Unnamed: 0,Serial No.,gre score,toefl score
0,1,337,118
1,2,324,107
2,3,316,104
3,4,322,110
4,5,314,103


In [183]:
(df['toefl score'] > 105) & (df['toefl score'] < 115)

0    False
1     True
2    False
3     True
4    False
Name: toefl score, dtype: bool

In [184]:
df[df['toefl score'].gt(105) & df['toefl score'].lt(115)]

Unnamed: 0,Serial No.,gre score,toefl score
1,2,324,107
3,4,322,110


In [185]:
(df['toefl score'] > 105) & (df['toefl score'] < 115)

0    False
1     True
2    False
3     True
4    False
Name: toefl score, dtype: bool

### 10

In [186]:
data = {'Major' : ['Math','seco'],
        'Name':['Alice','Jack'],
        'Age':[20,22],
        'Gender':['F','M']}
df = pd.DataFrame(data, columns = ['Major','Name','Age','Gender'])
df = df.set_index('Major')
df

Unnamed: 0_level_0,Name,Age,Gender
Major,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Math,Alice,20,F
seco,Jack,22,M


In [187]:
df.T['Math']

Name      Alice
Age          20
Gender        F
Name: Math, dtype: object