# Getting Started with Data Science

## Setup:

### Importing the necessary libraries and magic functions

Libraries:
* <a href="http://www.numpy.org/">**Numpy**</a>: Fundamental package for scientific computing
* <a href="https://pandas.pydata.org/">**Pandas**</a>: Library providing high-performance, easy-to-use data structures and data analysis tools
* <a href="https://matplotlib.org/">**Matplotlib**</a>: 2D Plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
* <a href="https://seaborn.pydata.org/">**Seaborn**</a>: Data visualization library based on matplotlib

Magic Functions:
* %matplotlib inline - magic function in Jupyter Notebook that allows plots to be displayed within the notebook

In [4]:
%matplotlib inline

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

## Pandas Series

A Series is a one-dimensional labeled array capable of holding any data type. The axis labels are collectively referred to as the index.

For this example, we will create a series of 5 samples containing information about height in cm.

In [6]:
s_height = pd.Series([165,154,170,183,120], index=['a','b','c','d','e'])

#### View the series

In [8]:
s_height

a    165
b    154
c    170
d    183
e    120
dtype: int64

#### Check series statistics

In [10]:
s_height.mean()

158.4

In [12]:
s_height.max()

183

In [14]:
s_height.min()

120

In [22]:
s_height.describe()

count      5.000000
mean     158.400000
std       23.860008
min      120.000000
25%      154.000000
50%      165.000000
75%      170.000000
max      183.000000
dtype: float64

### Accessing Series Values

#### By index key:

In [24]:
s_height['b']

154

#### By index position (starting with 0)

In [26]:
s_height.iloc[2]

170

#### Access multiple values by specifying indices

In [28]:
s_height[['a','c','d']]

a    165
c    170
d    183
dtype: int64

#### Access multiple values by specifying index range

In [30]:
s_height['a':'d']

a    165
b    154
c    170
d    183
dtype: int64

### Modifying Values

In [32]:
s_height['a'] = 175

In [34]:
s_height['a']

175

### Filtering Values

#### Check True/False value based on condition

In [36]:
s_height < 170

a    False
b     True
c    False
d    False
e     True
dtype: bool

#### Select samples that satisfy condition

In [38]:
s_height[s_height < 170]

b    154
e    120
dtype: int64

#### Select samples with "AND" condition

In [40]:
s_height[(s_height < 180) & (s_height > 160)]

a    175
c    170
dtype: int64

#### Select samples with "OR" condition

In [42]:
s_height[(s_height > 160) | (s_height.index == 'c' )]

a    175
c    170
d    183
dtype: int64

### Plotting the series

In [46]:
s_height.plot.bar(title="Heights")

<Axes: title={'center': 'Heights'}>

## Pandas Dataframe

A Dataframe is the primary pandas data structure. It is a Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. 

To put it simply, a dataframe is a table with rows (samples) and columns (features)

Pandas loads data as a Dataframe object

### Loading Data

#### Available Readers and Writers for Pandas

<table border="1" class="colwidths-given docutils">
<colgroup>
<col width="12%" />
<col width="40%" />
<col width="24%" />
<col width="24%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Format Type</th>
<th class="head">Data Description</th>
<th class="head">Reader</th>
<th class="head">Writer</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></td>
<td><a class="reference internal" href="#io-read-csv-table"><span class="std std-ref">read_csv</span></a></td>
<td><a class="reference internal" href="#io-store-in-csv"><span class="std std-ref">to_csv</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td><a class="reference external" href="http://www.json.org/">JSON</a></td>
<td><a class="reference internal" href="#io-json-reader"><span class="std std-ref">read_json</span></a></td>
<td><a class="reference internal" href="#io-json-writer"><span class="std std-ref">to_json</span></a></td>
</tr>
<tr class="row-even"><td>text</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/HTML">HTML</a></td>
<td><a class="reference internal" href="#io-read-html"><span class="std std-ref">read_html</span></a></td>
<td><a class="reference internal" href="#io-html"><span class="std std-ref">to_html</span></a></td>
</tr>
<tr class="row-odd"><td>text</td>
<td>Local clipboard</td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">read_clipboard</span></a></td>
<td><a class="reference internal" href="#io-clipboard"><span class="std std-ref">to_clipboard</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Microsoft_Excel">MS Excel</a></td>
<td><a class="reference internal" href="#io-excel-reader"><span class="std std-ref">read_excel</span></a></td>
<td><a class="reference internal" href="#io-excel-writer"><span class="std std-ref">to_excel</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://support.hdfgroup.org/HDF5/whatishdf5.html">HDF5 Format</a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">read_hdf</span></a></td>
<td><a class="reference internal" href="#io-hdf5"><span class="std std-ref">to_hdf</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://github.com/wesm/feather">Feather Format</a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">read_feather</span></a></td>
<td><a class="reference internal" href="#io-feather"><span class="std std-ref">to_feather</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://parquet.apache.org/">Parquet Format</a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">read_parquet</span></a></td>
<td><a class="reference internal" href="#io-parquet"><span class="std std-ref">to_parquet</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="http://msgpack.org/index.html">Msgpack</a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">read_msgpack</span></a></td>
<td><a class="reference internal" href="#io-msgpack"><span class="std std-ref">to_msgpack</span></a></td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/Stata">Stata</a></td>
<td><a class="reference internal" href="#io-stata-reader"><span class="std std-ref">read_stata</span></a></td>
<td><a class="reference internal" href="#io-stata-writer"><span class="std std-ref">to_stata</span></a></td>
</tr>
<tr class="row-even"><td>binary</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SAS_(software)">SAS</a></td>
<td><a class="reference internal" href="#io-sas-reader"><span class="std std-ref">read_sas</span></a></td>
<td>&#160;</td>
</tr>
<tr class="row-odd"><td>binary</td>
<td><a class="reference external" href="https://docs.python.org/3/library/pickle.html">Python Pickle Format</a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">read_pickle</span></a></td>
<td><a class="reference internal" href="#io-pickle"><span class="std std-ref">to_pickle</span></a></td>
</tr>
<tr class="row-even"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/SQL">SQL</a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">read_sql</span></a></td>
<td><a class="reference internal" href="#io-sql"><span class="std std-ref">to_sql</span></a></td>
</tr>
<tr class="row-odd"><td>SQL</td>
<td><a class="reference external" href="https://en.wikipedia.org/wiki/BigQuery">Google Big Query</a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">read_gbq</span></a></td>
<td><a class="reference internal" href="#io-bigquery"><span class="std std-ref">to_gbq</span></a></td>
</tr>
</tbody>
</table>

In [85]:
df = pd.read_csv('marvel-wikia-data.csv')

### Inspect Dataset

#### View top rows

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,PageID,Name,Universe,URL,Identity,Gender,Alive,Marital Status,Teams,Height (m),Weight (kg),Origin,Creators
0,0,A_Friend_(Earth-616),A Friend,Earth-616,https://marvel.fandom.com/wiki/A_Friend_(Earth...,Public,Male,Alive,Single,,,,Demon,Jim Mooney; Tony Isabella
1,1,A_Friend_(Earth-TRN743),A Friend,Earth-TRN743,https://marvel.fandom.com/wiki/A_Friend_(Earth...,Public,Male,Alive,Single,,,,,Al Ewing; Salvador Espin
2,2,A_La_Mode_(Earth-9047),A La Mode,Earth-9047,https://marvel.fandom.com/wiki/A_La_Mode_(Eart...,Secret,,Alive,Single,Lunch Legion,,,Living ice cream,Keith Wilson; Rob Tokar
3,3,A'di_(Earth-616),A'di,Earth-616,https://marvel.fandom.com/wiki/A%27di_(Earth-616),Public,Female,Alive,Single,Alpha Cadets; Alpha Flight,,,Human,Margaret Stohl; Michele Bandini
4,4,A'ishah_(Earth-616),A'ishah,Earth-616,https://marvel.fandom.com/wiki/A%27ishah_(Eart...,Secret,Female,Alive,Single,Church of Lucifer,,,Human,Warren Ellis


#### View bottom rows

In [20]:
df.tail()

Unnamed: 0.1,Unnamed: 0,PageID,Name,Universe,URL,Identity,Gender,Alive,Marital Status,Teams,Height (m),Weight (kg),Origin,Creators
92610,92610,Rebecca_Hoover_(Earth-17372),Rebecca Hoover,Earth-17372,https://marvel.fandom.com/wiki/Rebecca_Hoover_...,Secret,Female,Dead,Single,Hellfire Club,,,Mutant,Matt Nix
92611,92611,Rebecca_Houston_(Earth-616),Rebecca Houston,Earth-616,https://marvel.fandom.com/wiki/Rebecca_Houston...,Public,Female,Alive,,United States Navy,,,Human,Mark Waid; Ron Garney
92612,92612,Rebecca_Kaplan_(Earth-616),Rebecca Kaplan,Earth-616,https://marvel.fandom.com/wiki/Rebecca_Kaplan_...,Public,Female,Alive,Married,,,,Human,Allan Heinberg; Andrea DiVito
92613,92613,Reed_Richards_(Earth-TRN829),Reed Richards,Earth-TRN829,https://marvel.fandom.com/wiki/Reed_Richards_(...,Public,Male,Alive,Married,Fantastic Four,,,Human mutates due to exposure to Cosmic Rays,John Harkness; Rich Buckler
92614,92614,Rhonda_Rubens_(Earth-616),Rhonda Rubens,Earth-616,https://marvel.fandom.com/wiki/Rhonda_Rubens_(...,Secret,Female,Alive,,,,,Human,Felipe Smith; Tradd Moore


#### Inspect data types of features

In [22]:
df.dtypes

Unnamed: 0          int64
PageID             object
Name               object
Universe           object
URL                object
Identity           object
Gender             object
Alive              object
Marital Status     object
Teams              object
Height (m)        float64
Weight (kg)       float64
Origin             object
Creators           object
dtype: object

#### Describe the dataset

In [24]:
df.describe() #This will only include numeric features

Unnamed: 0.1,Unnamed: 0,Height (m),Weight (kg)
count,92615.0,5317.0,4873.0
mean,46307.0,3.548533,105.851053
std,26735.791928,28.102151,92.606689
min,0.0,1.02,0.0
25%,23153.5,1.73,63.5
50%,46307.0,1.8,81.65
75%,69460.5,1.88,104.33
max,92614.0,804.67,962.98


In [26]:
df.describe(include='object') #This will only include object/string features

Unnamed: 0,PageID,Name,Universe,URL,Identity,Gender,Alive,Marital Status,Teams,Origin,Creators
count,92615,92615,92609,92615,72740,89389,92614,38101,49808,78141,90565
unique,92615,41317,3590,92615,4,67,7,14,14537,9377,11131
top,A_Friend_(Earth-616),Peter Parker,Earth-616,https://marvel.fandom.com/wiki/A_Friend_(Earth...,Public,Male,Alive,Single,X-Men,Human,Jack Kirby; Stan Lee
freq,1,867,33469,1,49430,65471,69553,30355,3131,30272,701


In [28]:
df.describe(include='all') #This will include all features

Unnamed: 0.1,Unnamed: 0,PageID,Name,Universe,URL,Identity,Gender,Alive,Marital Status,Teams,Height (m),Weight (kg),Origin,Creators
count,92615.0,92615,92615,92609,92615,72740,89389,92614,38101,49808,5317.0,4873.0,78141,90565
unique,,92615,41317,3590,92615,4,67,7,14,14537,,,9377,11131
top,,A_Friend_(Earth-616),Peter Parker,Earth-616,https://marvel.fandom.com/wiki/A_Friend_(Earth...,Public,Male,Alive,Single,X-Men,,,Human,Jack Kirby; Stan Lee
freq,,1,867,33469,1,49430,65471,69553,30355,3131,,,30272,701
mean,46307.0,,,,,,,,,,3.548533,105.851053,,
std,26735.791928,,,,,,,,,,28.102151,92.606689,,
min,0.0,,,,,,,,,,1.02,0.0,,
25%,23153.5,,,,,,,,,,1.73,63.5,,
50%,46307.0,,,,,,,,,,1.8,81.65,,
75%,69460.5,,,,,,,,,,1.88,104.33,,


#### Show column summary

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92615 entries, 0 to 92614
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      92615 non-null  int64  
 1   PageID          92615 non-null  object 
 2   Name            92615 non-null  object 
 3   Universe        92609 non-null  object 
 4   URL             92615 non-null  object 
 5   Identity        72740 non-null  object 
 6   Gender          89389 non-null  object 
 7   Alive           92614 non-null  object 
 8   Marital Status  38101 non-null  object 
 9   Teams           49808 non-null  object 
 10  Height (m)      5317 non-null   float64
 11  Weight (kg)     4873 non-null   float64
 12  Origin          78141 non-null  object 
 13  Creators        90565 non-null  object 
dtypes: float64(2), int64(1), object(11)
memory usage: 9.9+ MB


#### Show dataset size

In [32]:
df.shape

(92615, 14)

### Inspect Columns

#### Set index

In [87]:
df.set_index('page_id', inplace=True)

In [36]:
df.head()

Unnamed: 0.1,Unnamed: 0,PageID,Name,Universe,URL,Identity,Gender,Alive,Marital Status,Teams,Height (m),Weight (kg),Origin,Creators
0,0,A_Friend_(Earth-616),A Friend,Earth-616,https://marvel.fandom.com/wiki/A_Friend_(Earth...,Public,Male,Alive,Single,,,,Demon,Jim Mooney; Tony Isabella
1,1,A_Friend_(Earth-TRN743),A Friend,Earth-TRN743,https://marvel.fandom.com/wiki/A_Friend_(Earth...,Public,Male,Alive,Single,,,,,Al Ewing; Salvador Espin
2,2,A_La_Mode_(Earth-9047),A La Mode,Earth-9047,https://marvel.fandom.com/wiki/A_La_Mode_(Eart...,Secret,,Alive,Single,Lunch Legion,,,Living ice cream,Keith Wilson; Rob Tokar
3,3,A'di_(Earth-616),A'di,Earth-616,https://marvel.fandom.com/wiki/A%27di_(Earth-616),Public,Female,Alive,Single,Alpha Cadets; Alpha Flight,,,Human,Margaret Stohl; Michele Bandini
4,4,A'ishah_(Earth-616),A'ishah,Earth-616,https://marvel.fandom.com/wiki/A%27ishah_(Eart...,Secret,Female,Alive,Single,Church of Lucifer,,,Human,Warren Ellis


#### Show counts of categorical columns

In [89]:
df['HAIR'].value_counts()

HAIR
Black Hair               3755
Brown Hair               2339
Blond Hair               1582
No Hair                  1176
Bald                      838
White Hair                754
Red Hair                  620
Grey Hair                 531
Green Hair                117
Auburn Hair                78
Blue Hair                  56
Strawberry Blond Hair      47
Purple Hair                47
Orange Hair                43
Variable Hair              32
Pink Hair                  31
Yellow Hair                20
Silver Hair                16
Gold Hair                   8
Reddish Blond Hair          6
Light Brown Hair            6
Magenta Hair                5
Orange-brown Hair           3
Bronze Hair                 1
Dyed Hair                   1
Name: count, dtype: int64

In [91]:
df.ALIVE.value_counts()

ALIVE
Living Characters      12608
Deceased Characters     3765
Name: count, dtype: int64

#### Show unique values per category

In [93]:
df.GSM.unique()

array([nan, 'Bisexual Characters', 'Transvestites',
       'Homosexual Characters', 'Pansexual Characters',
       'Transgender Characters', 'Genderfluid Characters'], dtype=object)

#### Show column stats

In [96]:
df.APPEARANCES.max()

4043.0

In [98]:
df.APPEARANCES.mean()

17.033376963350786

## Selection & Slicing

#### Slice by Column Name

In [102]:
df[['name','SEX','Year']].head()

Unnamed: 0_level_0,name,SEX,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1678,Spider-Man (Peter Parker),Male Characters,1962.0
7139,Captain America (Steven Rogers),Male Characters,1941.0
64786,"Wolverine (James \""Logan\"" Howlett)",Male Characters,1974.0
1868,"Iron Man (Anthony \""Tony\"" Stark)",Male Characters,1963.0
2460,Thor (Thor Odinson),Male Characters,1950.0


#### Slice by index

In [105]:
df.loc[1678:1868]

Unnamed: 0_level_0,name,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1678,Spider-Man (Peter Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
7139,Captain America (Steven Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
64786,"Wolverine (James \""Logan\"" Howlett)",Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
1868,"Iron Man (Anthony \""Tony\"" Stark)",Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0


#### Slice by index & column

In [108]:
df.loc[1678:1868,['name','SEX','Year']]

Unnamed: 0_level_0,name,SEX,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1678,Spider-Man (Peter Parker),Male Characters,1962.0
7139,Captain America (Steven Rogers),Male Characters,1941.0
64786,"Wolverine (James \""Logan\"" Howlett)",Male Characters,1974.0
1868,"Iron Man (Anthony \""Tony\"" Stark)",Male Characters,1963.0


#### Slice by row number

In [111]:
df.iloc[1:6]

Unnamed: 0_level_0,name,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7139,Captain America (Steven Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
64786,"Wolverine (James \""Logan\"" Howlett)",Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
1868,"Iron Man (Anthony \""Tony\"" Stark)",Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
2460,Thor (Thor Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0
2458,Benjamin Grimm (Earth-616),Public Identity,Good Characters,Blue Eyes,No Hair,Male Characters,,Living Characters,2255.0,Nov-61,1961.0


#### Slice by row number & column number

In [114]:
df.iloc[1:6,[0,5,10]]

Unnamed: 0_level_0,name,SEX,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7139,Captain America (Steven Rogers),Male Characters,1941.0
64786,"Wolverine (James \""Logan\"" Howlett)",Male Characters,1974.0
1868,"Iron Man (Anthony \""Tony\"" Stark)",Male Characters,1963.0
2460,Thor (Thor Odinson),Male Characters,1950.0
2458,Benjamin Grimm (Earth-616),Male Characters,1961.0


### Filtering & Conditional Selection

#### Filtering based on condition

In [118]:
df.loc[df['APPEARANCES']>3000]

Unnamed: 0_level_0,name,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1678,Spider-Man (Peter Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
7139,Captain America (Steven Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
64786,"Wolverine (James \""Logan\"" Howlett)",Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0


#### Filtering based on multiple conditions

In [121]:
df.loc[(df['GSM']=='Genderfluid Characters') | (df['GSM']=='Bisexual Characters')]

Unnamed: 0_level_0,name,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6545,Natalia Romanova (Earth-616),Public Identity,Good Characters,Green Eyes,Red Hair,Female Characters,Bisexual Characters,Living Characters,1050.0,Apr-64,1964.0
1818,Hercules (Earth-616),No Dual Identity,Good Characters,Blue Eyes,Brown Hair,Male Characters,Bisexual Characters,Living Characters,725.0,Oct-65,1965.0
2307,Elizabeth Braddock (Earth-616),Secret Identity,Neutral Characters,Blue Eyes,Purple Hair,Female Characters,Bisexual Characters,Living Characters,708.0,Dec-76,1976.0
2042,Loki Laufeyson (Earth-616),No Dual Identity,Neutral Characters,Green Eyes,Black Hair,Genderfluid Characters,Bisexual Characters,Living Characters,532.0,Aug-49,1949.0
55878,Raven Darkholme (Earth-616),Secret Identity,Bad Characters,Yellow Eyes,Red Hair,Female Characters,Bisexual Characters,Living Characters,371.0,Apr-78,1978.0
2182,Heather Douglas (Earth-616),Secret Identity,Good Characters,Blue Eyes,Bald,Female Characters,Bisexual Characters,Living Characters,323.0,Jan-73,1973.0
51769,Akihiro (Earth-616),Secret Identity,Bad Characters,Blue Eyes,Black Hair,Male Characters,Bisexual Characters,Living Characters,189.0,Oct-06,2006.0
18271,Marlo Chandler (Earth-616),Public Identity,Good Characters,Blue Eyes,Red Hair,Female Characters,Bisexual Characters,Living Characters,131.0,Sep-88,1988.0
1465,Doop (Earth-616),No Dual Identity,Good Characters,Variable Eyes,Black Hair,Male Characters,Bisexual Characters,Living Characters,113.0,Jul-01,2001.0
2298,David Alleyne (Earth-616),Secret Identity,Good Characters,Brown Eyes,Black Hair,Male Characters,Bisexual Characters,Living Characters,113.0,Oct-03,2003.0


#### Filtering based on string search

In [124]:
df.loc[(df['name'].str.contains('Sugar'))]

Unnamed: 0_level_0,name,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
34301,Sugar Kane (Earth-616),Public Identity,Good Characters,,Blond Hair,Female Characters,Bisexual Characters,Living Characters,6.0,Aug-01,2001.0
397442,Sheba Sugarfangs (Earth-616),Secret Identity,Neutral Characters,Red Eyes,Brown Hair,Female Characters,,Living Characters,1.0,Nov-10,2010.0


#### Filtering based on exact string match

In [127]:
df.loc[df['name']=='Captain America (Steven Rogers)']

Unnamed: 0_level_0,name,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7139,Captain America (Steven Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0


#### Show column value for exact string match

In [130]:
df.loc[df['name']=='Captain America (Steven Rogers)','SEX']

page_id
7139    Male Characters
Name: SEX, dtype: object

#### Show selected columns for exact string match

In [133]:
df.loc[df['name']=='Captain America (Steven Rogers)',['name','SEX','Year']]

Unnamed: 0_level_0,name,SEX,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
7139,Captain America (Steven Rogers),Male Characters,1941.0


### Removing Elements

#### Dropping row by index

In [137]:
df.drop(1678, inplace=True)

In [139]:
df.shape

(16375, 11)

#### Dropping column by name

In [142]:
df2 = df.drop('GSM', axis=1)

In [144]:
df.shape

(16375, 11)

In [146]:
df2.shape

(16375, 10)

### Group By Function

#### Group By one feature and aggregate

In [150]:
df.groupby('ID')[['name','SEX']].count()

Unnamed: 0_level_0,name,SEX
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
Known to Authorities Identity,15,15
No Dual Identity,1788,1744
Public Identity,4528,4423
Secret Identity,6274,6025


In [152]:
df.groupby('ID')[['APPEARANCES']].mean()

Unnamed: 0_level_0,APPEARANCES
ID,Unnamed: 1_level_1
Known to Authorities Identity,74.4
No Dual Identity,24.402013
Public Identity,23.407615
Secret Identity,16.699561


In [154]:
df.groupby('ID')[['APPEARANCES']].describe()

Unnamed: 0_level_0,APPEARANCES,APPEARANCES,APPEARANCES,APPEARANCES,APPEARANCES,APPEARANCES,APPEARANCES,APPEARANCES
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Known to Authorities Identity,15.0,74.4,112.960676,2.0,9.0,26.0,75.5,348.0
No Dual Identity,1689.0,24.402013,107.182232,1.0,1.0,3.0,10.0,2258.0
Public Identity,4281.0,23.407615,141.738933,1.0,1.0,3.0,8.0,3360.0
Secret Identity,5918.0,16.699561,56.896058,1.0,1.0,4.0,10.0,1265.0


#### Group by multiple features

In [157]:
df.groupby(['ID','ALIGN'])[['APPEARANCES']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,APPEARANCES
ID,ALIGN,Unnamed: 2_level_1
Known to Authorities Identity,Bad Characters,35.0
Known to Authorities Identity,Good Characters,720.0
Known to Authorities Identity,Neutral Characters,361.0
No Dual Identity,Bad Characters,4758.0
No Dual Identity,Good Characters,25296.0
No Dual Identity,Neutral Characters,9137.0
Public Identity,Bad Characters,15153.0
Public Identity,Good Characters,64698.0
Public Identity,Neutral Characters,14807.0
Secret Identity,Bad Characters,27130.0


#### Number of observations in each group

In [160]:
df.groupby(['ID','ALIGN']).size()

ID                             ALIGN             
Known to Authorities Identity  Bad Characters           3
                               Good Characters         10
                               Neutral Characters       2
No Dual Identity               Bad Characters         474
                               Good Characters        647
                               Neutral Characters     390
Public Identity                Bad Characters        1452
                               Good Characters       1628
                               Neutral Characters     706
Secret Identity                Bad Characters        3223
                               Good Characters       1612
                               Neutral Characters     818
dtype: int64

### Pivot Tables

#### Single Index

In [169]:
pd.pivot_table(df, index = ['ID'])

TypeError: agg function failed [how->mean,dtype->object]

In [171]:
pd.pivot_table(df, index = ['ID'], values = 'APPEARANCES', aggfunc='sum')

Unnamed: 0_level_0,APPEARANCES
ID,Unnamed: 1_level_1
Known to Authorities Identity,1116.0
No Dual Identity,41215.0
Public Identity,100208.0
Secret Identity,98828.0


#### Multi Index

In [173]:
pd.pivot_table(df, index=['ID','SEX'], values = ['name'], aggfunc='count')

Unnamed: 0_level_0,Unnamed: 1_level_0,name
ID,SEX,Unnamed: 2_level_1
Known to Authorities Identity,Agender Characters,1
Known to Authorities Identity,Female Characters,5
Known to Authorities Identity,Male Characters,9
No Dual Identity,Agender Characters,11
No Dual Identity,Female Characters,486
No Dual Identity,Genderfluid Characters,1
No Dual Identity,Male Characters,1246
Public Identity,Agender Characters,5
Public Identity,Female Characters,1051
Public Identity,Male Characters,3367


#### Multiple functions

In [175]:
pd.pivot_table(df, index = ['ID','SEX'], values = 'APPEARANCES', aggfunc=[np.sum,np.mean,np.min, np.max])

  pd.pivot_table(df, index = ['ID','SEX'], values = 'APPEARANCES', aggfunc=[np.sum,np.mean,np.min, np.max])
  pd.pivot_table(df, index = ['ID','SEX'], values = 'APPEARANCES', aggfunc=[np.sum,np.mean,np.min, np.max])
  pd.pivot_table(df, index = ['ID','SEX'], values = 'APPEARANCES', aggfunc=[np.sum,np.mean,np.min, np.max])
  pd.pivot_table(df, index = ['ID','SEX'], values = 'APPEARANCES', aggfunc=[np.sum,np.mean,np.min, np.max])


Unnamed: 0_level_0,Unnamed: 1_level_0,sum,mean,min,max
Unnamed: 0_level_1,Unnamed: 1_level_1,APPEARANCES,APPEARANCES,APPEARANCES,APPEARANCES
ID,SEX,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Known to Authorities Identity,Agender Characters,348.0,348.0,348.0,348.0
Known to Authorities Identity,Female Characters,393.0,78.6,4.0,319.0
Known to Authorities Identity,Male Characters,375.0,41.666667,2.0,146.0
No Dual Identity,Agender Characters,77.0,7.0,1.0,28.0
No Dual Identity,Female Characters,13414.0,28.785408,1.0,1304.0
No Dual Identity,Genderfluid Characters,532.0,532.0,532.0,532.0
No Dual Identity,Male Characters,27022.0,23.115483,1.0,2258.0
Public Identity,Agender Characters,99.0,19.8,4.0,61.0
Public Identity,Female Characters,25667.0,25.900101,1.0,1713.0
Public Identity,Male Characters,73966.0,23.20866,1.0,3360.0
