# Organising

In [1]:
import numpy as np
import pandas as pd

## The `pandas.DataFrame` object

In [2]:
data = [
    {'state': 'California', 'area': 423967, 'population': 38332521},
    {'state': 'Florida', 'area': 170312, 'population': 19552860},
    {'state': 'Illinois', 'area': 149995, 'population': 12882135},
    {'state': 'New York', 'area': 141297, 'population': 19651127},
    {'state': 'Texas', 'area': 695662, 'population': 26448193},
]

states = pd.DataFrame(data)
states

Unnamed: 0,area,population,state
0,423967,38332521,California
1,170312,19552860,Florida
2,149995,12882135,Illinois
3,141297,19651127,New York
4,695662,26448193,Texas


*Notes: [there are many ways to construct DataFrames](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb) (see "Constructing DataFrame objects"), but we will usually load data from file. In most cases rows should be either independent samples (also known as [tidy format](http://vita.had.co.nz/papers/tidy-data.pdf)) or timestamps.*

### Loading from file

[Comic characters dataset from fivethirtyeight](https://github.com/fivethirtyeight/data/tree/master/comic-characters).

In [3]:
df = pd.read_csv('data/dc-wikia-data.csv')

In [4]:
df

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0
5,1448,Wonder Woman (Diana Prince),\/wiki\/Wonder_Woman_(Diana_Prince),Public Identity,Good Characters,Blue Eyes,Black Hair,Female Characters,,Living Characters,1231.0,"1941, December",1941.0
6,1486,Aquaman (Arthur Curry),\/wiki\/Aquaman_(Arthur_Curry),Public Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,1121.0,"1941, November",1941.0
7,1451,Timothy Drake (New Earth),\/wiki\/Timothy_Drake_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1095.0,"1989, August",1989.0
8,71760,Dinah Laurel Lance (New Earth),\/wiki\/Dinah_Laurel_Lance_(New_Earth),Public Identity,Good Characters,Blue Eyes,Blond Hair,Female Characters,,Living Characters,1075.0,"1969, November",1969.0
9,1380,Flash (Barry Allen),\/wiki\/Flash_(Barry_Allen),Secret Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,1028.0,"1956, October",1956.0


*Notes: [you can read and write files in many formats](https://pandas.pydata.org/pandas-docs/stable/io.html). `read_csv` (and other variants) can also read directly from a url.*

## Inspecting DataFrames

In [5]:
df.head()

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0


In [6]:
df.tail()

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
6891,66302,Nadine West (New Earth),\/wiki\/Nadine_West_(New_Earth),Public Identity,Good Characters,,,Female Characters,,Living Characters,,,
6892,283475,Warren Harding (New Earth),\/wiki\/Warren_Harding_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6893,283478,William Harrison (New Earth),\/wiki\/William_Harrison_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6894,283471,William McKinley (New Earth),\/wiki\/William_McKinley_(New_Earth),Public Identity,Good Characters,,,Male Characters,,Living Characters,,,
6895,150660,Mookie (New Earth),\/wiki\/Mookie_(New_Earth),Public Identity,Bad Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,,,


In [7]:
df.dtypes

page_id               int64
name                 object
urlslug              object
ID                   object
ALIGN                object
EYE                  object
HAIR                 object
SEX                  object
GSM                  object
ALIVE                object
APPEARANCES         float64
FIRST APPEARANCE     object
YEAR                float64
dtype: object

In [8]:
df.describe()

Unnamed: 0,page_id,APPEARANCES,YEAR
count,6896.0,6541.0,6827.0
mean,147441.209252,23.625134,1989.766662
std,108388.631149,87.378509,16.824194
min,1380.0,1.0,1935.0
25%,44105.5,2.0,1983.0
50%,141267.0,6.0,1992.0
75%,213203.0,15.0,2003.0
max,404010.0,3093.0,2013.0


In [9]:
print('len:', len(df))
print('shape:', df.shape)

len: 6896
shape: (6896, 13)


## Selecting

In [10]:
df['name']

0                  Batman (Bruce Wayne)
1                 Superman (Clark Kent)
2            Green Lantern (Hal Jordan)
3              James Gordon (New Earth)
4           Richard Grayson (New Earth)
5           Wonder Woman (Diana Prince)
6                Aquaman (Arthur Curry)
7             Timothy Drake (New Earth)
8        Dinah Laurel Lance (New Earth)
9                   Flash (Barry Allen)
10                           GenderTest
11               Alan Scott (New Earth)
12           Barbara Gordon (New Earth)
13            Jason Garrick (New Earth)
14                Lois Lane (New Earth)
15        Alfred Pennyworth (New Earth)
16              Carter Hall (New Earth)
17              Kyle Rayner (New Earth)
18           Raymond Palmer (New Earth)
19         Alexander Luthor (New Earth)
20               Roy Harper (New Earth)
21               Kara Zor-L (Earth-Two)
22                Ted Grant (New Earth)
23           Garfield Logan (New Earth)
24              Guy Gardner (New Earth)


In [11]:
df.name  # This is the same, but be careful with methods

0                  Batman (Bruce Wayne)
1                 Superman (Clark Kent)
2            Green Lantern (Hal Jordan)
3              James Gordon (New Earth)
4           Richard Grayson (New Earth)
5           Wonder Woman (Diana Prince)
6                Aquaman (Arthur Curry)
7             Timothy Drake (New Earth)
8        Dinah Laurel Lance (New Earth)
9                   Flash (Barry Allen)
10                           GenderTest
11               Alan Scott (New Earth)
12           Barbara Gordon (New Earth)
13            Jason Garrick (New Earth)
14                Lois Lane (New Earth)
15        Alfred Pennyworth (New Earth)
16              Carter Hall (New Earth)
17              Kyle Rayner (New Earth)
18           Raymond Palmer (New Earth)
19         Alexander Luthor (New Earth)
20               Roy Harper (New Earth)
21               Kara Zor-L (Earth-Two)
22                Ted Grant (New Earth)
23           Garfield Logan (New Earth)
24              Guy Gardner (New Earth)


In [12]:
df['ALIGN'].value_counts()  # Great for making sense of categorical columns

Bad Characters        2895
Good Characters       2832
Neutral Characters     565
Reformed Criminals       3
Name: ALIGN, dtype: int64

*Note: columns are `pandas.Series` objects.*

In [17]:
df[['name', 'YEAR']]

Unnamed: 0,name,YEAR
0,Batman (Bruce Wayne),1939.0
1,Superman (Clark Kent),1986.0
2,Green Lantern (Hal Jordan),1959.0
3,James Gordon (New Earth),1987.0
4,Richard Grayson (New Earth),1940.0
5,Wonder Woman (Diana Prince),1941.0
6,Aquaman (Arthur Curry),1941.0
7,Timothy Drake (New Earth),1989.0
8,Dinah Laurel Lance (New Earth),1969.0
9,Flash (Barry Allen),1956.0


Boolean indexing

In [18]:
df['ALIGN'] == 'Bad Characters'

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19       True
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
6866     True
6867    False
6868    False
6869    False
6870    False
6871     True
6872     True
6873    False
6874     True
6875    False
6876    False
6877    False
6878    False
6879     True
6880    False
6881     True
6882    False
6883    False
6884     True
6885    False
6886    False
6887    False
6888    False
6889    False
6890    False
6891    False
6892    False
6893    False
6894    False
6895     True
Name: ALIGN, Length: 6896, dtype: bool

In [34]:
df[df['ALIGN'] != 'Bad Characters'].head()

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0


In [28]:
df[df['YEAR'] > 2012].head()

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
5539,85254,Springheeled Jack (Prime Earth),\/wiki\/Springheeled_Jack_(Prime_Earth),Secret Identity,Bad Characters,,,Male Characters,,Living Characters,1.0,"2013, October",2013.0


In [33]:
df[df['SEX'].isin(['Male Characters', 'Female Characters'])].head()

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0


In [36]:
df[(df['YEAR'] > 2011) & (df['ALIGN'] == 'Bad Characters')]

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
4324,379078,Ragnar (Green Lantern Animated Series),\/wiki\/Ragnar_(Green_Lantern_Animated_Series),Public Identity,Bad Characters,,,Male Characters,,Living Characters,3.0,"2012, March",2012.0
5539,85254,Springheeled Jack (Prime Earth),\/wiki\/Springheeled_Jack_(Prime_Earth),Secret Identity,Bad Characters,,,Male Characters,,Living Characters,1.0,"2013, October",2013.0
5540,309868,Napalm (Prime Earth),\/wiki\/Napalm_(Prime_Earth),Secret Identity,Bad Characters,,,Male Characters,,Deceased Characters,1.0,"2012, June",2012.0
6541,306472,Matteo Bischoff (New Earth),\/wiki\/Matteo_Bischoff_(New_Earth),Secret Identity,Bad Characters,,Grey Hair,Male Characters,,Living Characters,,"2012, May",2012.0


*Note: the only place you should realy use bitwise `&` and `|`.*

## DataFrame operations

In [37]:
states

Unnamed: 0,area,population,state
0,423967,38332521,California
1,170312,19552860,Florida
2,149995,12882135,Illinois
3,141297,19651127,New York
4,695662,26448193,Texas


In [39]:
states['population'] / 1000000

0    38.332521
1    19.552860
2    12.882135
3    19.651127
4    26.448193
Name: population, dtype: float64

In [41]:
states['population'] / states['area']

0     90.413926
1    114.806121
2     85.883763
3    139.076746
4     38.018740
dtype: float64

For common functions beyond simple math operator (e.g log, sin, etc.) we use [numpy ufuncs](https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html#available-ufuncs).

In [None]:
np.log(states['area'])

Custom functions application is the slowest but handy.

In [47]:
def is_over_populated(row):
    if row['population'] > 30000000:
        return True
    density = row['population'] / row['area']
    if density > 100:
        return True
    return False

In [50]:
states.apply(is_over_populated, axis='columns')

0     True
1     True
2    False
3     True
4    False
dtype: bool

In [53]:
states['density'] = states['population'] / states['area']
states

Unnamed: 0,area,population,state,density
0,423967,38332521,California,90.413926
1,170312,19552860,Florida,114.806121
2,149995,12882135,Illinois,85.883763
3,141297,19651127,New York,139.076746
4,695662,26448193,Texas,38.01874


In [54]:
states['debt'] = 16.5
states

Unnamed: 0,area,population,state,density,debt
0,423967,38332521,California,90.413926,16.5
1,170312,19552860,Florida,114.806121,16.5
2,149995,12882135,Illinois,85.883763,16.5
3,141297,19651127,New York,139.076746,16.5
4,695662,26448193,Texas,38.01874,16.5


In [55]:
states.sort_values('density')

Unnamed: 0,area,population,state,density,debt
4,695662,26448193,Texas,38.01874,16.5
2,149995,12882135,Illinois,85.883763,16.5
0,423967,38332521,California,90.413926,16.5
1,170312,19552860,Florida,114.806121,16.5
3,141297,19651127,New York,139.076746,16.5


In [56]:
states.rename(columns={'population': 'pop'})

Unnamed: 0,area,pop,state,density,debt
0,423967,38332521,California,90.413926,16.5
1,170312,19552860,Florida,114.806121,16.5
2,149995,12882135,Illinois,85.883763,16.5
3,141297,19651127,New York,139.076746,16.5
4,695662,26448193,Texas,38.01874,16.5


In [57]:
states.rename(columns=str.upper)

Unnamed: 0,AREA,POPULATION,STATE,DENSITY,DEBT
0,423967,38332521,California,90.413926,16.5
1,170312,19552860,Florida,114.806121,16.5
2,149995,12882135,Illinois,85.883763,16.5
3,141297,19651127,New York,139.076746,16.5
4,695662,26448193,Texas,38.01874,16.5


## *Exercise: Let's clean our comics dataset!*

In [59]:
df

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret Identity,Good Characters,Brown Eyes,Brown Hair,Male Characters,,Living Characters,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public Identity,Good Characters,Brown Eyes,White Hair,Male Characters,,Living Characters,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1237.0,"1940, April",1940.0
5,1448,Wonder Woman (Diana Prince),\/wiki\/Wonder_Woman_(Diana_Prince),Public Identity,Good Characters,Blue Eyes,Black Hair,Female Characters,,Living Characters,1231.0,"1941, December",1941.0
6,1486,Aquaman (Arthur Curry),\/wiki\/Aquaman_(Arthur_Curry),Public Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,1121.0,"1941, November",1941.0
7,1451,Timothy Drake (New Earth),\/wiki\/Timothy_Drake_(New_Earth),Secret Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,1095.0,"1989, August",1989.0
8,71760,Dinah Laurel Lance (New Earth),\/wiki\/Dinah_Laurel_Lance_(New_Earth),Public Identity,Good Characters,Blue Eyes,Blond Hair,Female Characters,,Living Characters,1075.0,"1969, November",1969.0
9,1380,Flash (Barry Allen),\/wiki\/Flash_(Barry_Allen),Secret Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,1028.0,"1956, October",1956.0


In [63]:
def first_word(text):
    if isinstance(text, str):  # Because we can't split NaN
        words = text.split()
        return words[0]
    return text

In [65]:
def apply(func, arg):
    return func(arg)

apply(str.upper, 'hello')

'HELLO'

In [62]:
df['SEX'].apply(first_word)

0         Male
1         Male
2         Male
3         Male
4         Male
5       Female
6         Male
7         Male
8       Female
9         Male
10      Female
11        Male
12      Female
13        Male
14      Female
15        Male
16        Male
17        Male
18        Male
19        Male
20        Male
21      Female
22        Male
23        Male
24        Male
25        Male
26        Male
27        Male
28        Male
29        Male
         ...  
6866      Male
6867      Male
6868      Male
6869      Male
6870      Male
6871    Female
6872      Male
6873    Female
6874      Male
6875      Male
6876      Male
6877      Male
6878    Female
6879      Male
6880      Male
6881    Female
6882    Female
6883      Male
6884      Male
6885    Female
6886      Male
6887      Male
6888      Male
6889      Male
6890      Male
6891    Female
6892      Male
6893      Male
6894      Male
6895      Male
Name: SEX, Length: 6896, dtype: object

1. Apply the `first_word` function to the columns: `['ID', 'ALIGN', 'EYE', 'HAIR', 'SEX', 'GSM', 'ALIVE']`. Set the result into the same column.
1. Rename the columns to lower case letters. Hint: use the `str.lower` function.
1. Check your result using `df.head()`.

In [66]:
for col in ['ID', 'ALIGN', 'EYE', 'HAIR', 'SEX', 'GSM', 'ALIVE']:
    df[col] = df[col].apply(first_word)

Unnamed: 0,page_id,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,YEAR
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret,Good,Blue,Black,Male,,Living,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret,Good,Blue,Black,Male,,Living,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret,Good,Brown,Brown,Male,,Living,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public,Good,Brown,White,Male,,Living,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret,Good,Blue,Black,Male,,Living,1237.0,"1940, April",1940.0


In [70]:
df = df.rename(columns=str.lower)

In [72]:
df.head()

Unnamed: 0,page_id,name,urlslug,id,align,eye,hair,sex,gsm,alive,appearances,first appearance,year
0,1422,Batman (Bruce Wayne),\/wiki\/Batman_(Bruce_Wayne),Secret,Good,Blue,Black,Male,,Living,3093.0,"1939, May",1939.0
1,23387,Superman (Clark Kent),\/wiki\/Superman_(Clark_Kent),Secret,Good,Blue,Black,Male,,Living,2496.0,"1986, October",1986.0
2,1458,Green Lantern (Hal Jordan),\/wiki\/Green_Lantern_(Hal_Jordan),Secret,Good,Brown,Brown,Male,,Living,1565.0,"1959, October",1959.0
3,1659,James Gordon (New Earth),\/wiki\/James_Gordon_(New_Earth),Public,Good,Brown,White,Male,,Living,1316.0,"1987, February",1987.0
4,1576,Richard Grayson (New Earth),\/wiki\/Richard_Grayson_(New_Earth),Secret,Good,Blue,Black,Male,,Living,1237.0,"1940, April",1940.0


In [None]:
df.to_csv('data/dc-wikia-data-clean.csv', index=False)

## Handling missing data

In [73]:
df = pd.read_csv('data/dc-wikia-data-clean.csv')

Checks

In [74]:
df['year'].isnull()

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
6866    False
6867    False
6868    False
6869    False
6870    False
6871    False
6872    False
6873    False
6874    False
6875    False
6876    False
6877    False
6878    False
6879    False
6880    False
6881    False
6882    False
6883    False
6884    False
6885    False
6886    False
6887     True
6888     True
6889     True
6890     True
6891     True
6892     True
6893     True
6894     True
6895     True
Name: year, Length: 6896, dtype: bool

In [77]:
df[df['gsm'].notnull()].head()

Unnamed: 0,page_id,name,urlslug,id,align,eye,hair,sex,gsm,alive,appearances,first appearance,year
48,1704,John Constantine (New Earth),\/wiki\/John_Constantine_(New_Earth),Public,Good,Blue,Blond,Male,Bisexual,Living,371.0,"1984, June",1984.0
65,8856,Renee Montoya (New Earth),\/wiki\/Renee_Montoya_(New_Earth),Secret,Good,Brown,Black,Female,Homosexual,Living,308.0,"1992, March",1992.0
119,1862,Todd Rice (New Earth),\/wiki\/Todd_Rice_(New_Earth),Public,Good,Brown,Brown,Male,Homosexual,Living,208.0,"1983, September",1983.0
134,1658,Margaret Sawyer (New Earth),\/wiki\/Margaret_Sawyer_(New_Earth),Public,,Blue,,Female,Homosexual,Living,180.0,"1987, April",1987.0
173,1597,Hartley Rathaway (New Earth),\/wiki\/Hartley_Rathaway_(New_Earth),Secret,Good,Blue,Red,Male,Homosexual,Living,160.0,"1959, May",1959.0


Filtering out

In [None]:
df.dropna(subset=['gsm'])

Filling empty values

In [81]:
df['year'].fillna(2000).tail(10)

6886    1936.0
6887    2000.0
6888    2000.0
6889    2000.0
6890    2000.0
6891    2000.0
6892    2000.0
6893    2000.0
6894    2000.0
6895    2000.0
Name: year, dtype: float64

*Note: operations return new DataFrames. They doesn't change them in-place.*

## *Exercises*

1. What is the hair color of the first character that is of a gender or sexual minority?
1. When the last neutral gender or sexual minority character was instroduced?
1. What is the percentage of good gender or sexual minority characters? Compare this to the percentage of good characters in general.