# Organising

In [None]:
import numpy as np
import pandas as pd

## The `pandas.DataFrame` object

In [None]:
data = [
    {'state': 'California', 'area': 423967, 'population': 38332521},
    {'state': 'Florida', 'area': 170312, 'population': 19552860},
    {'state': 'Illinois', 'area': 149995, 'population': 12882135},
    {'state': 'New York', 'area': 141297, 'population': 19651127},
    {'state': 'Texas', 'area': 695662, 'population': 26448193},
]

states = pd.DataFrame(data)
states

*Notes: [there are many ways to construct DataFrames](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb) (see "Constructing DataFrame objects"), but we will usually load data from file. In most cases rows should be either independent samples (also known as [tidy format](http://vita.had.co.nz/papers/tidy-data.pdf)) or timestamps.*

### Loading from file

[Comic characters dataset from fivethirtyeight](https://github.com/fivethirtyeight/data/tree/master/comic-characters).

In [None]:
df = pd.read_csv('data/dc-wikia-data.csv')

In [None]:
df

*Notes: [you can read and write files in many formats](https://pandas.pydata.org/pandas-docs/stable/io.html). `read_csv` (and other variants) can also read directly from a url.*

## Inspecting DataFrames

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
print('len:', len(df))
print('shape:', df.shape)

## Selecting

In [None]:
df['name']

In [None]:
df.name  # This is the same, but be careful with methods

In [None]:
df['ALIGN'].value_counts()  # Great for making sense of categorical columns

*Note: columns are `pandas.Series` objects.*

In [None]:
df[['name', 'YEAR']]

Boolean indexing

In [None]:
df['ALIGN'] == 'Bad Characters'

In [None]:
df[df['ALIGN'] == 'Bad Characters']

In [None]:
df[df['YEAR'] > 2012]

In [None]:
df[df['SEX'].isin(['Male Characters', 'Female Characters'])]

In [None]:
df[(df['YEAR'] > 2011) & (df['ALIGN'] == 'Bad Characters')]

*Note: the only place you should realy use bitwise `&` and `|`.*

## DataFrame operations

In [None]:
states

In [None]:
states['population'] / 1000000

In [None]:
states['population'] / states['area']

For common functions beyond simple math operator (e.g log, sin, etc.) we use [numpy ufuncs](https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html#available-ufuncs).

In [None]:
np.log(states['area'])

Custom functions application is the slowest but handy.

In [None]:
def is_over_populated(row):
    if row['population'] > 30000000:
        return True
    density = row['population'] / row['area']
    if density > 100:
        return True
    return False

In [None]:
states.apply(is_over_populated, axis='columns')

In [None]:
states['density'] = states['population'] / states['area']
states

In [None]:
states['debt'] = 16.5
states

In [None]:
states.sort_values('density')

In [None]:
states.rename(columns={'population': 'pop'})

In [None]:
states.rename(columns=str.upper)

## *Exercise: Let's clean our comics dataset!*

In [None]:
def first_word(text):
    if isinstance(text, str):  # Because we can't split NaN
        words = text.split()
        return words[0]
    return text

In [None]:
df['SEX'].apply(first_word)

1. Apply the `first_word` function to the columns: `['ID', 'ALIGN', 'EYE', 'HAIR', 'SEX', 'GSM', 'ALIVE']`. Set the result into the same column.
1. Rename the columns to lower case letters. Hint: use the `str.lower` function.
1. Check your result using `df.head()`.

## Handling missing data

In [None]:
df = pd.read_csv('data/dc-wikia-data-clean.csv')

Checks

In [None]:
df['year'].isnull()

In [None]:
df['gsm'].notnull()

Filtering out

In [None]:
df.dropna(subset=['gsm'])

Filling empty values

In [None]:
df['year'].fillna(2000)

*Note: operations return new DataFrames. They doesn't change them in-place.*

## *Exercises*

1. What is the hair color of the first character that is of a gender or sexual minority?
1. When the last neutral gender or sexual minority character was instroduced?
1. What is the percentage of good gender or sexual minority characters? Compare this to the percentage of good characters in general.