# Análise e manipulação de dados

Os arrays definidos pela biblioteca [NumPy](https://numpy.org/) fornecem funcionalidades essenciais para processamento numérico eficiente em Python. No entanto, estes foram desenhados para lidar com os tipos de conjuntos de dados limpos e bem organizados que são tipicamente usados no contexto de tarefas de computação numérica. No contexto da descoberta e extração de conhecimento de dados, é comum lidar com dados menos estruturados, heterogéneos e que podem ter valores em falta. As limitações dos arrays para a análise e manipulação deste tipo de dados tornam-se rapidamente evidentes. A biblioteca [pandas](https://pandas.pydata.org/) aborda essas limitações, fornecendo uma implementação eficiente de uma tabela de dados (`DataFrame`). As tabelas de dados são basicamente arrays multidimensionais associados a etiquetas para as linhas e colunas e capazes de lidar com tipos heterogéneos e valores em falta. Para além disso, a biblioteca *pandas* implementa várias operações sobre dados que são familiares para os utilizadores de bases de dados e folhas de cálculo. Como as estruturas de dados definidas pela biblioteca *pandas* são construídas em cima de arrays *NumPy*, estas operações são efetuadas de forma eficiente. Isto faz da biblioteca uma ferramenta importante para realizar as tarefas de manipulação de dados que ocupam grande parte do tempo de um cientista de dados.

In [136]:
import numpy as np
import pandas as pd

## Estruturas de dados

A um nível muito básico, as estruturas de dados definidas pela biblioteca *pandas* podem ser vistas como versões melhoradas de arrays *NumPy* nas quais as linhas e colunas são identificadas por etiquetas em vez de um índice baseado na posição. As três estruturas de dados fundamentais definidas pela biblioteca *pandas* são a série (`Series`), a tabela de dados (`DataFrame`) e o índice (`Index`). O índice é uma estrutura interessante por si só, que pode ser vista como um array imutável ou como um conjunto ordenado. No entanto, a sua relevancia deve-se ao seu uso no contexto das outras duas estruturas. Por isso, para simplificar, vamos focar nessas duas e olhar para o índice como algo semelhante a um array.

### Série (`Series`)

Uma série é um array unidimensional de dados indexados. 

Séries podem ser criadas de várias formas. Por exemplo, a partir de sequências (ex: listas ou arrays).

In [80]:
data = pd.Series([0.25, 0.5, 0.75, 1])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

Uma série combina uma sequência de valores e uma sequência explícita de índices, que podem ser acedidas individualmente através dos atributos `values`e `index`, respetivamente. Os valores são guardados como um array *NumPy*, enquanto os índices são guardados num objeto do tipo `Index` (ou uma das suas subclasses).

In [83]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [84]:
data.index

RangeIndex(start=0, stop=4, step=1)

Tal como nos arrays, é possível aceder aos dados de uma série usando o operador de indexação `[]`:

In [87]:
data[1]

0.5

In [88]:
data[1:3]

1    0.50
2    0.75
dtype: float64

Até agora, uma série parece ser a mesma coisa que um array *NumPy* unidimensional. No entanto, existe uma diferença essencial: enquanto o array tem um índice inteiro definido implicitamente que é usado para aceder aos valores, a série tem um índice definido explicitamente que é associado aos valores. Esta definição explícita do índice confere capacidades adicionais à série. Por exemplo, o índice pode consistir em valores de qualquer tipo e não apenas inteiros.

In [116]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Independentemente do tipo, o operador de indexação continua a funcionar da mesma forma.

In [113]:
data['b']

0.5

Os valores do índice nem sequer necessitam de ser contíguos ou sequenciais.

In [117]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [115]:
data[5]

0.5

Desta perspetiva, uma série pode ser vista como uma especialização de um dicionário. Em Python, um dicionário é uma estrutura que mapeia chaves arbitrárias num conjunto de valores arbitrários. Uma série é uma estrutura que mapeia chaves de um determinado tipo num conjunto de valores que também têm um tipo definido. Esta tipificação é importante pela mesma razão que no caso dos arrays *NumPy*: eficiência das operações realizadas sobre a estrutura.

In this way, you can think of a Pandas `Series` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a `Series` is a structure that maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas `Series` makes it more efficient than Python dictionaries for certain operations.

A analogia da série como um dicionário torna-se ainda mais clara ao construir uma série diretamente a partir de um dicionário:

In [122]:
population_dict = {
    'California': 39538223,
    'Texas': 29145505,
    'Florida': 21538187,
    'New York': 20201249,
    'Pennsylvania': 13002700
}

population = pd.Series(population_dict)
population

California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
dtype: int64

Neste caso, o índice é construído a partir do conjunto de chaves do dicionário. O acesso ao valor associado a um índice/chave é feito da mesma forma que num dicionário:  

In [119]:
population['California']

39538223

No entanto, ao contrário de um dicionário, uma série também permite obter todos os valores entre dois índices:

In [120]:
population['California':'Florida']

California    39538223
Texas         29145505
Florida       21538187
dtype: int64

**Nota**: Ao criar uma série a partir de um dicionário, é possível explicitar a ordem e/ou um subconjunto de chaves a usar:

In [121]:
pd.Series({2: 'a', 1: 'b', 3: 'c'}, index=[1, 2])

1    b
2    a
dtype: object

**Nota**: Também é possível criar uma série com um valor constante para todos os índices: 

In [124]:
pd.Series(5, index=['John', 'Jane', 'Mary'])

John    5
Jane    5
Mary    5
dtype: int64

### Tabela de dados (`DataFrame`)

Uma tabela de dados é uma sequência de séries que partilham o mesmo índice. Tal como uma série, uma tabela de dados pode ser vista como uma generalização de um array *NumPy* ou como uma especialização de um dicionário. Para exemplificar a criação de uma tabela de dados, vamos começar por criar uma série com as áreas dos estados dos EUA para combinar com a série da população desses estados que criamos anteriormente: 

In [126]:
area_dict = {
    'California': 423967,
    'Texas': 695662,
    'Florida': 170312,
    'New York': 141297,
    'Pennsylvania': 119280
}

area = pd.Series(area_dict)
area

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64

Para criar a tabela de dados, podemos usar um dicionário que associa uma etiqueta a cada uma das séries:

In [127]:
states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297
Pennsylvania,13002700,119280


Tal como uma série, uma tabela de dados tem um atributo `index` que permite aceder ao índice:

In [128]:
states.index

Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')

Para além disso, uma tabela de dados tem um atributo `columns` que permite aceder a um índice com as etiquetas das colunas:

In [129]:
states.columns

Index(['population', 'area'], dtype='object')

Logo, uma tabela de dados pode ser vista como uma generalização de um array bidimensional em que quer as linhas, quer as colunas têm um índice explícito que pode ser usado para aceder aos dados. Para além disso, uma tabela de dados também pode ser vista como uma especialização de um dicionário em que as etiquetas das colunas são mapeadas nas séries de dados correspondentes.

In [133]:
states['area']

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64

Para além de um dicionário que associa uma etiqueta a cada uma das séries, podemos criar tabelas de dados de outras formas. Por exemplo, a partir de uma única série:

In [134]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,39538223
Texas,29145505
Florida,21538187
New York,20201249
Pennsylvania,13002700


Também é possível criar uma tabela de dados a partir de um array bidimensional:

In [137]:
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.168776,0.039828
b,0.117703,0.493086
c,0.259081,0.450729


**Nota**: Neste caso, se um dos ou ambos os índices não forem explicitados, são usados os índices inteiros do array.

In [141]:
pd.DataFrame(np.random.rand(3, 2))

Unnamed: 0,0,1
0,0.659489,0.687858
1,0.372794,0.426883
2,0.729108,0.352116


Outra opção é criar uma tabela de dados a partir de uma lista de dicionários em que cada um deles representa uma entrada (linha) na tabela:

In [142]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


**Nota**: Ao criar uma tabela de dados, se os índices das séries ou as chaves das entradas não coincidirem, então a tabela vai ter valores em falta, representados por `NaN`.   

In [144]:
s1 = pd.Series(np.random.rand(3), index=['a', 'b', 'c'])
s2 = pd.Series(np.random.rand(3), index=['a', 'd', 'c'])

pd.DataFrame({'s1': s1, 's2': s2})

Unnamed: 0,s1,s2
a,0.464857,0.261234
b,0.503166,
c,0.461289,0.640581
d,,0.197477


In [145]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


## Operações sobre DataFrames

DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.

Let's load in the IMDB movies dataset to begin:

In [71]:
import os

data_path = '../data/' if os.path.exists('../data/') else 'https://raw.githubusercontent.com/TheAwesomeGe/DECD/main/data/'

movies_df = pd.read_csv(data_path + 'IMDB-Movie-Data.csv', index_col='Title')

In [72]:
movies_df

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


Neste caso, usamos o método `read_csv` para carregar o dataset de um ficheiro CSV e usar os títulos dos filmes como índice.

**Nota**: A biblioteca pandas define métodos para carregar datasets em vários formatos, como por exemplo a partir de uma folha de Excel (`read_excel`).  

### Visualizar os dados

The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with `.head()`:

In [20]:
movies_df.head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


`.head()` outputs the **first** five rows of your DataFrame by default, but we could also pass a number as well: `movies_df.head(10)` would output the top ten rows, for example. 

To see the **last** five rows use `.tail()`. `tail()` also accepts a number, and in this case we printing the bottom two rows.:

In [22]:
movies_df.tail(2)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
Nine Lives,1000,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


Typically when we load in a dataset, we like to view the first five or so rows to see what's under the hood. Here we can see the names of each column, the index, and examples of values in each row.

You'll notice that the index in our DataFrame is the *Title* column, which you can tell by how the word *Title* is slightly lower than the rest of the columns.

### Obter informação sobre os dados

`.info()` should be one of the very first commands you run after loading your data:

In [23]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Genre               1000 non-null   object 
 2   Description         1000 non-null   object 
 3   Director            1000 non-null   object 
 4   Actors              1000 non-null   object 
 5   Year                1000 non-null   int64  
 6   Runtime (Minutes)   1000 non-null   int64  
 7   Rating              1000 non-null   float64
 8   Votes               1000 non-null   int64  
 9   Revenue (Millions)  872 non-null    float64
 10  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB


`.info()` provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using. 

Notice in our movies dataset we have some obvious missing values in the `Revenue` and `Metascore` columns. We'll look at how to handle those in a bit.

Seeing the datatype quickly is actually quite useful. Imagine you just imported some JSON and the integers were recorded as strings. You go to do some arithmetic and find an "unsupported operand" Exception because you can't do math with strings. Calling `.info()` will quickly point out that your column you thought was all integers are actually string objects.

Another fast and useful attribute is `.shape`, which outputs just a tuple of (rows, columns):

In [24]:
movies_df.shape

(1000, 11)

### Limpar as colunas

Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.

Here's how to print the column names of our dataset:

In [25]:
movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Not only does `.columns` come in handy if you want to rename columns by allowing for simple copy and paste, it's also useful if you need to understand why you are receiving a `Key Error` when selecting data by column.

We can use the `.rename()` method to rename certain or all columns via a `dict`. We don't want parentheses, so let's rename those:

In [26]:
movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)

movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
       'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

Excellent. But what if we want to lowercase all names? Instead of using `.rename()` we could also set a list of names to the columns like so:

In [27]:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

But that's too much work. **Question:** Can you think of any otrher solution ?
<!-- ```
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns
``` -->

`list` (and `dict`) comprehensions come in handy a lot when working with pandas and data in general.

It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be working with a dataset for some time.

### Valores em falta

When exploring data, you’ll most likely encounter missing or null values, which are essentially placeholders for non-existent values. Most commonly you'll see Python's `None` or NumPy's `np.nan`, each of which are handled differently in some situations.

There are two options in dealing with nulls: 

1. Get rid of rows or columns with nulls
2. Replace nulls with non-null values, a technique known as **imputation**

Let's calculate to total number of nulls in each column of our dataset. The first step is to check which cells in our DataFrame are null:

In [28]:
movies_df.isnull()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,False,False,False,False
Split,False,False,False,False,False,False,False,False,False,False,False
Sing,False,False,False,False,False,False,False,False,False,False,False
Suicide Squad,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,False,False,False,False,False,False,False,False,False,True,False
Hostel: Part II,False,False,False,False,False,False,False,False,False,False,False
Step Up 2: The Streets,False,False,False,False,False,False,False,False,False,False,False
Search Party,False,False,False,False,False,False,False,False,False,True,False


Notice `isnull()` returns a DataFrame where each cell is either True or False depending on that cell's null status.

To count the number of nulls in each column we use an aggregate function for summing: 

In [29]:
movies_df.isnull().sum()

rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64

`.isnull()` just by itself isn't very useful, and is usually used in conjunction with other methods, like `sum()`.

We can see now that our data has **128** missing values for `revenue_millions` and **64** missing values for `metascore`.

#### Removing null values

Data Scientists and Analysts regularly face the dilemma of dropping or imputing null values, and is a decision that requires intimate knowledge of your data and its context. Overall, removing null data is only suggested if you have a small amount of missing data.

Remove nulls is pretty simple: `dropna()` will delete any **row** with at least a single null value, but it will return a new DataFrame without altering the original one. You can specify `inplace=True` in this method as well.

In [30]:
movies_df.dropna()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...
Resident Evil: Afterlife,994,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...,Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,140900,60.13,37.0
Project X,995,Comedy,3 high school seniors throw a birthday party t...,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,164088,54.72,48.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0


So in the case of our dataset, this operation would remove 128 rows where `revenue_millions` is null and 64 rows where `metascore` is null. This obviously seems like a waste since there's perfectly good data in the other columns of those dropped rows. That's why we'll look at imputation next.

Other than just dropping rows, you can also drop columns with null values by setting `axis = "columns"` or `axis=1`:

In [31]:
movies_df.dropna(axis="columns")

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727
...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881


In our dataset, this operation would drop the `revenue_millions` and `metascore` columns.

**Intuition side note**: What's with this `axis=1` parameter?

It's not immediately obvious where `axis` comes from and why you need it to be 1 for it to affect columns. To see why, just look at the `.shape` output:

In [32]:
movies_df.shape

(1000, 11)

As we learned above, this is a tuple that represents the shape of the DataFrame, i.e. 1000 rows and 11 columns. Note that the *rows* are at index zero of this tuple and *columns* are at **index one** of this tuple. This is why `axis=1` affects columns. This comes from NumPy, and is a great example of why learning NumPy is worth your time.

### Imputação

Imputation (the process of replacing missing data with substituted values) is a conventional feature engineering technique used to keep valuable data that have null values. 

There may be instances where dropping every row with a null value removes too big a chunk from your dataset, so instead we can impute that null with another value, usually the **mean** or the **median** of that column. 

Let's look at imputing the missing values in the `revenue_millions` column. First we'll extract that column into its own variable:

In [33]:
revenue = movies_df['revenue_millions'] 

Using square brackets is the general way we select columns in a DataFrame. 

If you remember back to when we created DataFrames from scratch, the keys of the `dict` ended up as column names. Now when we select columns of a DataFrame, we use brackets just like if we were accessing a Python dictionary. 

`revenue` now contains a Series:

In [34]:
revenue.head()

Title
Guardians of the Galaxy    333.13
Prometheus                 126.46
Split                      138.12
Sing                       270.32
Suicide Squad              325.02
Name: revenue_millions, dtype: float64

Slightly different formatting than a DataFrame, but we still have our `Title` index. 

We'll impute the missing values of revenue using the mean. Here's the mean value:

In [35]:
revenue_mean = revenue.mean()
revenue_mean

82.95637614678898

With the mean, let's fill the nulls using `fillna()`:

In [36]:
revenue.fillna(revenue_mean, inplace=True)

We have now replaced all nulls in `revenue` with the mean of the column. Notice that by using `inplace=True` we have actually affected the original `movies_df`:

In [37]:
movies_df.isnull().sum()
# Notice that revenue is a reference to "revenue_millions" in the dataframe

rank                 0
genre                0
description          0
director             0
actors               0
year                 0
runtime              0
rating               0
votes                0
revenue_millions     0
metascore           64
dtype: int64

Imputing an entire column with the same value like this is a basic example. It would be a better idea to try a more granular imputation by Genre or Director. 

For example, you would find the mean of the revenue generated in each genre individually and impute the nulls in each genre with that genre's mean.

Let's now look at more ways to examine and understand the dataset.

### keeping a smaller set of columns

In [38]:
movies_df.head(3)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


In [39]:
movies_df.columns

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

In [40]:
df = movies_df[['genre', 'director', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore']]

In [41]:
df.head()

Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,2014,121,8.1,757074,333.13,76.0
Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,2012,124,7.0,485820,126.46,65.0
Split,"Horror,Thriller",M. Night Shyamalan,2016,117,7.3,157606,138.12,62.0
Sing,"Animation,Comedy,Family",Christophe Lourdelet,2016,108,7.2,60545,270.32,59.0
Suicide Squad,"Action,Adventure,Fantasy",David Ayer,2016,123,6.2,393727,325.02,40.0


In [42]:
movies_df = df

### Understanding your variables

Using `describe()` on an entire DataFrame we can get a summary of the distribution of continuous variables:

In [43]:
movies_df.describe()

Unnamed: 0,year,runtime,rating,votes,revenue_millions,metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,936.0
mean,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,3.205962,18.810908,0.945429,188762.6,96.412043,17.194757
min,2006.0,66.0,1.9,61.0,0.0,11.0
25%,2010.0,100.0,6.2,36309.0,17.4425,47.0
50%,2014.0,111.0,6.8,110799.0,60.375,59.5
75%,2016.0,123.0,7.4,239909.8,99.1775,72.0
max,2016.0,191.0,9.0,1791916.0,936.63,100.0


Understanding which numbers are continuous also comes in handy when thinking about the type of plot to use to represent your data visually. 

`.describe()` can also be used on a categorical variable to get the count of rows, unique count of categories, top category, and freq of top category:

**Question:** show the "genre" of the first 8 movies 
<!--
movies_df['genre'].head(8).values
-->

In [44]:
movies_df['genre'].describe()

count                        1000
unique                        207
top       Action,Adventure,Sci-Fi
freq                           50
Name: genre, dtype: object

This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi, which shows up 50 times (freq).

`.value_counts()` can tell us the frequency of all values in a column:

In [125]:
movies_df['genre'].value_counts().head(10)

KeyError: 'genre'

#### Relationships between continuous variables

By using the correlation method `.corr()` we can generate the relationship between each continuous variable:

In [46]:
movies_df.corr()

ValueError: could not convert string to float: 'Action,Adventure,Sci-Fi'

Correlation tables are a numerical representation of the bivariate relationships in the dataset. 

Positive numbers indicate a positive correlation — one goes up the other goes up — and negative numbers represent an inverse correlation — one goes up the other goes down. 1.0 indicates a perfect correlation. 

So looking in the first row, first column we see `rank` has a perfect correlation with itself, which is obvious. On the other hand, the correlation between `votes` and `revenue_millions` is 0.6. A little more interesting.

Examining bivariate relationships comes in handy when you have an outcome or dependent variable in mind and would like to see the features most correlated to the increase or decrease of the outcome. You can visually represent bivariate relationships with scatterplots (seen below in the plotting section). 

For a deeper look into data summarizations check out [Essential Statistics for Data Science](https://www.learndatasci.com/tutorials/data-science-statistics-using-python/).

Let's now look more at manipulating DataFrames.

### DataFrame slicing, selecting, extracting

Up until now we've focused on some basic summaries of our data. We've learned about simple column extraction using single brackets, and we imputed null values in a column using `fillna()`. Below are the other methods of slicing, selecting, and extracting you'll need to use constantly.

It's important to note that, although many methods are the same, DataFrames and Series have different attributes, so you'll need be sure to know which type you are working with or else you will receive attribute errors. 

Let's look at working with columns first.

#### By column

You already saw how to extract a column using square brackets like this:

In [47]:
genre_col = movies_df['genre']
type(genre_col)

pandas.core.series.Series

This will return a *Series*. To extract a column as a *DataFrame*, you need to pass a list of column names. In our case that's just a single column:

In [48]:
genre_col = movies_df[['genre']]
type(genre_col)

pandas.core.frame.DataFrame

Since it's just a list, adding another column name is easy:

In [49]:
subset = movies_df[['genre', 'rating']]
subset.head()

Unnamed: 0_level_0,genre,rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",8.1
Prometheus,"Adventure,Mystery,Sci-Fi",7.0
Split,"Horror,Thriller",7.3
Sing,"Animation,Comedy,Family",7.2
Suicide Squad,"Action,Adventure,Fantasy",6.2


Now we'll look at getting data by rows.

#### By rows

For rows, we have two options: 

- `.loc` - **loc**ates by name
- `.iloc`- **loc**ates by numerical **i**ndex

Remember that we are still indexed by movie Title, so to use `.loc` we give it the Title of a movie:

In [50]:
prom = movies_df.loc["Prometheus"]
prom

genre               Adventure,Mystery,Sci-Fi
director                        Ridley Scott
year                                    2012
runtime                                  124
rating                                   7.0
votes                                 485820
revenue_millions                      126.46
metascore                               65.0
Name: Prometheus, dtype: object

On the other hand, with `iloc` we give it the numerical index of Prometheus:

In [51]:
prom = movies_df.iloc[1]
prom

genre               Adventure,Mystery,Sci-Fi
director                        Ridley Scott
year                                    2012
runtime                                  124
rating                                   7.0
votes                                 485820
revenue_millions                      126.46
metascore                               65.0
Name: Prometheus, dtype: object

`loc` and `iloc` can be thought of as similar to Python `list` slicing. To show this even further, let's select multiple rows.

How would you do it with a list? In Python, just slice with brackets like `example_list[1:4]`. It's works the same way in pandas:

In [52]:
movie_subset = movies_df.iloc[1:4]
movie_subset

Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,2012,124,7.0,485820,126.46,65.0
Split,"Horror,Thriller",M. Night Shyamalan,2016,117,7.3,157606,138.12,62.0
Sing,"Animation,Comedy,Family",Christophe Lourdelet,2016,108,7.2,60545,270.32,59.0


**Question**: How to achieve the same result using `.loc` ?
<!--
movie_subset = movies_df.loc['Prometheus':'Sing']
-->

One important distinction between using `.loc` and `.iloc` to select multiple rows is that `.loc` includes the movie *Sing* in the result, but when using `.iloc` we're getting rows 1:4 but the movie at index 4 (*Suicide Squad*) is not included. 

Slicing with `.iloc` follows the same rules as slicing with lists, the object at the index at the end is not included.

#### Conditional selections
We’ve gone over how to select columns and rows, but what if we want to make a conditional selection? 

For example, what if we want to filter our movies DataFrame to show only films directed by Ridley Scott or films with a rating greater than or equal to 8.0?

To do that, we take a column from the DataFrame and apply a Boolean condition to it. Here's an example of a Boolean condition:

In [53]:
movies_df['director']

Title
Guardians of the Galaxy              James Gunn
Prometheus                         Ridley Scott
Split                        M. Night Shyamalan
Sing                       Christophe Lourdelet
Suicide Squad                        David Ayer
                                   ...         
Secret in Their Eyes                  Billy Ray
Hostel: Part II                        Eli Roth
Step Up 2: The Streets               Jon M. Chu
Search Party                     Scot Armstrong
Nine Lives                     Barry Sonnenfeld
Name: director, Length: 1000, dtype: object

In [54]:
movies_df['director'] == "Ridley Scott"

Title
Guardians of the Galaxy    False
Prometheus                  True
Split                      False
Sing                       False
Suicide Squad              False
                           ...  
Secret in Their Eyes       False
Hostel: Part II            False
Step Up 2: The Streets     False
Search Party               False
Nine Lives                 False
Name: director, Length: 1000, dtype: bool

In [55]:
(movies_df['director'] == "Ridley Scott").head()

Title
Guardians of the Galaxy    False
Prometheus                  True
Split                      False
Sing                       False
Suicide Squad              False
Name: director, dtype: bool

**Question:** Please select all the movies directed by "Ridley Scott". Use `.count()` to count them.
<!--
movies_df['director'][movies_df['director'] == "Ridley Scott"]
-->

Similar to `isnull()`, this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him. 

We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False films. To return the rows where that condition is True we have to pass this operation into the DataFrame:

In [56]:
movies_df[movies_df['director'] == "Ridley Scott"].head()

Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,2012,124,7.0,485820,126.46,65.0
The Martian,"Adventure,Drama,Sci-Fi",Ridley Scott,2015,144,8.0,556097,228.43,80.0
Robin Hood,"Action,Adventure,Drama",Ridley Scott,2010,140,6.7,221117,105.22,53.0
American Gangster,"Biography,Crime,Drama",Ridley Scott,2007,157,7.8,337835,130.13,76.0
Exodus: Gods and Kings,"Action,Adventure,Drama",Ridley Scott,2014,150,6.0,137299,65.01,52.0


You can get used to looking at these conditionals by reading it like: 

> Select movies_df where movies_df director equals Ridley Scott

Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:

In [57]:
movies_df[movies_df['rating'] >= 8.6].head()

Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Interstellar,"Adventure,Drama,Sci-Fi",Christopher Nolan,2014,169,8.6,1047747,187.99,74.0
The Dark Knight,"Action,Crime,Drama",Christopher Nolan,2008,152,9.0,1791916,533.32,82.0
Inception,"Action,Adventure,Sci-Fi",Christopher Nolan,2010,148,8.8,1583625,292.57,74.0
Kimi no na wa,"Animation,Drama,Fantasy",Makoto Shinkai,2016,106,8.6,34110,4.68,79.0
Dangal,"Action,Biography,Drama",Nitesh Tiwari,2016,161,8.8,48969,11.15,


We can make some richer conditionals by using logical operators `|` for "or" and `&` for "and".

Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:

In [58]:
movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()

Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,2012,124,7.0,485820,126.46,65.0
Interstellar,"Adventure,Drama,Sci-Fi",Christopher Nolan,2014,169,8.6,1047747,187.99,74.0
The Dark Knight,"Action,Crime,Drama",Christopher Nolan,2008,152,9.0,1791916,533.32,82.0
The Prestige,"Drama,Mystery,Sci-Fi",Christopher Nolan,2006,130,8.5,913152,53.08,66.0
Inception,"Action,Adventure,Sci-Fi",Christopher Nolan,2010,148,8.8,1583625,292.57,74.0


We need to make sure to group evaluations with parentheses so Python knows how to evaluate the conditional.

Using the `isin()` method we could make this more concise though:

In [59]:
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()

Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,2012,124,7.0,485820,126.46,65.0
Interstellar,"Adventure,Drama,Sci-Fi",Christopher Nolan,2014,169,8.6,1047747,187.99,74.0
The Dark Knight,"Action,Crime,Drama",Christopher Nolan,2008,152,9.0,1791916,533.32,82.0
The Prestige,"Drama,Mystery,Sci-Fi",Christopher Nolan,2006,130,8.5,913152,53.08,66.0
Inception,"Action,Adventure,Sci-Fi",Christopher Nolan,2010,148,8.8,1583625,292.57,74.0


Let's say we want all movies released between 2005 and 2010, with a rating above 8.0, but bellow average in revenue.

Here's how we could do all of that:

In [60]:
movies_df[
    ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
    & (movies_df['rating'] > 8.0)
    & (movies_df['revenue_millions'] < movies_df['revenue_millions'].mean())
]

Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
The Prestige,"Drama,Mystery,Sci-Fi",Christopher Nolan,2006,130,8.5,913152,53.08,66.0
No Country for Old Men,"Crime,Drama,Thriller",Ethan Coen,2007,122,8.1,660286,74.27,91.0
Into the Wild,"Adventure,Biography,Drama",Sean Penn,2007,148,8.1,459304,18.35,73.0
Pan's Labyrinth,"Drama,Fantasy,War",Guillermo del Toro,2006,118,8.2,498879,37.62,98.0
There Will Be Blood,"Drama,History",Paul Thomas Anderson,2007,158,8.1,400682,40.22,92.0
3 Idiots,"Comedy,Drama",Rajkumar Hirani,2009,170,8.4,238789,6.52,67.0
The Lives of Others,"Drama,Thriller",Florian Henckel von Donnersmarck,2006,137,8.5,278103,11.28,89.0
Incendies,"Drama,Mystery,War",Denis Villeneuve,2010,131,8.2,92863,6.86,80.0
El secreto de sus ojos,"Drama,Mystery,Romance",Juan José Campanella,2009,129,8.2,144524,20.17,80.0
Taare Zameen Par,"Drama,Family,Music",Aamir Khan,2007,165,8.5,102697,1.2,42.0


If you recall up when we used `.describe()` the 25th percentile for revenue was about 17.4, and we can access this value directly by using the `quantile()` method with a float of 0.25.

So here we have only four movies that match that criteria.

## Applying functions

It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow.

An efficient alternative is to `apply()` a function to the dataset. For example, we could use a function to convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this transformed values to create a new column.

First we would create a function that, when given a rating, determines if it's good or bad:

In [61]:
def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"

In [62]:
movies_df["rating"].apply(rating_function)

Title
Guardians of the Galaxy    good
Prometheus                  bad
Split                       bad
Sing                        bad
Suicide Squad               bad
                           ... 
Secret in Their Eyes        bad
Hostel: Part II             bad
Step Up 2: The Streets      bad
Search Party                bad
Nine Lives                  bad
Name: rating, Length: 1000, dtype: object

Now we want to send the entire rating column through this function, which is what `apply()` does:

In [63]:
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)
movies_df.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df["rating_category"] = movies_df["rating"].apply(rating_function)


Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore,rating_category
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,2014,121,8.1,757074,333.13,76.0,good
Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,2012,124,7.0,485820,126.46,65.0,bad


The `.apply()` method passes every value in the `rating` column through the `rating_function` and then returns a new Series. This Series is then assigned to a new column called `rating_category`.

You can also use anonymous functions as well. This lambda function achieves the same result as `rating_function`:

In [64]:
movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')

movies_df.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')


Unnamed: 0_level_0,genre,director,year,runtime,rating,votes,revenue_millions,metascore,rating_category
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,2014,121,8.1,757074,333.13,76.0,good
Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,2012,124,7.0,485820,126.46,65.0,bad


Overall, using `apply()` will be much faster than iterating manually over rows because pandas is utilizing vectorization.

> Vectorization: a style of computer programming where operations are applied to whole arrays instead of individual elements —[Wikipedia](https://en.wikipedia.org/wiki/Vectorization)

A good example of high usage of `apply()` is during natural language processing (NLP) work. You'll need to apply all sorts of text cleaning functions to strings to prepare for machine learning.

## Wrapping up

Exploring, cleaning, transforming, and visualization data with pandas in Python is an essential skill in data science. Just cleaning wrangling data is 80% of your job as a Data Scientist. After a few projects and some practice, you should be very comfortable with most of the basics.

To keep improving, view the [extensive tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html) offered by the official pandas docs, follow along with a few [Kaggle kernels](https://www.kaggle.com/kernels), and keep working on your own projects!


Ver tutoriais da biblioteca [pandas](https://pandas.pydata.org/pandas-docs/stable/tutorials.html) e o [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)