# **Intro to Python for Data Analysis**
## Chapter 2: DataFrames
---
**Author:** Juan Martin Bellido  

**About**  
On this chapter we will start working with DataFrames (i.e. Python's version of data tables). After finishing with this notebook, we will be ready to move to advance data manipulation operations.

**Feedback?** Please share on [LinkedIn](https://www.linkedin.com/in/jmartinbellido/)  


## Table of Contents
---
1. Libraries
2. Intro to DataFrames
3. Filtering columns and rows
4. Basic operations with DataFrames
5. Exercises


Conventions used in this document

> 👉 *This is note*

> ⚠️ *This is a warning*

# 1. Libraries
---


### Importing libraries

Python is an *open-source* and *collaborative* programming language. Therefore, it is open for any user to build and share new functionalities; these are structured on *libraries* that we install (only once) and then import on every session we intend using them.

> 👉 Many popular libraries are pre-installed, therefore we only need to import them.  

We use the following syntaxis to import a library.

```
import (library) as (tag)
```

After importing a library, we will be referencing it every time we invoke a function on it. In order to simplify this, we usually tag libraries,

> 👉 Most popular libraries' *tags* are standardized (eg. *pd* for pandas).

In [2]:
# we import pandas and tag it as "pd"
import pandas as pd

### Installing libraries

We only can import a library once it is installed in our environment. There are many ways to install libraries on Python, the most popular method is through *pip*, a package management system built by the MIT.

A continuación, se presenta la sintaxis que se utilizaría para instalar la librería *pandas*.

The example below ilustrates how to install a library using *pip* (e.g. *pandas*).


```
pip install pandas
```

> ⚠️ We should first install *pip* before using it.

> 👉 How to know whether we need to install a library? The easiest way is simply trying to import it, if not found it means we need to install it.



# 2. Intro to DataFrames
---

In [35]:
# importing libraries
import pandas as pd
import numpy as np

### First steps with a DataFrame

In [None]:
# we create a dictionary
car_dic = {
    "car":['Honda Civic','VW Golf','Toyota Corolla'],
    "price":[12000,13000,15000],
    "is_new":[False,True,True]
  }

# pd.DataFrame() function: building a df manually
df_cars = pd.DataFrame(car_dic)

df_cars

Unnamed: 0,car,price,is_new
0,Honda Civic,12000,False
1,VW Golf,13000,True
2,Toyota Corolla,15000,True


In [None]:
# checking object type
type(df_cars)

pandas.core.frame.DataFrame

In [None]:
# we can check columns on this df
df_cars.columns.values

array(['car', 'price', 'is_new'], dtype=object)

In [None]:
# we can check indexes 
df_cars.index.values

array([0, 1, 2])

### Importing a DataFrame

After importing library *pandas*, we will use one of its functions to import a DataFrame into our environment.

```
pd.read_csv('path')
```
* The *pd.read_csv()* function imports a CSV or TXT file into a DataFrame
* The path could be either a route to a local file or (in our case) a URL


In [None]:
# using pd.read_csv() to import a dataframe
pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
9,The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


In [None]:
# repeating the operation, this time storing the data on an object
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# we use type() to confirm object type
type(df_jamesbond)

pandas.core.frame.DataFrame

### Exporting a DataFrame

We will now use the *pd.to_csv()* function to export a DataFrame into our local environment. In case we don't specify a route, Python will store the file on our working directory (our route by default).

```
df.to_csv('path')
```
> ⚠️ This function does not work on Google Colab

In [None]:
# exporting dataframe
df_jamesbond.to_csv("dataset_jamesbond.csv")

In [None]:
# we could export dataframe as txt if editing "sep" parameter
df_jamesbond.to_csv("dataset_jamesbond.txt",sep='\t')

In [None]:
# we can check our working directory using os.getcwd()
import os
os.getcwd()

'/content'

### First operations on a DataFrame

First opearations we will do on a DataFrame:
* check columns (variables) names and type of variables in our DataFrame
* review number of rows in our DataFrame
* change index in our DataFrame
* display first/last n rows in our DataFrame


In [None]:
# .dtypes method: allows to review name and type of variables in a df
df_jamesbond.dtypes

Film                  object
Year                   int64
Actor                 object
Director              object
Box Office           float64
Budget               float64
Bond Actor Salary    float64
dtype: object

In [None]:
# len() function: provides number of rows in df
len(df_jamesbond)

26

In [None]:
# in some cases, we might be interested in changing numerical index with a new index generated using an existing column in our df
# .set_index(column): converts a column into DataFrame's index
df_jamesbond.set_index("Film") # we use column "Film" as df's new index

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


In [None]:
# head()/ tail() method: review first/last n rows in df; n=5 by default
df_jamesbond.head()

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


# 3. Filtering columns and rows
---



In [4]:
# importing a df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [11]:
# creating a simple df manually
car_dic = {
    "car":['Honda Civic','VW Golf','Toyota Corolla'],
    "price":[12000,13000,15000],
    "is_new":[False,True,True]
    }
df_cars = pd.DataFrame(car_dic)

### Filtering columns by name

As introduced on notebook 1, Python uses [ ] (*brackets*) as basic sintaxis to select elements stored in data structures.

```
data_structure[x]
```


To filter columns by name, we use *text input* within the brackets.

```
data_structure['col_name']
```



In [6]:
# we select one column in df
df_jamesbond['Actor']

0       Sean Connery
1       Sean Connery
2       Sean Connery
3       Sean Connery
4        David Niven
5       Sean Connery
6     George Lazenby
7       Sean Connery
8        Roger Moore
9        Roger Moore
10       Roger Moore
11       Roger Moore
12       Roger Moore
13      Sean Connery
14       Roger Moore
15       Roger Moore
16    Timothy Dalton
17    Timothy Dalton
18    Pierce Brosnan
19    Pierce Brosnan
20    Pierce Brosnan
21    Pierce Brosnan
22      Daniel Craig
23      Daniel Craig
24      Daniel Craig
25      Daniel Craig
Name: Actor, dtype: object

To filter by more than one column, we use a *list* within the brackets.

```
data_structure[['col_1','col_2','col_3']]
```

In [7]:
# we invoke more than one column by using a list
df_jamesbond[['Film','Actor','Box Office']]

Unnamed: 0,Film,Actor,Box Office
0,Dr. No,Sean Connery,448.8
1,From Russia with Love,Sean Connery,543.8
2,Goldfinger,Sean Connery,820.4
3,Thunderball,Sean Connery,848.1
4,Casino Royale,David Niven,315.0
5,You Only Live Twice,Sean Connery,514.2
6,On Her Majesty's Secret Service,George Lazenby,291.5
7,Diamonds Are Forever,Sean Connery,442.5
8,Live and Let Die,Roger Moore,460.3
9,The Man with the Golden Gun,Roger Moore,334.0


Alternatively, we can filter by one column using the following simplified sintaxis.

```
object.column_name
```

> ⚠️ This works only when filtering for a single column

> ⚠️ Only works for column names declared without spaces

In [8]:
df_jamesbond.Actor

0       Sean Connery
1       Sean Connery
2       Sean Connery
3       Sean Connery
4        David Niven
5       Sean Connery
6     George Lazenby
7       Sean Connery
8        Roger Moore
9        Roger Moore
10       Roger Moore
11       Roger Moore
12       Roger Moore
13      Sean Connery
14       Roger Moore
15       Roger Moore
16    Timothy Dalton
17    Timothy Dalton
18    Pierce Brosnan
19    Pierce Brosnan
20    Pierce Brosnan
21    Pierce Brosnan
22      Daniel Craig
23      Daniel Craig
24      Daniel Craig
25      Daniel Craig
Name: Actor, dtype: object

### Filtering rows by positions

A first method to filter rows by positions on a DataFrame is by using *booleans*.

```
df[[True,False,False...]]

```
> 👉 In case of detecting *booleans*, Python will understand we are trying to filter rows (instead of selecting columns)

> ⚠️ The boolean list needs to have as many elements as rows in the DataFrame we are trying to filter

This method will become particularly relevant when filtering rows using conditions.


In [12]:
df_cars[[True, False, True]]
# we can observe that we only keep rows that corresponds with True values on the list

Unnamed: 0,car,price,is_new
0,Honda Civic,12000,False
2,Toyota Corolla,15000,True


### Filtering a DataFrame using Pandas: loc[ ] & iloc[ ] methods

Panda's *.loc[]* and *.iloc[]* methods allow to filter rows and columns using a matricial-like logic (rows,columns). 

> ⚠️ Methods tipically use parenthesis () to allow parameters. As an exception to this rule, these two methods use brackets [].


#### The .loc[ ] method

The *.loc[]* method allow to access rows and columns by *labels*.

```
object.loc[rows,columns]
object.loc[[row_1,row_2,...],[col_1,col_2,...]]
```

In [13]:
# importing a df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
df_jamesbond = df_jamesbond.set_index("Film") # changing index
df_jamesbond.head() # displaying first 5 rows

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [14]:
# let us use .loc to filter data
df_jamesbond.loc[["From Russia with Love","Goldfinger"],:] # filtering for rows within range, all columns (:)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [15]:
# filtering for rows in range, two columns
df_jamesbond.loc["From Russia with Love":"Goldfinger",["Year","Director"]] # we use a list within second parameter

Unnamed: 0_level_0,Year,Director
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
From Russia with Love,1963,Terence Young
Goldfinger,1964,Guy Hamilton


#### The .iloc[ ] method
Pandas *.iloc[]* method allow to access rows and columns by their *position* within the structure.

```
object.iloc[row_position,column_position]
```


In [17]:
# importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [18]:
# let us select elements, by position
df_jamesbond.iloc[:5,[1,3,5]] # selecting first 5 rows; columns 2nd, 4th and 6th (remember that Python indexes at 0)

Unnamed: 0,Year,Director,Budget
0,1962,Terence Young,7.0
1,1963,Terence Young,12.6
2,1964,Guy Hamilton,18.6
3,1965,Terence Young,41.9
4,1967,Ken Hughes,85.0


In [19]:
# example of nesting both methods
df_jamesbond.iloc[:10,:].loc[:,['Film','Actor','Director']]

Unnamed: 0,Film,Actor,Director
0,Dr. No,Sean Connery,Terence Young
1,From Russia with Love,Sean Connery,Terence Young
2,Goldfinger,Sean Connery,Guy Hamilton
3,Thunderball,Sean Connery,Terence Young
4,Casino Royale,David Niven,Ken Hughes
5,You Only Live Twice,Sean Connery,Lewis Gilbert
6,On Her Majesty's Secret Service,George Lazenby,Peter R. Hunt
7,Diamonds Are Forever,Sean Connery,Guy Hamilton
8,Live and Let Die,Roger Moore,Guy Hamilton
9,The Man with the Golden Gun,Roger Moore,Guy Hamilton


### Filtering rows, based on logical conditions

We can also choose to filter rows based on specific conditions we build. 

```
object[object[condition]]
```

We use operators to build boolean lists, then we use them to filter.


In [None]:
# importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [20]:
# we first create a boolean lists by testing a condition on a column
cond = df_jamesbond["Year"] >= 1990
cond

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18     True
19     True
20     True
21     True
22     True
23     True
24     True
25     True
Name: Year, dtype: bool

In [21]:
# we now use the boolean list created above to filter rows
df_jamesbond[cond]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [22]:
# we repeat the operation, this time without an auxiliary object
df_jamesbond[df_jamesbond['Year'] >= 1990]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [23]:
# we create two conditions
cond_1 = df_jamesbond["Budget"]>150
cond_2 = df_jamesbond["Actor"]=='Daniel Craig'

# we link the two conditions with an "and" operator
df_jamesbond[cond_1 & cond_2] # filtering rows

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [24]:
# we create two conditions
cond_1 = df_jamesbond["Budget"]>150
cond_2 = df_jamesbond["Actor"]=='Daniel Craig'

# we link the two conditions with an "or" operator
df_jamesbond[cond_1 | cond_2] # filtering rows

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [25]:
# we create two conditions
cond_1 = df_jamesbond["Box Office"]>500
cond_2 = df_jamesbond["Actor"]=='Daniel Craig'

# we relate both conditions with "and"; we negate the 2nd condition
df_jamesbond[cond_1 & -cond_2] 

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
10,The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
11,Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1


#### Useful methods when filtering rows by conditions


In [27]:
# the .isin() method is useful when filtering for multiple values on a categorical variable
cond = df_jamesbond['Director'].isin(['Martin Campbell','Terence Young'])
df_jamesbond[cond]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [26]:
# we can use the .str.contains() method to test if text is contained in string variable
cond_1 = df_jamesbond['Film'].str.contains('love', case=False)
cond_2 = df_jamesbond['Film'].str.contains('die', case=False)

df_jamesbond[cond_1 | cond_2]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
8,Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
10,The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


### Filtering columns, based on type of variable

The .select_dtypes() method allow to select columns, based on type of variable.

```
df.select_dtypes(include=[...], exclude=[...])
```





In [None]:
# importing df
# using .dtypes method to review columns type of variables
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
df_jamesbond.dtypes

Film                  object
Year                   int64
Actor                 object
Director              object
Box Office           float64
Budget               float64
Bond Actor Salary    float64
dtype: object

In [None]:
# we use the .select_dtypes() method to filter only for columns type 'object' (text) or 'int64' (integer)
df_jamesbond.select_dtypes(include=['object','int64']).head()

Unnamed: 0,Film,Year,Actor,Director
0,Dr. No,1962,Sean Connery,Terence Young
1,From Russia with Love,1963,Sean Connery,Terence Young
2,Goldfinger,1964,Sean Connery,Guy Hamilton
3,Thunderball,1965,Sean Connery,Terence Young
4,Casino Royale,1967,David Niven,Ken Hughes


In [None]:
# we now use the same method, this time to select all excluding columns type 'object'
df_jamesbond.select_dtypes(exclude='object').head()

Unnamed: 0,Year,Box Office,Budget,Bond Actor Salary
0,1962,448.8,7.0,0.6
1,1963,543.8,12.6,1.6
2,1964,820.4,18.6,3.2
3,1965,848.1,41.9,4.7
4,1967,315.0,85.0,


### Filtering rows using SQL

In case you are familiar with SQL (relational database language), you might find it easy to filter rows using a syntaxis that is similar to this language.

```
df.query("cond")
```
> 👉 Multiple conditions should be combined using and (&) / or


In [None]:
# importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# Using the .query() method to filter rows
df_jamesbond.query("Year > 2000 or Actor == 'Daniel Craig'")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


# 4. Basic operations with DataFrames
---


In [28]:
# importing df
df_cars = pd.read_csv('https://data-wizards.s3.amazonaws.com/datasets/dataset_us_cars.csv')

In [29]:
# to simplify df, we keep only first 10 rows
df_cars = df_cars.head()

In [30]:
# displaying df
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country
0,2008,toyota,6300,274117,black,new jersey,usa
1,2011,ford,2899,190552,silver,tennessee,usa
2,2018,dodge,5350,39590,silver,georgia,usa
3,2014,ford,25000,64146,blue,virginia,usa
4,2018,chevrolet,27700,6654,red,florida,usa


### Crearing a new column (variable)

We use the following syntaxis to create a new column in a DataFrame. In case the column name is already taken, Python will overwrite that variable with new data provided.

```
df["new_column"] = [...]
```
> 👉 We should provide a list with as many elements as rows in the DataFrame we are modifying




In [31]:
# creating a new column
df_cars['price_eur'] = df_cars['price']/1.2

In [32]:
# we visualize the df
df_cars.head()

Unnamed: 0,year,brand,price,mileage,color,state,country,price_eur
0,2008,toyota,6300,274117,black,new jersey,usa,5250.0
1,2011,ford,2899,190552,silver,tennessee,usa,2415.833333
2,2018,dodge,5350,39590,silver,georgia,usa,4458.333333
3,2014,ford,25000,64146,blue,virginia,usa,20833.333333
4,2018,chevrolet,27700,6654,red,florida,usa,23083.333333


In [None]:
# we repeat the operation using the round() function
df_cars['price_eur'] = round(df_cars['price']/1.2)
df_cars.head()

Unnamed: 0,year,brand,price,mileage,color,state,country,is_ford_card,brand-color
0,2008,Toyota,6300,274117,black,new jersey,usa,False,Toyota - black
1,2011,Ford,2899,190552,silver,tennessee,usa,True,Ford - silver
2,2018,Dodge,5350,39590,silver,georgia,usa,False,Dodge - silver
3,2014,Ford,25000,64146,blue,virginia,usa,True,Ford - blue
4,2018,Chevrolet,27700,6654,red,florida,usa,False,Chevrolet - red


In [33]:
# on a second example, we now create a new text column
df_cars["brand-color"] = df_cars["brand"] + " - " + df_cars["color"] # we use the + opearator to concatenate
df_cars.head()

Unnamed: 0,year,brand,price,mileage,color,state,country,price_eur,brand-color
0,2008,toyota,6300,274117,black,new jersey,usa,5250.0,toyota - black
1,2011,ford,2899,190552,silver,tennessee,usa,2415.833333,ford - silver
2,2018,dodge,5350,39590,silver,georgia,usa,4458.333333,dodge - silver
3,2014,ford,25000,64146,blue,virginia,usa,20833.333333,ford - blue
4,2018,chevrolet,27700,6654,red,florida,usa,23083.333333,chevrolet - red


### New column based on condition

We create a new column using an *if/else* condition test.

```
np.where(condition, value if true, value if false)
```

> 👉 This function is part of the *numpy* (np) library

In [36]:
# we create a condition
cond = df_cars['brand'].isin(['ford','dodge','chevrolet','chrysler'])
# we use the condition to build our new column
df_cars['is_Brand_US'] = np.where(cond,'US brand','non-US brand')
df_cars.head()

Unnamed: 0,year,brand,price,mileage,color,state,country,price_eur,brand-color,is_Brand_US
0,2008,toyota,6300,274117,black,new jersey,usa,5250.0,toyota - black,non-US brand
1,2011,ford,2899,190552,silver,tennessee,usa,2415.833333,ford - silver,US brand
2,2018,dodge,5350,39590,silver,georgia,usa,4458.333333,dodge - silver,US brand
3,2014,ford,25000,64146,blue,virginia,usa,20833.333333,ford - blue,US brand
4,2018,chevrolet,27700,6654,red,florida,usa,23083.333333,chevrolet - red,US brand


### Deleting columns on DataFrames

To delete a column on a DataFrame, we could simply filter for all columns we would like to keep using any method introduced on the previous section. However, in some occasions it is easier to explicitly drop columns. 

```
df.drop(columns=[column_list])
```

In [38]:
# deleting a specific column
df_cars.drop(columns=['year','price_eur','brand-color'])

Unnamed: 0,brand,price,mileage,color,state,country,is_Brand_US
0,toyota,6300,274117,black,new jersey,usa,non-US brand
1,ford,2899,190552,silver,tennessee,usa,US brand
2,dodge,5350,39590,silver,georgia,usa,US brand
3,ford,25000,64146,blue,virginia,usa,US brand
4,chevrolet,27700,6654,red,florida,usa,US brand


In [39]:
# please note that we did NOT overwrite the DataFrame, therefore we still keep the column (we only changed data displayed)
df_cars

Unnamed: 0,year,brand,price,mileage,color,state,country,price_eur,brand-color,is_Brand_US
0,2008,toyota,6300,274117,black,new jersey,usa,5250.0,toyota - black,non-US brand
1,2011,ford,2899,190552,silver,tennessee,usa,2415.833333,ford - silver,US brand
2,2018,dodge,5350,39590,silver,georgia,usa,4458.333333,dodge - silver,US brand
3,2014,ford,25000,64146,blue,virginia,usa,20833.333333,ford - blue,US brand
4,2018,chevrolet,27700,6654,red,florida,usa,23083.333333,chevrolet - red,US brand


### Renaming columns in DataFrame

We use the rename() method to rename columns on a DataFrame,

```
df.rename(columns={name:new_name})
```
> 👉 This method uses dictionaries to define which column name should be replaced by which new name


In [40]:
# let us change two column names
# first, we create a dictionary with the column - new name mapping
# then, we use that dictionary into the .rename() method

new_names_dic = {"brand":"car_brand","price":"car_price"}
df_cars.rename(columns = new_names_dic)

Unnamed: 0,year,car_brand,car_price,mileage,color,state,country,price_eur,brand-color,is_Brand_US
0,2008,toyota,6300,274117,black,new jersey,usa,5250.0,toyota - black,non-US brand
1,2011,ford,2899,190552,silver,tennessee,usa,2415.833333,ford - silver,US brand
2,2018,dodge,5350,39590,silver,georgia,usa,4458.333333,dodge - silver,US brand
3,2014,ford,25000,64146,blue,virginia,usa,20833.333333,ford - blue,US brand
4,2018,chevrolet,27700,6654,red,florida,usa,23083.333333,chevrolet - red,US brand


### Sorting a DataFrame

The *sort_values()* method allows to sort by one or more columns available in DataFrame

```
object.sort_values(column,ascending=True/False)
```



In [None]:
# importing df
df_jamesbond = pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

In [None]:
# sorting by column
df_jamesbond.sort_values("Bond Actor Salary",ascending=False)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
15,A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
17,Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
14,Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
7,Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
16,The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2


In [None]:
# sorting DataFrame by multiple columns
df_jamesbond.sort_values(["Actor","Year"],ascending=[True,False])
# note that we use a list within method parameter

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
25,Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,
24,Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
23,Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
22,Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
6,On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
21,Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
20,The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
19,Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
18,GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1


### Nesting operations in Python

Python allows to nest operations nativelly - that is to say, without the need for any library. 

Example for nested operations.

```
df.iloc[x,y].head().sort_values(x)
```
> 👉 On this example, we are performing three differenet operations on a same coding line


In [None]:
# example of nesting opeartions: (i) import df, (ii) filter rows, (iii) sorting
pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv").iloc[0:5,:].sort_values("Film")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


### Python method chaining
Coding lines turn long when nesting multiple operations, turning it difficult to read the code. Under those situations, we could choose to bring operations on a new coding line, by using the so-called *method chaining*.

```
df.iloc[x,y]\
  .head()\
  .sort_values(x)
```
* We use an backward slash (\) to separate operations
* We place the backward slash right after an operation and followed by a new coding line

> ⚠️ Method chaining does not allow any character (including comments) after the backward slash

In [None]:
# below is an example of nested operations using the method chaining
pd.read_csv("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")\
  .iloc[0:5,:]\
  .sort_values("Film")

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
3,Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


# 5. Exercises
---

> 👉 Solutions to exercises are available [here](https://nbviewer.org/github/SomosDataWizards/Python-Intro-Course/blob/main/Chapter_2_Exercises.ipynb)


### Exercise #1

##### EX 1.A. Import dataframe. Select columns *'name', 'homeworld' and 'species'* for the firs 10 rows in DataFrame.

##### EX 1.B. Filter for characters that a) are not human and b) are from *homeworld* either Naboo, Endor or Kashyyyk.

> *Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv*




### Exercise #2


##### EX 2.A. Select columns *'movie_title', 'country', 'director_name' and 'imdb_score'* for the first 10 rows in DataFrame.
##### EX 2.B. Select movies  
(i) produced outside the USA or with an IMDB score higher than 8.5, and  
(ii) directed by either James Cameron, Peter Jackson or Tim Burton.

> Dataset https://data-wizards.s3.amazonaws.com/datasets/movies.csv




### Exercise #3

##### EX 3.A. Import DataFrame. Select columbs "Company", "Sector" and "Revenue".
##### EX 3.B. Rename columb "Revenue" for "Company_Revenue".
##### EX 3.C. Create a new column "Profits_per_Employee" as the result of dividing column "Profits" by "Employees".


> Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv




### Exercise #4

Import DataFrame. Extract top 10 companies with highest revenue in the Technology sector.

> Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv




### Exercise #5

##### Import dataframe. Extract top 10 Sci-Fi movies with highest IMDB score; select only the following fields: *title_year*, *director_name* and *imdb_score*.

> Dataset https://data-wizards.s3.amazonaws.com/datasets/movies.csv


