## Pandas

__Pandas__ is a library provides efficient data structures to manage big quantities of information

pip3 install pandas

We need to import the library before using it

In [1]:
import pandas as pd

### Reading files

- csv: pandas.__read_csv__(filepath)
- excel: pandas.__read_excel__(filepath)

In [2]:
csv_file: pd.DataFrame = pd.read_csv("IMDB-Movie-Data.csv")
print(csv_file)

     Rank                    Title                     Genre  \
0       1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1       2               Prometheus  Adventure,Mystery,Sci-Fi   
2       3                    Split           Horror,Thriller   
3       4                     Sing   Animation,Comedy,Family   
4       5            Suicide Squad  Action,Adventure,Fantasy   
..    ...                      ...                       ...   
995   996     Secret in Their Eyes       Crime,Drama,Mystery   
996   997          Hostel: Part II                    Horror   
997   998   Step Up 2: The Streets       Drama,Music,Romance   
998   999             Search Party          Adventure,Comedy   
999  1000               Nine Lives     Comedy,Family,Fantasy   

                                           Description              Director  \
0    A group of intergalactic criminals are forced ...            James Gunn   
1    Following clues to the origin of mankind, a te...          Ridley 

In [3]:
excel_file: pd.DataFrame = pd.read_excel("IMDB-Movie-Data.xlsx")
print(excel_file)

     Rank                    Title                     Genre  \
0       1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1       2               Prometheus  Adventure,Mystery,Sci-Fi   
2       3                    Split           Horror,Thriller   
3       4                     Sing   Animation,Comedy,Family   
4       5            Suicide Squad  Action,Adventure,Fantasy   
..    ...                      ...                       ...   
995   996     Secret in Their Eyes       Crime,Drama,Mystery   
996   997          Hostel: Part II                    Horror   
997   998   Step Up 2: The Streets       Drama,Music,Romance   
998   999             Search Party          Adventure,Comedy   
999  1000               Nine Lives     Comedy,Family,Fantasy   

                                           Description              Director  \
0    A group of intergalactic criminals are forced ...            James Gunn   
1    Following clues to the origin of mankind, a te...          Ridley 

- __JSON__: There is a way to turn __JSON files into dataframes__

pandas.__json_normalize__(json_object)

In [4]:
import json
# Let's transform a JSON file into dataframe object
books_json = None
books_file = None

try:
    # Opening JSON file
    books_file = open('books.json')
 
    # returns JSON object as 
    # a dictionary
    books_json = json.load(books_file)
    print(books_json) 
except FileNotFoundError as e:
    print(e)
finally:
    if books_file is not None:
        books_file.close()

[{'author': 'Chinua Achebe', 'country': 'Nigeria', 'imageLink': 'images/things-fall-apart.jpg', 'language': 'English', 'link': 'https://en.wikipedia.org/wiki/Things_Fall_Apart\n', 'pages': 209, 'title': 'Things Fall Apart', 'year': 1958}, {'author': 'Hans Christian Andersen', 'country': 'Denmark', 'imageLink': 'images/fairy-tales.jpg', 'language': 'Danish', 'link': 'https://en.wikipedia.org/wiki/Fairy_Tales_Told_for_Children._First_Collection.\n', 'pages': 784, 'title': 'Fairy tales', 'year': 1836}, {'author': 'Dante Alighieri', 'country': 'Italy', 'imageLink': 'images/the-divine-comedy.jpg', 'language': 'Italian', 'link': 'https://en.wikipedia.org/wiki/Divine_Comedy\n', 'pages': 928, 'title': 'The Divine Comedy', 'year': 1315}, {'author': 'Unknown', 'country': 'Sumer and Akkadian Empire', 'imageLink': 'images/the-epic-of-gilgamesh.jpg', 'language': 'Akkadian', 'link': 'https://en.wikipedia.org/wiki/Epic_of_Gilgamesh\n', 'pages': 160, 'title': 'The Epic Of Gilgamesh', 'year': -1700}, {

In [5]:
# Now we can easily turn the JSON object to dataframe

books_df: pd.DataFrame = pd.json_normalize(books_json)
print(books_df)

                     author                    country  \
0             Chinua Achebe                    Nigeria   
1   Hans Christian Andersen                    Denmark   
2           Dante Alighieri                      Italy   
3                   Unknown  Sumer and Akkadian Empire   
4                   Unknown          Achaemenid Empire   
..                      ...                        ...   
95                    Vyasa                      India   
96             Walt Whitman              United States   
97           Virginia Woolf             United Kingdom   
98           Virginia Woolf             United Kingdom   
99     Marguerite Yourcenar             France/Belgium   

                           imageLink  language  \
0       images/things-fall-apart.jpg   English   
1             images/fairy-tales.jpg    Danish   
2       images/the-divine-comedy.jpg   Italian   
3   images/the-epic-of-gilgamesh.jpg  Akkadian   
4         images/the-book-of-job.jpg    Hebrew   
.. 

- __Copy__ dataframe:

new_var = df.copy()

In [6]:
copy_csv_file: pd.DataFrame = csv_file.copy()

### Basic structures:

- Series: One dimensional array-like structures
- Dataframes: Two dimensional array-like structures

NOTE: Convert from dict to dataframe

pd.__DataFrame__(_dict_)

In [7]:
countries_dict: dict = {"name": ["Spain", "France", "UK"], "population": [47502512, 68042591, 67820364], 
                   "GDP": [1.397, 2.958, 3.131]}
dataframe_countries: pd.DataFrame = pd.DataFrame(countries_dict)
print(dataframe_countries)

     name  population    GDP
0   Spain    47502512  1.397
1  France    68042591  2.958
2      UK    67820364  3.131


### Basic descriptions of the data in the file

- .info() -> Returns information about the dataframe/series contents
- .describe() -> Returns basic statistical measures of the information in dataframe/series
- .head(n) -> Returns n first rows in the dataframe (if we don't set n, 5 lines will be returned)

In [8]:
# info
csv_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


In [9]:
excel_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


In [10]:
excel_file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


In [11]:
dataframe_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        3 non-null      object 
 1   population  3 non-null      int64  
 2   GDP         3 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 204.0+ bytes


In [12]:
# describe
csv_file.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


In [13]:
excel_file.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


In [14]:
dataframe_countries.describe()

Unnamed: 0,population,GDP
count,3.0,3.0
mean,61121820.0,2.495333
std,11795190.0,0.95511
min,47502510.0,1.397
25%,57661440.0,2.1775
50%,67820360.0,2.958
75%,67931480.0,3.0445
max,68042590.0,3.131


In [15]:
# head
csv_file.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [16]:
excel_file.head(7)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
5,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
6,7,La La Land,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0


In [17]:
dataframe_countries.head(2)

Unnamed: 0,name,population,GDP
0,Spain,47502512,1.397
1,France,68042591,2.958


- To get the __columns__ in a dataframe

dataframe_var.__columns__

In [18]:
csv_file.columns

Index(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

In [19]:
csv_file.columns.values

array(['Rank', 'Title', 'Genre', 'Description', 'Director', 'Actors',
       'Year', 'Runtime (Minutes)', 'Rating', 'Votes',
       'Revenue (Millions)', 'Metascore'], dtype=object)

- __dimensions__ of the dataframe: .__shape__

First dimension is number of rows

Second dimension is number of columns

Will return a tuple with as many dimension as the object has

In [20]:
# Our dataframe has 12 columns and 1000 rows
csv_file.shape

(1000, 12)

- Adding __labels to rows__, in order to do so, you need to set a column as __index__

Then, all the values of such column become the label of its corresponding row

- .__set_index__('column_name') -> Sets a column as the index and therefore its value become the label of its corresponding row -> Returns new variable with the change, it is not applied to the dataframe

In [21]:
file_indexed: pd.DataFrame = csv_file.set_index('Revenue (Millions)')
print(file_indexed)

                    Rank                    Title                     Genre  \
Revenue (Millions)                                                            
333.13                 1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
126.46                 2               Prometheus  Adventure,Mystery,Sci-Fi   
138.12                 3                    Split           Horror,Thriller   
270.32                 4                     Sing   Animation,Comedy,Family   
325.02                 5            Suicide Squad  Action,Adventure,Fantasy   
...                  ...                      ...                       ...   
NaN                  996     Secret in Their Eyes       Crime,Drama,Mystery   
17.54                997          Hostel: Part II                    Horror   
58.01                998   Step Up 2: The Streets       Drama,Music,Romance   
NaN                  999             Search Party          Adventure,Comedy   
19.64               1000               Nine Lives   

- You can also add a __list as index of the dataframe__ (_labels of rows in dataframe_)

new_var = dataframe.index = list

In [22]:
indexes: [str] = [chr(i) for i in range(0, 1000)]

copy_csv_file.index = indexes
print(copy_csv_file)

    Rank                    Title                     Genre  \
       1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
      2               Prometheus  Adventure,Mystery,Sci-Fi   
      3                    Split           Horror,Thriller   
      4                     Sing   Animation,Comedy,Family   
      5            Suicide Squad  Action,Adventure,Fantasy   
..   ...                      ...                       ...   
ϣ    996     Secret in Their Eyes       Crime,Drama,Mystery   
Ϥ    997          Hostel: Part II                    Horror   
ϥ    998   Step Up 2: The Streets       Drama,Music,Romance   
Ϧ    999             Search Party          Adventure,Comedy   
ϧ   1000               Nine Lives     Comedy,Family,Fantasy   

                                          Description              Director  \
    A group of intergalactic criminals are forced ...            James Gunn   
   Following clues to the origin of mankind, a te...          Ridley Scott   
   Th

### Subset of columns

To get only a subset of columns

- dataframe[[__'col1', 'col2'__]]

In [23]:
csv_title_runtime_revenue: pd.DataFrame = csv_file[["Title", "Runtime (Minutes)", "Revenue (Millions)"]]
print(csv_title_runtime_revenue)

                       Title  Runtime (Minutes)  Revenue (Millions)
0    Guardians of the Galaxy                121              333.13
1                 Prometheus                124              126.46
2                      Split                117              138.12
3                       Sing                108              270.32
4              Suicide Squad                123              325.02
..                       ...                ...                 ...
995     Secret in Their Eyes                111                 NaN
996          Hostel: Part II                 94               17.54
997   Step Up 2: The Streets                 98               58.01
998             Search Party                 93                 NaN
999               Nine Lives                 87               19.64

[1000 rows x 3 columns]


In [24]:
excel_title_runtime_revenue: pd.DataFrame = excel_file[["Title", "Runtime (Minutes)", "Revenue (Millions)"]]
print(excel_title_runtime_revenue)

                       Title  Runtime (Minutes)  Revenue (Millions)
0    Guardians of the Galaxy                121              333.13
1                 Prometheus                124              126.46
2                      Split                117              138.12
3                       Sing                108              270.32
4              Suicide Squad                123              325.02
..                       ...                ...                 ...
995     Secret in Their Eyes                111                 NaN
996          Hostel: Part II                 94               17.54
997   Step Up 2: The Streets                 98               58.01
998             Search Party                 93                 NaN
999               Nine Lives                 87               19.64

[1000 rows x 3 columns]


In [25]:
country_population: pd.DataFrame = dataframe_countries[["name", "population"]]
print(country_population)

     name  population
0   Spain    47502512
1  France    68042591
2      UK    67820364


In [26]:
### For series is like this:
spain_dict: dict = {"name": "Spain", "population": 47502512, 
                   "GDP": 1.397}
spain_series = pd.Series(spain_dict)
print(spain_series)

spain_name_population_series: pd.Series = spain_series[['name', 'population']]
print(spain_name_population_series)

name             Spain
population    47502512
GDP              1.397
dtype: object
name             Spain
population    47502512
dtype: object


- .ix[i, j]: Selects column and row by index -> It is deprecated and replaced by
- .__iloc__[i, j]

In [27]:
try:
    csv_file.ix[3,3]
except AttributeError as e:
    print("ix has been deprecated")
    print(e)

ix has been deprecated
'DataFrame' object has no attribute 'ix'


In [28]:
csv_file.iloc[3,3]

"In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists' find that their lives will never be the same."

- .__loc__[row_index/'row_name', 'column_name']: Similar to iloc but instead of using the indexes, uses the names of the rows/columns (it can use the index of the row)

In [29]:
csv_file.loc[0, 'Year']

2014

In [30]:
# We can use also the index which are now labels of the rows
print(file_indexed)

                    Rank                    Title                     Genre  \
Revenue (Millions)                                                            
333.13                 1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
126.46                 2               Prometheus  Adventure,Mystery,Sci-Fi   
138.12                 3                    Split           Horror,Thriller   
270.32                 4                     Sing   Animation,Comedy,Family   
325.02                 5            Suicide Squad  Action,Adventure,Fantasy   
...                  ...                      ...                       ...   
NaN                  996     Secret in Their Eyes       Crime,Drama,Mystery   
17.54                997          Hostel: Part II                    Horror   
58.01                998   Step Up 2: The Streets       Drama,Music,Romance   
NaN                  999             Search Party          Adventure,Comedy   
19.64               1000               Nine Lives   

In [31]:
# Year of the movie whose Revenue was 17.54 millions
file_indexed.loc[17.54, 'Year']

2007

### Working with data

- __unique__(): Returns only the unique values in a column

In [32]:
# In the dataframe csv_file that contains information about movies in IMDB, let´s get the years the movies included were released
csv_file['Year'].unique()

array([2014, 2012, 2016, 2015, 2007, 2011, 2008, 2006, 2009, 2010, 2013],
      dtype=int64)

- __Filtering__: dataframe_var['col'] >=/>/... var/value -> Will return a list of True/False if row in column meets criteria or not

In [33]:
# Following the same example, let's get the movies that were made in 2014 and later
csv_file['Year'] >= 2014

0       True
1      False
2       True
3       True
4       True
       ...  
995     True
996    False
997    False
998     True
999     True
Name: Year, Length: 1000, dtype: bool

As we can see, this is just a list of true false. To really filter the dataframe we need to apply this to the dataframe

__dataframe_var__[dataframe_var['col'] >=/>/... var/value]

In [34]:
csv_file[csv_file['Year'] >= 2014]

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
5,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
...,...,...,...,...,...,...,...,...,...,...,...,...
987,988,Endless Love,"Drama,Romance",The story of a privileged girl and a charismat...,Shana Feste,"Gabriella Wilde, Alex Pettyfer, Bruce Greenwoo...",2014,104,6.3,33688,23.39,30.0
989,990,Selma,"Biography,Drama,History",A chronicle of Martin Luther King's campaign t...,Ava DuVernay,"David Oyelowo, Carmen Ejogo, Tim Roth, Lorrain...",2014,128,7.5,67637,52.07,
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


The result of this filter that can be assigned to a new variable

In [35]:
movies_2014_and_after = csv_file[csv_file['Year'] >= 2014]
movies_2014_and_after

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
5,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
...,...,...,...,...,...,...,...,...,...,...,...,...
987,988,Endless Love,"Drama,Romance",The story of a privileged girl and a charismat...,Shana Feste,"Gabriella Wilde, Alex Pettyfer, Bruce Greenwoo...",2014,104,6.3,33688,23.39,30.0
989,990,Selma,"Biography,Drama,History",A chronicle of Martin Luther King's campaign t...,Ava DuVernay,"David Oyelowo, Carmen Ejogo, Tim Roth, Lorrain...",2014,128,7.5,67637,52.07,
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0


- Also __loc__ can be used with filters:

In [36]:
# Movies released in 2015 or before
csv_file.loc[csv_file['Year'] <= 2015]

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
26,27,Bahubali: The Beginning,"Action,Adventure,Drama","In ancient India, an adventurous and daring ma...",S.S. Rajamouli,"Prabhas, Rana Daggubati, Anushka Shetty,Tamann...",2015,159,8.3,76193,6.50,
36,37,Interstellar,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
39,40,5- 25- 77,"Comedy,Drama","Alienated, hopeful-filmmaker Pat Johnson's epi...",Patrick Read Johnson,"John Francis Daley, Austin Pendleton, Colleen ...",2007,113,7.1,241,,
...,...,...,...,...,...,...,...,...,...,...,...,...
994,995,Project X,Comedy,3 high school seniors throw a birthday party t...,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,164088,54.72,48.0
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0


- Get only rows that contain a text:

df[df['column_name'].__str.contains__(string)]

In [37]:
# Let's get all the movies starring Julia Roberts

print(csv_file[csv_file['Actors'].str.contains('Julia Roberts')])

     Rank                 Title                 Genre  \
52     53          Mother's Day          Comedy,Drama   
569   570         Money Monster  Crime,Drama,Thriller   
995   996  Secret in Their Eyes   Crime,Drama,Mystery   

                                           Description        Director  \
52   Three generations come together in the week le...  Garry Marshall   
569  Financial TV host Lee Gates and his producer P...    Jodie Foster   
995  A tight-knit team of rising investigators, alo...       Billy Ray   

                                                Actors  Year  \
52   Jennifer Aniston, Kate Hudson, Julia Roberts, ...  2016   
569  George Clooney, Julia Roberts, Jack O'Connell,...  2016   
995  Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...  2015   

     Runtime (Minutes)  Rating  Votes  Revenue (Millions)  Metascore  
52                 118     5.6  20221               32.46       18.0  
569                 98     6.5  68654               41.01       55.0  
995 

- Column whose values in a specific set of values:

df.isin([...])

In [38]:
print(csv_file[csv_file['Director'].isin(['Jodie Foster','Garry Marshall'])])

     Rank          Title                 Genre  \
52     53   Mother's Day          Comedy,Drama   
569   570  Money Monster  Crime,Drama,Thriller   

                                           Description        Director  \
52   Three generations come together in the week le...  Garry Marshall   
569  Financial TV host Lee Gates and his producer P...    Jodie Foster   

                                                Actors  Year  \
52   Jennifer Aniston, Kate Hudson, Julia Roberts, ...  2016   
569  George Clooney, Julia Roberts, Jack O'Connell,...  2016   

     Runtime (Minutes)  Rating  Votes  Revenue (Millions)  Metascore  
52                 118     5.6  20221               32.46       18.0  
569                 98     6.5  68654               41.01       55.0  


#### Slicing

- __Slicing rows__: df.iloc[i:j]

In [39]:
# Last 10 rows of this dataframe
csv_file[990:1000]

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
990,991,Underworld: Rise of the Lycans,"Action,Adventure,Fantasy",An origins story centered on the centuries-old...,Patrick Tatopoulos,"Rhona Mitra, Michael Sheen, Bill Nighy, Steven...",2009,92,6.6,129708,45.8,44.0
991,992,Taare Zameen Par,"Drama,Family,Music",An eight-year-old boy is thought to be a lazy ...,Aamir Khan,"Darsheel Safary, Aamir Khan, Tanay Chheda, Sac...",2007,165,8.5,102697,1.2,42.0
992,993,Take Me Home Tonight,"Comedy,Drama,Romance","Four years after graduation, an awkward high s...",Michael Dowse,"Topher Grace, Anna Faris, Dan Fogler, Teresa P...",2011,97,6.3,45419,6.92,
993,994,Resident Evil: Afterlife,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...,Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,140900,60.13,37.0
994,995,Project X,Comedy,3 high school seniors throw a birthday party t...,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,164088,54.72,48.0
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0
996,997,Hostel: Part II,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
999,1000,Nine Lives,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


- __Slicing rows and columns__: df.iloc[i:j, m:n]

In [40]:
# First 4 columns of the last 10 rows
csv_file.iloc[990: 1000, 0:4]

Unnamed: 0,Rank,Title,Genre,Description
990,991,Underworld: Rise of the Lycans,"Action,Adventure,Fantasy",An origins story centered on the centuries-old...
991,992,Taare Zameen Par,"Drama,Family,Music",An eight-year-old boy is thought to be a lazy ...
992,993,Take Me Home Tonight,"Comedy,Drama,Romance","Four years after graduation, an awkward high s..."
993,994,Resident Evil: Afterlife,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...
994,995,Project X,Comedy,3 high school seniors throw a birthday party t...
995,996,Secret in Their Eyes,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo..."
996,997,Hostel: Part II,Horror,Three American college students studying abroa...
997,998,Step Up 2: The Streets,"Drama,Music,Romance",Romantic sparks occur between two dance studen...
998,999,Search Party,"Adventure,Comedy",A pair of friends embark on a mission to reuni...
999,1000,Nine Lives,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...


- __Slicing using labels__

Rows: df.loc['start_row_label': 'end_row_label'] -> Includes last row

In [41]:
file_indexed.loc[333.13: 325.02]

Unnamed: 0_level_0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Metascore
Revenue (Millions),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
333.13,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,76.0
126.46,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,65.0
138.12,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,62.0
270.32,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,59.0
325.02,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,40.0


Rows and columns: df.loc['start_label_row': ´'end_label_row', 'start_label_column': 'end_label_column'] -> Includes both last row and column

In [42]:
file_indexed.loc[333.13: 325.02, "Title": "Director"]

Unnamed: 0_level_0,Title,Genre,Description,Director
Revenue (Millions),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
333.13,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn
126.46,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott
138.12,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan
270.32,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet
325.02,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer


- .__transform__(func=): Applies a function to a dataframe

In [43]:
# Let's get the revenue in dollars
def from_millions_to_dollars(quantity: float):
    return quantity * 10**6

csv_file['Revenue (Millions)'].transform(func=from_millions_to_dollars)

0      333130000.0
1      126460000.0
2      138120000.0
3      270320000.0
4      325020000.0
          ...     
995            NaN
996     17540000.0
997     58010000.0
998            NaN
999     19640000.0
Name: Revenue (Millions), Length: 1000, dtype: float64

#### Sort rows by column value:

dataframe.__sort_values__(by=columns, ascending=True/False)

In [44]:
# For instance, let's sort the movies by revenue

print(csv_file.sort_values(by=['Revenue (Millions)'], ascending=False))

     Rank                                       Title  \
50     51  Star Wars: Episode VII - The Force Awakens   
87     88                                      Avatar   
85     86                              Jurassic World   
76     77                                The Avengers   
54     55                             The Dark Knight   
..    ...                                         ...   
977   978                               Amateur Night   
978   979              It's Only the End of the World   
988   989                                     Martyrs   
995   996                        Secret in Their Eyes   
998   999                                Search Party   

                        Genre  \
50   Action,Adventure,Fantasy   
87   Action,Adventure,Fantasy   
85    Action,Adventure,Sci-Fi   
76              Action,Sci-Fi   
54         Action,Crime,Drama   
..                        ...   
977                    Comedy   
978                     Drama   
988               

- We can sort by several columns

In [45]:
# For instance, let's sort the movies by revenue and after by metascore

print(csv_file.sort_values(by=['Revenue (Millions)', 'Metascore'], ascending=False))

     Rank                                       Title  \
50     51  Star Wars: Episode VII - The Force Awakens   
87     88                                      Avatar   
85     86                              Jurassic World   
76     77                                The Avengers   
54     55                             The Dark Knight   
..    ...                                         ...   
617   618                         Free State of Jones   
628   629                             The Whole Truth   
778   779                                 Chalk It Up   
820   821                             Suite Française   
965   966                               Inland Empire   

                        Genre  \
50   Action,Adventure,Fantasy   
87   Action,Adventure,Fantasy   
85    Action,Adventure,Sci-Fi   
76              Action,Sci-Fi   
54         Action,Crime,Drama   
..                        ...   
617    Action,Biography,Drama   
628       Crime,Drama,Mystery   
778               

In [46]:
# Let´s get the title of the movie with the biggest revenue

csv_file.sort_values(by=['Revenue (Millions)', 'Metascore'], ascending=False).iloc[0, 1]

'Star Wars: Episode VII - The Force Awakens'

#### Group by:

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups

Basically, you select a column to group the rest of the dataset by the values of that column and then apply a function to each group of values

For instance, let's get the average revenue that each director has per movie

In [47]:
csv_file[['Director', 'Revenue (Millions)']].groupby(['Director']).mean()

Unnamed: 0_level_0,Revenue (Millions)
Director,Unnamed: 1_level_1
Aamir Khan,1.200
Abdellatif Kechiche,2.200
Adam Leon,
Adam McKay,109.535
Adam Shankman,78.665
...,...
Xavier Dolan,3.490
Yimou Zhang,45.130
Yorgos Lanthimos,4.405
Zack Snyder,195.148


#### Merge dataframes:

You cannot filter dataframes whose indexes are different. However, we can merge dataframes with a common column

- pandas.merge(dataframe1, dataframe2, on = [col1, col2], how = '___')
    - dataframe1 = One of the dataframes we want to join
    - dataframe2 = The other dataframe we want to join
    - on = List of common columns
    - how: Strategy
      - inner = Value in the column has to exist in both dataframes (the result will be only records that are in both dataframes)
      - outer = If value is in just one of the two dataframes, the row is added (the fields corresponding to the other dataframe will be N/A)
      -  left = All the row in dataframe1 are included (the fields corresponding to the other dataframe will be N/A)
      -  right = All the row in dataframe2 are included (the fields corresponding to the other dataframe will be N/A)
See exercise 25


### Drop NA values

To drop rows that contain NA for one of their fields:

dataframe.dropna()

#### Save datasets:

- dataframe.__to_csv__('filepath'): Saves the dataframe to a csv file
- dataframe.__to_excel__('filepath'): Saves the dataframe to an excel file