# **Chpater 01: Introducing pandas**  

gonna learn about:
- The growth of data science in the 21st century
- The history of the pandas library for data analysis
- The pros and cons of pandas and its cometitors
- Data analysis in Excel versus data analysis with a programming language
- A tour of the library's features through a working example  

pandas: 
- a library for data analysis built on top of the Python
    - library  
    -a collection of code for solving problems in a specific field of endeavor  
- toolbox for data manipulation operations  
(sorting, filtering, cleaning, deduping, aggregating, pivoting, ...)  
- features
    - pairs well with other libraries for statistics, NLP, ML, data visualization, ...
    - open source library  
        -library's source code is publicly available to download, use, modify and distribute  
        -its license grants users more permissions than proprietary software such as Excel  
        -global team of volunteer software developers maintains it
    - efficiency  
    -pandas relies on lower-level languages(ex. C) for many of its calculation &rightarrow; could perform data manipulation efficiently  
    -easy to accomplish a lot with a little code  

<br/>

## **1.3 A Tour of pandas**  
Seeing pandas in action is the best way to grasp the power of it.  
So, we're gonna take a quick tour of the pandas by analyzing a data set of the 700 highest-grossing movies of all time.  
### 1.3.1 Importing a data set  
Import pandas library to gain access to its features.

In [1]:
import pandas as pd

The data is stored in `movies.csv` file.  
- CSV file  
    - Comma-Separated Values
    - plain-text file
    - its rows of data are divided by a line break
    - its row values are divided by a comma
    - the first row holds the column headers
- five columns:  
    -`Rank`  
    -`Title`  
    -`Studio`  
    -`Gross`  
    -`Year`  

<br/>  

Pandas can import various file types, each of which has an associated import function.  
A functions in pandas command to the library or an entity within it.  
We're gonna use the `read_csv()` function to import the `movies.csv` file.

In [2]:
pd.read_csv("./data-sets/ch01/movies.csv")

Unnamed: 0,Rank,Title,Studio,Gross,Year
0,1,Avengers: Endgame,Buena Vista,"$2,796.30",2019
1,2,Avatar,Fox,"$2,789.70",2009
2,3,Titanic,Paramount,"$2,187.50",1997
3,4,Star Wars: The Force Awakens,Buena Vista,"$2,068.20",2015
4,5,Avengers: Infinity War,Buena Vista,"$2,048.40",2018
...,...,...,...,...,...
777,778,Yogi Bear,Warner Brothers,$201.60,2010
778,779,Garfield: The Movie,Fox,$200.80,2004
779,780,Cats & Dogs,Warner Brothers,$200.70,2001
780,781,The Hunt for Red October,Paramount,$200.50,1990


Pandas imports the CSV file's contents into a `DataFrame` object.  
- object  
    - could be regarded as a container for storing data
    - different objects are optimized for different types of data  
    &rightarrow; different objects have different ways to interect with  
    - pandas uses one type of object called `DataFrame`  
    - `DataFrame` stores multicolumn data sets and another type of object called Series to store single-column data sets  


Pandas displays only the first and last five rows of the `DataFrame` to avoid cluttering the screen.  


The `DataFrame` from the `movies.csv` file  
- consists of 5 columns and index  
    - index is the range of ascending numbers on the left side of the `DataFrame`
    - index labels serve as identifiers for the rows  
    - any columns could be used as the index
    - pandas generates a numeric index starting from 0 by default  
    - the column whose values can act as a primary identifier or point of reference can be used as good index  
    &rightarrow; `Rank` column and `Title` column could be good index in this case  


Therefore, we're going to set the `Title` column as the index instead of the default numeric index.  
We can do it directly during the CSV file import.

In [3]:
pd.read_csv("./data-sets/ch01/movies.csv", index_col="Title")

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avengers: Endgame,1,Buena Vista,"$2,796.30",2019
Avatar,2,Fox,"$2,789.70",2009
Titanic,3,Paramount,"$2,187.50",1997
Star Wars: The Force Awakens,4,Buena Vista,"$2,068.20",2015
Avengers: Infinity War,5,Buena Vista,"$2,048.40",2018
...,...,...,...,...
Yogi Bear,778,Warner Brothers,$201.60,2010
Garfield: The Movie,779,Fox,$200.80,2004
Cats & Dogs,780,Warner Brothers,$200.70,2001
The Hunt for Red October,781,Paramount,$200.50,1990


A `DataFrame` can be assigned to a variable for easy reference elsewhere in the program.  
A variable is a user-assigned name for an object in the program.  
We'll assign the `DataFrame` to the variable `movies`.  

In [4]:
movies = pd.read_csv("./data-sets/ch01/movies.csv", index_col="Title")

### 1.3.2 Manipulating a DataFrame  
manupulating a `DataFrame`  
- extract a few rows from beginning

In [5]:
movies.head(4)

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avengers: Endgame,1,Buena Vista,"$2,796.30",2019
Avatar,2,Fox,"$2,789.70",2009
Titanic,3,Paramount,"$2,187.50",1997
Star Wars: The Force Awakens,4,Buena Vista,"$2,068.20",2015


- extract a few rows from the end

In [6]:
movies.tail(6)

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
21 Jump Street,777,Sony,$201.60,2012
Yogi Bear,778,Warner Brothers,$201.60,2010
Garfield: The Movie,779,Fox,$200.80,2004
Cats & Dogs,780,Warner Brothers,$200.70,2001
The Hunt for Red October,781,Paramount,$200.50,1990
Valkyrie,782,MGM,$200.30,2008


- find out how many rows the `DataFrame` has

In [7]:
len(movies)

782

- find out how many rows and columns the `DataFrame` has

In [8]:
movies.shape

(782, 4)

- inquire about the total number of cells in the `DataFrame`

In [9]:
movies.size

3128

- ask for the data types of the columns  
    - `int64`: integer
    - `obejct`: text

In [10]:
movies.dtypes

Rank       int64
Studio    object
Gross     object
Year       int64
dtype: object

- extract a row from the data set by its index position  
    - pandas returns a `Series` object  
        - `Series`: a one-dimensional labeled array of values  
        - it could be regarded as a single column of data with an identifier for each row  
        - `Series`' index labels(`Rank`, `Title`, `Studio`, `Gross`, `Year`) are the columns from the `DataFrame`
    - index label is also used to access a `DataFrame` row

In [11]:
movies.iloc[499]

Rank           500
Studio         Fox
Gross     $288.30 
Year          2018
Name: Maze Runner: The Death Cure, dtype: object

- extract row values for "Forrest Gump" by its index label  
(its index labes is `Title`)  
    - index labels can contain duplicates  
    (pandas permits dupicates, but it's not recommended)
    - the unique index labels accelerates the speed at which pandas can locate and extract data


In [12]:
movies.loc["Forrest Gump"]

Rank            119
Studio    Paramount
Gross      $677.90 
Year           1994
Name: Forrest Gump, dtype: object

- sort the `DataFrame` by the values in another column  
(in this practice, `Year`)

In [13]:
movies.sort_values(by="Year", ascending=False).head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Avengers: Endgame,1,Buena Vista,"$2,796.30",2019
John Wick: Chapter 3 - Parabellum,458,Lionsgate,$304.70,2019
The Wandering Earth,114,China Film Corporation,$699.80,2019
Toy Story 4,198,Buena Vista,$519.80,2019
How to Train Your Dragon: The Hidden World,199,Universal,$519.80,2019


- sort the `DataFrame` by values across multiple columns  
(in this practice, `Studio`, and `Year`  
&rightarrow; we can see the films organized alphabetically by studio and release date)

In [14]:
movies.sort_values(by=["Studio", "Year"]).head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Blair Witch Project,588,Artisan,$248.60,1999
101 Dalmatians,708,Buena Vista,$215.90,1961
The Jungle Book,755,Buena Vista,$205.80,1967
Who Framed Roger Rabbit,410,Buena Vista,$329.80,1988
Dead Poets Society,636,Buena Vista,$235.90,1989


- sort by index  
(in this case, `Title`  
 &rightarrow; we can see films in alphabetical order)

In [15]:
movies.sort_index().head()

Unnamed: 0_level_0,Rank,Studio,Gross,Year
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"10,000 B.C.",536,Warner Brothers,$269.80,2008
101 Dalmatians,708,Buena Vista,$215.90,1961
101 Dalmatians,425,Buena Vista,$320.70,1996
2 Fast 2 Furious,632,Universal,$236.40,2003
2012,93,Sony,$769.70,2009


The manipulations(operations) we've performed so far return new `DataFrame` objects.  
&rightarrow; Pandas has not altered the original `DataFrame` from the CSV file.  
The nondestructive nature of these operations is beneficial; it actively encourages experimentation.  
&rightarrow; We can always confirm that a result is correct before making it permanent.  

<br/>  

### 1.3.3 Counting values in a Series  
We're gonna try a more sophisticated analysis.  
We'll find out which movie studio had the greatest number of highest-grossing films.  


For this:  
need to count the number of times each studio appears in the `Studio` column  
1. extract a single column of data from the `DataFrame` as a `Series`
2. count the number of times each value appears in the `Series`

1. extract a single column of data

In [16]:
movies["Studio"]

Title
Avengers: Endgame                   Buena Vista
Avatar                                      Fox
Titanic                               Paramount
Star Wars: The Force Awakens        Buena Vista
Avengers: Infinity War              Buena Vista
                                     ...       
Yogi Bear                       Warner Brothers
Garfield: The Movie                         Fox
Cats & Dogs                     Warner Brothers
The Hunt for Red October              Paramount
Valkyrie                                    MGM
Name: Studio, Length: 782, dtype: object

Now we have a isolated the `Studio` column as a `Series` object.  


2. count the number of each studio's appearances  
    -count each unique value's number of occurrences using the `value_counts()` method  
    -limit the number of results to the top 10 studios

In [17]:
movies["Studio"].value_counts().head(10)

Studio
Warner Brothers    132
Buena Vista        125
Fox                117
Universal          109
Sony                86
Paramount           76
Dreamworks          27
Lionsgate           21
New Line            16
MGM                 11
Name: count, dtype: int64