# SLU01 - Pandas 101: Learning notebook

In this notebook we will be covering the following:

1. [What is Pandas](#1.-What-is-pandas?)
1. [Series](#2.-Series) 
1. [Dataframes](#3.-DataFrames) 
1. [Previewing and describing a dataframe](#4.-Previewing-a-DataFrame)
1. [Reading data from files into pandas](#5.-Reading-data-from-files-into-pandas)
1. [Writing data from pandas into files](#6.-Writing-data-from-pandas-into-files)

## 1. What is `pandas`?

Pandas is a major tool of interest. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python.

In this notebook the most basic functionalities will be covered.

### How do I call it?

In [1]:
import pandas as pd

Notice we import pandas as `pd`. This is not required but highly recommended. It's standard practice and commonly used in documentation and usage examples.

### Pandas Data Structures

There are two main data structures on pandas:

- **Series** - A 1-dimensional array of data of the same type. More documentation on Series is available on [`pandas.Series` documentation page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html).

![Pandas Series](assets/series.png "Pandas Series")

- **Dataframes** - Tabular structure that may be seen as a container of series (that may have different types).Be aware that is also possible to have 1-dimensional array of data as a DataFrame. More documentation on Dataframes is available on: [`pandas.DataFrame` documentation page](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

![Pandas DataFrame](assets/dataframe.PNG "Pandas Dataframe")

---

## 2. Series

Creating a series in pandas is really easy. We will start by creating a series of numbers and print it to see how it looks like.

[`pandas.Series` documentation page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

In [2]:
s1 = pd.Series([10, 3, 5, 1, 12])
s1

0    10
1     3
2     5
3     1
4    12
dtype: int64

We can see the values we have defined, and an index (automatically set in the range from 0 to the length of the data minus 1).

Notice as well that the series has one and only one type of data, in this case `int64`. Pandas is quite clever inferring what kind of data is passed to it.

Additonally it's possible to observe that the order of the data has been maintained.

Let's see what would have happened if we had passed it some floats, instead of integers: 

In [3]:
s2 = pd.Series([0.5, 0.2, 5.2, 1.6, -0.6])
s2

0    0.5
1    0.2
2    5.2
3    1.6
4   -0.6
dtype: float64

Ok, so now it's a `float64` series. 

Next up, the same, but this time with some strings: 

In [4]:
s3 = pd.Series(["Google", "Microsoft", "Facebook", "Apple"])
s3

0       Google
1    Microsoft
2     Facebook
3        Apple
dtype: object

Ok, this time it was considered `object`. Objects are types of data that can point to different types of data as we may see in the example below.

Fair question: what happens if you pass it a mix of stuff? 

In [5]:
s4 = pd.Series([1, 2.3, "omg a string", 2])
s4

0               1
1             2.3
2    omg a string
3               2
dtype: object

Well, when everything is mixed, it makes it an object! 

Series have a class attribute that shows us their data type. It's called dtype and can be used like this:

In [6]:
s1.dtype

dtype('int64')

In [7]:
s4.dtype

dtype('O')

Note that series `s1` have the dtype integer and `s4` have the dtype object.

### Indexing 

You will have noticed that our Series so far have a bunch of numbers on the left _(0, 1, 2, 3...)_. 

Those values represent the index, which is used for (among other things) selecting. 

Even though by default the index is _0, 1, 2, 3..._ it is often useful to set a different index. 

Here is an example: 

In [8]:
s5 = pd.Series(data=["Larry", "Bill", "Mark", "Steve"], 
               index=["Google", "Microsoft", "Facebook", "Apple"])
s5

Google       Larry
Microsoft     Bill
Facebook      Mark
Apple        Steve
dtype: object

We wanted `Bill` to have the index `Microsoft`. Now we can actually treat this a bit like a dictionary: 

In [9]:
s5['Microsoft']

'Bill'

We can also get all the values (still a bit like a dictionary): 

In [10]:
s5.values

array(['Larry', 'Bill', 'Mark', 'Steve'], dtype=object)

Or the indexes (like the `.keys()` of the dictionary) 

In [11]:
s5.index

Index(['Google', 'Microsoft', 'Facebook', 'Apple'], dtype='object')

Speaking of dictionaries, can I make a Pandas Series from a dictionary? 

In [12]:
my_dict = {"Google": "Larry",
           "Microsoft": "Bill",
           "Facebook": "Mark",
           "Apple": "Steve"}

s6 = pd.Series(my_dict)
s6

Google       Larry
Microsoft     Bill
Facebook      Mark
Apple        Steve
dtype: object

The Series class will automatically use the keys of the dictionary as indexes of the series and its correponding data as the data of the series as well. The interesting part of using this is that we now are able to have some functionalities that we usually don't have in dictionaries.

In [13]:
try:
    my_dict[-1:]
    
except:
    print("Illegal operation")

Illegal operation


In [14]:
s5[-1:]

Apple    Steve
dtype: object

In newer versions of pandas, instead of using the method `.values` they recommend to use one of the following depending on whether you need a reference to the underlying data or a NumPy array, respectively: `.array` and `.to_numpy`.
The reason for that is that `.values` sometimes gives you an numpy array and other times it gives you an ExtensionArray and with the new methods there is a clear understanding to which you want to return.

If you use the method `.array` in the series `s5`, you get a 'PandasArray' as you can see (this method works for Series and Index).

In [15]:
s5.array

<PandasArray>
['Larry', 'Bill', 'Mark', 'Steve']
Length: 4, dtype: object

In [16]:
s5.index.array

<PandasArray>
['Google', 'Microsoft', 'Facebook', 'Apple']
Length: 4, dtype: object

If you have a different type of data, like `period`, when you use the `.array` method, you get a 'PeriodArray'.

In [17]:
s6 = pd.period_range('2000', periods=4)

In [18]:
s6.array

<PeriodArray>
['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04']
Length: 4, dtype: period[D]

If you need an actual numpy array, you can do:

In [19]:
s5.to_numpy()

array(['Larry', 'Bill', 'Mark', 'Steve'], dtype=object)

In [20]:
s5.index.to_numpy()

array(['Google', 'Microsoft', 'Facebook', 'Apple'], dtype=object)

#### Key points:

- Each series has only one data type (even if it is a more inclusive one, like object).
- A list of indexes might be used (it has to have the same dimension).
- It is possible to use dictionaries to create series.
- There is a various methods to use to get info from a Series.

---

## 3. DataFrames

As mentioned previously, a dataframe is a tabular structure (think "Excel sheet"). This will become clear with the following examples.

[`pandas.DataFrame` documentation page](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

Let's make our first DataFrame: 

In [21]:
df1 = pd.DataFrame([10,122,1])

In [22]:
df1

Unnamed: 0,0
0,10
1,122
2,1


This first dataframe is a really simple one. We can see that in this case we have a column and an index. This shows the tabular structure. But the next case will highlight this even further.

In [23]:
df2 = pd.DataFrame([[1,   2,   3,    7],  # ignore the weird spacing, it's just to be clear we have 3 lists of 4 
                    [4.2, 6.1, 8.9, -4.1], 
                    ["a", "b", "c", "z"] ])

In [24]:
df2

Unnamed: 0,0,1,2,3
0,1,2,3,7
1,4.2,6.1,8.9,-4.1
2,a,b,c,z


Notice that the way this dataframe is being created leads to a row of values for each list of data provided. It is also possible to provide a list of names for each of the columns...

In [25]:
df3 = pd.DataFrame([[1,   2,   3,    7], 
                    [4.2, 6.1, 8.9, -4.1], 
                    ["a", "b", "c", "z"] ],
                    columns=["col_1", "col_2", "col_3", "col_4"])   # <-- The column names! 

In [26]:
df3

Unnamed: 0,col_1,col_2,col_3,col_4
0,1,2,3,7
1,4.2,6.1,8.9,-4.1
2,a,b,c,z


as well as for each of the rows:

In [27]:
df4 = pd.DataFrame([[1,   2,   3,    7], 
                    [4.2, 6.1, 8.9, -4.1], 
                    ["a", "b", "c", "z"] ],
                    columns=["col_1", "col_2", "col_3", "col_4"],  # <-- The column names
                    index=["row_1", "row_2", "row_3"])   # <-- The row names

In [28]:
df4

Unnamed: 0,col_1,col_2,col_3,col_4
row_1,1,2,3,7
row_2,4.2,6.1,8.9,-4.1
row_3,a,b,c,z


So far we've been creating DataFrames from lists, like so: 

In [29]:
company = ["Google", "Microsoft", "Facebook", "Apple"]
founder_name = ["Larry", "Bill", "Mark", "Steve"]
founder_surname = ["Page", "Gates", "Zuckerberg", "Jobs"]

df5 = pd.DataFrame( [ company, founder_name, founder_surname])

In [30]:
df5

Unnamed: 0,0,1,2,3
0,Google,Microsoft,Facebook,Apple
1,Larry,Bill,Mark,Steve
2,Page,Gates,Zuckerberg,Jobs


But we can also do something cool, which is to make a dictionary with the lists as values, where the keys will be the column names. Let's create the dictionary first, using the lists we have defined above: 

In [31]:
tech_companies_dictionary = {
    'company': ["Google", "Microsoft", "Facebook", "Apple"],
    'founder_name': ["Larry", "Bill", "Mark", "Steve"],
    'founder_surname': ["Page", "Gates", "Zuckerberg", "Jobs"],
}

This is super readable, right? 

In [32]:
tech_companies_dictionary

{'company': ['Google', 'Microsoft', 'Facebook', 'Apple'],
 'founder_name': ['Larry', 'Bill', 'Mark', 'Steve'],
 'founder_surname': ['Page', 'Gates', 'Zuckerberg', 'Jobs']}

Now we can simply pass this to a Pandas DataFrame: 

In [33]:
df6 = pd.DataFrame(tech_companies_dictionary)

In [34]:
df6

Unnamed: 0,company,founder_name,founder_surname
0,Google,Larry,Page
1,Microsoft,Bill,Gates
2,Facebook,Mark,Zuckerberg
3,Apple,Steve,Jobs


By passing a dictionary as input to the creation of the dataframe, the dataframe is now able to use the key of the dictionary as the name of the column and present the data along a column, instead of along a row. This is becoming closer to how information is usually presented.

----

### Putting it all together 

Let's do the same thing again, using everything we've learned so far:  

In [35]:
# Let's say we have these lists somewhere on our computer: 
founder_names = ["Larry", "Bill", "Mark", "Steve", "Larry", "Reed"]
founder_surnames = ["Page", "Gates", "Zuckerberg", "Jobs", "Ellison", "Hastings"]
company = ["Google", "Microsoft", "Facebook", "Apple", "Oracle", "Netflix"]

Let's make some Series, using the company name as index: 

In [36]:
series_of_founder_names = pd.Series(data=founder_names, # <-- data 
                                    index=company)      # <-- index 

In [37]:
series_of_founder_names

Google       Larry
Microsoft     Bill
Facebook      Mark
Apple        Steve
Oracle       Larry
Netflix       Reed
dtype: object

Same thing, this time for surnames: 

In [38]:
series_of_founder_surnames = pd.Series(data=founder_surnames, # <-- different data
                                    index=company)        # <-- same index 

In [39]:
series_of_founder_surnames

Google             Page
Microsoft         Gates
Facebook     Zuckerberg
Apple              Jobs
Oracle          Ellison
Netflix        Hastings
dtype: object

Now with these two Series we can create a dataframe! Pandas will notice that they have the same index, and will give the DataFrame that index: 

In [40]:
df7 = pd.DataFrame({'founder_name': series_of_founder_names,  
                    'founder_surname': series_of_founder_surnames})

In [41]:
df7

Unnamed: 0,founder_name,founder_surname
Google,Larry,Page
Microsoft,Bill,Gates
Facebook,Mark,Zuckerberg
Apple,Steve,Jobs
Oracle,Larry,Ellison
Netflix,Reed,Hastings


By passing series (in this case sharing the index) as values of a dictionary, the model is able to use the key value as column name and the index as the row name. The column and index(row) are also acessible, as will be shown below.

### What if my data isn't a Pandas Series?

It will often happen that you have a list or array:

In [42]:
number_of_employees = [73992, 124000, 20658, 123000, 138000, 5400]

In [43]:
series_of_number_employees = pd.Series(data=number_of_employees) # <-- data, no index 

Now, you may be tempted to add this directly to the DataFrame, and Pandas won't stop you:

In [44]:
df8 = pd.DataFrame({'founder_name': series_of_founder_names,  
                    'founder_surname': series_of_founder_surnames,
                    'number_employees': series_of_number_employees})

In [45]:
df8

Unnamed: 0,founder_name,founder_surname,number_employees
0,,,73992.0
1,,,124000.0
2,,,20658.0
3,,,123000.0
4,,,138000.0
5,,,5400.0
Apple,Steve,Jobs,
Facebook,Mark,Zuckerberg,
Google,Larry,Page,
Microsoft,Bill,Gates,


⚠️ You should however notice that this is a **lot more dangerous**, as you are making the assumption that the rows are in the same order as the list.

In practice, you will probably end up doing this out of time constraints, or reading other people's code where lists are added directly without an index. But remember: if you have an index you are safer.

-----

### Getting the index and column values 

This dataframe object contains some cool attributes, among which are the following: 

Get index, with `.index`: 

In [46]:
df8

Unnamed: 0,founder_name,founder_surname,number_employees
0,,,73992.0
1,,,124000.0
2,,,20658.0
3,,,123000.0
4,,,138000.0
5,,,5400.0
Apple,Steve,Jobs,
Facebook,Mark,Zuckerberg,
Google,Larry,Page,
Microsoft,Bill,Gates,


In [47]:
df8.index

Index([          0,           1,           2,           3,           4,
                 5,     'Apple',  'Facebook',    'Google', 'Microsoft',
         'Netflix',    'Oracle'],
      dtype='object')

Get the columns, with `.columns`: 

In [48]:
df8.columns

Index(['founder_name', 'founder_surname', 'number_employees'], dtype='object')

Among other things, this might be used to iterate over the column titles.

In [49]:
for col in df8.columns:
    print(col)

founder_name
founder_surname
number_employees


We can also use `dtypes` to know the type of each series of the dataframe:

In [50]:
df8.dtypes

founder_name         object
founder_surname      object
number_employees    float64
dtype: object

As mentioned above with Series, you also have the method `.values` in a dataframe, but it's probably better to start using the `.to_numpy` (performance timings!)

Note: DataFrame doesn't have the method `.array`

In [51]:
df8.to_numpy()

array([[nan, nan, 73992.0],
       [nan, nan, 124000.0],
       [nan, nan, 20658.0],
       [nan, nan, 123000.0],
       [nan, nan, 138000.0],
       [nan, nan, 5400.0],
       ['Steve', 'Jobs', nan],
       ['Mark', 'Zuckerberg', nan],
       ['Larry', 'Page', nan],
       ['Bill', 'Gates', nan],
       ['Reed', 'Hastings', nan],
       ['Larry', 'Ellison', nan]], dtype=object)

#### Key points:

- DataFrames may be seen as a tabular structure (named rows and columns).
- We can define the indexes and columns as we create the dataframe.
- It's possible to take advantage of dictionaries and Series to create DataFrames.

---

## 4. Previewing a DataFrame

#### Visualizing the DataFrame or part of it

To visualize a DataFrame, using a jupyter notebook, printing will display it (as seen previously).

In [52]:
df8

Unnamed: 0,founder_name,founder_surname,number_employees
0,,,73992.0
1,,,124000.0
2,,,20658.0
3,,,123000.0
4,,,138000.0
5,,,5400.0
Apple,Steve,Jobs,
Facebook,Mark,Zuckerberg,
Google,Larry,Page,
Microsoft,Bill,Gates,


In the case that the dataframe has a lot of entries, it will be only partially displayed. Nonetheless, it might still be too much information being displayed at once and the methods that are going to be used below often prove to be a better alternative. Namely, it is possible to print only a certain number of entries from the top or from the bottom using `.head` and `.tail`, respectively.

In [53]:
df8.head(n=2)

Unnamed: 0,founder_name,founder_surname,number_employees
0,,,73992.0
1,,,124000.0


In [54]:
df8.tail(n=2)

Unnamed: 0,founder_name,founder_surname,number_employees
Netflix,Reed,Hastings,
Oracle,Larry,Ellison,


---

### Retrieving DataFrame Information

#### Getting the relevant info

With pandas' [`.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) it is possible to obtain:
- the number of entries it has
- the total number of columns
- the title of each column
- the number of entries that in fact exists in each column (does not consider missing values!)
- the type of data of the entries of a given column.

In [55]:
df8.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 0 to Oracle
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   founder_name      6 non-null      object 
 1   founder_surname   6 non-null      object 
 2   number_employees  6 non-null      float64
dtypes: float64(1), object(2)
memory usage: 384.0+ bytes


For the **NUMERICAL** variables it's also possible to print some more information using [`.describe()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html), namely:

- the number of rows for each of those columns
- the mean value
- the standard deviation
- the minimum and maximum value
- the median, the 25th and 75th percentile.

In [56]:
df8.describe()

Unnamed: 0,number_employees
count,6.0
mean,80841.666667
std,57039.422527
min,5400.0
25%,33991.5
50%,98496.0
75%,123750.0
max,138000.0


Finally, [`.shape`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) returns a tuple with the dimensions of the dataframe (nr_rows, nr_columns).

In [57]:
df8.shape

(12, 3)

### Key points:

- It's possible to print the dataframe (still shows too many lines, might be too "noisy").
- `head()` and `tail()` print the n top and bottom, respectively, lines of the dataframe.
- Info returns the number of entries, the number of columns, their counts and the data type.
- Describe returns basic statistical information of the numeric columns.

---

## 5. Reading data from files into pandas

Pandas framework has implemented functions that allow us to create dataframes form several different types of data:

- CSV
- JSON
- HTML
- ... and [many more](https://pandas.pydata.org/pandas-docs/stable/io.html)

All of this is possible by using the read_*dataFormat*. With it is possible to create a dataframe and apply all the previously shown techniques. 

For instance, using the 2010 census profile and housing characteristics of the city of Los Angeles ([source](https://catalog.data.gov/dataset/2010-census-populations-by-zip-code)):

In [58]:
data = pd.read_csv("data/2010_Census_Populations_by_Zip_Code.csv")

What does this look like?

In [59]:
data.head(5)

Unnamed: 0,Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
0,91371,1,73.5,0,1,1,1.0
1,90001,57110,26.6,28468,28642,12971,4.4
2,90002,51223,25.5,24876,26347,11731,4.36
3,90003,66266,26.3,32631,33635,15642,4.22
4,90004,62180,34.8,31302,30878,22547,2.73


Let's use `info`: 

In [60]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319 entries, 0 to 318
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Zip Code                319 non-null    int64  
 1   Total Population        319 non-null    int64  
 2   Median Age              319 non-null    float64
 3   Total Males             319 non-null    int64  
 4   Total Females           319 non-null    int64  
 5   Total Households        319 non-null    int64  
 6   Average Household Size  319 non-null    float64
dtypes: float64(2), int64(5)
memory usage: 17.6 KB


Now for a fuller description:

In [61]:
data.describe()

Unnamed: 0,Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
count,319.0,319.0,319.0,319.0,319.0,319.0,319.0
mean,91000.673981,33241.341693,36.527586,16391.564263,16849.777429,10964.570533,2.828119
std,908.360203,21644.417455,8.692999,10747.495566,10934.986468,6270.6464,0.835658
min,90001.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,90243.5,19318.5,32.4,9763.5,9633.5,6765.5,2.435
50%,90807.0,31481.0,37.1,15283.0,16202.0,10968.0,2.83
75%,91417.0,44978.0,41.0,22219.5,22690.5,14889.5,3.32
max,93591.0,105549.0,74.0,52794.0,53185.0,31087.0,4.67


### Key points:

- Pandas allows the creation of dataframes from several structures of data.

----

## 6. Writing data from pandas into files

Besides reading from the disk, Pandas allows us to also write and save our dataframe after we performed some transformations to the data.

In [62]:
data.to_csv("data/new_csv.csv")

You should now have a new file called `new_csv.csv` in your `data` folder!

---

### Reading/Writing with other types of data

The same way we can read data from various data types, we can also write data to various file types (CSV, JSON, HTML, ...)

All of this is possible by using the to_*dataFormat*, giving as an argument the path where you want to save the file.

For example you can right to a JSON using `to_json`, or to an Excel spreadsheet using `to_excel`, and so on.

---

## To learn more (optional)

- [Pandas Getting Started tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html)

- [Intro to data structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)

---