# SLU01 - Pandas 101: Learning notebook


In this notebook we will cover the following:

[1. What is Pandas](#1.-What-is-pandas?)  
[2. Series](#2.-Series)   
&emsp;[2.1 Datatypes](#2.1-Datatypes)  
&emsp;[2.2 Indexing](#2.2-Indexing)  
&emsp;[2.3 Extracting data from series](#2.3-Extracting-data-from-series)  
[3. Dataframes](#3.-DataFrames)   
&emsp;[3.1 Making dataframes from series](#3.1-Making-dataframes-from-series)  
&emsp;[3.2 What if my data isn't a pandas Series?](#3.2-What-if-my-data-isn't-a-pandas-Series?)  
&emsp;[3.3 Getting the index and column values](#3.3-Getting-the-index-and-column-values)  
[4. Previewing and describing a DataFrame](#4.-Previewing-and-describing-a-DataFrame)  
&emsp;[4.1 Previewing the DataFrame or part of it](#4.1-Previewing-the-DataFrame-or-part-of-it)  
&emsp;[4.2 Retrieving DataFrame information](#4.2-Retrieving-DataFrame-information)  
[5. Reading data from files into pandas dataframes](#5.-Reading-data-from-files-into-pandas-dataframes)  
[6. Writing data from pandas into files](#6.-Writing-data-from-pandas-into-files)

So you want to be a data scientist! In the first few learning units, we'll be dealing with the data part. The science comes later.

The first thing we need to do after collecting data is to store it. People have stored all kinds of data in all kinds of media: stone tables, tree bark markings, books, files, drawings, photographs, CDs, oral history... you name it. As a data scientist, you are most likely to encounter data in a digital format, a more or less useful one (e.g. text files, json files, databases, excel files, screenshots of excel files, digital images). You will often have to extract the data into more easily manageable form, typically some kind of table. Here we will talk about data in this final, tabular form, and a Python package that helps you deal with it - pandas.

<img src="media/data_storage_media.jpg" width="900"/>

## 1. What is `pandas`?

`pandas` is a data manipulation and analysis tool designed to make data cleaning and analysis in Python fast and easy.  It contains tabular data structures and I/O tools for them. It is highly optimized for performance, as data scientists usually deal with large amounts of data.

In this notebook, you will learn about pandas data structures and importing/exporting of data between these data structures and files.

We will import pandas as `pd`. This is standard practice commonly used in documentation and usage examples and is highly recommended.

In [1]:
import pandas as pd
import numpy as np

Pandas has two main **data structures**:

- **Series** - A 1-dimensional array of data of the same type. The documentation on Series is available on the `pandas.Series` [documentation page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). Below is a Series example containing Pokémon names. 

![Pandas Series](media/series.png "Pandas Series")

- **DataFrame** - A 2D, potentially heterogenous, tabular structure. It can be thought of as a container of Series. It is also possible to have 1-dimensional dataframes (dataframes with one column). The documentation on DataFrame is available on the `pandas.DataFrame` [documentation page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) Below is an example of a DataFrame with Pokémon characteristics. Notice the different datatypes - numeric, string, and boolean.

![Pandas DataFrame](media/dataframe.PNG "Pandas Dataframe")

Notice that both Series and DataFrame are **indexed** (row labels in the leftmost column). In the following two sections, we will see how Series/DataFrame can be instantiated (created) and explore their attributes and methods. If you don't remember what are attributes and methods, review the prep course chapters on objects.


## 2. Series

**Creating a series** in pandas is very easy - calling the `pd.Series` object and passing it data. We will start by creating a series of integers. One option how to pass the data is a **list**.

In [2]:
s1 = pd.Series([10, 3, 5, 1, 12])
s1

0    10
1     3
2     5
3     1
4    12
dtype: int64

The series is always **indexed**. This time, we did not explicitly define an index, therefore it was created automatically as a sequence of consecutive integers from 0 to the length of the data minus 1. The data in the series is ordered in the same way as in the list we passed. Pandas also correctly interpreted the integer datatype, `int64`, as shown below the printed series.

Let's see what happens when we pass a list containing **floats**:

In [3]:
s2 = pd.Series([5, 2, 5.2, 1.6, -0.6,6])
s2

0    5.0
1    2.0
2    5.2
3    1.6
4   -0.6
5    6.0
dtype: float64

Although the list contained both **integers** and **floats**, the series datatype is now `float64`, the more inclusive datatype, and the **integers** were promoted to **floats**. This is because by definition **Series cannot contain different datatypes**.

Next, let's pass a list of **strings**:

In [4]:
s3 = pd.Series(["Google", "Microsoft", "Facebook", "Apple"])
s3

0       Google
1    Microsoft
2     Facebook
3        Apple
dtype: object

Ok, this time the datatype is `object`. What happens if we pass a mix of stuff? 

In [5]:
s4 = pd.Series([1, 2.3, "omg a string", 2])
s4

0               1
1             2.3
2    omg a string
3               2
dtype: object

An `object` again! Time to make a small detour to datatypes.

### 2.1 Datatypes
Pandas is intimately connected to NumPy and also uses its datatypes `float`, `int`, `bool`, `timedelta64[ns]` and `datetime64[ns]`. In pandas, all of these are **64bit** types, regardless of the operating system. In addition, pandas has its own so-called **extension datatypes**, e.g `strings`, `periods`, `intervals`, `categoricals`. You can see the full list [here](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) 

Let's look at our two series with object datatypes, `s3` and `s4`. While `s4` contains different kinds of data, `s3` only has strings and it's better to make this clear with the correct datatype. To make `s3` of string datatype, we have to pass the `dtype` argument when defining the series, like this:

In [6]:
s3 = pd.Series(["Google", "Microsoft", "Facebook", "Apple"],dtype='string')
s3

0       Google
1    Microsoft
2     Facebook
3        Apple
dtype: string

The `string` dtype did not exist in older versions of pandas. Now it is recommended to use the `string` datatype for text data for several reasons.
1. First, a `string` series will not let you store anything but text data, but it can easily happen unintentionally in an `object` series. 
2. Second, there are datatype specific operations, such as selecting dataframe columns by datatype. Datatype `string` allows you to specifically select just text, unlike the `object` dtype. 
3. Third, the `string` datatype clearly identifies the contents of a series or dataframe column as text, while it is not so clear with the `object`datatype.

Series (and DataFrame as we will see later) has a class attribute that shows us their datatype. It's called `dtype` and can be used like this:

In [7]:
s4.dtype

dtype('O')

Although the data type of `s4` is an `object`, its elements retain their own datatype. It can be seen with the `type()` function:

In [8]:
print(f"The first element of s4, '{s4[0]}', is of type {type(s4[0])}")
print(f"The second element of s4, '{s4[1]}', is of type {type(s4[1])}")
print(f"The first element of s4, '{s4[2]}', is of type {type(s4[2])}")

The first element of s4, '1', is of type <class 'int'>
The second element of s4, '2.3', is of type <class 'float'>
The first element of s4, 'omg a string', is of type <class 'str'>


Pandas has several functions for **datatype conversion**: `to_numeric`, `to_datetime`, `to_timedelta`, `astype`. The first three are used to convert 1D objects to a specific datatype, while the last converts to the datatype specified by the user. 

A typical usage example is to convert string data to a numeric format:

In [9]:
pd.to_numeric(pd.Series(['1','2','3']))

0    1
1    2
2    3
dtype: int64

What if the input contains nonnumeric elements? We can use the `errors` argument to specify what to do. In this case, we choose to coerce the nonnumeric elements and they were converted to NaN.

In [10]:
pd.to_numeric(s4,errors='coerce')

0    1.0
1    2.3
2    NaN
3    2.0
dtype: float64

`astype` can also be applied to 2D objects, like dataframes. Here we use it to convert a float series to a string. Notice the different syntax:

In [11]:
s2.astype('string')

0     5.0
1     2.0
2     5.2
3     1.6
4    -0.6
5     6.0
dtype: string

One more important point: in general, operating on data structures does **not change** the original data structure, but produces a **copy** of it. This ensures that you don't change your data structures unless you really want to. Many functions and methods have a `copy` or `inplace` argument for this purpose (set to `True` and `False` by default, respectively). To change the original data structure, you need to assign the result to it.

In [12]:
# s4 remained unchanged after the previous astype() operation.
s4

0               1
1             2.3
2    omg a string
3               2
dtype: object

In [13]:
# Now we change it.
s4=pd.to_numeric(s4,errors='coerce')
s4

0    1.0
1    2.3
2    NaN
3    2.0
dtype: float64

### 2.2 Indexing 

We have already mentioned that `Series` and `DataFrame` are indexed. The previous examples had the default index, a sequence of consecutive integers starting with 0. It is possible to pass an explicit index using the `index` argument:

In [14]:
s5 = pd.Series(data=["Larry", "Bill", "Mark", "Steve"], 
               index=["Google", "Microsoft", "Facebook", "Apple"],
               dtype='string')
s5

Google       Larry
Microsoft     Bill
Facebook      Mark
Apple        Steve
dtype: string

Index can be used for selecting rows in the series:

In [15]:
s5['Google']

'Larry'

We can also call the series elements by their position. Even negative indexing is allowed:

In [16]:
s5[-1]

'Steve'

The index attribute returns the `index` object, containing the sequence of all the row labels:

In [17]:
s5.index

Index(['Google', 'Microsoft', 'Facebook', 'Apple'], dtype='object')

Another way to define an index is to create a series from a `dictionary`. The `keys` will become the `indices` and the `values` will be the `data` in the series.

In [18]:
my_dict = {"Google": "Larry",
           "Microsoft": "Bill",
           "Facebook": "Mark",
           "Apple": "Steve"}

s6 = pd.Series(my_dict,dtype='string')
s6

Google       Larry
Microsoft     Bill
Facebook      Mark
Apple        Steve
dtype: string

When we define a series from a dictionary, but still pass an index, the dictionary values corresponding to the keys in the index will be **pulled out**:

In [19]:
s7 = pd.Series(my_dict,dtype='string',index=['Facebook','Google','Apple','Google'])
s7

Facebook     Mark
Google      Larry
Apple       Steve
Google      Larry
dtype: string

Yes, some values in the index repeat! This is allowed in pandas.

<img src="media/no-problem-panda-approves.jpg" width="400"/>

### 2.3 Extracting data from series
Sometimes you need to extract the data from your series and have it in the form of an array. There are two ways how to do it: the `to_numpy()` method and the `array` attribute. In both cases, we get an array, but `to_numpy()` returns a `NumPy array` and `array` returns a `pandas ExtensionArray` with pandas own datatypes. There is not much difference when dealing with numeric data:

In [20]:
s2.to_numpy()

array([ 5. ,  2. ,  5.2,  1.6, -0.6,  6. ])

In [21]:
s2.array

<PandasArray>
[5.0, 2.0, 5.2, 1.6, -0.6, 6.0]
Length: 6, dtype: float64

Both arrays are of type `float`. In case of non-numeric datatypes, it is more complicated, because `to_numpy()` has to coerce the datatypes that NumPy does not support. With a string series, the output is `string` or `object`:

In [22]:
s5.array

<StringArray>
['Larry', 'Bill', 'Mark', 'Steve']
Length: 4, dtype: string

In [23]:
s5.to_numpy()

array(['Larry', 'Bill', 'Mark', 'Steve'], dtype=object)

We can use the same syntax for `index`:

In [24]:
s5.index.array

<PandasArray>
['Google', 'Microsoft', 'Facebook', 'Apple']
Length: 4, dtype: object

In [25]:
s5.index.to_numpy()

array(['Google', 'Microsoft', 'Facebook', 'Apple'], dtype=object)

Which option should you choose, `array` or `to_numpy()`? That depends on how you plan to use the output. If it's going to serve as input into a function, check what that function asks for. Different functions and methods will accept either a `pandas ExtensionArray` or a `NumPy array`. Note that the `Series` itself is also a valid argument to most NumPy functions.

There is another option which you may see in code using older versions of pandas, the `values` attribute. It is recommended to avoid it because in case your series contains pandas extension datatypes, it is unclear whether `values` returns a `pandas ExtensionArray` or a `NumPy array`.

In [26]:
s5.values 

<StringArray>
['Larry', 'Bill', 'Mark', 'Steve']
Length: 4, dtype: string

## 3. DataFrames

As mentioned previously, a `DataFrame` is a 2D tabular structure (think Excel sheet). It's the most commonly used data structure in pandas. The documentation for `DataFrame` is available [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

Let's **create** our first dataframe: 

In [27]:
df1 = pd.DataFrame([10,122,1])
df1

Unnamed: 0,0
0,10
1,122
2,1


Now this might look like a pandas Series on the first sight, but it behaves differently in some ways (and similarly in others). The `DataFrame` has an index as a Series, but additionally it has column names (the zero above the  horizontal line in this example). Remember that when printing Series, pandas automatically printed the `datatype`? It does not happen with the DataFrame. The DataFrame as a whole does not have a datatype, but each of its columns does. 

Let's print the dataframe column. We use **square brackets** with the column name to select it. In this case, the **column name** is just a number. (If you remember from the Series section, this was the way to select the Series rows - the first difference!)  
A DataFrame column called in this way is actually a pandas `Series`.

In [28]:
df1[0] 

0     10
1    122
2      1
Name: 0, dtype: int64

There is another way to select a column using **two pairs of square brackets**:

In [29]:
df1[[0]]

Unnamed: 0,0
0,10
1,122
2,1


You probably guessed by the look of it that this is a `DataFrame`! **Remember**: **one** pair of brackets --> `Series`, **two pairs** of brackets --> `DataFrame`.

But how can you select a DataFrame row? Like this, selecting by **row name**:

In [30]:
df1.loc[0]

0    10
Name: 0, dtype: int64

Or like this, selecting by **row position**:

In [31]:
df1.iloc[0]

0    10
Name: 0, dtype: int64

Now on to a dataframe with more columns. We will also pass a list of `column names` and another list for the `index`, using the appropriate arguments.

In [32]:
# ignore the weird spacing, it's just to make clear that we have 3 lists of 3 elements
# notice that this is a list of lists

df2 = pd.DataFrame([[1,   2,   7],  
                    [4.2, 6.1, -4.1], 
                    ["a", "b", "z"] ],
                    columns=['col1','col2','col3'],  # <- column names
                    index=['row1','row2','row3'])    # <- row names
df2

Unnamed: 0,col1,col2,col3
row1,1,2,7
row2,4.2,6.1,-4.1
row3,a,b,z


Hmm, this might be surprising, but that's the way it is - each list is a row, not a column. What to do if you need the lists to become columns? Instead of passing the three lists inside another list, put them into a dictionary. You get the column names for free this time.

In [33]:
df3=pd.DataFrame({ 'col1':[1,   2,   7],  
                   'col2':[4.2, 6.1, -4.1], 
                   'col3':["a", "b", "z"] },
                   index=['row1','row2','row3'])
df3

Unnamed: 0,col1,col2,col3
row1,1,4.2,a
row2,2,6.1,b
row3,7,-4.1,z


Very often, your data comes in lists, including the column names. It's easy to zip them into a `dictionary`...

In [34]:
company = ["PiggyVest","Bumble","Backstage Capital","Blendoor","LungXpert", "Cisco","Eventbrite",
                "Adafruit Industries","Verge Genomics","23andme"]
founder_name = ["Odunayo","Whitney","Arlan","Stephanie","Sasikala","Sandy","Julia","Limor","Alice","Anne"]
founder_surname = ["Eweniyi","Wolfe Heard","Hamilton","Lampkin","Devi","Lerner","Hartz","Fried","Zhang","Wojcicki"]
column_names=["company","founder_name","founder_surname"]

In [35]:
tech_companies_dictionary=dict(zip(column_names,[company,founder_name,founder_surname]))

... then pass it to the `DataFrame`:

In [36]:
df6 = pd.DataFrame(tech_companies_dictionary)
df6

Unnamed: 0,company,founder_name,founder_surname
0,PiggyVest,Odunayo,Eweniyi
1,Bumble,Whitney,Wolfe Heard
2,Backstage Capital,Arlan,Hamilton
3,Blendoor,Stephanie,Lampkin
4,LungXpert,Sasikala,Devi
5,Cisco,Sandy,Lerner
6,Eventbrite,Julia,Hartz
7,Adafruit Industries,Limor,Fried
8,Verge Genomics,Alice,Zhang
9,23andme,Anne,Wojcicki


To make it perfect, we will also set the correct datatypes. The `convert_dtypes()` method infers the best possible datatypes. As listed with the `dtypes` attribute, each column has a specific datatype and the whole dataframe is of datatype object.

In [37]:
df6=df6.convert_dtypes()
df6.dtypes

company            string
founder_name       string
founder_surname    string
dtype: object

You might wonder if it's possible to use the `dtype` argument, as with Series. It is, but only one datatype can be set for the whole dataframe.

In [38]:
df6 = pd.DataFrame(tech_companies_dictionary,dtype='string')
df6.dtypes

company            string
founder_name       string
founder_surname    string
dtype: object

### 3.1 Making dataframes from series

Let's do the same thing again, using everything we've learned so far:

In [39]:
# Let's say we have these lists somewhere on our computer: 
company = ["PiggyVest","Bumble","Backstage Capital","Blendoor","LungXpert", "Cisco","Eventbrite",
                "Adafruit Industries","Verge Genomics","23andme"]
founder_name = ["Odunayo","Whitney","Arlan","Stephanie","Sasikala","Sandy","Julia","Limor","Alice","Anne"]
founder_surname = ["Eweniyi","Wolfe Heard","Hamilton","Lampkin","Devi","Lerner","Hartz","Fried","Zhang","Wojcicki"]
column_names=["company","founder_name","founder_surname"]

Let's make some series, using the company name as index: 

In [40]:
series_of_founder_names = pd.Series(data=founder_name, # <-- data 
                                    index=company,     # <-- index 
                                    dtype='string')    # <-- datatype 
series_of_founder_names

PiggyVest                Odunayo
Bumble                   Whitney
Backstage Capital          Arlan
Blendoor               Stephanie
LungXpert               Sasikala
Cisco                      Sandy
Eventbrite                 Julia
Adafruit Industries        Limor
Verge Genomics             Alice
23andme                     Anne
dtype: string

Same thing, this time for surnames: 

In [41]:
series_of_founder_surnames = pd.Series(data=founder_surname, # <-- different data
                                    index=company,           # <-- same index 
                                    dtype='string')          # <-- datatype   
series_of_founder_surnames

PiggyVest                  Eweniyi
Bumble                 Wolfe Heard
Backstage Capital         Hamilton
Blendoor                   Lampkin
LungXpert                     Devi
Cisco                       Lerner
Eventbrite                   Hartz
Adafruit Industries          Fried
Verge Genomics               Zhang
23andme                   Wojcicki
dtype: string

Now with these two series we can create a dataframe! Pandas will notice that they have the same index, and will give the dataframe that index: 

In [42]:
df7 = pd.DataFrame({'founder_name': series_of_founder_names,  
                    'founder_surname': series_of_founder_surnames})
df7

Unnamed: 0,founder_name,founder_surname
PiggyVest,Odunayo,Eweniyi
Bumble,Whitney,Wolfe Heard
Backstage Capital,Arlan,Hamilton
Blendoor,Stephanie,Lampkin
LungXpert,Sasikala,Devi
Cisco,Sandy,Lerner
Eventbrite,Julia,Hartz
Adafruit Industries,Limor,Fried
Verge Genomics,Alice,Zhang
23andme,Anne,Wojcicki


By passing `series` (in this case sharing the index) as values of a `dictionary`, pandas is able to use the `key` value as `column` name and the `index` as the `row` name. The column and index (row) are also acessible, as will be shown below.

### 3.2 What if my data isn't a pandas Series?

It will often happen that you have a list or an array:

In [43]:
number_of_employees = [71, 700, 12, 20, 10, 79500, 1000, 105, 49, 683]

In [44]:
series_number_of_employees = pd.Series(data=number_of_employees) # <-- data, no index 
# this has an index, although we did not pass it - Series always has an index
series_number_of_employees

0       71
1      700
2       12
3       20
4       10
5    79500
6     1000
7      105
8       49
9      683
dtype: int64

Now, you may be tempted to add this series directly to the dataframe, and pandas won't stop you:

In [45]:
df8 = pd.DataFrame({'founder_name': series_of_founder_names,  
                    'founder_surname': series_of_founder_surnames,
                    'number_employees': series_number_of_employees})
df8

Unnamed: 0,founder_name,founder_surname,number_employees
0,,,71.0
1,,,700.0
2,,,12.0
3,,,20.0
4,,,10.0
5,,,79500.0
6,,,1000.0
7,,,105.0
8,,,49.0
9,,,683.0


It worked, but not as you might have expected. The two series with the same index were combined into the same rows, but the `series_number_of_employees` series has another index, so separate rows were created. Moreover, the index was **ordered** upon creating the dataframe, unlike in the case of `df7`. Remember to **think about the index** when combining `series` into `dataframes`!

Let's repair our dataframe first before moving on:

In [46]:
series_number_of_employees_indexed = pd.Series(data=number_of_employees,index=company)
df9 = pd.DataFrame({'founder_name': series_of_founder_names,  
                    'founder_surname': series_of_founder_surnames,
                    'number_employees': series_number_of_employees_indexed})
df9

Unnamed: 0,founder_name,founder_surname,number_employees
PiggyVest,Odunayo,Eweniyi,71
Bumble,Whitney,Wolfe Heard,700
Backstage Capital,Arlan,Hamilton,12
Blendoor,Stephanie,Lampkin,20
LungXpert,Sasikala,Devi,10
Cisco,Sandy,Lerner,79500
Eventbrite,Julia,Hartz,1000
Adafruit Industries,Limor,Fried,105
Verge Genomics,Alice,Zhang,49
23andme,Anne,Wojcicki,683


### 3.3 Getting the index and column values 

The `DataFrame` object contains a few attributes that are useful for getting an overview of your data.

Get the index (row names), with `.index`: 

In [47]:
df9.index

Index(['PiggyVest', 'Bumble', 'Backstage Capital', 'Blendoor', 'LungXpert',
       'Cisco', 'Eventbrite', 'Adafruit Industries', 'Verge Genomics',
       '23andme'],
      dtype='object')

Get the column names, with `.columns`: 

In [48]:
df9.columns

Index(['founder_name', 'founder_surname', 'number_employees'], dtype='object')

Among other things, this might be used to **iterate** over the column names:

In [49]:
for col in df9.columns:
    print(col)

founder_name
founder_surname
number_employees


We can also use `dtypes` to know the type of each column in the dataframe:

In [50]:
df9.dtypes

founder_name        string
founder_surname     string
number_employees     int64
dtype: object

To get the underlying data as an array, use `.to_numpy`. The array will be 2D, as the dataframe. The DataFrame object does not have the method `.array`. It still has `.values`.

In [51]:
df9.to_numpy()

array([['Odunayo', 'Eweniyi', 71],
       ['Whitney', 'Wolfe Heard', 700],
       ['Arlan', 'Hamilton', 12],
       ['Stephanie', 'Lampkin', 20],
       ['Sasikala', 'Devi', 10],
       ['Sandy', 'Lerner', 79500],
       ['Julia', 'Hartz', 1000],
       ['Limor', 'Fried', 105],
       ['Alice', 'Zhang', 49],
       ['Anne', 'Wojcicki', 683]], dtype=object)

In [52]:
df9.values

array([['Odunayo', 'Eweniyi', 71],
       ['Whitney', 'Wolfe Heard', 700],
       ['Arlan', 'Hamilton', 12],
       ['Stephanie', 'Lampkin', 20],
       ['Sasikala', 'Devi', 10],
       ['Sandy', 'Lerner', 79500],
       ['Julia', 'Hartz', 1000],
       ['Limor', 'Fried', 105],
       ['Alice', 'Zhang', 49],
       ['Anne', 'Wojcicki', 683]], dtype=object)

## 4. Previewing and describing a DataFrame

### 4.1 Previewing the DataFrame or part of it

In a jupyter notebook, calling a DataFrame will display it (as seen previously):

In [53]:
df9

Unnamed: 0,founder_name,founder_surname,number_employees
PiggyVest,Odunayo,Eweniyi,71
Bumble,Whitney,Wolfe Heard,700
Backstage Capital,Arlan,Hamilton,12
Blendoor,Stephanie,Lampkin,20
LungXpert,Sasikala,Devi,10
Cisco,Sandy,Lerner,79500
Eventbrite,Julia,Hartz,1000
Adafruit Industries,Limor,Fried,105
Verge Genomics,Alice,Zhang,49
23andme,Anne,Wojcicki,683


If the dataframe has a lot of entries, it will be only partially displayed. Nonetheless, it might still be too much information being displayed at once. An alternative are the `.head()` and `.tail()` methods which print only a certain number of entries from the top and bottom of the dataframe, respectively.

In [54]:
df9.head(n=2)

Unnamed: 0,founder_name,founder_surname,number_employees
PiggyVest,Odunayo,Eweniyi,71
Bumble,Whitney,Wolfe Heard,700


In [55]:
df9.tail(n=2)

Unnamed: 0,founder_name,founder_surname,number_employees
Verge Genomics,Alice,Zhang,49
23andme,Anne,Wojcicki,683


### 4.2 Retrieving DataFrame information

[`.shape`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) returns a tuple with the dimensions of the dataframe (number_of_rows, number_of_columns).

In [56]:
df9.shape

(10, 3)

With [`.info()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html), we obtain:
- the number of entries
- the number of columns
- the title of each column
- the number of entries that in fact exists in each column (does not consider missing values!)
- the type of data of the entries of a given column.

In [57]:
df9.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, PiggyVest to 23andme
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   founder_name      10 non-null     string
 1   founder_surname   10 non-null     string
 2   number_employees  10 non-null     int64 
dtypes: int64(1), string(2)
memory usage: 620.0+ bytes


For the **numerical** columns it's also possible to obtain basic statistical information using [`.describe()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html):

- the number of rows for each numerical column
- the mean value
- the standard deviation
- the minimum and maximum value
- the median, the 25th and 75th percentile.

In [58]:
df9.describe()

Unnamed: 0,number_employees
count,10.0
mean,8215.0
std,25049.647037
min,10.0
25%,27.25
50%,88.0
75%,695.75
max,79500.0


## 5. Reading data from files into pandas dataframes

Pandas has functions that allow us to create `dataframes` form several different types of data `files`:

- CSV
- JSON
- HTML
- ... and [many more](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

All of this is possible by using the read_*dataFormat* methods.

For instance, using the 2010 census profile and housing characteristics of the city of Los Angeles ([source](https://catalog.data.gov/dataset/2010-census-populations-by-zip-code)):

In [59]:
census_2010 = pd.read_csv("data/2010_Census_Populations_by_Zip_Code.csv")

This is the resulting dataframe:

In [60]:
census_2010.head()

Unnamed: 0,Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
0,91371,1,73.5,0,1,1,1.0
1,90001,57110,26.6,28468,28642,12971,4.4
2,90002,51223,25.5,24876,26347,11731,4.36
3,90003,66266,26.3,32631,33635,15642,4.22
4,90004,62180,34.8,31302,30878,22547,2.73


It's size is:

In [61]:
census_2010.shape

(319, 7)

Let's use `info()` to get an overview of the column variables: 

In [62]:
census_2010.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319 entries, 0 to 318
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Zip Code                319 non-null    int64  
 1   Total Population        319 non-null    int64  
 2   Median Age              319 non-null    float64
 3   Total Males             319 non-null    int64  
 4   Total Females           319 non-null    int64  
 5   Total Households        319 non-null    int64  
 6   Average Household Size  319 non-null    float64
dtypes: float64(2), int64(5)
memory usage: 17.6 KB


And `.describe()` for basic statistics:

In [63]:
census_2010.describe()

Unnamed: 0,Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
count,319.0,319.0,319.0,319.0,319.0,319.0,319.0
mean,91000.673981,33241.341693,36.527586,16391.564263,16849.777429,10964.570533,2.828119
std,908.360203,21644.417455,8.692999,10747.495566,10934.986468,6270.6464,0.835658
min,90001.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,90243.5,19318.5,32.4,9763.5,9633.5,6765.5,2.435
50%,90807.0,31481.0,37.1,15283.0,16202.0,10968.0,2.83
75%,91417.0,44978.0,41.0,22219.5,22690.5,14889.5,3.32
max,93591.0,105549.0,74.0,52794.0,53185.0,31087.0,4.67


## 6. Writing data from pandas into files

Besides reading from the disk, Pandas allows us to save our dataframe to a file.

In [64]:
census_2010.to_csv("data/new_csv.csv")

You should now have a new file called `new_csv.csv` in your `data` folder!

The same way we can read data from various files types, we can also write data to various file types (CSV, JSON, HTML, ...) All of this is possible by using the to_*dataFormat* method, giving as an argument the path where you want to save the file. For example, you can write to the JSON format using `to_json`, or to an Excel spreadsheet using `to_excel`, and so on.

## 7. Useful links

- [Pandas Getting Started tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/index.html)

- [Intro to data structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)

#### Image acknowledgements:
Détail de la stèle du Code de Hammurabi, roi de Babylone (musée du Louvre): By Deror avi - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6042133  
First medical X-ray by Wilhelm Röntgen of his wife Anna Bertha Ludwig's hand: Wilhelm Röntgen., Public domain, via Wikimedia Commons
Magura cave drawing: By Vislupus - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=68925028  
Interior of the Duomo (Milan): By Darafsh - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=50973859  
