# SLU1 - Pandas 101: Learning notebook

In this notebook we will be covering the following:

- What is Pandas
- Series
- Dataframes
- Previewing a dataframe
- Columns
- Count
- Shape
- Reading data from disk
- Info
- Describe

## What is pandas?

Using their own words

> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

What this means is that pandas is a framework for python that allows us to easily load data into a structure than we can read, print, manipulate and transform in order to extract the most from it.

In this notebook the most basic functionalities will be covered.

### How do I call it?

In [1]:
import pandas as pd

Notice that we import pandas as "pd". This is not required, but it's standard.

### Pandas Data Structures

There are two main data structures on pandas:
- **Series** - An array of data of the same type.
- **Dataframes**- Tabular structure that may be seen as a container of series (that may have different types).

## Series

Creating a series in pandas is really easy. We will start by creating a series of numbers and print it to see how it look likes.

[Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)

In [2]:
s1 = pd.Series([10, 3, 5, 1, 12])
print(s1)

0    10
1     3
2     5
3     1
4    12
dtype: int64


We can see that we see the values we have defined as data, with a given index (from 0 to the length of the data minus 1). Notice as well that the series has one and only one type of data, in this case int64.

In [3]:
s2 = pd.Series(["Google", "Microsoft", "Facebook", "Apple"])
print(s2)

0       Google
1    Microsoft
2     Facebook
3        Apple
dtype: object


We can also create series of arrays. Notice that in this case the data type is an _object_. Objects are type of data that can point to different types of data as we may see in the example below.

In [4]:
s3 = pd.Series([1, 2.0, 2, "d"])
print(s3)

0    1
1    2
2    2
3    d
dtype: object


In [5]:
print(type(s3[0]))
print(type(s3[1]))
print(type(s3[3]))

<class 'int'>
<class 'float'>
<class 'str'>


It is also possible to index the data as we wish.

In [6]:
s4 = pd.Series(data=["Larry", "Bill", "Mark", "Steve"], 
               index=["Google", "Microsoft", "Facebook", "Apple"])
print(s4)

Google       Larry
Microsoft     Bill
Facebook      Mark
Apple        Steve
dtype: object


Notice as well that we are able to access the data by its index,

In [7]:
s4["Google"]

'Larry'

and also extract arrays with both the data or the indexes,

In [8]:
s4.values

array(['Larry', 'Bill', 'Mark', 'Steve'], dtype=object)

In [9]:
s4.index

Index(['Google', 'Microsoft', 'Facebook', 'Apple'], dtype='object')

You may be noticing that this resembles a lot the behaviour of python dictionaries. Let's see how the Series function functions when a dictionary is used as data.

In [10]:
my_dict = {"Google": "Larry",
           "Microsoft": "Bill",
           "Facebook": "Mark",
           "Apple": "Steve"}

s5 = pd.Series(my_dict)
print(s5)

Apple        Steve
Facebook      Mark
Google       Larry
Microsoft     Bill
dtype: object


The Series function will automatically use the keys of the dictionary as indexes of the series and its correponding data as the data of the series as well. The interesting part of using this is that we now are able to have some functionalities that we usually don't have in dictionaries.

In [11]:
try:
    my_dict[-1:]
    
except:
    print("Illegal operation")

Illegal operation


In [12]:
s5[-1:]

Microsoft    Bill
dtype: object

#### Key points:

- Each series has only one data type (even if it is a more inclusive one, like object).
- A list of indexes might be used (it has to have the same dimension).
- It is possible to use dictionaries to create series.

---

## DataFrames

As mentioned previously, a dataframe is a tabular structure. This will become clear with the following examples.

[Documentantion](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

In [13]:
df1 = pd.DataFrame([10,122,1])

In [14]:
df1

Unnamed: 0,0
0,10
1,122
2,1


This first dataframe is a really simple one. We can see that in this case we have columns and indexes. This shows the tabular structure. But the next case will highlight this even further.

In [15]:
df2 = pd.DataFrame([ [1, 2, 3, 7], [4.2, 6.1, 8.9, -4.1], ["a", "b", "c", "z"] ])

In [16]:
df2

Unnamed: 0,0,1,2,3
0,1,2,3,7
1,4.2,6.1,8.9,-4.1
2,a,b,c,z


Notice that the way this dataframe is being created leads to a row of values for each list of data provided. It is also possible to provide a list of names for each of the columns...

In [17]:
df3 = pd.DataFrame([ [1, 2, 3, 7], [4.2, 6.1, 8.9, -4.1], ["a", "b", "c", "z"] ],
                   columns=["col_1", "col_2", "col_3", "col_4"])

In [18]:
df3

Unnamed: 0,col_1,col_2,col_3,col_4
0,1,2,3,7
1,4.2,6.1,8.9,-4.1
2,a,b,c,z


as well for each of the rows.

In [19]:
df4 = pd.DataFrame([ [1, 2, 3, 7], [4.2, 6.1, 8.9, -4.1], ["a", "b", "c", "z"] ],
                   columns=["col_1", "col_2", "col_3", "col_4"],
                   index=["row_1", "row_2", "row_3"])

In [20]:
df4

Unnamed: 0,col_1,col_2,col_3,col_4
row_1,1,2,3,7
row_2,4.2,6.1,8.9,-4.1
row_3,a,b,c,z


Let's now explore the creation of dataframes from dictionaries and series as well.

In [21]:
company = ["Google", "Microsoft", "Facebook", "Apple"]
founder_name = ["Larry", "Bill", "Mark", "Steve"]
founder_surname = ["Page", "Gates", "Zuckerberg", "Jobs"]

df5 = pd.DataFrame( [ company, founder_name, founder_surname])

In [22]:
df5

Unnamed: 0,0,1,2,3
0,Google,Microsoft,Facebook,Apple
1,Larry,Bill,Mark,Steve
2,Page,Gates,Zuckerberg,Jobs


This is the behaviour that we have noticed in the previous examples.

In [23]:
df6 = pd.DataFrame({'company': company,
                    'founder_name': founder_name,
                    'founder_surname': founder_surname})

In [24]:
df6

Unnamed: 0,company,founder_name,founder_surname
0,Google,Larry,Page
1,Microsoft,Bill,Gates
2,Facebook,Mark,Zuckerberg
3,Apple,Steve,Jobs


By passing a dictionary as input to the creation of the dataframe, the dataframe is now able to use the key of the dictionary as the name of the column and present the data along a column, instead of along a row. This is becoming closer to how information is usually presented.

In [25]:
series_names = pd.Series(data=founder_name, index=company)
series_surnames = pd.Series(data=founder_surname, index=company)

df7 = pd.DataFrame({'founder_name': series_names,
                    'founder_surname': series_surnames})

df7

Unnamed: 0,founder_name,founder_surname
Google,Larry,Page
Microsoft,Bill,Gates
Facebook,Mark,Zuckerberg
Apple,Steve,Jobs


By passing series (in this case sharing the index) as values of a dictionary, the model is able to use the key value as column name and the index as the row name. The column and index(row) are also acessible, as will be shown below.

#### Creating a DataFrame to use for the next topics

In [26]:
company = ["Google", "Microsoft", "Facebook", "Apple", "Oracle", "Netflix"]
founder_name = ["Larry", "Bill", "Mark", "Steve", "Larry", "Reed"]
founder_surname = ["Page", "Gates", "Zuckerberg", "Jobs", "Ellison", "Hastings"]
year_found = [1998, 1975, 2004, 1976, 1977, 1997]

series_company = pd.Series(data=company)
series_names = pd.Series(data=founder_name)
series_surnames = pd.Series(data=founder_surname)
series_year_found = pd.Series(data=year_found)

In [27]:
df = pd.DataFrame({'company': series_company,
                   'founder_name': series_names,
                   'founder_surname': series_surnames,
                   'year_founded': series_year_found})

df

Unnamed: 0,company,founder_name,founder_surname,year_founded
0,Google,Larry,Page,1998
1,Microsoft,Bill,Gates,1975
2,Facebook,Mark,Zuckerberg,2004
3,Apple,Steve,Jobs,1976
4,Oracle,Larry,Ellison,1977
5,Netflix,Reed,Hastings,1997


If it is necessary to add new columns to the dataframe it is not necessary to repeat this steps.

In [28]:
df["location"] = ["Mountain View", "Albuquerque", "Menlo Park", "Cupertino", "Santa Clara", "Scotts Valley"]
df["state"] = ["California", "New Mexico", "California", "California", "California", "California"]
df["number_employees"] = [73992, 124000, 20658, 123000, 138000, 5400]

In [29]:
df

Unnamed: 0,company,founder_name,founder_surname,year_founded,location,state,number_employees
0,Google,Larry,Page,1998,Mountain View,California,73992
1,Microsoft,Bill,Gates,1975,Albuquerque,New Mexico,124000
2,Facebook,Mark,Zuckerberg,2004,Menlo Park,California,20658
3,Apple,Steve,Jobs,1976,Cupertino,California,123000
4,Oracle,Larry,Ellison,1977,Santa Clara,California,138000
5,Netflix,Reed,Hastings,1997,Scotts Valley,California,5400


It is still possible to get the indexes of the dataframe and the columns as well.

In [30]:
df.index

RangeIndex(start=0, stop=6, step=1)

In [31]:
df.columns

Index(['company', 'founder_name', 'founder_surname', 'year_founded',
       'location', 'state', 'number_employees'],
      dtype='object')

Namely, this might be used to iterate over the column titles.

In [32]:
for col in df.columns:
    print(col)

company
founder_name
founder_surname
year_founded
location
state
number_employees


#### Key points:

- DataFrames may be seen as a tabular structure (named rows and columns).
- We can define the indexes and columns as we create the dataframe.
- It possible to take advantage of dictionaries and Series to create DataFrames.
- To add the new columns of data it is not necessary to create a new dataframe.

---

## Previewing a DataFrame

#### Visualizing the DataFrame or part of it

To visualize a DataFrame, using a jupyter-notebook, printing will display it (as seen previously).

In [33]:
df

Unnamed: 0,company,founder_name,founder_surname,year_founded,location,state,number_employees
0,Google,Larry,Page,1998,Mountain View,California,73992
1,Microsoft,Bill,Gates,1975,Albuquerque,New Mexico,124000
2,Facebook,Mark,Zuckerberg,2004,Menlo Park,California,20658
3,Apple,Steve,Jobs,1976,Cupertino,California,123000
4,Oracle,Larry,Ellison,1977,Santa Clara,California,138000
5,Netflix,Reed,Hastings,1997,Scotts Valley,California,5400


In the case that the dataframe has a lot of entries, it will be only partially displayed. Nonetheless, it might still be too much information being displayed at once and the methods that are going to be used below often prove to be a better alternative. Namely, it is possible to print only a certain number of entries from the top or from the bottom using .head and .tail, respectively.

In [44]:
df.head(n=2)

Unnamed: 0,company,founder_name,founder_surname,year_founded,location,state,number_employees
0,Google,Larry,Page,1998,Mountain View,California,73992
1,Microsoft,Bill,Gates,1975,Albuquerque,New Mexico,124000


In [45]:
df.tail(n=2)

Unnamed: 0,company,founder_name,founder_surname,year_founded,location,state,number_employees
4,Oracle,Larry,Ellison,1977,Santa Clara,California,138000
5,Netflix,Reed,Hastings,1997,Scotts Valley,California,5400


#### Getting the relevant info

[count](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) may be used to get the count of entries that are not null.

In [36]:
df.count()

company             6
founder_name        6
founder_surname     6
year_founded        6
location            6
state               6
number_employees    6
dtype: int64

With pandas' [info](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) it possible to obtain:
- How many entries it has.
- The total number of columns.
- The title of each column.
- The number of entries that in fact exists in each column.
- The type of data of the entries of a given column.

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
company             6 non-null object
founder_name        6 non-null object
founder_surname     6 non-null object
year_founded        6 non-null int64
location            6 non-null object
state               6 non-null object
number_employees    6 non-null int64
dtypes: int64(2), object(5)
memory usage: 416.0+ bytes


For the numerical variables it's also possible to print some more information using [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html), namely:

- The number of rows for each of those columns.
- The mean value.
- The standard deviation.
- The minimum and maximum value.
- The median, the 25th and 75th percentile.

In [38]:
df.describe()

Unnamed: 0,year_founded,number_employees
count,6.0,6.0
mean,1987.833333,80841.666667
std,13.197222,57039.422527
min,1975.0,5400.0
25%,1976.25,33991.5
50%,1987.0,98496.0
75%,1997.75,123750.0
max,2004.0,138000.0


Finally, [shape](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.shape.html) returns a tuple with the dimensions of the dataframe (nr_rows, nr_columns).

In [39]:
df.shape

(6, 7)

### Key points:

- It possible to print the dataframe (still shows too many lines, might be too "noisy").
- head() and tail() print the n top and bottom, respectively, lines of the dataframe.
- Count returns the number of entries for each column that are not null.
- Info returns the number of entries, the number of columns, their counts and the data type.
- Describe returns basic statistical information of the numeric columns.

---

## Reading from the disk

Pandas framework has implemented functions that allow us to create dataframes form several different types of data:

- CSV
- JSON
- HTML
- ... and [many more](https://pandas.pydata.org/pandas-docs/stable/io.html)

All of this is possible by using the read_*dataFormat*. With it is possible to create a dataframe and apply all the previously shown techniques. 

For instance, using the 2010 census profile and housing characteristics of the city of Los Angeles ([source](https://catalog.data.gov/dataset?res_format=CSV)):

In [40]:
data = pd.read_csv("data/2010_Census_Populations_by_Zip_Code.csv")

In [41]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319 entries, 0 to 318
Data columns (total 7 columns):
Zip Code                  319 non-null int64
Total Population          319 non-null int64
Median Age                319 non-null float64
Total Males               319 non-null int64
Total Females             319 non-null int64
Total Households          319 non-null int64
Average Household Size    319 non-null float64
dtypes: float64(2), int64(5)
memory usage: 17.5 KB


In [42]:
data.describe()

Unnamed: 0,Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
count,319.0,319.0,319.0,319.0,319.0,319.0,319.0
mean,91000.673981,33241.341693,36.527586,16391.564263,16849.777429,10964.570533,2.828119
std,908.360203,21644.417455,8.692999,10747.495566,10934.986468,6270.6464,0.835658
min,90001.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,90243.5,19318.5,32.4,9763.5,9633.5,6765.5,2.435
50%,90807.0,31481.0,37.1,15283.0,16202.0,10968.0,2.83
75%,91417.0,44978.0,41.0,22219.5,22690.5,14889.5,3.32
max,93591.0,105549.0,74.0,52794.0,53185.0,31087.0,4.67


In [43]:
data.head()

Unnamed: 0,Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
0,91371,1,73.5,0,1,1,1.0
1,90001,57110,26.6,28468,28642,12971,4.4
2,90002,51223,25.5,24876,26347,11731,4.36
3,90003,66266,26.3,32631,33635,15642,4.22
4,90004,62180,34.8,31302,30878,22547,2.73


### Key points:

- Pandas allows the creation of dataframes from several structures of data.

----

## To learn more (optional)

- [Intro to data structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)