<a href="https://colab.research.google.com/github/MJMortensonWarwick/DSML2223/blob/main/2_1_Introduction_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

### What is Pandas?
As Python has grown to become a popular solution for data analysis, many new tools have been introduced to support such tasks. Pandas is arguably the most popular and widely used of these, although it does have some limitations (if working with very large datasets you may want to look at Dask and/or PySpark). 

First, we will import it into our session. We will also import numpy - another widely used Python package for working with numerical data (the name is a portmanteau of "numerical" and "python"). By convention we import pandas as "pd" and numpy as "np". This is equivalent to giving it a variable name.

In [1]:
import pandas as pd
import numpy as np

### Testing our installation
To test everything is working we will create some fake data frame using numpy and load it into a pandas dataframe (more on these below). First we will create the random numbers:

In [2]:
x = np.random.rand(10,1)
x

array([[0.34548274],
       [0.97451032],
       [0.18749265],
       [0.19229023],
       [0.56354888],
       [0.25986074],
       [0.60829396],
       [0.3268164 ],
       [0.34697615],
       [0.88960271]])

The commands here have told numpy to create a set of random numbers between zero and one. The arguments we have passed, "1" and "10", tells numpy we want a 10x1 array of numbers (i.e. a vector).

Next we will create a pandas dataframe using "x":

In [3]:
testdf = pd.DataFrame(x)
testdf

Unnamed: 0,0
0,0.345483
1,0.97451
2,0.187493
3,0.19229
4,0.563549
5,0.259861
6,0.608294
7,0.326816
8,0.346976
9,0.889603


We have now successfully created a pandas dataframe!

### What are Dataframes?
Pandas has a very elegant way of managing data, very much borrowed from the statistical language R, mostly based around dataframes. You can think of dataframes a bit like an Excel table with rows, columns and common operations like sum, average and so on. We can use this on top of our previous work on lists and dictionaries (etc.) more usable and malleable.

To begin with we will create a dataset - in this case a dictionary.



In [4]:
orders = {'o10001':{'date':'2021/01/10', 'product':'Social Media Detector', 'quantity':'1'},
            'o10002':{'date':'2021/01/13', 'product':'Realistic Man\'s wig', 'quantity':'2'},
            'o10003':{'date':'2021/01/14', 'product':'Weather\'s Originals', 'quantity':'10'},
            'o10004':{'date':'2021/01/15', 'product':'Brown Shoes', 'quantity':'2'}
}
orders

{'o10001': {'date': '2021/01/10',
  'product': 'Social Media Detector',
  'quantity': '1'},
 'o10002': {'date': '2021/01/13',
  'product': "Realistic Man's wig",
  'quantity': '2'},
 'o10003': {'date': '2021/01/14',
  'product': "Weather's Originals",
  'quantity': '10'},
 'o10004': {'date': '2021/01/15', 'product': 'Brown Shoes', 'quantity': '2'}}

We can convert this to a Dataframe with great ease 

In [5]:
import pandas as pd
import numpy as np

orders_df = pd.DataFrame(orders)
orders_df

Unnamed: 0,o10001,o10002,o10003,o10004
date,2021/01/10,2021/01/13,2021/01/14,2021/01/15
product,Social Media Detector,Realistic Man's wig,Weather's Originals,Brown Shoes
quantity,1,2,10,2


We can even create such outputs from more complex dictionaries, such as a dictionary which includes a nested list:

In [6]:
customers = {'James':{'name':'James Pennington', 'open_orders':3, 'orders':['o10001', 'o10002', 'o10004']},
             'Gareth':{'name':'Gareth Edwards', 'open_orders':0, 'orders':[]},
             'Mark':{'name':'Mark Bonnett', 'open_orders':1, 'orders':['o10003']},
             'Emily':{'name':'Emily Davis', 'open_orders':0, 'orders':[]}
}
customers

{'James': {'name': 'James Pennington',
  'open_orders': 3,
  'orders': ['o10001', 'o10002', 'o10004']},
 'Gareth': {'name': 'Gareth Edwards', 'open_orders': 0, 'orders': []},
 'Mark': {'name': 'Mark Bonnett', 'open_orders': 1, 'orders': ['o10003']},
 'Emily': {'name': 'Emily Davis', 'open_orders': 0, 'orders': []}}

In [7]:
customers_df = pd.DataFrame(customers)
customers_df

Unnamed: 0,James,Gareth,Mark,Emily
name,James Pennington,Gareth Edwards,Mark Bonnett,Emily Davis
open_orders,3,0,1,0
orders,"[o10001, o10002, o10004]",[],[o10003],[]


Pandas dataframes can also be built from various other data structures and sources (including Excel files, CSVs, text files, databases and many more). To conclude the session we will look at a list example:

In [8]:
a = [1, 2, 3, 4]
b = ["a", "b", "c", "d"]
c = [True, False, True, False]

listdf = pd.DataFrame([a, b, c])
listdf

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,a,b,c,d
2,True,False,True,False


### EXERCISE
Try building you own dataframes from a list and/or dictionary you create. What would happen if you have an item missing from one element. E.g. if "c" in the above example only had three items - True, False, True - rather than four. Test it - does the output match your expectation?