# Common Data Structures in Python

There are many different data structures that are used in python. Most prominently used are `numpy arrays`, `pandas dataframes` and `dictionaries`.

In this notebook, we will talk about these and how they are used within the python language

![Numpy](../static/Numpy.png)

## Numpy Arrays

An array is a datastructure we can use to store information in. Arrays can be n-dimensional, but typically they are 1- or 2-Dimensional.

          This is a 1-Dimensional Array
          np.array([0,1,2,3])

          This is a 2-Dimensional Array
          np.array([[0,1,2,3],
                    [4,5,6,7]])

What you can see in this example is actually nothing more than calling the `np.array()` function and passing a `list` [0,1,2,3] to it

Within a numpy array, the stored datatypes must be `homogeneous` meaning that they all need to belong to the same data type. Numpy arrays are optimized to be used for numerical computations.

We will now start to explore the numpy environment and the numpy arrays.

So first, start by `importing numpy as np`

We will not work with real data *yet*, but rather simulate our own numpy arrays to work with. Numpy offers some really useful functions we can use to generate our arrays.

Lets start with the `numpy.random.rand` function.

They key arguments we need to pass to the function are the `shape` of the array we want to create. Lets start by creating a 1-Dimensional Array first.

The first argument determines the number of *rows* we want our array to have, where as the second argument determines the number of columns.

### **Exercise**

Create a 1-D numpy array using the `rand` function from the `random` module (from the `numpy` package). Use it to create a numpy array with **1** Row and **20** Columns.

Assign your array to a variable called "RandomArray".

In [11]:
import numpy as np

In [12]:
RandomArray = np.random.rand(1,20)

The information about the `shape` (e.g, how many rows and colum we have) is actually stored within the array element itself. We can access it with `array.shape`

In [13]:
RandomArray.shape

(1, 20)

Indexing a Numpy Array 

In [37]:
RandomArray[:,:]

array([[0.48717182, 0.03035302, 0.30656875, 0.79526604, 0.93950624,
        0.61541519, 0.79135653, 0.90462905, 0.12916138, 0.43791648,
        0.12838649, 0.91423623, 0.41345175, 0.73463452, 0.02307464,
        0.24663635, 0.81967195, 0.30473204, 0.67198669, 0.33036913]])

![pandas](../static/Pandas.png)

## Pandas


So pandas is *the* python library you want to use for organizing, manipulating and analyzing your datasets. The standard datatype used in pandas are `dataframes`.
Pandas dataframes are basically like a excel sheet, but way, way better.

In this section, we will download a dataset and use this to explore dataframes and apply what we have learned so far.

But first the basic import:

          import pandas as pd

This is the way to go. Again, you can use what ever abbreviation you want to, but I dont think I saw any code where some just used `pandas.xyz` or `pandas as p` or something strange like that.

The dataset we will use is a up-to-date netflix meta-data one. (I was debating to use this one or one that is pokemon based)


In [3]:
import pandas as pd

Before transitioning to the dataset, you should know a thing or two about `pandas`.

The cool thing here is that you actually *convert* `lists` or `numpy arrays` to a `dataframe`.

What you usually want to do is put your lists or arrays in to a `dictionary`. A dictionary is a further vanilla python datatype. Its syntax goes like this:

          dictionary = {"Participant Number":[0,1,2,3,4],
                        "Reaction times":[100,50,76,34,95]}      

The string input is what we call a `key` (Participant Number, Reaction times) The key basically stores the values that are associated with it. We wont focus on dictionaries too much here, but you should know, that a dictionary can store values (or lists of values, or arrays) in so called keys.

What we can now do is create a `dataframe` from this dictionary.

In [41]:
'''This code cell created a dictionary called reaction_time_dictionary. It has two keys called Participant_number and Reaction_times. 
The values are created by using the np.arange function, which gives values from (start) to (stop). We randomly create reaction time values using the np.random.rand function. 
The array needs to be reshaped into a 1-D Vector, so it can be passed to the pd.DataFrame() method.

'''
reaction_time_dictionary = {"Participant_number":np.arange(0,10),
          "Reaction_times":np.random.rand(1,10).reshape(-1)

}
dataframe = pd.DataFrame(reaction_time_dictionary)

So the expected shape of our dataframe should look like this!

![df](../static/pandas_table.svg)

You can again use the `?` here to gather more information about your dataframe.

In [None]:
dataframe?

We can use the `dataframe.head(n=n)` method to display the first n-entries of our dataframe. 

In [50]:
dataframe.head(n=3)

Unnamed: 0,Participant_number,Reaction_times
0,0,0.800368
1,1,0.514193
2,2,0.186876


You can see, that this object is neatly organized. It has two columns, which corespond to the `keys` from our `reaction_time_dictionary`. With this dataframe, we now have the opportunity to do many differnt things. But, this would be pretty boring based on this dataframe. So we will load a different one in and check out pandas functionalities based on it.

The code we are using to obtain this data is not in `python` but [`bash`](https://wiki.ubuntuusers.de/Bash/).

In [None]:
!curl url https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/scrubbed.csv

We now have temporarily download the `scrubbed.csv` file in our google colab session.
This also means, that we can now load it into a pandas dataframe. The function we want to use for that is called 

          pd.read_csv(yourfilename)

So use that this function to read the `scrubbed.csv` file. Store it in a dataframe called Ufo.

In [52]:
Ufo = pd.read_csv("https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/scrubbed.csv")

  Ufo = pd.read_csv("https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/scrubbed.csv")


We can display our newly obtained by either simply typing and running Ufo in a code cell. If you are only interested in viewing the *n-th* first or last elements of your dataframe, you can use `df.head(n=n)` or `df.tail(n=n)`, respectively.

In [53]:
Ufo.head(n=10)

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611
5,10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.595,-82.188889
6,10/10/1965 21:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2/14/2006,51.434722,-3.18
7,10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.1175,-73.408333
8,10/10/1966 20:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,3/19/2009,33.5861111,-86.286111
9,10/10/1966 21:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,5/11/2005,30.2947222,-82.984167


As you can see here, a dataframe looks strikingly similar to a file you would expect in a excel sheet. It has a bunch of rows and columns, and in each column there is some information stored. 

We can further examine the type and shape (rows and columns) of our dataframe by using the `df.info` method.

In [54]:
Ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              80332 non-null  object 
 1   city                  80332 non-null  object 
 2   state                 74535 non-null  object 
 3   country               70662 non-null  object 
 4   shape                 78400 non-null  object 
 5   duration (seconds)    80332 non-null  object 
 6   duration (hours/min)  80332 non-null  object 
 7   comments              80317 non-null  object 
 8   date posted           80332 non-null  object 
 9   latitude              80332 non-null  object 
 10  longitude             80332 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.7+ MB


This method gives us information about the number and names of columns, the `Non-Null Count`, the datatype (Dtype) for each column and how many entries (rows) there are. This is super helpful to get an idea, with what kind of data we are dealing with.

However, returned datatype refers to the `values` within each column. The datatype of the column itself are `pandas.Series`.

To access a column of the dataframe, we have two options. 

The first one being `df.column`. However this only works, if you column name has no spaces or extra characters!

So we could use `Ufo.datetime` but to access the `duration (seconds)` column, we need to use Ufo["duration (seconds)"]. The output of these operations is equivalent.

![Series](../static/01_table_series.svg)

In [65]:
type(Ufo.datetime),type(Ufo["duration (seconds)"])

(pandas.core.series.Series, pandas.core.series.Series)

We can seperately extract the column names by using the `dataframe.columns`.

In [60]:
Ufo.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude '],
      dtype='object')

We can also directly convert them to a `list` by calling the `to_list()` method!

In [61]:
columns_list = Ufo.columns.to_list()

And in principle, we can now use this list to get a subset of our dataframe, extracting only the first two columns.

In [66]:
Ufo[columns_list[:2]]

Unnamed: 0,datetime,city
0,10/10/1949 20:30,san marcos
1,10/10/1949 21:00,lackland afb
2,10/10/1955 17:00,chester (uk/england)
3,10/10/1956 21:00,edna
4,10/10/1960 20:00,kaneohe
...,...,...
80327,9/9/2013 21:15,nashville
80328,9/9/2013 22:00,boise
80329,9/9/2013 22:00,napa
80330,9/9/2013 22:20,vienna


In [84]:
Ufo = Ufo[Ufo['latitude'] != "33q.200088"]
Ufo.latitude = pd.to_numeric(Ufo.latitude)
Ufo["duration (seconds)"] = pd.to_numeric(Ufo["duration (seconds)"], errors='coerce')
Ufo = Ufo.dropna(subset=["duration (seconds)"])

In [88]:
Ufo.describe()

Unnamed: 0,duration (seconds),latitude,longitude
count,80328.0,80328.0,80328.0
mean,9017.336,38.124963,-86.772015
std,620232.2,10.469146,39.697805
min,0.001,-82.862752,-176.658056
25%,30.0,34.134722,-112.073333
50%,180.0,39.4125,-87.903611
75%,600.0,42.788333,-78.755
max,97836000.0,72.7,178.4419
