<a href="https://colab.research.google.com/github/SSSpock/advpython/blob/main/skillspireDS_wk2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Data Structures

## Series

In [None]:
import numpy as np
import pandas as pd

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [None]:
s = pd.Series(data, index=index)

Here, data can be many different things:
a Python dict
an ndarray
a scalar value (like 5)

In [None]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

s.index

In [None]:
# Series can be instantiated from dicts
d = {"b": 1, "a": 0, "c": 2}

pd.Series(d)

Series acts very similarly to a ndarray and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [None]:
s[0]

s[:3]

s[s > s.median()]

s[[4,3,1]]

np.exp(s)

In [None]:
# series like arrays have a single Data Type
s.dtype

In [None]:
# A series is like a dict

s["a"]

s["e"] = 12

"e" in s

"f" in s

In [None]:
# Vectorized Operations and label alignment

s + s

s * 2

np.exp(s)

In [None]:
# A series has advantages over an array.  Operations between series automatically align based on the label.

s[1:] + s{:-1}

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

## Data Frames


# Object Creation

In [None]:
# This week we are focused on the Pandas Library

# Creating a Series by passing a list of values, letting pandas create a default integer index:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [None]:
# Creating a DataFrame by passing a NumPy array, with a datetime index using date_range() and labeled columns:

dates = pd.date_range("20130101", periods=6)

dates


In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

In [None]:
# Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df2

In [None]:
df2.types

In [None]:
# from dict of arrays/lists
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}



In [None]:
data = np.zeros((2,), dtype=[("A", "i4"), ("B", "f4"), ("C", "a10")])

data[:] = [(1, 2.0, "Hello"), (2, 3.0, "World")]

pd.DataFrame(data)

pd.DataFrame(data, index=["first", "second"])

pd.DataFrame(data, columns=["C", "A", "B"])

In [None]:
# From a dict of tuples

pd.DataFrame(
    {
        ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
        ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
        ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
        ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
        ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
    }
)

# Viewing Data

In [None]:
# Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame respectively:

df.head()

df.tail()

df.index

df.columns

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

In [None]:
df.to_numpy()

In [None]:
# Get fast summary statistics
df.describe()

In [None]:
# Transpose your data
df.T

In [None]:
# Sort By an axis
df.sort_index(axis=1, ascending=False)

In [None]:
# Sort by values
df.sort_values(by='B')

# Selection

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().

In [None]:
df['A']

In [None]:
df[0:3]

df['20130102':'20130104']

In [None]:
# Selection by Label
df.loc[dates[0]]

df.loc[, ["A", "B"]]

In [None]:
df.loc["20130102":"20130104", ["A", "B"]]

In [None]:
df.loc["20130102", ["A", "B"]]

In [None]:
# Selecting by position
df.iloc[3:5, 0:2]

df.iloc[[1, 2, 4], [0, 2]]

df.iloc[1:3, :]

df.iloc[:, 1:3]

In [None]:
# Boolean Indexing
df[df["A"] > 0]

df[df > 0]

In [None]:
# Boolean filtering

df2 = df.copy()

df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2

df2[df2["E"].isin(["two", "four"])]

# Setting

In [None]:
# Setting a new column automatically ali9gns the data by the indexes

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

df["F"] = s1

In [None]:
# Setting a values by label
df.at[dates[0], "A"] = 0

In [None]:
# Setting Values by position
df.iat[0, 1] = 0