# Python Libraries
Python, like other programming languages, has an abundance of additional modules or libraries that augument the base framework and funtionality of the language.

Think of a library as a collection of functions that can be accessed to complere certain programming tasks without having to write your own algorithm.

For this course, we will focus primarily on the following libraries:

 - Numpy is a library for working with arrays of data.
 - Pandas provides high-performance, easy-to-use data structures and data analysis tools
 - Scipy is a library of techniques for numerical and scientific computing.
 - Matplotlib is a library for making graph.
 - Seaborn is a higher-level interface to Matplotlib that can be used to simplify many graphing tasks.
 - Statsmodels is a library that implements many statistical techniques.


# Importing Libraries
When using Python, you must always begin your scripts by importing the libraries that you will be using.
The following statement umports the numpy and pandas library, and gives them abbreviated names:

In [6]:
import numpy as np
import pandas as pd

# Utilizing Library Functions

In [7]:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
np.mean(a)

5.0

# Data Management


# Importing Data

In [9]:
# Store the url string that hosts our .csv file
url = "https://raw.githubusercontent.com/Tublifet/Statistics-With-Python/main/data.raw"
# Read the .csv file and store it as a pandas Data Frame
df = pd.read_csv(url)
# Output object type
type(df)

pandas.core.frame.DataFrame

# Viewing Data

In [11]:
# We can view our Data Frame by calling the head() function
df.head()

Unnamed: 0,Date,Open,Close,State
0,10/3/2018,57.5,58.0,INCREASE
1,10/4/2018,57.7,57.0,DECREASE
2,10/5/2018,7.0,56.1,DECREASE
3,10/8/2018,55.6,55.9,INCREASE
4,10/9/2018,55.9,56.7,INCREASE


The head() functioin simply shows the first 5 rows of our Data Frame. If we wanted to show the entire Data Frame we would simply write the following:

In [12]:
# Output entire Date Frame
df

Unnamed: 0,Date,Open,Close,State
0,10/3/2018,57.5,58.0,INCREASE
1,10/4/2018,57.7,57.0,DECREASE
2,10/5/2018,7.0,56.1,DECREASE
3,10/8/2018,55.6,55.9,INCREASE
4,10/9/2018,55.9,56.7,INCREASE
5,10/10/2018,56.4,54.1,DECREASE
6,10/11/2018,53.6,53.6,UNCHANGE


As you can see, we have a 2-Dimensional object where each row is an independent observation of our cartwheel data.

To gather more information regarding the data, we can view the column names and data types of each column with the following functions:

In [13]:
df.columns

Index(['Date', 'Open', 'Close', 'State'], dtype='object')

Lets say we would like to splice our data drame and select only specific portions of our data. There are three different ways of doing so.

1. .loc()
2. .iloc()
3. .ix()

We will cover the .loc() and .iloc() splicing functions.

# .loc()
.loc() takes two single/list/range operator separated by '.'. The first one indicates the row and the seconde one indicates columns.

In [16]:
# Return all observations of Date
df.loc[:, "Date"]

0     10/3/2018
1     10/4/2018
2     10/5/2018
3     10/8/2018
4     10/9/2018
5    10/10/2018
6    10/11/2018
Name: Date, dtype: object

In [20]:
# Select all rows for multiple columns [Date, Open, Close]
df.loc[:, ['Date','Open','Close']]

Unnamed: 0,Date,Open,Close
0,10/3/2018,57.5,58.0
1,10/4/2018,57.7,57.0
2,10/5/2018,7.0,56.1
3,10/8/2018,55.6,55.9
4,10/9/2018,55.9,56.7
5,10/10/2018,56.4,54.1
6,10/11/2018,53.6,53.6


In [21]:
# Select few rows for multiple columns ['Date','Open','Close']
df.loc[:4, ['Date','Open','Close']]

Unnamed: 0,Date,Open,Close
0,10/3/2018,57.5,58.0
1,10/4/2018,57.7,57.0
2,10/5/2018,7.0,56.1
3,10/8/2018,55.6,55.9
4,10/9/2018,55.9,56.7


In [22]:
# Select range of rows for all columns
df.loc[3:5]

Unnamed: 0,Date,Open,Close,State
3,10/8/2018,55.6,55.9,INCREASE
4,10/9/2018,55.9,56.7,INCREASE
5,10/10/2018,56.4,54.1,DECREASE


# .iloc()
.iloc() is integer based slicing, whereas .loc() used labels/column names. Here are some example:

In [23]:
df.iloc[:4]

Unnamed: 0,Date,Open,Close,State
0,10/3/2018,57.5,58.0,INCREASE
1,10/4/2018,57.7,57.0,DECREASE
2,10/5/2018,7.0,56.1,DECREASE
3,10/8/2018,55.6,55.9,INCREASE


In [33]:
df.iloc[1:5, 2:4]

Unnamed: 0,Close,State
1,57.0,DECREASE
2,56.1,DECREASE
3,55.9,INCREASE
4,56.7,INCREASE


In [39]:
df.iloc[1:5]

Unnamed: 0,Date,Open,Close,State
1,10/4/2018,57.7,57.0,DECREASE
2,10/5/2018,7.0,56.1,DECREASE
3,10/8/2018,55.6,55.9,INCREASE
4,10/9/2018,55.9,56.7,INCREASE


We can view the data types of out data frame columns with by calling .dtypes on your data frame:

In [40]:
df.dtypes

Date      object
Open     float64
Close    float64
State     object
dtype: object

We may also want to observe the different unique values within a specific column:

In [41]:
# List unique values in the df['State'] column
df.State.unique()

array(['INCREASE', 'DECREASE', 'UNCHANGE'], dtype=object)

In [42]:
df.Open.unique()

array([57.5, 57.7,  7. , 55.6, 55.9, 56.4, 53.6])

It seems that these fields may serve the same purpose, which is to specify INCREASE, DECREASE and UNCHANGE. Lets check this quickly bu observing only these two columns:

In [44]:
# Use .loc() to sepcify a list of multiple column names
df.loc[:,['Open', 'State']]

Unnamed: 0,Open,State
0,57.5,INCREASE
1,57.7,DECREASE
2,7.0,DECREASE
3,55.6,INCREASE
4,55.9,INCREASE
5,56.4,DECREASE
6,53.6,UNCHANGE


In [45]:
df.groupby(['Open', 'State']).size()

Open  State   
7.0   DECREASE    1
53.6  UNCHANGE    1
55.6  INCREASE    1
55.9  INCREASE    1
56.4  DECREASE    1
57.5  INCREASE    1
57.7  DECREASE    1
dtype: int64