![](https://www.python.org/static/community_logos/python-logo-master-v3-TM.png)

Learning pandas and numpy are arguably the most important things you can do to boost your data analysis stock.
<br> <br>
**Pandas** is a flexible, powerful package that allows the programmer to import, shape, and analyze data quickly and with very little code. Using Series and DataFrame objects, you can sort and organize data in a breeze. [READ THE DOCS](https://pandas.pydata.org/docs/)
<br> <br>
**Numpy** is the foundational statistical / quantitative analysis package in Python. Calculate distribution metrics, IQR, random arrays, etc. [READ THE DOCS](https://numpy.org/doc/stable/)
<br> <br>
If you haven't installed these packages already, type in "pip install pandas" and "pip install numpy" at the command line (or "conda install pandas" and "conda install numpy" from the Anaconda distribution).

In [1]:
# Import the packages into Jupyter notebook
# We'll use aliases to avoid lots of typing
import pandas as pd
import numpy as np

## Basic Numpy
------------
If you're coming from R, NumPy won't seem too foreign. It's based on **vectors** and **matrices** (the difference being 1-dimension vs. 2-dimension). Let's demonstrate the difference by converting a Python list to a NumPy vector.

In [2]:
vector = [4, 7, 15, 6]

# Convert to NumPy array
vector = np.array(vector)

vector

array([ 4,  7, 15,  6])

Now that it's in array form, we can apply NumPy computations to it

In [3]:
np.mean(vector)

8.0

We can also reshape it, to convert it to a 2x2 matrix

In [4]:
matrix = np.reshape(vector, (2,2))

matrix

array([[ 4,  7],
       [15,  6]])

In [5]:
tall_vector = np.reshape(matrix, (4,1))

tall_vector

array([[ 4],
       [ 7],
       [15],
       [ 6]])

You can also start out with the matrix (no need to make a list first)

In [7]:
my_matrix = np.array([[1,2,3], [4,5,6]])

my_matrix

array([[1, 2, 3],
       [4, 5, 6]])

Python interprets the end of each list as the start of a new dimension (in this case, a new row in the 2-D matrix)

## Basic Pandas
------------
Pandas is very basically a souped-up version of Excel with a zillion more features (it's way more than that, but we'll get there). I use Pandas mostly to store data in easy-to-navigate objects, often times in tabular form (like a CSV or Excel sheet). However, the first Pandas data type we'll discuss is a **series**.

### Series
--------

In [8]:
# Imagine a Series == One column of an Excel sheet
my_list = ["Washington, D.C.", "Richmond", "Brooklyn", "San Francisco"]

pd.Series(data=my_list)

0    Washington, D.C.
1            Richmond
2            Brooklyn
3       San Francisco
dtype: object

In [9]:
# You can replace the index values easily as well
labels = ["First", "Second", "Third", "Fourth"]

pd.Series(data=my_list, index=labels)

First     Washington, D.C.
Second            Richmond
Third             Brooklyn
Fourth       San Francisco
dtype: object

In [10]:
# You can determine the index values within the Series call as well

pd.Series(data=my_list, index=[1,2,3,4])

1    Washington, D.C.
2            Richmond
3            Brooklyn
4       San Francisco
dtype: object

In [12]:
# You can perform computations across Series as well

series1 = pd.Series(data=[4,5,6,7], index=my_list)
series2 = pd.Series(data=[1,2,1,2], index=my_list)

series1 + series2

Washington, D.C.    5
Richmond            7
Brooklyn            7
San Francisco       9
dtype: int64

### Data Frames
---------
This is the real money maker. When you imagine tabular data in a CSV or Excel sheet, you're picturing the visual representation of a Data Frame. 
<br> <br>
Let's explore...

In [13]:
# Make a dataframe of random numbers, for presentation purposes
from numpy.random import randn

df = pd.DataFrame(randn(4,4), index=["A", "B", "C", "D"], columns=["L", "M", "N", "O"])

df

Unnamed: 0,L,M,N,O
A,0.117932,-1.18949,-1.238757,-2.223633
B,-0.841645,-0.062229,0.610884,-1.017243
C,-0.534026,0.955982,-0.115022,-0.535981
D,0.34437,-0.662984,-0.973008,-1.728365


Not too shabby! We can grab (or "slice") columns/rows of interest with very basic syntax

In [14]:
# Grab the "L" column
df["L"]

A    0.117932
B   -0.841645
C   -0.534026
D    0.344370
Name: L, dtype: float64

In [15]:
# Grab the "L" and "O" columns
# NOTE the double brackets required here
df[["L", "O"]]

Unnamed: 0,L,O
A,0.117932,-2.223633
B,-0.841645,-1.017243
C,-0.534026,-0.535981
D,0.34437,-1.728365


We can quickly add a new column, the same way that we'd add a new key to a dictionary

In [18]:
# Assign new column as row mean (hence axis=1)
# If (axis=0) you would get column means
df["NEW GUY"] = df.mean(axis=1)

df

Unnamed: 0,L,M,N,O,NEW GUY
A,0.117932,-1.18949,-1.238757,-2.223633,-1.133487
B,-0.841645,-0.062229,0.610884,-1.017243,-0.327558
C,-0.534026,0.955982,-0.115022,-0.535981,-0.057262
D,0.34437,-0.662984,-0.973008,-1.728365,-0.754997
