The pandas package is the most important tool at the disposal of Data Scientists and Analysts working in Python today.

# **What's Pandas for?**
This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table, basically — then let you do things like:

Calculate statistics and answer questions about the data, like

*   What's the average, median, max, or min of each column?
*   Does column A correlate with column B?
*   What does the distribution of data in column C look like?
*   Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
*   Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
*   Store the cleaned, transformed data back into a CSV, other file or database

Before you jump into the modeling or the complex visualizations you need to have a good understanding of the nature of your dataset and pandas is the best avenue through which to do that.

# **How does pandas fit into the data science toolkit?**
Not only is the pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.

Pandas is built on top of the **NumPy** package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in **SciPy**, plotting functions from **Matplotlib**, and machine learning algorithms in **Scikit-learn**.

Jupyter Notebooks (with the .ipynb file extension) offer a good environment for using pandas to do data exploration and modeling, but pandas can also be used in text editors just as easily.

Coding Notebooks give us the ability to execute code in a particular cell as opposed to running the entire file. This saves a lot of time when working with large datasets and complex transformations. Notebooks also provide an easy way to visualize pandas’ DataFrames and plots.

# **Pandas First Steps**


## Install and import
Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

    pip install pandas

**`pip3`** on Macs

To import pandas we usually import it with a shorter name since it's used so much:

    import pandas as pd



In [28]:
# add imports pandas and numpy
import pandas as pd 
import numpy as np
from portland_wx import getting_weather_data

## Core components of pandas: Series and DataFrames
The primary two components of pandas are the Series and DataFrame.

A ***Series*** is essentially a column, and a ***DataFrame*** is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

You'll see how these components work when we start working with data below.




## Creating DataFrames from scratch
Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple `dict`.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [11]:
# Create a DataFrame df from this dictionary "data"

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

data_df = pd.DataFrame(data)

data_df

Unnamed: 0,animal,age,visits,priority
0,cat,2.5,1,yes
1,cat,3.0,3,yes
2,snake,0.5,2,no
3,dog,,3,yes
4,dog,5.0,2,no
5,cat,2.0,3,no
6,snake,4.5,1,no
7,cat,,1,yes
8,dog,7.0,2,no
9,dog,3.0,1,no


The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels.

Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an **Index**. We can assign values to it by using an `index` parameter in our constructor. To this we assign a list of values to be used as the index.  `len(list)` must equal the numer of rows of the Dataframe.

In [13]:
# Create a DataFrame df from this dictionary "data" which has the index "labels".

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

data_df2 = pd.DataFrame(data, index=labels)

data_df2

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


## Creating Series from scratch

A **Series**, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [14]:
# basic Pandas Series
sample_series1 = pd.Series([1, 2, 3, 4, 5])

sample_series1

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [17]:
# Series with specific index names and Series name
sample_series2 = pd.Series([20, 30, 40], index = ['2015 Sales', '2016 Sales', '2017 Sales'], name = 'Product A')
sample_series2

2015 Sales    20
2016 Sales    30
2017 Sales    40
Name: Product A, dtype: int64

In [6]:
# Convert a Pandas Series to a Python list


## Conerting Pandas back to Python

Almost any calculation/manipulation you might want to do with a Python List or Dictionary can be done as a Pandas Series or DataFrame.  However if you ever need to convert back to Python objects you can use `pd.to_list()` or `pd.to_dict()`

In [18]:
# Convert the two Series above to the appropriate Python container.
# Check the type of each to be sure you made the correct conversions

data_dict = data_df.to_dict()
data_dict

{'animal': {0: 'cat',
  1: 'cat',
  2: 'snake',
  3: 'dog',
  4: 'dog',
  5: 'cat',
  6: 'snake',
  7: 'cat',
  8: 'dog',
  9: 'dog'},
 'age': {0: 2.5,
  1: 3.0,
  2: 0.5,
  3: nan,
  4: 5.0,
  5: 2.0,
  6: 4.5,
  7: nan,
  8: 7.0,
  9: 3.0},
 'visits': {0: 1, 1: 3, 2: 2, 3: 3, 4: 2, 5: 3, 6: 1, 7: 1, 8: 2, 9: 1},
 'priority': {0: 'yes',
  1: 'yes',
  2: 'no',
  3: 'yes',
  4: 'no',
  5: 'no',
  6: 'no',
  7: 'yes',
  8: 'no',
  9: 'no'}}

# Your Turn

Fill a Python dictionary with the data from the PortlandWeather data file used the previous project.

Make a pandas dataframe from the dictionary and see if you can create a new df.

In [2]:
# Store PortlandWeather data in a Python dictionary
wx_data = pw.getting_weather_data('/Users/244213/Desktop/DataAnalytics/PythonReview/PortlandWeather2013.txt')

wx_data

NameError: name 'pw' is not defined

In [None]:
# Create a Pandas dataframe

wx_df = pd.DataFrame(wx_data)

wx_df
