# NumPy and Pandas

### By: Ananya Pattnaik and Jasmine Wu
This is the second part of our notebook references, here is the [first one](http://localhost:8888/notebooks/Desktop/Python%20Guide%20.ipynb)!

## NumPy

There are a lot of different libraries data scientists use, but a common library is numpy. It is an open source Python library that’s used in almost every field of science and engineering, so we have to import that library in order for us to use it.

In [149]:
import numpy as np

We shorten `numpy` to `np` in order to save time and also to keep code standardized so that anyone working with your code can easily understand and run it.

### Common `np` Functions

A **NumPy array** is a  useful feature used in Python. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be the same type. NumPy arrays are faster, more compact, use less memory, and are more convenient to use than Python lists. To create a basic array:

In [150]:
np.array([1,2,3])

array([1, 2, 3])

The `np.arange()` function is similar to the Python `range()` function we talked about earlier. But while `range()` is a **list** type, `np.arange()` is an **array**.

In [151]:
np.arange(8)

array([0, 1, 2, 3, 4, 5, 6, 7])

For a more detailed breakdown of other common `np` functions, here is a helpful [cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf) we found online! 

## Pandas

Pandas is a package that is an important tool at the disposal of Data Scientists and Analysts working in Python today. We have to import it like we did NumPy.

In [152]:
import pandas as pd

Pandas are used to calculate statistics and answer questions about the data, clean data, visualize data, and store the cleaned data.

### Series and DataFrames

The primary two components of pandas are the `Series` and `DataFrame`.

A `Series` is essentially a column, and a `DataFrame` is a multi-dimensional table made up of a collection of Series.

![image.png](attachment:image.png)

There are $many$ ways to create a `DataFrame` from scratch, but a great option is to just use a simple `dict`.

In [153]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

In [154]:
fruits_bought = pd.DataFrame(data)

fruits_bought

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


### Index

The index of this `DataFrame` was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the `DataFrame`. Adding an index to last example above:

In [155]:
fruits_bought = pd.DataFrame(data, index=['Ananya', 'Jasmine', 'Joyce', 'Ben'])

fruits_bought

Unnamed: 0,apples,oranges
Ananya,3,0
Jasmine,2,3
Joyce,0,7
Ben,1,2


### `.loc` and `.iloc` 

By rows
For rows, we have two options:
- `.loc` - locates by name
- `.iloc`- locates by numerical index

`.loc` and `.iloc` are similar to python slicing, but for DataFrames.

### `.loc`

Now, we could **loc**ate a customer's order by using their name:

In [156]:
fruits_bought.loc['Jasmine']

apples     2
oranges    3
Name: Jasmine, dtype: int64

### `.iloc`

In [157]:
fruits_bought.iloc[1]

apples     2
oranges    3
Name: Jasmine, dtype: int64

There's more on locating and extracting data from the `DataFrame`, but for now you should be able to create a `DataFrame` from scratch with any random data to learn on.

### Importing Outside Files

If you have a CSV file (called "file.csv") - a comma seperated value file: `pd.read_csv('file.csv')` 

If you have a JSON file (called "file.json") — which is essentially a stored Python dict — pandas can read this just as easily: `pd.read_json('file.json')`

### Some Common (Useful) Pandas Functions

The first five (default) rows, unless you put in a number as an argument, as we did here (with 2):

In [158]:
fruits_bought.head(2) 

Unnamed: 0,apples,oranges
Ananya,3,0
Jasmine,2,3


The last five (default) rows, unless you put in a number as an argument, as we did here (with 2):

In [159]:
fruits_bought.tail(2) 

Unnamed: 0,apples,oranges
Joyce,0,7
Ben,1,2


Shows whether or not the values are "null" using True/False:

In [160]:
fruits_bought.isnull() 

Unnamed: 0,apples,oranges
Ananya,False,False
Jasmine,False,False
Joyce,False,False
Ben,False,False


The mean value of each column:

In [161]:
fruits_bought.mean() 

apples     1.5
oranges    3.0
dtype: float64

Sums up the values in each column:

In [162]:
fruits_bought.sum() 

apples      6
oranges    12
dtype: int64

Returns the number of rows, number of columns:

In [163]:
fruits_bought.shape 

(4, 2)

Names of the columns:

In [164]:
fruits_bought.columns 

Index(['apples', 'oranges'], dtype='object')

There’s a lot more pandas functions… these are just a snippet of a few of the most common functions used!

### Slicing, Selecting, and Extracting 

In [165]:
apple = fruits_bought['apples']

apple 

Ananya     3
Jasmine    2
Joyce      0
Ben        1
Name: apples, dtype: int64

In [166]:
type(apple)

pandas.core.series.Series

This will return a `Series`. To extract a column as a `DataFrame`, you need to pass a list of column names. In our case that's just a single column:

In [167]:
apple_df = fruits_bought[['apples']]

apple_df

Unnamed: 0,apples
Ananya,3
Jasmine,2
Joyce,0
Ben,1


In [168]:
type(apple_df)

pandas.core.frame.DataFrame

This should return a `DataFrame`. This can help if you need a subset of the DataFrame.

### Adding and Removing Columns

**NOTE:**
In the following sections, `column_name` represents any column, `phrase` represents any phrase, and `dataframe` represents any data frame.

To add a column: `dataframe['new_column_name'] = series`.

In [169]:
markets = ["Trader Joe's", "Trader Joe's", 'Safeway', "Farmers' Market"]
fruits_bought['markets'] = markets
fruits_bought

Unnamed: 0,apples,oranges,markets
Ananya,3,0,Trader Joe's
Jasmine,2,3,Trader Joe's
Joyce,0,7,Safeway
Ben,1,2,Farmers' Market


To remove a row: `dataframe.drop('row_name')`.

To remove a column: `dataframe.drop('column_name', axis=1)`.

In [170]:
fruits_bought = fruits_bought.drop('oranges', axis=1)
fruits_bought

Unnamed: 0,apples,markets
Ananya,3,Trader Joe's
Jasmine,2,Trader Joe's
Joyce,0,Safeway
Ben,1,Farmers' Market


### `str` methods

**NOTE**: `method_name` refers to any `str` method.

To manipulate string data, we can use list comprehension. But there are more advanced approaches, such as using a `str` method from the `Series` class.

The general format for `str` methods on a `Series` is: `column_name.str.method_name(‘phrase’)`.

The general format for `str` methods on a `DataFrame` is: `dataframe[dataframe['column_name']str.method_name(‘phrase’)]`.


Suppose we want to find out which strings in the column `'markets'` start with the letter `'T'`.

In [171]:
fruits_bought[fruits_bought['markets'].str.startswith('T')]

Unnamed: 0,apples,markets
Ananya,3,Trader Joe's
Jasmine,2,Trader Joe's


Suppose we want to find out which strings in the column `'markets'` contain the letter `'t'`.

In [172]:
fruits_bought[fruits_bought['markets'].str.contains('t')]

Unnamed: 0,apples,markets
Ben,1,Farmers' Market


A lot of `str` methods are intuitive, and there are many more common methods that we will not go over. The following are some common methods to show you how to use them. To see a full list of `str` methods, check out the bottom of [this link](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html).

### `groupby` features

The function `groupby` sorts the dataframe by the values of one column and applies some sort of method or function on the remaining columns (usually the columns containing integers).

The general format for a `groupby` function on a `DataFrame`: `dataframe.groupby(['column_name']).method_name()`.

In [173]:
mean_fruits = fruits_bought.groupby(['markets']).mean()
mean_fruits

Unnamed: 0_level_0,apples
markets,Unnamed: 1_level_1
Farmers' Market,1.0
Safeway,0.0
Trader Joe's,2.5


Here, we used the `groupby` to group markets together, and used the `mean()` method to show the average number of apples bought from each market. Some other aggregate methods that can be used alongside `mean()` for groupby functions are `sum()`, `count()`, `max()`, `min()`, etc.

### References

We used the following Data 100 lecture slides to help us with this notebook!

[Data 100 - Lec 5](https://docs.google.com/presentation/d/1afDZnCeBrzdOlL3osFNb3hdb21Hrt0pJG02USEqIGPw/edit#slide=id.g8ae4121a16_0_811)

[Data 100 - Lec 6](https://docs.google.com/presentation/d/1m_ZbB9dbJkj492TOqYxZBf1XOOfxOA-rLULWKgm8r9I/edit#slide=id.g8a5b9458f2_0_964)

# Thank You! xoxo