# PANDAS

### What are libraries? 

A programming library is a collection of prewritten code that programmers can use to optimize tasks. 

<img src='NN1.jpg' style="border-radius: 10px"/>

## Introduction to Pandas


Built on top of NumPy, pandas is a package for data manipulation and analysis in Python. The name pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely pandas Series and pandas DataFrames. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. This lesson is intended as a basic overview of pandas and introduces some of its most important features.

In this lesson you will learn:

1. How to import pandas

2. How to create pandas Series and DataFrames using various methods

3. How to access and change elements in Series and DataFrames

4. How to perform arithmetic operations on Series

5. How to load data into a DataFrame

6. How to deal with Not a Number (NaN) values

Link to Pandas Documentation: https://pandas.pydata.org/pandas-docs/stable/

## Why Use Pandas?

The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important. 

It often happens that large datasets don’t come ready to be fed into your learning algorithms. More often than not, large datasets will often have missing values, outliers, incorrect values, etc. Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. 

This is where pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use.

## Data structures in Pandas

1. Pandas series

2. Pandas DataFrame

### Pandas Series

First things first, you need to install and import pandas

In [2]:
#pip install pandas
import pandas as pd

A pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings, and has an option to provide axis labels.

## Creating a Pandas series

In [5]:
# We create a pandas Series that stores a grocery list
groceries = pd.Series( data = ['eggs', 'apples', 'milk', 'bread'], index = [0,1,2,3])

# We display the groceries pandas Series
groceries

0      eggs
1    apples
2      milk
3     bread
dtype: object

We see that pandas Series are displayed with the indices in the first column and the data in the second column. Also, notice that the data in our pandas Series has both integers and strings.

In [3]:
groceries.shape

(4,)

In [4]:
groceries.ndim

1

In [5]:
groceries.size

4

In [6]:
# To obtain the index
groceries.index

Index([0, 1, 2, 3], dtype='int64')

In [7]:
# To obtain the data
groceries.values

array(['eggs', 'apples', 'milk', 'bread'], dtype=object)

In [8]:
4 in groceries

False

In [9]:
3 in groceries

True

In [10]:
groceries.dtype

dtype('O')

In [11]:
groceries

0      eggs
1    apples
2      milk
3     bread
dtype: object

## Accessing and Deleting Elements in Pandas Series

In [12]:
cars = pd.Series(data=[2500, 3400, 2785], index=['Toyota', 'Ford', 'KIA'],)
cars

Toyota    2500
Ford      3400
KIA       2785
dtype: int64

In [13]:
#Accesing with index labels
cars['Ford']

3400

In [14]:
#Accessing with a list of index labels
cars[['Ford', 'Toyota']]

Ford      3400
Toyota    2500
dtype: int64

In [15]:
# Accessing with numeric indices
cars[0]

2500

In [16]:
cars[-1]

2785

In [17]:
cars[[1,2]]

Ford    3400
KIA     2785
dtype: int64

### loc and iloc

They are pandas series attribute for accessing data

loc- location: accessing with index labels

iloc - Integer location: accesing with numeric indices

In [32]:
cars.loc['Ford']

2889

In [30]:
cars.iloc[1]

2785

### Mutability

In [18]:
cars['Ford'] = 2889

In [19]:
cars

Toyota    2500
Ford      2889
KIA       2785
dtype: int64

### Dropping elements

In [20]:
cars.drop('Toyota')

Ford    2889
KIA     2785
dtype: int64

In [21]:
# That was out of place
cars

Toyota    2500
Ford      2889
KIA       2785
dtype: int64

In [22]:
# For inplace
cars.drop('Toyota', inplace = True)

In [23]:
cars

Ford    2889
KIA     2785
dtype: int64

## Arithmetic Operations on Pandas Series


In [33]:
# We create a pandas Series that stores a grocery list of just fruits
fruits= pd.Series(data = [10, 6, 3,], index = ['apples', 'oranges', 'bananas'])

# We display the fruits pandas Series
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [34]:
fruits * 2

apples     20
oranges    12
bananas     6
dtype: int64

In [36]:
fruits 

apples     10
oranges     6
bananas     3
dtype: int64

In [39]:
fruits - 2

apples     8
oranges    4
bananas    1
dtype: int64

In [40]:
fruits / 2

apples     5.0
oranges    3.0
bananas    1.5
dtype: float64

In [41]:
fruits['bananas'] -2

1

In [42]:
fruits.loc['bananas'] -2

1

In [43]:
fruits[1] -2

4

In [46]:
fruits.iloc[1] - 2

4

In [47]:
fruits.iloc[[0,1]] *2

apples     20
oranges    12
dtype: int64

In [48]:
# We create a pandas Series that stores a grocery list
groceries = pd.Series( data = ['eggs', 'apples', 'milk', 'bread'], index = [0,1,2,3],)

In [49]:
groceries * 2

0        eggseggs
1    applesapples
2        milkmilk
3      breadbread
dtype: object

# CLASS WORK

In [7]:
# DO NOT CHANGE THE VARIABLE NAMES

# Given a list representing a few planets
planets = ['Earth', 'Saturn', 'Venus', 'Mars', 'Jupiter']



# Given another list representing the distance of each of these planets from the Sun
# The distance from the Sun is in units of 10^6 km
distance_from_sun = [149.6, 1433.5, 108.2, 227.9, 778.6]

In [8]:
# TO DO: Create a Pandas Series "dist_planets" using the lists above, representing the distance of the planet from the Sun.
# Use the `distance_from_sun` as your data, and `planets` as your index.
dist_planets = pd.Series(data = distance_from_sun, index =planets )

In [9]:
dist_planets

Earth       149.6
Saturn     1433.5
Venus       108.2
Mars        227.9
Jupiter     778.6
dtype: float64

In [10]:
# TO DO: Calculate the time (minutes) it takes light from the Sun to reach each planet. 
# You can do this by dividing each planet's distance from the Sun by the speed of light.
# Use the speed of light, c = 18, since light travels 18 x 10^6 km/minute.
time = dist_planets /18
time

Earth       8.311111
Saturn     79.638889
Venus       6.011111
Mars       12.661111
Jupiter    43.255556
dtype: float64

In [13]:
# TO DO: Use Boolean indexing to select only those planets for which sunlight takes less
# than 40 minutes to reach them.
# We'll check your work by printing out these close planets.
low_time = time[time <40]
low_time

Earth     8.311111
Venus     6.011111
Mars     12.661111
dtype: float64