#### What is Pandas?
Pandas is an opensource library that allows to you perform data manipulation in Python. Pandas library is built on top of Numpy, meaning Pandas needs Numpy to operate. Pandas provide an easy way to create, manipulate and wrangle the data. Pandas is also an elegant solution for time series data.

Data scientists use Pandas for its following advantages:

Easily handles missing data

    It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data structure

    It provides an efficient way to slice the data

    It provides a flexible way to merge, concatenate or reshape the data

    It includes a powerful time series tool to work with


What is a data frame?
A data frame is a two-dimensional array, with labeled axes (rows and columns). A data frame is a standard way to store data.

Data frame is well-known by statistician and other data practitioners. A data frame is a tabular data, with rows to store the information and columns to name the information. For instance, the price can be the name of a column and 2,3,4 the price values.

In [1]:
import pandas as pd
pd.Series([1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

In [2]:
# You can add the index with index. It helps to name the rows. 
# The length should be equal to the size of the column

pd.Series([1., 2., 3.], index=['a', 'b', 'c'])

a    1.0
b    2.0
c    3.0
dtype: float64

In [4]:
import numpy as np
# Below, you create a Pandas series with a missing value for the third rows.
#Note, missing values in Python are noted "NaN." You can use numpy to create missing value: np.nan artificially

pd.Series([1,2,np.nan])

0    1.0
1    2.0
2    NaN
dtype: float64

### Create Data frame
You can convert a numpy array to a pandas data frame with pd.Data frame(). The opposite is also possible. To convert a pandas Data Frame to an array, you can use np.array()

In [6]:
## Numpy to pandas
import numpy as np

h = [[1,2],[3,4]] 
df_h = pd.DataFrame(h)
print('Data Frame:', df_h)

    ## Pandas to numpy
df_h_n = np.array(df_h)
print('Numpy array:', df_h_n)
    #Data Frame:    0  1
    #0  1  2
    #1  3  4
    #Numpy array: [[1 2]
    # [3 4]]

Data Frame:    0  1
0  1  2
1  3  4
Numpy array: [[1 2]
 [3 4]]


In [8]:
# You can also use a dictionary to create a Pandas dataframe.

dic = {'Name': ["John", "Smith"], 'Age': [30, 40]}
pd.DataFrame(data=dic)

Unnamed: 0,Name,Age
0,John,30
1,Smith,40


Range Data
Pandas have a convenient API to create a range of date                
pd.data_range(date,period,frequency):                 
       The first parameter is the starting date                    
       The second parameter is the number of periods (optional if the end date is specified)                 
       The last parameter is the frequency: day: 'D,' month: 'M' and year: 'Y.'            

In [9]:
## Create date
# Days
dates_d = pd.date_range('20300101', periods=6, freq='D')
print('Day:', dates_d)

Day: DatetimeIndex(['2030-01-01', '2030-01-02', '2030-01-03', '2030-01-04',
               '2030-01-05', '2030-01-06'],
              dtype='datetime64[ns]', freq='D')


In [10]:
# Months
dates_m = pd.date_range('20300101', periods=6, freq='M')
print('Month:', dates_m)

Month: DatetimeIndex(['2030-01-31', '2030-02-28', '2030-03-31', '2030-04-30',
               '2030-05-31', '2030-06-30'],
              dtype='datetime64[ns]', freq='M')


#### Inspecting data
You can check the head or tail of the dataset with head(), or tail() preceded by the name of the panda's data frame

In [11]:
# Step 1) Create a random sequence with numpy. The sequence has 4 columns and 6 rows

random = np.random.randn(6,4)

In [12]:
# Step 2) Then you create a data frame using pandas.
#Use dates_m as an index for the data frame. It means each row will be given a "name" or an index, corresponding to a date.
#Finally, you give a name to the 4 columns with the argument columns

In [13]:
# Create data with date
df = pd.DataFrame(random,
                  index=dates_m,
                  columns=list('ABCD'))

In [18]:
#Step 3) Display the data frame. Also can use head function df.head
df.head(8)  # there are 6 rows only to display

Unnamed: 0,A,B,C,D
2030-01-31,0.088187,0.309553,-0.430135,1.331396
2030-02-28,3.039074,-0.057611,0.377823,1.717503
2030-03-31,-0.963206,1.916118,1.943762,-0.687879
2030-04-30,1.042479,-0.544763,0.977501,-1.673582
2030-05-31,0.679855,-0.998623,1.975864,0.325947
2030-06-30,0.543184,-0.910065,-1.029135,0.458066


In [19]:
#Step 4) Using tail function

df.tail(3)

Unnamed: 0,A,B,C,D
2030-04-30,1.042479,-0.544763,0.977501,-1.673582
2030-05-31,0.679855,-0.998623,1.975864,0.325947
2030-06-30,0.543184,-0.910065,-1.029135,0.458066


In [20]:
#Step 5) An excellent practice to get a clue about the data is to use describe().
#It provides the counts, mean, std, min, max and percentile of the dataset.

df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.738262,-0.047565,0.635947,0.245242
std,1.322524,1.084372,1.232726,1.261086
min,-0.963206,-0.998623,-1.029135,-1.673582
25%,0.201936,-0.818739,-0.228145,-0.434422
50%,0.611519,-0.301187,0.677662,0.392006
75%,0.951823,0.217762,1.702197,1.113063
max,3.039074,1.916118,1.975864,1.717503


In [21]:
# Slice data
#The last point of this tutorial is about how to slice a pandas data frame.

#You can use the column name to extract data in a particular column.

## Slice
### Using name
df['A']

2030-01-31    0.088187
2030-02-28    3.039074
2030-03-31   -0.963206
2030-04-30    1.042479
2030-05-31    0.679855
2030-06-30    0.543184
Freq: M, Name: A, dtype: float64

In [23]:
#To select multiple columns, you need to use two times the bracket, [[..,..]]

#The first pair of bracket means you want to select columns, the second pairs of bracket tells what columns you want to return.

df[['A', 'B']]

Unnamed: 0,A,B
2030-01-31,0.088187,0.309553
2030-02-28,3.039074,-0.057611
2030-03-31,-0.963206,1.916118
2030-04-30,1.042479,-0.544763
2030-05-31,0.679855,-0.998623
2030-06-30,0.543184,-0.910065


In [24]:
#You can slice the rows with :
#The code below returns the first three rows

### using a slice for row
df[0:3]	

Unnamed: 0,A,B,C,D
2030-01-31,0.088187,0.309553,-0.430135,1.331396
2030-02-28,3.039074,-0.057611,0.377823,1.717503
2030-03-31,-0.963206,1.916118,1.943762,-0.687879


In [25]:
# The loc function is used to select columns by names. 
#As usual, the values before the coma stand for the rows and after refer to the column. 
#You need to use the brackets to select more than one column.

## Multi col
df.loc[:,['A','B']]	

Unnamed: 0,A,B
2030-01-31,0.088187,0.309553
2030-02-28,3.039074,-0.057611
2030-03-31,-0.963206,1.916118
2030-04-30,1.042479,-0.544763
2030-05-31,0.679855,-0.998623
2030-06-30,0.543184,-0.910065


In [26]:
# There is another method to select multiple rows and columns in Pandas. 
#You can use iloc[]. This method uses the index instead of the columns name. 
#The code below returns the same data frame as above

df.iloc[:, :2]

Unnamed: 0,A,B
2030-01-31,0.088187,0.309553
2030-02-28,3.039074,-0.057611
2030-03-31,-0.963206,1.916118
2030-04-30,1.042479,-0.544763
2030-05-31,0.679855,-0.998623
2030-06-30,0.543184,-0.910065


In [27]:
# Drop a column
#You can drop columns using pd.drop()

df.drop(columns=['A', 'C'])

Unnamed: 0,B,D
2030-01-31,0.309553,1.331396
2030-02-28,-0.057611,1.717503
2030-03-31,1.916118,-0.687879
2030-04-30,-0.544763,-1.673582
2030-05-31,-0.998623,0.325947
2030-06-30,-0.910065,0.458066


In [29]:
# Concatenation
#You can concatenate two DataFrame in Pandas. You can use pd.concat()
#First of all, you need to create two DataFrames. So far so good, you are already familiar with dataframe creation

import numpy as np
df1 = pd.DataFrame({'name': ['John', 'Smith','Paul'],
                     'Age': ['25', '30', '50']},
                    index=[0, 1, 2])
df2 = pd.DataFrame({'name': ['Adam', 'Smith' ],
                     'Age': ['26', '11']},
                    index=[3, 4])  

#Finally, you concatenate the two DataFrame

df_concat = pd.concat([df1,df2]) 
df_concat

Unnamed: 0,name,Age
0,John,25
1,Smith,30
2,Paul,50
3,Adam,26
4,Smith,11


In [30]:
# Drop_duplicates
#If a dataset can contain duplicates information use, `drop_duplicates` is an easy to exclude duplicate rows. 
#You can see that `df_concat` has a duplicate observation, `Smith` appears twice in the column `name.`

df_concat.drop_duplicates('name')

Unnamed: 0,name,Age
0,John,25
1,Smith,30
2,Paul,50
3,Adam,26


In [31]:
# Sort values
#You can sort value with sort_values

df_concat.sort_values('Age')

Unnamed: 0,name,Age
4,Smith,11
0,John,25
3,Adam,26
1,Smith,30
2,Paul,50


In [32]:
# Rename: change of index
#You can use rename to rename a column in Pandas. 
#The first value is the current column name and the second value is the new column name.

df_concat.rename(columns={"name": "Surname", "Age": "Age_ppl"})

Unnamed: 0,Surname,Age_ppl
0,John,25
1,Smith,30
2,Paul,50
3,Adam,26
4,Smith,11
