# Pandas
Pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

`pandas` is well suited for many different kinds of data:

*   Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
*   Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, **Series** (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.


Normally we import pandas as following

In [2]:
# !conda install pandas
# !
# Import pandas
import pandas as pd

Lets import numpy too 

In [4]:
# Import numpy
import numpy as np

## Series

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

### Creating a series

In [6]:
# create list, array and dictionary 
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using list

In [7]:
# Create pandas series using my_list
pd.Series(data = my_list)

0    10
1    20
2    30
dtype: int64

In [9]:
# Create pandas series using my_list and with labels as index
pd.Series(data = my_list, index = labels)

a    10
b    20
c    30
dtype: int64

In [11]:
pd.Series(my_list, labels)

a    10
b    20
c    30
dtype: int64

Using numpy Arrays

In [12]:
# Create pandas series using numpy array
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [13]:
# Create pandas series using numpy array and labels
pd.Series(['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Using dictionary

In [None]:
# Create pandas series using dictionary


Pandas can hold varities of object types

In [25]:
# Create pandas series with string objects
pd.Series(['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Using an Index

In [35]:
# Create a series with values [3,4,5] with ['2019', '2020', '2021'] as index
series1 = pd.Series(data = [3,4,5], index = ['2019', '2020', '2021'])

In [36]:
# Print the series
print(series1)

2019    3
2020    4
2021    5
dtype: int64


In [38]:
# Show the value for index '2020'
series1['2020']

4

Operations on series can be based on the index

In [32]:
# Create a second series with values [1,2,3] and index ['2017', '2018', '2019']
series2 = pd.Series(data = [1,2,3], index = ['2017', '2018', '2019'])

In [33]:
# Display the second series
series2

2017    1
2018    2
2019    3
dtype: int64

In [41]:
# Add the two series 
series1 + series2

2017    NaN
2018    NaN
2019    6.0
2020    NaN
2021    NaN
dtype: float64

Giving a name to the series

In [21]:
# Create a series and give name to the series by passing 'Product A' as the name
pd.Series([1,2,3,4], ['a','b','c','d'], name = "Prodect A")

a    1
b    2
c    3
d    4
Name: Prodect A, dtype: int64

In [43]:
type(series1)

pandas.core.series.Series

## Dataframe
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

We can think of a DataFrame as a bunch of Series objects put together to share the same index.




In [22]:
# Create a pandas dataframe with the dictionary {'Yes': [50, 20], 'No':[100,200]}
pd.DataFrame({'Yes': [50, 20], 'No':[100,200]})

Unnamed: 0,Yes,No
0,50,100
1,20,200


In [29]:
# Create a pandas dataframe with the dictionary {'Yes': [50, 20], 'No':[100,200]} and index ['RAI 1', 'RAI 2']
pd.DataFrame({'Yes': [50, 20], 'No':[100,200]}, index = ['RAI 1', 'RAI 2'])

Unnamed: 0,Yes,No
RAI 1,50,100
RAI 2,20,200


We can also define dataframe as list of values and similarly add index and columns as list

In [30]:
# creating a list of random number of size (5,4)
np.random.seed(100)
random_values = np.random.randn(5,4)

In [31]:
# Display the random values
random_values

array([[-1.74976547,  0.3426804 ,  1.1530358 , -0.25243604],
       [ 0.98132079,  0.51421884,  0.22117967, -1.07004333],
       [-0.18949583,  0.25500144, -0.45802699,  0.43516349],
       [-0.58359505,  0.81684707,  0.67272081, -0.10441114],
       [-0.53128038,  1.02973269, -0.43813562, -1.11831825]])

In [39]:
# Create the dataframe with the random_values with index ['Thailand', 'Malaysia', 'Japan', 'Laos', 'Vietnam'] and columns ['Dell', 'Samsung', 'Apple', 'HP']
df = pd.DataFrame(random_values, index=['Thailand', 'Malaysia', 'Japan', 'Laos', 'Vietnam'], columns=['Dell', 'Samsung', 'Apple', 'HP'])

In [40]:
# Display the dataframe
df

Unnamed: 0,Dell,Samsung,Apple,HP
Thailand,-1.749765,0.34268,1.153036,-0.252436
Malaysia,0.981321,0.514219,0.22118,-1.070043
Japan,-0.189496,0.255001,-0.458027,0.435163
Laos,-0.583595,0.816847,0.672721,-0.104411
Vietnam,-0.53128,1.029733,-0.438136,-1.118318


In [45]:
type(df)

pandas.core.frame.DataFrame

### Selection and Indexing

Lets learn the various methods to grad data from a DataFrame

In [91]:
# Select using indexing for Dell models only
df['Dell'] # => try to use st like this

Thailand   -1.749765
Malaysia    0.981321
Laos       -0.583595
Vietnam    -0.531280
Name: Dell, dtype: float64

In [92]:
# Check the type of data returned by the above code
type(df['Dell'])

pandas.core.series.Series

In [93]:
# Select the columns Dell and HP
df[['Dell', 'HP']]

Unnamed: 0,Dell,HP
Thailand,-1.749765,-0.252436
Malaysia,0.981321,-1.070043
Laos,-0.583595,-0.104411
Vietnam,-0.53128,-1.118318


In [94]:
type(df[['Dell', 'HP']])

pandas.core.frame.DataFrame

In [95]:
# Using the string with (.) operator to display the column
df.Dell # not recommended

Thailand   -1.749765
Malaysia    0.981321
Laos       -0.583595
Vietnam    -0.531280
Name: Dell, dtype: float64

In [96]:
df

Unnamed: 0,Dell,Samsung,Apple,HP,Apple + HP
Thailand,-1.749765,0.34268,1.153036,-0.252436,8999
Malaysia,0.981321,0.514219,0.22118,-1.070043,8999
Laos,-0.583595,0.816847,0.672721,-0.104411,8999
Vietnam,-0.53128,1.029733,-0.438136,-1.118318,8999


In [87]:
# Creating a new column
df['Apple + HP'] = df['Apple'] + df['HP']

In [97]:
# Display the df
df

Unnamed: 0,Dell,Samsung,Apple,HP,Apple + HP
Thailand,-1.749765,0.34268,1.153036,-0.252436,8999
Malaysia,0.981321,0.514219,0.22118,-1.070043,8999
Laos,-0.583595,0.816847,0.672721,-0.104411,8999
Vietnam,-0.53128,1.029733,-0.438136,-1.118318,8999


In [98]:
df['Apple + HP'] = 8999

In [99]:
df

Unnamed: 0,Dell,Samsung,Apple,HP,Apple + HP
Thailand,-1.749765,0.34268,1.153036,-0.252436,8999
Malaysia,0.981321,0.514219,0.22118,-1.070043,8999
Laos,-0.583595,0.816847,0.672721,-0.104411,8999
Vietnam,-0.53128,1.029733,-0.438136,-1.118318,8999


### Deleting a column

Note: If inplace is False(defaut), it will not delete the data from the original dataframe but just return the view of dataframe without the column.

In [104]:
# Delete the above created column using drop command => drop stand for deleting
df.drop('Apple + HP', axis = 1, inplace = True)

In [105]:
# Display the df
df

Unnamed: 0,Dell,Samsung,Apple,HP
Thailand,-1.749765,0.34268,1.153036,-0.252436
Malaysia,0.981321,0.514219,0.22118,-1.070043
Laos,-0.583595,0.816847,0.672721,-0.104411
Vietnam,-0.53128,1.029733,-0.438136,-1.118318


### Deleting a Row

In [100]:
# Delete the row Japan
df.drop('Japan', axis = 0, inplace = True)

KeyError: "['Japan'] not found in axis"

In [77]:
# Display the df
df

Unnamed: 0,Dell,Samsung,Apple,HP
Thailand,-1.749765,0.34268,1.153036,-0.252436
Malaysia,0.981321,0.514219,0.22118,-1.070043
Laos,-0.583595,0.816847,0.672721,-0.104411
Vietnam,-0.53128,1.029733,-0.438136,-1.118318


### Selecting Rows

In [90]:
# Select the row Thailand
df.loc['Thailand']

Dell            -1.749765
Samsung          0.342680
Apple            1.153036
HP              -0.252436
Apple + HP    8999.000000
Name: Thailand, dtype: float64

In [102]:
# Select based on position
df.iloc[1]

Dell             0.981321
Samsung          0.514219
Apple            0.221180
HP              -1.070043
Apple + HP    8999.000000
Name: Malaysia, dtype: float64

### Conditional Selection

Similar to numpy, pandas also supports conditional selection using bracket notation

In [106]:
# Display the df first
df

Unnamed: 0,Dell,Samsung,Apple,HP
Thailand,-1.749765,0.34268,1.153036,-0.252436
Malaysia,0.981321,0.514219,0.22118,-1.070043
Laos,-0.583595,0.816847,0.672721,-0.104411
Vietnam,-0.53128,1.029733,-0.438136,-1.118318


In [107]:
# Show positions where value > 0
df > 0

Unnamed: 0,Dell,Samsung,Apple,HP
Thailand,False,True,True,False
Malaysia,True,True,True,False
Laos,False,True,True,False
Vietnam,False,True,False,False


In [108]:
# Show the values where value > 0
df[df > 0]

Unnamed: 0,Dell,Samsung,Apple,HP
Thailand,,0.34268,1.153036,
Malaysia,0.981321,0.514219,0.22118,
Laos,,0.816847,0.672721,
Vietnam,,1.029733,,


In [110]:
# Display the rows where dell has values greater than 0
df[df['Dell'] > 0]

Unnamed: 0,Dell,Samsung,Apple,HP
Malaysia,0.981321,0.514219,0.22118,-1.070043


In [111]:
# Display the coulmns Dell and HP and rows where Dell has values greater than 0
df[df['Dell'] > 0][['Dell', 'HP']]

Unnamed: 0,Dell,HP
Malaysia,0.981321,-1.070043


We can use | and & with parenthesis for combining multiple conditions

In [112]:
# Display the rows where dell>0 and apple>0
df[(df['Dell'] > 0) & (df['Apple'] > 0)]

Unnamed: 0,Dell,Samsung,Apple,HP
Malaysia,0.981321,0.514219,0.22118,-1.070043
