- Pandas allows us to analyze big data and make conclusions based on statistical theories.
- Pandas can clean messy data sets and make them readable and relevant.
- Relevant data is very important in data science.
#### What can Pandas do?
- Pandas gives you answers about the data. Like:
    - Is there a correlation between two or more columns?
    - What is the average value?
    - Max value?
    - Min value?
- Pandas are also able to delete rows that are not relevant, or contains wrong values like empty or NULL values. This is called *cleaning the data*.
- Where is the Pandas Codebase?
    - The source code for Pandas is located at its github repository ```pandas-dev/pandas```
- If you have Python and PIP already installed on a system, then installation of Pandas is very easy. Install it using this command: ```pip install pandas```
- Once pandas is installed, you can import it in your applications by using the ```import``` keyword.

In [3]:
import pandas as pd

In [2]:
data = {'cars':['BMW','Volvo','Ford'],
       'passings':[3,7,2]}

In [8]:
data

{'cars': ['BMW', 'Volvo', 'Ford'], 'passings': [3, 7, 2]}

In [4]:
df = pd.DataFrame(data)

In [5]:
df

Unnamed: 0,cars,passings
0,BMW,3
1,Volvo,7
2,Ford,2


- What is a series?
    - A pandas series is like a column in a table.
    - It is one-dimensional array holding data of any type.

##### ? Create a simple Pandas Series from a list

In [9]:
a = [1,7,2]

In [10]:
a

[1, 7, 2]

In [11]:
s = pd.Series(a)

In [14]:
s

0    1
1    7
2    2
dtype: int64

Labels
- If nothing else is specified, the values are labelled with their index number. First value has index 0, second value has index 1 etc.

In [15]:
s[0]

1

In [16]:
s[2]

2

In [30]:
s[1]

7

With the ```index``` argument, you can name your own labels

In [20]:
cars = pd.Series(data['passings'],index=data['cars'])

In [21]:
cars

BMW      3
Volvo    7
Ford     2
dtype: int64

In [23]:
s_1 = pd.Series(a,index=['x','y','z'])

In [24]:
s_1

x    1
y    7
z    2
dtype: int64

In [25]:
s_1['x']

1

In [26]:
s_1['z']

2

In [27]:
s_1['y']

7

In [31]:
calories = {'day1':420,'day2':380,'day3':390}

In [32]:
calories_series = pd.Series(calories)

In [33]:
calories_series

day1    420
day2    380
day3    390
dtype: int64

In [34]:
calories_series = pd.Series(calories,index=['day1','day2'])

In [35]:
calories_series

day1    420
day2    380
dtype: int64

In [36]:
calories_series = pd.Series(calories,index=['day1','day3'])

In [37]:
calories_series

day1    420
day3    390
dtype: int64

##### DataFrames
- Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is lika column, a DataFrame is the whole table.

In [38]:
data = {'calories':[420,380,390],
       'duration':[50,40,45]}

In [41]:
# Load data into a DataFrame object
df = pd.DataFrame(data)

In [40]:
df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


In [42]:
df.loc[0]

calories    420
duration     50
Name: 0, dtype: int64

In [43]:
df.loc[1]

calories    380
duration     40
Name: 1, dtype: int64

In [44]:
df.loc[2]

calories    390
duration     45
Name: 2, dtype: int64

In [45]:
df.loc[[0,1]]

Unnamed: 0,calories,duration
0,420,50
1,380,40


In [46]:
df.loc[[1,2]]

Unnamed: 0,calories,duration
1,380,40
2,390,45


In [47]:
df.loc[[0,2]]

Unnamed: 0,calories,duration
0,420,50
2,390,45


In [48]:
# Load data into a DataFrame object
df = pd.DataFrame(data,index=['day1','day2','day3'])

In [49]:
df

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


In [50]:
df.loc['day2']

calories    380
duration     40
Name: day2, dtype: int64

In [51]:
df.loc['day1']

calories    420
duration     50
Name: day1, dtype: int64

In [52]:
df.loc['day3']

calories    390
duration     45
Name: day3, dtype: int64

In [53]:
df.loc[['day1']]

Unnamed: 0,calories,duration
day1,420,50


### Reading from a CSV file

In [54]:
df = pd.read_csv('importexport202223.csv')

In [55]:
df.head()

Unnamed: 0,IMPORT/EXPORT (Quantity in '000 Metric Tonnes),PRODUCT,APRIL,MAY,JUNE,JULY,AUGUST,SEPTEMBER,OCTOBER,NOVEMBER,DECEMBER,JANUARY,FEBRUARY,MARCH,TOTAL
0,IMPORT,CRUDE OIL,21626,19644,19441,20624,17637,16772,18123,19003,19618,20058,19285,20729,232561
1,IMPORT,LPG,1605,1363,1264,1417,1574,1448,1374,1778,1718,1709,1647,1410,18309
2,IMPORT,MS,0,30,127,63,0,190,327,211,120,0,0,0,1069
3,IMPORT,Naphtha,35,30,2,145,90,76,135,28,88,67,121,117,936
4,IMPORT,ATF,0,0,0,0,0,0,0,0,0,0,0,0,0


In [56]:
df.tail()

Unnamed: 0,IMPORT/EXPORT (Quantity in '000 Metric Tonnes),PRODUCT,APRIL,MAY,JUNE,JULY,AUGUST,SEPTEMBER,OCTOBER,NOVEMBER,DECEMBER,JANUARY,FEBRUARY,MARCH,TOTAL
23,EXPORT,Bitumen,0,0,1,0,0,2,2,0,0,0,0,3,9
24,EXPORT,Petcoke / CBFS,4,33,36,11,39,17,0,0,0,0,22,123,284
25,EXPORT,Others%,275,300,267,214,398,254,123,229,684,215,320,435,3717
26,EXPORT,TOTAL PRODUCT EXPORT,5437,5682,5506,4689,5240,4987,3972,4266,5697,4501,5056,6006,61039
27,NET IMPORT,NET IMPORT,20366,17247,17074,19674,15606,14946,18043,18776,17969,19485,17801,18367,215355


In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 15 columns):
 #   Column                                          Non-Null Count  Dtype 
---  ------                                          --------------  ----- 
 0   IMPORT/EXPORT (Quantity in '000 Metric Tonnes)  28 non-null     object
 1   PRODUCT                                         28 non-null     object
 2   APRIL                                           28 non-null     int64 
 3   MAY                                             28 non-null     int64 
 4   JUNE                                            28 non-null     int64 
 5   JULY                                            28 non-null     int64 
 6   AUGUST                                          28 non-null     int64 
 7   SEPTEMBER                                       28 non-null     int64 
 8   OCTOBER                                         28 non-null     int64 
 9   NOVEMBER                                        28 non-n

In [59]:
df.drop(columns=df.columns[0:2]).corr()

Unnamed: 0,APRIL,MAY,JUNE,JULY,AUGUST,SEPTEMBER,OCTOBER,NOVEMBER,DECEMBER,JANUARY,FEBRUARY,MARCH,TOTAL
APRIL,1.0,0.998839,0.999018,0.999671,0.998897,0.998716,0.998989,0.999244,0.999391,0.999543,0.99976,0.999138,0.999853
MAY,0.998839,1.0,0.999744,0.997952,0.999617,0.999677,0.996482,0.996813,0.999127,0.997278,0.999059,0.999618,0.999229
JUNE,0.999018,0.999744,1.0,0.998403,0.999639,0.999428,0.996848,0.997218,0.999252,0.997776,0.999349,0.999803,0.999427
JULY,0.999671,0.997952,0.998403,1.0,0.998137,0.997859,0.999464,0.999576,0.998687,0.999843,0.9995,0.9985,0.999565
AUGUST,0.998897,0.999617,0.999639,0.998137,1.0,0.999756,0.996822,0.997261,0.999672,0.997751,0.999382,0.999797,0.999439
SEPTEMBER,0.998716,0.999677,0.999428,0.997859,0.999756,1.0,0.996959,0.997162,0.999369,0.997368,0.998971,0.999386,0.999258
OCTOBER,0.998989,0.996482,0.996848,0.999464,0.996822,0.996959,1.0,0.999833,0.997827,0.999558,0.998499,0.99699,0.998793
NOVEMBER,0.999244,0.996813,0.997218,0.999576,0.997261,0.997162,0.999833,1.0,0.998265,0.999788,0.998964,0.997437,0.999067
DECEMBER,0.999391,0.999127,0.999252,0.998687,0.999672,0.999369,0.997827,0.998265,1.0,0.998576,0.999621,0.999595,0.999677
JANUARY,0.999543,0.997278,0.997776,0.999843,0.997751,0.997368,0.999558,0.999788,0.998576,1.0,0.999339,0.998044,0.999344
