## PANDAS
##### [Visit Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html)
---

In [1]:
# importing libraries
import pandas as pd
import numpy as np
# note that the underlying library that works under pandas is numpy

### Pandas Series
- We'll start analyzing "The Group of Seven" (G7). Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a pandas.Series object.

In [2]:
# pandas.Series : Series is a one-dimensional labeled array that can hold any data type 
# (integers, floats, strings, Python objects, etc.). 
# It is similar to a column in a spreadsheet or a one-dimensional NumPy array but with labels (an index) attached to each value.
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

Note : Someone might not know we're representing population in millions of inhabitants. Series can have a name, to better document the purpose of the Series:

In [3]:
# .name is used to add a name to the Series
g7_pop.name = "G7 population in millions"

In [4]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 population in millions, dtype: float64

In [5]:
# finding the datatype using dtype form numpy
g7_pop.dtype

dtype('float64')

In [6]:
# The .values attribute in Pandas is used to extract the underlying data from a Series 
# or DataFrame as a NumPy array (or an array-like structure). 
# It helps when you need to perform NumPy-based operations or work with raw data.
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [7]:
# we can check the type of the array above
type(g7_pop.values)

numpy.ndarray

In [8]:
type(g7_pop)
# the result is a class <class 'pandas.core.series.Series'>

pandas.core.series.Series

In [9]:
# even though g7_pop in a class instance but still its values / elements can be accessed just like arrays
g7_pop[0], g7_pop[1]

(35.467, 63.951)

In [10]:
# The .index attribute in Pandas is used to retrieve or modify the index labels of a Series or DataFrame.
# note that by default the index is from 0 to n-1 with a step of 1
# Pandas Index is immutable, meaning you cannot modify it in place, but you can replace it entirely with a new index.
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

Note : we can modify the index just like an object or dictionary

In [11]:
# here we will be modifying the index based on the need entirely
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]
# note that .index type is also a panda class 

In [12]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions, dtype: float64

In [13]:
# here we will be creating the series from scratch
pd.Series({
    "Canada" : 35.467,
    "France" : 63.951,
    "Germany" : 80.940,
    "Italy" : 60.665,
    "Japan" : 127.061,
    "United Kingdom" : 64.511,
    "United States" : 318.523
},name = "G7 population in million!!")

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in million!!, dtype: float64

In [14]:
# series can also be created in this way from scratch
pd.Series(
    [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
    index = ['Canada','France','Germany','Italy','Japan','United Kingdom','United States',],
    name="G7 population in millions!!"
)

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions!!, dtype: float64

In [15]:
# We can create another series out of other series specifying their indexes
pd.Series(
    g7_pop,
    index = ["France","Germany","Italy","Spain"]
)
# here for index spain it shows NaN because in g7_pop spain as index is not present

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 population in millions, dtype: float64

### Indexing
- Indexing works similarly to lists and dictionaries, you use the index of the element you're looking for:

In [16]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions, dtype: float64

In [17]:
# now using the name of the country as the index
g7_pop["Canada"]

35.467

In [18]:
g7_pop["Japan"]

127.061

In [19]:
# but we can still get the element value using the numerical position index using iloc attribute
# The .iloc attribute in Pandas is used for position-based (integer) indexing to access rows and columns in a DataFrame or Series.
g7_pop.iloc[1]

63.951

In [20]:
g7_pop.iloc[-1]

318.523

In [21]:
# selecting multiple indexes
g7_pop[["Italy","France"]]

Italy     60.665
France    63.951
Name: G7 population in millions, dtype: float64

In [22]:
# now using iloc
g7_pop[[1,-1]]

France            63.951
United States    318.523
Name: G7 population in millions, dtype: float64

#### Slicing in pandas
note that :
- there is a fundamental difference between the slicing in python lists and in pandas series
- In python list while slicing the upper limit index is not considered
- In pandas series while slicing the upper limit index is considered


In [23]:
# slicing 
g7_pop["Canada" : "Japan"]

Canada      35.467
France      63.951
Germany     80.940
Italy       60.665
Japan      127.061
Name: G7 population in millions, dtype: float64

### Conditional Selection ( boolean array )
- The same boolean array techniques we saw applied to numpy arrays can be used for Pandas Series
- Note that conditions written outside returns a boolean series
- Note that conditions written inside the [] retuns a series of values following the conditions

In [24]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions, dtype: float64

In [25]:
g7_pop > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 population in millions, dtype: bool

See here the above code returned a series containing boolean values according to the given condition

In [26]:
# getting another series based on the condition applied to the elements in the array
g7_pop[g7_pop > 70]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 population in millions, dtype: float64

In [27]:
# calculating the mean of the array
# The axis parameter in .mean() determines whether Pandas computes the mean along rows or columns.
# axis applies if the array is not 1D
g7_pop.mean()

107.30257142857144

In [28]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.523
Name: G7 population in millions, dtype: float64

In [29]:
# to find the standard deviation we will be using std
# The axis parameter in .mean() determines whether Pandas computes the mean along rows or columns.
# axis applies if the array is not 1D
g7_pop.std()

97.24996987121581

In [30]:
# operators used in pandas and numpy underlying arrays
# ~ not
# | or
# & and

In [31]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions, dtype: float64

### Operations and methods
- Series also support vectorized operations and aggregation functions as Numpy:

In [32]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions, dtype: float64

In [33]:
# multiplying the elements with million
# this doesnot change the series rather returns another series
g7_pop * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 population in millions, dtype: float64

In [34]:
# finding the mean of the update g7_pop
g7_pop.mean()

107.30257142857144

In [35]:
# np.log() is used to compute the natural logarithm (ln) of numbers, meaning it calculates logarithms to the base e (~2.718).
np.log(g7_pop)

Canada            3.568603
France            4.158117
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
Name: G7 population in millions, dtype: float64

In [36]:
# finding the mean of a particular slice of g7_pop array
g7_pop["Germany":"United Kingdom"].mean()

83.29425

### Boolean Operation
using boolean operators

In [37]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions, dtype: float64

In [38]:
g7_pop > 80

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 population in millions, dtype: bool

In [39]:
g7_pop[g7_pop > 80]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 population in millions, dtype: float64

In [40]:
g7_pop[(g7_pop > 80) | (g7_pop < 40)]

Canada            35.467
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 population in millions, dtype: float64

In [41]:
g7_pop[(g7_pop > 80) & (g7_pop < 200)]

Germany     80.940
Japan      127.061
Name: G7 population in millions, dtype: float64

### Modifying series
Notes : 
- The panda series is partially immutable
- In series the indexes are immutable means they cannot be changed individually but can be altered entirely
- In series the values are mutable 

In [42]:
# changing the value of canada
g7_pop["Canada"] = 40.5
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 population in millions, dtype: float64

In [43]:
# now using the index to change the value using .iloc
g7_pop.iloc[-1] = 500.456
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.456
Name: G7 population in millions, dtype: float64

In [44]:
# now changing the values of the indexes following the provided condition
g7_pop[g7_pop < 80] = 99.56
g7_pop

Canada             99.560
France             99.560
Germany            80.940
Italy              99.560
Japan             127.061
United Kingdom     99.560
United States     500.456
Name: G7 population in millions, dtype: float64