<a href="https://colab.research.google.com/github/CHOCOCHANEL/Data-Analysis/blob/main/Python_(Pandas).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas
* When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you.
* Pandas will help you to explore, clean, and process your data.
* In pandas, a data table is called a `DataFrame`.

In [1]:
import numpy as np
import pandas as pd

### Object creation
* Creating a `Series` by passing a list of values, letting pandas create a default integer index.

In [2]:
series = pd.Series([1, 3, 5, np.nan, 6, 8])
series

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

* Creating a `DataFrame` by passing a NumPy array, with a datetime index using `date_range()` and labeled columns.

In [3]:
dates = pd.date_range("20230603", periods=7)
dates

DatetimeIndex(['2023-06-03', '2023-06-04', '2023-06-05', '2023-06-06',
               '2023-06-07', '2023-06-08', '2023-06-09'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(7, 5), index=dates, columns=['A', 'B', 'C', 'D', 'E'])
df

Unnamed: 0,A,B,C,D,E
2023-06-03,-0.77693,-0.381407,-0.450462,1.977023,1.286982
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,0.03484
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,-2.13703
2023-06-06,1.617192,1.523158,0.403473,-0.271044,-0.22507
2023-06-07,-0.914472,0.484513,-0.829975,-0.557842,-1.146289
2023-06-08,-0.6282,0.497147,-0.888865,-0.828357,-0.701441
2023-06-09,-0.068338,-0.50416,0.113768,1.2886,-1.568774


* Creating a `DataFrame` by passing a dictionary of objects that can be converted into a series-like structure.

In [9]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.date_range("20230603", periods=4),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"])
    }
)
df2

Unnamed: 0,A,B,C,D,E
0,1.0,2023-06-03,1.0,3,test
1,1.0,2023-06-04,1.0,3,train
2,1.0,2023-06-05,1.0,3,test
3,1.0,2023-06-06,1.0,3,train


In [13]:
# The columns have different dtypes.
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
dtype: object

#### Viewing

* Use `DataFrame.head()` and `DataFrame.tail()` to view the top and bottom rows of the frame respectively.

In [15]:
df.head()

Unnamed: 0,A,B,C,D,E
2023-06-03,-0.77693,-0.381407,-0.450462,1.977023,1.286982
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,0.03484
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,-2.13703
2023-06-06,1.617192,1.523158,0.403473,-0.271044,-0.22507
2023-06-07,-0.914472,0.484513,-0.829975,-0.557842,-1.146289


In [16]:
df.tail()

Unnamed: 0,A,B,C,D,E
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,-2.13703
2023-06-06,1.617192,1.523158,0.403473,-0.271044,-0.22507
2023-06-07,-0.914472,0.484513,-0.829975,-0.557842,-1.146289
2023-06-08,-0.6282,0.497147,-0.888865,-0.828357,-0.701441
2023-06-09,-0.068338,-0.50416,0.113768,1.2886,-1.568774


In [17]:
df.index

DatetimeIndex(['2023-06-03', '2023-06-04', '2023-06-05', '2023-06-06',
               '2023-06-07', '2023-06-08', '2023-06-09'],
              dtype='datetime64[ns]', freq='D')

In [19]:
df.columns

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

### Fundamental Difference between Pandas and NumPy

* NumPy arrays have one dtype for the entire array, while Pandas DataFrames have one dtype per column.

In [20]:
df.to_numpy()

array([[-0.77692953, -0.38140736, -0.45046188,  1.97702349,  1.28698174],
       [ 0.04070689, -0.99997855, -2.38109609, -1.21342143,  0.03483991],
       [-0.11546979, -0.86027742, -0.11521281, -0.54174054, -2.13702993],
       [ 1.61719204,  1.52315759,  0.40347311, -0.27104395, -0.22506984],
       [-0.9144718 ,  0.48451284, -0.82997485, -0.55784228, -1.14628948],
       [-0.62819963,  0.49714681, -0.88886513, -0.82835652, -0.70144055],
       [-0.06833847, -0.50416038,  0.11376798,  1.28859993, -1.56877358]])

In [21]:
df2.to_numpy()

array([[1.0, Timestamp('2023-06-03 00:00:00'), 1.0, 3, 'test'],
       [1.0, Timestamp('2023-06-04 00:00:00'), 1.0, 3, 'train'],
       [1.0, Timestamp('2023-06-05 00:00:00'), 1.0, 3, 'test'],
       [1.0, Timestamp('2023-06-06 00:00:00'), 1.0, 3, 'train']],
      dtype=object)

* `describe()` shows a quick statistic summary of your data

In [22]:
df.describe()

Unnamed: 0,A,B,C,D,E
count,7.0,7.0,7.0,7.0,7.0
mean,-0.120787,-0.034429,-0.592624,-0.020969,-0.636683
std,0.853166,0.906762,0.920331,1.183409,1.13318
min,-0.914472,-0.999979,-2.381096,-1.213421,-2.13703
25%,-0.702565,-0.682219,-0.85942,-0.693099,-1.357532
50%,-0.11547,-0.381407,-0.450462,-0.541741,-0.701441
75%,-0.013816,0.49083,-0.000722,0.508778,-0.095115
max,1.617192,1.523158,0.403473,1.977023,1.286982


### Selection
* We recommend the optimized pandas data access methods, `DataFrame.at()`, `DataFrame.iat()`, `DataFrame.loc()`, `DataFrame.iloc()`.

In [24]:
df

Unnamed: 0,A,B,C,D,E
2023-06-03,-0.77693,-0.381407,-0.450462,1.977023,1.286982
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,0.03484
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,-2.13703
2023-06-06,1.617192,1.523158,0.403473,-0.271044,-0.22507
2023-06-07,-0.914472,0.484513,-0.829975,-0.557842,-1.146289
2023-06-08,-0.6282,0.497147,-0.888865,-0.828357,-0.701441
2023-06-09,-0.068338,-0.50416,0.113768,1.2886,-1.568774


In [23]:
df['A']

2023-06-03   -0.776930
2023-06-04    0.040707
2023-06-05   -0.115470
2023-06-06    1.617192
2023-06-07   -0.914472
2023-06-08   -0.628200
2023-06-09   -0.068338
Freq: D, Name: A, dtype: float64

In [25]:
df[:2]

Unnamed: 0,A,B,C,D,E
2023-06-03,-0.77693,-0.381407,-0.450462,1.977023,1.286982
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,0.03484


In [26]:
df["20230605":"20230608"]

Unnamed: 0,A,B,C,D,E
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,-2.13703
2023-06-06,1.617192,1.523158,0.403473,-0.271044,-0.22507
2023-06-07,-0.914472,0.484513,-0.829975,-0.557842,-1.146289
2023-06-08,-0.6282,0.497147,-0.888865,-0.828357,-0.701441


#### Selection by label
* `DataFrame.loc()`

In [28]:
df.loc[["20230604", "20230605"], ["B", "C"]]

Unnamed: 0,B,C
2023-06-04,-0.999979,-2.381096
2023-06-05,-0.860277,-0.115213


#### Selection by position
* `DataFrame.iloc()`

In [29]:
df.iloc[3]

A    1.617192
B    1.523158
C    0.403473
D   -0.271044
E   -0.225070
Name: 2023-06-06 00:00:00, dtype: float64

In [30]:
df.iloc[3:5, :2]

Unnamed: 0,A,B
2023-06-06,1.617192,1.523158
2023-06-07,-0.914472,0.484513


#### Boolean Indexing
* Selecting values from a DataFrame where a boolean condition is met

In [31]:
df[df["A"] > 0]

Unnamed: 0,A,B,C,D,E
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,0.03484
2023-06-06,1.617192,1.523158,0.403473,-0.271044,-0.22507


In [32]:
df[df > 0]

Unnamed: 0,A,B,C,D,E
2023-06-03,,,,1.977023,1.286982
2023-06-04,0.040707,,,,0.03484
2023-06-05,,,,,
2023-06-06,1.617192,1.523158,0.403473,,
2023-06-07,,0.484513,,,
2023-06-08,,0.497147,,,
2023-06-09,,,0.113768,1.2886,


* Using `isin()` for filtering.

In [34]:
df_copied = df.copy()
df_copied

Unnamed: 0,A,B,C,D,E
2023-06-03,-0.77693,-0.381407,-0.450462,1.977023,1.286982
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,0.03484
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,-2.13703
2023-06-06,1.617192,1.523158,0.403473,-0.271044,-0.22507
2023-06-07,-0.914472,0.484513,-0.829975,-0.557842,-1.146289
2023-06-08,-0.6282,0.497147,-0.888865,-0.828357,-0.701441
2023-06-09,-0.068338,-0.50416,0.113768,1.2886,-1.568774


In [37]:
df_copied["E"] = ["one", "one", "two", "three", "four", "three", "two"]
df_copied

Unnamed: 0,A,B,C,D,E
2023-06-03,-0.77693,-0.381407,-0.450462,1.977023,one
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,one
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,two
2023-06-06,1.617192,1.523158,0.403473,-0.271044,three
2023-06-07,-0.914472,0.484513,-0.829975,-0.557842,four
2023-06-08,-0.6282,0.497147,-0.888865,-0.828357,three
2023-06-09,-0.068338,-0.50416,0.113768,1.2886,two


In [38]:
df_copied[df_copied["E"].isin(["one", "two"])]

Unnamed: 0,A,B,C,D,E
2023-06-03,-0.77693,-0.381407,-0.450462,1.977023,one
2023-06-04,0.040707,-0.999979,-2.381096,-1.213421,one
2023-06-05,-0.11547,-0.860277,-0.115213,-0.541741,two
2023-06-09,-0.068338,-0.50416,0.113768,1.2886,two
